Ansys Products

Ansys Products

Unplanned exit of fluent occasionally causes a fluent subdirectory to become inaccessible.

    • shampton
      Subscriber

      We run AFS with RHEL7 as a shared file system on our HPC clusters.  We have a group that heavily uses Fluent.  When a job exits abnormally, there is a chance that some search path of the Fluent executable will become inaccessible for all users.  Any attempts to run Fluent after this event fail as the directory is no longer visible.  The only solution is to have root manually go in and clean out the system cache.  This is not reproducible with any regularity, but frequent enough that the group is continually frustrated.  I'm writing to see if you may have any insight into file system calls that may be related?  Particularly, are there any parallel I/O operations that you're aware of as AFS does not support MPI I/O.

    • JakeC
      Ansys Employee

      Hi shampton,


      Unfortunately I don't know anything abut the AFS file system.


      When you refer to "System Cache" what are you referring to exactly?  What does root need to do exactly to get things working again?


      Is it safe to assume that if you run the job on a traditional file system and fluent crashes, that everything in the file structure stays the same and is still accessible?


      In other words are you are you only seeing this when using the AFS file system?


      I am fairly certain that the MPI processes themselves write out to the disk, but I don't know if multiple processes write out to the same file at the same time.


       


      Thank you,


      Jake

    • Mangesh Bhide
      Ansys Employee

      Hello


      - check if it works when the FLUENT installation can be mounted ready only to avoid anything writing to that path or install FLUENT to a path not on the cached file system but onto NFS


      - is this openafs, if so then check the openAFS manual if Forcing the Update of Cached Data is helpful or is that what is already being done ? if some other AFS then please refer  to its manual for flushing cache. in any case a program exiting abnormally should not affect the operating system's file cache to that program. what is the exact error message seen executing FLUENT subsequently and what is the exact command line used to start FLUENT?


      - refer to this and 3 sub topics for information on parallel I/O


      https://ansyshelp.ansys.com/account/secured?returnurl=/Views/Secured/corp/v182/flu_ug/flu_ug_par_data_files.html


      see if writing regular dat files instead of parallel dat files helps get around the issue.


       

    • shampton
      Subscriber

      Thanks for the feedback.  I should clarify that the jobs running are primarily writing to our scratch file system and not on AFS.  What happens is that a directory completely goes missing on a compute node and that directory is always within the Ansys hierarchy.  This is obviously an AFS issue, we just don't understand what about Ansys is causing it as there are no other programs that do this.  We'll try moving Ansys to a read-only volume and see if that alleviates the issue.  Thanks again.

    • Mangesh Bhide
      Ansys Employee

      thank you for the update, hope moving ANSYS to a different mount helps. I doubt that just running a program could cause an issue with the file system, (especially if mounted read-only).


      If flushing the file system cache or rebooting is helping then it would appear that it is getting remounted or re-cached correctly.


      If only portions of the path go "missing" after an abort then I wonder if it is a function of cache size and either number of files or size of files or a combination or number and sizes of files that is causing this issue.


      Thank you for the understanding and I hope moving FLUENT installation to a different location resolves the issue with this file system. do let us know


       

Viewing 4 reply threads
  • You must be logged in to reply to this topic.