Platform

Platform

Ansys solving system coupling problems on a remote cluster does not provide constant results.

    • NOS_ATX
      Subscriber

      Hello All,


      Our group has established a RSM connection to solve system coupling (Transient Structural & Fluent) problems. However we discovered that the RSM/Cluster does not provide constant results. We have tested a small project many times. Results of the same project ended up differently. Only a few run succeed and finished as we expected and the running time was around 8-12 hours. However other tries ended up either ran into connecting/license problem or were continuously running until they reached kill off time (120 hours), which we could not understand. Such issue was not found if we ran these two modulus individually. Can somebody help us figure out why our RSM solver behaved like this?


       


      Thanks,


       


      Boyuan

    • Rob
      Ansys Employee

      To check, the solver results (contours etc) are the same for runs that complete, and the inconsistent behaviour is that the jobs don't always complete? 


      If it's the latter, check what else is running on the cluster. If the licence is lost during the run it can interrupt the solver or prevent additional activities from being triggered. Eg Fluent checks the licence during the run at intervals and when writing: if someone else grabs the Fluent licence during the Mechanical run you have a problem. 


      Are there any errors visible in the Workbench window? 

    • NOS_ATX
      Subscriber

      Hi rwoolhou,


      Yes, the solver results are the same, as we also compared with results finished locally.


      The license should be fine since we had that problem once before and it had clear warning/error. The only thing strange to me is that during running, the log showed that the machine consistently transfered file scLog.scl_ and also noticed that the file already exists. I wonder if the running got stuck in a loop.


      Thanks,


      Boyuan 

    • Rob
      Ansys Employee

      Thanks Boyuan. My suspicion would be on the IT side and whether some permissions are getting mixed up or timing out during the runs. Not something we can fix on an open forum, so you may need to ask your ANSYS contact to talk to the US support team. 

    • Karthik R
      Administrator

      Hello Boyuan,


      Thank you for your patience. We are looking internally on how best to help you with your question.


      Thank you.


      Best Regards,


      Karthik

    • JakeC
      Ansys Employee

      Hi Boyuan,


      Could you please attach the RSM Job report from the failed job?


      You can get that by opening the RSM Job Monitor, selecting the job, right click in the bottom pane, and select Save Job Report.


      If you could send the log from a failed one, and one from a successful run, that would be helpful.


      I have a feeling that the solver process may be crashing, and thus never finishing.  That will be tough to tell from just the RSM log, but its a place to start anyway.


       


      Thank you,


      Jake

    • NOS_ATX
      Subscriber

      Hi Jake,


      I have stored some failed log. However I did not save the successful one. Can I recall them? Also how should I attach the log file on this open forum? Some of them are quite big (400Mb).


      Best,


      Boyuan

    • JakeC
      Ansys Employee

      Hi Boyuan,


      Are you able to zip up that log file?  It should make it much smaller.


      RSM Job Logs are cleaned up eventually, so you may not be able to retrieve them.


      Could you create a much smaller project that uses the same types of systems and analysis, and run that as well? (one that solves quicker)


      I would like to try and compare a successful log to the failed one.


      Thank you,


      Jake

    • NOS_ATX
      Subscriber

      Hi Jake,


      I have tried but nothing successful since. I will attach a failed log for you.


      Thanks!


      Boyuan

    • NOS_ATX
      Subscriber

      Hi Jake,


      Any thoughts on the attached failed log?


      Thanks!


      Boyuan

    • JakeC
      Ansys Employee

      Hi Boyuan,


      I apologize, it doesn't seem I was notified by the forum that you had uploaded a file.


      Let me look into it today.


      Thank you,


      Jake

    • JakeC
      Ansys Employee

      Hi Boyuan,


      Can you see if the cluster admin will install the RSM Launcher service on a cluster submit node?


      I think that will help to stabilize the issues that you are seeing.  I would really like to see if that helps.


      Thank you,


      Jake

    • NOS_ATX
      Subscriber

      Hi Jake,


      Here is the responding from the cluster:


       


      "The cluster doesn't support running 'services' as a general rule; if this is something that can be installed and run by users on an as-needed basis within a submission, it might be an option.


       


      If it's something that needs to run 24x7 with dedicated resources waiting for connections from outside the cluster, it's not something we can offer, but if it can run elsewhere and connect to the cluster over ssh to start jobs, then it might be possible.


       


      Is there any official documentation about it?  I see other sites referencing it, but very little seems to be from ANSYS officially."


       


       


      Best,


      Boyuan

    • JakeC
      Ansys Employee

      Hi Boyuan, 


      Yes, the official docs for RSM are here:


      https://ansyshelp.ansys.com/account/secured?returnurl=/Views/Secured/corp/v192/wb_rsm/wb_rsm.html


       


      You might be able to run the RSMLauncher service as your own user, but they will need to open a port on the cluster's firewall.


      Is that an option?


       


      Thank you,


      Jake

    • NOS_ATX
      Subscriber

      Hi Jake,


      Sorry for the long delay. I was out of the country for a while. The Cluster Admin responded me that,


      "Based on your description, it may be possible; presumably the firewall change would be to allow incoming connections to the cluster.  That by itself won't work, we can't route internet traffic directly to processing nodes.


       


      However, if you start a job that listens to a port on the processing node, you can use ssh's port forwarding so that you can log in to ghpcc06 and forward a port on your local pc through the ssh tunnel to the node running your job."


       


      Again, we do not have connecting problem running CFD or FEA individually on the cluster. The problem showed up when we running system coupling cases.


       


      Thanks!


       


      Boyuan

    • Rob
      Ansys Employee

      System coupling will automatically open & close tools as needed: I'm not sure if the system will treat these differently to manually opened files. 


      Would it be possible to try a small parametric run (just set a pipe with variable diameter) and output the mass flow and see if that will work. You'd only need a few thousand cells and it'll complete in under 30 minutes. IT can then monitor the system to see what is being called. 

    • NOS_ATX
      Subscriber

      Hi rwoolhou,


      I know it's been a long time, but our group came back to this project and remote cluster approach.


      Yes, we still have this problem. By looking at Job Monitor, the job we submitted and have been up and running seems repeat itself with something like this:


       


      Inquiring for files by Name: [FSI_R1x1.2_3_Cluster_1_updated_dp0.wppz,FSI_R1x1.2_3_Cluster_1_files/dp0/SC/SC/scLog.scl_]
      Transferring file: A subdirectory or file E:LegendLauDocumentsPhDCurrentProjectSimulationsScanIP_NewFSI_R1x1.2_3_Cluster_1_pendingUDP-4FSI_R1x1.2_3_Cluster_1_filesdp0SCSC already exists.
      Transferring file: scLog.scl_                | 32 kB |  32.0 kB/s | ETA: 00:00:01 |  44%
      Transferring file: scLog.scl_                | 72 kB |  72.4 kB/s | ETA: 00:00:00 | 100%
      74.11 KB, 3.83 sec (19.35 KB/sec)
      Transferring file: A subdirectory or file E:LegendLauDocumentsPhDCurrentProjectSimulationsScanIP_NewFSI_R1x1.2_3_Cluster_1_pendingUDP-4FSI_R1x1.2_3_Cluster_1_filesdp0SCSC already exists.
      Transferring file: scLog.scl_                | 32 kB |  32.0 kB/s | ETA: 00:00:01 |  44%
      Transferring file: scLog.scl_                | 72 kB |  72.4 kB/s | ETA: 00:00:00 | 100%
      74.11 KB, 5.02 sec (14.77 KB/sec)
      Inquiring for files by Name: [FSI_R1x1.2_3_Cluster_1_updated_dp0.wppz,FSI_R1x1.2_3_Cluster_1_files/dp0/SC/SC/scLog.scl_]
      Transferring file: A subdirectory or file E:LegendLauDocumentsPhDCurrentProjectSimulationsScanIP_NewFSI_R1x1.2_3_Cluster_1_pendingUDP-4FSI_R1x1.2_3_Cluster_1_filesdp0SCSC already exists.
      Transferring file: scLog.scl_                | 32 kB |  32.0 kB/s | ETA: 00:00:01 |  44%
      Transferring file: scLog.scl_                | 72 kB |  72.4 kB/s | ETA: 00:00:00 | 100%
      74.11 KB, 3.41 sec (21.75 KB/sec)
      Transferring file: A subdirectory or file E:LegendLauDocumentsPhDCurrentProjectSimulationsScanIP_NewFSI_R1x1.2_3_Cluster_1_pendingUDP-4FSI_R1x1.2_3_Cluster_1_filesdp0SCSC already exists.


       


      Do you have any idea what's the cause?


       


      Best,


       


      Boyuan

    • tsiriaks
      Ansys Employee

      Hi Boyuan,


      Can you try a very small project like Rob mentioned ?


      Please post the full RSM Job report of this run.


      Also, if possible, we should make sure that Firewall has nothing to do with this by completely (but temporarily) turn off Firewall on the cluster. If that's not good for your IT, what about turning it off only on the submit node and 1 compute node ? then you just make sure that the job is submitted to solve on that node.


      Thanks,


      Win

    • maryam
      Subscriber

      Hi Boyuan,


      I saw your post here i am going to run my fsi simulation (Transient structural and fluent) in HPC cluster but i am beginner at it. If you do not mind can you give me some informations how i can do this? or if it possible send me the job files for running system coupling?


      Best,


      Mari

Viewing 18 reply threads
  • You must be logged in to reply to this topic.