Coupled Run partitioning on HPC

ansysuseransysuser USA Member Posts: 44

Hello,

I recently was able to make a coupled system run on an HPC cluster, but I discovered that the simulation time is much slower than on my desktop PC.  This is not the case when I run Fluent alone on the same cluster, so I think the trouble is in the partitioning.

I found a link ( https://ansyshelp.ansys.com/account/secured?returnurl=/Views/Secured/corp/v194/sysc_ug/sysc_commandline_analysis_setup_modifysettings.html?q=ParallelArguments ) that describes says ParalleArguments can be used in execCon to allocate the number of processes used by each of the coupling participants.  I had to search for an example of how to use this command, and the only one I could find is in the section on Forte 2019R, Forte Beta Features Manual.  This is the only documented use of this command, so I am not sure I am using it correctly, but this is what I put in the input file for the system coupling run:

execCon['Solution'].ExecutionControl.ParallelArguments = '-nprocesses 16'

So I put this in the input file for both Solution and Solution 1, since I am first using a single node with 32 cores.  However, this results in an error:

Error: eval: unbound variable
Errro Object: -nporcesses

 

When that didn't work, I dug further into the documentation and found this command ( https://ansyshelp.ansys.com/account/secured?returnurl=/Views/Secured/corp/v194/sysc_ug/sysc_cl_PartitionParticipants.html ) and tried to do like example 46 where it is supposed to just divide the nodes equally between participants. But I got another error:



Traceback (most recent call last):
  File "/apps/r/ansys/v193/SystemCoupling/PyLib/Controller.py", line 139, in <module>
    _run(sys.argv)
  File "/apps/r/ansys/v193/SystemCoupling/PyLib/Controller.py", line 135, in _run
    _executeScript(options)
  File "/apps/r/ansys/v193/SystemCoupling/PyLib/Controller.py", line 107, in _executeScript
    kernel.commands.readScriptFile(scriptFile)
  File "PyLib/kernel/commands/__init__.py", line 31, in readScriptFile
  File "PyLib/kernel/commands/CommandManager.py", line 168, in readScriptFile
  File "inputfile_p.in", line 3, in <module>
    PartitionParticipants()
  File "PyLib/kernel/commands/CommandDefinition.py", line 72, in func
  File "PyLib/kernel/commands/__init__.py", line 28, in executeCommand
  File "PyLib/kernel/commands/CommandManager.py", line 120, in executeCommand
  File "PyLib/cosimulation/externalinterface/cosim_commands/partitioning.py", line 78, in execute
  File "PyLib/cosimulation/partitioning/__init__.py", line 63, in partitionParticipants
  File "PyLib/cosimulation/partitioning/__init__.py", line 119, in sharedAllocateMachinesByFraction
  File "PyLib/cosimulation/partitioning/__init__.py", line 142, in __allocateCommon
  File "PyLib/cosimulation/partitioning/machinelist.py", line 50, in loadMachines
RuntimeError: Host list not found.

 

From the error list it looks like ANSYS expects a hostlist even though the example I mentioned above issues the command exactly the way I did.  The problem is I cannot specify the host names beforehand, as this is a cluster with hundreds of nodes and when I submitt a job I get whatever node is available according to the queuing system.

Question:  How can I specify the number of nodes used in System Coupling through HPC?  I need to run on both one node and several nodes, each with 32 cores.  How can I accomplish this task?

 

Thanks 

Comments

  • SteveSteve Posts: 153Forum Coordinator
    edited June 2019

    Hi, 

    I've been looking over a few things the last few days. You're on the right track with ExecutionControl.ParallelArguments and PartitionParticipants. I have to look into a few things and I'll get back to you later today or tomorrow with more information.

    Steve

  • SteveSteve Posts: 153Forum Coordinator
    edited June 2019

    Hi, 

    For Fluent, the ParallelArguments = '-t16', and for Mechanical, ParallelArguments = '-n16'. Let me know if this works.

  • ansysuseransysuser USAPosts: 141Member
    edited June 2019

    Hello,

    I tried this again and it seems to have worked, though it is still slower than my desktop PC by a factor of 3. (My desktop has two 16-core processors and I ran the system coupling with 16 to Fluent and 16 to Mechanical.)  So this first run was just to test one of the HPC nodes that has 32 cores.

    In the Fluent Solution.trn file from HPC I see:

    -------------------------------------------------------------------------------
    ID     Hostname      Core   O.S.      PID        Vendor                     
    -------------------------------------------------------------------------------
    n0-15  t388.cluster  16/32  Linux-64  8724-8739  Intel(R) Xeon(R) E5-2683 v4
    host   t388.cluster         Linux-64  8548       Intel(R) Xeon(R) E5-2683 v4

    MPI Option Selected: ibmmpi
    Selected system interconnect: shared-memory

    And in the Mechanical Solution.out file from HPC I see it is using the other half of the same 32 core node:

      CONNECTION TO COUPLING SERVER REQUESTED
        SYSTEM NAME                = Solution
        HOST NAME                  = t388.cluster
        SERVER PORT REQUESTED      = 48421      

     PARAMETER STATUS-           (   1 PARAMETERS DEFINED)

     NAME                              VALUE                        TYPE
     N16                                                             CHARACTER

     

    But now when I give it two nodes, hoping to see some improvement in performance, I find that the system does not give one node to Fluent and one to Mechanical but gives them the same 32 core node to share while seemingly ignoring the other node altogether.  According to the scheduler I have nodes t298 and t299 called for this job.  But in the Fluent file I see that Fluent has 32 cores from t298:

     

    --------------------------------------------------------------------------------
    ID     Hostname      Core   O.S.      PID          Vendor                     
    --------------------------------------------------------------------------------
    n0-31  t298.cluster  32/32  Linux-64  13040-13071  Intel(R) Xeon(R) E5-2683 v4
    host   t298.cluster         Linux-64  12888        Intel(R) Xeon(R) E5-2683 v4

    MPI Option Selected: ibmmpi
    Selected system interconnect: shared-memory
    --------------------------------------------------------------------------------

    When I look in the Mechanical file I see it also has 32 cores from t298:

     

      CONNECTION TO COUPLING SERVER REQUESTED
        SYSTEM NAME                = Solution
        HOST NAME                  = t298.cluster
        SERVER PORT REQUESTED      = 43358      

     PARAMETER STATUS-           (   1 PARAMETERS DEFINED)

     NAME                              VALUE                        TYPE
     N32                                                             CHARACTER

    So this is not what is expected.  I would have thought one of them (Fluent or Mechanical) would get the t298 node and one would get the t299 node that I can see I have reserved and are currently available as this job executes.  Steve, can you elevate this request because I have been trying to get this thing to run on HPC for a couple of months now and it seems like this should be a fairly common thing for ANSYS to be run on HPC with system coupling.  I really need to get this evaluation done so we can figure out of purchasing HPC licenses is going to give enough performance boost for our group or whether we should just stick with desktop licenses.  With Fluent as a stand alone we see dramatic performance increases when using HPC, but so far this is not the case with system coupling when comparing one 32 core node to my desktop pair of 16 core processors.  But I don't think the coupling has gotten a fair trial at this because I can't get it to do what it should be able to do.

     

    Thanks!

     

     

  • ansysuseransysuser USAPosts: 141Member
    edited June 2019

    Hello,

     

    I am still waiting for a reply to this question. 

     

    Thanks

  • ansysuseransysuser USAPosts: 141Member
    edited June 2019

    Hello,

     

    I still need input on this problem.  Please reply as to the status.

     

    Thanks

  • ansysuseransysuser USAPosts: 141Member
    edited June 2019

    Hello,

    I am still trying to solve this.  Please respond to let me know of any other options.

     

    Thanks.

  • maryammaryam Posts: 34Member
    edited April 2020

    Hi,

    I saw your post here i am going to run my fsi simulation (Transient structural and fluent) in HPC cluster but i am beginner at it. If you do not mind can you give me some informations how i can do this? or if it possible send me the job files for running system coupling?

    Best,

    Mari

  • maryammaryam Posts: 34Member
    edited April 2020

    ansysuser,

    i could run the system coupling successfully on HPC but i have speed problem as well. Could you find a solution?

     

    Maryam

  • AhmedMALIMAhmedMALIM Posts: 2Member

    Hello,

    I'm working on FSI, I tried some tutorials of the Oscillating Plate with the command line ( CLI ) in the windows machine and it works properly.

    the tutorial is here: https://ansyshelp.ansys.com/account/secured?returnurl=/Views/Secured/corp/v202/en/sysc_tut/sysc_tut_oscplate_cli_fluent_steps.html

    When I wanted to use the same tutorials in the Linux HPC (type SLURM), after finishing all the commands properly, the last command to run the solving process ( Solve() ) gives me the following error which seems to be similar to the above error:

    (

    Caught exception in ApplicationContext.mapModules

    Traceback (most recent call last):

     File "PyLib/kernel/framework/ApplicationContext.py", line 152, in __call__

     File "PyLib/kernel/framework/ApplicationContext.py", line 61, in _makeComponent

     File "PyLib/kernel/framework/ApplicationContext.py", line 59, in <lambda>

     File "PyLib/kernel/framework/Annotate.py", line 50, in injected

     File "PyLib/kernel/remote/service/__init__.py", line 280, in __init__

     File "PyLib/kernel/remote/service/__init__.py", line 234, in _load

     File "PyLib/kernel/remote/service/__init__.py", line 146, in _initializeComputeNodes

     File "PyLib/kernel/remote/service/__init__.py", line 77, in _makeMultiportArgs

     File "PyLib/kernel/remote/service/__init__.py", line 34, in getParallelOptions

     File "PyLib/cosimulation/partitioning/machinelist.py", line 66, in loadMachines

     File "PyLib/cosimulation/partitioning/machinelist.py", line 252, in _constructMachineListSLURM

    TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

    Traceback (most recent call last):

     File "PyLib/main/Controller.py", line 147, in <module>

     File "PyLib/main/Controller.py", line 143, in _run

     File "PyLib/main/Controller.py", line 92, in _executeScript

     File "PyLib/kernel/commands/__init__.py", line 31, in readScriptFile

     File "PyLib/kernel/commands/CommandManager.py", line 169, in readScriptFile

     File "run.py", line 39, in <module>

      Solve()

     File "PyLib/kernel/commands/CommandDefinition.py", line 74, in func

     File "PyLib/kernel/commands/__init__.py", line 28, in executeCommand

     File "PyLib/kernel/commands/CommandManager.py", line 122, in executeCommand

     File "PyLib/cosimulation/externalinterface/core/solver.py", line 125, in execute

     File "PyLib/kernel/framework/ApplicationContext.py", line 30, in __getattr__

     File "PyLib/kernel/framework/ApplicationContext.py", line 50, in __madeComponent

     File "PyLib/kernel/framework/ApplicationContext.py", line 59, in <lambda>

     File "PyLib/kernel/framework/Annotate.py", line 50, in injected

     File "PyLib/cosimulation/solver/__init__.py", line 70, in __init__

     File "PyLib/kernel/fluxdb/FluxDB.py", line 336, in FluxDB

     File "PyLib/kernel/remote/RemoteObjects.py", line 202, in __init__

     File "PyLib/kernel/remote/RemoteObjects.py", line 88, in __init__

     File "PyLib/ComputeNodeCommand.py", line 81, in newfunc

     File "PyLib/kernel/remote/RemoteStatus.py", line 18, in forceRemoteStart

     File "PyLib/kernel/framework/ApplicationContext.py", line 157, in __call__

     File "PyLib/kernel/framework/ApplicationContext.py", line 152, in __call__

     File "PyLib/kernel/framework/ApplicationContext.py", line 61, in _makeComponent

     File "PyLib/kernel/framework/ApplicationContext.py", line 59, in <lambda>

     File "PyLib/kernel/framework/Annotate.py", line 50, in injected

     File "PyLib/kernel/remote/service/__init__.py", line 280, in __init__

     File "PyLib/kernel/remote/service/__init__.py", line 234, in _load

     File "PyLib/kernel/remote/service/__init__.py", line 146, in _initializeComputeNodes

     File "PyLib/kernel/remote/service/__init__.py", line 77, in _makeMultiportArgs

     File "PyLib/kernel/remote/service/__init__.py", line 34, in getParallelOptions

     File "PyLib/cosimulation/partitioning/machinelist.py", line 66, in loadMachines

     File "PyLib/cosimulation/partitioning/machinelist.py", line 252, in _constructMachineListSLURM

    TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

    )

  • BenMarBenMar Posts: 6Member

    Hello,


    As Ahmed said above we are trying to run Systemcoupling on a Slurm HPC. Could you please share a minimal working example from the discussion above?

    It is unclear where and which options mentioned above help so sharing the correct run.py that did work, and/or the environment variables or other settings would greatly help!

    Thank you and keep safe!

    Benoît

Sign In or Register to comment.