Electronics

Electronics

Running ANSYS HFSS on multiple nodes on SLURM based cluster

Tagged: ,

    • riswadkar1994
      Subscriber

      Hi all,

      Could someone point me to some standardized instructions for running HFSS on multiple nodes on a SLURM based cluster? Currently we are limited by the RAM on a single node, but multi-core simulations(parametric sweeps) do work on a single node.

      I did find an interesting solution here(https://forum.ansys.com/discussion/18341/running-hfss-on-a-slurm-based-machine-rsm-cannot-be-accessed), but cannot yet get my cluster admins to install ansoftrsmservice executable. Is there a way around installing this executable? I can see that ARC is pre-installed on the cluster, but running ARC node executable on every node before submitting the job might not be possible. If anyone has a standardized procedure for doing multi-node jobs on SLURM based clusters using ARC or anything else, I will be happy to hear that.

      I tried to submit a job on ansysedt on a cluster compute node (https://vis.tacc.utexas.edu/#) as follows:

      Tools->Job Management->Submit Job

      The hostnames of the assigned nodes is given by the NODELIST variable (squeue -u )

      Following is the preview of the job submission

      /home1/apps/ANSYS/AnsysEM20.1/Linux64/desktopjob -cmd dso -jobid RSM_14259 -machinelist list=c506-011:1:21:90%:1,c506-012:1:21:90%:1,c506-013:1:21:90%:1,c506-014:1:21:90%:1 -monitor -waitforlicense -useelectronicsppe -ng -batchoptions " -batchsolve 20210711_Single_Qubit1_a:Optimetrics:ParametricSetup1'work2/08252/ameya/stampede2/tmp/Cavity Qubit Start_parametric_2.aedt'

      I found the default port used by slurmd to listen for incoming requests from slurmctld is 6818 (https://slurm.schedmd.com/network.html) but I could not configure the port when submitting the job

      When I submit the job, I get a message that the job submission is successful and get redirected to Monitor Job - RSM which gives the following error message:

      Connecting to running job.

      ==================================

      Running LSDSO job 'RSM_5415'

      Location: /home1/apps/ANSYS/AnsysEM20.1/Linux64/desktopjob

      Batch Solve/Save: 'work2/08252/ameya/stampede2/tmp/Cavity Qubit Start_parametric_2.aedt

      Starting Batch Run: 04:11:30PM Thursday, August 12, 2021

      Temp directory: /tmp

      ==================================

      Error: (T=08/12/21 16:11:31): Failed to launch engine on machine 'c506-012' or obtain a compatible interface

      Error: (T=08/12/21 16:11:31): Failed to activate child job. Node 'c506-012' is removed from this job's available resources.

      Error: (T=08/12/21 16:11:31): Failed to launch engine on machine 'c506-013' or obtain a compatible interface

      Error: (T=08/12/21 16:11:31): Failed to activate child job. Node 'c506-013' is removed from this job's available resources.

      Error: (T=08/12/21 16:11:31): Failed to launch...

    • ANSYS_MMadore
      Ansys Employee
      If you were to run version 2021R1 or 2021R2 there is direct integration with SLURM, please refer to Help Documentation. https://ansyshelp.ansys.com/Views/Secured/Electronics/v211/en/Subsystems/HFSS/HFSS.htm#HPC/IntegrationwithSLURMLinuxScheduler.htm%3FTocPath%3DHFSS%2520Help%7CHigh%2520Performance%2520Computing%7CHigh%2520Performance%2520Computing%2520(HPC)%2520Integration%7C_____8
      For running prior releases it does require a custom scheduler proxy and ansoftrsm.
      ARC is not an option.
    • riswadkar1994
      Subscriber
      I tried using the scheduler proxy and running ansoftrsmservice on each node, I get the following error on all compute nodes on the cluster other than localhost
      Unable to locate or start COM engine on '192.168.217.68'
      I ran ansoftrsmservice on each node as follows:
      for NODE in $(cat nodefile_local_ip)
      do
      # start the RSM service
      ssh $NODE $WORKDIR/rsm/Linux64/ansoftrsmservice start
      # register engines with RSM (otherwise it'll complain that it can't find it)
      ssh $NODE /home1/apps/ANSYS/AnsysEM20.1/Linux64/RegisterEnginesWithRSM.pl add
      telnet $NODE 32958
      done
      The SSH and telnet works in connecting to the other compute nodes. SSH output tells me the following:
      Starting Ansoft Remote Simulation Manager: Done
      Registering product-specific engines with RSM...
      /home1/apps/ANSYS/AnsysEM20.1/Linux64/HFSSCOMENGINE: Done
      And the telnet output is
      Trying 192.168.217.49...
      Connected to 192.168.217.49.

      Following is my ansysedt command
      /home1/apps/ANSYS/2021R2/AnsysEM21.2/Linux64/ansysedt -regserver -autoextract -distributed -ng -machinelist list="${ABQHOSTS}" -UseElectronicsPPE -Logfile "9_21_21_2.log" -batchoptions $OptFile -batchsolve -monitor 20210711_Single_Qubit1_a:Optimetrics:ParametricSetup1 "/work2/08252//stampede2/tmp/Cavity_Qubit_Start_parametric_2.aedt"

      From reading the documentation, it looks like
      libdesktop.so calls a subroutine called "LaunchProcess" which the /schedulers/proxies.cfg file redirects to/schedulers/SLURM_LaunchProcess_linux.py which fails to start a process and schedulers/scripts/WinHPC/ansysem_messages.py produces the error message. I don't know where along this chain the error occurs.

      I even checked the network while running ansysedt:
      I find that the ansoftrsm does start listening on node 32958
      tcp00 0.0.0.0:329580.0.0.0:*LISTEN376438/ansoftrsmser
      I do find my SSH packets going through
      tcp00 192.168.217.49:33852192.168.217.51:22TIME_WAIT-
      And my telnet packets going through
      tcp00 192.168.217.49:57054192.168.217.50:32958FIN_WAIT2-
      I do see ansysedt and HFSSCOMENGINE connections on local ports, but nothing on the other local ip's that I provide for the other compute nodes
      tcp00 127.0.0.1:41246127.0.0.1:46590ESTABLISHED 427384/ansysedt.exe
      tcp00 127.0.0.1:45576127.0.0.1:39634ESTABLISHED 427818/HFSSCOMENGIN
      tcp1680 206.76.217.49:38506129.114.99.212:54845ESTABLISHED 427922/ansyscl

      Any help is much appreciated
    • J0sh8830
      Subscriber
      Hi riswadkar1994, I hope you are doing well. I am having a similar issue trying to run a distributed solve on the cluster at my university. Have you had any luck with this issue yet?
      Best regards Josh
    • ANSYS_MMadore
      Ansys Employee
    • riswadkar1994
      Subscriber
      I did everything on that page (https://ansyshelp.ansys.com/Views/Secured/Electronics/v211/en/Subsystems/HFSS/HFSS.htm#HPC/IntegrationwithSLURMLinuxScheduler.htm%3FTocPath%3DHFSS%2520Help%7CHigh%2520Performance%2520Computing%7CHigh%2520Performance%2520Computing%2520(HPC)%2520Integration%7C_____8)
      until
      Using the generic scheduler with SLURM
      Everything in and after this section seems to be GUI based, but I don't have access to the GUI, I am submitting the job from a command line on the login node, after which SLURM schedules it with some compute nodes but I remain on the login node shell.
      Is there something that can be written in the batchoptions file to submit the SelectScheduler option from the command line? I do see that the error (Unable to locate or start COM engine on 'c506-124.stampede2.tacc.utexas.edu' :(12:03:38 PMSep 20, 2021)) comes from schedulers/scripts/WinHPC/ansysem_messages.py, where the slurm scheduler python files are located.
    • ANSYS_MMadore
      Ansys Employee
      -- if submit using batch then need:
      export AppFolder=/opt/AnsysEM/AnsysEM21.2/Linux64
      export ANSYSEM_GENERIC_MPI_WRAPPER=${AppFolder}/schedulers/scripts/utils/slurm_srun_wrapper.sh
      export ANSYSEM_COMMON_PREFIX=${AppFolder}/common
      srun_cmd="srun --overcommit --export=ALL -n 1 -N 1 --cpu-bind=none --mem-per-cpu=0 --overlap "
      ${srun_cmd} ${AppFolder}/ansysedt -ng -monitor -waitforlicense useelectronicsppe=1 -distributed -machinelist numcores=XX -auto -batchoptions "" -batchsolve Project.aedt
    • ANSYS_MMadore
      Ansys Employee
      Important,you must be running AEDT 2021R2 and SLURM 20.x for the following steps
      -- You would want to gather the following to confirm scheduler version and network info
      $ cat /etc/*release
      $ sinfo -V
      $ ifconfig


      Enabling tight integration change - this requires editing slurm_srun_wrapper.sh and setting batchoption 'RemoteSpawncommand'

      1.. edit the slurm_srun_wrapper.sh
      archive/copy .../AnsysEM21.2/Linux64/schedulers/scripts/utils/slurm_srun_wrapper.sh to slurm_srun_wrapper.sh.ORIG

      edit .../AnsysEM21.2/Linux64/schedulers/scripts/utils/slurm_srun_wrapper.sh and insert the following at line 28: host=$(echo "${host}" | cut -d'.' -f1)
      ex:
      if [[ -n "$ANSYSEM_SLURM_JOB_ID" ]]
      then
      export SLURM_JOB_ID="${ANSYSEM_SLURM_JOB_ID}"
      echo "set SLURM_JOB_ID=${SLURM_JOB_ID}" >> "$DEBUG_FILE"
      fi
      => # tfs447753
      => host=$(echo "${host}" | cut -d'.' -f1)
      verStr=`scontrol --version`

      2.. After making the slurm_srun_wrapper.sh file change, run the following to set defaults to tight integration(will create/modify//AnsysEM21.2/Linux64/config/default.XML)

      as root or installation owner
      cd /opt/AnsysEM/AnsysEM21.2/Linux64
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Desktop/Settings/ProjectOptions/ProductImprovementOptStatus" -RegistryValue 0 -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS 3D Layout Design/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS-IE/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Maxwell 2D/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Maxwell 3D/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Q3D Extractor/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Icepak/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS 3D Layout Design/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS-IE/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Maxwell 3D/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Maxwell 2D/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Q3D Extractor/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Icepak/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
      ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Desktop/Settings/ProjectOptions/ProductImprovementOptStatus" -RegistryValue 0 -RegistryLevel install
      # ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Desktop/Settings/ProjectOptions/AnsysEMPreferredSubnetAddress" -RegistryValue "192.168.1.0/24" -RegistryLevel install



      4.. An example batch script:

      Create %HOME/anstest/job.sh with the following contents (correct highlighted):

      #!/bin/bash
      #SBATCH -N 3# allocate 3 nodes
      #SBATCH -n 12# 12 tasks total
      ##SBATCH --exclusive# no other jobs on the nodes while job is running
      #SBATCH -J AnsysEMTest# sensible name for the job

      #Set job folder, scratch folder, project, and design (Design is optional)
      JobFolder=$(pwd)
      ProjName=OptimTee-DiscreteSweep-FineMesh.aedt
      DsnName="TeeModel:Nominal"

      # Executable path and SLURM custom integration variables
      AppFolder=/opt/AnsysEM/AnsysEM21.2/Linux64

      # setup environment and srun
      export ANSYSEM_GENERIC_MPI_WRAPPER=${AppFolder}/schedulers/scripts/utils/slurm_srun_wrapper.sh
      export ANSYSEM_COMMON_PREFIX=${AppFolder}/common
      export ANSYSEM_TASKS_PER_NODE=${SLURM_TASKS_PER_NODE}

      # setup srun
      srun_cmd="srun --overcommit --export=ALL -n 1 --cpu-bind=none --mem-per-cpu=0 --overlap "

      # MPI timeout set to 30min default for cloud suggest lower to 120 or 240 seconds for onprem
      export MPI_TIMEOUT_SECONDS=120

      # System networking environment variables - HPC system dependent should not be user edits!
      # export ANSOFT_MPI_INTERCONNECT=ib
      # export ANSOFT_MPI_INTERCONNECT_VARIANT=ofed

      # Skip dependency check
      # export ANS_NODEPCHECK=1

      # Autocompute total cores from node allocation
      CoreCount=$((SLURM_JOB_NUM_NODES * SLURM_CPUS_ON_NODE))

      # Run Job
      ${srun_cmd} ${AppFolder}/ansysedt -ng -monitor -waitforlicense -useelectronicsppe=1 -distributed -auto -machinelist numcores=$CoreCount -batchoptions "" -batchsolve ${DsnName} ${JobFolder}/${ProjName}



      Then run it:
      $ dos2unix $HOME/anstest/job.sh
      $ chmod +x$HOME/anstest/job.sh
      $ sbatch $HOME/anstest/job.sh

      When complete, send resulting:
      $HOME/anstest/OptimTee-DiscreteSweep-FineMesh.aedt.batchinfo/*.log


    • riswadkar1994
      Subscriber
      I am running slurm 18.08.5-2
    • riswadkar1994
      Subscriber
      srun
      --overlap option is not recognized
    • ANSYS_MMadore
      Ansys Employee
      Correct, please remove --overlap, that is only supported under version 20.
    • ANSYS_MMadore
      Ansys Employee
      Please note, slurm 18 is not tested or supported.
    • zwyll
      Subscriber
      Hi mmadore I have the root access to the machine that riswadkar1994 is running ANSYS HFSS on. And I have followed your step 1 and 2. Step 1 is pretty simple and has no problem. However, in step 2, I encounter two errors when running the command for HFSS-IE, which are

      staff.stampede2(575)# ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey ÔÇ£HFSS-IE/MPIVendorÔÇØ -RegistryValue ÔÇ£IntelÔÇØ -RegistryLevel install
      MPIVendor is not DSO option
      Available DSO options forare:
      HPCLicenseType
      registry key is not defined in this product
      error on to set registry value at

      and

      staff.stampede2(580)# ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey ÔÇ£HFSS-IE/RemoteSpawnCommandÔÇØ -RegistryValue ÔÇ£schedulerÔÇØ -RegistryLevel install
      RemoteSpawnCommand is not DSO option
      Available DSO options forare:
      HPCLicenseType
      registry key is not defined in this product
      error on to set registry value at

      Other commands seem to be working well. Could you please let me know how to resolve these two issues?

      Thanks.
    • ANSYS_MMadore
      Ansys Employee
      You can actually ignore the HFSS-IE entries, those are legacy and not required.
Viewing 13 reply threads
  • You must be logged in to reply to this topic.