-
-
August 31, 2021 at 7:34 pm
riswadkar1994
SubscriberHi all,
Could someone point me to some standardized instructions for running HFSS on multiple nodes on a SLURM based cluster? Currently we are limited by the RAM on a single node, but multi-core simulations(parametric sweeps) do work on a single node.
I did find an interesting solution here(https://forum.ansys.com/discussion/18341/running-hfss-on-a-slurm-based-machine-rsm-cannot-be-accessed), but cannot yet get my cluster admins to install ansoftrsmservice executable. Is there a way around installing this executable? I can see that ARC is pre-installed on the cluster, but running ARC node executable on every node before submitting the job might not be possible. If anyone has a standardized procedure for doing multi-node jobs on SLURM based clusters using ARC or anything else, I will be happy to hear that.
I tried to submit a job on ansysedt on a cluster compute node (https://vis.tacc.utexas.edu/#) as follows:
Tools->Job Management->Submit Job
The hostnames of the assigned nodes is given by the NODELIST variable (squeue -u
) Following is the preview of the job submission
/home1/apps/ANSYS/AnsysEM20.1/Linux64/desktopjob -cmd dso -jobid RSM_14259 -machinelist list=c506-011:1:21:90%:1,c506-012:1:21:90%:1,c506-013:1:21:90%:1,c506-014:1:21:90%:1 -monitor -waitforlicense -useelectronicsppe -ng -batchoptions " -batchsolve 20210711_Single_Qubit1_a:Optimetrics:ParametricSetup1'work2/08252/ameya/stampede2/tmp/Cavity Qubit Start_parametric_2.aedt'
I found the default port used by slurmd to listen for incoming requests from slurmctld is 6818 (https://slurm.schedmd.com/network.html) but I could not configure the port when submitting the job
When I submit the job, I get a message that the job submission is successful and get redirected to Monitor Job - RSM which gives the following error message:
Connecting to running job.
==================================
Running LSDSO job 'RSM_5415'
Location: /home1/apps/ANSYS/AnsysEM20.1/Linux64/desktopjob
Batch Solve/Save: 'work2/08252/ameya/stampede2/tmp/Cavity Qubit Start_parametric_2.aedt
Starting Batch Run: 04:11:30PM Thursday, August 12, 2021
Temp directory: /tmp
==================================
Error: (T=08/12/21 16:11:31): Failed to launch engine on machine 'c506-012' or obtain a compatible interface
Error: (T=08/12/21 16:11:31): Failed to activate child job. Node 'c506-012' is removed from this job's available resources.
Error: (T=08/12/21 16:11:31): Failed to launch engine on machine 'c506-013' or obtain a compatible interface
Error: (T=08/12/21 16:11:31): Failed to activate child job. Node 'c506-013' is removed from this job's available resources.
Error: (T=08/12/21 16:11:31): Failed to launch...
September 1, 2021 at 12:48 pmANSYS_MMadore
Ansys EmployeeIf you were to run version 2021R1 or 2021R2 there is direct integration with SLURM, please refer to Help Documentation. https://ansyshelp.ansys.com/Views/Secured/Electronics/v211/en/Subsystems/HFSS/HFSS.htm#HPC/IntegrationwithSLURMLinuxScheduler.htm%3FTocPath%3DHFSS%2520Help%7CHigh%2520Performance%2520Computing%7CHigh%2520Performance%2520Computing%2520(HPC)%2520Integration%7C_____8
For running prior releases it does require a custom scheduler proxy and ansoftrsm.
ARC is not an option.
October 5, 2021 at 11:39 pmriswadkar1994
SubscriberI tried using the scheduler proxy and running ansoftrsmservice on each node, I get the following error on all compute nodes on the cluster other than localhost
Unable to locate or start COM engine on '192.168.217.68'
I ran ansoftrsmservice on each node as follows:
for NODE in $(cat nodefile_local_ip)
do
# start the RSM service
ssh $NODE $WORKDIR/rsm/Linux64/ansoftrsmservice start
# register engines with RSM (otherwise it'll complain that it can't find it)
ssh $NODE /home1/apps/ANSYS/AnsysEM20.1/Linux64/RegisterEnginesWithRSM.pl add
telnet $NODE 32958
done
The SSH and telnet works in connecting to the other compute nodes. SSH output tells me the following:
Starting Ansoft Remote Simulation Manager: Done
Registering product-specific engines with RSM...
/home1/apps/ANSYS/AnsysEM20.1/Linux64/HFSSCOMENGINE: Done
And the telnet output is
Trying 192.168.217.49...
Connected to 192.168.217.49.
Following is my ansysedt command
/home1/apps/ANSYS/2021R2/AnsysEM21.2/Linux64/ansysedt -regserver -autoextract -distributed -ng -machinelist list="${ABQHOSTS}" -UseElectronicsPPE -Logfile "9_21_21_2.log" -batchoptions $OptFile -batchsolve -monitor 20210711_Single_Qubit1_a:Optimetrics:ParametricSetup1 "/work2/08252//stampede2/tmp/Cavity_Qubit_Start_parametric_2.aedt"
From reading the documentation, it looks like
libdesktop.so calls a subroutine called "LaunchProcess" which the /schedulers/proxies.cfg file redirects to/schedulers/SLURM_LaunchProcess_linux.py which fails to start a process and schedulers/scripts/WinHPC/ansysem_messages.py produces the error message. I don't know where along this chain the error occurs.
I even checked the network while running ansysedt:
I find that the ansoftrsm does start listening on node 32958
tcp00 0.0.0.0:329580.0.0.0:*LISTEN376438/ansoftrsmser
I do find my SSH packets going through
tcp00 192.168.217.49:33852192.168.217.51:22TIME_WAIT-
And my telnet packets going through
tcp00 192.168.217.49:57054192.168.217.50:32958FIN_WAIT2-
I do see ansysedt and HFSSCOMENGINE connections on local ports, but nothing on the other local ip's that I provide for the other compute nodes
tcp00 127.0.0.1:41246127.0.0.1:46590ESTABLISHED 427384/ansysedt.exe
tcp00 127.0.0.1:45576127.0.0.1:39634ESTABLISHED 427818/HFSSCOMENGIN
tcp1680 206.76.217.49:38506129.114.99.212:54845ESTABLISHED 427922/ansyscl
Any help is much appreciated
November 19, 2021 at 2:15 amJ0sh8830
SubscriberHi riswadkar1994, I hope you are doing well. I am having a similar issue trying to run a distributed solve on the cluster at my university. Have you had any luck with this issue yet?
Best regards Josh
November 19, 2021 at 12:07 pmANSYS_MMadore
Ansys EmployeeHave you followed the Ansys Help documentation regarding integration with SLURM?
https://ansyshelp.ansys.com/Views/Secured/Electronics/v211/en/Subsystems/HFSS/HFSS.htm#HPC/IntegrationwithSLURMLinuxScheduler.htm%3FTocPath%3DHFSS%2520Help%7CHigh%2520Performance%2520Computing%7CHigh%2520Performance%2520Computing%2520(HPC)%2520Integration%7C_____8
November 20, 2021 at 11:08 pmriswadkar1994
SubscriberI did everything on that page (https://ansyshelp.ansys.com/Views/Secured/Electronics/v211/en/Subsystems/HFSS/HFSS.htm#HPC/IntegrationwithSLURMLinuxScheduler.htm%3FTocPath%3DHFSS%2520Help%7CHigh%2520Performance%2520Computing%7CHigh%2520Performance%2520Computing%2520(HPC)%2520Integration%7C_____8)
until
Using the generic scheduler with SLURM
Everything in and after this section seems to be GUI based, but I don't have access to the GUI, I am submitting the job from a command line on the login node, after which SLURM schedules it with some compute nodes but I remain on the login node shell.
Is there something that can be written in the batchoptions file to submit the SelectScheduler option from the command line? I do see that the error (Unable to locate or start COM engine on 'c506-124.stampede2.tacc.utexas.edu' :(12:03:38 PMSep 20, 2021)) comes from schedulers/scripts/WinHPC/ansysem_messages.py, where the slurm scheduler python files are located.
November 22, 2021 at 2:21 pmANSYS_MMadore
Ansys Employee-- if submit using batch then need:
export AppFolder=/opt/AnsysEM/AnsysEM21.2/Linux64
export ANSYSEM_GENERIC_MPI_WRAPPER=${AppFolder}/schedulers/scripts/utils/slurm_srun_wrapper.sh
export ANSYSEM_COMMON_PREFIX=${AppFolder}/common
srun_cmd="srun --overcommit --export=ALL -n 1 -N 1 --cpu-bind=none --mem-per-cpu=0 --overlap "
${srun_cmd} ${AppFolder}/ansysedt -ng -monitor -waitforlicense useelectronicsppe=1 -distributed -machinelist numcores=XX -auto -batchoptions "" -batchsolve Project.aedt
November 22, 2021 at 3:03 pmANSYS_MMadore
Ansys EmployeeImportant,you must be running AEDT 2021R2 and SLURM 20.x for the following steps
-- You would want to gather the following to confirm scheduler version and network info
$ cat /etc/*release
$ sinfo -V
$ ifconfig
Enabling tight integration change - this requires editing slurm_srun_wrapper.sh and setting batchoption 'RemoteSpawncommand'
1.. edit the slurm_srun_wrapper.sh
archive/copy .../AnsysEM21.2/Linux64/schedulers/scripts/utils/slurm_srun_wrapper.sh to slurm_srun_wrapper.sh.ORIG
edit .../AnsysEM21.2/Linux64/schedulers/scripts/utils/slurm_srun_wrapper.sh and insert the following at line 28: host=$(echo "${host}" | cut -d'.' -f1)
ex:
if [[ -n "$ANSYSEM_SLURM_JOB_ID" ]]
then
export SLURM_JOB_ID="${ANSYSEM_SLURM_JOB_ID}"
echo "set SLURM_JOB_ID=${SLURM_JOB_ID}" >> "$DEBUG_FILE"
fi
=> # tfs447753
=> host=$(echo "${host}" | cut -d'.' -f1)
verStr=`scontrol --version`
2.. After making the slurm_srun_wrapper.sh file change, run the following to set defaults to tight integration(will create/modify//AnsysEM21.2/Linux64/config/default.XML)
as root or installation owner
cd /opt/AnsysEM/AnsysEM21.2/Linux64
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Desktop/Settings/ProjectOptions/ProductImprovementOptStatus" -RegistryValue 0 -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS 3D Layout Design/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS-IE/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Maxwell 2D/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Maxwell 3D/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Q3D Extractor/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Icepak/MPIVendor" -RegistryValue "Intel" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS 3D Layout Design/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "HFSS-IE/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Maxwell 3D/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Maxwell 2D/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Q3D Extractor/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Icepak/RemoteSpawnCommand" -RegistryValue "scheduler" -RegistryLevel install
./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Desktop/Settings/ProjectOptions/ProductImprovementOptStatus" -RegistryValue 0 -RegistryLevel install
# ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey "Desktop/Settings/ProjectOptions/AnsysEMPreferredSubnetAddress" -RegistryValue "192.168.1.0/24" -RegistryLevel install
4.. An example batch script:
Create %HOME/anstest/job.sh with the following contents (correct highlighted):
#!/bin/bash
#SBATCH -N 3# allocate 3 nodes
#SBATCH -n 12# 12 tasks total
##SBATCH --exclusive# no other jobs on the nodes while job is running
#SBATCH -J AnsysEMTest# sensible name for the job
#Set job folder, scratch folder, project, and design (Design is optional)
JobFolder=$(pwd)
ProjName=OptimTee-DiscreteSweep-FineMesh.aedt
DsnName="TeeModel:Nominal"
# Executable path and SLURM custom integration variables
AppFolder=/opt/AnsysEM/AnsysEM21.2/Linux64
# setup environment and srun
export ANSYSEM_GENERIC_MPI_WRAPPER=${AppFolder}/schedulers/scripts/utils/slurm_srun_wrapper.sh
export ANSYSEM_COMMON_PREFIX=${AppFolder}/common
export ANSYSEM_TASKS_PER_NODE=${SLURM_TASKS_PER_NODE}
# setup srun
srun_cmd="srun --overcommit --export=ALL -n 1 --cpu-bind=none --mem-per-cpu=0 --overlap "
# MPI timeout set to 30min default for cloud suggest lower to 120 or 240 seconds for onprem
export MPI_TIMEOUT_SECONDS=120
# System networking environment variables - HPC system dependent should not be user edits!
# export ANSOFT_MPI_INTERCONNECT=ib
# export ANSOFT_MPI_INTERCONNECT_VARIANT=ofed
# Skip dependency check
# export ANS_NODEPCHECK=1
# Autocompute total cores from node allocation
CoreCount=$((SLURM_JOB_NUM_NODES * SLURM_CPUS_ON_NODE))
# Run Job
${srun_cmd} ${AppFolder}/ansysedt -ng -monitor -waitforlicense -useelectronicsppe=1 -distributed -auto -machinelist numcores=$CoreCount -batchoptions "" -batchsolve ${DsnName} ${JobFolder}/${ProjName}
Then run it:
$ dos2unix $HOME/anstest/job.sh
$ chmod +x$HOME/anstest/job.sh
$ sbatch $HOME/anstest/job.sh
When complete, send resulting:
$HOME/anstest/OptimTee-DiscreteSweep-FineMesh.aedt.batchinfo/*.log
November 22, 2021 at 3:26 pmriswadkar1994
SubscriberI am running slurm 18.08.5-2
November 22, 2021 at 3:28 pmriswadkar1994
Subscribersrun
--overlap option is not recognized
November 22, 2021 at 3:31 pmANSYS_MMadore
Ansys EmployeeCorrect, please remove --overlap, that is only supported under version 20.
November 22, 2021 at 3:31 pmANSYS_MMadore
Ansys EmployeePlease note, slurm 18 is not tested or supported.
November 23, 2021 at 9:32 pmzwyll
SubscriberHi mmadore I have the root access to the machine that riswadkar1994 is running ANSYS HFSS on. And I have followed your step 1 and 2. Step 1 is pretty simple and has no problem. However, in step 2, I encounter two errors when running the command for HFSS-IE, which are
staff.stampede2(575)# ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey ÔÇ£HFSS-IE/MPIVendorÔÇØ -RegistryValue ÔÇ£IntelÔÇØ -RegistryLevel install
MPIVendor is not DSO option
Available DSO options forare:
HPCLicenseType
registry key is not defined in this product
error on to set registry value at
and
staff.stampede2(580)# ./UpdateRegistry -set -ProductName ElectronicsDesktop2021.2 -RegistryKey ÔÇ£HFSS-IE/RemoteSpawnCommandÔÇØ -RegistryValue ÔÇ£schedulerÔÇØ -RegistryLevel install
RemoteSpawnCommand is not DSO option
Available DSO options forare:
HPCLicenseType
registry key is not defined in this product
error on to set registry value at
Other commands seem to be working well. Could you please let me know how to resolve these two issues?
Thanks.
November 23, 2021 at 10:23 pmANSYS_MMadore
Ansys EmployeeYou can actually ignore the HFSS-IE entries, those are legacy and not required.
Viewing 13 reply threads- You must be logged in to reply to this topic.
Ansys Innovation SpaceEarth Rescue – An Ansys Online Series
The climate crisis is here. But so is the human ingenuity to fight it. Earth Rescue reveals what visionary companies are doing today to engineer radical new ideas in the fight against climate change. Click here to watch the first episode.
Ansys Blog
Subscribe to the Ansys Blog to get great new content about the power of simulation delivered right to your email on a weekly basis. With content from Ansys experts, partners and customers you will learn about product development advances, thought leadership and trends and tips to better use Ansys tools. Sign up here.
Trending discussions- simulation completed with execution error on server
- Signing up as ANSYS Support Coordinator
- How to export Ansys Maxwell simulation results for post-processing in matlab or in .csv file
- Maxwell, HFSS or Q3D?
- Error
- Unable to assign correctly the excitations in a coil
- Running ANSYS HFSS on the HPC (it runs on Linux only)
- Running ANSYS HFSS on multiple nodes on SLURM based cluster
- Intersect errors with model with complex structure
- Number of parallel paths in Ansys Maxwell
Top Contributors-
2656
-
2120
-
1347
-
1118
-
461
Top Rated Tags© 2023 Copyright ANSYS, Inc. All rights reserved.
Ansys does not support the usage of unauthorized Ansys software. Please visit www.ansys.com to obtain an official distribution.
-