March 24, 2022 at 9:47 amamnonjwSubscriber
Hi Lumerical team,
We are working with a cloud computing server where we can request to run jobs on several cpus. The cloud server is used for several applications other than Lumerical, and is also able to run jobs on several nodes utilizing the same amount of cpus. A quick search in this forum showed other people are running Lumerical on several nodes.
The code used to submit the job to the server job-manager is below. When checking the job status (using qstat) I see that the job manager accepted the request (thanks to the addition of the flag "#PBS -l place=scatter:excl"):March 29, 2022 at 6:37 amamnonjwSubscriberUpdate:
When I checked the output file (".o" file) I see multiple lines :
n037(process 26): Creating temporary file  failed
n037(process 26): Temporary file  is on Read Only mode, it should be Read and Write mode.
n037(process 28): Creating temporary file  failed
n037(process 28): Temporary file  is on Read Only mode, it should be Read and Write mode.
n037(process 33): Creating temporary file  failed
n037(process 33): Temporary file  is on Read Only mode, it should be Read and Write mode.
n037(process 35): Creating temporary file  failed
n037(process 35): Temporary file  is on Read Only mode, it should be Read and Write mode.
I think that the mpi exec file of Lumerical is attempting to create a temporary file so share data across nodes. Either the temporary file is being created in the application folder where my user has no permission to write, or the the mpi application is run is run as a general system user that has no permission to write in my user folder.
Can you please check you code in the executables "bin/fdtd-engine-mpich2nem" and "mpich2/nemesis/bin/mpiexec" for the above error lines?
March 29, 2022 at 7:55 pmLito YapAnsys EmployeeThe simulation file should be on shared network storage that is accessible by the same path/folder on all nodes running the simulation where your user/account has read/write access.
March 29, 2022 at 7:58 pmamnonjwSubscriberThank you Lito, I will check with the cluster administrator.
March 31, 2022 at 11:34 amamnonjwSubscriberHi Lito Currently there are no issues with running a job on a single node with multiple cpus, and since I don't have control on the specific node being used - then I think that all nodes can access files in my user directory without problems. Is this relevant to your comment?
I found another information page on using mpich2. There's an example for running a simulation on several nodes:
$ /opt/lumerical/v221/mpich2/nemesis/bin/mpiexec -hosts node01:4,hosts02:8,node03:16 /opt/lumerical/v221/bin/fdtd-engine-mpich2nem -logall -t 1 $HOME/working/paralleltest.fsp
I see that the command must include the name of the nodes, either in hostname or IP. Is this mandatory?
In our cluster server we can request a job that runs on several nodes but we don't know in advance the hostnames or IP addresses of the nodes. I don't know of a way for a running job to get (on the fly) a list of IPs of the nodes that are allocated for it.
March 31, 2022 at 7:35 pmLito YapAnsys EmployeeThe example shown in our guide is when you know the hostnames/IP of the machines running the simulation in distributed mode without a job scheduler. Try to consult with IT if they have a different MPI configured on the cluster for parallel jobs, i.e. OpenMPI or IntelMPI. And run the simulation using your MPI installation.
April 12, 2022 at 1:27 pmamnonjwSubscriberHello I contacted our cluster support team. They informed me that we have the following installations available:
They asked - which versions are compatible for Lumerical's fdtd engine executables?
April 12, 2022 at 5:44 pmLito YapAnsys EmployeeWe support Intel MPI 2019 and OpenMPI 3.1.4. I think OpenMPI 3.1.6 would be the closest. Hope this helps.
April 13, 2022 at 9:36 amamnonjwSubscriberHi Lito I think I managed to run successfully on several nodes. See below the script I used to used 3 nodes X 10 cores in each.
The only indication I have that the application actually uses the full capacity of available resources is by comparing the run time (for example comparing to a job that uses 10 cores in a single node). When I review the log file, the 6th row says "Running on host: n024", and I know that n24 is the name of one of the nodes used, maybe it's the first in the list of nodes. Is this because the fdtd-engine doesn't actually know it runs on parallel nodes?
Is there any indication I can get to verify the simulation is running on several nodes simultaneously?
Code for PBS:
#PBS -N transformer
#PBS -q zeus_all_q
#PBS -l select=3:ncpus=10
#PBS -l walltime=24:00:00
#PBS -l place=scatter
$MPIRUN --mca mpi_warn_on_fork 0 -machinefile $PBS_NODEFILE $MY_PROG ./$INPUT
April 13, 2022 at 7:54 pmLito YapAnsys EmployeeThe simulation_p0.log file should indicate the number of processors used:
All processes are communicating and ready to proceed to simulation...
Running fsp file: C:\Program Files\Lumerical\v221\bin\fdtd-engine-msmpi.exe -t 1 -remote paralleltest.fsp
number of processors is 12
Or try to run with the -logall option for the Lumerical engine.
$MPIRUN --mca mpi_warn_on_fork 0 -machinefile $PBS_NODEFILE $MY_PROG -t 1 -logall ./$INPUT
This should create the corresponding simulation_p0.log file for each of the processes/cores used to run the simulation. i.e. for #PBS -l select=3:ncpus=10, you will get 30 simulation_p##.log files.
Viewing 9 reply threads
Ansys Innovation Space
- You must be logged in to reply to this topic.
Earth Rescue – An Ansys Online Series
The climate crisis is here. But so is the human ingenuity to fight it. Earth Rescue reveals what visionary companies are doing today to engineer radical new ideas in the fight against climate change. Click here to watch the first episode.
Subscribe to the Ansys Blog to get great new content about the power of simulation delivered right to your email on a weekly basis. With content from Ansys experts, partners and customers you will learn about product development advances, thought leadership and trends and tips to better use Ansys tools. Sign up here.Trending discussions
- “Import optical generation” or “delta generation rate”?
- Why am I getting “process exited without calling finalize”, and how do I fix it?
- Error on Lumerical device
- Using a license file on a new license server
- Error: addfdtd is not a valid function or a variable name
- Ansys Insight: Diverging Simulations
- Ansys Insight: Transmission results greater than one
- Ansys Insight: About override mesh in FDTD: its use and settings
- Ansys Insight: Convergence issues in CHARGE
- Is there a Lumerical script command to output the Simulation and Memory requirements?
Top Rated Tags
© 2023 Copyright ANSYS, Inc. All rights reserved.Ansys does not support the usage of unauthorized Ansys software. Please visit www.ansys.com to obtain an official distribution.