Lumerical parallel running on several nodes

    • amnonjw

      Hi Lumerical team,

      We are working with a cloud computing server where we can request to run jobs on several cpus. The cloud server is used for several applications other than Lumerical, and is also able to run jobs on several nodes utilizing the same amount of cpus. A quick search in this forum showed other people are running Lumerical on several nodes.

      The code used to submit the job to the server job-manager is below. When checking the job status (using qstat) I see that the job manager accepted the request (thanks to the addition of the flag "#PBS -l place=scatter:excl"):

    • amnonjw
      When I checked the output file (".o" file) I see multiple lines :
      n037(process 26): Creating temporary file [] failed
      n037(process 26): Temporary file [] is on Read Only mode, it should be Read and Write mode.
      n037(process 28): Creating temporary file [] failed
      n037(process 28): Temporary file [] is on Read Only mode, it should be Read and Write mode.
      n037(process 33): Creating temporary file [] failed
      n037(process 33): Temporary file [] is on Read Only mode, it should be Read and Write mode.
      n037(process 35): Creating temporary file [] failed
      n037(process 35): Temporary file [] is on Read Only mode, it should be Read and Write mode.

      I think that the mpi exec file of Lumerical is attempting to create a temporary file so share data across nodes. Either the temporary file is being created in the application folder where my user has no permission to write, or the the mpi application is run is run as a general system user that has no permission to write in my user folder.
      Can you please check you code in the executables "bin/fdtd-engine-mpich2nem" and "mpich2/nemesis/bin/mpiexec" for the above error lines?

    • Lito Yap
      Ansys Employee
      The simulation file should be on shared network storage that is accessible by the same path/folder on all nodes running the simulation where your user/account has read/write access.
    • amnonjw
      Thank you Lito, I will check with the cluster administrator.
    • amnonjw
      Hi Lito Currently there are no issues with running a job on a single node with multiple cpus, and since I don't have control on the specific node being used - then I think that all nodes can access files in my user directory without problems. Is this relevant to your comment?
      I found another information page on using mpich2. There's an example for running a simulation on several nodes:
      $ /opt/lumerical/v221/mpich2/nemesis/bin/mpiexec -hosts node01:4,hosts02:8,node03:16 /opt/lumerical/v221/bin/fdtd-engine-mpich2nem -logall -t 1 $HOME/working/paralleltest.fsp
      I see that the command must include the name of the nodes, either in hostname or IP. Is this mandatory?
      In our cluster server we can request a job that runs on several nodes but we don't know in advance the hostnames or IP addresses of the nodes. I don't know of a way for a running job to get (on the fly) a list of IPs of the nodes that are allocated for it.
    • Lito Yap
      Ansys Employee
      The example shown in our guide is when you know the hostnames/IP of the machines running the simulation in distributed mode without a job scheduler. Try to consult with IT if they have a different MPI configured on the cluster for parallel jobs, i.e. OpenMPI or IntelMPI. And run the simulation using your MPI installation.
    • amnonjw
      Hello I contacted our cluster support team. They informed me that we have the following installations available:
      intelMPI 17
      intelMPI 22
      OpenMPI 3
      OpenMPI 3.1.6
      OpenMPI 4
      OpenMPI 4.1.1
      They asked - which versions are compatible for Lumerical's fdtd engine executables?

    • Lito Yap
      Ansys Employee
      We support Intel MPI 2019 and OpenMPI 3.1.4. I think OpenMPI 3.1.6 would be the closest. Hope this helps.
    • amnonjw
      Hi Lito I think I managed to run successfully on several nodes. See below the script I used to used 3 nodes X 10 cores in each.
      The only indication I have that the application actually uses the full capacity of available resources is by comparing the run time (for example comparing to a job that uses 10 cores in a single node). When I review the log file, the 6th row says "Running on host: n024", and I know that n24 is the name of one of the nodes used, maybe it's the first in the list of nodes. Is this because the fdtd-engine doesn't actually know it runs on parallel nodes?
      Is there any indication I can get to verify the simulation is running on several nodes simultaneously?

      Code for PBS:
      #PBS -N transformer
      #PBS -q zeus_all_q
      #PBS -l select=3:ncpus=10
      #PBS -l walltime=24:00:00
      #PBS -l place=scatter
      cd $PBS_O_WORKDIR
      source /usr/local/openmpi-3.1.6/
      source /usr/local/lumerical-2021R2.5/
      $MPIRUN --mca mpi_warn_on_fork 0 -machinefile $PBS_NODEFILE $MY_PROG ./$INPUT
    • Lito Yap
      Ansys Employee
      The simulation_p0.log file should indicate the number of processors used:
      All processes are communicating and ready to proceed to simulation...
      Running fsp file: C:\Program Files\Lumerical\v221\bin\fdtd-engine-msmpi.exe -t 1 -remote paralleltest.fsp
      number of processors is 12
      Or try to run with the -logall option for the Lumerical engine.
      $MPIRUN --mca mpi_warn_on_fork 0 -machinefile $PBS_NODEFILE $MY_PROG -t 1 -logall ./$INPUT
      This should create the corresponding simulation_p0.log file for each of the processes/cores used to run the simulation. i.e. for #PBS -l select=3:ncpus=10, you will get 30 simulation_p##.log files.

Viewing 9 reply threads
  • You must be logged in to reply to this topic.