Ansys Products

Ansys Products

Running Fluent with TORQUE Job Scheduler

    • Xingchun Wang
      Subscriber

      Hi,


      Does anyone have experience on running Fluent on a cluster with Torque as the job scheduler?


      I know Fluent is integrated with PBS Pro job scheduler, however, we have Torque, basically, they are very similar, but when starting fluent with [-pbs] flag, the system returns errors,  if we start fluent with PBS script, then we are only able to use only one computing node which includes 36 processors.


      Now we want to use more computing nodes, how should we do that?


      or is there any way to manually control spawning nodes?

    • tsiriaks
      Ansys Employee

      Hey Xingchun,


      Sorry for the delay. I rarely look at this section. I would recommend to submit your future questions about cluster/RSM setup in 'Installation and Licensing' section. This 'systems' section is for Physics-systems , ref


      https://www.ansys.com/products/systems


       


      Now for your question, let me ask around about this and I will let you know what I find out tomorrow.


      Thanks,


      Win

    • Xingchun Wang
      Subscriber

      Hi Tsiriaks,


      Thank you for your suggestions, and help I just move it to the section. 


      Let me know if you need any other necessary information.


      All the best


      Xingchun

    • tsiriaks
      Ansys Employee

      Hi Xingchun,


      Thank you for moving it here.


      I'm still asking around. Someone will help you soon.


      One info that might be useful at some point is what is the OS of the cluster ? If I remember correctly from past SRs, your systems are usually running Ubuntu (which is not supported as we discussed)


      Thanks,


      Win

    • JakeC
      Ansys Employee

      Hi Xingchun,


      Can you paste in the command you are using to launch using the -pbs flag, and post what the errors are exactly that you are getting?


       


      Also what does the PBS script do that you mention?  Can you paste in the contents of that script?  If you request more than 36 cores what happens?  Can you also paste in the final submission command that is called from the PBS script?


      Depending on how that script works, you may need to provide a machine list to fluent to use, as opposed to setting the number of cores.


       


      Thank you,


      Jake


       

    • Xingchun Wang
      Subscriber

      No, it's running CentOS 7.3

    • Xingchun Wang
      Subscriber

      Hi, Jack


      Thanks for replying,


      The PBS script is : [fluent -g -t72 3ddp -mpi=openmpi -pdefault -pbs] when using the command, it automatically submit a job on the scheduler, and basically the process won't continue, so an alternative way is providing the machine list as you mentioned, please see the following image



      This way will generate the same result as using -pbs flag, here I also attached a screenshot to show the result


       



      If we request more than 36 cores, which means more than 1 computation node, the cluster will be spawning all the process on single computation node, lets say if we request 72 cores, the result is on one single node, it spawns 72 fluent processes, which in fact low down the computing speed.


       


      Hope this information helps


       


      All the best


      Xingchun Wang

    • JakeC
      Ansys Employee

      Hi Xingchun,


      Can you print out the contents of pnodes.txt and ncpus in that script, and post the results?


      Also can you confirm that passwordless ssh is set up for your user?  Meaning you can ssh between compute nodes without it asking you for a password.


      Lastly what type of interconnect do you have between the compute nodes?  Right now it is trying to use ethernet, but do you have infiniband or something similar?


       


      Thank you,


      Jake

    • Xingchun Wang
      Subscriber

      Hi Jake,


      No problem!


      For the first question, pnodes is a file be generated every time the script runs, for example, it looks like:



       


      and for ncpus (number of cpus), I copied that part of code from the user manual, originally it was exported as a system variable and was passed to -t as [-t$ncpus], but I didn't adopt that method.


      and yes, I can ssh between compute nodes without password, for the interconnection, I'm not very sure, but I actually tried all of them, and I don't think it help so I just keep it as default.


       


      All the best


      Xingchun

    • JakeC
      Ansys Employee

      Hi Xingchun,


      That all looks correct to me.


       


      Have you been able to run other types of workloads on this cluster and distribute across nodes?


      Do the compute nodes have Hyperthreading enabled?


      Lastly can you try IBM mpi instead of OpenMPI?


      Thank you,


      Jake

    • tsiriaks
      Ansys Employee

      Hi Xingchun,


      Aside from what Jake has asked, I have heard from a Fluent cluster setup expert that


      "Torque, specified version, with MOAB is supported only via RSM, that is well documents.  Additionally, there are issue with Torque in that PBS Torque does not accept core allocation using '-l select=n', it needs to be changed to use -l ppn format.


      qsub  -q batch -l select=32:ncpus=1:mpiprocs=1 -V -o stdout.out -e stderr.out."


      Please try with the ppn format but if that doesn't help, you would need to either use RSM to submit jobs via Torque/MOAB or hire a third party Ansys channel partner to assist in customization.


      Thank you,


      Win

    • Xingchun Wang
      Subscriber

      Hi Jake and Win


      Thank you for your reply, please allow me some time to try your solution, I have to discuss with our Cluster administrator about the details.


      Hope this time it works, I will report back ASAP.


       


      All the best


      Xingchun

    • Xingchun Wang
      Subscriber

      Hi Jake,


      I just confirmed we have Hyperthreading enabled, and actually, we don't have IBM mpi on the cluster.


      All the best


      Xingchun

    • Xingchun Wang
      Subscriber

      Hi Win


      I'm a little confused about what you said, it sounds like [qsub -q batch -l select=32:ncpus=1:mpiprocs=1 -V -o stdout.out -e stderr.out.] won't work on TORQUE and you want me to try use ppn format.


      From my understanding, I think we are using the ppn format, for normal submitting, we use [qsub -l nodes=2pn=36,walltime=4:00:00 ], this gives you 2 nodes with 36 processors on each node and 4 hours to use them. Is this the ppn format you are talking about?


      Also, I checked with our administrator, he suggested me start fluent with lower level command, and using mpirun to start the process.


      Hope this information helpful


       


      All the best


      Xingchun

    • tsiriaks
      Ansys Employee

      Hi Xingchun,


      Yes, have you tried that command that you use ppn instead ?


      What is the script that you specify when using qsub ? Please provide the full/actual command that you've tried.


      As mentioned, if this doesn't work, you would need to setup RSM for it.


      Thanks,


      Win

    • Xingchun Wang
      Subscriber

      The script I use is already in this post, I just copy and paste it here, I named the script as submit.pbs, in order to let it run in batch mode when submitting, we use [qsub submit.pbs] to submit the job to TORQUE



       


      All the best


      Xingchun 

    • tsiriaks
      Ansys Employee

      Hi Xingchun,


      Ah yea, I missed that. Sorry.


      In that case, I think you would need to follow one of the two ways that the Fluent cluster setup expert has mentioned


      1. Setup RSM to submit jobs via Torque/MOAB


      2. Hire a third party Ansys channel partner to assist in customization to work with your current setup (which is not supported).


      Thanks,


      Win

    • Xingchun Wang
      Subscriber

      Hi, Win


      I checked with the administrator, we don't actually have MOAB, instead, we use maui, so the question is, can RSM be configured with this combination?


      and again? is there a way to start fluent process by using lower level command and by using mpirun?


       


      All the best


      Xingchun

    • tsiriaks
      Ansys Employee

      Hi Xingchun,


      I got the answer


      "you shouldn't need to use the mpiexec or mpirun with Fluent"


      As for the Maui vs RSM, this is not supported, so it needs some kind of customization.


      Thanks,


      Win

    • Xingchun Wang
      Subscriber

      OK, I see, thank you for your information


       


      All the best


      Xingchun

Viewing 19 reply threads
  • You must be logged in to reply to this topic.