Installing Ansys 2019R1 on Cray/CLE environment.

Hi,

We have configured the arc master on login node and arcnode on compute node. When we test the RSM it gives error as, cannot submit job and failed to start the arcnode.



Thank you

Answers

  • yskalencatyskalencat Posts: 14Member

    RSM Version: 19.3.328.0, Build Date: 11/18/2018 14:06:25

    Job Name: RSM Queue Test Job

      Type: SERVERTEST

      Client Directory: /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/2r2bm361.x9s

      Client Machine: login

      Queue: RSM Queue [localhost, default]

      Template: SERVERTEST

    Cluster Configuration: localhost [localhost]

      Cluster Type: ARC

      Custom Keyword: blank

      Transfer Option: None

      Staging Directory: blank

      Local Scratch Directory: /workarea/workarea/ansys

      Platform: Linux

      Using SSH for inter-node communication on cluster

      Cluster Submit Options: blank

      Normal Inputs: [*,commands.xml,*.in]

      Cancel Inputs: [-]

      Excluded Inputs: [-]

      Normal Outputs: [*]

      Failure Outputs: [-]

      Cancel Outputs: [-]

      Excluded Outputs: [-]

      Inquire Files:

       normal: [*]

       inquire: [*.out]

    Submission in progress...

    Runtime Settings:

      Job Owner: hpcadmin

      Submit Time: Thursday, 17 December 2020 15:26

      Directory: /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/2r2bm361.x9s

    2.67 KB, .05 sec (55.81 KB/sec)

    Submission in progress...

    JobType is: SERVERTEST

    Final command platform: Linux

    RSM_PYTHON_HOME=/workarea/athenasoftwares/ansys2019R1/ansys_inc/v193/commonfiles/CPython/2_7_15/linx64/Release/python

    RSM_HPC_JOBNAME=RSMTest

    Distributed mode requested: True

    RSM_HPC_DISTRIBUTED=TRUE

    Running 5 commands

    Job working directory: /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/2r2bm361.x9s

    Number of CPU requested: 1

    AWP_ROOT193=/workarea/athenasoftwares/ansys2019R1/ansys_inc/v193

    Testing writability of working directory...

    /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/2r2bm361.x9s

    If you can read this, file was written successfully to working directory

    Writability test complete

    Checking queue default exists ...

    Job will run locally on each node in: /workarea/workarea/ansys/ack7tpu4.7hx

    JobId was parsed as: 1

    External operation: 'queryStatus' has failed. This may or may not become a fatal error

    Status parsing failed to parse the primary command output: '

    External operation: 'parseStatus' has failed. This may or may not become a fatal error

    Parser could not parse the job status. Checking for completed job exitcode...

    Status Failed

    Problem during Status. The parser was unable to parse the output and did not output the variable: RSM_HPC_OUTPUT_STATUS.

    Error: Please check that the master service is started and that there is no firewall blocking access on ports 11193, 12193, or 13193

    Output:

    The status command failed to get single job status to the master service on: login:11193.

    External operation: 'queryStatus' has failed. This may or may not become a fatal error

    Status parsing failed to parse the primary command output: '

    External operation: 'parseStatus' has failed. This may or may not become a fatal error

    Parser could not parse the job status. Checking for completed job exitcode...

    Status Failed

    Problem during Status. The parser was unable to parse the output and did not output the variable: RSM_HPC_OUTPUT_STATUS.

    Error: Please check that the master service is started and that there is no firewall blocking access on ports 11193, 12193, or 13193

    Output:

    The status command failed to get single job status to the master service on: login:11193.

  • yskalencatyskalencat Posts: 14Member

    RSM_PYTHON_HOME=/workarea/athenasoftwares/ansys2019R1/ansys_inc/v193/commonfiles/CPython/2_7_15/linx64/Release/python

    RSM_HPC_JOBNAME=RSMTest

    Distributed mode requested: True

    RSM_HPC_DISTRIBUTED=TRUE

    Running 5 commands

    Job working directory: /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/gdqlpi3q.1d8

    Number of CPU requested: 1

    AWP_ROOT193=/workarea/athenasoftwares/ansys2019R1/ansys_inc/v193

    Testing writability of working directory...

    /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/gdqlpi3q.1d8

    If you can read this, file was written successfully to working directory

    Writability test complete

    Checking queue local exists ...

    Submit parsing failed to parse the primary command output: 'ArcMaster process running as rsmadmin

    External operation: 'parseSubmit' has failed. This may or may not become a fatal error

    ArcMaster process running as rsmadmin

    ARCNode Process could not be reached

    Skipping autostart because Master processes is started as a service.

    Job not submitted. Error: Job is too big to fit in the queue local with the currently assigned machines.

    Failed to submit job to cluster

    Submit Failed

    Problem during Submit. The parser was unable to parse the output and did not output the variable: RSM_HPC_OUTPUT_JOBID.

    Error:

    Output: ArcMaster process running as rsmadmin

    ARCNode Process could not be reached

    Skipping autostart because Master processes is started as a service.

    Job not submitted. Error: Job is too big to fit in the queue local with the currently assigned machines.

  • yskalencatyskalencat Posts: 14Member

    Exec Node Name     Associated Master     State     Service User   Avail  Max  Avail  Max  Avail  Max

    ===============================================================================================================================


         nid00033         login       Running      root      72   50   *   *   *   * 

         nid00034         login       Running      root      72   50   *   *   *   * 


      * Indicates that resources have not been set up. Any resource request will be accepted.


    Updating Users and Groups...


    Groups matching *

     rsmadmins


    Users matching *

     root

     rsmadmin


    Updating queues...



         Name      Status   Priority  Start Time  End Time  Max Jobs   Allowed Machines     Allowed Users 

    ================================================================================================================================


        default      Active    0    00:00:00   23:59:59   *   login:nid00033:nid00034      all    

        local      Active    0    00:00:00   23:59:59   *        login          all    



      * Indicates that resources have not been set up. Any resource request will be accepted.

  • MangeshANSYSMangeshANSYS Posts: 31Forum Coordinator

    Please refer RSM documentation, search for "Example: Setting Up a Multi-Node ANSYS RSM Cluster (ARC)". Please double check that all the lines in Step #1 were followed. The arcmaster node needs to be able to communicate with the arcnode

  • yskalencatyskalencat Posts: 14Member

    I have setup the ARC master and client nodes and they can communicate with each other. I can see the execution nodes and I have setup the queues using the arcconfigui.

    I have configured the RSM and I can see the queue of ARC in it. Once I submit the jobs from the RSM the arc master service is stopped ?? and the process gives as can't communicate to arc master

  • yskalencatyskalencat Posts: 14Member

    I was able to install the Ansys in Cray and configure the ARC and RSM.

    The problem we are facing is while running the Job through workbench via RSM the job status shows running but in progress it shows "waiting for background task the solution component in Fluent" and do not proceed further.


  • yskalencatyskalencat Posts: 14Member

    RSM Version: 19.4.420.0, Build Date: 04/08/2019 10:09:10

    Job Name: HPC1-DP0-Fluent-Solution (A3)

      Type: Workbench_FLUENT

      Client Directory: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution

      Client Machine: login

      Queue: Cluster2 [localhost, cluster2]

      Template: Workbench_FLUENTJob.xml

    Cluster Configuration: localhost [localhost]

      Cluster Type: ARC

      Custom Keyword: blank

      Transfer Option: NativeOS

      Staging Directory: /workarea/workarea/Hessamedinn/

      Delete Staging Directory: True

      Local Scratch Directory: blank

      Platform: Linux

      Using SSH for inter-node communication on cluster

      Cluster Submit Options: blank

      Normal Inputs: [*]

      Cancel Inputs: [sec.interrupt]

      Excluded Inputs: [-]

      Normal Outputs: [*]

      Failure Outputs: [secStart.log,secInterrupt.verify,SecDebugLog.txt,sec.solverexitcode,sec.envvarvalidation.executed,sec.failure,SecCommPackage.sec]

      Cancel Outputs: [secStart.log,secInterrupt.verify,SecDebugLog.txt,SecCommPackage.sec]

      Excluded Outputs: [hosts.txt,cleanup-fluent-*.sh,sec.interrupt,SecInput.txt,rerunsec.bat,rerunsec.sh]

      Inquire Files:

       cancel: [secStart.log,secInterrupt.verify,SecDebugLog.txt,SecCommPackage.sec]

       progress: [status.txt.gz]

       transcript: [Solution.trn]

       failure: [secStart.log,secInterrupt.verify,SecDebugLog.txt,sec.solverexitcode,sec.envvarvalidation.executed,sec.failure,SecCommPackage.sec]

       __SecCommunicationPackage: [SecCommPackage.sec]

       __SecServerConfig: [ServerConfigFile.conf]

       normal: [*]

    Submission in progress...

    Created storage directory /workarea/workarea/Hessamedinn/c9a37ufo.wge

    Runtime Settings:

      Job Owner: hessamedinn

      Submit Time: Monday, 01 February 2021 16:46

      Directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge

    Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/FFF-1-1-00000.dat.gz

    Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/FFF-1.cas.gz

    Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/FFF-1.set

    Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/SecInput.txt

    Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/SolutionPending.jou

    Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/commands.xml

    Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/fluentLauncher.txt

    606.3 KB, .44 sec (1387.12 KB/sec)

    Submission in progress...

    JobType is: Workbench_FLUENT

    Final command platform: Linux

    RSM_PYTHON_HOME=/workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/commonfiles/CPython/2_7_15/linx64/Release/python

    RSM_HPC_JOBNAME=Workbench

    Distributed mode requested: True

    RSM_HPC_DISTRIBUTED=TRUE

    Running 1 commands

    Job working directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge

    Number of CPU requested: 50

    AWP_ROOT194=/workarea/athenasoftwares/ansys2019R2/ansys_inc/v194

    Checking queue cluster2 exists ...

    JobId was parsed as: 72

    Job submission was successful.

    Job is running on hostname login

    Job user from this host: hessamedinn

    Starting directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge

    Reading control file /workarea/workarea/Hessamedinn/c9a37ufo.wge/control_e1d41399-eac3-4eb0-96b8-07db32182843.rsm ....

    Correct Cluster verified

    Cluster Type: ARC

    Underlying Cluster: ARC

      RSM_CLUSTER_TYPE = ARC

    Compute Server is running on LOGIN

    Reading commands and arguments...

      Command 1: $AWP_ROOT194/SEC/SolverExecutionController/runsec.sh, arguments: , redirectFile: None

    Running from shared staging directory ...

      RSM_USE_LOCAL_SCRATCH = False

      RSM_LOCAL_SCRATCH_DIRECTORY =

      RSM_LOCAL_SCRATCH_PARTIAL_UNC_PATH =

    Cluster Shared Directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge

      RSM_SHARE_STAGING_DIRECTORY = /workarea/workarea/Hessamedinn/c9a37ufo.wge

    Job file clean up: True

    Use SSH on Linux cluster nodes: True

      RSM_USE_SSH_LINUX = True

    LivelogFile: NOLIVELOGFILE

    StdoutLiveLogFile: stdout_e1d41399-eac3-4eb0-96b8-07db32182843.live

    StderrLiveLogFile: stderr_e1d41399-eac3-4eb0-96b8-07db32182843.live

    Reading input files...

      *

    Reading cancel files...

      sec.interrupt

    Reading output files...

      *

      secStart.log

      secInterrupt.verify

      SecDebugLog.txt

      sec.solverexitcode

      sec.envvarvalidation.executed

      sec.failure

      SecCommPackage.sec

    Reading exclude files...

      hosts.txt

      cleanup-fluent-*.sh

      sec.interrupt

      SecInput.txt

      rerunsec.bat

      rerunsec.sh

      stdout_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout

      stderr_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout

      control_e1d41399-eac3-4eb0-96b8-07db32182843.rsm

      hosts.dat

      exitcode_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout

      exitcodeCommands_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout

      stdout_e1d41399-eac3-4eb0-96b8-07db32182843.live

      stderr_e1d41399-eac3-4eb0-96b8-07db32182843.live

      ClusterJobCustomization.xml

      ClusterJobs.py

      clusterjob_e1d41399-eac3-4eb0-96b8-07db32182843.sh

      clusterjob_e1d41399-eac3-4eb0-96b8-07db32182843.bat

      inquire.request

      inquire.confirm

      request.upload.rsm

      request.download.rsm

      wait.download.rsm

      scratch.job.rsm

      volatile.job.rsm

      restart.xml

      cancel_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout

      liveLogLastPositions_e1d41399-eac3-4eb0-96b8-07db32182843.rsm

      stdout_e1d41399-eac3-4eb0-96b8-07db32182843_kill.rsmout

      stderr_e1d41399-eac3-4eb0-96b8-07db32182843_kill.rsmout

      sec.interrupt

      hosts.txt

      cleanup-fluent-*.sh

      SecInput.txt

      rerunsec.bat

      rerunsec.sh

    Reading environment variables...

      RSM_PYTHON_LOCALE = en-us

    Reading AWP_ROOT environment variable name ...

      AWP_ROOT environment variable name is: AWP_ROOT194

    Reading Low Disk Space Warning Limit ...

      Low disk space warning threshold set at: 2.0GiB

    Reading File identifier ...

      File identifier found as: e1d41399-eac3-4eb0-96b8-07db32182843

    Done reading control file.

    AWP_ROOT194 install directory: /workarea/athenasoftwares/ansys2019R2/ansys_inc/v194

    RSM_MACHINES = login:2:nid00034:48

    Number of nodes assigned for current job = 2

    Machine list: ['login', 'nid00034']

    Start running job commands ...

    Running on machine : login

    Current Directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge

    Running command: /workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/SEC/SolverExecutionController/runsec.sh

    Redirecting output to None

    Final command arg list : ['/workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/SEC/SolverExecutionController/runsec.sh']

    Running Process

    Running Solver : /workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/fluent/bin/fluent --albion --run --launcher_setting_file "fluentLauncher.txt" --fluent_options " -gu -driver null -workbench-session -i \"SolutionPending.jou\" -mpi=ibmmpi -t50 -cnf=hosts.txt -ssh"

    /workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/fluent/fluent19.4.0/bin/fluent -r19.4.0 --albion --run --launcher_setting_file fluentLauncher.txt

    Warning: DISPLAY environment variable is not set.

     Graphics and GUI will not operate correctly

     without this being set properly.

    Warning: DISPLAY environment variable is not set.

     Graphics and GUI will not operate correctly

     without this being set properly.


    Inquiring for files by Tag: [progress]

    No files transferred.

Sign In or Register to comment.