Installing Ansys 2019R1 on Cray/CLE environment.
Hi,
We have configured the arc master on login node and arcnode on compute node. When we test the RSM it gives error as, cannot submit job and failed to start the arcnode.
Thank you
Hi,
We have configured the arc master on login node and arcnode on compute node. When we test the RSM it gives error as, cannot submit job and failed to start the arcnode.
Thank you
Answers
RSM Version: 19.3.328.0, Build Date: 11/18/2018 14:06:25
Job Name: RSM Queue Test Job
Type: SERVERTEST
Client Directory: /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/2r2bm361.x9s
Client Machine: login
Queue: RSM Queue [localhost, default]
Template: SERVERTEST
Cluster Configuration: localhost [localhost]
Cluster Type: ARC
Custom Keyword: blank
Transfer Option: None
Staging Directory: blank
Local Scratch Directory: /workarea/workarea/ansys
Platform: Linux
Using SSH for inter-node communication on cluster
Cluster Submit Options: blank
Normal Inputs: [*,commands.xml,*.in]
Cancel Inputs: [-]
Excluded Inputs: [-]
Normal Outputs: [*]
Failure Outputs: [-]
Cancel Outputs: [-]
Excluded Outputs: [-]
Inquire Files:
normal: [*]
inquire: [*.out]
Submission in progress...
Runtime Settings:
Job Owner: hpcadmin
Submit Time: Thursday, 17 December 2020 15:26
Directory: /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/2r2bm361.x9s
2.67 KB, .05 sec (55.81 KB/sec)
Submission in progress...
JobType is: SERVERTEST
Final command platform: Linux
RSM_PYTHON_HOME=/workarea/athenasoftwares/ansys2019R1/ansys_inc/v193/commonfiles/CPython/2_7_15/linx64/Release/python
RSM_HPC_JOBNAME=RSMTest
Distributed mode requested: True
RSM_HPC_DISTRIBUTED=TRUE
Running 5 commands
Job working directory: /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/2r2bm361.x9s
Number of CPU requested: 1
AWP_ROOT193=/workarea/athenasoftwares/ansys2019R1/ansys_inc/v193
Testing writability of working directory...
/hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/2r2bm361.x9s
If you can read this, file was written successfully to working directory
Writability test complete
Checking queue default exists ...
Job will run locally on each node in: /workarea/workarea/ansys/ack7tpu4.7hx
JobId was parsed as: 1
External operation: 'queryStatus' has failed. This may or may not become a fatal error
Status parsing failed to parse the primary command output: '
External operation: 'parseStatus' has failed. This may or may not become a fatal error
Parser could not parse the job status. Checking for completed job exitcode...
Status Failed
Problem during Status. The parser was unable to parse the output and did not output the variable: RSM_HPC_OUTPUT_STATUS.
Error: Please check that the master service is started and that there is no firewall blocking access on ports 11193, 12193, or 13193
Output:
The status command failed to get single job status to the master service on: login:11193.
External operation: 'queryStatus' has failed. This may or may not become a fatal error
Status parsing failed to parse the primary command output: '
External operation: 'parseStatus' has failed. This may or may not become a fatal error
Parser could not parse the job status. Checking for completed job exitcode...
Status Failed
Problem during Status. The parser was unable to parse the output and did not output the variable: RSM_HPC_OUTPUT_STATUS.
Error: Please check that the master service is started and that there is no firewall blocking access on ports 11193, 12193, or 13193
Output:
The status command failed to get single job status to the master service on: login:11193.
RSM_PYTHON_HOME=/workarea/athenasoftwares/ansys2019R1/ansys_inc/v193/commonfiles/CPython/2_7_15/linx64/Release/python
RSM_HPC_JOBNAME=RSMTest
Distributed mode requested: True
RSM_HPC_DISTRIBUTED=TRUE
Running 5 commands
Job working directory: /hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/gdqlpi3q.1d8
Number of CPU requested: 1
AWP_ROOT193=/workarea/athenasoftwares/ansys2019R1/ansys_inc/v193
Testing writability of working directory...
/hpchome/hpcadmin/.local/share/Temp/RsmConfigTest/gdqlpi3q.1d8
If you can read this, file was written successfully to working directory
Writability test complete
Checking queue local exists ...
Submit parsing failed to parse the primary command output: 'ArcMaster process running as rsmadmin
External operation: 'parseSubmit' has failed. This may or may not become a fatal error
ArcMaster process running as rsmadmin
ARCNode Process could not be reached
Skipping autostart because Master processes is started as a service.
Job not submitted. Error: Job is too big to fit in the queue local with the currently assigned machines.
Failed to submit job to cluster
Submit Failed
Problem during Submit. The parser was unable to parse the output and did not output the variable: RSM_HPC_OUTPUT_JOBID.
Error:
Output: ArcMaster process running as rsmadmin
ARCNode Process could not be reached
Skipping autostart because Master processes is started as a service.
Job not submitted. Error: Job is too big to fit in the queue local with the currently assigned machines.
Exec Node Name Associated Master State Service User Avail Max Avail Max Avail Max
===============================================================================================================================
nid00033 login Running root 72 50 * * * *
nid00034 login Running root 72 50 * * * *
* Indicates that resources have not been set up. Any resource request will be accepted.
Updating Users and Groups...
Groups matching *
rsmadmins
Users matching *
root
rsmadmin
Updating queues...
Name Status Priority Start Time End Time Max Jobs Allowed Machines Allowed Users
================================================================================================================================
default Active 0 00:00:00 23:59:59 * login:nid00033:nid00034 all
local Active 0 00:00:00 23:59:59 * login all
* Indicates that resources have not been set up. Any resource request will be accepted.
Please refer RSM documentation, search for "Example: Setting Up a Multi-Node ANSYS RSM Cluster (ARC)". Please double check that all the lines in Step #1 were followed. The arcmaster node needs to be able to communicate with the arcnode
I have setup the ARC master and client nodes and they can communicate with each other. I can see the execution nodes and I have setup the queues using the arcconfigui.
I have configured the RSM and I can see the queue of ARC in it. Once I submit the jobs from the RSM the arc master service is stopped ?? and the process gives as can't communicate to arc master
I was able to install the Ansys in Cray and configure the ARC and RSM.
The problem we are facing is while running the Job through workbench via RSM the job status shows running but in progress it shows "waiting for background task the solution component in Fluent" and do not proceed further.
RSM Version: 19.4.420.0, Build Date: 04/08/2019 10:09:10
Job Name: HPC1-DP0-Fluent-Solution (A3)
Type: Workbench_FLUENT
Client Directory: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution
Client Machine: login
Queue: Cluster2 [localhost, cluster2]
Template: Workbench_FLUENTJob.xml
Cluster Configuration: localhost [localhost]
Cluster Type: ARC
Custom Keyword: blank
Transfer Option: NativeOS
Staging Directory: /workarea/workarea/Hessamedinn/
Delete Staging Directory: True
Local Scratch Directory: blank
Platform: Linux
Using SSH for inter-node communication on cluster
Cluster Submit Options: blank
Normal Inputs: [*]
Cancel Inputs: [sec.interrupt]
Excluded Inputs: [-]
Normal Outputs: [*]
Failure Outputs: [secStart.log,secInterrupt.verify,SecDebugLog.txt,sec.solverexitcode,sec.envvarvalidation.executed,sec.failure,SecCommPackage.sec]
Cancel Outputs: [secStart.log,secInterrupt.verify,SecDebugLog.txt,SecCommPackage.sec]
Excluded Outputs: [hosts.txt,cleanup-fluent-*.sh,sec.interrupt,SecInput.txt,rerunsec.bat,rerunsec.sh]
Inquire Files:
cancel: [secStart.log,secInterrupt.verify,SecDebugLog.txt,SecCommPackage.sec]
progress: [status.txt.gz]
transcript: [Solution.trn]
failure: [secStart.log,secInterrupt.verify,SecDebugLog.txt,sec.solverexitcode,sec.envvarvalidation.executed,sec.failure,SecCommPackage.sec]
__SecCommunicationPackage: [SecCommPackage.sec]
__SecServerConfig: [ServerConfigFile.conf]
normal: [*]
Submission in progress...
Created storage directory /workarea/workarea/Hessamedinn/c9a37ufo.wge
Runtime Settings:
Job Owner: hessamedinn
Submit Time: Monday, 01 February 2021 16:46
Directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge
Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/FFF-1-1-00000.dat.gz
Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/FFF-1.cas.gz
Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/FFF-1.set
Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/SecInput.txt
Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/SolutionPending.jou
Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/commands.xml
Uploading file: /workarea/workarea/Hessamedinn/HPC1_pending/dp0_FLU_Solution/fluentLauncher.txt
606.3 KB, .44 sec (1387.12 KB/sec)
Submission in progress...
JobType is: Workbench_FLUENT
Final command platform: Linux
RSM_PYTHON_HOME=/workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/commonfiles/CPython/2_7_15/linx64/Release/python
RSM_HPC_JOBNAME=Workbench
Distributed mode requested: True
RSM_HPC_DISTRIBUTED=TRUE
Running 1 commands
Job working directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge
Number of CPU requested: 50
AWP_ROOT194=/workarea/athenasoftwares/ansys2019R2/ansys_inc/v194
Checking queue cluster2 exists ...
JobId was parsed as: 72
Job submission was successful.
Job is running on hostname login
Job user from this host: hessamedinn
Starting directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge
Reading control file /workarea/workarea/Hessamedinn/c9a37ufo.wge/control_e1d41399-eac3-4eb0-96b8-07db32182843.rsm ....
Correct Cluster verified
Cluster Type: ARC
Underlying Cluster: ARC
RSM_CLUSTER_TYPE = ARC
Compute Server is running on LOGIN
Reading commands and arguments...
Command 1: $AWP_ROOT194/SEC/SolverExecutionController/runsec.sh, arguments: , redirectFile: None
Running from shared staging directory ...
RSM_USE_LOCAL_SCRATCH = False
RSM_LOCAL_SCRATCH_DIRECTORY =
RSM_LOCAL_SCRATCH_PARTIAL_UNC_PATH =
Cluster Shared Directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge
RSM_SHARE_STAGING_DIRECTORY = /workarea/workarea/Hessamedinn/c9a37ufo.wge
Job file clean up: True
Use SSH on Linux cluster nodes: True
RSM_USE_SSH_LINUX = True
LivelogFile: NOLIVELOGFILE
StdoutLiveLogFile: stdout_e1d41399-eac3-4eb0-96b8-07db32182843.live
StderrLiveLogFile: stderr_e1d41399-eac3-4eb0-96b8-07db32182843.live
Reading input files...
*
Reading cancel files...
sec.interrupt
Reading output files...
*
secStart.log
secInterrupt.verify
SecDebugLog.txt
sec.solverexitcode
sec.envvarvalidation.executed
sec.failure
SecCommPackage.sec
Reading exclude files...
hosts.txt
cleanup-fluent-*.sh
sec.interrupt
SecInput.txt
rerunsec.bat
rerunsec.sh
stdout_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout
stderr_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout
control_e1d41399-eac3-4eb0-96b8-07db32182843.rsm
hosts.dat
exitcode_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout
exitcodeCommands_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout
stdout_e1d41399-eac3-4eb0-96b8-07db32182843.live
stderr_e1d41399-eac3-4eb0-96b8-07db32182843.live
ClusterJobCustomization.xml
ClusterJobs.py
clusterjob_e1d41399-eac3-4eb0-96b8-07db32182843.sh
clusterjob_e1d41399-eac3-4eb0-96b8-07db32182843.bat
inquire.request
inquire.confirm
request.upload.rsm
request.download.rsm
wait.download.rsm
scratch.job.rsm
volatile.job.rsm
restart.xml
cancel_e1d41399-eac3-4eb0-96b8-07db32182843.rsmout
liveLogLastPositions_e1d41399-eac3-4eb0-96b8-07db32182843.rsm
stdout_e1d41399-eac3-4eb0-96b8-07db32182843_kill.rsmout
stderr_e1d41399-eac3-4eb0-96b8-07db32182843_kill.rsmout
sec.interrupt
hosts.txt
cleanup-fluent-*.sh
SecInput.txt
rerunsec.bat
rerunsec.sh
Reading environment variables...
RSM_PYTHON_LOCALE = en-us
Reading AWP_ROOT environment variable name ...
AWP_ROOT environment variable name is: AWP_ROOT194
Reading Low Disk Space Warning Limit ...
Low disk space warning threshold set at: 2.0GiB
Reading File identifier ...
File identifier found as: e1d41399-eac3-4eb0-96b8-07db32182843
Done reading control file.
AWP_ROOT194 install directory: /workarea/athenasoftwares/ansys2019R2/ansys_inc/v194
RSM_MACHINES = login:2:nid00034:48
Number of nodes assigned for current job = 2
Machine list: ['login', 'nid00034']
Start running job commands ...
Running on machine : login
Current Directory: /workarea/workarea/Hessamedinn/c9a37ufo.wge
Running command: /workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/SEC/SolverExecutionController/runsec.sh
Redirecting output to None
Final command arg list : ['/workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/SEC/SolverExecutionController/runsec.sh']
Running Process
Running Solver : /workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/fluent/bin/fluent --albion --run --launcher_setting_file "fluentLauncher.txt" --fluent_options " -gu -driver null -workbench-session -i \"SolutionPending.jou\" -mpi=ibmmpi -t50 -cnf=hosts.txt -ssh"
/workarea/athenasoftwares/ansys2019R2/ansys_inc/v194/fluent/fluent19.4.0/bin/fluent -r19.4.0 --albion --run --launcher_setting_file fluentLauncher.txt
Warning: DISPLAY environment variable is not set.
Graphics and GUI will not operate correctly
without this being set properly.
Warning: DISPLAY environment variable is not set.
Graphics and GUI will not operate correctly
without this being set properly.
Inquiring for files by Tag: [progress]
No files transferred.