Distributed solver

Hello,

I'm playing with Distributed solving in version 19.4 R2. My cluster has 3 PCs. In a couple minutes after Solver started the job has failed.

I can't find what caused it. Could you please review log file and advise?

Thanks, Nick

«1

Comments

  • tsiriakstsiriaks 3240 El Camino Real #290, Irvine, CA 92602Member
    edited September 2019

    Hi Nick,

    Sorry, ANSYS employees are not allowed to download file attachments. Could you post this as an inline text or post screenshots inline ?

    Are you using ANSYS Student (free) version ?

    Is this Mechanical, Fluids, or Electronics product ?

    Thanks,

    Win

  • peteroznewmanpeteroznewman Member
    edited September 2019

    Win, it looks like a directory error.

  • kolyan007kolyan007 Member
    edited October 2019

    Thanks Peter, good spotting. That folder ScrCA9A is missing under _ProjectScratch. How I can find why Mechanical trying to access it? Could it be because of some references in projects?

  • tsiriakstsiriaks 3240 El Camino Real #290, Irvine, CA 92602Member
    edited October 2019

    Thanks for the screenshot Peter.

    So, this is Mechanical + ANSYS RSM.

    I'm gonna need much more information to proceed. Note: That folder should be automatically created by the solver.

    When you said 'cluster' , do you mean it has job scheduler or it's just 3 machines with no job scheduler connecting to each other ? If you have your own scheduler, what's the name of it ? 

    How do you submit the jobs ? Please show screenshots of the workflow and setup in both Workbench and RSM Configuration GUI's

    Post screenshot of the top part of RSM Job Report.

    Thanks,

    Win

  • kolyan007kolyan007 Member
    edited October 2019

    Thanks for your reply. I was thinking there is some incompatibility between my 3 machine cluster and R2, so I did reinstall OS on master machine (maincad) that using to send tasks to slaves in cluster, install R3 on it and install R3 on slaves, then replicate settings.

    I don't have my own scheduler. All machines connecting to each other.

    There are a screenshots with my config below.

    ARC configuration on master

    ARC configuration on masterARC configuration on masterARC configuration on master

    ARC configuration on masterARC configuration on master

    Note: I'm not sure why, but Service Management running by cluster02user. Not sure how to change to default.

    RSM config on master:

    RSM configuration on master

    RSM configuration on masterRSM configuration on master

    Somehow Queues refresh fails.

    This is after I deleted credentials for cluster01

    RSM configuration on master

    RSM config for cluster02

    RSM configuration on master

    RSM configuration on masterRSM configuration on master

     

    And RSM config for localhost

    RSM configuration on master

    RSM configuration on masterRSM configuration on master

    These 3 services running on each machine in cluster:

    Master:

    Services on master

    Cluster01

    Services on cluster01

    Cluster02

    Services on cluster02

    My Solve Process settings:

    Solve settings

    Solve in mechanical

     

    Also I've tested telnet to port 9195 and SMB in Windows, there is no issues from OS perspective. All machines using same credentials.

     

    Additionally I've found that if I run: arcconfig node modify %COMPUTERNAME% -mn maincad                  so test in RSM will go in schedule mode, but if I replace my master node with %COMPUTERNAME%, a test will go through fine. What options is correct?

    Cluster01 port 9195 listening

     

    Cluster02 port 9195 listening

     

     

  • tsiriakstsiriaks 3240 El Camino Real #290, Irvine, CA 92602Member
    edited October 2019

    ok, so you are using ANSYS own light-weight scheduler (ARC).

    I normally use commands to set this, so I'm not familiar with settings on the ARC Config GUI but I've asked a colleague to help check it.

    However, I see a lot of mis-configurations.

    Essentially, this is what you need to do to properly setup RSM-ARC

     

    Create share directory

    Create a directory on a machine that you would like to use as the staging directory.

     

    For this example, I will use e:RSMStaging

    Now share that directory out to all solving machines and 'submit host' so that anyone can read and write to it.

    ---------

    Right click on RSMStaging and select Properties Select Sharing Tab

    Click on Share...

     

    Type in:

    Everyone

    And click Add

     

    Type in:

    Domain Computers

    Click Add

     

    Type in:

    Domain Users

    Click Add

     

    Give each of the users Read/Write access

    Then click "Share"

    Then click Done

     

     

    On 'submit host' machine (This is a dedicated node. Do not install this on all nodes)

     

    This is the machine that receives requests from clients, then queues and distributes jobs to solve on solving machines. This machine itself can also be a solving machine.

     

    Two components are required here: 

     

    1. RSM Launcher service, which deals with the communications between client machine and this machine

     

    2. ARC Master service is one of the two parts of the ARC job scheduler, a light-weight job scheduler that ANSYS provides. This part deals with the communications between RSM Launcher and solving machine(s) , essentially, queueing and distributing jobs to solving machine(s).

     

     

     

     

     

    On solving machine(s)

     

    Install ARC Node service, which is used for communicating with the ARC Master service.

     

     

     

     

     

     

  • jcalleryjcallery Member
    edited October 2019

    Hi Nick,

     

    There seems to be a number of different issues to look at here.

    First, please uninstall the ARC Master and RSM Launcher services from the two machines that are NOT the head node.

    You can do this from the command line by doing the following:

    Start -> Type cmd -> Right click on Command Prompt and select Run As administrator

    At the prompt type:

    cd /d "c:program filesansys incv195RSMin"

    ansunconfigrsm -launcher

    cd /d "c:program filesansys incv195RSMARC oolswinx64"

    uninstallservice -arcmaster

     

    This should leave the ARC Node service running on MainCad,Cluster01,Cluster02

    This should leave the ARC Master and Launcher service running on MainCad only.

     

    Once that is done, go to MainCad and run the admin command prompt as state above, then type:

    cd /c "c:program filesansys incv195RSMARC oolswinx64"

    arcnodes

    arcqueues

    Then please paste a screenshot of that.

     

    Thank you,

    Jake

     

     

     

  • tsiriakstsiriaks 3240 El Camino Real #290, Irvine, CA 92602Member
    edited October 2019

    Continued

     

    On client machines. Note All info here is the same on any client machine. The Submit Host is the dedicated node above. Do not use local machine information.

    Launch RSM Configuration GUI

    Once you are in the GUI, click the plus sign on the top left then fill out name or its IP address of the Submit Host machine and OS.

    Once, this is done, click “Apply”

    Go to “File Management” tab , Select "Operating System File Transfer" , Put in the share path to the RSMStaging directory you shared earlier (again, this is the same UNC path that you setup to share above, do not use local machine information)

    Once, this is done, click “Apply”

    Then, Go to “Queues” tab , click “Import/Refresh Cluster Queues” , then specify your login username and password. This will bring up all the queues. You can enable and disable any queue that you like.

    Then make sure that port 9195 is open between the client and the ‘submit host’ machines.

  • tsiriakstsiriaks 3240 El Camino Real #290, Irvine, CA 92602Member
    edited October 2019

    sounds good, please keep us posted

    Thanks,

    Win

  • kolyan007kolyan007 Member
    edited October 2019

    Just checked RSM Cluster Load Monitoring on HEAD node, it shows only 1 node, not 3. I've manually created another queue and added another nodes to it, but it doesn't help.

    My queues and RSM Cluster output on HEAD node is below:

     

  • kolyan007kolyan007 Member
    edited October 2019

    Followed ANSYS RSM user guide to finalize my cluster setup.

     

    Got 3 nodes under RSM Cluster Monitoring and tests were fine, but there is an error when I tried to solve my project. Maincad is my Submit Host+solver, Cluster02 is only solver. Please advise what could be wrong?

      

  • jcalleryjcallery Member
    edited October 2019

    Looks like something is blocking the IntelMPI processes.

    Firewall / User Account Control / Antivirus / etc...

    You could try IBM MPI instead.

    First make sure IBM/Platform MPI is installed on all machines.

    Then In Mechanical go to Solve Process Settings -> Select your config -> Advanced

    In the Additional Command Line Arguments add:

    -mpi ibmmpi

     

    Then try it again.

     

    Thank you,

    Jake

  • kolyan007kolyan007 Member
    edited October 2019

    Thanks for advise. I've installed MPI services only on Submit host. I will add it to remote solvers and add these CMD arguments. 

  • kolyan007kolyan007 Member
    edited October 2019

    What arguments should I use if I will be using Intel MPI? (once MPI installed on all machines)

    What installation set of IBM MPI would be the best for Windows 10? (Service Mode, Service Only or HPC mode)

  • kolyan007kolyan007 Member
    edited October 2019

    Right, so IBM MPI is installed on all machines as Service Mode, hope it's correct set. Also -mpi ibmmpi as argument also added. Telnet to TCP 8636 from Submit Host to solvers is ok.

    Now solving working, but I don't see any CPU load on my solvers. If there anything else to check?

     

    Start running job commands ...

    Running on machine : Maincad

    Current Directory: c:RSMStagingofwkpry5.3rq

    Running command: C:Program FilesANSYS Incv195SECSolverExecutionController
    unsec.bat

    Redirecting output to  None

    Final command arg list : ['C:\Program Files\ANSYS Inc\v195\SEC\SolverExecutionController\runsec.bat']

    Running Process

    Running Solver : C:Program FilesANSYS Incv195ansysinwinx64ANSYS195.exe -b nolist -s noread -p ansys -mpi ibmmpi -i remote.dat -o solve.out -dis -machines maincad:1:cluster01:1:cluster02:1 -dir "cRSMStaging/ofwkpry5.3rq"

    WARNING: No cached password or password provided.

             use '-pass' or '-cache' to provide password

     *** The AWP_ROOT195 environment variable is set to: C:Program FilesANSYS Incv195

     *** The AWP_LOCALE195 environment variable is set to: en-us

     *** The ANSYS195_DIR environment variable is set to: C:Program FilesANSYS Incv195ANSYS

     *** The ANSYS_SYSDIR environment variable is set to: winx64

     *** The ANSYS_SYSDIR32 environment variable is set to: intel

     *** The CADOE_DOCDIR195 environment variable is set to: C:Program FilesANSYS Incv195CommonFileshelpen-ussolviewer

     *** The CADOE_LIBDIR195 environment variable is set to: C:Program FilesANSYS Incv195CommonFilesLanguageen-us

     *** The LSTC_LICENSE environment variable is set to: ANSYS

     *** The P_SCHEMA environment variable is set to: C:Program FilesANSYS Incv195AISOLCADIntegrationParasolidPSchema

     *** The TEMP environment variable is set to: C:UsersUserAppDataLocalTemp

     

  • tsiriakstsiriaks 3240 El Camino Real #290, Irvine, CA 92602Member
    edited October 2019

    This seems good (and by the way, the Service Mode for IBM MPI is the correct choice) .

    If you open up Task Manager on solving machines do you see any ANSYS.exe or ANSYS195.exe ?

  • kolyan007kolyan007 Member
    edited October 2019

    Only Submit host has some activity with ANSYS, solvers don't show it. Weird.

    This screenshot from Submit host:

  • jcalleryjcallery Member
    edited October 2019

    Please post a screenshot of your Solution properties if submitting from workbench or post the screenshot of your Solve Process Settings and the advanced tab for the solve process setting you are using.

     

    Thank you,

    Jake

  • kolyan007kolyan007 Member
    edited October 2019

    Yesterday I was playing with Intel MPI as previously I've seen some load on solvers with it. The screenshot with Solve Process Settings and Advanced tab is below.

    So I've tried with -mpi intelmpi argument and without. The result was the same, i.e. the solving process started on all machines, CPU load was 100% but only for a couple of minutes, then I got MPI_Abort error (full error message is also below).

    Note: Regarding that Invalid working directory value= CRSMStaging/aokqqj0p.5k1 message it's weird as folder actually exists and accessible by solvers. Not sure why it has wrong slashes here.

    Also, I do have NVidia Quadro GPU in each machine. Is it reasonable to enable GPU Acceleration to improve performance?

     

    Error message (omitted):

    send of 20 bytes failed.

    Command Exit Code: 99

    application called MPI_Abort(MPI_COMM_WORLD, 99) - process 0

    [[email protected]] ..hydrautilssocksock.c (420): write error (Unknown error)

    [[email protected]] ..hydrautilslaunchlaunch.c (121): shutdown failed, sock 648, error 10093

    ClusterJobs Command Exit Code: 99

    Saving exit code file: C:RSMStagingaokqqj0p.5k1exitcode_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        Exit code file: C:RSMStagingaokqqj0p.5k1exitcode_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout has been created.

    Saving exit code file: C:RSMStagingaokqqj0p.5k1exitcodeCommands_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        Exit code file: C:RSMStagingaokqqj0p.5k1exitcodeCommands_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout has been created.

    =======SOLVE.OUT FILE======

    solve -->  *** FATAL ***

    solve --> Attempt to run ANSYS in a distributed mode failed.

    solve -->   The Distributed ANSYS process with MPI Rank ID =    4 is not responding.

    solve -->   Please refer to the Distributed ANSYS Guide for detailed setup and configuration information.

    ClusterJobs Exiting with code: 99

    Individual Command Exit Codes are: [99]

     

     

     

    Full error log:

    RSM Version: 19.5.420.0, Build Date: 30/07/2019 57:06 am

    RSM File Version: 19.5.0-beta.3+Branch.release-19.5.Sha.9c15e0a3e0a608e8efe8c568e494d2160788d025

    RSM Library Version: 19.5.0.0

    Job Name: Annulus_Thread_MAX_Sludge-SidWalls=No Separation,rhoSludge=2,3,t=0-DP0-Model (B4)-Static Structural (B5)-Solution (B6)

        Type: Mechanical_ANSYS

        Client Directory: C:TestProject_ProjectScratchScr6531

        Client Machine: Maincad

        Queue: Default [localhost, default]

        Template: Mechanical_ANSYSJob.xml

    Cluster Configuration: localhost [localhost]

        Cluster Type: ARC

        Custom Keyword: blank

        Transfer Option: NativeOS

        Staging Directory: C:RSMStaging

        Delete Staging Directory: True

        Local Scratch Directory: blank

        Platform: Windows

        Cluster Submit Options: blank

        Normal Inputs: [*.dat,file*.*,*.mac,thermal.build,commands.xml,SecInput.txt]

        Cancel Inputs: [file.abt,sec.interrupt]

        Excluded Inputs: [-]

        Normal Outputs: [*.xml,*.NR*,CAERepOutput.xml,cyclic_map.json,exit.topo,file*.dsub,file*.ldhi,file*.png,file*.r0*,file*.r1*,file*.r2*,file*.r3*,file*.r4*,file*.r5*,file*.r6*,file*.r7*,file*.r8*,file*.r9*,file*.rd*,file*.rst,file.BCS,file.ce,file.cm,file.cnd,file.cnm,file.DSP,file.err,file.gst,file.json,file.nd*,file.nlh,file.nr*,file.PCS,file.rdb,file.rfl,file.spm,file0.BCS,file0.ce,file0.cnd,file0.err,file0.gst,file0.nd*,file0.nlh,file0.nr*,file0.PCS,frequencies_*.out,input.x17,intermediate*.topo,Mode_mapping_*.txt,NotSupportedElems.dat,ObjectiveHistory.out,post.out,PostImage*.png,record.txt,solve*.out,topo.err,toput,vars.topo,SecDebugLog.txt,secStart.log,sec.validation.executed,sec.envvarvalidation.executed,sec.failure]

        Failure Outputs: [*.xml,*.NR*,CAERepOutput.xml,cyclic_map.json,exit.topo,file*.dsub,file*.ldhi,file*.png,file*.r0*,file*.r1*,file*.r2*,file*.r3*,file*.r4*,file*.r5*,file*.r6*,file*.r7*,file*.r8*,file*.r9*,file*.rd*,file*.rst,file.BCS,file.ce,file.cm,file.cnd,file.cnm,file.DSP,file.err,file.gst,file.json,file.nd*,file.nlh,file.nr*,file.PCS,file.rdb,file.rfl,file.spm,file0.BCS,file0.ce,file0.cnd,file0.err,file0.gst,file0.nd*,file0.nlh,file0.nr*,file0.PCS,frequencies_*.out,input.x17,intermediate*.topo,Mode_mapping_*.txt,NotSupportedElems.dat,ObjectiveHistory.out,post.out,PostImage*.png,record.txt,solve*.out,topo.err,toput,vars.topo,SecDebugLog.txt,sec.solverexitcode,secStart.log,sec.failure,sec.envvarvalidation.executed]

        Cancel Outputs: [file*.err,solve.out,secStart.log,SecDebugLog.txt]

        Excluded Outputs: [-]

        Inquire Files:

          normal: [*.xml,*.NR*,CAERepOutput.xml,cyclic_map.json,exit.topo,file*.dsub,file*.ldhi,file*.png,file*.r0*,file*.r1*,file*.r2*,file*.r3*,file*.r4*,file*.r5*,file*.r6*,file*.r7*,file*.r8*,file*.r9*,file*.rd*,file*.rst,file.BCS,file.ce,file.cm,file.cnd,file.cnm,file.DSP,file.err,file.gst,file.json,file.nd*,file.nlh,file.nr*,file.PCS,file.rdb,file.rfl,file.spm,file0.BCS,file0.ce,file0.cnd,file0.err,file0.gst,file0.nd*,file0.nlh,file0.nr*,file0.PCS,frequencies_*.out,input.x17,intermediate*.topo,Mode_mapping_*.txt,NotSupportedElems.dat,ObjectiveHistory.out,post.out,PostImage*.png,record.txt,solve*.out,topo.err,toput,vars.topo,SecDebugLog.txt,secStart.log,sec.validation.executed,sec.envvarvalidation.executed,sec.failure]

          cancel: [file*.err,solve.out,secStart.log,SecDebugLog.txt]

          failure: [*.xml,*.NR*,CAERepOutput.xml,cyclic_map.json,exit.topo,file*.dsub,file*.ldhi,file*.png,file*.r0*,file*.r1*,file*.r2*,file*.r3*,file*.r4*,file*.r5*,file*.r6*,file*.r7*,file*.r8*,file*.r9*,file*.rd*,file*.rst,file.BCS,file.ce,file.cm,file.cnd,file.cnm,file.DSP,file.err,file.gst,file.json,file.nd*,file.nlh,file.nr*,file.PCS,file.rdb,file.rfl,file.spm,file0.BCS,file0.ce,file0.cnd,file0.err,file0.gst,file0.nd*,file0.nlh,file0.nr*,file0.PCS,frequencies_*.out,input.x17,intermediate*.topo,Mode_mapping_*.txt,NotSupportedElems.dat,ObjectiveHistory.out,post.out,PostImage*.png,record.txt,solve*.out,topo.err,toput,vars.topo,SecDebugLog.txt,sec.solverexitcode,secStart.log,sec.failure,sec.envvarvalidation.executed]

          RemotePostInformation: [RemotePostInformation.txt]

          SolutionInformation: [solve.out,file.gst,file.nlh,file0.gst,file0.nlh,file.cnd]

          PostDuringSolve: [file.rcn,file.redm,file.rfl,file.rfrq,file.rmg,file.rdsp,file.rsx,file.rst,file.rth,solve.out,file.gst,file.nlh,file0.gst,file0.nlh,file.cnd]

          transcript: [solve.out,monitor.json]

    Submission in progress...

    Created storage directory C:RSMStagingaokqqj0p.5k1

    Runtime Settings:

        Job Owner: MAINCADUser

        Submit Time: Tuesday, 8 October 2019 11:40 pm

        Directory: C:RSMStagingaokqqj0p.5k1

    Uploading file: C:TestProject_ProjectScratchScr6531ds.dat

    Uploading file: C:TestProject_ProjectScratchScr6531
    emote.dat

    Uploading file: C:TestProject_ProjectScratchScr6531commands.xml

    Uploading file: C:TestProject_ProjectScratchScr6531SecInput.txt

    39898.74 KB, .08 sec (498807.21 KB/sec)

    Submission in progress...

    JobType is: Mechanical_ANSYS

    Final command platform: Windows

    RSM_PYTHON_HOME=C:Program FilesANSYS Incv195commonfilesCPython2_7_15winx64Releasepython

    RSM_HPC_JOBNAME=Mechanical

    Distributed mode requested: True

    RSM_HPC_DISTRIBUTED=TRUE

    Running 1 commands

    Job working directory: C:RSMStagingaokqqj0p.5k1

    Number of CPU requested: 12

    AWP_ROOT195=C:Program FilesANSYS Incv195

    Checking queue default exists ...

    JobId was parsed as: 4

    Job submission was successful.

    Job is running on hostname Maincad

    Job user from this host: User

    Starting directory: C:RSMStagingaokqqj0p.5k1

    Reading control file C:RSMStagingaokqqj0p.5k1control_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsm ....

    Correct Cluster verified

    Cluster Type: ARC

    Underlying Cluster: ARC

        RSM_CLUSTER_TYPE = ARC

    Compute Server is running on MAINCAD

    Reading commands and arguments...

        Command 1: %AWP_ROOT195%SECSolverExecutionController
    unsec.bat, arguments: , redirectFile: None

    Running from shared staging directory ...

        RSM_USE_LOCAL_SCRATCH = False

        RSM_LOCAL_SCRATCH_DIRECTORY =

        RSM_LOCAL_SCRATCH_PARTIAL_UNC_PATH =

    Cluster Shared Directory: C:RSMStagingaokqqj0p.5k1

        RSM_SHARE_STAGING_DIRECTORY = C:RSMStagingaokqqj0p.5k1

    Job file clean up: True

    Use SSH on Linux cluster nodes: False

        RSM_USE_SSH_LINUX = False

    LivelogFile: NOLIVELOGFILE

    StdoutLiveLogFile: stdout_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.live

    StderrLiveLogFile: stderr_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.live

    Reading input files...

        *.dat

        file*.*

        *.mac

        thermal.build

        commands.xml

        SecInput.txt

    Reading cancel files...

        file.abt

        sec.interrupt

    Reading output files...

        *.xml

        *.NR*

        CAERepOutput.xml

        cyclic_map.json

        exit.topo

        file*.dsub

        file*.ldhi

        file*.png

        file*.r0*

        file*.r1*

        file*.r2*

        file*.r3*

        file*.r4*

        file*.r5*

        file*.r6*

        file*.r7*

        file*.r8*

        file*.r9*

        file*.rd*

        file*.rst

        file.BCS

        file.ce

        file.cm

        file.cnd

        file.cnm

        file.DSP

        file.err

        file.gst

        file.json

        file.nd*

        file.nlh

        file.nr*

        file.PCS

        file.rdb

        file.rfl

        file.spm

        file0.BCS

        file0.ce

        file0.cnd

        file0.err

        file0.gst

        file0.nd*

        file0.nlh

        file0.nr*

        file0.PCS

        frequencies_*.out

        input.x17

        intermediate*.topo

        Mode_mapping_*.txt

        NotSupportedElems.dat

        ObjectiveHistory.out

        post.out

        PostImage*.png

        record.txt

        solve*.out

        topo.err

        toput

        vars.topo

        SecDebugLog.txt

        secStart.log

        sec.validation.executed

        sec.envvarvalidation.executed

        sec.failure

        *.xml

        *.NR*

        CAERepOutput.xml

        cyclic_map.json

        exit.topo

        file*.dsub

        file*.ldhi

        file*.png

        file*.r0*

        file*.r1*

        file*.r2*

        file*.r3*

        file*.r4*

        file*.r5*

        file*.r6*

        file*.r7*

        file*.r8*

        file*.r9*

        file*.rd*

        file*.rst

        file.BCS

        file.ce

        file.cm

        file.cnd

        file.cnm

        file.DSP

        file.err

        file.gst

        file.json

        file.nd*

        file.nlh

        file.nr*

        file.PCS

        file.rdb

        file.rfl

        file.spm

        file0.BCS

        file0.ce

        file0.cnd

        file0.err

        file0.gst

        file0.nd*

        file0.nlh

        file0.nr*

        file0.PCS

        frequencies_*.out

        input.x17

        intermediate*.topo

        Mode_mapping_*.txt

        NotSupportedElems.dat

        ObjectiveHistory.out

        post.out

        PostImage*.png

        record.txt

        solve*.out

        topo.err

        toput

        vars.topo

        SecDebugLog.txt

        sec.solverexitcode

        secStart.log

        sec.failure

        sec.envvarvalidation.executed

    Reading exclude files...

        stdout_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        stderr_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        control_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsm

        hosts.dat

        exitcode_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        exitcodeCommands_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        stdout_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.live

        stderr_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.live

        ClusterJobCustomization.xml

        ClusterJobs.py

        clusterjob_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.sh

        clusterjob_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.bat

        inquire.request

        inquire.confirm

        request.upload.rsm

        request.download.rsm

        wait.download.rsm

        scratch.job.rsm

        volatile.job.rsm

        restart.xml

        cancel_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        liveLogLastPositions_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsm

        stdout_782e78b5-2c00-45fa-b7ac-6976aa49c2cf_kill.rsmout

        stderr_782e78b5-2c00-45fa-b7ac-6976aa49c2cf_kill.rsmout

        sec.interrupt

        stdout_782e78b5-2c00-45fa-b7ac-6976aa49c2cf_*.rsmout

        stderr_782e78b5-2c00-45fa-b7ac-6976aa49c2cf_*.rsmout

        stdout_task_*.live

        stderr_task_*.live

        control_task_*.rsm

        stdout_task_*.rsmout

        stderr_task_*.rsmout

        exitcode_task_*.rsmout

        exitcodeCommands_task_*.rsmout

        file.abt

    Reading environment variables...

        RSM_IRON_PYTHON_HOME = C:Program FilesANSYS Incv195commonfilesIronPython

        RSM_TASK_WORKING_DIRECTORY = C:RSMStagingaokqqj0p.5k1

        RSM_USE_SSH_LINUX = False

        RSM_QUEUE_NAME = default

        RSM_CONFIGUREDQUEUE_NAME = Default

        RSM_COMPUTE_SERVER_MACHINE_NAME = Maincad

        RSM_HPC_JOBNAME = Mechanical

        RSM_HPC_DISPLAYNAME = Annulus_Thread_MAX_Sludge-SidWalls=No Separation,rhoSludge=2,3,t=0-DP0-Model (B4)-Static Structural (B5)-Solution (B6)

        RSM_HPC_CORES = 12

        RSM_HPC_DISTRIBUTED = TRUE

        RSM_HPC_NODE_EXCLUSIVE = FALSE

        RSM_HPC_QUEUE = default

        RSM_HPC_USER = MAINCADUser

        RSM_HPC_WORKDIR = C:RSMStagingaokqqj0p.5k1

        RSM_HPC_JOBTYPE = Mechanical_ANSYS

        RSM_HPC_ANSYS_LOCAL_INSTALL_DIRECTORY = C:Program FilesANSYS Incv195

        RSM_HPC_VERSION = 195

        RSM_HPC_STAGING = C:RSMStagingaokqqj0p.5k1

        RSM_HPC_LOCAL_PLATFORM = Windows

        RSM_HPC_CLUSTER_TARGET_PLATFORM = Windows

        RSM_HPC_STDOUTFILE = stdout_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        RSM_HPC_STDERRFILE = stderr_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        RSM_HPC_STDOUTLIVE = stdout_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.live

        RSM_HPC_STDERRLIVE = stderr_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.live

        RSM_HPC_SCRIPTS_DIRECTORY_LOCAL = C:Program FilesANSYS Incv195RSMConfigscripts

        RSM_HPC_SCRIPTS_DIRECTORY = C:Program FilesANSYS Incv195RSMConfigscripts

        RSM_HPC_SUBMITHOST = localhost

        RSM_HPC_STORAGEID = <JobId>    <id>a7ab7b67-dc3d-44cb-a665-813ffa216eee</id>    <nativeId>Mechanical=LocalOS#localhost$C:RSMStagingaokqqj0p.5k1</nativeId>    <timestamp>Tuesday, October 08, 2019 10:40:05.011 AM</timestamp>    <fullyDefined>True</fullyDefined></JobId>

        RSM_HPC_PLATFORMSTORAGEID = C:RSMStagingaokqqj0p.5k1

        RSM_HPC_NATIVEOPTIONS =

        ARC_ROOT = C:Program FilesANSYS Incv195RSMConfigscripts....ARC

        RSM_HPC_KEYWORD = ARC

        RSM_PYTHON_LOCALE = en-us

    Reading AWP_ROOT environment variable name ...

        AWP_ROOT environment variable name is: AWP_ROOT195

    Reading Low Disk Space Warning Limit ...

        Low disk space warning threshold set at: 2.0GiB

    Reading File identifier ...

        File identifier found as: 782e78b5-2c00-45fa-b7ac-6976aa49c2cf

    Done reading control file.

    RSM_AWP_ROOT_NAME = AWP_ROOT195

    AWP_ROOT195 install directory: C:Program FilesANSYS Incv195

    RSM_MACHINES = maincad:4:cluster01:4:cluster02:4

    Number of nodes assigned for current job = 3

    Machine list: ['maincad', 'cluster01', 'cluster02']

    Checking DisableUNCCheck ...

    Unable to read registry for HKEY_CURRENT_USERSoftwareMicrosoftCommand Processor

    Ignore verifying DisableUNCCheck, you may check it manually if not already done so.

    Start running job commands ...

    Running on machine : Maincad

    Current Directory: C:RSMStagingaokqqj0p.5k1

    Running command: C:Program FilesANSYS Incv195SECSolverExecutionController
    unsec.bat

    Redirecting output to  None

    Final command arg list : ['C:\Program Files\ANSYS Inc\v195\SEC\SolverExecutionController\runsec.bat']

    Running Process

    Running Solver : C:Program FilesANSYS Incv195ansysinwinx64ANSYS195.exe -b nolist -s noread -acc nvidia -na 1 -p ansys -mpi intelmpi -i remote.dat -o solve.out -dis -machines maincad:4:cluster01:4:cluster02:4 -dir "CRSMStaging/aokqqj0p.5k1"

     *** The AWP_ROOT195 environment variable is set to: C:Program FilesANSYS Incv195

     *** The AWP_LOCALE195 environment variable is set to: en-us

     *** The ANSYS195_DIR environment variable is set to: C:Program FilesANSYS Incv195ANSYS

     *** The ANSYS_SYSDIR environment variable is set to: winx64

     *** The ANSYS_SYSDIR32 environment variable is set to: intel

     *** The CADOE_DOCDIR195 environment variable is set to: C:Program FilesANSYS Incv195CommonFileshelpen-ussolviewer

     *** The CADOE_LIBDIR195 environment variable is set to: C:Program FilesANSYS Incv195CommonFilesLanguageen-us

     *** The LSTC_LICENSE environment variable is set to: ANSYS

     *** The P_SCHEMA environment variable is set to: C:Program FilesANSYS Incv195AISOLCADIntegrationParasolidPSchema

     *** The TEMP environment variable is set to: C:UsersUserAppDataLocalTemp

     *** FATAL ***

      Invalid working directory value= CRSMStaging/aokqqj0p.5k1

      (specified working directory does not exist)

     *** NOTE ***

      USAGE: C:Program FilesANSYS Incv195ANSYSinwinx64ANSYS.EXE

                         [-d device name] [-j job name]

                         [-b list|nolist] [-m scratch memory(mb)]

                         [-s read|noread] [-g] [-db database(mb)]

                         [-p product] [-l language]

                         [-dyn] [-np #] [-mfm]

                         [-dvt] [-dis] [-machines list]

                         [-i inputfile] [-o outputfile]

                         [-ser port] [-scport port]

                         [-scname couplingname] [-schost hostname]

                         [-smp] [-mpi intelmpi|ibmmpi|msmpi]

                         [-dir working_directory ]

     *** FATAL ***

      Invalid working directory value= CRSMStaging/aokqqj0p.5k1

      (specified working directory does not exist)

     *** NOTE ***

      USAGE: C:Program FilesANSYS Incv195ANSYSinwinx64ANSYS.EXE

                         [-d device name] [-j job name]

                         [-b list|nolist] [-m scratch memory(mb)]

                         [-s read|noread] [-g] [-db database(mb)]

                         [-p product] [-l language]

                         [-dyn] [-np #] [-mfm]

                         [-dvt] [-dis] [-machines list]

                         [-i inputfile] [-o outputfile]

                         [-ser port] [-scport port]

                         [-scname couplingname] [-schost hostname]

                         [-smp] [-mpi intelmpi|ibmmpi|msmpi]

                         [-dir working_directory ]

                         [-acc nvidia] [-na #]

     *** FATAL ***

      Invalid working directory value= CRSMStaging/aokqqj0p.5k1

      (specified working directory does not exist)

     *** NOTE ***

      USAGE: C:Program FilesANSYS Incv195ANSYSinwinx64ANSYS.EXE

                         [-d device name] [-j job name]

                         [-b list|nolist] [-m scratch memory(mb)]

                         [-s read|noread] [-g] [-db database(mb)]

                         [-p product] [-l language]

                         [-dyn] [-np #] [-mfm]

                         [-dvt] [-dis] [-machines list]

                         [-i inputfile] [-o outputfile]

                         [-ser port] [-scport port]

                         [-scname couplingname] [-schost hostname]

                         [-smp] [-mpi intelmpi|ibmmpi|msmpi]

                         [-dir working_directory ]

                         [-acc nvidia] [-na #]

     *** FATAL ***

      Invalid working directory value= CRSMStaging/aokqqj0p.5k1

      (specified working directory does not exist)

     *** NOTE ***

      USAGE: C:Program FilesANSYS Incv195ANSYSinwinx64ANSYS.EXE

                         [-d device name] [-j job name]

                         [-b list|nolist] [-m scratch memory(mb)]

                         [-s read|noread] [-g] [-db database(mb)]

                         [-p product] [-l language]

                         [-dyn] [-np #] [-mfm]

                         [-dvt] [-dis] [-machines list]

                         [-i inputfile] [-o outputfile]

                         [-ser port] [-scport port]

                         [-scname couplingname] [-schost hostname]

                         [-smp] [-mpi intelmpi|ibmmpi|msmpi]

                         [-dir working_directory ]

                         [-acc nvidia] [-na #]

                         [-acc nvidia] [-na #]

     *** FATAL ***

      Invalid working directory value= CRSMStaging/aokqqj0p.5k1

      (specified working directory does not exist)

     *** NOTE ***

      USAGE: C:Program FilesANSYS Incv195ANSYSinwinx64ANSYS.EXE

                         [-d device name] [-j job name]

                         [-b list|nolist] [-m scratch memory(mb)]

                         [-s read|noread] [-g] [-db database(mb)]

                         [-p product] [-l language]

                         [-dyn] [-np #] [-mfm]

                         [-dvt] [-dis] [-machines list]

                         [-i inputfile] [-o outputfile]

                         [-ser port] [-scport port]

                         [-scname couplingname] [-schost hostname]

                         [-smp] [-mpi intelmpi|ibmmpi|msmpi]

                         [-dir working_directory ]

                         [-acc nvidia] [-na #]

     *** FATAL ***

      Invalid working directory value= CRSMStaging/aokqqj0p.5k1

      (specified working directory does not exist)

     *** NOTE ***

      USAGE: C:Program FilesANSYS Incv195ANSYSinwinx64ANSYS.EXE

                         [-d device name] [-j job name]

                         [-b list|nolist] [-m scratch memory(mb)]

                         [-s read|noread] [-g] [-db database(mb)]

                         [-p product] [-l language]

                         [-dyn] [-np #] [-mfm]

                         [-dvt] [-dis] [-machines list]

                         [-i inputfile] [-o outputfile]

                         [-ser port] [-scport port]

                         [-scname couplingname] [-schost hostname]

                         [-smp] [-mpi intelmpi|ibmmpi|msmpi]

                         [-dir working_directory ]

                         [-acc nvidia] [-na #]

     *** FATAL ***

      Invalid working directory value= CRSMStaging/aokqqj0p.5k1

      (specified working directory does not exist)

     *** NOTE ***

      USAGE: C:Program FilesANSYS Incv195ANSYSinwinx64ANSYS.EXE

                         [-d device name] [-j job name]

                         [-b list|nolist] [-m scratch memory(mb)]

                         [-s read|noread] [-g] [-db database(mb)]

                         [-p product] [-l language]

                         [-dyn] [-np #] [-mfm]

                         [-dvt] [-dis] [-machines list]

                         [-i inputfile] [-o outputfile]

                         [-ser port] [-scport port]

                         [-scname couplingname] [-schost hostname]

                         [-smp] [-mpi intelmpi|ibmmpi|msmpi]

                         [-dir working_directory ]

                         [-acc nvidia] [-na #]

     *** FATAL ***

      Invalid working directory value= CRSMStaging/aokqqj0p.5k1

      (specified working directory does not exist)

     *** NOTE ***

      USAGE: C:Program FilesANSYS Incv195ANSYSinwinx64ANSYS.EXE

                         [-d device name] [-j job name]

                         [-b list|nolist] [-m scratch memory(mb)]

                         [-s read|noread] [-g] [-db database(mb)]

                         [-p product] [-l language]

                         [-dyn] [-np #] [-mfm]

                         [-dvt] [-dis] [-machines list]

                         [-i inputfile] [-o outputfile]

                         [-ser port] [-scport port]

                         [-scname couplingname] [-schost hostname]

                         [-smp] [-mpi intelmpi|ibmmpi|msmpi]

                         [-dir working_directory ]

                         [-acc nvidia] [-na #]

    send of 20 bytes failed.

    Command Exit Code: 99

    application called MPI_Abort(MPI_COMM_WORLD, 99) - process 0

    [[email protected]] ..hydrautilssocksock.c (420): write error (Unknown error)

    [[email protected]] ..hydrautilslaunchlaunch.c (121): shutdown failed, sock 648, error 10093

    ClusterJobs Command Exit Code: 99

    Saving exit code file: C:RSMStagingaokqqj0p.5k1exitcode_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        Exit code file: C:RSMStagingaokqqj0p.5k1exitcode_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout has been created.

    Saving exit code file: C:RSMStagingaokqqj0p.5k1exitcodeCommands_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout

        Exit code file: C:RSMStagingaokqqj0p.5k1exitcodeCommands_782e78b5-2c00-45fa-b7ac-6976aa49c2cf.rsmout has been created.

    =======SOLVE.OUT FILE======

    solve -->  *** FATAL ***

    solve --> Attempt to run ANSYS in a distributed mode failed.

    solve -->   The Distributed ANSYS process with MPI Rank ID =    4 is not responding.

    solve -->   Please refer to the Distributed ANSYS Guide for detailed setup and configuration information.

    ClusterJobs Exiting with code: 99

    Individual Command Exit Codes are: [99]

     

     

  • jcalleryjcallery Member
    edited October 2019

    Hi,

    If you are trying to distribute across multiple machines your staging directory can't be a local drive "c:"

    You will need to share that directory and in the staging directory setting use the unc path to that shared directory eg:

    \maincadRSMStaging

     

    Please be sure to set the RSMStaging share security to be open to all users and machines.

     

    Thank you,

    Jake

     

Sign In or Register to comment.