Ansys Products

Ansys Products

Multi-node RSM ARC fails jobs when adding more than 1 compute node

    • Tomas Gintautas
      Subscriber

      I have set up a compute cluster with 1 head node and multiple compute nodes underneath using centos7.9 (essentianlly this):

      If I use a queue that contains only one compute node, all calculations go thorugh fine, the moment I add additional nodes to the queue, I start getting this error (with scratch directory set in HPC Side File Management):
      Failed to create working directory on execution nodes via node share/mount.
      ClusterJobs Command Exit Code: 1004

      If I set HPC staging dir in HPC Side File Management, jobs still do not go thoruhg, but the error is different (it hangs at this stage):

      Running Process
      Running Solver : /mnt/resource/ansys_install/v221/ansys/bin/ansys221 -b nolist -s noread -apip off -p meba -i remote.dat -o solve.out -dis -machines af99a381cd6f419783359a34da1f218b000000:8:af99a381cd6f419783359a34da1f218b000001:8 -dir "/mnt/resource/batch/tasks/fsmounts/R/ep30uf1s.w4g"
      Host key verification failed.

      Furthermore, the model calcs fail regardless of RSM test from RSM config gui passing with green tick-mark...

      Does anyone have ideas? What am I missing? the credentials are cached withj arccredentials on master node (and on compute nodes), paswordless SSH is set-up across the cluster.

      Thanks!

       

    • George Karnos
      Ansys Employee

      The error message:
      Host key verification failed.
      Points to a passwordless SSH issue.
      Test that across the cluster machines.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.