Ansys Products

Ansys Products

Cluster’s node random disconnect

    • kevin1986
      Subscriber

      Hi guys,

      We have a 4 node cluster with amd epyc CPUs. They are connected by mellanox switch. However, sometimes I found some node would disconnect with the others, either this node cannot ssh to the others, or this node's folder cannot be mounted. It happens quite randomly. I tried different version of Fluent, tried different mellanox switch, tried different DAC cables, tried different centOS, tried intelmpi and openmpi, it still has this problem.

      I install Fluent in a shared folder, all nodes can access this folder. Do I need to install a local copy of fluent on each node?

      I really dont know why. Any experts can give any clue?

      Thanks.

    • RK
      Ansys Employee
      Hello Kevin,
      You might want to check with your IT department on this first.
    • kevin1986
      Subscriber
      Hi RK Thanks for your reply. Actually I work for IT department... I just have a question. When we build a cluster, we have two methods:
      1) we install ansys on each node.
      2) we install ansys in a folder, then share this folder by mount -o tcp,port=2049
      then each node can run Ansys.
      Right now we use method 2). I also found method 1) works. I am not sure if method 1) is preferable when builing a cluster.
      Thanks.
    • kevin1986
      Subscriber
      These days, we tried different mellanox switch, different cables, different IB cards, we still have this problem. We are running Fluent 2022 linux version. I appreciate any kind of help or advices :-)
    • RK
      Ansys Employee
      Hi Kevin,
      We acknowledge your response. One of us will get back to you shortly.
Viewing 4 reply threads
  • You must be logged in to reply to this topic.