Thanks Ghobad. I also tried to use intel2018 in GUI (type intel2018 manully). Please see the latest results. 

  1. default intel's bandwidth is very slow, even 4 cores are used. Only 1000-1400.
  2. when 4 core is used. Intel2018's bandwidth is around 9000. Very good. much better than default intel.  
  3. when 64 cores are used. Intel2018's minimum bandwidth drops to 900, which slow down the calculation seriously.

4. msmpi has similar issue. Also slow



At last, I would say Fluent 2020r2 does not have this issue at all. I hope this information can help them debug further about this issue :-)