Open
Description
The TCP BTL has long-since used heuristics to determine if peer interface A can reach peer interface B.
It would be great if, instead, the TCP BTL consulted the OS routing table (e.g., via libnl on Linux) to see if interface A can reach interface B.
This prevents cases like the following:
- Node A has interfaces Xa, Ya, and Za
- Node B has interfaces Xb, Yb, and Zb
- Xa and Ya are reachable from each other. Ditto for Ya and Yb. For simplicity, let's assume that Xa and Xb are on subnet 1, and Ya and Yb are on subnet 2.
- Za and Zb are not reachable from each other.
- The TCP BTL -- by default -- will pair Xa/Xb and Ya/Yb. In certain cases, the TCP BTL will also pair Za and Zb without checking if they are actually reachable from each other. This leads to error messages like the following (and the MPI process will hang):
[ivy08][[463,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] connect() to 10.102.1.194 failed: No route to host (113)
Reported by @bturrubiates.