91做厙

Skip to main content
SHARE
Publication

X-SRQ - Improving Scalability and Performance of Multi-Core InfiniBand Clusters

by Galen M Shipman, Stephen W Poole
Publication Type
Conference Paper
Book Title
Recent Advances in Parallel Virtual Machine and Message Passing Interface
Publication Date
Page Numbers
33 to 42
Volume
5205
Publisher Location
Heidelberg, Germany
Conference Name
15th European Parallel-Virtual-Machine-and-Message-Passing-Interface-Users-Group Meeting (PVM/MPI)
Conference Location
Dublin, Ireland
Conference Date
-

To improve the scalability of InfiniBand on large scale clusters Open MPI introduced a protocol known as B-SRQ [2]. This protocol was shown to provide much better memory utilization of send and receive buffers for a wide variety of benchmarks and real-world applications.

Unfortunately B-SRQ increases the number of connections between communicating peers. While addressing one scalability problem of InfiniBand the protocol introduced another. To alleviate the connection scalability problem of the B-SRQ protocol a small enhancement to the reliable connection transport was requested which would allow multiple shared receive queues to be attached to a single reliable connection. This modified reliable connection transport is now known as the extended reliable connection transport.

X-SRQ is a new transport protocol in Open MPI based on B-SRQ which takes advantage of this improvement in connection scalability. This paper introduces the X-SRQ protocol and details the significantly improved scalability of the protocol over B-SRQ and its reduction of the memory footprint of connection state by as much as 2 orders of magnitude on large scale multi-core systems. In addition to improving scalability, performance of latency-sensitive collective operations are improved by up to 38% while significantly decreasing the variability of results. A detailed analysis of the improved memory scalability as well as the improved performance are discussed.