NDLI: Performance Evaluation of MPI on Cray XC 40 Xeon Phi Systems

Please wait, while we are loading the content...

Performance Evaluation of MPI on Cray XC 40 Xeon Phi Systems

Content Provider	Semantic Scholar
Author	Parker, S. G. Chunduri, Sudheer Harms, Kevin Kandalla, Krishna
Copyright Year	2018
Abstract	e scale and complexity of large-scale systems continues to increase, therefore optimal performance of commonly used communication primitives such as MPI point-to-point and collective operations is essential to the scalability of parallel applications. is work presents an analysis of the performance of the Cray MPI point-to-point and collectives operations on the Argonne eta Cray XC Xeon Phi system. e performance of key MPI routines is benchmarked using the OSU benchmarks and from the collected data analytical models are t in order to quantify the performance and scaling of the point-to-point and collective implementations. In addition, the impact of congestion on the repeatability and relative performance consistency of MPI collectives is discussed. INTRODUCTION Given the technological trends in the performance of the compute and network components of HPC systems [6], the performance and scaling of parallel applications on these systems is highly dependent on their communication performance. Optimal implementation and usage of MPI pointto-point and collective routines is essential for the performance of MPI based applications. In this study the performance of MPI point-to-point and collectives routines on the eta supercomputer at the Argonne Leadership Computing Facility (ALCF) is evaluated using the OSU benchmarks. eta is a Cray XC40 system equipped with the Cray Aries interconnect in a Dragony topology and a proprietary Cray MPI implementation derived from the open source MPICH implementation . e compute nodes on eta utilize the Intel Xeon Phi Knights Landing processor. Numerous models for MPI point-to-point performance exist and in this work the Hockney [2] or ”postal” model is utilized to represent the latency of MPI point-to-point operations. In an MPI library the collective routines are implemented using a sequence of point-to-point message exchanges. Different message paerns or algorithms are used for the dierent collectives, and within the same MPI collective dierent paerns are used for dierent messages sizes and node counts. Well established models for collectives performance have been dened [7] and are used as a basis for ing analytic models to the measured collective latency data. Additionally, the impact of network congestion on MPI performance is quantied using an MPI benchmark that has been run repeatedly on eta on dierent days under dierent load conditions. Finally, MPI performance consistency guidelines that have been dened in the literature [3, 8, 9] were tested on eta and the adherence of the Cray MPI implementation on eta to these guideline was evaluated. THETA SYSTEM DESCRIPTION e ALCF Cray XC40 system eta is an 11.7 peta-op system that utilizes the second generation Intel Xeon Phi Knights Landing many core processor and the Cray Aries interconnect. A high level system description is given in Table 1. e system consists 4,392 compute nodes with an aggregate 281,088 cores with the nodes housed in 24 racks. e compute nodes are connected using a 3-tier dragony topology with two racks creating a dragony group, resulting in a total of twelve groups. Aries Network e Cray XC series utilizes the Cray Aries interconnect, a follow on to the previous generation Gemini interconnect. Aries utilizes a system-on-chip design that combines four network interface controllers (NIC) and a 48 port router onto a single device which connects to four XC compute nodes via a 16x PCI-Express Gen3 connection. Each NIC is connected to two injection ports on the router and 40 ports are available for links between routers. e PCIe interface provides for 8 GT/s per direction with 16 bits for a total of 16 GB/s of peak bandwidth. e network links may be electrical or optical with the electrical links providing 5.25 GB/s of bandwidth per direction and the optical links providing 4.7 GB/s. A node initiates network operations by writing across the host interface to the NIC. e NIC then creates packets containing the request information and issues them to the network with packets containing up to 64 bytes of data. Aries implements two messaging protocols: fast memory access (FMA) and block transfer engine (BTE). FMA oers minimized overhead for 8-64 byte operations resulting in a fast path for single word put, get and non-fetching atomic operations but requires the CPU be involved in the message transfer. is provides low latency and a fast issue rate for small transfers. Writes to the FMA window produce a stream of put operations each transferring 64 bytes of data. e BTE is used for larger messages and can result in higher achieved bandwidth and provides for asynchronous transfers independent of the CPU, however higher messages latencies exist. To utilized the BTE a process writes a block transfer descriptor to a queue and the Aries hardware performs the operation asynchronously. Up to four concurrent block transfers are supported allowing maximum bandwidth to be achieved for smaller concurrent transfers. Block transfers have higher latency than FMA transfers but can transfer up to 4GB without CPU involvement. Aries supports atomic operations, including put operations such as atomic add, and get operations such as conditional swaps, and maintains a 64 entry atomic operation cache to reduce host reads when multiple processes access the same variable. ese network atomics are not coherent with respect to local memory operations. Additionally, the NIC contains a collective engine that provides hardware support for reduction and barrier Processor core KNL (64-bit) CPUs per node 1 # of cores per CPU 64 Max nodes/rack 192 Racks 24 Nodes 4,392 Interconnect Cray Aries Dragony Dragony Groups 12 # of links between groups 12 Table 1: eta – Cray XC40 system Collectives RMA Pt2Pt Application
File Format	PDF HTM / HTML
Alternate Webpage(s)	https://cug.org/proceedings/cug2018_proceedings/includes/files/pap131s2-file2.pdf
Alternate Webpage(s)	https://cug.org/proceedings/cug2018_proceedings/includes/files/pap131s2-file1.pdf
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in