Loading...
Please wait, while we are loading the content...
Similar Documents
X10 at Petascale
| Content Provider | Semantic Scholar |
|---|---|
| Author | Herta, Benjamin Cunningham, Dolora Grove, David Kambadur, Prabhanjan Saraswat, Vijay A. Shinnar, Avraham Takeuchi, Mikio Vaziri, Mandana |
| Copyright Year | 2013 |
| Abstract | X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same fine-grained concurrency mechanisms within and across nodes. We demonstrate that X10 delivers solid performance at petascale by running (weak scaling) eight application kernels on an IBM Power 775 supercomputer utilizing up to 55680 Power7 cores (1.7 Pflop/s). We sketch advances in distributed termination detection, distributed load balancing, and use of high-performance interconnects that enable X10 to scale out to thousands of nodes. 1. OVERVIEW X10 is a high-performance, high-productivity programming language developed at IBM. It is a class-based, strongly typed, garbage-collected, object-oriented language [8, 7]. To support concurrency and distribution, X10 uses the Asynchronous Partitioned Global Address Space programming model (APGAS [6]). This model introduces two key concepts – places and asynchronous tasks – and a few mechanisms for coordination. With these, APGAS can express both regular and irregular parallelism, message-passing-style and active-message-style computations, fork-join and bulksynchronous parallelism. In contrast to hybrid models like MPI+OpenMP, the same constructs underpin both intraand inter-place concurrency. We present experimental results for eight kernels. We implement the four HPC Class 2 Challenge benchmarks: HPL, FFT, RandomAccess, and Stream Triad [3], as well as SmithWaterman [10], Betweenness Centrality [1], K-Means [4], and Unbalanced Tree Search (UTS) [5]. We run them on The X10 tool chain and the benchmark codes are publicly available at http://x10-lang.org. a large Power 775 system with a theoretical peak performance of 1.7 Pflop/s. For the HPC Challenge benchmarks, X10 today achieves 41% to 87% of the system’s potential at scale as reported by IBM’s optimized runs entry to the HPC Class 1 Challenge in Nov. 2012 [2]. To the best of our knowledge, our UTS implementation is the first to scale linearly to petaflop systems for geometric trees. Our K-Means and Smith-Waterman codes also scale linearly. Although still statically load-balanced, our Betweenness Centrality code can process 245 Billion edges per second using 47040 cores. These results have been made possible with our solutions to the distributed termination detection problem at scale, the effective use of high-performance interconnects, and a refined distributed load balancing scheme for UTS. 2. OPTIMIZING FOR PETASCALE The X10 programming model relies heavily on termination detection. Detection scopes can be nested, distributed, and collocated. The reference implementation has to handle arbitrary distributed task graphs and cope with arbitrary network latencies. It does not scale. We identify a series of common patterns of distributed termination detection and provide specialized scalable implementations for these, relying on a combination of static analysis, dynamic analysis, and user input (pragmas) to guide selection. Supercomputers such as Power 775 are built upon highperformance interconnects and provide network acceleration mechanisms such as collectives and RDMAs. We augment X10’s communication APIs to expose these primitives when available and emulate them when not (e.g., over ethernet). The UTS benchmark measures the rate of traversal of a tree generated on the fly using a splittable random number generator. The tree is highly irregular and unpredictable. Dynamic distributed load balancing is therefore indispensable. We start from the state-of-the-art lifeline-based load balancer described in [9], which only scales well to hundreds of Power 775 nodes, and make it scale to thousands of nodes. The reference algorithm cleverly combines random workstealing with organized work-sharing to efficiently distribute work dynamically. We improve the work queue implementation by compactly representing intervals of sibling nodes. We reduce termination detection overheads by separating work-stealing and work-sharing termination scopes. We optimize routes by taming the random victim selection and exploiting indirect routes in termination detection. 589231' 22.38'(1'core)' |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://sc13.supercomputing.org/sites/default/files/PostersArchive/tech_posters/post158s2-file3.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |