Loading...
Please wait, while we are loading the content...
Similar Documents
Partnership for Advanced Computing in Europe Urgent Computing service in the PRACE Research Infrastructure
| Content Provider | Semantic Scholar |
|---|---|
| Author | Kupczyka, Miroslaw Kaliszana, Damian Stoffersb, Huub Wilsonc, Niall Molld, Felip |
| Copyright Year | 2017 |
| Abstract | The document presents the background and assumptions of the pilot deployment of the Urgent Computing environment in the PRACE RI. It presents several possible scenarios of integrating the functionality bearing in mind that the PRACE RI is a distributed environment with distinct policies, requirements and local limitations. The final recommendations and guidelines will be presented in the project deliverable. 1. Design prerequisites The goal is to demonstrate the pilot installation of the environment ready to use for the Urgent Computing (UC). To implement specific solutions several possible scenarios were proposed of integrating the prospective functionality into PRACE-RI under the general agreement of the PRACE Technical Board and Board of Directors. The power of PRACE-RI is distributed so it imposes on us a non-uniform behaviour in terms of managing a new service, which is expected to be a single coherent user environment designed to run the applications in question. First, the authors focused on balancing pros and cons regarding available ideas [7], [8], [9]. In parallel, the authors touched technical and local policy aspect on selected test sites (PRACE Exec Sites); PSNC, ICHEC, SURFsara, BSC. The technical outcome of the document confirms the feasibility of running urgent computing application (UC Application) on each considered PRACE-RI site. The conclusion includes that the recommended policy involves three main steps: 1. An appropriate software is installed on a dedicated EXEC Host. 2. The User of UC Application can operate in normal PRACE Project mode, as, e.g., DECI project, but: 3. UC Application should start as quickly as agreed in the policy agreement. Here it must be agreed on the means to enter into the extraordinary state of the scheduling basing on the configuration provided in advance. Each of the aforementioned points is split into smaller entities and treated distinctly in the following chapters. The conclusion is the following: 1. Executing UC Application does not violate any PRACE-wide policy. * Corresponding author. E-mail address: miron@man.poznan.pl 2 2. Exploitation of the UC Application will require the availability of a licensing agreement with the owner of such an application as well as input data by EXEC Host or PRACE-wide. 3. It might happen that memory requirements of a UC Application are over-exhausted, due to the specification of the computational nodes. Some EXEC Hosts do not have virtual memory configured on OS level at all (e.g., PSNC). 2. Important differences between regular jobs and UC jobs As PRACE is an HPC infrastructure designed for research, the compute-only part of most of its supercomputers is not fault-tolerant against several sorts of hardware failure and certainly not against power failuresb. Since redundancy and fault tolerance is not for free, and hardware and power failures occur quite infrequently, it is more cost-efficient to spend money on total capacity and to simply reschedule and rerun the jobs that have suffered from an occasional failure than to spend more money on resilience and redundancy. On PRACE systems it is also customary that maintenance intervals are planned and announced in advance, without giving users a back-up production site during maintenancec. If the announcements are well in advance, this is no problem for regular research jobs. For UC jobs the matter is quite different. It is a defining hallmark for the context of UC that the demand is only foreseeable for a short period prior to execution. Means to synchronize, by delaying or speeding up the course of certain events, may be lacking altogether. The urgency usually implies that there is no second chance. Starting a job after the deadline has past or having to rerun a failed job will make the result, although correct, useless simply because it is too late. 3. Organizing coarse-grained redundancy of systems in stand-by for UC It is not realistic that the design and maintenance practices of the PRACE HPC infrastructures will be changed significantly to accommodate to the specific requirements of urgent computing. The fault tolerance needed to mitigate the risk that an urgent computing task fails to meet its deadline is thus best served by a “coarse-grained” redundancy: operational platforms for a selected UC case should be implemented on more than one sitesd. The probability that all sites involved must go into unplanned maintenance at the same time is small. The probability that they all have planned maintenance at the same time would be decreased too, but could of course be ruled out, if sites that are peers in supporting UC cases would coordinate and adjust their planned maintenance sessions to the effect that maintenance session do not overlap. While the organization of cross-site coarse-grained redundancy is vital, this should of course not preclude other enhancements of fault tolerance at the site level. Implementing an operational platform supporting a UC case implies that the systems are each in a state of “warm standby” for the real event. The term “warm standby” was borrowed from the SPRUCE frequently asked questions list [2]. To be, and remain, in a state of warm standby: ● All needed application software must be pre-installed ● One or more data sets suitable for validation must be pre-installed ● There must be a validation protocol using pre-installed software and pre-installed data. ● Runs to execute the validation protocol must be scheduled regularly basing on accepted SLA to verify that the applications keep performing as expected, especially after system changes, software upgrades on general purpose and computational libraries, etc. ● Validation runs can be regular batch jobs. A budget of sufficient core hours to perform these jobs regularly must be allocated. ● Since pre-installed software and data may be damaged by human error, or hardware malfunctioning, or even the above noted updating of other software components, there should also be regularly tested b Datacentres do have solutions like diesel powered emergency power aggregates to overcome net outages, but not at the scale to support their relatively power intensive compute-only equipment, which is often only protected against short peaks and dips. c Planned complete service outages can range from several minutes or hours, e.g., in the case of system wide software or firmware updates, to several days in the case of major hardware replacements. d The principle to have a set of systems that each may fail to offer continuous availability for UC cases, to increase the probability of actual availability when needed. Having more redundant systems minimises the lack of results problem, but increases the maintenance of UC environment. The number of redundant systems must be a part of SLA. |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.prace-ri.eu/IMG/pdf/WP243.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |