Project Presentation – Dash – Data-Aware Scheduling at Higher scale

Abstract

While computing power of supercomputers keeps on increasing at an exponential rate, their capacity to manage data movement experiences some limits. It is expected that this imbalance will be one of the key limitation to the development of future HPC applications. We propose to rethink how I/O is managed in supercomputers. More specifically, the novelty of this project is to account for known HPC application behaviors (periodicity, limited number of concurrent applications) to define static strategies. We expect that those strategies can be turned into more efficient dynamic strategies than current strategies.
In this study, we plan to include a dynamicity provision to cope with any uncertain behavior of applications. We also plan to research how to model and include emerging technologies. Finally we plan to explore the importance and impact of reliabilty and energy-efficiency into I/O management strategies.

Context, position, objectives

In the race to larger supercomputers, the most commonly used metric is is the computational power. However supercomputer are not simply computers with billions of processors. One of the reason why Sunway TaihuLight (the world fastest supercomputer as of June 2016), reaches 93 PetaFlops on HPL (a performance benchmark based on dense linear algebra), but struggles to reach 0.37 PetaFlop on HPCG, a recent benchmark based on actual HPC applications [4] is data movement.

Historically, the design of algorithms has always been focused on computational power (for example by using efficiently parallelization, minimizing the computational complexity). With extreme scale computing, the focus is changing: to the plenitude of computing power faces the bottleneck due to the ever growing need of data [1].

DASH: Innovative scheduling to overcome bandwidth limitations

I/O movements are critical at scale. Observations on the Intrepid machine at Argonne National Lab is that I/O transfer can be slowed down up to 70% due to congestion [6].
In 2013, Argonne upgraded its house supercomputer: moving from Intrepid (Peak performance: 0.56PFlop/s; peak I/O throughput: 88GB/s) to Mira (Peak performance: 10PFlop/s; peak I/O throughput: 240GB/s). While both criteria seem to have improved considerably, the reality behind is that for a given application, its I/O throughput scales linearly (or worse) with its performance, and hence, what should be noticed is a downgrade from 160GB/PFlop to 24GB/PFlop!

To solve this problem, several approaches have been tried. Some recent work look at the application level, and try to reduce the amount of data sent by transforming it (compression) [5] or preprocessing it [9]. The goal is to reduce the volume of data by processing it before it is sent to the I/O file-system. The core of this proposal is orthogonal to these solutions and hence can be used conjointly. In this proposal, we offer to look at how the I/O data is managed by the system.

The current behavior of HPC machines with respect to I/O bandwidth is “first-come first served” (FCFS) [10]: amongst all applications that are competing for I/O bandwidth, priority is given to the application that asked for it first.
This naive strategy creates congestion at the I/O node levels, consequencing in large slowdowns in the performance of applications. Additional architectural enhancements (burst buffers) are added to the I/O nodes to help during large I/O burst, but they can only be used to complement an efficient I/O management [8].
Recently, several groups have proposed different online scheduling strategies that perform better than the greedy FCFS strategy [6,11].
One of the general concern about these strategies is that they may not scale well because they put additional stress on I/O nodes: to be efficient they need to keep a lot of information and do frequent recomputations. The key point in the next generation of I/O management is not to add another bottleneck to the system.

We propose here a completely novel paradigm to deal with this problem. The originality of this project is to use known HPC application behaviors. Observations show that most HPC applications periodically alternate between (i) operations (computations, local data-accesses) executed on the compute nodes, and (ii) I/O transfers of data and this behavior can be predicted before-hand [2,3].

Taking this structural argument, along with HPC-specific applications facts (there are in general very few applications running concurrently on a machine, and the applications run for many iterations with similar behavior) the goal is to design new algorithms for I/O scheduling. The novelty of this class of algorithms for I/O management is that we intend them to be computed statically.
To deal with the complexity of designing static algorithms, they will be designed with a periodic behavior, the scheduling of the volume of I/O of the different applications is repeated over time. This is critical since often the number of instances of all applications is very high and the overall complexity of a non-periodic static schedule would make those algorithms non usable. We envision the implementation of this periodic scheduler to take place at two levels:

The job scheduler would know the applications profile. Using these profiles it would be in charge of computing a periodic schedule every time an application enters or leaves the system.
Application-side I/O management strategies then would be responsible to ensure the correct transfer of I/O at the right time by limiting the bandwidth used by nodes that transfer I/O.

To be able to manage the I/O at scale, extra care will be given to include solutions to the following challenges:

Robustness and dynamicity to cope with uncertainties: while most applications are very structured, there might be some variability on the computational cost and I/O volumes transferred. The solutions designed will need to be robust to these variabilities. One solution might be to couple dynamicity to the offline solutions proposed.
Reliability: today’s data management is considered to be reliable [1]. However with more stress put on the I/O system, failures and data corruption are expected to become the norm. At term the solution proposed will need to take fault-tolerance into account.
Energy efficiency: finally, expectations are that energy cost of data-movement is going to be one of the key energy consumer in future systems [7]. Energy-efficiency will have to be incorporated in the scheduler we design.

Bibliography

[1] Advanced Scientific Computing Advisory Committee (ASCAC). Ten technical approaches to address the challenges of Exascale computing. http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf.
[2] Guillaume Aupy, Yves Robert, Frédéric Vivien, and Dounia Zaidouni. “Checkpointing algorithms and fault prediction.” In: Journal of Parallel and Distributed Computing 74.2 (2014), pp. 2048–2064.
[3] Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Katherine Riley. “24/7 characterization of petascale I/O workloads.” In: Proceedings of CLUSTER09. IEEE. 2009, pp. 1–10.
[4] Jack Dongarra and Michael A Heroux. “Toward a new metric for ranking high performance computing systems.” In: Sandia Report, SAND2013-4744 312 (2013).
[5] Matthieu Dorier, Gabriel Antoniu, Franck Cappello, Marc Snir, and Leigh Orf. “Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O.” In: Cluster Computing (CLUSTER), 2012 IEEE International Conference on. IEEE. 2012, pp. 155–163.
[6] Ana Gainaru, Guillaume Aupy, Anne Benoit, Franck Cappello, Yves Robert, and Marc Snir. “Scheduling the I/O of HPC applications under congestion.” In: Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. IEEE. 2015, pp. 1013–1022.
[7] Peter Kogge and John Shalf. “Exascale Computing Trends: Adjusting to the “New Normal” in Computer Architecture.” In: IEEE, 2013.
[8] N. Liu et al. “On the Role of Burst Buffers in Leadership-Class Storage Systems.” In: MSST/SNAPI. 2012.
[9] Christopher Sewell et al. “Large-scale compute-intensive analysis via a combined in-situ and co-scheduling workflow approach.” In: Proceedings of the International Conference for High Performance Computing, Networking,Storage and Analysis. ACM. 2015, p. 50.
[10] Xuechen Zhang, Kei Davis, and Song Jiang. “IOrchestrator: improving the performance of multi-node I/O systems via inter-server coordination.” In: Proceedings of SC12. 2010.
[11] Zhou Zhou, Xu Yang, Dongfang Zhao, Paul Rich, Wei Tang, Jia Wang, and Zhiling Lan. “I/o-aware batch scheduling for petascale computing systems.” In: 2015 IEEE International Conference on Cluster Computing. IEEE. 2015, pp. 254–263.