Keynotes – IEEE SBAC-PAD 2022

Keynote Speakers

Rosa Badia (Barcelone Supercomputing Center, Spain): HPC, data analytics and AI: enabling their convergence through workflow methodologies
Anne Benoît (École Normale Supérieure de Lyon, France): Handling Failures on High Performance Computing Platforms: Checkpointing and Scheduling Techniques
Walfredo Cirne (Google): Cloud Management
Ian Foster (Argonne National Laboratory, Illinois and Inria Grenoble, France): CANCELED DUE TO COVID ~~Linking Scientific Instruments and Computation: Patterns, Technologies, Experiences~~
Kunle Olukotun (Stanford University, California): Systems for ML and ML for Systems: A Virtuous Cycle

Keynote talks

HPC, data analytics and AI: enabling their convergence through workflow methodologies

by Rosa Badia (Barcelone Supercomputing Center, Spain)
Thursday, November 3rd — 11h00 – 12h00
Chair: Lionel Eyraud-Dubois (Inria Bordeaux, France)

Abstract: High-Performance Computing (HPC) systems are increasingly larger and complex.
At the same time, the user community is providing more complex application workflows to leverage them. In the application area, current trends aim to use data analytics and artificial intelligence combined with HPC modeling and simulation. However, the programming models and tools are different in these fields, and there is a need for methodologies that enable the development of workflows that combine HPC software, data analytics, and artificial intelligence. The eFlows4HPC project aims at providing a workflow software stack that fulfills this need.
The project is also developing the HPC Workflows as a Service (HPCWaaS) methodology that aims at providing tools to simplify the development, deployment, execution, and reuse of workflows. The project showcases its advances with three application Pillars with industrial and social relevance: manufacturing, climate, and urgent computing for natural hazards. The talk will present the actual progress and findings of the project.

Bio: Rosa M. Badia holds a PhD on Computer Science (1994) from the Technical University of Catalonia (UPC). She is the manager of the Workflows and Distributed Computing research group at the Barcelona Supercomputing Center (BSC). She has made significant contributions to Parallel programming models for multicore and distributed computing due to her contribution to task-based programming models during the last 15 years. The research group focuses on PyCOMPSs/COMPSs, a parallel task-based programming distributed computing, and its application to the development of large heterogeneous workflows that combine HPC, Big Data, and Machine Learning. Dr Badia has published nearly 200 papers in international conferences and journals on the topics of her research. She has been active in projects funded by the European Commission in contracts with industry. She is a member of HiPEAC Network of Excellence. She received the Euro-Par Achievement Award 2019 for her contributions to parallel processing, the DonaTIC award, category Academia/Researcher in 2019, and the HPDC Achievement Award 2021 for her innovations in parallel task-based programming models, workflow applications and systems, and leadership in the high-performance computing research community. Rosa Badia is the IP of eFlows4HPC.

Handling Failures on High Performance Computing Platforms: Checkpointing and Scheduling Techniques

by Anne Benoît (École Normale Supérieure de Lyon, France)
Wednesday, November 2nd — 13h30 – 14h30
Chair: Olivier Beaumont (Inria Bordeaux, France)

Abstract: In this talk, we will discuss how to address failures on high-performance computing (HPC) platforms through checkpointing and/or scheduling coupled with re-execution techniques. While these techniques are now well established, the frequency of checkpointing and/or the amount of replication still needs to be optimized carefully. First, we will review the classical technique of periodic checkpointing à la Young/Daly. While the optimal checkpointing interval can be determined with a formula, we will assess the usefulness and limitations of this formula, in particular with non-memoryless failure distributions and workflows with dependencies between tasks. Finally, we will discuss the challenges raised by scheduling parallel tasks that may be subject to silent errors, and re-executed until successful completion.
Bio: Anne Benoit is an Associate Professor in the LIP laboratory and the Chair of the Computer Science Department at ENS Lyon, France. She is also the Chair of the IEEE CS Technical Community on Parallel Processing (TCPP). She is Associate Editor (in Chief) of JPDC and Parco, and she has been Associate Editor of IEEE TPDS, JPDC, and SUSCOM. She was the general co-chair of IPDPS’22, and she has chaired the program committee of several major conferences in her field, in particular SC, IPDPS, ICPP and HiPC. Her research interests include algorithm design and scheduling techniques for parallel and distributed platforms, with a focus on energy awareness and resilience. See bit.ly/abenoit for further information.

Cloud Management

by Walfredo Cirne (Google)
Friday, November 4th — 11h00 – 12h00
Chair: Cristiana Bentes (UERJ, Brazil)

Abstract: This talk defines the Private Cloud Management program and presents Flex, Google’s solution for it. It covers how Flex makes Google’s easier to operate, as well as the key techniques used to make it more efficient. It also discusses how we preserve governance by charging internal users the resources they prompted Google to buy, in spite of aggressive resource sharing and on-demand allocation.
Bio: Walfredo Cirne has worked on the many aspects of parallel scheduling and cluster management for the past 25 years. He is currently with the Technical Infrastructure Group at Google, where he leads Flex, Google’s solution for resource management of its internal Cloud. Previously, he was faculty at the Universidade Federal de Campina Grande, where he led the OurGrid project. Walfredo holds a PhD in Computer Science from the University of California San Diego, and Bachelors and Masters from the Universidade Federal de Campina Grande.

Linking Scientific Instruments and Computation: Patterns, Technologies, Experiences

by Ian Foster (Argonne National Laboratory, Illinois and Inria Grenoble, France)
Friday, November 4th — 11h00 – 12h00
Chair: Alfredo Goldman (USP, Brazil)

Abstract: Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Thus, methods are required for configuring and running distributed computing pipelines—what we call flows—that link instruments, computers (e.g., for analysis, simulation, AI model training), edge computing (e.g., for analysis), data stores, metadata catalogs, and high-speed networks. We review common patterns associated with such flows and describe methods for instantiating these patterns. We present experiences with the application of these methods to the processing of data from five different scientific instruments, each of which engages powerful computers for data inversion, machine learning model training, or other purposes. We also discuss implications of such methods for operators and users of scientific facilities.

Bio: Ian Foster is Senior Scientist and Distinguished Fellow, and director of the Data Science and Learning Division, at Argonne National Laboratory, and the Arthur Holly Compton Distinguished Service Professor of Computer Science at the University of Chicago. He has a BSc degree from the University of Canterbury, New Zealand, and a PhD from Imperial College, United Kingdom, both in computer science. His research is in distributed, parallel, and data-intensive computing technologies, and applications to scientific problems.

Systems for ML and ML for Systems: A Virtuous Cycle

by Kunle Olukotun (Stanford University and SambaNova Systems)
Thursday, November 3rd — 14h30 – 15h30
Chair: Viktor Prasana (USC, USA)

Abstract: This talk is about the virtuous interplay between machine learning (ML) and systems. I will show examples of how systems optimized for ML computation can be used to train more accurate and capable ML models and how these ML models can be used to improve upon the ad-hoc heuristics used in system design and management. These improved systems can then be used to train better ML models. The latest trend in ML is the development of Foundation models. Foundation models are large pretrained models that have obtained state-of-the-art quality in natural language processing, vision, speech, and other areas. These models are challenging to train and serve because they are characterized by billions of parameters, irregular data access (sparsity) and irregular control flow. I will explain how Reconfigurable Dataflow Accelerators (RDAs) can be designed to accelerate foundation models with these characteristics. SambaNova Systems is using RDA technology to achieve record-setting performance on foundation models. I will describe how the RDAs can also be used to build Taurus, an intelligent network data plane that enables ML models to be used to manage computer networks at full line-rate bandwidths. In particular, a Taurus prototype detects two orders of magnitude more events in a security application than a state-of-the-art system based on conventional network technology.

Bio: Kunle Olukotun is the Cadence Design Professor of Electrical Engineering and Computer Science at Stanford University. Olukotun is a pioneer in multicore processor design and the leader of the Stanford Hydra chip multiprocessor (CMP) research project. He founded Afara Websystems to develop high-throughput, low-power multicore processors for server systems. The Afara multi-core processor, called Niagara, was acquired by Sun Microsystems and now powers Oracle’s SPARC-based servers. In 2017, Olukotun co-founded SambaNova Systems, a Machine Learning and Artificial Intelligence company, and continues to lead as their Chief Technologist.
Olukotun is the Director of the Pervasive Parallel Lab and a member of the Data Analytics tor What’s Next (DAWN) Lab, developing infrastructure for usable machine learning. He is a member of the National Academy of Engineering, an ACM Fellow, and an IEEE Fellow for contributions to multiprocessors on a chip design and the commercialization of this technology. He also received the Harry H. Goode Memorial Award.
Olukotun received his Ph.D. in Computer Engineering from The University of Michigan.