Workplan 2017

Distributed Infrastructure Support for Workflow and Data Management

Design of a Cloud Approach for Dataset integration (Task 1): Next-generation scientific discoveries
are at the boundaries of datasets, e.g., across multiple science disciplines, institutions and spatial
and temporal scales. Today, data integration processes and methods are largely ad-hoc or manual.
A generalized resource infrastructure that integrates knowledge of the data and the processing tasks
being performed by the user in the context of the data and resource lifecycle is needed.
Clouds provide an important infrastructure platform that can be leveraged by including knowledge
for distributed data integration and that will be the focus of this research area. In 2017, we will work
on the cloud system design leveraging the work done at LBNL on the design of the E-HPC system. We
will further work on real-time data processing platforms targeting elasticity in the context of clouds to
dynamically adjust the resource allocation to the needs.

Data analysis for anomaly detection during workflow execution (Task 2): We plan to expand
the work done in 2016 to detect anomalies during the execution of scientific workflows. We will use several
scientific computing workflows as exemplars, study the places where integrity failures could occur, and
examine places at any point during the workflow, including the scientific instruments, networks, and HPC
systems, where provenance data could tell us more about the existence or cause of the failure. Where
provenance data can be captured, we will develop systems to capture and analyze that data to build
a proof of concept. Where provenance data cannot currently be captured, we will recommend ways in
which hardware or software designs could be altered in future implementations to capture such data. If
time allows, we will prove the necessity and/or sufficiency of such data to demonstrate completeness of
the approach. Amir Teshome Wonjiga’s internship will be related to this task.

Dynamic workflow execution on HPC platforms (Task 3): Today, GinFlow is not designed to be
deployed over an HPC environment. In particular, there is no proper scheduling strategies to optimise the
mapping of the tasks over compute nodes. One direction we will explore is to make GinFlow HPC-ready.
In particular, we plan to devise strategies to dynamically assign CPU power to tasks as the execution
moves forward in the graph of tasks. This would allow on one side to make GinFlow more HPC ready,
and in particular deployable over the NERSC computing platform, and on