DALHIS – Data Analysis on Large-scale Heterogeneous Infrastructures for Science
The DALHIS associate team is a collaboration between the Myriads Inria project-team (Rennes, France), Avalon Inria project-team (Lyon, France) and the LBNL Data Science and Technology (DST) department (Berkeley, USA).
Data produced by scientific instruments (large facilities like telescopes or field data), large-scale experiments, and high-fidelity simulations are increasing in magnitude and complexity. Existing data analysis methods, tools and infrastructure are often difficult to use and unable to provide the complete data management, collaboration, and curation environment needed to manage these complex, dynamic, and large-scale data analysis environments. It is important to treat data as a first class discoverable dynamic resource in the context of collaborative analytics. Enabling the integrated scientific data analysis ecosystem is key to accelerating the pace of scientific insight.
Three scientific fields that are struggling with collaborative data challenges in projects involving United States (LBL) and France (INRA, IN2P3, CNRS, and ESRF) are Carbon cycle understanding research (Fluxnet and ICOS), Cosmology (SNFactory and LSST) and Light Source Science (ALS) and ESRF).
Building a scientific data analysis environment to address today’s experiment and observation needs requires development of an effective and efficient data management and computational environment. Two key challenges in this pursuit are a) Energy efficiency: ability to efficiently manage resources to limit power consumption and b) Data Integration: integration of data from distributed sources into an integrated data analysis environment. Partnership between Computer Scientists and domain scientists enables scientific projects to take advantage of the best available computing tools and practices in building a data ecosystem. In the past these partnerships have focused on the algorithms. However, two important challenges that have received too little focus are a) Coordination: processing pipeline and data movement orchestration particularly across distributed resources and b) Interfaces: user and programmatic interfaces that are intuitive for scientists. Through the application of techniques in Human-Computer Interaction, significant advancements in this space can be made . In the DALHIS renewal, we propose to work on projects specifically targeting these four challenges.
Objectives (2016 – 2018)
The goal of the Inria-LBL collaboration is to create a collaborative distributed software ecosystem to manage data lifecycle and enable data analytics on distributed data sets and resources. Specifically, our goal is to build a dynamic software stack that is user- friendly, scalable, energy-efficient and fault tolerant. We plan to approach the problem from two dimensions.
- Distributed Infrastructure Support for Workflow and Data Management: Research to determine appropriate execution environments that allow users to seamlessly execute their end-to-end dynamic data analysis workflows in various resource environments and scales while meeting energy-efficiency, performance and fault tolerance goals.
- Deep partnerships with Scientific Collaborations: We will engage in deep partnerships with scientific teams and use a mix of user research with system software R&D to address specific challenges that these communities face. Our experience will in turn inform future research directions.