DALHIS – Data Analysis on Large-scale Heterogeneous Infrastructures for Science

The DALHIS associate team is a collaboration between the Myriads Inria project-team (Rennes, France), Avalon Inria project-team (Lyon, France) and the LBNL Data Science and Technology (DST) department (Berkeley, USA). 

Data produced by scientific instruments (large facilities like telescopes or field data), large-scale experiments, and high-fidelity simulations are increasing in magnitude and complexity. Existing data analysis methods, tools and infrastructure are often difficult to use and unable to provide the complete data management, collaboration, and curation environment needed to manage these complex, dynamic, and large-scale data analysis environments. It is important to treat data as a first class discoverable dynamic resource in the context of collaborative analytics. Enabling the integrated scientific data analysis ecosystem is key to accelerating the pace of scientific insight.

Three scientific fields that are struggling with collaborative data challenges in projects involving United States (LBL) and France (INRA, IN2P3, CNRS, and ESRF) are Carbon cycle understanding research (Fluxnet and ICOS), Cosmology (SNFactory and LSST) and Light Source Science (ALS) and ESRF).

Building a scientific data analysis environment to address today’s experiment and observation needs requires development of an effective and efficient data management and computational environment. Two key challenges in this pursuit are a) Energy efficiency: ability to efficiently manage resources to limit power consumption and b) Data Integration: integration of data from distributed sources into an integrated data analysis environment. Partnership between Computer Scientists and domain scientists enables scientific projects to take advantage of the best available computing tools and practices in building a data ecosystem. In the past these partnerships have focused on the algorithms. However, two important challenges that have received too little focus are a) Coordination: processing pipeline and data movement orchestration particularly across distributed resources and b) Interfaces: user and programmatic interfaces that are intuitive for scientists. Through the application of techniques in Human-Computer Interaction, significant advancements in this space can be made [33]. In the DALHIS renewal, we propose to work on projects specifically targeting these four challenges.


DALHIS 2016-2018

The goal of the second phase of the collaboration is to create a collaborative distributed software ecosystem to manage data lifecycle and enable data analytics on distributed data sets and resources. Specifically, our goal is to build a dynamic software stack that is user- friendly, scalable, energy-efficient and fault tolerant. We plan to approach the …

DALHIS 2013 -2015

The objective of the first 3 years of the collaboration was to create a software ecosystem to facilitate seamless data analysis across desktops, HPC and cloud environments. Specifically, our goal was to build a dynamic software stack that is user-friendly, scalable, energy-efficient and fault tolerant. Research areas A programming environment for scientific data analysis workflows. An integrated capability that …