Return to Research

Results (2016-2017)

The work of the associate team in 2016-2018 centered around two areas: distributed Infrastructure support for workflow and data management and deep partnerships with scientific collaborations.

Distributed Infrastructure Support for Workflow and Data Management

Energy-efficient data-intensive workflow execution:  We explored a way for energy aware
HPC cloud users to reduce their footprint on cloud infrastructures by reducing the size of the virtual
resources they are asking for. A user who agrees to reduce her impact on the environment can choose
to lose in performance by executing her application on less resources on the infrastructure. The unused
resources are free for another application and thus favor a better consolidation of the whole system. The
better the consolidation, the lower the electrical consumption. The proposed system offers three execution
modes based on: Big, Medium and Little. An algorithm selects the size of the VMs for executing each
task of the workflows depending on the selected execution mode. Medium mode executes using the user’s
normal VM resources for each workflow stage, Little mode reduces the VMs by one size for the workflow
and Big increases the VMs by one size for the workflow.
We evaluated the impact of the proportion of users selecting the Big, Medium or Little mode on a data
center’s energy consumption. Our evaluations have been done using three kinds of scientific workflows,
energy consumption measurements for the execution of these workflows on a real platform and traces
of jobs submitted to a production HPC center. We evaluated by simulation the data center energy
consumption for different proportions of users selecting the three available modes. A paper on this work was accepted and presented at PDP 2017.

Building a workflow for data analysis for anomaly detection:  We worked on building a workflow
for anomaly Detection in HPC environments using statistical data. An initial analysis of traffic to and
from NERSC data nodes was conducted in order to determine if and to what extend a percentage of
this traffic is normalized. We focused on the following characteristics: number of connections, connection
frequency, port range and size of transfers. We were able to identify specific patterns that define a
customizable model of normal behavior and flag hosts that deviate from this model. We plan to expand our
model to other types of traffic and include more characteristics such as number of packet retransmissions,
filename path, username.

Data integrity in HPC systems Amir Teshome’s 3-month internship at LBNL (April-June 2017)
under the supervision of Sean Peisert focused on data integrity in High Performance Computing (HPC)
systems. Such systems are used by scientists to solve complex science and engineering problems. One of
the security triad, data integrity, can simply be defined as an absence of improper data alterations. Its
goal is to ensure that data is written correctly as intended and read from disk, memory or network exactly
as it was written. Assuring such consistency in HPC scientific workflows is challenging and falling to do
so may falsify the result of an experiment. During the internship we have studied where in the workflow
could the data integrity be affected, what are the current existing solutions and how we can leverage them
to have better security and performance to cope with next generation HPC supercomputers. In general,
data integrity in HPC environments could be affected: at the source (e.g. experimental setup), in the
network, at processing time and finally in storage. Existing solutions include error correction code (ECC)
at memory level, checksum at the network level and different levels (and types) of replication depending
on the sensitivity of applications. Replication could be done at memory level (e.g. replicating MPI
processes) or replicating the entire HPC computation (e.g. doing replicated computation at different HPC

Design of a Cloud Approach for Dataset integration :  Next-generation scientific discoveries
are at the boundaries of datasets, e.g., across multiple science disciplines, institutions and spatial
and temporal scales. This task is based in the context of Deduce.   The DST team
worked on a) identifying data change characteristics from a number of different domains b) developing an
elastic resource framework for data integration called E-HPC that manages a dynamic resource pool that
can grow and shrink, for data workflows c) compared and contrasted existing approaches for real-time
analyses on HPC. A paper on E-HPC was accepted and presented at WORKS 2017, a workshop held in
conjunction with SC|17. We also developed a Deduce framework that allows a user to compare two data
versions and presents filesystem, metadata changes to the user.
Complementarily, Myriads team worked on evaluating data processing environments deployed in
clouds. We performed a thorough comparative analysis of four data stream processing platforms – Apache
Flink, Spark Streaming, Apache Storm, and Twitter Heron, that are chosen based on their potential to
process both streams and batches in real-time. The goal of the work is guide the choice of a resourceefficient
adaptive streaming platform for a given application. For the comparative performance analysis of
the chosen platforms, in this work, we have experimented using 8-node clusters on Grid’5000 experimentation testbed and have selected a wide variety of applications ranging from the conventional benchmark (word count application) to sensor-based IoT applications (air quality monitoring application) to statistical batch processing (flight delay analysis application).


Deep partnerships with scientific collaborations

AmeriFlux and FLUXNET

Data exploration: The carbon flux datasets from AmeriFlux (Americas) and FLUXNET
(global) are comprised of long-term time series data and other measurements at each tower site. There
are over 800 flux towers around the world collecting this data. The non-time series measurements include
information critical to performing analysis on the site’s data. Examples include: canopy height, species
distribution, soil properties, leaf area, instrument heights, etc. These measurements are reported as a
variable group where the value plus information such as method of measurement and other information
are reported together. Each variable group has a different number and type of parameters that are
reported. The current output format is a normalized file. Users have found this file difficult to use.
Our earlier work in the associated team focused on building user interfaces to specify the data. This
year we jointly worked on developing a Jupyter Notebook that would serve as a tool for users to read in
and explore the data in a personalized tutorial type environment. We developed two notebooks and the
next step is to start user testing on the notebooks.

Mobile application for reliable collection of field data for Fluxnet: Continuing with our initial usability feedback experiences gathered in 2015 on the application interface designs, we decided on the mobile application workflow for implementation. We developed a first prototype  using the PhoneGap2 platform which provided two advantages: (1) reusing some of the existing HTML, CSS and JavaScript web resources and (2) same development code generates mobile application for IOS, Android and Windows platform simultaneously. The main functionality realized in the application prototype is that the user can download all the site data required by logging in through the application; and then view/edit them at the tower site (even in offline mode). The next logical step would be developing the synchronization and validation of data held locally in the application with the servers.

Astroparticle Physics

The Large Synoptic Survey Telescope will soon produce an unprecedented
catalog of celestial objects. Physicists in the US and in France will be able to exploit this sheer amount
of data through a Data Access Center, which is an end-to-end integrated system for data management,
analysis, and visualization. In 2017, members of Fred Suter’s team investigated the use of Jupyter
notebooks as the main interface for data exploration and analysis. A first prototype has been already
used in several training sessions for physicists. Discussions with people at Berkeley Lab will probably
help to make this prototype evolve into a production tool suited to user needs.