Workplan 2014

Our workplan for 2014 covers three main directions: energy efficiency, scientific workflow management, and data life-cycle management in clouds.

Energy efficiency

Energy consumption and workload models

Infrastructure as a Service (IaaS) clouds provide their customers with virtual machines on demand and self-service following a pay as you go pricing model. Virtual machines are dispatched to the servers of the underlying computing cluster by the IaaS cloud management system. In IaaS clouds, resources are managed dynamically in order to take into account the load variation in the different virtual machines and the fluctuation of the number of virtual machines hosted in the data center.

We aim at developing the methods and infrastructure for understanding energy consumption behavior of large-scale systems and resource utilization by the workloads they execute. The energy consumption and workload models will allow to improve data centers resource management in particular with regards to energy saving and thermal management. We target on-line analysis enabling the resource management system to quickly and autonomously react to changes in the workload by adapting the number of active servers to the actual load and by launching virtual machine consolidation algorithms only when appropriate.

We plan to study the design and implementation of an on-line analysis infrastructure. Such an infrastructure is logically composed of three important components for data collection, model creation/validation and prediction.

Energy-efficiency cloud elasticity for data-driven applications

Distributed and parallel systems offer to users tremendous computing capacities. They rely on distributed computing resources linked by networks. They require algorithms and protocols to manage these resources in a transparent way for users. Recently, the maturity of virtualization techniques has allowed for the emergence of virtualized infrastructures (Clouds). These infrastructures provide resources to users dynamically, and adapted to their needs. By benefiting from economies of scale, Clouds can efficiently manage and offer virtually unlimited numbers of resources, reducing the costs for users.

However, the rapid growth for Cloud demands leads to a preoccupying and uncontrolled increase of their electric consumption. In this context, we will focus on data driven applications which require to process large amounts of data. These applications have elastic needs in terms of computing resources as their workload varies over time. While reducing energy consumption and improving performance are orthogonal goals, this internship aims at studying possible trade-offs for energy-efficient data processing without performance impact. As elasticity comes at a cost of reconfigurations, these trade-offs will consider the time and energy required by the infrastructure to dynamically adapt the resources to the application needs.

The validations of the proposed algorithms may rely on the French experimental platform named Grid’5000 [1]. This platform comprises about 8,000 cores geographically distributed in 10 sites linked with a dedicated gigabit network. Some of these sites have wattmeters which provide the consumption of the computing nodes in real-time. This validation step is essential as it will ensure that the selected criteria are well observed: energy-efficiency, performance and elasticity.

Energy efficiency analysis of resource management systems

In addition, we will also investigate the energy efficiency of various resource management software which is a critical component of the software footprint on large-scale infrastructure. Specifically, we will evaluate Hadoop, Snooze and batch queueing software.

Scientific workflow management

We identified several steps to be undertaken towards the HOCL-based execution of TIGRES-based workflow specifications. They can be grouped into two categories:

Limitations of the HOCL runtime prototype: The current prototype suffers from several limitations mainly related to its initial primary objective which was to be a research prototype. For instance, the only way to invoke the tasks of the workflows run is through the Web service standards. One work planned is to extend its invocation capabilities, for instance to support simpler schemes, so as to simplify the porting of applications.
Development of the interface between TIGRES and the HOCL runtime: The main issue is aspects of code generation. We need to generate the set of files containing the HOCL code to be processed by the HOCL engines starting from a TIGRES workflow specification. Tigres and HOCL have limited overlap in their language and usage. HOCL is intended for specification while Tigres is intended to be a runtime language). Thus, this requires non negligible engineering effort. An ADT proposal is planned on INRIA’s side which will address these issues.

Data life-cycle management in clouds

Infrastructure as a Service provides a convenient way for scientists to procure, control and pay for resources that are needed. However storage and data management in clouds suffer from certain difficulties. First, our previous work has shown that I/O performance has a high amount of variability and significantly lower than what scientists are used to in HPC environments. Recently there has been work to explore the use of high performance file systems such as Lustre in public cloud environments. Our plan in the next year includes evaluating such offerings on Amazon’s cloud environment. The results of our evaluation will inform the building blocks necessary to facilitate fault-tolerant, energy efficient management of cloud environments. Second, the work conducted under DALHIS in year 1 revealed that storage planning ties in closely with both compute planning as well as application patterns. This is indicative of the need to explore custom VM types. However, custom VM types make the scheduling problem in clouds significantly more challenging. In Year 2, we will explore the trade-offs of custom virtual machine types.

Failure management in large-scale distributed systems

Large-scale distributed systems gather large numbers of heterogeneous resources and make them available to numerous users. Understanding the usage of these infrastructures is of vital importance in order to adequately size them (in terms of resources, air conditioning and energy supply for instance), as well as to anticipate usage bursts and consequently possible failures and bottlenecks. Previous analysis of Grid’5000 logs have shown that Grid’5000 usage is highly heterogeneous and is made of activity peaks and gaps [2,3]. Yet, the overall average usage of Grid’5000 is comparable to the average usage of computational Grids [2,4]. Thus, the tools developed for Grid’5000 within this project to other Grids should be adaptable to other Grid environments. First, in this project, we want to develop tools to detect failures, and to help failure management for the Grid’5000 platform. These tools aim at anticipating failures in order to treat certain categories of failures automatically, thus easing administration tasks. These tools will be based on the available monitoring data of the platform such as logs. It will be required to analyze these logs to better understand how Grid’5000 is used. Secondly, we want to adapt the forecasting tools developed for failure detection to energy management. The idea is to be able to anticipate the power consumption peaks and gaps for the computing nodes, the storage and networking resources and the air conditioning infrastructures.

This work will be based on an in-depth analysis of the usage logs of Grid’5000. Log analysis can have different objectives: resource monitoring [4], usage reporting [3], usage visualization [5], pattern detection [6], energy consumption estimation [2], failure detection [5], etc. As explained before, in our context, the objectives of this log analysis are twofold: 1) anticipating usage needs in terms of resources and energy consumption; and 2) helping administration decisions in case of failures or crashes.
This project will require first 1) to collect usage data (through the monitoring tools deployed on Grid’5000 such as Monika and Ganglia); 2) to analyze them (with statistical methods for instance); 3) to design automated detection tools (on-line and off-line algorithms), and 4) to validate these tools (by deploying them on Grid’5000).
The detection tools will be based on data mining and machine learning techniques. For instance, these techniques can be used to design profiles of typical users and of typical resources (i.e. a computing node). On-line and off-line techniques will be studied for the different purposes of this project: reporting, forecasting, etc. Different types of data will be used for this analysis depending on their availability on the Grid’5000 sites: resource reservations, hardware failure history, CPU usage, disk usage, network usage, and energy consumption measurements.

References:

[1] http://www.grid5000.fr
[2] Marcos Dias de Assunção, Anne-Cécile Orgerie, and Laurent Lefèvre, “An Analysis of Power Consumption Logs from a Monitored Grid Site”, IEEE/ACM International Conference on Green Computing and Communications (GreenCom-2010), pages 61-68, 2010.
[3] Anne-Cécile Orgerie and Laurent Lefèvre, “A year in the life of a large-scale experimental distributed system: the Grid’5000 platform in 2008”, INRIA Research Report RR-7481, 2010.
[4] A. Iosup, C. Dumitrescu, D. Epema, Hui Li, and L. Wolters. “How are real grids used? the analysis of four grid traces and its implications”, IEEE/ACM International Conference on Grid Computing (GRID), 2006.
[5] Lucas Mello Schnorr, Arnaud Legrand and Jean-Marc Vincent, “Detection and analysis of resource usage anomalies in large distributed systems through multi-scale visualization”, Concurrency and Computation: Practice and Experience, Wiley, 2012.
[6] Marcelo Finger, Germano C. Bezerra and Danilo R. Conde, “Resource use pattern analysis for opportunistic grids”, International Workshop on Middleware for Grid Computing (MGC), 2008.
[7] https://www.grid5000.fr/mediawiki/index.php/Hemera