DirtyData

DirtyData is a research project
to reduce the human cost of data preparation and integration
as well as to facilitate statistical analysis of incomplete data.

The future of data integration

Data integration and data cleaning, ie going from the raw data to the statistical analysis, is a major stumbling block in data science. DirtyData is an ambitious research project uniting multiple big actors of the French public and private research around this problem.

Two research axis are funded. One revolves around dirty data integration, it is funded by the ANR (Association National de Recherche).
The other strives to develop new techniques to analyze incomplete data. It is funded by the DataAI institute.

Research axis/Funding

 

DirtyData integration

This project aims to reduce the cost of data preparation by integrating it directly into the statistical analysis. Our key insight is that machine learning itself deals well with noise and errors. Hence, we aim to develop methodology to do statistical analysis directly on the original dirty data. For this, the operations currently done to clean data before the analysis must be adapted to a statistical framework that captures errors and inconsistencies. Our research agenda combines the data-integration state of the art in database research with statistical modeling and regularization from machine learning.
A challenge is to turn the entities present in the data into representations well-suited for statistical learning that are robust to potential errors but do not wash out uncertainty.
We address typing entries, deduplication -finding different forms of the same entity- building joins across dirty tables, and correcting errors and missing data. Our goal is that these steps should be generic enough to digest directly dirty data without user-defined rules. Methods developed are empirically evaluated on a variety of dataset, including a number of open-data websites.

 

Missing Big Data

‘Big data’, often observational and compound, rather than experimental and homogeneous, poses missing-data challenges: missing values are structured, non independent of the outcome variables of interest. Deleting incomplete observations creates at best information losses, at worst warped conclusions due to a selection bias.
Our missing-data research is motivated by applications in medical data, with the Traumabase and UK Biobank, which feature a great diversity of missing values. In particular, we would like to tackle the problem of causal inference with inverse propensity weighting when the data is incomplete.
We propose to use more powerful models that can benefit from the large sample sizes to impute the missing values, even when they are generated by a non ignorable mechanism. We also consider alternatives to imputation, by directly adapting models such as random forests to handle missing values in the features.