Skip to content
Github Twitter

DirtyData project

Statistical learning on non-curated data
Skip to content
  • Team
  • Software
  • Publications
  • Job offers
  • Other news
DirtyData

DirtyData

DirtyData is a research project
to reduce the human cost of data preparation and integration
as well as to facilitate statistical analysis of incomplete data.

The future of data integration

Data integration and data cleaning, ie going from the raw data to the statistical analysis, is a major stumbling block in data science. DirtyData is an ambitious research project uniting multiple big actors of the French public and private research around this problem.

Two research axis are funded. One revolves around dirty data integration, it is funded by the ANR (Association National de Recherche).
The other strives to develop new techniques to analyze incomplete data. It is funded by the DataAI institute.

Research axis/Funding

 
DirtyData integration
This project aims to reduce the cost of data preparation by integrating it directly into the statistical analysis. Our key insight is that machine learning itself deals well with noise and errors. Hence, we aim to develop methodology to do statistical analysis directly on the original dirty data. For this, the operations currently done to clean data before the analysis must be adapted to a statistical framework that captures errors and inconsistencies. Our research agenda combines the data-integration state of the art in database research with statistical modeling and regularization from machine learning.
A challenge is to turn the entities present in the data into representations well-suited for statistical learning that are robust to potential errors but do not wash out uncertainty.
We address typing entries, deduplication -finding different forms of the same entity- building joins across dirty tables, and correcting errors and missing data. Our goal is that these steps should be generic enough to digest directly dirty data without user-defined rules. Methods developed are empirically evaluated on a variety of dataset, including a number of open-data websites.

Funding
 
Missing Big Data
‘Big data’, often observational and compound, rather than experimental and homogeneous, poses missing-data challenges: missing values are structured, non independent of the outcome variables of interest. Deleting incomplete observations creates at best information losses, at worst warped conclusions due to a selection bias.
Our missing-data research is motivated by applications in medical data, with the Traumabase and UK Biobank, which feature a great diversity of missing values. In particular, we would like to tackle the problem of causal inference with inverse propensity weighting when the data is incomplete.
We propose to use more powerful models that can benefit from the large sample sizes to impute the missing values, even when they are generated by a non ignorable mechanism. We also consider alternatives to imputation, by directly adapting models such as random forests to handle missing values in the features.

Funding

Presentation: Useful results from DirtyData for machine learning in Python on non-curated data

Uncategorized

A presentation on practical results from the DirtyData project for data analysts that run machine learning in Python on non-curated data

Machine learning on non curated data from Gael Varoquaux
Powered by Nirvana & WordPress. Mentions légales & CGU & Politique de confidentialité & Cookies
Github Twitter

We are using cookies to give you the best experience on our website.

You can find out more about which cookies we are using or switch them off in settings.

DirtyData project
Powered by  GDPR Cookie Compliance
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.