Fast statistical analysis of web data via sparse learning
- Dr. Francis Bach, SIERRA project-team, Inria Paris Rocquencourt
- Prof. Laurent El Ghaoui, University of California Berkeley
STATWEB aims to provide web-based tools for the analysis and visualization of large corpora of text documents, with a focus on databases of news articles. The team uses advanced algorithms, drawing from recent progresses in machine learning and statistics, to allow a user to quickly produce a short summary and associate timeline showing how a certain topic is described in news media.
- Theoretical analysis of sparse PCA: The team has shown that methods based on convex relaxations could reach the desired detection threshold, improving over the best known polynomial-time result.
- Parallel graph cuts for computer vision: While general submodular minimization is challenging, the team has proposed a new approach that exploits existing decomposability of submodular functions. In contrast to previous approaches, the method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize.
- Natural language processing with weakly supervised learning: the team has designed a new algorithm to extract relations from text using distant supervision.
Publications and Awards:
- 2 Journal articles.
- 2 Conference papers.
A. d’Aspremont, F. Bach, L. El Ghaoui. Approximation Bounds for Sparse Principal Component Analysis. Mathematical Programming, 2013.
The collaboration has continued in several forms among which the Inria@SiliconValley postdoctoral grant of Edouard Grave who has been studying “semi-supervised models for event extraction from textual data”. The associate team has allowed building a collaboration and a mutual respect between Inria and UC Berkeley teams, which is lasting.