Context: In many scientific applications, increasingly large datasets are being acquired to describe more accurately biological or physical phenomena. While the dimensionality of the resulting measures has increased, the number of samples available is often limited, due to physical or financial limits. This results in impressive amounts of complex data observed in small batches of samples. A question that has arisen is then: what features in the data are really informative about some outcome of interest? This amounts to inferring the relationships between these variables and the outcome, conditionally to all other variables. Providing statistical guarantees on these associations is needed in many fields of data science, where competing models require rigorous statistical assessment. Yet reaching such guarantees is very hard. In particular, it is not uncommon for a brain imaging analysis task to have a sample size n of 100 but a covariates number p of 100000 that corresponds to the number of brain voxels. In such situation, a method to cluster the brain voxels into regions of voxels that works as a way of dimension reduction has been introduced.
- The main objective of this project is to develop and extend theoretical results and practical estimation procedures that render statistical inference feasible in such high-dimensional setting.
- Potential development of robust methods to estimate the distribution of the covariates, especially the sample covariance matrix will also be considered.
- Development of the corresponding software and novelty assessment regarding the inference schemes with focus on application of brain imaging. Successful realizations of the procedures will be added to statistical, possibly domain-specific libraries such as nilearn.github.io and ja-che.github.io/hidimstat/.