Workshop Fast-Big 30 septembre 2022

Programme:

Accueil 9h30-10h

Session matin: 10h-12h45: (inclut une pause café)

Antoine Villié (LBBE) Selective inference for sequence motifs
- Convolutional neural networks (CNNs) achieve good performance in predicting the phenotype for unannotated biological sequences. To that end, they optimize filters that can be interpreted as sequence motifs. Such motifs appear to be relevant variants for Genome-wide association studies (GWAS), that aim to identify correlations between genetic variants and a trait. They are indeed better-suited variants for GWAS studies applied to meta-genomes or organisms with accessory genomes than the standard ones.
  To our knowledge, there are no existing frameworks to perform inference on the trained filters of a CNN. Although standard data-splitting strategies do exist for GWAS studies, testing the association between the motifs and the phenotype using those strategies results in both a lower performance for motifs optimization and a loss in statistical power in a context of small-scale datasets.
  In the present work, we first develop a stable step-wise procedure to select a small number of sequence motifs associated with a trait, and we draw a formal link between our procedure and CNNs for biological sequences.
  We then take advantage of recent advances in post-selection inference to produce a well-calibrated testing procedure for the association between the selected motifs and the trait, while accounting for our selection procedure.
Ahmad Chamma (Inria MIND) Statistically Valid Inference on Population-Level Variable Importance via Conditional Permutation
- Linear models or decision trees are popular in applied research for their simplicity and interpretability. Yet, the ever-growing amount of complex data is calling for high-capacity prediction models such as Random Forests (RFs) or Deep Neural Networks (DNNs). Unfortunately, these methods do not automatically provide clear-cut rules to explain model’s prediction by aspects of the input data. Statistical procedures for inferring variable importance are, therefore, under active development. Permutation approaches are popular. They gauge the importance of a variable of interest by measuring the impact of shuffling its values on outcome prediction. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of systematic correlations. This can be solved by permutation schemes preserving conditional dependencies between variables. Here we present Conditional Permutation Importance approach (CPI) that is both model agnostic and computationally lean. We show theoretically and empirically that CPI overcomes the limitations of standard permutation importance by providing Type-I error control. When used with a deep neural network, CPI consistently showed top accuracy and power across benchmarks. An empirical benchmark on real-world data analysis in the UK Biobank showed that CPI provided a more parsimonious selection of statistically significant. Our results argue that CPI can be readily used as drop-in replacement for non-conditional permutation-based methods as it can adapt to the presence of correlated predictors.
Alexandre Blain (Inria MIND / IMT) Notip: Non-parametric true discovery proportion control for brain imaging
- Cluster-level inference procedures are widely used for brain mapping. These methods compare the size of clusters obtained by thresholding brain maps to an upper bound under the global null hypothesis, computed using Random Field Theory or permutations. However, the guarantees obtained by this type of inference – i.e. at least one voxel is truly activated in the cluster – are not informative with regards to the strength of the signal therein. There is thus a need for methods to assess the amount of signal within clusters; yet such methods have to take into account that clusters are defined based on the data, which creates circularity in the inference scheme. This has motivated the use of post hoc estimates that allow statistically valid estimation of the proportion of activated voxels in clusters. In the context of fMRI data, the All-Resolutions Inference framework introduced in Rosenblatt et al. (2018) provides post hoc estimates of the proportion of activated voxels. However, this method relies on parametric threshold families, which results in conservative inference. In this paper, we leverage randomization methods to adapt to data characteristics and obtain tighter false discovery control. We obtain Notip, for Non-parametric True Discovery Proportion control: a powerful, non-parametric method that yields statistically valid guarantees on the proportion of activated voxels in data-derived clusters. Numerical experiments demonstrate substantial gains in number of detections compared with state-of-the-art methods on 36 fMRI datasets. The conditions under which the proposed method brings benefits are also discussed.

Déjeuner 12h45-14h

Session après-midi: 14h-17h:

Yohann de Castro (ECL/IUF) Multiple Testing and Variable Selection along the path of the Least Angle Regression
- The Least-Angle Regression algorithm (LAR, Efron et al., 2004) is a simple and powerful greedy method to select variables in high-dimensional contexts or searching for the strongest associations in bio-statistics. Measuring the strength of the resulting associations is a challenging task, because one must account for the effects of the selection. The fact that we have searched for the strongest associations means that we must set a higher bar for declaring significant the associations that we see. In this talk, we will see how to address this issue, giving for the first time the joint law of the p-values entering along the LAR’s path. How study will lead us to a new formulation of the LAR and to a FDR control of the variables selecting by this algorithm.
Iqraa Meah (Sorbonne University Paris) Online multiple testing with super-uniformity reward
- Valid online inference is an important problem in contemporary multiple testing research, to which various solutions have been proposed recently. It is well-known that these existing methods can suffer from a significant loss of power if the null p-values are conservative. In this work, we extend the previously introduced methodology to obtain more powerful procedures for the case of super-uniformly distributed p-values. These types of p-values arise in important settings, e.g. when discrete hypothesis tests are performed or when the p-values are weighted. To this end, we introduce the method of super-uniformity reward (SUR) that incorporates information about the individual null cumulative distribution functions. Our approach yields several new ’rewarded’ procedures that offer uniform power improvements over known procedures and come with mathematical guarantees for controlling online error criteria based either on the family-wise error rate (FWER) or the marginal false discovery rate (mFDR). We illustrate the benefit of super-uniform re- warding in real-data analyses and simulation studies. While discrete tests serve as our leading example, we also show how our method can be applied to weighted p-values.
Guillermo Durand (LMO, Université Paris Saclay) Post hoc false positive control with structured hypotheses
- We revert the usual paradigm of multiple testing: instead of prescribing a rejection set with statistical guarantees, we construct confidence bounds on the number of false positives in any subset of hypotheses, valid uniformly on all the possible subsets. Based on a general construction technique provided by Blanchard, Neuvial, Roquain (2020), we further assume some hierarchical spatial structure for the signal to derive new bounds and compare them to pre-existing ones. This is a joint work with the aforementioned authors.

Durée des exposés: 20′ + 25′ questions

Lieu: Salle Emmy Noether, Inria Paris

Lien visio:

https://inria.webex.com/inria/j.php?MTID=mfa8911f766bd9bd3a2c8108ff0938c1f

Programme:

Resources

Meta