New results
Modeling the dynamics of proteins
Simpler protein domain identification using spectral clustering
The decomposition of a biomolecular complex into domains is an important step to investigate biological functions and ease structure determination. A successful approach to do so is the SPECTRUS algorithm, which provides a segmentation based on spectral clustering applied to a graph coding inter-atomic fluctuations derived from an elastic network model.
We present 19, which makes three straightforward and useful additions to SPECTRUS. For single structures, we show that high quality partitionings can be obtained from a graph Laplacian derived from pairwise interactions–without normal modes. For sets of homologous structures, we introduce a Multiple Sequence Alignment mode, exploiting both the sequence based information (MSA) and the geometric information embodied in experimental structures. Finally, we propose to analyze the clusters/domains delivered using the so-called D-Family matching algorithm, which establishes a correspondence between domains yielded by two decompositions, and can be used to handle fragmentation issues.
Our domains compare favorably to those of the original SPECTRUS, and those of the deep learning based method Chainsaw. Using two complex cases, we show in particular that is the only method handling complex conformational changes involving several sub-domains. Finally, a comparison of and Chainsaw on the manually curated domain classification ECOD as a reference shows that high quality domains are obtained without using any evolutionary related piece of information.
is provided in the Structural Bioinformatics Library, see SBL and Spectral domain explorer.
Algorithmic foundations
Improved seeding strategies for k-means and k-GMM
In 18, we revisit the randomized seeding techniques for k-means clustering and k-GMM (Gaussian Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used for seed selection. This analysis yields novel families of initialization methods exploiting a lookahead principle–conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a multipass strategy to tame down the effect of randomization.
Experiments show a significant improvement over classical contenders. In particular, for k-means, our methods improve on the recently designed multi-swap strategy (similar results in terms of sum of square errors (SSE), seeding k-means++ seeding.
Our experimental analysis also shed light on subtle properties of k-means often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods.
Practically, our most effective seeding methods are strong candidates to become one of the–if not the–standard technique(s). From a theoretical perspective, our formalization of seeding opens the door to a new line of analytical approaches.
Modeling high dimensional point clouds with the spherical cluster model
A parametric cluster model is a statistical model providing geometric insights onto the points defining a cluster. The spherical cluster model (SC) approximates a finite point set
First, we show that fitting a spherical cluster yields a strictly convex but not smooth combinatorial optimization problem. Second, we present an exact solver using the Clarke gradient on a suitable stratified cell complex defined from an arrangement of hyper-spheres. Finally, we present experiments on a variety of datasets ranging in dimension from
The SC model is of direct interest for high dimensional multivariate data analysis, and the application to the design of mixtures of SC will be reported in a companion paper.
Applications in structural bioinformatics and beyond
Fold or flop: quality assessment of AlphaFold predictions on whole proteomes
Reliability of AlphaFold predictions is primarily assessed by the method’s self-reported score predicted Local Distance Difference Test (pLDDT). For model organisms, AlphaFold predictions show that 30% to 40% of all amino acids fall into the low-confidence range of pLDDT. Moreover, pLDDT has occasionally failed to flag predictions that are physically implausible. This raises two fundamental questions: can we identify more robust indicators of reliability? And do unreliable predictions exhibit shared structural or biophysical traits?
To address these questions, we introduce semi-global statistics characterizing packing properties at multiple scales, and performing dimensionality reduction and clustering at once 23. We use these to perform a systematic whole-proteome structural quality assessment of prediction contained in the AlphaFold Database (AFDB), investigating connections between unreliable predictions, fold classification, and intrinsic disorder propensity.
Our results reveal consistent relationships between low-confidence predictions, clustering of intrinsically disordered regions (IDRs), and distinctive packing properties, thereby highlighting both strengths and limitations of current self-assessment metrics. This work provides a framework for deeper confidence assessment of AlphaFold predictions and offers generalizable strategies for distinguishing reliable from unreliable structural models.
Characterizing the fragmentation of AlphaFold predictions
The Nobel prize winning program AlphaFold computes plausible structures of (well) folded proteins. The main quality assessment is based on the predicted Local Distance Difference Test (pLDDT), a per amino acid confidence score. To enhance quality assessment, we provide novel quantitative measures to identify coherent amino acid (a.a.) stretches along the sequence in terms of pLDDT values 22. These measures, which rely on standard tools from topological data analysis and combinatorics, qualify the coherence / fragmentation of AlphaFold predictions. The outcome of our analysis can readily be used to select reliable regions/domains within proteins whose pLDDT values span the entire pLDDT range.
Orphan genes survey
Orphan genes are protein-coding genes that lack detectable homologs in other species, making them lineage-specific and evolutionarily enigmatic. This review 20 synthesizes research on orphan genes in animals and fungi, summarizing their prevalence, proposed origins (including divergence and de novo emergence), and biological roles. Orphan genes are implicated in diverse processes such as reproduction, development, adaptation, and disease, highlighting their functional importance. They are especially interesting for computational biology because identifying them challenges homology-based annotation methods and requires novel comparative and statistical approaches. By consolidating scattered knowledge, this work provides a foundation for developing better computational tools to detect, classify, and model the evolution and function of orphan genes.
Orphan genes detection and classification
Building on the broader synthesis of orphan gene prevalence and function, we provide a focused, data-driven case in plant-parasitic nematodes of the genus Meloidogyne. Using comparative genomics across 85 nematode species, we show that orphan genes are not rare anomalies but constitute 18% of the genome, with strong transcriptional support 24. By integrating synteny and ancestral sequence reconstruction, the work quantifies the relative contributions of divergence and de novo gene birth, directly addressing questions raised in the earlier review. Proteomic and translatomic evidence further validates these genes as bona fide coding sequences with distinctive molecular features. Together, this study builds a new and effective pipeline for detecting and classifying orphan genes, and exemplifies how computational approaches can move from cataloging orphan genes to dissecting their origins and linking them to lineage-specific adaptations such as parasitism.