


{"id":139,"date":"2019-07-09T09:08:29","date_gmt":"2019-07-09T07:08:29","guid":{"rendered":"http:\/\/project.inria.fr\/ludo2\/?p=139"},"modified":"2019-07-09T09:14:29","modified_gmt":"2019-07-09T07:14:29","slug":"raweb","status":"publish","type":"post","link":"https:\/\/project.inria.fr\/ludo2\/raweb\/","title":{"rendered":"results"},"content":{"rendered":"<br \/>\n<h4>New results<\/h4>\n<div class='subsecClass'>\n<h4>Modeling the dynamics of proteins<\/h4>\n<p>                   <subsection_keyword_list><b>Keywords: <\/b>Protein flexibility, protein conformations, collective coordinates, conformational sampling, loop closure, kinematics, dimensionality reduction.<\/subsection_keyword_list>                   <subsection id=\"ABS-RA-2025-uid29\" level=\"2\">                     <\/p>\n<h4>Simpler protein domain identification using spectral clustering<\/h4>\n<p>The decomposition of a biomolecular complex into domains is an important step to investigate biological functions and ease structure determination. A successful approach to do so is the <hi rend=\"tt\">SPECTRUS<\/hi> algorithm, which provides a segmentation based on spectral clustering applied to a graph coding inter-atomic fluctuations derived from an elastic network model.<\/p>\n<p>We present \u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2025_bibitem_cazals-hal-04504447\">19<\/ref>, which makes three straightforward and useful additions to <hi rend=\"tt\">SPECTRUS<\/hi>. For single structures, we show that high quality partitionings can be obtained from a graph Laplacian derived from pairwise interactions\u2013without normal modes. For sets of homologous structures, we introduce a Multiple Sequence Alignment mode, exploiting both the sequence based information (MSA) and the geometric information embodied in experimental structures. Finally, we propose to analyze the clusters\/domains delivered using the so-called D-Family matching algorithm, which establishes a correspondence between domains yielded by two decompositions, and can be used to handle fragmentation issues.<\/p>\n<p>Our domains compare favorably to those of the original <hi rend=\"tt\">SPECTRUS<\/hi>, and those of the deep learning based method <hi rend=\"tt\">Chainsaw<\/hi>. Using two complex cases, we show in particular that is the only method handling complex conformational changes involving several sub-domains. Finally, a comparison of and <hi rend=\"tt\">Chainsaw<\/hi> on the manually curated domain classification <hi rend=\"tt\">ECOD<\/hi>  as a reference shows that high quality domains are obtained without using any evolutionary related piece of information.<\/p>\n<p>is provided in the Structural Bioinformatics Library, see <ref xlink:href=\"http:\/\/sbl.inria.fr\" location=\"extern\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\">SBL<\/ref> and <ref xlink:href=\"https:\/\/sbl.inria.fr\/doc\/Spectral_domain_explorer-user-manual.html\" location=\"extern\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\">Spectral domain explorer<\/ref>.<\/p>\n<\/p><\/div>\n<p>                 <\/subsection>                 <\/p>\n<div class='subsecClass'>\n<h4>Algorithmic foundations<\/h4>\n<p>                   <subsection_keyword_list><b>Keywords: <\/b>Computational geometry, computational topology, optimization, graph theory, data analysis, statistical physics.<\/subsection_keyword_list>                   <subsection id=\"ABS-RA-2025-uid31\" level=\"2\">                     <\/p>\n<h4>Improved seeding strategies for k-means and k-GMM<\/h4>\n<p>In <ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2025_bibitem_carriere-hal-05441325\">18<\/ref>, we revisit the randomized seeding techniques for <hi rend=\"tt\">k-means<\/hi> clustering and <hi rend=\"tt\">k-GMM<\/hi>  (Gaussian Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used for seed selection. This analysis yields novel families of initialization methods exploiting a <hi rend=\"it\">lookahead<\/hi> principle\u2013conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a <hi rend=\"it\">multipass strategy<\/hi> to tame down the effect of randomization.<\/p>\n<p>Experiments show a significant improvement over classical contenders. In particular, for <hi rend=\"tt\">k-means<\/hi>, our methods improve on the recently designed multi-swap strategy (similar results in terms of sum of square errors (SSE), seeding <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mrow><mo>\u223c<\/mo><mo>\u00d7<\/mo><mn>6<\/mn><\/mrow><\/math><\/formula> faster), which was the first one to outperform the greedy <hi rend=\"tt\">k-means++<\/hi> seeding.<\/p>\n<p>Our experimental analysis also shed light on subtle properties of <hi rend=\"tt\">k-means<\/hi> often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods.<\/p>\n<p>Practically, our most effective seeding methods are strong candidates to become one of the\u2013if not the\u2013standard technique(s). From a theoretical perspective, our formalization of seeding opens the door to a new line of analytical approaches.<\/p>\n<\/p><\/div>\n<div class='subsecClass'>\n<h4>Modeling high dimensional point clouds with the spherical cluster model<\/h4>\n<div class='moreClass'>In collaboration with L. Goldenberg (former Inria intern). <\/div>\n<p>A parametric cluster model is a statistical model providing geometric insights onto the points defining a cluster. The <hi rend=\"it\">spherical cluster model<\/hi> (SC) approximates a finite point set <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mrow><mi>P<\/mi><mo>\u2282<\/mo><msup><mi>\u211d<\/mi><mi>d<\/mi><\/msup><\/mrow><\/math><\/formula> by a sphere <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mrow><mi>S<\/mi><mo>(<\/mo><mi>c<\/mi><mo>,<\/mo><mi>r<\/mi><mo>)<\/mo><\/mrow><\/math><\/formula> as follows. Taking <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>r<\/mi><\/math><\/formula> as a fraction <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mrow><mi>\u03b7<\/mi><mo>\u2208<\/mo><mo>(<\/mo><mn>0<\/mn><mo>,<\/mo><mn>1<\/mn><mo>)<\/mo><\/mrow><\/math><\/formula> (hyper-parameter) of the standard deviation of distances between the center <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>c<\/mi><\/math><\/formula> and the data points, the cost of the SC model is the sum over all data points lying outside the sphere <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>S<\/mi><\/math><\/formula> of their power distance with respect to <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>S<\/mi><\/math><\/formula>. The center <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>c<\/mi><\/math><\/formula> of the SC model is the point minimizing this cost. Note that <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mrow><mi>\u03b7<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><\/math><\/formula> yields the celebrated center of mass used in KMeans clustering. We make three contributions\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2025_bibitem_cazals-hal-05442010\">21<\/ref>.<\/p>\n<p>First, we show that fitting a spherical cluster yields a strictly convex but not smooth combinatorial optimization problem. Second, we present an exact solver using the Clarke gradient on a suitable stratified cell complex defined from an arrangement of hyper-spheres. Finally, we present experiments on a variety of datasets ranging in dimension from <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mrow><mi>d<\/mi><mo>=<\/mo><mn>9<\/mn><\/mrow><\/math><\/formula> to <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mrow><mi>d<\/mi><mo>=<\/mo><mn>10<\/mn><mo>,<\/mo><mn>000<\/mn><\/mrow><\/math><\/formula>, with two main observations. First, the exact algorithm is orders of magnitude faster than Broyden-Fletcher-Goldfarb-Shanno (BFGS) based heuristics for datasets of small\/intermediate dimension and small values of <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>\u03b7<\/mi><\/math><\/formula>, and for high dimensional datasets (say <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mrow><mi>d<\/mi><mo>&gt;<\/mo><mn>100<\/mn><\/mrow><\/math><\/formula>) whatever the value of <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mi>\u03b7<\/mi><\/math><\/formula>. Second, the center of the SC model behaves as a parameterized high-dimensional median.<\/p>\n<p>The SC model is of direct interest for high dimensional multivariate data analysis, and the application to the design of mixtures of SC will be reported in a companion paper.<\/p>\n<\/p><\/div>\n<p>                 <\/subsection>                 <\/p>\n<div class='subsecClass'>\n<h4>Applications in structural bioinformatics and beyond<\/h4>\n<p>                   <subsection_keyword_list><b>Keywords: <\/b>Docking, scoring, interfaces, protein complexes, phylogeny, evolution.<\/subsection_keyword_list>                   <subsection id=\"ABS-RA-2025-uid34\" level=\"2\">                     <\/p>\n<h4>Fold or flop: quality assessment of AlphaFold predictions on whole proteomes<\/h4>\n<p>Reliability of <hi rend=\"tt\">AlphaFold<\/hi> predictions is primarily assessed by the method\u2019s self-reported score predicted Local Distance Difference Test (pLDDT). For model organisms, <hi rend=\"tt\">AlphaFold<\/hi> predictions show that 30% to 40% of all amino acids fall into the low-confidence range of pLDDT. Moreover, pLDDT has occasionally failed to flag predictions that are physically implausible. This raises two fundamental questions: can we identify more robust indicators of reliability? And do unreliable predictions exhibit shared structural or biophysical traits?<\/p>\n<p>To address these questions, we introduce semi-global statistics characterizing packing properties at multiple scales, and performing dimensionality reduction and clustering at once\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2025_bibitem_sarti-hal-05438855\">23<\/ref>. We use these to perform a systematic whole-proteome structural quality assessment of prediction contained in the AlphaFold Database (AFDB), investigating connections between unreliable predictions, fold classification, and intrinsic disorder propensity.<\/p>\n<p>Our results reveal consistent relationships between low-confidence predictions, clustering of intrinsically disordered regions (IDRs), and distinctive packing properties, thereby highlighting both strengths and limitations of current self-assessment metrics. This work provides a framework for deeper confidence assessment of <hi rend=\"tt\">AlphaFold<\/hi> predictions and offers generalizable strategies for distinguishing reliable from unreliable structural models.<\/p>\n<\/p><\/div>\n<div class='subsecClass'>\n<h4>Characterizing the fragmentation of AlphaFold predictions<\/h4>\n<p>The Nobel prize winning program <hi rend=\"tt\">AlphaFold<\/hi> computes plausible structures of (well) folded proteins. The main quality assessment is based on the <hi rend=\"it\">predicted Local Distance Difference Test<\/hi> (pLDDT), a per amino acid confidence score. To enhance quality assessment, we provide novel quantitative measures to identify <hi rend=\"it\">coherent<\/hi> amino acid (a.a.) stretches along the sequence in terms of pLDDT values\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2025_bibitem_cazals-hal-05438856\">22<\/ref>. These measures, which rely on standard tools from topological data analysis and combinatorics, qualify the coherence \/ fragmentation of <hi rend=\"tt\">AlphaFold<\/hi> predictions. The outcome of our analysis can readily be used to select reliable regions\/domains within proteins whose pLDDT values span the entire pLDDT range.<\/p>\n<\/p><\/div>\n<div class='subsecClass'>\n<h4>Orphan genes survey<\/h4>\n<p>Orphan genes are protein-coding genes that lack detectable homologs in other species, making them lineage-specific and evolutionarily enigmatic. This review\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2025_bibitem_seckin-hal-05455139\">20<\/ref>\u00a0 synthesizes research on orphan genes in animals and fungi, summarizing their prevalence, proposed origins (including divergence and de novo emergence), and biological roles. Orphan genes are implicated in diverse processes such as reproduction, development, adaptation, and disease, highlighting their functional importance. They are especially interesting for computational biology because identifying them challenges homology-based annotation methods and requires novel comparative and statistical approaches. By consolidating scattered knowledge, this work provides a foundation for developing better computational tools to detect, classify, and model the evolution and function of orphan genes.<\/p>\n<\/p><\/div>\n<div class='subsecClass'>\n<h4>Orphan genes detection and classification<\/h4>\n<p>Building on the broader synthesis of orphan gene prevalence and function, we provide a focused, data-driven case in plant-parasitic nematodes of the genus Meloidogyne. Using comparative genomics across 85 nematode species, we show that orphan genes are not rare anomalies but constitute \u00a018% of the genome, with strong transcriptional support\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2025_bibitem_seckin-hal-05438858\">24<\/ref>. By integrating synteny and ancestral sequence reconstruction, the work quantifies the relative contributions of divergence and de novo gene birth, directly addressing questions raised in the earlier review. Proteomic and translatomic evidence further validates these genes as bona fide coding sequences with distinctive molecular features. Together, this study builds a new and effective pipeline for detecting and classifying orphan genes, and exemplifies how computational approaches can move from cataloging orphan genes to dissecting their origins and linking them to lineage-specific adaptations such as parasitism.<\/p>\n<\/p><\/div>\n<p>                 <\/subsection>               <\/p>\n","protected":false},"excerpt":{"rendered":"<p>New results Modeling the dynamics of proteins Keywords: Protein flexibility, protein conformations, collective coordinates, conformational sampling, loop closure, kinematics, dimensionality reduction. Simpler protein domain identification using spectral clustering The decomposition of a biomolecular complex into domains is an important step to investigate biological functions and ease structure determination. A successful\u2026<\/p>\n<p> <a class=\"continue-reading-link\" href=\"https:\/\/project.inria.fr\/ludo2\/raweb\/\"><span>Continue reading<\/span><i class=\"crycon-right-dir\"><\/i><\/a> <\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-139","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/posts\/139","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/comments?post=139"}],"version-history":[{"count":10,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/posts\/139\/revisions"}],"predecessor-version":[{"id":149,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/posts\/139\/revisions\/149"}],"wp:attachment":[{"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/media?parent=139"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/categories?post=139"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/tags?post=139"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}