


{"id":139,"date":"2019-07-09T09:08:29","date_gmt":"2019-07-09T07:08:29","guid":{"rendered":"http:\/\/project.inria.fr\/ludo2\/?p=139"},"modified":"2019-07-09T09:14:29","modified_gmt":"2019-07-09T07:14:29","slug":"raweb","status":"publish","type":"post","link":"https:\/\/project.inria.fr\/ludo2\/raweb\/","title":{"rendered":"results"},"content":{"rendered":"<br \/>\n<h4>New results<\/h4>\n<p\/>\n<div class='subsecClass'>\n<h4>Modeling the dynamics of proteins<\/h4>\n<p>                   <subsection_keyword_list><b>Keywords: <\/b>Protein flexibility, protein conformations, collective coordinates, conformational sampling, loop closure, kinematics, dimensionality reduction.<\/subsection_keyword_list>                   <subsection id=\"ABS-RA-2024-uid27\" level=\"2\">                     <\/p>\n<h4>Simpler protein domain identification using spectral clustering<\/h4>\n<p\/>\n<p>The decomposition of a biomolecular complex into domains is an important step to investigate biological functions and ease structure determination. A successful approach to do so is the <hi rend=\"tt\">SPECTRUS<\/hi> algorithm, which provides a segmentation based on spectral clustering applied to a graph coding inter-atomic fluctuations derived from an elastic network model.<\/p>\n<p>We present \u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2024_bibitem_cazals-hal-04504447\">20<\/ref>, which makes three straightforward and useful additions to <hi rend=\"tt\">SPECTRUS<\/hi>. For single structures, we show that high quality partitionings can be obtained from a graph Laplacian derived from pairwise interactions\u2013without normal modes. For sets of homologous structures, we introduce a Multiple Sequence Alignment mode, exploiting both the sequence based information (MSA) and the geometric information embodied in experimental structures. Finally, we propose to analyze the clusters\/domains delivered using the so-called D-Family matching algorithm, which establishes a correspondence between domains yielded by two decompositions, and can be used to handle fragmentation issues.<\/p>\n<p>Our domains compare favorably to those of the original <hi rend=\"tt\">SPECTRUS<\/hi>, and those of the deep learning based method <hi rend=\"tt\">Chainsaw<\/hi>. Using two complex cases, we show in particular that is the only method handling complex conformational changes involving several sub-domains. Finally, a comparison of and <hi rend=\"tt\">Chainsaw<\/hi> on the manually curated domain classification <hi rend=\"tt\">ECOD<\/hi>  as a reference shows that high quality domains are obtained without using any evolutionary related piece of information.<\/p>\n<p>is provided in the Structural Bioinformatics Library, see <ref xlink:href=\"http:\/\/sbl.inria.fr\" location=\"extern\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\">SBL<\/ref> and <ref xlink:href=\"https:\/\/sbl.inria.fr\/doc\/Spectral_domain_explorer-user-manual.html\" location=\"extern\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\">Spectral domain explorer<\/ref>.<\/p>\n<\/p><\/div>\n<p>                 <\/subsection>                 <\/p>\n<div class='subsecClass'>\n<h4>Algorithmic foundations<\/h4>\n<p>                   <subsection_keyword_list><b>Keywords: <\/b>Computational geometry, computational topology, optimization, graph theory, data analysis, statistical physics.<\/subsection_keyword_list>                   <subsection id=\"ABS-RA-2024-uid29\" level=\"2\">                     <\/p>\n<h4>A mini-review of clustering algorithms and their theoretical properties, with applications to molecular science<\/h4>\n<p\/>\n<p>Clustering is a fundamental task, in particular to analyze potential and free energy landscapes in molecular science. In this survey\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2024_bibitem_cazals-hal-04504440\">19<\/ref>, I review the key properties of three remarkable clustering algorithms (<hi rend=\"tt\">k-means<\/hi> ++, persistence-based clustering, and spectral clustering) with a double perspective. The first one is the specification of the main mathematical and algorithmic properties of the algorithms; the second one is the relevance of these methods for structural, thermodynamic, and kinetic analysis. Doing so provides a unique opportunity to mention important connexions between optimization, graph theory, geometry, and theoretical biophysics.<\/p>\n<\/p><\/div>\n<div class='subsecClass'>\n<h4>Improved seeding strategies for <hi rend=\"tt\">k-means<\/hi> and Gaussian mixture fitting with Expectation-Maximization<\/h4>\n<p\/>\n<p><hi rend=\"tt\">k-means<\/hi> clustering and Gaussian Mixture model fitting are fundamental tasks in data analysis and statistical modeling. Practically, both algorithms follow a general iterative pattern, relying on (randomized) seeding techniques.<\/p>\n<p>We revisit the previous seeding methods and formalize their key ingredients (metric used for seed sampling, number of seed candidates, metric used for seed selection). This analysis results in casting most of the previous methods into a coherent framework and, most importantly, yields novel families of initialization methods. Incidentally, these novel methods exploit a <hi rend=\"it\">lookahead<\/hi> principle\u2013conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a <hi rend=\"it\">multipass strategy<\/hi>\u2013using at least two selection passes to tame down the effect of randomization.<\/p>\n<p>Experiments show a consistent constant factor improvement over classical contenders in terms of the final metric (sum of square error (SSE) for <hi rend=\"tt\">k-means<\/hi>, log-likelihood for Expectation-Maximization applied to Gaussian mixture model fitting), at the same cost. Roughly speaking, our improvement with respect to the greedy smart seeding of <hi rend=\"tt\">k-means++<\/hi> matches that yielded by this greedy smart seeding with respect to the classical randomized smart seeding.<\/p>\n<p><hi rend=\"bold\">Remark.<\/hi> Due to the double blind review process of machine learning conferences, the tech report will be made public early 2025.<\/p>\n<\/p><\/div>\n<div class='subsecClass'>\n<h4>Subspace-Embedded Spherical Clusters: a novel cluster model for compact clusters of arbitrary dimension<\/h4>\n<div class='moreClass'>In collaboration with L. Goldenberg (former Inria intern), and with S. Suren (IIT Delhi). <\/div>\n<p>Subspace clustering aims at selecting a small number of original coordinates (features) so that clusters are clearly identified in those subspaces. Subspace techniques rely on parametric cluster models including affine, spherical, Gaussian cluster models\u2013to name a few. To go beyond fully dimensional spherical cluster models and affine clusters of arbitrary dimension, we introduce <hi rend=\"it\">Subspace-embedded spherical clusters<\/hi> (SESC), a novel cluster model for compact clusters of arbitrary intrinsic dimension. The well posed nature of such clusters is established via the study of an optimization problem relying on an arrangement of hyper-spheres. This arrangement is used to exhibit a piecewise smooth strictly convex function, amenable to non smooth optimization.<\/p>\n<p>We illustrate the merits of the SESC model via comparisons against projection medians and the distance to the measure, and for clustering.<\/p>\n<p><hi rend=\"bold\">Remark.<\/hi> Due to the double blind review process of machine learning conferences, the tech report will be made public early 2025.<\/p>\n<\/p><\/div>\n<p>                 <\/subsection>                 <\/p>\n<div class='subsecClass'>\n<h4>Applications in structural bioinformatics and beyond<\/h4>\n<p>                   <subsection_keyword_list><b>Keywords: <\/b>Docking, scoring, interfaces, protein complexes, phylogeny, evolution.<\/subsection_keyword_list>                   <subsection id=\"ABS-RA-2024-uid33\" level=\"2\">                     <\/p>\n<h4><hi rend=\"tt\">AlphaFold<\/hi> predictions on whole genomes at a glance<\/h4>\n<p\/>\n<p>The 2024 Nobel prize in chemistry was awarded to David Baker (Univ. of Washington) for <hi rend=\"it\">computational protein design<\/hi>, and to Demis Hassabis and John Jumper (Google DeepMind, London, UK), for <hi rend=\"it\">protein structure prediction<\/hi>. The DeepMind software, called <hi rend=\"tt\">AlphaFold<\/hi>, plays a crucial role to help biologists understand protein functions. We designed novel statistical analysis to assess predictions\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2024_bibitem_cazals-hal-04872025\">21<\/ref>.<\/p>\n<p>For model organisms, <hi rend=\"tt\">AlphaFold<\/hi> predictions show that 30% to 40% of amino acids have a (very) low pLDDT (predicted local distance difference test) confidence score. This observation, combined with the method&#8217;s high complexity, commands to investigate difficult cases, the link with IDPs (intrinsically disordered proteins) or IDRs (intrinsically disordered regions), and potential hallucinations. We do so via four contributions. First, we provide a multiscale characterization of stretches with coherent <formula type=\"inline\"><math xmlns=\"http:\/\/www.w3.org\/1998\/Math\/MathML\"><mtext>pLDDT<\/mtext><\/math><\/formula> values along the sequence, an important analysis for model quality assessment. Second, we leverage the 3D atomic packing properties of predictions to represent a structure as a distribution. This distribution is then mapped into the so-called <hi rend=\"it\">2D arity map<\/hi>, which simultaneously performs dimensionality reduction and clustering, effectively summarizing all structural elements across all predictions. Third, using the database of domains <hi rend=\"tt\">ECOD<\/hi> , we study potential biases in <hi rend=\"tt\">AlphaFold<\/hi> predictions at the sequence and structural levels, identifying a specific region of the arity map populated with low quality 3D domains. Finally, with a focus on proteins with intrinsically disordered regions (IDRs), using DisProt and AIUPred, we identify specific regions of the arity map characterized by false positive and false negatives in terms of IDRs.<\/p>\n<p>Summarizing, the arity map sheds light on the accuracy of <hi rend=\"tt\">AlphaFold<\/hi> predictions, both in terms of 3D domains and IDRs.<\/p>\n<\/p><\/div>\n<div class='subsecClass'>\n<h4>EncoMPASS: a database for the analysis of membrane protein structures, and symmetries<\/h4>\n<p\/>\n<p>Membrane proteins (MPs) constitute about 30% of the proteome of each organisms, but they represent only 2% of the entries in the Protein Data Bank (PDB), as their three-dimensional structure is difficult to determine experimentally. Membrane protein structures differ from the rest of the proteome in two respects: 1) despite the great variety of functions performed, their structures are very similar, thus making structural classification more challenging and 2) although symmetric regions are common throughout the whole proteome, in MPs they are often essential for their functional mechanism.<\/p>\n<p>Among the databases collecting and organizing experimental structures of MPs, EncoMPASS is the only one relating the structure and internal symmetry of experimentally determined membrane protein complexes. In this new publication\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2024_bibitem_aleksandrova-hal-04472000\">18<\/ref>, the pipeline and founding criteria for building the database are described along with a complete analysis of the available data. The quality and consistency checks regularly performed on EncoMPASS make it a high quality resource for membrane protein structure algorithms.<\/p>\n<\/p><\/div>\n<div class='subsecClass'>\n<h4>Detecting orphan proteins in a nematode&#8217;s genome<\/h4>\n<p\/>\n<p>Protein classified in the same family are called homologs and are thought to share a common ancestor from which they have evolved. Proteins that cannot at present be classified in any known family are called orphan proteins, and their existence can be attributed to either the current limitations in protein classification (we talk then of <hi rend=\"it\">distant homologs<\/hi>) or to genuinely novel proteins (<hi rend=\"it\">de novo proteins<\/hi>). Determining whether a protein is orphan &#8211; or, even more, a distant homolog or a de novo &#8211; is particularly challenging due to the uncertainties and intricateness of homolog detection. In the poster\u00a0<ref location=\"biblio\" xlink:type=\"simple\" xlink:show=\"replace\" xlink:actuate=\"onRequest\" xlink:href=\"#ABS-RA-2024_bibitem_seckin-hal-04615706\">23<\/ref> presented at JOBIM2024 by E. Se\u00e7kin, we show a new pipeline for determining orphan proteins, and its application to the genomes of the <hi rend=\"it\">Meloidogyne<\/hi> genus of nematodes. This work is a fundamental step in preparation to the first ever algorithm for characterizing the structure of orphan proteins.<\/p>\n<\/p><\/div>\n<p>                 <\/subsection>               <\/p>\n","protected":false},"excerpt":{"rendered":"<p>New results Modeling the dynamics of proteins Keywords: Protein flexibility, protein conformations, collective coordinates, conformational sampling, loop closure, kinematics, dimensionality reduction. Simpler protein domain identification using spectral clustering The decomposition of a biomolecular complex into domains is an important step to investigate biological functions and ease structure determination. A successful\u2026<\/p>\n<p> <a class=\"continue-reading-link\" href=\"https:\/\/project.inria.fr\/ludo2\/raweb\/\"><span>Continue reading<\/span><i class=\"crycon-right-dir\"><\/i><\/a> <\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-139","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/posts\/139","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/comments?post=139"}],"version-history":[{"count":10,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/posts\/139\/revisions"}],"predecessor-version":[{"id":149,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/posts\/139\/revisions\/149"}],"wp:attachment":[{"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/media?parent=139"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/categories?post=139"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/project.inria.fr\/ludo2\/wp-json\/wp\/v2\/tags?post=139"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}