OmicFinder – Querying the genomic treasure

Genomic data enable critical advances in medicine, ecology, ocean monitoring, and agronomy. Precious sequencing data accumulate exponentially in public genomic data banks such as the ENA. A major limitation is that it is impossible to query these entire data (petabytes of sequences).

In this context, OmicFinder, is an Inria challenge. It is a four years project starting in late 2023. Its aim is to provide a novel global search engine making it possible to query nucleotidic sequences against the vast amount of publicly available genomic data. The central algorithmic idea of a genomic search engine is to index and query small exact words (hundreds of billions over millions of datasets), as well as the associated metadata.

In addition to the creation of fundamental novel algorithms and data structures (GenScale team), the project develops new approaches to improve the query experience and the answer information by integrating the Semantic Web technologies framework (Dyliss team). In view of the considered volume of data, a part of the research focuses on clever index distribution (Iroko team). Throughout the project, we are committed to proposing methods that minimize the environmental impact generated by the massive use of the tools that will be produced, in particular through the use of specialized hardware (Taran team).

External partners are CEA-GenoScope, Elixir, Pasteur Institute, Inria Challenge OceanIA, CEA-CNRGH, and Mediterranean Institute of Oceanography. They participate in algorithmic developments and provide validations and use cases.

About