One of the particularities of the dnarXiv project is to target nanopore sequencing. This sequencing technology allows the decoding of long DNA molecules (several tens of thousands of nucleotides). The advantage of using large molecules is to increase the ratio between useful information and indexing information.

Indeed, the synthesis stage, whatever the technology, can currently only produce short molecules (between 60 to 200 nucleotides). A document must therefore be cut into a multitude of small DNA molecules which must necessarily contain additional information to reconstitute it.

The biotechnology research axis of the dnarXiv project aims to develop original protocols to design long DNA molecules from short oligonucleotides generated by DNA synthesizers. It also validates the decoding using nanopore sequencing technology.

This part of the project is dedicated to channel modeling and error-correction coding for DNA data storage and targeting nanopore sequencing. Two research axes have been investigated:

  1. A novel statistical model for DNA storage, which takes into account the memory within DNA storage error events, and follows the way nanopore sequencing works. Compared to existing channel models, the proposed model represents more accurate experimental datasets.
  2. A full error-correction scheme for DNA storage, based on a consensus algorithm and non-binary LDPC codes. Especially, a novel synchronization method have been introduced, allowing to eliminate remaining deletion errors after the consensus, before applying a belief-propagation LDPC decoding algorithm to correct substitution errors. This method exploits the LDPC code structure to correct deletions, and does not require adding any extra redundancy.

We have been working on a solution for writing encrypted data onto synthetic DNA molecules considering DNA synthesis and the error-correction code constraints. Our solution consists in an encoding process positioned after data encryption. This one takes into account the fact that any cryptosystems lead to encoded DNA sequences uniformly distributed, with a non-null probability to produce forbidden DNA sequences or patterns. Even though this process adds some data overhead, it demonstrates a good information rate compared to existing works.

The dnarXiv platform aims to conduct both real and in-silico experimentations. Based on a set of specific commands to store and read data on DNA, various scenarios can be investigated.

