Case: Solving the puzzle of plant genomes

High quality genome assemblies of plants are an important first step in gaining insights into plant development, disease resistance and crop improvement. But plant genomes are notoriously difficult to assemble. A new approach reduces the complexity of the process by wielding out errors and making more efficient use of data.

Chickpeas

A billion little fragments 

Genome assembly is the process of taking a large number of relatively short DNA sequences, fragments of 100,000 letters at most, and putting them back together in a computer. This creates a representation of the original set of chromosomes (the genome). In some plants, we are talking about a billion little fragments. Needless to say, this requires a lot of computing power. 

Jigsaw puzzle

But plant genomes are difficult to assemble, for many reasons. “For one, plant genomes often have multiple copies of a chromosome”, bioinformatics researcher Sven Warris (Wageningen University and Research) explains. “They also contain many repetitive fragments. For software, it’s very difficult to work out what belongs together. You can compare it to a jigsaw puzzle with a lot of pieces of blue sky that are all only slightly different from each other. The smaller the pieces, the more complex it is to assemble the puzzle. Last but not least, plant genomes can be very large, and are much more complex than a human genome.”

"Plant genomes often have multiple copies of a chromosome. For software, it’s very difficult to work out what belongs together."

High quality DNA 

For many genome projects, that aim to determine the complete genetic information of an organism (i.e. plant or animal), it is important to be able to use the PacBio or Nanopore sequencing technologies. “These platforms produce very long DNA reads and are therefore useful in creating high quality de novo assemblies. These are assemblies without the use of a reference genome - like a jigsaw puzzle without the box with the picture on it. However, these sequencing platforms also require high quality DNA in large amounts as input. This is not always available. The DNA can be amplified in the lab but this process introduces so-called chimeric fragments (pieces that don’t belong together get stuck together), hampering the down-stream analyses of the data.” 

"This new approach allows us to identify and separate these pieces, so we don’t need to discard information anymore."

New approach

Warris, together with an international group of researchers, developed a new approach that reduces the complexity of the genome assembly process. It enables them to select a single chromosome, amplify the DNA, and sequence it on, for example, the PacBio sequel platform. The tool they developed for this, Pacasus, makes it possible to use low-input material to create high quality de novo assemblies. “De novo assembly tools have a hard time identifying the artificial chimeric fragments. This new approach allows us to identify and separate these pieces, so we don’t need to discard information anymore.”

GPUs on HPC Cloud

Pacasus runs most efficiently using Graphics Processing Units (GPU) technology. “Within this project we made use of SURFsara’s HPC Cloud service, which offers a flexible infrastructure for research purposes while providing access to High Performance Computing resources, namely GPUs.” GPU-accelerated computing offloads compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU (processor). By design, GPUs have a massively parallel architecture consisting of thousands of smaller more efficient cores, able to handle multiple tasks simultaneously. That is to say, if the scientific application’s code is optimized to profit from GPUs, it will run way faster.

"Within this project we made use of SURFsara’s HPC Cloud service, which offers a flexible infrastructure for research purposes while providing access to High Performance Computing resources, namely GPUs."

Flexible environment 

Warris aimed to setup a flexible environment with access to GPUs while being able to easily share data between the computing resources. A (virtual) cluster of Virtual Machines (VMs) was set up in the HPC Cloud with direct access to GPU hardware for performance reasons. A shared network storage (2 terabytes) across the cluster was provided via a distributed file system (NFS) setup. The distribution of Pacasus job runs on the cluster was managed by HTCondor, a workload management system for compute-intensive jobs. “At peak use we had 9 GPUs attached to 5 VMs, including access to a high-end node with the newest NVIDIA Tesla P100 accelerators.”

More information