"Plant genes often have multiple copies of a chromosome. For software, it is very difficult to figure out what belongs together."
Solving the puzzle of plant genomes
High quality genome assemblies of plants are an important first step in gaining insights into plant development, disease resistance and crop improvement. But plant genomes are notoriously difficult to assemble. A new approach reduces the complexity of the process by wielding out errors and making more efficient use of data.
A billion tiny fragments
In genome assembly, you take a large number of relatively short DNA sequences, fragments of up to 100,000 letters, and knot them back together in a computer. This results in a representation of the original set of chromosomes (the genome). Some plants involve a billion tiny fragments (reads). So this requires a lot of computing power.
But plant genomes are difficult to put together, for several reasons. "Plant genes often have multiple copies of a chromosome," explains bioinformatician Sven Warris (Wageningen University and Research). "They also contain many repetitive fragments. For software, it is very difficult to figure out what belongs together. You can compare it to a jigsaw puzzle with many pieces of blue sky that are all only slightly different from each other. The smaller the pieces, the more complex it is to put the puzzle together. Last but not least, plant genomes can be very large and are much more complex than a human genome."
For many genome projects, aimed at determining the complete genetic information of an organism (e.g. plant or animal), it is important to be able to use PacBio or Nanopore sequencing technologies. "These platforms produce very long DNA reads and are therefore useful in making high-quality de novo assemblies. De novo means: without a reference genome - like a jigsaw puzzle without the box with the image on it. However, these sequencing platforms also require high-quality DNA in large quantities as input. This is not always available. DNA can be amplified in the lab, but this process introduces so-called chimeric fragments (pieces that do not belong together stick together), which complicates data analysis. "
"With this new approach, we can identify and separate the artificial chimeric fragments, so we no longer have to throw away information."
Warris, together with an international group of researchers, developed a new approach that reduces the complexity of the genome assembly process. It allows them to select a single chromosome, amplify the DNA and sequence it on, for example, the PacBio sequencing platform. The tool they have developed for this purpose, Pacasus, allows them to use low-quality material to make high-quality de novo assemblies. "The de novo assembly tools struggle to identify the artificial chimeric fragments. With this new approach, we can identify and separate these pieces, so we no longer have to throw away information. "
GPUs on HPC Cloud
Pacasus works most efficiently with GPU (Graphics Processing Units) technology. "Within this project, we used SURF's HPC Cloud service. That offers a flexible infrastructure for research purposes while providing access to High Performance Computing infrastructure, in this case GPUs." Compute-intensive parts of the application are transferred to the GPU, while the rest of the code still runs on the CPU (processor). GPUs have a huge parallel architecture made up of thousands of smaller, more efficient cores that can process multiple tasks simultaneously. If the scientific application's code is optimised to take advantage of GPUs, it will run much faster.
"Within this project, we used SURF's HPC Cloud service. That provides a flexible infrastructure for research purposes while providing access to High Performance Computing infrastructure, in this case GPUs."
Warris wanted to set up a flexible environment with access to GPUs while easily sharing data between computing infrastructures. A (virtual) cluster of virtual machines (VMs) was set up in the HPC Cloud with direct access to GPU hardware for performance. The cluster could access shared network storage of 2 terabytes. The distribution of Pacasus tasks on the cluster was managed by HTCondor, a workload management system for compute-intensive tasks. "At the time of peak utilisation, we had 9 GPUs connected to 5 VMs, including access to a high-end node (node) with the latest NVIDIA Tesla P100 accelerators."