Principles and Practice in Scaling Computational Genomics Applications

Overview

Summary

Genomics is an important area of study that holds the promise of unraveling mysteries of how our genes guide our lives and how personalized gene therapies may be used to treat various medical conditions. The increasing adoption of massively-parallel sequencing technologies, known as next generation sequencing (NGS) technologies, has resulted in a situation where computation has become the bottleneck in our ability to ingest all the genetic information and to synthesize actionable knowledge from this information. After the sequencing has been done, computational codes are also required for analysis and interpretation to extract biologically-significant data.

The approach that computational scientists are pursuing is, unsurprisingly, parallelization of the computational genomics applications. There are some obvious sources of parallelism in many of these applications, e.g., in the DNA sequence alignment algorithms, each query (basically a string of DNA base pairs) in a set of queries can be matched against the database (a reference corpus of genetic information) in parallel. However, the solution approaches to parallelization have predominantly been point solutions for individual algorithms. These point solutions have run into bottlenecks even when run at moderate scales. For example, in our empirical measurements with a parallel implementation of BLAST, the de facto application to match nucleic acid sequences to databases of known sequences, called mpiBLAST, we found that it took an hour and a half to match a single query (albeit a large query of 263K base pairs) when run on 120 cores.

Further, for a number of significant problems, parallelization is in its infancy, e.g., for RNA structure determination. Finally, as the scale of the genomic information grows at a rapid pace, the computational codes have to deal with ever increasing scales, where scale implies larger process counts and larger data sizes.

We are developing a principled way to extract parallelism from applications. We leverage the approximate nature of some of these applications (e.g., the DNA sequence match does not need to be exact) to aid in the parallelism. We have had the insight that most applications have at their heart two fundamental algorithms, string matching and graph matching, and have identified a coarse-grained, inter-query parallelism and a fine-grained intra-query parallelism inherent in both. The quality of the scaling model determines how well our system can predict the behavior of a program at large scales. We will extend existing modeling techniques that can capture a wide variety of scaling behaviors and automatically select which program behaviors to track. We will also model the data dependence that many of the target algorithms exhibit. We will use the model to detect and diagnose performance issues that often crop up in the computational genomics applications at large data sizes.

Achieved Technical Goals

 

Gene sequencing instruments are producing huge volumes of data, straining the capabilities of current database searching algorithms and hindering efforts of researchers analyzing larger collections of data to obtain greater insights. In the space of parallel genomic sequence search, most of the popular softwares, like mpiBLAST, use the database segmentation approach, wherein the entire database is sharded and searched on different nodes. However this approach does not scale well with the increasing length of individual query sequences as well as the rapid growth in size of sequence databases. In this paper, we propose a fine-grained parallelism technique, called Orion, that divides the input query into an adaptive number of fragments and shards the database. Our technique achieves higher parallelism (and hence speedup) and load balancing, while maintaining 100% accuracy. We show that it is 12.3X faster than mpiBLAST for solving a relevant comparative genomics problem.

The classic algorithm for performing sequence alignment, identifying matches between a query and a database of sequences, is the Basic Local Alignment Search Tool (BLAST) [1], [2] (References are from our Supercomputing 2014 paper). BLAST operates by comparing each of the sequences in the input query set against each of the sequences in a database to identify alignments that partially or completely overlap. The more similarity there is, the higher the alignment’s score. E-value ia numerical value that captures the likelihood that the similarity is statistically significant. Alignments with E-value below a certain threshold are output as potential matches by the algorithm. Section II describes the algorithm in more detail. The National Center of Biotechnology Information (NCBI) provides public databases of gene sequences that researchers can search using BLAST.1. Unfortunately, the explosive growth in the number of biological sequences poses a formidable challenge to the current database searching algorithms. In December 2013, the GenBank database—hosted by NCBI—had about 170 million sequences, and the number of bases has doubled approximately every 18 months [3], [19].

Given the exponential growth in the size of sequence databases, and the requirement to query longer sequences, current database searching algorithms struggle to provide the alignment and search results in a timely manner. Early parallel BLAST implementations [5], [7] exploited coarse-grained parallelism: individual queries can be processed simultaneously against the same database. However, while such parallelism improves throughput, it does not help an individual researcher with a single query: For example, a BLAST job with a query sequence of 100,000 contiguous fragments (i.e., contigs or overlapping sequenced data reads) BLASTed against the nonredundant (NR) nucleotide database could take 70 days [30]! To provide genomics researchers with reasonable latency for their searches, exploiting additional parallelism has become a necessity.

The most popular open source parallelization of BLAST is mpiBLAST, using, unsurprisingly, MPI to run BLAST in parallel on clusters [8]. mpiBLAST adopts a natural parallelization strategy. Because BLAST compares the input query against each sequence in the database separately, parallelism can be exploited by performing multiple such comparisons concurrently. mpiBLAST thus shards the database into multiple pieces each containing a subset of the databases’s sequences and distributes the shards across the computational nodes in the cluster. These shards can then be searched independently and simultaneously for alignments with the input query.

Unfortunately, while mpiBLAST can exploit parallelism by sharding large databases, and even by processing multiple input queries in parallel, it has significant limitations for many biological use cases. In long sequence alignment, a long input query is matched against a database. Such use cases are becoming increasingly common. With the rapid expansion of next generation sequencing technologies, the number of organisms whose entire genomes are being sequenced has been growing at a rapid pace. Once a genome is sequenced, it is annotated, which involves (among other processes) comparing the newly-sequenced genome, or parts thereof, with that of a closely-related organism or with the expansive NT database, to establish the evolutionary relations of this newly-sequenced organism. This results in large queries, with the upper bound being the size of the entire genome, which can be millions of nucleotides.

In this scenario, mpiBLAST runs out of parallelization opportunities. There is but one input sequence, so parallelism by processing multiple queries simultaneously is impossible. And increasing the number of database shards to increase parallelism suffers from diminishing returns: even if the database contains enough sequences to profitably create additional shards, additional shards increase scheduling overhead as well as the time required to aggregate the output from each queryshard work unit.

Moreover, mpiBLAST’s parallelization strategy can lead to severe load imbalance with large queries, or with queries of very different sizes. If a query sequence is long, or has many matches with a particular database sequence, it will take a long time to process, while a short query sequence, or one with little similarity to a database sequence can be completed much faster. As a result, the execution time of different queryshard work units can vary significantly, a problem that is only exacerbated as queries get longer [26], [12]. Further, it is difficult to predict what the running time for a unit of work will be from simple metrics as the length of the query [12]. Consequently, the static load balancing approach of mpiBLAST tends to create severe load imbalances among the different nodes processing different work units, as we experimentally show in our evaluation.

To address these concerns, we propose Orion, a new parallel BLAST implementation that exploits finer-grained parallelism than mpiBLAST, achieving both more parallelism in the face of long sequences as well as better load balance. The key insight behind Orion is that a single, long query sequence need not be matched against a database sequence serially; instead, the query can be fragmented into sub-queries (which we call “query fragments”), each of which can be matched against the database independently and in parallel. Figure 1 captures the various levels of parallelism inherent in sequence alignment. The early approaches to sequence alignment primarily targeted the lowest level, processing multiple queries in parallel against the entire database, while mpiBLAST exploits the two lowest levels, processing the same query against different database shards simultaneously. Orion exploits all levels of parallelism: inter-query, intra-database, and intra-query.

In Orion, we limit the size of the overlap by querying the input parameters such as the thresholds in the BLAST algorithm and the penalties due to a mismatch in BLAST, and employ a novel extension and aggregation strategy to avoid missing alignments. Our fragmenting strategy is such that practically there is no loss in accuracy, i.e., every sequence that will be matched successfully in BLAST will also be matched successfully in Orion. However, the overlaps are not so large as to eliminate the scope for intra-query parallelism.

We introduce three chief novelties:
1) We develop an analytical model based on BLAST’s scoring formula that identifies the optimal fragmentation strategy, avoiding redundant work.
2) We introduce a speculative extension strategy that allows alignments that may cross query fragment boundaries to be identified.
3) We build an aggregation algorithm that combines full and partial alignments from each fragment to generate a final set of alignments that matches the original sequential algorithm.

We parallelize and implement our algorithm using the Hadoop MapReduce framework, and demonstrate that our algorithm yields better parallelization, performance and load balance than mpiBLAST, while producing the same results.

Download Software

You can download the Orion software from github.
Link: https://github.com/purdue-dcsl/Orion

Future Work

We are looking to leverage the two key insights that we have for parallelizing computational genomics applications to parallelize other applications beyond our initial target of genomic sequence alignment. We observe that these applications have parallelism in two dimensions— the first is in repeated invocations of the algorithm for matching different “queries” (we call this inter-query parallelism or coarse-grained parallelism (CGP)) and the second is the scope to match different parts of an individual query in parallel (we call this intra-query parallelism or fine-grained parallelism (FGP)).

We are looking to model the performance of these applications at large scales – large process counts plus large data sizes. This will enable us to run them at large scales and detect and diagnose any performance problems automatically.

Students

  • Kanak Mahadik, kmahadik AT purdue DOT edu
  • Bowen Zhou, bzhou AT purdue DOT edu

Funding Source

National Science Foundation, Division of Computer and Communication Foundations (CCF) – Exploiting Parallelism and Scalability (XPS), “On the Hunt for Correctness and Performance Bugs in Large-scale Programs,” Proposal No. CCF-1337158, September 2013-September 2015.

Last modified: August 24, 2015