Genomics is an important area of study that holds the promise of unraveling mysteries of how our genes guide our lives and how personalized gene therapies may be used to treat various medical conditions. The increasing adoption of massively-parallel sequencing technologies, known as next generation sequencing (NGS) technologies, has resulted in a situation where computation has become the bottleneck in our ability to ingest all the genetic information and to synthesize actionable knowledge from this information. After the sequencing has been done, computational codes are also required for analysis and interpretation to extract biologically-significant data.
The approach that computational scientists are pursuing is, unsurprisingly, parallelization of the computational genomics applications. There are some obvious sources of parallelism in many of these applications, e.g., in the DNA sequence alignment algorithms, each query (basically a string of DNA base pairs) in a set of queries can be matched against the database (a reference corpus of genetic information) in parallel. However, the solution approaches to parallelization have predominantly been point solutions for individual algorithms. These point solutions have run into bottlenecks even when run at moderate scales. For example, in our empirical measurements with a parallel implementation of BLAST, the de facto application to match nucleic acid sequences to databases of known sequences, called mpiBLAST, we found that it took an hour and a half to match a single query (albeit a large query of 263K base pairs) when run on 120 cores.
Further, for a number of significant problems, parallelization is in its infancy, e.g., for RNA structure determination. Finally, as the scale of the genomic information grows at a rapid pace, the computational codes have to deal with ever increasing scales, where scale implies larger process counts and larger data sizes.
We are developing a principled way to extract parallelism from applications. We leverage the approximate nature of some of these applications (e.g., the DNA sequence match does not need to be exact) to aid in the parallelism. We have had the insight that most applications have at their heart two fundamental algorithms, string matching and graph matching, and have identified a coarse-grained, inter-query parallelism and a fine-grained intra-query parallelism inherent in both. The quality of the scaling model determines how well our system can predict the behavior of a program at large scales. We will extend existing modeling techniques that can capture a wide variety of scaling behaviors and automatically select which program behaviors to track. We will also model the data dependence that many of the target algorithms exhibit. We will use the model to detect and diagnose performance issues that often crop up in the computational genomics applications at large data sizes.