Publications

Pre-Prints

Tim Dunn, Justin M. Zook, James M. Holt, Satish Narayanasamy, "Jointly benchmarking small and structural variant calls with vcfdist". In Submission: Genome Biology, 2024.
PDF Code
@article{dunn2024vcfdist,
  author={Dunn, Tim and Zook, Justin M and Holt, James M and Narayanasamy, Satish},
  title={Jointly benchmarking small and structural variant calls with vcfdist},
  journal={bioRxiv},
  year={2024},
  publisher={Cold Spring Harbor Laboratory},
  doi={10.1101/2024.01.23.575922},
  URL={https://doi.org/10.1101/2024.01.23.575922}
}

Recent improvements in long-read sequencing accuracy have enabled calling phased small and structural variants from a single analysis pipeline. Despite this, the current standard tools for variant calling evaluation are only designed for either small (vcfeval) or large (Truvari) variants. In this work we extend vcfdist -- previously a small variant calling evaluator -- to evaluate structural variants, making it the first benchmarking tool to jointly evaluate phased single-nucleotide polymorphisms (SNPs), small insertions/deletions (INDELs), and structural variants (SVs) for the whole genome. We find that a joint evaluation reduces measured false negative and false positive variant calls across the board: by 28.1% for SNPs, 19.1% for INDELs, and 52.4% for SVs over 50 bases. Building on vcfdist's alignment-based evaluation, we also jointly analyze phasing accuracy. vcfdist identifies that 43% to 92% of all flip errors called by standard phasing evaluation tool WhatsHap are false positives due to differences in variant representations.

Vishwaratn Asthana, Erika M. Nieves, Pallavi Bugga, Clara Smith, Tim Dunn, Satish Narayanasamy, J. Scott VanEpps, "Development of a rapid, culture-free, universal bacterial identification system using internal transcribed spacer targeting primers". In Submission: PNAS, 2024.

Antibiotic resistance is a significant problem of interest around the world. Numerous studies have shown that the indiscriminate administration of broad-spectrum antibiotics is a primary contributor to the increasing prevalence of antibiotic resistance. Unfortunately, culture, the gold standard for bacterial identification is a time intensive process, often taking 24-72 hours. Further, some organisms are unculturable, requiring unique species-specific molecular assays to identify. Due to this extended diagnostic period, broad-spectrum antibiotics are generally prescribed to prevent poor outcomes. To overcome the deficits of culture-based methods and single species-specific molecular assays, we have developed a universal bacterial identification system. The platform utilizes a unique universal polymerase chain reaction (PCR) primer set that targets the internal transcribed spacer (ITS) regions between conserved bacterial genes, creating a distinguishable amplicon signature for every bacterial species. Bioinformatic simulation demonstrates that at least 45 commonly isolated pathogenic species can be uniquely identified using this approach. We experimentally confirmed these predictions on seven representative pathogenic bacterial species, including Gram-negatives and Gram- positives, aerobes and anaerobes, and spore formers. Without a priori knowledge of the infectious organism, this system can rapidly identify the unique amplicon signature generated by multiple bacterial species in a single reaction. In addition to determining the identity of the infectious organism, the system can also determine the corresponding concentration of each pathogen. We also show that the system is resilient to human DNA contamination at physiologic concentrations, eliminating the need for complex and time intensive extraction methods. As a proof-of-principle, we spiked pig urine and human blood with bacterial isolates and found that the system functions reliably in clinical biofluids. Finally, we confirmed that the universal bacterial identification system performed as designed in clinical urinary tract infection samples.

Journal Papers

Tim Dunn, Satish Narayanasamy, "vcfdist: Accurately benchmarking phased small variant calls in human genomes". Nature Communications Volume 14, 2023.
PDF Code
@article{dunn2023vcfdist,
  author={Dunn, Tim and Narayanasamy, Satish},
  title={vcfdist: Accurately benchmarking phased small variant calls in human genomes},
  journal={Nature Communications},
  year={2023},
  volume={14},
  number={1},
  pages={8149},
  issn={2041-1723},
  doi={10.1038/s41467-023-43876-x},
  URL={https://doi.org/10.1038/s41467-023-43876-x}
}

Accurately benchmarking small variant calling accuracy is critical for the continued improvement of human whole genome sequencing. In this work, we show that current variant calling evaluations are biased towards certain variant representations and may misrepresent the relative performance of different variant calling pipelines. We propose solutions, first exploring the affine gap parameter design space for complex variant representation and suggesting a standard. Next, we present our tool vcfdist and demonstrate the importance of enforcing local phasing for evaluation accuracy. We then introduce the notion of partial credit for mostly-correct calls and present an algorithm for clustering dependent variants. Lastly, we motivate using alignment distance metrics to supplement precision-recall curves for understanding variant calling performance. We evaluate the performance of 64 phased Truth Challenge V2 submissions and show that vcfdist improves measured insertion and deletion performance consistency across variant representations from R2 = 0.97243 for baseline vcfeval to 0.99996 for vcfdist.

Tim Dunn, David Blaauw, Reetuparna Das, Satish Narayanasamy, "nPoRe: n-Polymer Realigner for improved pileup variant calling". BMC Bioinformatics Volume 24, 2023.
PDF Code
@article{dunn2023npore,
  title={nPoRe: n-polymer realigner for improved pileup-based variant calling},
  author={Dunn, Tim and Blaauw, David and Das, Reetuparna and Narayanasamy, Satish},
  journal={BMC Bioinformatics},
  volume={24},
  number={1},
  pages={1--21},
  year={2023},
  publisher={BioMed Central}
}

Despite recent improvements in nanopore basecalling accuracy, germline variant calling of small insertions and deletions (INDELs) remains poor. Although precision and recall for single nucleotide polymorphisms (SNPs) now regularly exceeds 99.5%, INDEL recall at relatively high coverages (85x) remains below 80% for standard R9.4.1 flow cells. Current nanopore variant callers work in two stages: an efficient pileup-based method identifies candidates of interest, and then a more expensive full-alignment model provides the final variant calls. Most false negative INDELs are lost during the first (pileup-based) step, particularly in low-complexity repeated regions. We show that read phasing and realignment can recover a significant portion of INDELs lost during this stage. In particular, we extend Needleman-Wunsch affine gap alignment by introducing new gap penalties for more accurately aligning repeated n-polymer sequences such as homopolymers (n = 1) and tandem repeats (2 <= n <= 6). On our dataset with 60.6x coverage, haplotype phasing improves INDEL recall in all evaluated high confidence regions from 63.76% to 70.66% and then nPoRe realignment improves it further to 73.04%, with no loss of precision.

Tim Dunn, Erdal Cosgun, "A cloud-based pipeline for analysis of FHIR and long-read data". Bioinformatics Advances, Volume 3, Issue 1, 2023. https://doi.org/10.1093/bioadv/vbac09
PDF Code
@article{dunn2023fhir,
  title={A cloud-based pipeline for analysis of FHIR and long-read data},
  author={Dunn, Tim and Cosgun, Erdal},
  journal={Bioinformatics Advances},
  volume={3},
  number={1},
  year={2023},
  month={01},
  issn={2635-0041},
  doi={10.1093/bioadv/vbac095},
  publisher={Oxford University Press}
}

As genome sequencing becomes cheaper and more accurate, it is becoming increasingly viable to merge this data with electronic health information to inform clinical decisions. In this work we demonstrate a full pipeline for working with both PacBio sequencing data and clinical FHIR data, from initial data to tertiary analysis. The electronic health records are stored in FHIR -- Fast Healthcare Interoperability Resource -- format, the current leading standard for health care data exchange. For the genomic data, we perform variant calling on long read PacBio HiFi data using Cromwell on Azure. Both data formats are parsed, processed, and merged in a single scalable pipeline which securely performs tertiary analyses using cloud-based Jupyter notebooks. We include three example applications: exporting patient information to a database, clustering patients, and performing a simple pharmacogenomic study.

Conference Papers

Yufeng Gu, Arun Subramaniyan, Tim Dunn, Alireza Khadem, Kuan-Yu Chen, Somnath Paul, Md Vasimuddin, Sanchit Misra, David Blaauw, Satish Narayanasamy, Reetuparna Das, "GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing Analysis". 50th International Symposium on Computer Architecture (ISCA-50). 2023.
Code
@inproceedings{gu2023gendp,
  title={GenDP: a framework of dynamic programming acceleration for genome sequencing analysis},
  author={Gu, Yufeng, and Subramaniyan, Arun and Dunn, Tim and Khadem, Alireza and Chen, Kuan-Yu and Paul, Somnath and Vasimuddin, Md and Misra, Sanchit and Blaauw, David and Narayanasamy, Satish and Das, Reetuparna},
  booktitle={ISCA-50: 50th International Symposium on Computer Architecture},
  year={2023}
}

Genomics is playing an important role in transforming healthcare. Genetic data, however, is being produced at a rate that far outpaces Moore's Law. Many efforts have been made to accelerate genomics kernels on modern commodity hardware such as CPUs and GPUs, as well as custom accelerators (ASICs) for specific genomics kernels. While ASICs provide higher performance and energy efficiency than general-purpose hardware, they incur a high hardware design cost. Moreover, in order to extract the best performance, ASICs tend to have significantly different architectures for different kernels. The divergence of ASIC designs makes it difficult to run commonly used modern sequencing analysis pipelines due to software integration and programming challenges.

With the observation that many genomics kernels are dominated by dynamic programming (DP) algorithms, this paper presents GenDP, a framework of dynamic programming acceleration including DPAx, a DP accelerator, and DPMap, a graph partitioning algorithm that maps DP objective functions to the accelerator. DPAx supports DP kernels with various dependency patterns, such as 1D and 2D DP tables and long-range dependencies in the graph structure. DPAx also supports different DP objective functions and precisions required for genomics applications. GenDP is evaluated on genomics kernels in both short-read and long-read analysis pipelines, achieving 157.8× throughput/mm2 over GPU baselines and 132.0× throughput/mm2 over CPU baselines.

Tim Dunn*, Harisankar Sadasivan*, Jack Wadden, Kush Goliya, Kuan-Yu Chen, David Blaauw, Reetuparna Das, Satish Narayanasamy, "SquiggleFilter: An Accelerator for Portable Virus Detection". 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Virtual Event, Athens, Greece, 2021. IEEE MICRO 2022 Top Picks Honorable Mention
PDF Code Short Talk Full Talk
@inproceedings{dunn2021squigglefilter,
  title={SquiggleFilter: An Accelerator for Portable Virus Detection},
  author={Dunn, Tim and Sadasivan, Harisankar and Wadden, Jack and Goliya, Kush and Chen, Kuan-Yu and Blaauw, David and Das, Reetuparna and Narayanasamy, Satish},
  booktitle={MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture},
  pages={535--549},
  year={2021}
}

The MinION is a recent-to-market handheld nanopore sequencer. It can be used to determine the whole genome of a target virus in a biological sample. Its Read Until feature allows us to skip sequencing a majority of non-target reads (DNA/RNA fragments), which constitutes more than 99% of all reads in a typical sample. However, it does not have any on-board computing, which significantly limits its portability.

We analyze the performance of a Read Until metagenomic pipeline for detecting target viruses and identifying strain-specific mutations. We find new sources of performance bottlenecks (basecaller in classification of a read) that are not addressed by past genomics accelerators.

We present SquiggleFilter, a novel hardware accelerated dynamic time warping (DTW) based filter that directly analyzes MinION's raw squiggles and filters everything except target viral reads, thereby avoiding the expensive basecalling step. We show that our 14.3W 13.25mm2 accelerator has 274× greater throughput and 3481× lower latency than existing GPU-based solutions while consuming half the power, enabling Read Until for the next generation of nanopore sequencers.

Arun Subramaniyan, Yufeng Gu, Tim Dunn, Somnath Paul, Md Vasimuddin, Sanchit Misra, David Blaauw, Satish Narayanasamy, Reetuparna Das, "GenomicsBench: A Benchmark Suite for Genomics". IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Virtual Event, 2021.
PDF Code
@inproceedings{subramaniyan2021genomicsbench,
  title={GenomicsBench: A Benchmark Suite for Genomics},
  author={Subramaniyan, Arun and Gu, Yufeng and Dunn, Timothy and Paul, Somnath and Vasimuddin, Md and Misra, Sanchit and Blaauw, David and Narayanasamy, Satish and Das, Reetuparna},
  booktitle={2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
  pages={1--12},
  year={2021},
  organization={IEEE}
}

Over the last decade, advances in high-throughput sequencing and the availability of portable sequencers have enabled fast and cheap access to genetic data. For a given sample, sequencers typically output fragments of the DNA in the sample. Depending on the sequencing technology, the fragments range from a length of 150-250 at high accuracy to lengths in few tens of thousands but at much lower accuracy. Sequencing data is now being produced at a rate that far outpaces Moore's law and poses significant computational challenges on commodity hardware. To meet this demand, software tools have been extensively redesigned and new algorithms and custom hardware have been developed to deal with the diversity in sequencing data. However, a standard set of benchmarks that captures the diverse behaviors of these recent algorithms and can facilitate future architectural exploration is lacking.

To that end, we present the GenomicsBench benchmark suite which contains 12 computationally intensive data-parallel kernels drawn from popular bioinformatics software tools. It covers the major steps in short and long-read genome sequence analysis pipelines such as basecalling, sequence mapping, de-novo assembly, variant calling and polishing. We observe that while these genomics kernels have abundant data level parallelism, it is often hard to exploit on commodity processors because of input-dependent irregularities. We also perform a detailed microarchitectural characterization of these kernels and identify their bottlenecks. GenomicsBench includes parallel versions of the source code with CPU and GPU implementations as applicable along with representative input datasets of two sizes - small and large.

Talks

Tim Dunn, Satish Narayanasamy, "vcfdist: accurately benchmarking phased variant calls". University of Michigan Department of Computational Medicine and Bioinformatics (DCMB) Tools and Technology Seminar Series. Ann Arbor, Michigan, 2024.
Slides

Tim Dunn, Satish Narayanasamy, "vcfdist: accurately benchmarking phased variant calls". Genomics Deep Dives with Google Genomics. Virtual, 2023.
Slides

Tim Dunn, Satish Narayanasamy, "vcfdist: accurately benchmarking phased small variant calls in human genomes". Genome in a Bottle Consortium Meeting. Virtual, 2023.
Slides

Tim Dunn, David Blaauw, Reetu Das, Satish Narayanasamy, "nPoRe: n-Polymer Realigner for improved pileup-based variant calling". University of Michigan Department of Computational Medicine and Bioinformatics (DCMB) Tools and Technology Seminar Series. Ann Arbor, Michigan, 2022.
Slides

Tim Dunn, "A cloud-based pipeline for analysis of FHIR and long-read data". Internal Microsoft Research Talk. Virtual, 2022.
Slides

Tim Dunn*, Danika Gaviola*, Jason Rising*, "Integrating a virtual Trusted Platform Module into the OpenXT Hypervisor". Internal Assured Information Security Talk. Rome, New York, 2019.
Slides Unavailable

Tim Dunn, Daniel Thuerck, Michael Goesele, "SemiDefinite Program (SDP) Solver". Deutsche Akademische AustauschDienst (DAAD) Research Internships in Science and Engineering (RISE) Student Conference. Heidelberg, Germany, 2018.
Slides

Poster Presentations

Tim Dunn, Satish Narayanasamy, "vcfdist: accurately benchmarking phased small variant calls". Advances in Genome Biology and Technology: Precision Health Conference (AGBT). San Diego, 2023.
Poster

Tim Dunn, Erdal Cosgun, "A cloud-based pipeline for analysis of FHIR and long-read data". Cold Spring Harbor Laboratory: Biological Data Science Conference. Virtual, 2022.
Poster

Tim Dunn, David Blaauw, Reetu Das, Satish Narayanasamy, "nPoRe: n-Polymer Realigner for improved pileup-based variant calling". Oxford Nanopore Technologies: London Calling. Virtual, 2022.
Poster

Tim Dunn, Sean Banerjee, "Using GPUs to Mine Large Scale Software Problem Repositories". Clarkson University Symposium on Undergraduate Research Experiences (SURE). 2016. Best Poster Presentation in EE & CS, Honorable Mention Oral Presentation
Poster Slides

Workshop Papers

Tim Dunn, Sean Banerjee, Natasha Banerjee, "User-Independent Detection of Swipe Pressure using a Thermal Camera for Natural Surface Interaction". IEEE 20th International Workshop on Multimedia and Signal Processing (MMSP). Vancouver, Canada, 2018. Top 5% Paper, MMSP 2018
PDF Code Slides
@inproceedings{dunn2018user,
  title={User-independent detection of swipe pressure using a thermal camera for natural surface interaction},
  author={Dunn, Tim and Banerjee, Sean and Banerjee, Natasha Kholgade},
  booktitle={2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP)},
  pages={1--6},
  year={2018},
  organization={IEEE}
}

In this paper, we use a thermal camera to distinguish hard and soft swipes performed by a user interacting with a natural surface by detecting differences in the thermal signature of the surface due to heat transferred by the user. Unlike prior work, our approach provides swipe pressure classifiers that are user-agnostic, i.e., that recognize the swipe pressure of a novel user not present in the training set, enabling our work to be ported into natural user interfaces without user-specific calibration. Our approach generates average classification accuracy of 76% using random forest classifiers trained on a test set of 9 subjects interacting with paper and wood, with 8 hard and 8 soft test swipes per user. We compare results of the user-agnostic classification to user-aware classification with classifiers trained by including training samples from the user. We obtain average user-aware classification accuracy of 82% by adding up to 8 hard and 8 soft training swipes for each test user. Our approach enables seamless adaptation of generic pressure classification systems based on thermal data to the specific behavior of users interacting with natural user interfaces.

Tim Dunn, Natasha Banerjee, Sean Banerjee, "GPU Acceleration of Document Similarity Measures for Automated Bug Triaging". 1st International Workshop on Software Faults (IWSF). Ottawa, Canada, 2016.
PDF Code
@inproceedings{dunn2016gpu,
  title={GPU acceleration of document similarity measures for automated bug triaging},
  author={Dunn, Tim and Banerjee, Natasha Kholgade and Banerjee, Sean},
  booktitle={2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)},
  pages={140--145},
  year={2016},
  organization={IEEE}
}

Large-scale open source software bug repositories from companies such as Mozilla, RedHat, Novell and Eclipse have enabled researchers to develop automated solutions to bug triaging problems such as bug classification, duplicate classification and developer assignment. However, despite the repositories containing millions of usable reports, researchers utilize only a small fraction of the data. A major reason for this is the polynomial time and cost associated with making comparisons to all prior reports. Graphics processing units (GPUs) with several thousand cores have been used to accelerate algorithms in several domains, such as computer graphics, computer vision and linguistics. However, they have remained unexplored in the area of bug triaging.

In this paper, we demonstrate that the problem of comparing a bug report to all prior reports is an embarassingly parallel problem that can be accelerated using graphics processing units (GPUs). Comparing the similarity of two bug reports can be performed using frequency based methods (e.g. cosine similarity and BM25F), sequence based methods (e.g. longest common substring and longest common subsequence) or topic modeling. For the purpose of this paper we focus on cosine similarity, longest common substring and longest common subsequence. Using an NVIDIA Tesla K40 GPU, we show that frequency and sequence based similarity measures are accelerated by 89 and 85 times respectively when compared to a pure CPU based implementation. Thus, allowing us to generate similarity scores for the entire Eclipse repository, consisting of 498,161 reports in under a day, as opposed to 83.4 days using a CPU based approach.

Undergraduate Honors Thesis

Tim Dunn, "Detection of Swipe Pressure using a Thermal Camera and ConvNets for Natural Surface Interaction". Clarkson University Honors Program. Potsdam, New York, 2019.
PDF Code

In this paper, I present a system for reliably distinguishing between two levels of applied finger pressure on planar surfaces using a thermal camera. This work is the first to do so without requiring prior per-user calibration, and will enable arbitrary natural materials to be used as touchscreen surfaces in augmented reality applications. Two approaches were explored during this research for swipe pressure identification, which took place over the Spring and Fall Semesters of 2018. The first approach used morphological filters and supplied handcrafted features as input to a random forest classifier. The second approach used convolutional neural networks to classify both raw and filtered video data using several approaches. It was found that convolutional neural networks could only consistently outperform the random forest classifier when the same morphological filtering had been applied on the input videos.