A clinical diagnostics team required optimisation of their Lyme disease pathogen detection pipeline to address critical limitations affecting diagnostic accuracy. The pipeline needed to identify and quantify DNA from Borrelia species and 53 other Lyme-associated pathogens across blood, urine, and cerebrospinal fluid samples. Technical challenges included managing highly fragmented DNA in urine samples, distinguishing true pathogen signals from contamination, and achieving reliable detection of co-infections whilst maintaining specificity. The clinical imperative was monitoring treatment responses following provocative therapy designed to antagonise latent Lyme infections into detectable active states.
Review legacy pipeline architecture and identify technical limitations affecting diagnostic performance
Improve pathogen detection sensitivity whilst maintaining clinical specificity
Expand reference database to include post-2019 strain discoveries and international variants
Integrate comprehensive BLAST analysis for ambiguous sequence identifications
Implement robust low-complexity read filtering to reduce technical noise
Establish automated workflow infrastructure with parallel processing capability
Design contamination detection protocols for sample quality assurance
Conducted comprehensive assessment of legacy pipeline processing 762 amplicon targets across 53 pathogenic species including Borrelia burgdorferi, Babesia, Bartonella, Ehrlichia, Anaplasma, and tick-borne viruses
Performed off-target binding analysis by BLAST-aligning amplicon flanking sequences against viral and prokaryotic databases
Identified critical low-complexity read contamination comprising long AG and CT tandem repeats
Evaluated diagnostic concordance issues between molecular detection and Western Blot serology
Rewrote complete workflow in Python utilising Pysam package for efficient SAM/BAM file processing with FASTA indexing
Implemented stringent low-complexity read filtering removing tandem repeat sequences
Enhanced quality control with more restrictive fastp parameters beyond legacy Q20/Q30 thresholds
Integrated NCBI BLAST API functionality for reads with alignment scores below 40, enabling comparison against comprehensive pathogen databases
Configured SLURM scheduler for parallel sample processing and concurrent BLAST queries
Developed R scripts for human-readable BLAST result interpretation and clinical reporting
Maintained dual-threshold host depletion using hg38 reference with MAPQ ≥30 and alignment score <30 criteria
Implemented multi-region alignment confidence scoring requiring hits to multiple amplicon targets
Created three-tier reporting classification system: confirmed matches (high-confidence exact alignments), potential matches (requiring BLAST validation), and novel strain detection (previously uncharacterised organisms)
Built automated contamination flagging for Enterococcus and other non-target species
Configured Apache Airflow for hourly automated workflow scheduling
Deployed provisional infrastructure on AWS EC2 instance pending permanent hosting arrangements
Deposited complete codebase on GitHub repository with comprehensive documentation
Established failed read retention protocols enabling manual review of ambiguous results
Achieved 400% improvement in mapping rate from 5-10% to 40% through low-complexity filtering implementation
Expanded pathogen coverage to include international Borrelia species complex members addressing travel-related exposures
Integrated post-2019 strain discoveries updating temporal coverage by six years beyond legacy database
Implemented multi-amplicon validation requiring concordant hits across multiple genome regions
Established confidence thresholds: high confidence (AS≥40, ~98% positive predictive value), medium confidence (30≤AS<40 requiring ≥2 amplicons)
Created automated quality metrics tracking read length distributions, GC content, and base quality profiles
Enabled parallel processing of multiple samples concurrently through SLURM integration
Addressed fundamental sensitivity limitations through comprehensive BLAST integration expanding detection beyond constrained local reference database
Enhanced specificity by implementing rigorous low-complexity filtering eliminating technical artefacts masquerading as pathogen signals
Provided contamination detection capability identifying sample integrity issues requiring reprocessing
Established framework for novel strain identification enabling detection of previously uncharacterised Borrelia variants
"The updated PathoDNA pipeline from Umbizo represents a significant step forward in addressing the complex diagnostic challenges of Lyme disease. The integration of BLAST functionality with stringent quality filtering has substantially improved our confidence in the results, whilst the longitudinal tracking capabilities provide unprecedented insights into treatment responses. The contamination detection features have proven particularly valuable in identifying sample processing issues before they impact clinical decisions."
Client Clinical DiagnosticsTeam
Legacy Pipeline Assessment and Issue Identification: 2 weeks
Off-target Analysis and Database Updates: 1 week
Core Pipeline Rewrite in Python with Pysam Integration: 2 weeks
BLAST API Integration and R Script Development: 1 week
Quality Control Enhancement and Low-complexity Filtering: 1 week
SLURM Configuration and Automation Infrastructure: 1 week
Reporting System Development and Visualisation: 1 week
Documentation and GitHub Repository Preparation: 1 week
Total Project Duration: 10 weeks
Medium-term objectives encompass Laboratory Developed Test regulatory pathway validation including analytical validation studies, precision and accuracy assessments, and clinical utility documentation. Longitudinal integration studies will correlate molecular results with clinical outcomes and treatment responses, validating quantitative pathogen monitoring for guiding therapeutic decisions. Long-term expansion opportunities include adaptation to additional tick-borne pathogens and investigation of machine learning approaches for pattern recognition in co-infection profiles and treatment response prediction based on accumulated longitudinal molecular data.