Information Wellness Blog

Detailed Reviews and Guides about energy and informational health and wellness

Turn a glass of water into natural remedy

Challenges of Whole Genome Sequencing

Clinical WGS faces many unique challenges. Disease-causing variants may be difficult to identify due to strand bias and repetitive sequences; furthermore, certain sites contain large insertions/deletions not represented in reference genomes.

Sensitivity to clinical WGS depends on a variety of experimental parameters, such as DNA input and library preparation methods. To evaluate these aspects, five NA12878 samples were sequenced using different DNA inputs and both PCR-based and PCR-free library preparation protocols on MGISEQ-2000 sequencers to compare results.


Genome sequencing technology has transformed medical research and discovery, as well as helping reduce costs associated with genetic testing and treatment. With its rapid advancement, more individuals will have their genome sequenced leading to a new paradigm shift in health care that includes prevention over treatment by identifying pre-disease risk factors that will allow patients to take preventative steps that reduce costly medical interventions.

The genomic revolution is providing healthcare systems with tools to identify potentially costly diseases, like cancer, early. This allows healthcare systems to improve patient outcomes while cutting costs by limiting hospitalizations, chronic therapy and unnecessary tests and treatments. Yet the challenge remains of creating an affordable genomic data ecosystem; success will come to those who combine technological infrastructure, clinical decision support systems and genetic counseling into one package so everyone can reap its benefits.

An essential component of this paradigm shift is the creation of human and animal genomic reference databases. Such reference genomes are essential to translational genomics and precision medicine; they allow for identification of disease-causing variants that improve genetic counseling as well as clinical validity of genomic testing as well as predict variant effects on individual genotype and phenotype, thus creating personalized medicine solutions.

At present, available whole genome sequencing (WGS) and whole exome sequencing (WES) data can be analyzed effectively using existing reference populations and various bioinformatics tools, including gnomAD – a large genomic database for human genetic variation that includes high coverage of coding regions allowing accurate gene mapping and rare variant detection without recourse to ineffective imputation methods; such methods have proven ineffective when dealing with recessive variants.

Numerous WGS and WES sequencing technologies are now available to generate this reference data, such as Nextera-based whole genome sequencing or an economical augmented exome capture assay (AEC). AEC has proven comparable to standard exome sequencing kits while offering higher coverage of coding regions.


WGS has rapidly advanced from research tools to clinical diagnostic tests in recent years, making accuracy vital. Initial evaluations revealed significant variations between exome capture kits, sequencing platforms, aligners, and variant callers when calling SNV/INDEL genotype calls; as a result, technical benchmarks and variation-evaluation methods must be established promptly.

NIST has developed the MGISEQ-2000 platform in response to these concerns and evaluates WGS data using different experimental parameters, such as library preparation protocols, DNA input amounts and read lengths to measure its quality and determine its sensitivity as well as assess coverage of disease-related genes and CNVs.

WGS depends on a range of variables, such as depth and accuracy of sequence coverage, the complexity of the reference genome, and alignment algorithm. These can all be modified using various bioinformatics pipelines; Burrows-Wheeler Aligner can help align raw sequencing data while Genome Analysis Toolkit is an excellent way to identify SNVs and indels.

Recent research examined the sensitivity of WGS for variant detection across 8394 disease-related genes with differing DNA inputs, comparing PCR-based and PCR-free WGS approaches on the same genome. Results demonstrated that PCR-free WGS outshone its counterpart in terms of distribution patterns (DP and GQ distribution), SNV/indel detection rates and breadth coverage among disease-related genes/CNVs.

Filtering can also have an effect on sensitivity; for instance, strand bias filters have been reported to cause up to 68 % false positives when used alone or combined with other filters; additionally they may lead to false negative results in patients who possess high-confidence genotypes.

The MGISEQ-2000 platform was utilized to assess the sensitivity of a whole-genome sequencing (WGS) pipeline (PCR-free). Results demonstrated that 19 previously confirmed variants across 11 clinical cases were identified by this WGS pipeline, further increasing its sensitivity by decreasing discard rate due to removal of variants that did not fulfill stringent inclusion criteria in final variant set.


WGS and exome sequencing (WES) fail to cover many disease-associated variants at sufficient depth. These mutations may arise either within genes with known transcripts, or non-genic regions with unknown causes. We conducted WGS and WES experiments compiled from 5 databases in order to assess breadth of coverage as well as investigate how different experimental parameters affect both sensitivity and breadth of coverage.

In this study, we utilized the MGISEQ-2000 to perform WGS on five NA12878 samples that varied in terms of DNA input and library preparation protocol. We evaluated their performance in terms of depth of coverage, genotype quality (GQ) distribution, SNP/indel detection sensitivity for disease-related genes as well as copy number variations (CNVs). Furthermore, we developed and implemented an automated (PCR-free) WGS pipeline and evaluated it on 11 clinical cases.

Our results indicate that most FNVs in WES and WGS calls are caused by filtering. More specifically, in Platypus 87 percent of FNVs were eliminated because they resided in difficult-to-sequence or difficult-to-call regions (i.e. low base qualities, high allele frequencies, homopolymers >10 bp, variant quality 20 threshold too many haplotypes sequence context or strand bias filters had the highest error rates with an estimated error rate of 68% of false positives due to just these filters alone!). Strand bias filter had the highest error rate with 68 percent alone contributing to false positives being identified!

As genomic technologies transition from research tools to clinical diagnostic tests, their accuracy must remain uncompromised in order to detect medically-relevant variants accurately. Unfortunately, WGS remains one of several barriers preventing its adoption for medical diagnosis; one such roadblock being its lack of data and comprehensive understanding of its technical limitations.


The human genome poses formidable technical difficulties to genomic sequencing. It contains 50-69% repetitive sequence, such as transposable elements and low complexity regions (homopolymers). Furthermore, large insertions, deletions, and rearrangements that aren’t included in its reference genome contribute to global mapping and alignment issues and may lead to false negative or positive variant calls. Furthermore, many disease-related genes contain multiple paralogous versions with complex structures that make sequencing short reads challenging.

As WGS evolves from research tools into clinical diagnostic tests, its importance grows exponentially. Variant calling algorithms depend on high-quality sequence data for accurate genotypes; as such, understanding how experimental parameters impact variant detection becomes critical.

In this study, we evaluated PCR-free WGS using samples from Genome in a Bottle consortium. Datasets were generated on Illumina HiSeq 2500 using PCR-free v2 chemistry at NIST on Illumina HiSeq 2500s with default settings using BWA MEM v. BWA MEM V2.0 with default settings; raw data were aligned against NIST RM8398 genome using BWA MEM V2.5 with default settings before annotating assembly using Genome Analyzer V2.8-1-g932cd3a/GATKv2.8-1-g932cd3a Base Quality Score Recalibration respectively.

Five NA12878 samples were sequenced on the MGISEQ-2000 platform using different DNA input and library preparation protocols (PCR-based versus PCR-free). Sensitivity, depth of coverage and genotype quality evaluation were conducted, with similar performance as traditional exome sequencing technology – showing that PCR-free WGS provides comparable performance as traditional exome sequencing technology and thus becoming an important diagnostic technology.