Human Genome Reference Sequence: Summary or Example?

Graph.png

There is no one human genome. Each person starts life with two non-identical copies of a genome, and variations both small and large begin to accumulate each time those copies are copied. And then there are the differences between individuals. If we think of the genome as a single list of bases at specific positions then point mutations—substitutions, small inserts and deletions—are easy enough to map to those position, however major structural variants—inversions, translocations and repetitive sequences—complicate how we map these mutations. Reference genomes, a consensus representation of deeply sequenced human genomes have traditionally been the basis of how we map nucleotides and variants to positions on chromosomes but long read technologies are making it increasingly apparent that structural variants are quite common and new methods for representing the human genome.

The first of the following articles lays out why a more advanced model for capturing the variation in the human genome is needed. The article after that describes how multiple genomes and their structural variation can be summarized using graphs, a computational improvement on the current linear reference genomes. The last article discusses the some of the single molecule sequencing technology bringing this issue to the fore. There are many other articles that deal with this topic, but these are a good start.

Yang, et al. (2019) One reference genome is not enough. Genome Biology

Abstract

A recent study on human structural variation indicates insufficiencies and errors in the human reference genome, GRCh38, and argues for the construction of a human pan-genome.

########################################################################################

Here’s an article describing how structural variants can be captured in a graph.

Rakocevic, et al. (2019) Fast and accurate genomic analyses using genome graphs. Nature Genetics

Abstract

The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, thus impairing analysis accuracy. Here we present a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels). The pipeline processes one whole-genome sequencing sample in 6.5 h using a system with 36 CPU cores. We show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is an important advance toward fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.

########################################################################################

Here’s an article describing how next-next generation sequencing is illuminating the diversity of structural variants across human populations.

Chaisson, et al. (2015) Resolving the complexity of the human genome using single-molecule sequencing. Nature

Abstract

Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing2, next-generation mapping3, microfluidics-based linked reads4, and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novoassembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6. This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of unreported and Asian-specific structural variants, and high-quality haplotyping of clinically relevant alleles for precision medicine.

Thank you for reading!