Human Genome Reference Sequence: Summary or Example?

Graph.png

There is no one human genome. Each person starts life with two non-identical copies of a genome, and variations both small and large begin to accumulate each time those copies are copied. And then there are the differences between individuals. If we think of the genome as a single list of bases at specific positions then point mutations—substitutions, small inserts and deletions—are easy enough to map to those position, however major structural variants—inversions, translocations and repetitive sequences—complicate how we map these mutations. Reference genomes, a consensus representation of deeply sequenced human genomes have traditionally been the basis of how we map nucleotides and variants to positions on chromosomes but long read technologies are making it increasingly apparent that structural variants are quite common and new methods for representing the human genome.

The first of the following articles lays out why a more advanced model for capturing the variation in the human genome is needed. The article after that describes how multiple genomes and their structural variation can be summarized using graphs, a computational improvement on the current linear reference genomes. The last article discusses the some of the single molecule sequencing technology bringing this issue to the fore. There are many other articles that deal with this topic, but these are a good start.

Yang, et al. (2019) One reference genome is not enough. Genome Biology

Abstract

A recent study on human structural variation indicates insufficiencies and errors in the human reference genome, GRCh38, and argues for the construction of a human pan-genome.

########################################################################################

Here’s an article describing how structural variants can be captured in a graph.

Rakocevic, et al. (2019) Fast and accurate genomic analyses using genome graphs. Nature Genetics

Abstract

The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, thus impairing analysis accuracy. Here we present a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels). The pipeline processes one whole-genome sequencing sample in 6.5 h using a system with 36 CPU cores. We show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is an important advance toward fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.

########################################################################################

Here’s an article describing how next-next generation sequencing is illuminating the diversity of structural variants across human populations.

Chaisson, et al. (2015) Resolving the complexity of the human genome using single-molecule sequencing. Nature

Abstract

Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing2, next-generation mapping3, microfluidics-based linked reads4, and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novoassembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6. This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of unreported and Asian-specific structural variants, and high-quality haplotyping of clinically relevant alleles for precision medicine.

Thank you for reading!

The Human Family Tree is a Bush

More evidence of branching off and reconnecting in the early history of humans

It has been an exciting week week for human ancestry. First, a new species of hominid was identified in the Philippines, Homo luzonensis, and now there’s evidence of the formerly elusive Denisovans in the ancient ancestry of Papuans. Adding to the excitement, this group found evidence of at least three distinct Denisovan lineages, and that humans likely interbred with Denisovan cousins somewhere around New Guinea. This is all pretty amazing, considering we first became aware of Denisovans from a single DNA sample from a finger, found in a cave, in Siberia.

Multiple Deeply Divergent Denisovan Ancestries in Papuans

The Human Family Tree is a Bush

Jacobs et.al, Cell (Research Article)

Highlights

•A new dataset of 161 genomes covering the understudied Indonesia-New Guinea region

•Introgressing Denisovans comprise at least three genetically divergent groups

•Papuans carry haplotypes from two Denisovan groups, with one unique to Oceania

•Some Denisovan introgression was recent and likely occurred in New Guinea or Wallacea

Summary—Genome sequences are known for two archaic hominins—Neanderthals and Denisovans—which interbred with anatomically modern humans as they dispersed out of Africa. We identified high-confidence archaic haplotypes in 161 new genomes spanning 14 island groups in Island Southeast Asia and New Guinea and found large stretches of DNA that are inconsistent with a single introgressing Denisovan origin. Instead, modern Papuans carry hundreds of gene variants from two deeply divergent Denisovan lineages that separated over 350 thousand years ago. Spatial and temporal structure among these lineages suggest that introgression from one of these Denisovan groups predominantly took place east of the Wallace line and continued until near the end of the Pleistocene. A third Denisovan lineage occurs in modern East Asians. This regional mosaic suggests considerable complexity in archaic contact, with modern humans interbreeding with multiple Denisovan groups that were geographically isolated from each other over deep evolutionary time.


Read the original article HERE … and other summaries here and here.

What your genome won't tolerate

This is one of those projects that’s so clearly interesting and important that it’s surprising nobody has done it already: specifically, this is a very thorough and well-executed analysis of all the places in the human genome that do not appear to tolerate being mutated. If you have access, it’s worth reading. —RPR

Measuring intolerance to mutation in human genetics

Zachary L. Fuller, Jeremy J. Berg, Hakhamanesh Mostafavi, Guy Sella & Molly Przeworski

What your genome won't tolerate?

Nature Genetics (Research Article)

Abstract—In numerous applications, from working with animal models to mapping the genetic basis of human disease susceptibility, knowing whether a single disrupting mutation in a gene is likely to be deleterious is useful. With this goal in mind, a number of measures have been developed to identify genes in which protein-truncating variants (PTVs), or other types of mutations, are absent or kept at very low frequency in large population samples—genes that appear ‘intolerant’ to mutation. One measure in particular, the probability of being loss-of-function intolerant (pLI), has been widely adopted. This measure was designed to classify genes into three categories, null, recessive and haploinsufficient, on the basis of the contrast between observed and expected numbers of PTVs. Such population-genetic approaches can be useful in many applications. As we clarify, however, they reflect the strength of selection acting on heterozygotes and not dominance or haploinsufficiency.


READ MORE …

Cornell sequences student genomes

What’s in Your DNA? Cornell Genomics Class Provides Students Free 23andMe Testing

“The first-ever sequencing of the human genome cost $2.7 billion. Today, the service 23andMe offers personal genome sequencing for less than $200. And for students enrolled in Cornell’s personal genomics class, it’s free.

sequences student genomes

Prof. Charles Aquadro, molecular biology and genetics, has been teaching Molecular Biology and Genetics 1290: Personal Genomics and Medicine: Why Should You Care About What’s in Your Genes for seven years now, following his hugely successful Cornell University Genetic Ancestry Project, a collaboration that traced the ancestry of over 200 Cornell undergraduates in the spring of 2011.”


READ MORE …

Important reminder that your genome is your property

Your invaluable genome

Genomic data is the currency of a new era of medicine that promises incredible advances. Here, bioinformatician Nana Mensah explains why…

ATCGs-487446069.jpg

In the race for greatest medical revolution of the 21st century, genomics is undoubtedly a frontrunner. Those outside of the field, however, might still find themselves wondering: 'what's all the fuss about?' There are many reasons why genomics is revolutionary, but data is at the root of it all. As genomics is used more and more in mainstream care, it becomes ever more important to understand the great power and value of this new kind of data, writes Nana Mensah.

From cell to computer

While the word ‘genome’ refers to the entire sequence of DNA of an individual organism, the term ‘genomic data’ refers to its digital representation – a large data file resulting from the sequencing process.

READ MORE …