Human Genome Reference Sequence: Summary or Example?

Graph.png

There is no one human genome. Each person starts life with two non-identical copies of a genome, and variations both small and large begin to accumulate each time those copies are copied. And then there are the differences between individuals. If we think of the genome as a single list of bases at specific positions then point mutations—substitutions, small inserts and deletions—are easy enough to map to those position, however major structural variants—inversions, translocations and repetitive sequences—complicate how we map these mutations. Reference genomes, a consensus representation of deeply sequenced human genomes have traditionally been the basis of how we map nucleotides and variants to positions on chromosomes but long read technologies are making it increasingly apparent that structural variants are quite common and new methods for representing the human genome.

The first of the following articles lays out why a more advanced model for capturing the variation in the human genome is needed. The article after that describes how multiple genomes and their structural variation can be summarized using graphs, a computational improvement on the current linear reference genomes. The last article discusses the some of the single molecule sequencing technology bringing this issue to the fore. There are many other articles that deal with this topic, but these are a good start.

Yang, et al. (2019) One reference genome is not enough. Genome Biology

Abstract

A recent study on human structural variation indicates insufficiencies and errors in the human reference genome, GRCh38, and argues for the construction of a human pan-genome.

########################################################################################

Here’s an article describing how structural variants can be captured in a graph.

Rakocevic, et al. (2019) Fast and accurate genomic analyses using genome graphs. Nature Genetics

Abstract

The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, thus impairing analysis accuracy. Here we present a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels). The pipeline processes one whole-genome sequencing sample in 6.5 h using a system with 36 CPU cores. We show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is an important advance toward fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.

########################################################################################

Here’s an article describing how next-next generation sequencing is illuminating the diversity of structural variants across human populations.

Chaisson, et al. (2015) Resolving the complexity of the human genome using single-molecule sequencing. Nature

Abstract

Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing2, next-generation mapping3, microfluidics-based linked reads4, and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novoassembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6. This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of unreported and Asian-specific structural variants, and high-quality haplotyping of clinically relevant alleles for precision medicine.

Thank you for reading!

Where mutations are not tolerated: a good summary of an outstanding study

Big datasets pinpoint new regions to explore the genome for disease

A dataset of more than 100,000 individuals allows researchers to identify genetic regions that are intolerant to change and may underlie developmental disorders.

background-20147_640.jpg

Imagine rain falling on a square of sidewalk. While the raindrops appear to land randomly, over time a patch of sidewalk somehow remains dry. The emerging pattern suggests something special about this region. This analogy is akin to a new method devised by researchers at University of Utah Health. They explored more than 100,000 healthy humans to identify regions of our genes that are intolerant to change. They believe that DNA mutations in these "constrained" regions may cause severe pediatric diseases.

"Instead of focusing on where DNA changes are, we looked for parts of genes where DNA changes are not," said Aaron Quinlan, Ph.D., associate professor of Human Genetics and Biomedical Informatics at U of U Health and associate director of the USTAR Center for Genetic Discovery. "Our model searches for exceptions to the rule of dense genetic variation in this massive dataset to reveal constrained regions of genes that are devoid of variation. We believe these regions may be lethal or cause extreme phenotypes of disease when mutated."

While this approach is conceptually simple, only recently has there been enough human genomes available to make it happen. These new, invariable stretches may reveal new disease-causing genes and can be used to help pinpoint the cause of disease in patients with developmental disorders. The results of this study are available online in the December 10 issue of the journal Nature Genetics.


READ MORE …

The true number of human miRNAs

An estimate of the total number of true human miRNAs

Julia Alles, Tobias Fehlmann, Ulrike Fischer, Christina Backes, Valentina Galata, Marie Minet, Martin Hart, Masood Abu-Halima, Friedrich A Grässer,  Hans-Peter Lenhof, Andreas Keller, and Eckart Meese

Nucleic Acids Research (Research Article)

Abstract—While the number of human miRNA candidates continuously increases, only a few of them are completely characterized and experimentally validated. Toward determining the total number of true miRNAs, we employed a combined in silico high- and experimental low-throughput validation strategy. We collected 28 866 human small RNA sequencing data sets containing 363.7 billion sequencing reads and excluded falsely annotated and low quality data. Our high-throughput analysis identified 65% of 24 127 mature miRNA candidates as likely false-positives. Using northern blotting, we experimentally validated miRBase entries and novel miRNA candidates. By exogenous overexpression of 108 precursors that encode 205 mature miRNAs, we confirmed 68.5% of the miRBase entries with the confirmation rate going up to 94.4% for the high-confidence entries and 18.3% of the novel miRNA candidates. Analyzing endogenous miRNAs, we verified the expression of 8 miRNAs in 12 different human cell lines. In total, we extrapolated 2300 true human mature miRNAs, 1115 of which are currently annotated in miRBase V22. The experimentally validated miRNAs will contribute to revising targetomes hypothesized by utilizing falsely annotated miRNAs.

READ MORE …