How to Train your Genomics Models

First open resource hosts trained machine-learning genomics models to facilitates their use and exchange

A powerful new resource, one that is actually a new kind of resource, has come online and, hopefully, will help accelerate advances in genomics and the fight against many types of disease. The scale of genome data is so large that computational tools are required for every major step of acquiring, organizing, and analyzing genomes. Generating useful models from large genomic datasets, the kind you generate when studying human disease, is often difficult and time consuming and many aspects of this are now being automated using various types of machine learning approaches. Machine learning in this context can be roughly summarized as using computers to generate and evaluate huge numbers of statistical models in order to clarify relationships in datasets. To do this, the machine learning program needs to train on useful datasets. So for many cutting edge applications, the program doesn’t just need to be written but also trained—and this second step can require large amounts of time and computational resources, making the transmission and broader application of these programs less likely, until now. The Kipoi repository is the first open resource for machine learning methods in genomics, making cutting edge approaches available to clinicians and smaller labs. This resource is sure to speed the application and innovation in machine learning based genomics approaches, and hopefully we will all benefit from this new site for the free exchange of ideas.

For more information, here’s a nice summary from Technology Networks.

Here is the introduction from the original article, published in Nature Biotechnology.

Advances in machine learning, coupled with rapidly growing genome sequencing and molecular profiling datasets, are catalyzing progress in genomics1. In particular, predictive machine learning models, which are mathematical functions trained to map input data to output values, have found widespread usage. Prominent examples include calling variants from whole-genome sequencing data2,3, estimating CRISPR guide activity4,5 and predicting molecular phenotypes, including transcription factor binding, chromatin accessibility and splicing efficiency, from DNA sequence1,6,7,8,9,10,11. Once trained, these models can be probed in silico to infer quantitative relationships between diverse genomic data modalities, enabling several key applications such as the interpretation of functional genetic variants and rational design of synthetic genes.

However, despite the pivotal importance of predictive models in genomics, it is surprisingly difficult to share and exchange models effectively. In particular, there is no established standard for depositing and sharing trained models. This lack is in stark contrast to bioinformatics software and workflows, which are commonly shared through general-purpose software platforms such as the highly successful Bioconductor project12. Similarly, there exist platforms to share genomic raw data, including Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/), ArrayExpress (https://www.ebi.ac.uk/arrayexpress) and the European Nucleotide Archive (https://www.ebi.ac.uk/ena). In contrast, trained genomics models are made available via scattered channels, including code repositories, supplementary material of articles and author-maintained web pages. The lack of a standardized framework for sharing trained models in genomics hampers not only the effective use of these models—and in particular their application to new data—but also the use of existing models as building blocks to solve more complex tasks.

READ MORE …

Categorizing Cells with Machine Learning and Latent Space

Picture1.png

Two exciting and complementary machine learning methods for assigning cell identity based on single-cell sequencing data were published in a paper from Johns Hopkins. The first program, scCoGAPS, defines latent spaces from a single-cell RNA-sequencing dataset to categorize cells and the second program, projectR, evaluates latent spaces in independent target datasets using transfer learning. These two methods are interesting advances towards a goal that is likely still far off—understanding exactly what makes each cell what it is. For an excellent summary read the press release, Finding A Cell’s True Identity.

The original article is a more complicated reading but interesting through out.

Stein-O’Brien, et al. (2019) Decomposing Cell Identity for Transfer Learning across Cellular Measurements, Platforms, Tissues, and Species. Cell Systems

Summary

Analysis of gene expression in single cells allows for decomposition of cellular states as low-dimensional latent spaces. However, the interpretation and validation of these spaces remains a challenge. Here, we present scCoGAPS, which defines latent spaces from a source single-cell RNA-sequencing (scRNA-seq) dataset, and projectR, which evaluates these latent spaces in independent target datasets via transfer learning. Application of developing mouse retina to scRNA-Seq reveals intrinsic relationships across biological contexts and assays while avoiding batch effects and other technical features. We compare the dimensions learned in this source dataset to adult mouse retina, a time-course of human retinal development, select scRNA-seq datasets from developing brain, chromatin accessibility data, and a murine-cell type atlas to identify shared biological features. These tools lay the groundwork for exploratory analysis of scRNA-seq data via latent space representations, enabling a shift in how we compare and identify cells beyond reliance on marker genes or ensemble molecular identity.

Google AI variant caller goes deep on rice genomes

Analyzing 3024 rice genomes characterized by DeepVariant

Google AI variant caller goes deep on rice genomes

“Rice is an ideal candidate for study in genomics, not only because it’s one of the world’s most important food crops, but also because centuries of agricultural cross-breeding have created unique, geographically-induced differences. With the potential for global population growth and climate change to impact crop yields, the study of this genome has important social considerations.

This post explores how to identify and analyze different rice genome mutations with a tool called DeepVariant. To do this, we performed a re-analysis of the Rice 3Kdataset and have made the data publicly available as part of the Google Cloud Public Dataset Program pre-publication and under the terms of the Toronto Statement.

We aim to show how AI can improve food security by accelerating genetic enhancement to increase rice crop yield. According to the Food and Agriculture Organization of the United Nations, crop improvements will reduce the negative impact of climate change and loss of arable land on rice yields, as well as support an estimated 25% increase in rice demand by 2030.”


READ MORE …

We'll need AI to deal with coming wave of genome data

Getting smart about artificial intelligence

By: Alison Cranage, Science writer

We'll need AI to deal with coming wave of genome data. Genome Media.

“Genomics is set to become the biggest source of data on the planet, overtaking the current leading heavyweights – astronomy, YouTube and Twitter. Genome sequencing currently produces a staggering 25 petabytes of digital information per year. A petabyte is 1015 bytes, or about 1,000 times the average storage on a personal computer. And there is no sign of a slowdown.”


READ MORE …

Cancer mutation characterization with machine learning (original article -- very cool)

Integrated structural variation and point mutation signatures in cancer genomes using correlated topic models

Loss of DNA repair mechanisms can leave specific mutation signatures in the genomes of cancer cells. To identify cancers with broken DNA-repair processes, accurate methods are needed for detecting mutation signatures and, in particular, their activities or probabilities within individual cancers. In this paper, we introduce a class of statistical modeling methods used for natural language processing, known as “topic models”, that outperform standard methods for signature analysis. We show that topic models that incorporate signature probability correlations across cancers perform best, while jointly analyzing multiple mutation types improves robustness to low mutation counts.



READ MORE…

More machine learning making models...

Sberbank creates algorithm to do data scientists' job

Sberbank creates algorithm to do data scientists' job - More machine learning making models...

It seems that even data scientists are not immune to the corrosive impact of artificial intelligence on the jobs market. Russia's Sberbank claims to have created an algorithm - Auto ML (machine learning) - that "acts like a data scientist", creating its own models that then solve application tasks.

The bank carried out its first pilot in January, using Auto ML algos to create several baseline models to help with the targeting of sales campaigns.

READ MORE…