Skip to content

Commit

Permalink
Fix non-semantic newlines for e.g. , i.e. , ; @ (greenelab#812).
Browse files Browse the repository at this point in the history
Closes greenelab#791.

* add author_info for @michaelmhoffman

* merge from upstream

* Fix non-semantic newlines for `e.g. `, `i.e. `, `; @`

Performed by

```
perl -0pi -e 's/e\.g\.\n/e.g. /g; s/i\.e\.\n/i.e. /g; s/;\n@/; @/g' *.md
```

Addresses greenelab#791

* Remove newline

* Remove newline
  • Loading branch information
michaelmhoffman authored and agitter committed Jan 23, 2018
1 parent 57563a3 commit 1db0c13
Show file tree
Hide file tree
Showing 6 changed files with 34 additions and 75 deletions.
6 changes: 2 additions & 4 deletions content/02.intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,10 +120,8 @@ Despite these recent advances, a number of challenges exist in this area of rese
Deep learning can be applied to answer more fundamental biological questions; it is especially suited to leveraging large amounts of data from high-throughput
"omics" studies.
One classic biological problem where machine learning, and now deep learning, has been extensively applied is molecular target prediction.
For example, deep recurrent neural networks (RNNs) have been used to predict gene targets of microRNAs [@doi:10.1109/icnn.1994.374637], and CNNs have been applied to predict protein residue-residue contacts and secondary structure [@doi:10.1371/journal.pcbi.1005324; @doi:10.1109/TCBB.2014.2343960;
@doi:10.1038/srep18962].
Other recent exciting applications of deep learning include recognition of functional genomic elements such as enhancers and promoters [@doi:10.1101/036129; @doi:10.1007/978-3-319-16706-0_20;
@doi:10.1093/nar/gku1058] and prediction of the deleterious effects of nucleotide polymorphisms [@doi:10.1093/bioinformatics/btu703].
For example, deep recurrent neural networks (RNNs) have been used to predict gene targets of microRNAs [@doi:10.1109/icnn.1994.374637], and CNNs have been applied to predict protein residue-residue contacts and secondary structure [@doi:10.1371/journal.pcbi.1005324; @doi:10.1109/TCBB.2014.2343960; @doi:10.1038/srep18962].
Other recent exciting applications of deep learning include recognition of functional genomic elements such as enhancers and promoters [@doi:10.1101/036129; @doi:10.1007/978-3-319-16706-0_20; @doi:10.1093/nar/gku1058] and prediction of the deleterious effects of nucleotide polymorphisms [@doi:10.1093/bioinformatics/btu703].

#### Treatment of patients

Expand Down
23 changes: 7 additions & 16 deletions content/03.categorize.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,25 +28,18 @@ In all cases that we examined, fewer than one million images were available for
Researchers have developed subtask-specific strategies to address this challenge.

Data augmentation provides an effective strategy for working with small training sets.
The practice is exemplified by a series of papers that analyze images from mammographies [@tag:Dhungel2015_struct_pred_mamm;
@tag:Dhungel2016_mamm; @tag:Zhu2016_mult_inst_mamm; @tag:Zhu2016_advers_mamm;
@tag:Dhungel2017_mamm_min_interv].
The practice is exemplified by a series of papers that analyze images from mammographies [@tag:Dhungel2015_struct_pred_mamm; @tag:Dhungel2016_mamm; @tag:Zhu2016_mult_inst_mamm; @tag:Zhu2016_advers_mamm; @tag:Dhungel2017_mamm_min_interv].
To expand the number and diversity of images, researchers constructed adversarial training examples [@tag:Zhu2016_advers_mamm].
Adversarial training examples are constructed by applying a transformation that changes training images but not their content---for example by rotating an image by a random amount.
An alternative in the domain is to train towards human-created features before subsequent fine-tuning [@tag:Dhungel2016_mamm], which can help to sidestep this challenge though it does give up deep learning techniques' strength as feature constructors.

A second strategy repurposes features extracted from natural images by deep learning models, such as ImageNet [@tag:Russakovsky2015_imagenet], for new purposes.
Diagnosing diabetic retinopathy through color fundus images became an area of focus for deep learning researchers after a large labeled image set was made publicly available during a 2015 Kaggle competition [@tag:Pratt2016_dr].
Most participants trained neural networks from scratch [@tag:Pratt2016_dr;
@tag:Abramoff2016_dr; @tag:Leibig2016_dr], but Gulshan et al.
[@tag:Gulshan2016_dt] repurposed a 48-layer Inception-v3 deep architecture pre-trained on natural images and surpassed the state-of-the-art specificity and sensitivity.
Such features were also repurposed to detect melanoma, the deadliest form of skin cancer, from dermoscopic [@tag:Codella2016_ensemble_melanoma; @tag:Yu2016_melanoma_resnet] and non-dermoscopic images of skin lesions [@tag:Jafari2016_skin_lesions;
@tag:Esfahani2016_melanoma; @tag:Esteva2017_skin_cancer_nature] as well as age-related macular degeneration [@tag:Burlina2016_amd].
Most participants trained neural networks from scratch [@tag:Pratt2016_dr; @tag:Abramoff2016_dr; @tag:Leibig2016_dr], but Gulshan et al. [@tag:Gulshan2016_dt] repurposed a 48-layer Inception-v3 deep architecture pre-trained on natural images and surpassed the state-of-the-art specificity and sensitivity.
Such features were also repurposed to detect melanoma, the deadliest form of skin cancer, from dermoscopic [@tag:Codella2016_ensemble_melanoma; @tag:Yu2016_melanoma_resnet] and non-dermoscopic images of skin lesions [@tag:Jafari2016_skin_lesions; @tag:Esfahani2016_melanoma; @tag:Esteva2017_skin_cancer_nature] as well as age-related macular degeneration [@tag:Burlina2016_amd].
Pre-training on natural images can enable very deep networks to succeed without overfitting.
For the melanoma task, reported performance was competitive with or better than a board of certified dermatologists [@tag:Codella2016_ensemble_melanoma;
@tag:Esteva2017_skin_cancer_nature].
Reusing features from natural images is also an emerging approach for radiographic images, where datasets are often too small to train large deep neural networks without these techniques [@tag:Bar2015_nonmed_tl; @tag:Shin2016_cad_tl;
@tag:Rajkomar2017_radiographs; @tag:Lakhani2017_radiography].
For the melanoma task, reported performance was competitive with or better than a board of certified dermatologists [@tag:Codella2016_ensemble_melanoma; @tag:Esteva2017_skin_cancer_nature].
Reusing features from natural images is also an emerging approach for radiographic images, where datasets are often too small to train large deep neural networks without these techniques [@tag:Bar2015_nonmed_tl; @tag:Shin2016_cad_tl; @tag:Rajkomar2017_radiographs; @tag:Lakhani2017_radiography].
A deep CNN trained on natural images boosts performance in radiographic images [@tag:Rajkomar2017_radiographs].
However, the target task required either re-training the initial model from scratch with special pre-processing or fine-tuning of the whole network on radiographs with heavy data augmentation to avoid overfitting.

Expand Down Expand Up @@ -366,10 +359,8 @@ First, research samples are frequently non-representative of the general populat
One well-known consequence of these biases in genomics is that penetrance is consistently lower in the general population than would be implied by case-control data, as reviewed in [@doi:10.1086/512821].
Moreover, real genetic associations found in one population may not hold in other populations with different patterns of linkage disequilibrium (even when population stratification is explicitly controlled for [@doi:10.1038/nrg2813]).
As a result, many genomic findings are of limited value for people of non-European ancestry [@doi:10.1371/journal.pbio.1001661] and may even lead to worse treatment outcomes for them.
Methods have been developed for mitigating some of these problems in genomic studies [@doi:10.1086/512821;
@doi:10.1038/nrg2813], but it is not clear how easily they can be adapted for deep models that are designed specifically to extract subtle effects from high-dimensional data.
For example, differences in the equipment that tended to be used for cases versus controls have led to spurious genetic findings (e.g.
Sebastiani et al.'s retraction [@doi:10.1126/science.333.6041.404-a]).
Methods have been developed for mitigating some of these problems in genomic studies [@doi:10.1086/512821; @doi:10.1038/nrg2813], but it is not clear how easily they can be adapted for deep models that are designed specifically to extract subtle effects from high-dimensional data.
For example, differences in the equipment that tended to be used for cases versus controls have led to spurious genetic findings (e.g. Sebastiani et al.'s retraction [@doi:10.1126/science.333.6041.404-a]).
In some contexts, it may not be possible to correct for all of these differences to the degree that a deep network is unable to use them.
Moreover, the complexity of deep networks makes it difficult to determine when their predictions are likely to be based on such nominally-irrelevant features of the data (called "leakage" in other fields [@doi:10.1145/2382577.2382579]).
When we are not careful with our data and models, we may inadvertently say more about the way the data was collected (which may involve a history of unequal access and discrimination) than about anything of scientific or predictive value.
Expand Down
27 changes: 9 additions & 18 deletions content/04.study.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,7 @@ As RNA-seq continues to fall in price and rise in throughput, sample sizes will
Already several deep learning approaches have been applied to gene expression data with varying aims.
For instance, many researchers have applied unsupervised deep learning models to extract meaningful representations of gene modules or sample clusters.
Denoising autoencoders have been used to cluster yeast expression microarrays into known modules representing cell cycle processes [@tag:Gupta2015_exprs_yeast] and to stratify yeast strains based on chemical and mutational perturbations [@tag:Chen2016_exprs_yeast].
Shallow (one hidden layer) denoising autoencoders have also been fruitful in extracting biological insight from thousands of _Pseudomonas aeruginosa_ experiments [@tag:Tan2015_adage;
@tag:Tan2016_eadage] and in aggregating features relevant to specific breast cancer subtypes [@tag:Tan2014_psb].
Shallow (one hidden layer) denoising autoencoders have also been fruitful in extracting biological insight from thousands of _Pseudomonas aeruginosa_ experiments [@tag:Tan2015_adage; @tag:Tan2016_eadage] and in aggregating features relevant to specific breast cancer subtypes [@tag:Tan2014_psb].
These unsupervised approaches applied to gene expression data are powerful methods for identifying gene signatures that may otherwise be overlooked.
An additional benefit of unsupervised approaches is that ground truth labels, which are often difficult to acquire or are incorrect, are nonessential.
However, the genes that have been aggregated into features must be interpreted carefully.
Expand Down Expand Up @@ -198,16 +197,14 @@ But one could easily replace them with the output of one of the enhancer or prom

### Micro-RNA binding

Prediction of microRNAs (miRNAs) and miRNA targets is of great interest, as they are critical components of gene regulatory networks and are often conserved across great evolutionary distance [@tag:Bracken2016_mirna;
@tag:Berezikov2011_mirna].
Prediction of microRNAs (miRNAs) and miRNA targets is of great interest, as they are critical components of gene regulatory networks and are often conserved across great evolutionary distance [@tag:Bracken2016_mirna; @tag:Berezikov2011_mirna].
While many machine learning algorithms have been applied to these tasks, they currently require extensive feature selection and optimization.
For instance, one of the most widely adopted tools for miRNA target prediction, TargetScan, trained multiple linear regression models on 14 hand-curated features including structural accessibility of the target site on the mRNA, the degree of site conservation, and predicted thermodynamic stability of the miRNA-mRNA complex [@tag:Agarwal2015_targetscan].
Some of these features, including structural accessibility, are imperfect or empirically derived.
In addition, current algorithms suffer from low specificity [@tag:Lee2016_deeptarget].

As in other applications, deep learning promises to achieve equal or better performance in predictive tasks by automatically engineering complex features to minimize an objective function.
Two recently published tools use different recurrent neural network-based architectures to perform miRNA and target prediction with solely sequence data as input [@tag:Park2016_deepmirgene;
@tag:Lee2016_deeptarget].
Two recently published tools use different recurrent neural network-based architectures to perform miRNA and target prediction with solely sequence data as input [@tag:Park2016_deepmirgene; @tag:Lee2016_deeptarget].
Though the results are preliminary and still based on a validation set rather than a completely independent test set, they were able to predict microRNA target sites with higher specificity and sensitivity than TargetScan.
Excitingly, these tools seem to show that RNNs can accurately align sequences and predict bulges, mismatches, and wobble base pairing without requiring the user to input secondary structure predictions or thermodynamic calculations.
Further incremental advances in deep learning for miRNA and target prediction will likely be sufficient to meet the current needs of systems biologists and other researchers who use prediction tools mainly to nominate candidates that are then tested experimentally.
Expand All @@ -226,13 +223,11 @@ Here we focus on deep learning methods for two representative sub-problems: seco
Secondary structure refers to local conformation of a sequence segment, while a contact map contains information on all residue-residue contacts.
Secondary structure prediction is a basic problem and an almost essential module of any protein structure prediction package.
Contact prediction is much more challenging than secondary structure prediction, but it has a much larger impact on tertiary structure prediction.
In recent years, the accuracy of contact prediction has greatly improved [@doi:10.1371/journal.pcbi.1005324; @doi:10.1093/bioinformatics/btu791;
@doi:10.1073/pnas.0805923106; @doi:10.1371/journal.pone.0028766].
In recent years, the accuracy of contact prediction has greatly improved [@doi:10.1371/journal.pcbi.1005324; @doi:10.1093/bioinformatics/btu791; @doi:10.1073/pnas.0805923106; @doi:10.1371/journal.pone.0028766].

One can represent protein secondary structure with three different states (alpha helix, beta strand, and loop regions) or eight finer-grained states.
Accuracy of a three-state prediction is called Q3, and accuracy of an 8-state prediction is called Q8.
Several groups [@doi:10.1371/journal.pone.0032235; @doi:10.1109/TCBB.2014.2343960;
@doi:10.1038/srep11476] applied deep learning to protein secondary structure prediction but were unable to achieve significant improvement over the *de facto* standard method PSIPRED [@doi:10.1006/jmbi.1999.3091], which uses two shallow feedforward neural networks.
Several groups [@doi:10.1371/journal.pone.0032235; @doi:10.1109/TCBB.2014.2343960; @doi:10.1038/srep11476] applied deep learning to protein secondary structure prediction but were unable to achieve significant improvement over the *de facto* standard method PSIPRED [@doi:10.1006/jmbi.1999.3091], which uses two shallow feedforward neural networks.
In 2014, Zhou and Troyanskaya demonstrated that they could improve Q8 accuracy by using a deep supervised and convolutional generative stochastic network [@arxiv:1403.1347].
In 2016 Wang et al. developed a DeepCNF model that improved Q3 and Q8 accuracy as well as prediction of solvent accessibility and disorder regions [@doi:10.1038/srep18962; @doi:10.1007/978-3-319-46227-1_1].
DeepCNF achieved a higher Q3 accuracy than the standard maintained by PSIPRED for more than 10 years.
Expand Down Expand Up @@ -351,8 +346,7 @@ This preprocessing effectively reduces the number of graph convolutional layers
A field poised for dramatic revolution by deep learning is bioimage analysis.
Thus far, the primary use of deep learning for biological images has been for segmentation---that is, for the identification of biologically relevant structures in images such as nuclei, infected cells, or vasculature---in fluorescence or even brightfield channels [@doi:10.1371/journal.pcbi.1005177].
Once so-called regions of interest have been identified, it is often straightforward to measure biological properties of interest, such as fluorescence intensities, textures, and sizes.
Given the dramatic successes of deep learning in biological imaging, we simply refer to articles that review recent advancements [@doi:10.3109/10409238.2015.1135868;
@doi:10.1371/journal.pcbi.1005177; @doi:10.1007/978-3-319-24574-4_28].
Given the dramatic successes of deep learning in biological imaging, we simply refer to articles that review recent advancements [@doi:10.3109/10409238.2015.1135868; @doi:10.1371/journal.pcbi.1005177; @doi:10.1007/978-3-319-24574-4_28].
For deep learning to become commonplace for biological image segmentation, we need user-friendly tools.

We anticipate an additional paradigm shift in bioimaging that will be brought about by deep learning: what if images of biological samples, from simple cell cultures to three-dimensional organoids and tissue samples, could be mined for much more extensive biologically meaningful information than is currently standard? For example, a recent study demonstrated the ability to predict lineage fate in hematopoietic cells up to three generations in advance of differentiation [@doi:10.1038/nmeth.4182].
Expand Down Expand Up @@ -410,14 +404,11 @@ Metagenomics, which refers to the study of genetic material---16S rRNA or whole-
In recent years, machine learning has proved to be a powerful tool for metagenomic analysis.
16S rRNA has long been used to deconvolve mixtures of microbial genomes, yet this ignores more than 99% of the genomic content.
Subsequent tools aimed to classify 300--3000 bp reads from complex mixtures of microbial genomes based on tetranucleotide frequencies, which differ across organisms [@tag:Karlin], using supervised [@tag:McHardy; @tag:nbc] or unsupervised methods [@tag:Abe].
Then, researchers began to use techniques that could estimate relative abundances from an entire sample faster than classifying individual reads [@tag:Metaphlan;
@tag:wgsquikr; @tag:lmat; @tag:Vervier].
Then, researchers began to use techniques that could estimate relative abundances from an entire sample faster than classifying individual reads [@tag:Metaphlan; @tag:wgsquikr; @tag:lmat; @tag:Vervier].
There is also great interest in identifying and annotating sequence reads [@tag:yok; @tag:Soueidan].
However, the focus on taxonomic and functional annotation is just the first step.
Several groups have proposed methods to determine host or environment phenotypes from the organisms that are identified [@tag:Guetterman; @tag:Knights; @tag:Stratnikov;
@tag:Segata] or overall sequence composition [@tag:Ding].
Also, researchers have looked into how feature selection can improve classification [@tag:Liu;
@tag:Segata], and techniques have been proposed that are classifier-independent [@tag:Ditzler; @tag:Ditzler2].
Several groups have proposed methods to determine host or environment phenotypes from the organisms that are identified [@tag:Guetterman; @tag:Knights; @tag:Stratnikov; @tag:Segata] or overall sequence composition [@tag:Ding].
Also, researchers have looked into how feature selection can improve classification [@tag:Liu; @tag:Segata], and techniques have been proposed that are classifier-independent [@tag:Ditzler; @tag:Ditzler2].

Most neural networks are used for phylogenetic classification or functional annotation from sequence data where there is ample data for training.
Neural networks have been applied successfully to gene annotation (e.g. Orphelia [@tag:Hoff] and FragGeneScan [@doi:10.1093/nar/gkq747]).
Expand Down
Loading

0 comments on commit 1db0c13

Please sign in to comment.