using single-cell transcriptome sequencing to …...olfactory stem cells and neural regeneration the...
TRANSCRIPT
Using Single-Cell Transcriptome Sequencing toInfer Olfactory Stem Cell Fate Trajectories
Sandrine DudoitDivision of Biostatistics and Department of Statistics
University of California, Berkeleywww.stat.berkeley.edu/~sandrine
BioC 2016: Where Software and Biology ConnectStanford University
June 25, 2016
Version: 25/06/2016, 05:52
1 / 88
Acknowledgments
• Sandrine Dudoit, Division of Biostatistics and Department ofStatistics, UC Berkeley.
I Fanny Perraudeau, Graduate Group in Biostatistics, UCBerkeley.
I Davide Risso, Division of Biostatistics, UC Berkeley.I Kelly Street, Graduate Group in Biostatistics, UC Berkeley.
• John Ngai, Department of Molecular and Cell Biology, UCBerkeley – Principal investigator.
I Diya Das.I Russell Fletcher.I David Stafford.
• Elizabeth Purdom, Department of Statistics, UC Berkeley.
• Jean-Philippe Vert, Mines ParisTech and Institut Curie, Paris,France.
I Svetlana Gribkova.
2 / 88
Acknowledgments
• Nir Yosef, Department of Electrical Engineering and ComputerSciences, UC Berkeley.
I Michael Cole.I Allon Wagner.
• Funded by BRAIN Initiative and California Institute forRegenerative Medicine (CIRM).
3 / 88
Outline
1 Olfactory Stem Cell Fate TrajectoriesOlfactory Stem Cells and Neural RegenerationOlfactory Epithelium p63 DatasetAnalysis Pipeline
2 Exploratory Data Analysis and Quality Assessment/Control
3 Normalization and Expression QuantitationMotivationMethodsSoftware: sconeZero-Inflated Negative Binomial Model
4 Resampling-Based Sequential Ensemble ClusteringMotivationMethodsSoftware: clusterExperiment
5 Cell Lineage and Pseudotime InferenceMotivation
4 / 88
Outline
MethodsSoftware: slingshot
5 / 88
Workshop
Workshop: Today, 1:00-2:50 pm (Session 1, Intermediate)Analysis of single-cell RNA-seq data with R and Bioconductor
Davide Risso, Kelly Street, Michael Cole, UC Berkeley
6 / 88
Olfactory Stem Cells and Neural Regeneration
R. Fletcher, J. Ngai
7 / 88
Olfactory Stem Cells and Neural Regeneration
• The nature of stem cells giving rise to the nervous system isof particular interest in neurobiology, because neural stemcells remain active in certain brain regions for the entire life ofan individual.
• We focus on the mouse olfactory epithelium (OE), a site ofactive neurogenesis in the postnatal animal.
• Adult olfactory stem cells support the replacement of olfactorysensory neurons and non-neuronal support cells (e.g.,sustentacular) over postnatal life and can reconstitute theentire OE following injury.
• The OE is a convenient system to study, due to itsexperimental accessibility (in situ analysis) and its limitednumber of cell types:
I olfactory sensory neurons (OSN),I sustentacular cells (SUS),
8 / 88
Olfactory Stem Cells and Neural Regeneration
I cells of the Bowman gland,I microvillous cells (rare).
9 / 88
Olfactory Stem Cells and Neural Regeneration
Figure 1: Mouse olfactory epithelium.
10 / 88
Olfactory Stem Cells and Neural Regeneration
Sustentacular cells
Mature olfactory neurons
Immature olfactory neurons Globose basal cells Horizontal basal cells Olfactory ensheathing glia
Bowman’s gland
• HBCs: multipotent, quiescent – deep reserve adult tissue stem cell • GBCs: proliferative progenitor cells + transit-amplifying cells
The Horizontal Basal Cell Is an Adult Tissue Stem Cell
Figure 2: Olfactory epithelium cell types.
11 / 88
Olfactory Stem Cells and Neural Regeneration
Open questions.
• Determine the stage at which the neuronal and non-neuronallineages bifurcate/diverge.
• Characterize discrete intermediate stages of cell differentiation.
• Identify the genetic networks and signaling pathways thatpromote self-renewal and regulate the transition todifferentiation.
12 / 88
Olfactory Stem Cells and Neural Regeneration
p63 regulation of horizontal basal cells.
• The horizontal basal cell (HBC) is an adult tissue stem cell.
• The p63 protein (tumor protein p63, TP63) promotesself-renewal of HBC by blocking differentiation.
• When p63 is down-regulated, this “brake” is removed,allowing differentiation to proceed at the expense ofself-renewal. Thus, p63 can be viewed as a “molecularswitch” that decides between the alternate stem cell fates ofself-renewal vs. differentiation.
• We use single-cell RNA-Seq to analyze cell fate trajectoriesfrom olfactory stem cells (HBC) of p63 conditional knock-outmice.
13 / 88
Olfactory Epithelium p63 Dataset
X
Cre-ER Krt5
YFP
loxP-STOP-loxP
Rosa
p63
+
Experimental design
Russell Fletcher, Levi Gadye, Mike Sanchez
~ ~
0 24h 48h 72h 96h 7d 14d
tamoxifen YFP+;p63+/+: 98 cells
Resting HBCs
FACS-purify lineage-traced cells -> analyze by single cell RNA-Seq
Differentiating cells
YFP+;p63-/-: 88 cells 100 cells 128 cells 110 cells 98 cells
Figure 3: Experimental design. Single-cell RNA-Seq for 636 HBC: 102wild-type and 534 p63 knock-out cells.
14 / 88
Olfactory Epithelium p63 Dataset
P01
P02
P03
A
P03
B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
Number of cells per batch, colored by bio
num
ber
of c
ells
0
10
20
30
40
50
60KO_14DPTKO_24HPTKO_48HPTKO_7DPTKO_96HPTWT_72HPTWT_96HPT
Figure 4: Experimental design. Number of cells per batch, colored bybiological condition.
15 / 88
Olfactory Epithelium p63 Dataset
• Single-cell RNA-Seq for 636 HBC.
I 102 wild-type (WT)/resting cells534 p63 knock-out (KO) cells, at five timepoints followingtamoxifen treatment.
I Biological replicate: Cells from 1–3 mice.At least two replicates per biological condition.
I One FACS run and one C1 run per biological replicate=⇒ 14 batches.
I 8 HiSeq runs (96 cells/lane, single-end 50-base-pair reads).
• Some confounding between biological and technical effects.
• Batch effects nested within biological effects.
16 / 88
Olfactory Epithelium p63 Dataset
• Marker genes. 94 marker genes, curated from literature andfrom prior microarray, sequencing, and in situ experiments,e.g., neuronal, progenitor cell markers.
• Housekeeping genes. 715 housekeeping (HK) genes, curatedfrom prior microarray experiments, expected to be constantlyand highly-expressed across cells of the OE.
• Gene-level read counts. TopHat2 alignment to RefSeq mm10genome and featureCounts
(bioinf.wehi.edu.au/featureCounts) counting, withgenes defined as union of all isoforms.
17 / 88
Zero Inflation
Proportion of genes with zero count, pre gene filtering
0.0
0.2
0.4
0.6
0.8
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
6
Proportion of cells in which a gene is detected, pre gene filtering
N = 22054 Bandwidth = 0.02443
Den
sity
(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected
Figure 5: Zero inflation. Pre gene filtering.
18 / 88
Zero Inflation
• Single-cell RNA-Seq data have many more genes with zeroread counts than bulk RNA-Seq data.
• This zero inflation could occur for biological reasons (i.e., thegene is simply not expressed) or technical reasons (e.g., lowcapture efficiency).
• Zero-count gene filtering is advisable for normalization anddownstream analyses.
• Most RNA-Seq normalization methods involve scaling andperform poorly when many genes have zero counts.
• In particular, the global-scaling method of Anders and Huber(2010), implemented in the Bioconductor R package DESeq,discards any gene having zero count in at least one sample. Inpractice, the scaling factors are therefore estimated based ononly a handful of genes, e.g., 5/22,054 genes for OE dataset.
19 / 88
Zero Inflation
• Full-quantile (FQ) normalization also doesn’t behave properlydue to ties from the large number of zeros.
• We apply the following zero-count gene filtering to the OEdataset: Retain only the genes with at least nr = 20 reads, inat least ns = 40 samples.This yields 9,133/22,054 genes.
20 / 88
Zero Inflation
Proportion of genes with zero count, post gene filtering
0.0
0.2
0.4
0.6
0.8
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
Proportion of cells in which a gene is detected, post gene filtering
N = 9133 Bandwidth = 0.03333
Den
sity
(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected
Figure 6: Zero inflation. Post gene filtering.
21 / 88
Sample-Level QC
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
1e+06
2e+06
3e+06
QC measures by batch
NR
EA
DS
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
500000100000015000002000000250000030000003500000
QC measures by batch
NA
LIG
NE
D
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
90
92
94
96
QC measures by batch
RA
LIG
N
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
203040506070
QC measures by batch
TOTA
L_D
UP
Figure 7: Sample-level QC. Boxplots of QC measures, by batch.
22 / 88
Sample-Level QC
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
0.00.10.20.30.40.50.6
QC measures by batch
PC
T_I
NT
RO
NIC
_BA
SE
S
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
0.100.150.200.250.30
QC measures by batch
PC
T_I
NT
ER
GE
NIC
_BA
SE
S
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
0.20.30.40.50.60.70.8
QC measures by batch
PC
T_M
RN
A_B
AS
ES
P01
P02
P03
AP
03B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
0.60.81.01.21.41.61.82.0
QC measures by batch
ME
DIA
N_C
V_C
OV
ER
AG
E
Figure 8: Sample-level QC. Boxplots of QC measures, by batch.
23 / 88
Sample-Level QC
−5 0 5 10
−4
−2
02
46
QC PCA by batch
PC1
PC
2
P01
P02
P03
A
P03
B
P04
P05
P06
P10
P11
P12
P13
P14
Y01
Y04
−4
−2
0
2
4
6
QC PCA by batch
PC
1
(a) QC PC2 vs. PC1, colored by batch (b) QC PC1, by batch
Figure 9: Sample-level QC. Principal component analysis (PCA) ofsample-level QC measures.
24 / 88
Sample-Level QC
−40 −20 0 20 40 60 80
−5
05
10QC PC1 vs. count PC1
Count PC1
QC
PC
1
Cor=−0.49
PC1 PC2 PC3 PC4 PC5
Absolute correlation of count PC and QC measures
0.0
0.2
0.4
0.6
0.8
1.0
NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS
(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures
Figure 10: Sample-level QC. Association of counts and sample-level QCmeasures.
25 / 88
Sample-Level QC: Summary
• The distribution of QC measures can vary substantially withinand between batches.
• Some QC measures clearly point to low-quality samples, e.g.,low percentage of mapped reads (RALIGN).
• There can be a strong association between QC measures andread counts (cf. PCA).
• Filtering samples based on QC measures is advisable, asnormalization procedures may not be able to adjust for QCand some samples simply have low quality.
• Normalization procedures based on QC measures (e.g.,regression on first few PC of QC measures) should also beconsidered.
26 / 88
Gene-Level Counts
0
2
4
6
8
10
Gene−level log−count
log(
coun
t+1)
−4
−2
0
2
4
6
8
Gene−level RLE
RLE
(a) Log-count (b) RLE
Figure 11: Gene-level counts. Gene-level log-count and relative logexpression (RLE = log-ratio of read count to median read count acrosssamples).
27 / 88
Gene-Level Counts: Summary
• After gene and sample filtering and before normalization,there are large differences in gene-level count distributionswithin and between batches (cf. RLE, housekeeping genes).
• The counts are still zero-inflated.
• There can be substantial association of counts andsample-level QC measures.
• Normalization is essential before any clustering or differentialexpression analysis, to ensure that observed differences inexpression measures between samples and/or genes are trulydue to differential expression and not technical artifacts.
28 / 88
SCONE
D. Risso, M. Cole, N. Yosef
29 / 88
SCONE
SCONE: Single-Cell Overview of Normalized Expression. A generalframework for the normalization of scRNA-Seq data.
• Range of normalization methods.I Global-scaling, e.g., DESeq, TMM.I Full-quantile (FQ).I Unknown factors of unwanted variation: Remove unwanted
variation (RUV).I Known factors of unwanted variation: Regression-based
normalization on, e.g., QC PC, C1 run.
• Normalization performance metrics.
• Numerical and graphical summaries of normalized read countsand metrics.
• R package scone, to be released through the BioconductorProject: github.com/yoseflab/scone.
30 / 88
SCONE
Figure 12: scone. Regression model.
31 / 88
SCONE
Performance metrics. (Green: Good when high; Red: Good whenlow.)
• BIO SIL: Average silhouette width by biological condition.
• BATCH SIL: Average silhouette width by batch.
• PAM SIL: Maximum average silhouette width for PAMclusterings, for a range of user-supplied numbers of clusters.
• EXP QC COR: Maximum squared Spearman correlationbetween count PCs and QC measures.
• EXP UV COR: Maximum squared Spearman correlationbetween count PCs and factors of unwanted variation(preferably derived from other set of negative control genesthan used in RUV).
• EXP WV COR: Maximum squared Spearman correlationbetween count PCs and factors of wanted variation (derivedfrom positive control genes).
32 / 88
SCONE
• RLE MED: Mean squared median relative log expression(RLE).
• RLE IQR: Mean inter-quartile range (IQR) of RLE.
33 / 88
Software Package scone
Application to OE p63 dataset.
• Apply and evaluate 172 normalization procedures using mainscone function.
I scaling method: None, DESeq, TMM, FQ.I uv factors: None, RUVg k = 1, · · · , 5, QC PC k = 1, · · · , 5.I adjust biology: Yes/no.I adjust batch: Yes/no.
• Select a normalization procedure based on (function of) theperformance scores.Unweighted mean score =⇒ none,fq,qc k=4,bio,no batch
Weighted mean score =⇒none,fq,qc k=2,no bio,no batch
34 / 88
Software Package scone
−0.15 −0.10 −0.05 0.00 0.05 0.10
−0.
15−
0.05
0.00
0.05
0.10
SCONE: Biplot of scores colored by mean score
PC1
PC
2
BIO_SIL
BATCH_SILPAM_SIL
EXP_QC_COR
EXP_UV_COREXP_WV_COR
RLE_MEDRLE_IQR
Figure 13: scone. Biplot of performance scores, colored by mean score(yellow high/good, blue low/bad).
35 / 88
Software Package scone
−0.4 −0.2 0.0 0.2
−0.
20−
0.10
0.00
0.10
SCONE: Score PCA colored by mean score
PC1
PC
2
Top meanTop weighted meanfq,ruv_k=1none
Figure 14: scone. PCA of performance scores, colored by mean score.
36 / 88
Software Package scone
−0.4 −0.2 0.0 0.2
−0.
20−
0.10
0.00
0.10
SCONE: Score PCA colored by method
PC1
PC
2
nonefqdeseqtmm
Figure 15: scone. PCA of performance scores, colored by method –scaling method.
37 / 88
Software Package scone
−0.4 −0.2 0.0 0.2
−0.
20−
0.10
0.00
0.10
SCONE: Score PCA colored by method
PC1
PC
2
no_bio,no_batchno_bio,batchbio,no_batchbio,batch
Figure 16: scone. PCA of performance scores, colored by method –adjust biology, adjust batch.
38 / 88
Software Package scone
FQ + RUVg(HK, k=1): W by batch
W
−0.
15−
0.05
0.00
0.05
0.10
−0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10
−5
05
10
FQ + RUVg(HK, k=1): QC PC1 vs. W
W
QC
PC
1
Cor=−0.56
(a) Unwanted factor W (b) QC PC1 vs. W
Figure 17: scone. Association of RUVg unwanted factor W and QCmeasures for none,fq,ruv k=1,no bio,batch.
39 / 88
Software Package scone
−3
−2
−1
0
1
2
3
4
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: RLE
RLE
−2
0
2
4
6
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: HK RLE
RLE
(a) All genes (b) Housekeeping genes
Figure 18: scone. Gene-level relative log expression (RLE = log-ratio ofread count to median read count across samples) for method with topweighted mean score none,fq,qc k=2,no bio,no batch.
40 / 88
Software Package scone
−20 0 20 40
−5
05
10
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: QC PC1 vs. count PC1
count PC1
QC
PC
1
Cor=0
PC1 PC2 PC3 PC4 PC5
SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: Absolute correlation of count PC and QC measure
0.0
0.2
0.4
0.6
0.8
1.0
NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS
(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures
Figure 19: scone. Association of counts and sample-level QC measures,none,fq,qc k=2,no bio,no batch.
41 / 88
Software Package scone: Summary
• Unnormalized gene-level counts exhibit large differences indistributions within and between batches and association withsample-level QC measures.
• Different normalization methods vary in performanceaccording to SCONE metrics and lead to different distributionsof gene-level counts, hence clustering and DE results.
• Global-scaling normalization. Not aggressive enough to handlepotentially large batch effects and association of counts andQC measures. Biological effects are dominated by nuisancetechnical effects. Additionally, for DESeq, the scaling factorsare computed based on only a handful of genes with non-zerocounts in all cells (5/22,054).
42 / 88
Software Package scone: Summary
• Batch effect normalization. Adjusting for batch effectswithout properly accounting for the nesting of batch withinbiological effects (no bio,batch) in the regression model isproblematic, as this removes the biological effects of interest(e.g., empirical Bayes framework of ComBat).
• FQ followed by QC-based or RUVg normalization. Seemseffective: Similar RLE distributions between samples, lowerassociation of counts and QC measures. The first unwantedfactor of RUVg is correlated with the first QC PC.
• The remaining analyses are based onnone,fq,qc k=2,no bio,no batch, the best methodaccording to weighted mean score.
43 / 88
Software Package scone: Summary
• Interpretation of performance metrics. Some metrics tend tofavor certain methods over others, e.g., EXP UV COR(correlation between count PCs and factors of unwantedvariation) naturally favors RUVg, especially when the same setof negative controls are used for normalization and evaluation.Hence, a careful, global interpretation of the metrics isrecommended.
• Negative controls. The selection of proper, distinct sets ofnegative controls is important, as these are used for bothnormalization (RUVg) and assessment of normalization results(EXP UV COR).
• Ongoing efforts.I Zero-inflated negative binomial (ZINB) model.I User-supplied factors unwanted and wanted variation (UV and
WV, respectively).I Other methods (e.g., ComBat/sva).
44 / 88
Software Package scone: Summary
I Other performance metrics.I Visualization.I Shiny app for interactive web interface.
45 / 88
Resampling-Based Sequential Ensemble Clustering
D. Risso, E. Purdom
46 / 88
Motivation
• Robustness to choice of samples. Both hierarchical andpartitioning methods tend to be sensitive to the choice ofsamples to be clustered. Outlying samples/clusters (e.g., glia)are common in scRNA-Seq and mask interesting substructurein the data, often requiring the successive pruning out ofdominating clusters to get to the finer structure.
• Robustness to clustering algorithm and tuning parameters.Clustering results are sensitive to pre-processing steps such asnormalization and dimensionality reduction, as well as to thechoice of clustering algorithm and associated tuningparameters (e.g., distance function, number of clusters).
47 / 88
Motivation
• Not focusing on the number of clusters. A major tuningparameter of partitioning methods such as partitioning aroundmedoids (PAM) and k-means is the number of clusters k .Methods for selecting k (e.g., silhouette width) are sensitiveto the choice of samples, normalization, and other tuningparameters. They tend to be conservative (low k), i.e.,capture only the coarse clustering structure and maskinteresting substructure in the data. Additionally, the numberof clusters k is often not of primary interest.E.g. Silhouette width with PAM selects only k = 2 clusters forthe OE p63 dataset.
• Not forcing samples into clusters. Some samples may beoutliers, that do not really belong to any clusters. Leavingthem out can improve the quality and interpretability of theclustering as well as downstream analyses (e.g., identificationof cluster marker genes).
48 / 88
Motivation
• Cluster gene expression signatures. Common differentialexpression statistics are not well-suited for finding markergenes for the clusters, especially for finer structure in ahierarchy.
49 / 88
Resampling-Based Sequential Ensemble Clustering
• We have developed a resampling-based sequential ensembleclustering approach, with the aim of obtaining stable andtight clusters.
• Ensemble clustering, i.e., aggregating multiple clusteringsobtained from different algorithms or applications of a givenalgorithm to resampled versions of the learning set, is ageneral approach for improving stability. This can be viewedas the unsupervised analog of ensemble methods in supervisedlearning, e.g., bagging, boosting, random forests.
• Our approach is related to bagged/consensus/tight clustering(Dudoit and Fridlyand, 2003; Leisch, 1999; Tseng and Wong,2005).
• R package clusterExperiment, released through theBioconductor Project.
50 / 88
Resampling-Based Sequential Ensemble Clustering
RSEC: Resampling-based Sequential Ensemble Clustering.
• Given a base clustering algorithm (e.g., PAM, k-means) andassociated tuning parameters (e.g., number of principalcomponents, number of clusters k , distance matrix), generatea single candidate clustering using
I resampling-based clustering to find robust and tight clusters;I sequential clustering to find stable clusters over a range of
numbers of clusters (Tseng and Wong, 2005).
• Generate a collection of candidate clusterings by repeating theabove procedure for different base clustering algorithms andtuning parameters.
• Identify a consensus over the different candidate clusterings.
• Merge non-differential clusters.
• Find cluster signatures by testing for differential expressionbetween selected subsets of clusters.
51 / 88
Resampling-Based Sequential Ensemble Clustering
• Visualization. Comparison of multiple clusterings of the samesamples, heatmaps of co-clustering matrices, heatmaps withhierarchical clustering of genes and/or samples.
52 / 88
Differential Expression
• Find cluster gene expression signatures, i.e., marker genes, bytesting for differential expression between selected subsets ofclusters.
• Standard F -statistic. Tests for any difference betweenclusters. Sensitive to outlying samples/clusters. Non-specific,i.e., not useful for interpreting differences between clusters.
• Standard solution in (generalized) linear models/ANOVA is toconsider contrasts between groups of clusters. By using themachinery of the (generalized) linear model, we use all of thesamples in testing these contrasts, rather than just thosesamples involved in the corresponding clusters.
I All pairwise. All pairwise comparisons between clusters.I One against all. Compare each cluster to union of remaining
clusters.
53 / 88
Differential Expression
I Dendrogram. Create a hierarchy of clusters, work up the tree,test for DE between sister nodes (as in approach used formerging clusters).
• For each contrast, test for DE using empirical Bayes linearmodeling approach of R package limma, with voom option toaccount for mean-variance relationship of log-counts (i.e.,over-dispersion).
54 / 88
Software Package clusterExperiment
Workflow.
• clusterMany. Generate a collection of candidate clusterings,for different base clustering algorithms and tuning parameters,with option to use resampling and sequential approaches.
• combineMany. Find consensus clustering across severalclusterings.
• Identify non-differential clusters that should be merged intolarger clusters.
I makeDendrogram. Hierarchical clustering of the clusters foundby combineMany.
I mergeClusters. Merge clusters of this hierarchy based on DEbetween nodes.
• RSEC. Wrapper function around the clusterExperimentworkflow.
55 / 88
Software Package clusterExperiment
• getBestFeatures. Find cluster signatures by testing fordifferential expression between selected subsets of clusters.
• Visualization.I plotClusters. Comparison of multiple clusterings of the
same samples. Based on ConsensusClusterPlus package.I plotHeatmap. Heatmaps of co-clustering matrices, heatmaps
with hierarchical clustering of genes and/or samples (interfaceto aheatmap from NMF package).
56 / 88
Software Package clusterExperiment
Application to OE p63 dataset.
• clusterMany: Generate 22 candidate clusterings.I Dimensionality reduction: 25, 50 PC.I Euclidean distance.I Base clustering method: PAM, k = 5, · · · , 15.I Resampling-based clustering: B = 100, proportion = 0.7,α = 0.3.
I Sequential clustering: k0 = 15, β = 0.9.I clusterFunction=c("hierarchical01").
• combineMany(ce, clusterFunction="hierarchical01",
whichClusters="workflow", proportion=0.7,
propUnassigned=0.5, minSize=5).
• mergeClusters(ce, mergeMethod="adjP",
cutoff=0.05).
57 / 88
Software Package clusterExperiment
Clusterings from clusterMany
nPCAFeatures=50,k0=15nPCAFeatures=25,k0=15nPCAFeatures=50,k0=14nPCAFeatures=25,k0=14nPCAFeatures=50,k0=13nPCAFeatures=25,k0=13nPCAFeatures=50,k0=12nPCAFeatures=25,k0=12nPCAFeatures=50,k0=11nPCAFeatures=25,k0=11nPCAFeatures=50,k0=10nPCAFeatures=25,k0=10
nPCAFeatures=50,k0=9nPCAFeatures=25,k0=9nPCAFeatures=50,k0=8nPCAFeatures=25,k0=8nPCAFeatures=50,k0=7nPCAFeatures=25,k0=7nPCAFeatures=50,k0=6nPCAFeatures=25,k0=6nPCAFeatures=50,k0=5nPCAFeatures=25,k0=5
Figure 20: clusterExperiment. Comparison of 22 clusterMany
clusterings using plotClusters.
58 / 88
Software Package clusterExperiment
combineMany−1c1c10c11c12c13c14c15c16c17c18c19c2c20c21c22c23c24c25c26c27c28c29c3
0
0.2
0.4
0.6
0.8
1
Co−clustering proportion matrix
combineMany
Figure 21: clusterExperiment. Heatmap of co-clustering matrix forclusterMany clusterings, used to create combineMany clustering(plotHeatmap).
59 / 88
Software Package clusterExperiment
c1
c2
c3
c4
c5
c6
c7c8c9
c10
c11c12
c13
c14
c15
c16
c17c18
c19c20
c21c22
c23
c24c25
c26c27
c28
c29c30c31
c32
c1
c2
c3
c4
c5
c6
c7c8c9
c10
c11c12
c13
c14
c15
c16
c17c18
c19c20
c21c22
c23
c24c25
c26c27
c28
c29c30c31
c32
0.38
0.058
0.15
0.028
0.066
0.088
0.035 0.014
0.014
0.039
0.039
0.026
0.023
0.017
0.016
0.0065
0.076
0.00790.011
0.0086
0.0110.0042
0.0077
0.0092
0.006
0.0065
0.0086
0.0056
0.0051
0.0038
0.0056
Figure 22: clusterExperiment. Dendrograms from combineMany andmergeClusters.
60 / 88
Software Package clusterExperiment
−50 0 50
−40
−20
020
4060
80
Count PCA by cluster
PC1
PC
2
−112345678
Figure 23: clusterExperiment. PCA of gene-level log-counts, colored bymergeClusters.
61 / 88
Software Package clusterExperiment
mergeClusters−1m1m2m3m4m5m6m7m8
Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor
−4
−2
0
2
4
6
8
10
Heatmap of DE genes, dendrogram contrasts
mergeClustersBio
Figure 24: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using dendrogram contrasts (269, top 50 ineach of the 6 nodes).
62 / 88
Software Package clusterExperiment
sirt5
hnrnpl
rmi2
gm17821
nup98
cox6c
oaz1
rpl27
rplp1
rpl13a
rps20
rps7
rpl23
rpl37a
rpl23a
rps29
rpl32
rplp2
rpl24
fau
rpl36
rpl37
rps23
rpl35
prrc2c
son
kcnq1ot1
nip7
eif1
cfl1
rpl10
lars2
rps25
rpl13
naca
rplp0
rpl7
rpl8
rps6
rps3a1
rpl21
rps11
rpl14
hmgb1
rpl3
trns1
rnr1
ppia
rps27
myh9
rpl9
rps9
rps19
rpl41
rps14
map1b
eif4g2
calm2
ywhae
gsk3b
calm1
rn18s−rs5
malat1
actb
gnb1
rnr2
hsp90ab1
eef1a1
h3f3b
canx
btg1
srsf2
csde1
fth1
sod1
set
ywhaz
ubb
tmsb4x
prdx1
akr1a1
sptbn1
ubc
mt1
dstn
rps5
rpl4
rps3
rpl18a
actr2
ddx5
hnrnpa2b1
skp1a
hspa8
ptges3
tpt1
actg1
ftl1
ptma
gm1821 mergeClusters−1m1m2m3m4m5m6m7m8
Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor
0
2
4
6
8
10
Heatmap of DE genes, F−test
mergeClustersBio
Figure 25: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using F -test (top 100).
63 / 88
Software Package clusterExperiment
sox11
creer
gstm2
cyp2f2
cbr2
trim66
gap43
ebf1
ebf2
ncam1
omp
cnga2
gng13
rgs11
lhx2
ebf3
nhlh1
gng8
arhgdig
cyp2g1
lgr5
cyp1a2
scarb1
coch
cd24a
cftr
ceacam1
prmt5
fabp5
neurod1
nhlh2
zfp423
ebf4
rrm1
ezh2
hes6
ccnd1
rbm24
kit
dtl
tead2
mcm7
ccna2
ect2
top2a
cdca8
pbk
il6
notch1
yap1
icam1
ppara
fgfr2
dkk3
igfbp4
itgb5
egfr
igfbp2
pax6
sall1
elf5
il33
sox2
wnt4
lifr
nrcam
trp63
krt14
krt5
sox9
kitl
sfrp1
f3
perp
ccnd2
hmgb2
tubb5
cdkn1b
egfp
igfbp5
krt8
hes1
notch2 mergeClusters−1m1m2m3m4m5m6m7m8
Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor
0
2
4
6
8
10
Heatmap of markers genes
mergeClustersBio
Figure 26: clusterExperiment. Heatmap of log-counts for “a priori”markers genes (83).
64 / 88
Software Package clusterExperiment: Summary
• Tuning parameters.I α controls tightness.I β controls stability.I k0, the initial number of clusters used in sequential clustering,
is the parameter with the greatest impact on the results.Larger k0 tend to lead to smaller and tighter clusters.
• Caveat. The DE analysis is exploratory and nominal p-valuesonly a rough summary of significance (reliance on models, tinyp-values even after adjustment for multiple testing, same dataused to define clusters and to perform DE analysis).
• Ongoing efforts.I Cluster confidence measures.I Potentially assign unclustered observations to clusters.I DE using ZINB model.I Greater modularity (e.g., distance functions, DE test).I Shiny app for interactive web interface.
65 / 88
Cell Lineage and Pseudotime Inference
K. Street, D. Risso, E. Purdom
66 / 88
Motivation
• Mapping transcriptional progression from stem cells tospecialized cell types is essential for properly understandingthe mechanisms regulating cell and tissue differentiation.
• There may not always be a clear distinction between states,but rather a smooth transition, with individual cells existingon a continuum between states.
• In such a case, cells may undergo gradual transcriptionalchanges, where the relationship between states can berepresented as a continuous lineage dependent upon anunderlying spatial or temporal variable. This representation,referred to as pseudotemporal ordering, can help usunderstand how cells differentiate and how cell fate decisionsare made (Bendall et al., 2014; Campbell et al., 2015; Ji andJi, 2016; Petropoulos et al., 2016; Shin et al., 2015; Trapnellet al., 2014).
67 / 88
Motivation
• We have developed Slingshot as a flexible and robustframework for inferring cell lineages and pseudotimes in thestudy of continuous differentiation processes.
68 / 88
Cell Lineage and Pseudotime Inference
• Input/Output.I Input. Normalized gene expression measures and cell
clustering.I Output. Cell lineages, i.e., subsets of ordered cell clusters.
Cell pseudotimes, i.e., for each lineage, ordered sequence ofcells and associated pseudotimes.
• Dimensionality reduction.I Principal component analysis (PCA) seems effective and
simple, in conjunction with steps detailed next.I Other approaches include related linear methods, e.g.,
independent component analysis (ICA) (Trapnell et al., 2014,Monocle), and non-linear methods, e.g., Laplacianeigenmaps/spectral embedding (Campbell et al., 2015,Embeddr), t-distributed stochastic neighbor embedding(t-SNE) (Bendall et al., 2014; Petropoulos et al., 2016,Wanderlust).
69 / 88
Cell Lineage and Pseudotime Inference
• Inferring cell lineages.I Minimum spanning tree (MST; ape package) over cell clusters,
with between-cluster distance based on Euclidean distancebetween cluster means scaled by within-cluster covariance.
I Outlying clusters. Identified using granularity parameter ω thatlimits maximum edge weight in the tree. Specifically, buildMST using an artificial cluster Ω, with distance ω from otherclusters (a fraction of maximum pairwise distance betweenclusters), and then remove Ω.
I Root and leaf nodes. May either be pre-specified orautomatically selected.Root node. If not pre-specified, selected based on parsimony(i.e., set of lineages with maximal number of clusters sharedbetween them).Leaf nodes. If pre-specified, constrained MST.
I A lineage is then defined as any unique path coming out of theroot node and ending in a leaf node.
70 / 88
Cell Lineage and Pseudotime Inference
I Constructing the MST on clusters (Ji and Ji, 2016; Shin et al.,2015, TSCAN,Waterfall) vs. cells (Trapnell et al., 2014,Monocle) offers greater stability and computational efficiency,less complex lineages, and easier determination of directionalityand branching.
• Inferring cell pseudotimes.I Iterative procedure inspired from the principal curve algorithm
of Hastie and Stuetzle (1989); principal.curve function inprincurve package.
I In the case of branching lineages, a shrinkage step is includedat each iteration, that forces a degree of similarity between thecurves in the neighborhood of shared clusters.
I Pseudotime values are derived by orthogonal projection ontothe curves.
I Cells belonging to clusters that are included in multiplelineages have multiple, similar pseudotime values.
71 / 88
Cell Lineage and Pseudotime Inference
I Previous approaches also use smooth curves to representlineages (Campbell et al., 2015; Petropoulos et al., 2016,Embeddr), while others use piecewise linear paths through theMST and extract orderings either by orthogonal projection (Jiand Ji, 2016; Shin et al., 2015, TSCAN,Waterfall) or PQ tree(Trapnell et al., 2014, Monocle).
I We find that smooth curves provide discerning power not foundin piecewise linear trajectories, while also adding stability overa range of dimensionality reduction and clustering methods.
• Differential expression. Regression of gene expressionmeasures on pseudotime, e.g., generalized additive models(GAM) (Ji and Ji, 2016, TSCAN).
• Visualization. Two- and three-dimensional plots of celllineages and pseudotimes, gene-level trajectories, heatmaps forDE genes.
72 / 88
Cell Lineage and Pseudotime Inference
• R package slingshot, to be released through the BioconductorProject: github.com/kstreet13/slingshot.
73 / 88
Software Package slingshot
• Modularity.I Integrates easily with a range of normalization, clustering, and
dimensionality reduction methods.I get lineages: Given expression measures and cluster labels,
use MST to infer lineages.get lineages(X, clus.labels, start.clus = NULL,
end.clus = NULL, dist.fun = NULL, omega = Inf,
distout = FALSE).I get curves: Given lineages, infer pseudotimes.
get curves(X, clus.labels, lineages, thresh =
1e-04, maxit = 100, stretch = 2, shrink = TRUE).
• Flexibility. Can be used with varying levels of supervision.I Cluster-based approach allows for easy supervision when
researchers have prior knowledge of cell classes, while stillbeing able to detect novel branching events.
I User-supplied or data-driven selection of root and leaf nodes.
74 / 88
Software Package slingshot
• Visualization.
I plot tree: MST in 2 and 3D.I plot curves: Lineage curves in 2 and 3D.
75 / 88
Software Package slingshot
Application to OE p63 dataset.
• Applied to the first three principal components, Slingshotidentifies two lineages: The first corresponds to theHBC-to-neurons transition, the second to theHBC-to-sustentacular cells transition.
• A first-pass DE analysis, based on a regression of log-count onpseudotime using GAM, suggests that many genes areinvolved in the differentiation process.
• Among the top 100 DE genes for each lineage, only 13 are DEin both, suggesting distinct processes in the neuronal vs.non-neuronal lineages.
76 / 88
Software Package slingshot
−50 0 50−
500
50
PC2
PC
1
−40 0 20 40 60−
400
2040
6080
PC3
PC
2
Figure 27: slingshot. PCA of gene-level log-counts, colored by clusters,with MST edges used to infer lineages (get lineages, plot tree).
77 / 88
Software Package slingshot
−50 0 50−
500
50
PC2
PC
1
−40 0 20 40 60−
400
2040
6080
PC3
PC
2
Figure 28: slingshot. PCA of gene-level log-counts, colored by clusters,with smooth curves representing lineages and used to infer pseudotimes(get curves, plot curves).
78 / 88
Software Package slingshot
0 50 150
04
8
fstl5
pseudotimelo
g(co
unt+
1)
0 50 150
04
8
ugt2a1
pseudotime
log(
coun
t+1)
0 50 150
06
gstm1
pseudotime
log(
coun
t+1)
0 50 150
04
8
ncam1
pseudotime
log(
coun
t+1)
0 50 150
04
8
gstm2
pseudotimelo
g(co
unt+
1)
0 50 150
04
8
rtn1
pseudotime
log(
coun
t+1)
0 50 150
04
8
aqp3
pseudotime
log(
coun
t+1)
0 50 150
04
8
scn9a
pseudotime
log(
coun
t+1)
0 50 150
04
gm581
pseudotimelo
g(co
unt+
1)
Slingshot: DE genes based on GAM, lineage 1
Figure 29: slingshot. Scatterplots of gene-level log-count vs. pseudotimefor GAM DE genes in lineage 1 (HBC–Neurons).
79 / 88
Software Package slingshot
jakmip1
tuba1a
gm11223
stmn1
sult1d1
ak1
ebf3
elavl3
myt1l
atf5
ttll7
homer2
uchl1
stmn3
rd3
calb2
snap25
cntn4
aplp1
clgn
rtp1
gap43
trim66
sept3
ank2
spock1
gm581
scn9a
mgst3
gng13
5730409k12rik
prune2
flrt1
syt11
cystm1
dcx
stmn2
fstl5
ncam1
ebf1
tubb3
ebf2
ell3
rtn1
map1b
gnb1
rps27
prdx1
sp8
ifitm2
runx1
ezr
zfp36l1
tcf7l2
atf3
krt18
sdc4
prdx6
gstm1
cyp2f2
cbr2
fmo6
ccnd2
gsta4
slc26a7
lypd2
notch2
aox2
hspb1
socs3
dapl1
s100a11
ahnak
sik1
hes1
bhlhe40
irf1
cyr61
junb
nr4a1
klf4
fosb
wls
rbm47
foxn3
ptn
gja1
ifitm3
clu
aqp4
anxa2
ces1d
ephx1
zfp36
epas1
gpm6a
scgb1c1
ugt2a1
aqp3
gstm2 mergeClustersm1m2m3m5m7m8
BioHBCHBC transitionINP/GBCINP2iOSNmOSNsSUSSUS precursorSUS progenitor
0
2
4
6
8
10
Slingshot: DE genes based on GAM, lineage 1
mergeClustersBio
Figure 30: slingshot. Heatmap of log-counts for GAM DE genes inlineage 1 (HBC–Neurons), cells sorted by pseudotime.
80 / 88
Software Package slingshot
Lineages.
$lineage1
[1] "8" "3" "7" "5" "2" "1"
$lineage2
[1] "8" "3" "7" "4"
Biological annotation of clusters.
cl
b 1 2 3 4 5 7 8
HBC 0 0 26 0 0 1 47
HBC transition 0 0 35 0 0 19 0
INP/GBC 0 0 1 0 41 0 0
INP2 0 34 0 0 0 0 0
iOSN 1 77 1 0 0 0 0
Microvillous 0 0 0 8 0 0 0
mOSNs 33 1 0 0 0 0 0
SUS 0 0 1 61 0 5 0
SUS precursor 0 0 7 2 4 62 0
SUS progenitor 0 0 0 0 6 0 0
81 / 88
Software Package slingshot: Summary
Ongoing efforts.
• Number of lineages: User-supplied, testing for distinctlineages, merging non-differential lineages.
• DE within and between (i.e., bifurcation) lineages.
• Visualization.
• Performance measures.
• OOP with S4 classes and methods.
• Shiny app for interactive web interface.
82 / 88
References
S. Anders and W. Huber. Differential expression analysis for sequence countdata. Genome Biology, 11(10):R106, 2010.
S. C. Bendall, K. L. Davis, E. D. Amir, M. D. Tadmor, E. F. Simonds, T. J.Chen, D. K. Shenfeld, G. P. Nolan, and D. Pe’er. Single-cell trajectorydetection uncovers progression and regulatory coordination in human B celldevelopment. Cell, 157(3):714–725, 2014.
B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison ofnormalization methods for high density oligonucleotide array data based onbias and variance. Bioinformatics, 19(2):185–193, 2003.
J. H. Bullard, E. A. Purdom, K. D. Hansen, and S. Dudoit. Evaluation ofstatistical methods for normalization and differential expression inmRNA-Seq experiments. BMC Bioinformatics, 11:Article 94, 2010. URLhttp://www.biomedcentral.com/1471-2105/11/94/abstract. (Highlyaccessed).
83 / 88
References
K. Campbell, C. P. Ponting, and C. Webber. Laplacian eigenmaps and principalcurves for high resolution pseudotemporal ordering of single-cell RNA-seqprofiles. Technical report, MRC Functional Genomics Unit, University ofOxford, UK, 2015. URLbiorxiv.org/content/early/2015/09/18/027219.
S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clusteringprocedure. Bioinformatics, 19(9):1090–1099, 2003. URL http:
//bioinformatics.oxfordjournals.org/content/19/9/1090.abstract.
J. Gagnon-Bartsch, L. Jacob, and T. P. Speed. Removing unwanted variationfrom high dimensional data with negative controls. Technical Report 820,Department of Statistics, University of California, Berkeley, 2013.
J. A. Gagnon-Bartsch and T. P. Speed. Using control genes to correct forunwanted variation in microarray data. Biostatistics, 13(3):539–552, 2012.
T. Hastie and W. Stuetzle. Principal curves. Journal of the AmericanStatistical Association, 84(406):502–516, 1989.
R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis,U. Scherf, and T. P. Speed. Exploration, normalization, and summaries ofhigh density oligonucleotide array probe level data. Biostatistics, 4(2):249–264, 2003.
84 / 88
References
L. Jacob, J. Gagnon-Bartsch, and T. P. Speed. Correcting gene expression datawhen neither the unwanted factors nor the factor of interest is observed.Annals of Applied Statistics, 2013. (Submitted).
Z. Ji and H. Ji. TSCAN: Pseudo-time reconstruction and evaluation insingle-cell RNA-seq analysis. Nucleic Acids Research, 2016.
W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarrayexpression data using empirical Bayes methods. Biostatistics, 8(1):118–127,2007.
F. Leisch. Bagged clustering. Technical Report 51, SFB Adaptive InformationSystems and Modelling in Economics and Management Science, ViennaUniversity of Economics and Business Administration, Vienna, Austria,August 1999. URLwww.ci.tuwien.ac.at/~leisch/papers/fl-techrep.html.
J. Loven, D. A. Orlando, A. A. Sigova, C. Y. Lin, P. B. Rahl, C. B. Burge,D. L. Levens, T. I. Lee, and R. A. Young. Revisiting global gene expressionanalysis. Cell, 151(3):476–482, 2012.
85 / 88
References
J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad. RNA-seq:an assessment of technical reproducibility and comparison with geneexpression arrays. Genome Research, 18(9):1509–1517, 2008.
A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mappingand quantifying mammalian transcriptomes by RNA-Seq. Nature Methods,5(7):621–628, 2008.
S. Petropoulos, D. Edsgard, B. Reinius, Q. Deng, S. P. Panula, S. Codeluppi,A. Plaza Reyes, S. Linnarsson, R. Sandberg, and F. Lanner. Single-cellRNA-Seq reveals lineage and X chromosome dynamics in humanpreimplantation embryos. Cell, 165(In press):1–15, 2016.
D. Risso, K. Schwartz, G. Sherlock, and S. Dudoit. GC-content normalizationfor RNA-Seq data. BMC Bioinformatics, 12:Article 480, 2011. URLhttp://www.biomedcentral.com/1471-2105/12/480/abstract. (Highlyaccessed).
D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. Normalization of RNA-seq datausing factor analysis of control genes or samples. Nature Biotechnology, 32(9):896–902, 2014a. URL http:
//www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2931.html.
86 / 88
References
D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. The role of spike-in standards inthe normalization of RNA-seq. In S. Datta and D. Nettleton, editors,Statistical Analysis of Next Generation Sequencing Data, Frontiers inProbability and the Statistical Sciences, chapter 9, pages 169–190. SpringerInternational Publishing, 2014b.
M. D. Robinson and A. Oshlack. A scaling normalization method for differentialexpression analysis of RNA-seq data. Genome Biology, 11(3):R25, 2010.
J. Shin, D. A. Berg, Y. Zhu, J. Y. Shin, J. Song, M. A. Bonaguidi,G. Enikolopov, D. W. Nauen, K. M. Christian, G. Ming, and H. Song.Single-cell RNA-Seq with Waterfall reveals molecular cascades underlyingadult neurogenesis. Cell Stem Cell, 17(3):360–372, 2015.
C. Trapnell, D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse, N. J.Lennon, K. J. Livak, T. S. Mikkelsen, and J. L. Rinn. The dynamics andregulators of cell fate decisions are revealed by pseudotemporal ordering ofsingle cells. Nature Biotechnology, 4(32):381–391, 2014.
G. C. Tseng and W. H. Wong. Tight clustering: a resampling-based approachfor identifying stable and tight patterns in data. Biometrics, 61(1):10–16,2005.
87 / 88
References
C. A. Vallejos, D. Risso, A. Scialdone, S. Dudoit, and J. C. Marioni.Normalizing single-cell RNA sequencing data: challenges and opportunities.Nature Methods, 2016. (Submitted).
Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool fortranscriptomics. Nature Reviews Genetics, 10(1):57–63, 2009.
88 / 88