using single-cell transcriptome sequencing to …...olfactory stem cells and neural regeneration the...

Using Single-Cell Transcriptome Sequencing toInfer Olfactory Stem Cell Fate Trajectories

Sandrine DudoitDivision of Biostatistics and Department of Statistics

University of California, Berkeleywww.stat.berkeley.edu/~sandrine

BioC 2016: Where Software and Biology ConnectStanford University

June 25, 2016

Version: 25/06/2016, 05:52

1 / 88

http://www.stat.berkeley.edu/~sandrine

Acknowledgments

• Sandrine Dudoit, Division of Biostatistics and Department ofStatistics, UC Berkeley.

I Fanny Perraudeau, Graduate Group in Biostatistics, UCBerkeley.

I Davide Risso, Division of Biostatistics, UC Berkeley.I Kelly Street, Graduate Group in Biostatistics, UC Berkeley.

• John Ngai, Department of Molecular and Cell Biology, UCBerkeley – Principal investigator.

I Diya Das.I Russell Fletcher.I David Stafford.

• Elizabeth Purdom, Department of Statistics, UC Berkeley.

• Jean-Philippe Vert, Mines ParisTech and Institut Curie, Paris,France.

I Svetlana Gribkova.

2 / 88

Acknowledgments

• Nir Yosef, Department of Electrical Engineering and ComputerSciences, UC Berkeley.

I Michael Cole.I Allon Wagner.

• Funded by BRAIN Initiative and California Institute forRegenerative Medicine (CIRM).

3 / 88

Outline

1 Olfactory Stem Cell Fate TrajectoriesOlfactory Stem Cells and Neural RegenerationOlfactory Epithelium p63 DatasetAnalysis Pipeline

2 Exploratory Data Analysis and Quality Assessment/Control

3 Normalization and Expression QuantitationMotivationMethodsSoftware: sconeZero-Inflated Negative Binomial Model

4 Resampling-Based Sequential Ensemble ClusteringMotivationMethodsSoftware: clusterExperiment

5 Cell Lineage and Pseudotime InferenceMotivation

4 / 88

Outline

MethodsSoftware: slingshot

5 / 88

Workshop

Workshop: Today, 1:00-2:50 pm (Session 1, Intermediate)Analysis of single-cell RNA-seq data with R and Bioconductor

Davide Risso, Kelly Street, Michael Cole, UC Berkeley

6 / 88

Olfactory Stem Cells and Neural Regeneration

R. Fletcher, J. Ngai

7 / 88


• The nature of stem cells giving rise to the nervous system isof particular interest in neurobiology, because neural stemcells remain active in certain brain regions for the entire life ofan individual.

• We focus on the mouse olfactory epithelium (OE), a site ofactive neurogenesis in the postnatal animal.

• Adult olfactory stem cells support the replacement of olfactorysensory neurons and non-neuronal support cells (e.g.,sustentacular) over postnatal life and can reconstitute theentire OE following injury.

• The OE is a convenient system to study, due to itsexperimental accessibility (in situ analysis) and its limitednumber of cell types:

I olfactory sensory neurons (OSN),I sustentacular cells (SUS),

8 / 88


I cells of the Bowman gland,I microvillous cells (rare).

9 / 88


Figure 1: Mouse olfactory epithelium.

10 / 88


Sustentacular cells

Mature olfactory neurons

Immature olfactory neurons Globose basal cells Horizontal basal cells Olfactory ensheathing glia

Bowman’s gland

•  HBCs: multipotent, quiescent – deep reserve adult tissue stem cell •  GBCs: proliferative progenitor cells + transit-amplifying cells

The Horizontal Basal Cell Is an Adult Tissue Stem Cell

Figure 2: Olfactory epithelium cell types.

11 / 88


Open questions.

• Determine the stage at which the neuronal and non-neuronallineages bifurcate/diverge.

• Characterize discrete intermediate stages of cell differentiation.

• Identify the genetic networks and signaling pathways thatpromote self-renewal and regulate the transition todifferentiation.

12 / 88


p63 regulation of horizontal basal cells.

• The horizontal basal cell (HBC) is an adult tissue stem cell.

• The p63 protein (tumor protein p63, TP63) promotesself-renewal of HBC by blocking differentiation.

• When p63 is down-regulated, this “brake” is removed,allowing differentiation to proceed at the expense ofself-renewal. Thus, p63 can be viewed as a “molecularswitch” that decides between the alternate stem cell fates ofself-renewal vs. differentiation.

• We use single-cell RNA-Seq to analyze cell fate trajectoriesfrom olfactory stem cells (HBC) of p63 conditional knock-outmice.

13 / 88

Olfactory Epithelium p63 Dataset

X

Cre-ER Krt5

YFP

loxP-STOP-loxP

Rosa

p63

+

Experimental design

Russell Fletcher, Levi Gadye, Mike Sanchez

~ ~

0 24h 48h 72h 96h 7d 14d

tamoxifen YFP+;p63+/+: 98 cells

Resting HBCs

FACS-purify lineage-traced cells -> analyze by single cell RNA-Seq

Differentiating cells

YFP+;p63-/-: 88 cells 100 cells 128 cells 110 cells 98 cells

Figure 3: Experimental design. Single-cell RNA-Seq for 636 HBC: 102wild-type and 534 p63 knock-out cells.

14 / 88


P01

P02

P03

A

P03

B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

Number of cells per batch, colored by bio

num

ber

of c

ells

0

10

20

30

40

50

60KO_14DPTKO_24HPTKO_48HPTKO_7DPTKO_96HPTWT_72HPTWT_96HPT

Figure 4: Experimental design. Number of cells per batch, colored bybiological condition.

15 / 88


• Single-cell RNA-Seq for 636 HBC.

I 102 wild-type (WT)/resting cells534 p63 knock-out (KO) cells, at five timepoints followingtamoxifen treatment.

I Biological replicate: Cells from 1–3 mice.At least two replicates per biological condition.

I One FACS run and one C1 run per biological replicate=⇒ 14 batches.

I 8 HiSeq runs (96 cells/lane, single-end 50-base-pair reads).

• Some confounding between biological and technical effects.

• Batch effects nested within biological effects.

16 / 88


• Marker genes. 94 marker genes, curated from literature andfrom prior microarray, sequencing, and in situ experiments,e.g., neuronal, progenitor cell markers.

• Housekeeping genes. 715 housekeeping (HK) genes, curatedfrom prior microarray experiments, expected to be constantlyand highly-expressed across cells of the OE.

• Gene-level read counts. TopHat2 alignment to RefSeq mm10genome and featureCounts

(bioinf.wehi.edu.au/featureCounts) counting, withgenes defined as union of all isoforms.

17 / 88

http://bioinf.wehi.edu.au/featureCounts

Zero Inflation

Proportion of genes with zero count, pre gene filtering

0.0

0.2

0.4

0.6

0.8

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

Proportion of cells in which a gene is detected, pre gene filtering

N = 22054 Bandwidth = 0.02443

Den

sity

(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected

Figure 5: Zero inflation. Pre gene filtering.

18 / 88

Zero Inflation

• Single-cell RNA-Seq data have many more genes with zeroread counts than bulk RNA-Seq data.

• This zero inflation could occur for biological reasons (i.e., thegene is simply not expressed) or technical reasons (e.g., lowcapture efficiency).

• Zero-count gene filtering is advisable for normalization anddownstream analyses.

• Most RNA-Seq normalization methods involve scaling andperform poorly when many genes have zero counts.

• In particular, the global-scaling method of Anders and Huber(2010), implemented in the Bioconductor R package DESeq,discards any gene having zero count in at least one sample. Inpractice, the scaling factors are therefore estimated based ononly a handful of genes, e.g., 5/22,054 genes for OE dataset.

19 / 88

Zero Inflation

• Full-quantile (FQ) normalization also doesn’t behave properlydue to ties from the large number of zeros.

• We apply the following zero-count gene filtering to the OEdataset: Retain only the genes with at least nr = 20 reads, inat least ns = 40 samples.This yields 9,133/22,054 genes.

20 / 88

Zero Inflation

Proportion of genes with zero count, post gene filtering

0.0

0.2

0.4

0.6

0.8

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

Proportion of cells in which a gene is detected, post gene filtering

N = 9133 Bandwidth = 0.03333

Den

sity

(a) Proportion of genes with zero count (b) Proportion of cells in which a gene is detected

Figure 6: Zero inflation. Post gene filtering.

21 / 88

Sample-Level QC

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

1e+06

2e+06

3e+06

QC measures by batch

NR

EA

DS

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

500000100000015000002000000250000030000003500000


NA

LIG

NE

D

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

90

92

94

96


RA

LIG

N

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

203040506070


TOTA

L_D

UP

Figure 7: Sample-level QC. Boxplots of QC measures, by batch.

22 / 88

Sample-Level QC

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

0.00.10.20.30.40.50.6


PC

T_I

NT

RO

NIC

_BA

SE

S

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

0.100.150.200.250.30


PC

T_I

NT

ER

GE

NIC

_BA

SE

S

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

0.20.30.40.50.60.70.8


PC

T_M

RN

A_B

AS

ES

P01

P02

P03

AP

03B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

0.60.81.01.21.41.61.82.0


ME

DIA

N_C

V_C

OV

ER

AG

E

Figure 8: Sample-level QC. Boxplots of QC measures, by batch.

23 / 88

Sample-Level QC

−5 0 5 10

−4

−2

02

46

QC PCA by batch

PC1

PC

2

P01

P02

P03

A

P03

B

P04

P05

P06

P10

P11

P12

P13

P14

Y01

Y04

−4

−2

0

2

4

6

QC PCA by batch

PC

1

(a) QC PC2 vs. PC1, colored by batch (b) QC PC1, by batch

Figure 9: Sample-level QC. Principal component analysis (PCA) ofsample-level QC measures.

24 / 88

Sample-Level QC

−40 −20 0 20 40 60 80

−5

05

10QC PC1 vs. count PC1

Count PC1

QC

PC

1

Cor=−0.49

PC1 PC2 PC3 PC4 PC5

Absolute correlation of count PC and QC measures

0.0

0.2

0.4

0.6

0.8

1.0

NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS

(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures

Figure 10: Sample-level QC. Association of counts and sample-level QCmeasures.

25 / 88

Sample-Level QC: Summary

• The distribution of QC measures can vary substantially withinand between batches.

• Some QC measures clearly point to low-quality samples, e.g.,low percentage of mapped reads (RALIGN).

• There can be a strong association between QC measures andread counts (cf. PCA).

• Filtering samples based on QC measures is advisable, asnormalization procedures may not be able to adjust for QCand some samples simply have low quality.

• Normalization procedures based on QC measures (e.g.,regression on first few PC of QC measures) should also beconsidered.

26 / 88

Gene-Level Counts

0

2

4

6

8

10

Gene−level log−count

log(

coun

t+1)

−4

−2

0

2

4

6

8

Gene−level RLE

RLE

(a) Log-count (b) RLE

Figure 11: Gene-level counts. Gene-level log-count and relative logexpression (RLE = log-ratio of read count to median read count acrosssamples).

27 / 88

Gene-Level Counts: Summary

• After gene and sample filtering and before normalization,there are large differences in gene-level count distributionswithin and between batches (cf. RLE, housekeeping genes).

• The counts are still zero-inflated.

• There can be substantial association of counts andsample-level QC measures.

• Normalization is essential before any clustering or differentialexpression analysis, to ensure that observed differences inexpression measures between samples and/or genes are trulydue to differential expression and not technical artifacts.

28 / 88

SCONE

D. Risso, M. Cole, N. Yosef

29 / 88

SCONE

SCONE: Single-Cell Overview of Normalized Expression. A generalframework for the normalization of scRNA-Seq data.

• Range of normalization methods.I Global-scaling, e.g., DESeq, TMM.I Full-quantile (FQ).I Unknown factors of unwanted variation: Remove unwanted

variation (RUV).I Known factors of unwanted variation: Regression-based

normalization on, e.g., QC PC, C1 run.

• Normalization performance metrics.

• Numerical and graphical summaries of normalized read countsand metrics.

• R package scone, to be released through the BioconductorProject: github.com/yoseflab/scone.

30 / 88

http://github.com/yoseflab/scone

SCONE

Figure 12: scone. Regression model.

31 / 88

SCONE

Performance metrics. (Green: Good when high; Red: Good whenlow.)

• BIO SIL: Average silhouette width by biological condition.

• BATCH SIL: Average silhouette width by batch.

• PAM SIL: Maximum average silhouette width for PAMclusterings, for a range of user-supplied numbers of clusters.

• EXP QC COR: Maximum squared Spearman correlationbetween count PCs and QC measures.

• EXP UV COR: Maximum squared Spearman correlationbetween count PCs and factors of unwanted variation(preferably derived from other set of negative control genesthan used in RUV).

• EXP WV COR: Maximum squared Spearman correlationbetween count PCs and factors of wanted variation (derivedfrom positive control genes).

32 / 88

SCONE

• RLE MED: Mean squared median relative log expression(RLE).

• RLE IQR: Mean inter-quartile range (IQR) of RLE.

33 / 88

Software Package scone

Application to OE p63 dataset.

• Apply and evaluate 172 normalization procedures using mainscone function.

I scaling method: None, DESeq, TMM, FQ.I uv factors: None, RUVg k = 1, · · · , 5, QC PC k = 1, · · · , 5.I adjust biology: Yes/no.I adjust batch: Yes/no.

• Select a normalization procedure based on (function of) theperformance scores.Unweighted mean score =⇒ none,fq,qc k=4,bio,no batch

Weighted mean score =⇒none,fq,qc k=2,no bio,no batch

34 / 88


−0.15 −0.10 −0.05 0.00 0.05 0.10

−0.

15−

0.05

0.00

0.05

0.10

SCONE: Biplot of scores colored by mean score

PC1

PC

2

BIO_SIL

BATCH_SILPAM_SIL

EXP_QC_COR

EXP_UV_COREXP_WV_COR

RLE_MEDRLE_IQR

Figure 13: scone. Biplot of performance scores, colored by mean score(yellow high/good, blue low/bad).

35 / 88


−0.4 −0.2 0.0 0.2

−0.

20−

0.10

0.00

0.10

SCONE: Score PCA colored by mean score

PC1

PC

2

Top meanTop weighted meanfq,ruv_k=1none

Figure 14: scone. PCA of performance scores, colored by mean score.

36 / 88


−0.4 −0.2 0.0 0.2

−0.

20−

0.10

0.00

0.10

SCONE: Score PCA colored by method

PC1

PC

2

nonefqdeseqtmm

Figure 15: scone. PCA of performance scores, colored by method –scaling method.

37 / 88


−0.4 −0.2 0.0 0.2

−0.

20−

0.10

0.00

0.10

SCONE: Score PCA colored by method

PC1

PC

2

no_bio,no_batchno_bio,batchbio,no_batchbio,batch

Figure 16: scone. PCA of performance scores, colored by method –adjust biology, adjust batch.

38 / 88


FQ + RUVg(HK, k=1): W by batch

W

−0.

15−

0.05

0.00

0.05

0.10

−0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10

−5

05

10

FQ + RUVg(HK, k=1): QC PC1 vs. W

W

QC

PC

1

Cor=−0.56

(a) Unwanted factor W (b) QC PC1 vs. W

Figure 17: scone. Association of RUVg unwanted factor W and QCmeasures for none,fq,ruv k=1,no bio,batch.

39 / 88


−3

−2

−1

0

1

2

3

4

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: RLE

RLE

−2

0

2

4

6

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: HK RLE

RLE

(a) All genes (b) Housekeeping genes

Figure 18: scone. Gene-level relative log expression (RLE = log-ratio ofread count to median read count across samples) for method with topweighted mean score none,fq,qc k=2,no bio,no batch.

40 / 88


−20 0 20 40

−5

05

10

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: QC PC1 vs. count PC1

count PC1

QC

PC

1

Cor=0

PC1 PC2 PC3 PC4 PC5

SCONE weighted mean score −none,fq,qc_k=2,no_bio,no_batch−: Absolute correlation of count PC and QC measure

0.0

0.2

0.4

0.6

0.8

1.0

NREADSNALIGNEDRALIGNTOTAL_DUPPRIMERPCT_RIBOSOMAL_BASESPCT_CODING_BASESPCT_UTR_BASESPCT_INTRONIC_BASESPCT_INTERGENIC_BASESPCT_MRNA_BASESMEDIAN_CV_COVERAGEMEDIAN_5PRIME_BIASMEDIAN_3PRIME_BIAS

(a) QC PC1 vs. count PC1 (b) Correlation of count PC and QC measures

Figure 19: scone. Association of counts and sample-level QC measures,none,fq,qc k=2,no bio,no batch.

41 / 88

Software Package scone: Summary

• Unnormalized gene-level counts exhibit large differences indistributions within and between batches and association withsample-level QC measures.

• Different normalization methods vary in performanceaccording to SCONE metrics and lead to different distributionsof gene-level counts, hence clustering and DE results.

• Global-scaling normalization. Not aggressive enough to handlepotentially large batch effects and association of counts andQC measures. Biological effects are dominated by nuisancetechnical effects. Additionally, for DESeq, the scaling factorsare computed based on only a handful of genes with non-zerocounts in all cells (5/22,054).

42 / 88


• Batch effect normalization. Adjusting for batch effectswithout properly accounting for the nesting of batch withinbiological effects (no bio,batch) in the regression model isproblematic, as this removes the biological effects of interest(e.g., empirical Bayes framework of ComBat).

• FQ followed by QC-based or RUVg normalization. Seemseffective: Similar RLE distributions between samples, lowerassociation of counts and QC measures. The first unwantedfactor of RUVg is correlated with the first QC PC.

• The remaining analyses are based onnone,fq,qc k=2,no bio,no batch, the best methodaccording to weighted mean score.

43 / 88


• Interpretation of performance metrics. Some metrics tend tofavor certain methods over others, e.g., EXP UV COR(correlation between count PCs and factors of unwantedvariation) naturally favors RUVg, especially when the same setof negative controls are used for normalization and evaluation.Hence, a careful, global interpretation of the metrics isrecommended.

• Negative controls. The selection of proper, distinct sets ofnegative controls is important, as these are used for bothnormalization (RUVg) and assessment of normalization results(EXP UV COR).

• Ongoing efforts.I Zero-inflated negative binomial (ZINB) model.I User-supplied factors unwanted and wanted variation (UV and

WV, respectively).I Other methods (e.g., ComBat/sva).

44 / 88


I Other performance metrics.I Visualization.I Shiny app for interactive web interface.

45 / 88

Resampling-Based Sequential Ensemble Clustering

D. Risso, E. Purdom

46 / 88

Motivation

• Robustness to choice of samples. Both hierarchical andpartitioning methods tend to be sensitive to the choice ofsamples to be clustered. Outlying samples/clusters (e.g., glia)are common in scRNA-Seq and mask interesting substructurein the data, often requiring the successive pruning out ofdominating clusters to get to the finer structure.

• Robustness to clustering algorithm and tuning parameters.Clustering results are sensitive to pre-processing steps such asnormalization and dimensionality reduction, as well as to thechoice of clustering algorithm and associated tuningparameters (e.g., distance function, number of clusters).

47 / 88

Motivation

• Not focusing on the number of clusters. A major tuningparameter of partitioning methods such as partitioning aroundmedoids (PAM) and k-means is the number of clusters k .Methods for selecting k (e.g., silhouette width) are sensitiveto the choice of samples, normalization, and other tuningparameters. They tend to be conservative (low k), i.e.,capture only the coarse clustering structure and maskinteresting substructure in the data. Additionally, the numberof clusters k is often not of primary interest.E.g. Silhouette width with PAM selects only k = 2 clusters forthe OE p63 dataset.

• Not forcing samples into clusters. Some samples may beoutliers, that do not really belong to any clusters. Leavingthem out can improve the quality and interpretability of theclustering as well as downstream analyses (e.g., identificationof cluster marker genes).

48 / 88

Motivation

• Cluster gene expression signatures. Common differentialexpression statistics are not well-suited for finding markergenes for the clusters, especially for finer structure in ahierarchy.

49 / 88


• We have developed a resampling-based sequential ensembleclustering approach, with the aim of obtaining stable andtight clusters.

• Ensemble clustering, i.e., aggregating multiple clusteringsobtained from different algorithms or applications of a givenalgorithm to resampled versions of the learning set, is ageneral approach for improving stability. This can be viewedas the unsupervised analog of ensemble methods in supervisedlearning, e.g., bagging, boosting, random forests.

• Our approach is related to bagged/consensus/tight clustering(Dudoit and Fridlyand, 2003; Leisch, 1999; Tseng and Wong,2005).

• R package clusterExperiment, released through theBioconductor Project.

50 / 88


RSEC: Resampling-based Sequential Ensemble Clustering.

• Given a base clustering algorithm (e.g., PAM, k-means) andassociated tuning parameters (e.g., number of principalcomponents, number of clusters k , distance matrix), generatea single candidate clustering using

I resampling-based clustering to find robust and tight clusters;I sequential clustering to find stable clusters over a range of

numbers of clusters (Tseng and Wong, 2005).

• Generate a collection of candidate clusterings by repeating theabove procedure for different base clustering algorithms andtuning parameters.

• Identify a consensus over the different candidate clusterings.

• Merge non-differential clusters.

• Find cluster signatures by testing for differential expressionbetween selected subsets of clusters.

51 / 88


• Visualization. Comparison of multiple clusterings of the samesamples, heatmaps of co-clustering matrices, heatmaps withhierarchical clustering of genes and/or samples.

52 / 88

Differential Expression

• Find cluster gene expression signatures, i.e., marker genes, bytesting for differential expression between selected subsets ofclusters.

• Standard F -statistic. Tests for any difference betweenclusters. Sensitive to outlying samples/clusters. Non-specific,i.e., not useful for interpreting differences between clusters.

• Standard solution in (generalized) linear models/ANOVA is toconsider contrasts between groups of clusters. By using themachinery of the (generalized) linear model, we use all of thesamples in testing these contrasts, rather than just thosesamples involved in the corresponding clusters.

I All pairwise. All pairwise comparisons between clusters.I One against all. Compare each cluster to union of remaining

clusters.

53 / 88

Differential Expression

I Dendrogram. Create a hierarchy of clusters, work up the tree,test for DE between sister nodes (as in approach used formerging clusters).

• For each contrast, test for DE using empirical Bayes linearmodeling approach of R package limma, with voom option toaccount for mean-variance relationship of log-counts (i.e.,over-dispersion).

54 / 88

Software Package clusterExperiment

Workflow.

• clusterMany. Generate a collection of candidate clusterings,for different base clustering algorithms and tuning parameters,with option to use resampling and sequential approaches.

• combineMany. Find consensus clustering across severalclusterings.

• Identify non-differential clusters that should be merged intolarger clusters.

I makeDendrogram. Hierarchical clustering of the clusters foundby combineMany.

I mergeClusters. Merge clusters of this hierarchy based on DEbetween nodes.

• RSEC. Wrapper function around the clusterExperimentworkflow.

55 / 88


• getBestFeatures. Find cluster signatures by testing fordifferential expression between selected subsets of clusters.

• Visualization.I plotClusters. Comparison of multiple clusterings of the

same samples. Based on ConsensusClusterPlus package.I plotHeatmap. Heatmaps of co-clustering matrices, heatmaps

with hierarchical clustering of genes and/or samples (interfaceto aheatmap from NMF package).

56 / 88



• clusterMany: Generate 22 candidate clusterings.I Dimensionality reduction: 25, 50 PC.I Euclidean distance.I Base clustering method: PAM, k = 5, · · · , 15.I Resampling-based clustering: B = 100, proportion = 0.7,α = 0.3.

I Sequential clustering: k0 = 15, β = 0.9.I clusterFunction=c("hierarchical01").

• combineMany(ce, clusterFunction="hierarchical01",

whichClusters="workflow", proportion=0.7,

propUnassigned=0.5, minSize=5).

• mergeClusters(ce, mergeMethod="adjP",

cutoff=0.05).

57 / 88


Clusterings from clusterMany

nPCAFeatures=50,k0=15nPCAFeatures=25,k0=15nPCAFeatures=50,k0=14nPCAFeatures=25,k0=14nPCAFeatures=50,k0=13nPCAFeatures=25,k0=13nPCAFeatures=50,k0=12nPCAFeatures=25,k0=12nPCAFeatures=50,k0=11nPCAFeatures=25,k0=11nPCAFeatures=50,k0=10nPCAFeatures=25,k0=10

nPCAFeatures=50,k0=9nPCAFeatures=25,k0=9nPCAFeatures=50,k0=8nPCAFeatures=25,k0=8nPCAFeatures=50,k0=7nPCAFeatures=25,k0=7nPCAFeatures=50,k0=6nPCAFeatures=25,k0=6nPCAFeatures=50,k0=5nPCAFeatures=25,k0=5

Figure 20: clusterExperiment. Comparison of 22 clusterMany

clusterings using plotClusters.

58 / 88


combineMany−1c1c10c11c12c13c14c15c16c17c18c19c2c20c21c22c23c24c25c26c27c28c29c3

0

0.2

0.4

0.6

0.8

1

Co−clustering proportion matrix

combineMany

Figure 21: clusterExperiment. Heatmap of co-clustering matrix forclusterMany clusterings, used to create combineMany clustering(plotHeatmap).

59 / 88


c1

c2

c3

c4

c5

c6

c7c8c9

c10

c11c12

c13

c14

c15

c16

c17c18

c19c20

c21c22

c23

c24c25

c26c27

c28

c29c30c31

c32

c1

c2

c3

c4

c5

c6

c7c8c9

c10

c11c12

c13

c14

c15

c16

c17c18

c19c20

c21c22

c23

c24c25

c26c27

c28

c29c30c31

c32

0.38

0.058

0.15

0.028

0.066

0.088

0.035 0.014

0.014

0.039

0.039

0.026

0.023

0.017

0.016

0.0065

0.076

0.00790.011

0.0086

0.0110.0042

0.0077

0.0092

0.006

0.0065

0.0086

0.0056

0.0051

0.0038

0.0056

Figure 22: clusterExperiment. Dendrograms from combineMany andmergeClusters.

60 / 88


−50 0 50

−40

−20

020

4060

80

Count PCA by cluster

PC1

PC

2

−112345678

Figure 23: clusterExperiment. PCA of gene-level log-counts, colored bymergeClusters.

61 / 88


mergeClusters−1m1m2m3m4m5m6m7m8

Bio−1−2ContaminantsHBCHBC transitionINP/GBCINP2iOSNMicrovillousmOSNsSUSSUS precursor

−4

−2

0

2

4

6

8

10

Heatmap of DE genes, dendrogram contrasts

mergeClustersBio

Figure 24: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using dendrogram contrasts (269, top 50 ineach of the 6 nodes).

62 / 88


sirt5

hnrnpl

rmi2

gm17821

nup98

cox6c

oaz1

rpl27

rplp1

rpl13a

rps20

rps7

rpl23

rpl37a

rpl23a

rps29

rpl32

rplp2

rpl24

fau

rpl36

rpl37

rps23

rpl35

prrc2c

son

kcnq1ot1

nip7

eif1

cfl1

rpl10

lars2

rps25

rpl13

naca

rplp0

rpl7

rpl8

rps6

rps3a1

rpl21

rps11

rpl14

hmgb1

rpl3

trns1

rnr1

ppia

rps27

myh9

rpl9

rps9

rps19

rpl41

rps14

map1b

eif4g2

calm2

ywhae

gsk3b

calm1

rn18s−rs5

malat1

actb

gnb1

rnr2

hsp90ab1

eef1a1

h3f3b

canx

btg1

srsf2

csde1

fth1

sod1

set

ywhaz

ubb

tmsb4x

prdx1

akr1a1

sptbn1

ubc

mt1

dstn

rps5

rpl4

rps3

rpl18a

actr2

ddx5

hnrnpa2b1

skp1a

hspa8

ptges3

tpt1

actg1

ftl1

ptma

gm1821 mergeClusters−1m1m2m3m4m5m6m7m8


0

2

4

6

8

10

Heatmap of DE genes, F−test

mergeClustersBio

Figure 25: clusterExperiment. Heatmap of log-counts forgetBestFeatures DE genes using F -test (top 100).

63 / 88


sox11

creer

gstm2

cyp2f2

cbr2

trim66

gap43

ebf1

ebf2

ncam1

omp

cnga2

gng13

rgs11

lhx2

ebf3

nhlh1

gng8

arhgdig

cyp2g1

lgr5

cyp1a2

scarb1

coch

cd24a

cftr

ceacam1

prmt5

fabp5

neurod1

nhlh2

zfp423

ebf4

rrm1

ezh2

hes6

ccnd1

rbm24

kit

dtl

tead2

mcm7

ccna2

ect2

top2a

cdca8

pbk

il6

notch1

yap1

icam1

ppara

fgfr2

dkk3

igfbp4

itgb5

egfr

igfbp2

pax6

sall1

elf5

il33

sox2

wnt4

lifr

nrcam

trp63

krt14

krt5

sox9

kitl

sfrp1

f3

perp

ccnd2

hmgb2

tubb5

cdkn1b

egfp

igfbp5

krt8

hes1

notch2 mergeClusters−1m1m2m3m4m5m6m7m8


0

2

4

6

8

10

Heatmap of markers genes

mergeClustersBio

Figure 26: clusterExperiment. Heatmap of log-counts for “a priori”markers genes (83).

64 / 88

Software Package clusterExperiment: Summary

• Tuning parameters.I α controls tightness.I β controls stability.I k0, the initial number of clusters used in sequential clustering,

is the parameter with the greatest impact on the results.Larger k0 tend to lead to smaller and tighter clusters.

• Caveat. The DE analysis is exploratory and nominal p-valuesonly a rough summary of significance (reliance on models, tinyp-values even after adjustment for multiple testing, same dataused to define clusters and to perform DE analysis).

• Ongoing efforts.I Cluster confidence measures.I Potentially assign unclustered observations to clusters.I DE using ZINB model.I Greater modularity (e.g., distance functions, DE test).I Shiny app for interactive web interface.

65 / 88

Cell Lineage and Pseudotime Inference

K. Street, D. Risso, E. Purdom

66 / 88

Motivation

• Mapping transcriptional progression from stem cells tospecialized cell types is essential for properly understandingthe mechanisms regulating cell and tissue differentiation.

• There may not always be a clear distinction between states,but rather a smooth transition, with individual cells existingon a continuum between states.

• In such a case, cells may undergo gradual transcriptionalchanges, where the relationship between states can berepresented as a continuous lineage dependent upon anunderlying spatial or temporal variable. This representation,referred to as pseudotemporal ordering, can help usunderstand how cells differentiate and how cell fate decisionsare made (Bendall et al., 2014; Campbell et al., 2015; Ji andJi, 2016; Petropoulos et al., 2016; Shin et al., 2015; Trapnellet al., 2014).

67 / 88

Motivation

• We have developed Slingshot as a flexible and robustframework for inferring cell lineages and pseudotimes in thestudy of continuous differentiation processes.

68 / 88


• Input/Output.I Input. Normalized gene expression measures and cell

clustering.I Output. Cell lineages, i.e., subsets of ordered cell clusters.

Cell pseudotimes, i.e., for each lineage, ordered sequence ofcells and associated pseudotimes.

• Dimensionality reduction.I Principal component analysis (PCA) seems effective and

simple, in conjunction with steps detailed next.I Other approaches include related linear methods, e.g.,

independent component analysis (ICA) (Trapnell et al., 2014,Monocle), and non-linear methods, e.g., Laplacianeigenmaps/spectral embedding (Campbell et al., 2015,Embeddr), t-distributed stochastic neighbor embedding(t-SNE) (Bendall et al., 2014; Petropoulos et al., 2016,Wanderlust).

69 / 88


• Inferring cell lineages.I Minimum spanning tree (MST; ape package) over cell clusters,

with between-cluster distance based on Euclidean distancebetween cluster means scaled by within-cluster covariance.

I Outlying clusters. Identified using granularity parameter ω thatlimits maximum edge weight in the tree. Specifically, buildMST using an artificial cluster Ω, with distance ω from otherclusters (a fraction of maximum pairwise distance betweenclusters), and then remove Ω.

I Root and leaf nodes. May either be pre-specified orautomatically selected.Root node. If not pre-specified, selected based on parsimony(i.e., set of lineages with maximal number of clusters sharedbetween them).Leaf nodes. If pre-specified, constrained MST.

I A lineage is then defined as any unique path coming out of theroot node and ending in a leaf node.

70 / 88


I Constructing the MST on clusters (Ji and Ji, 2016; Shin et al.,2015, TSCAN,Waterfall) vs. cells (Trapnell et al., 2014,Monocle) offers greater stability and computational efficiency,less complex lineages, and easier determination of directionalityand branching.

• Inferring cell pseudotimes.I Iterative procedure inspired from the principal curve algorithm

of Hastie and Stuetzle (1989); principal.curve function inprincurve package.

I In the case of branching lineages, a shrinkage step is includedat each iteration, that forces a degree of similarity between thecurves in the neighborhood of shared clusters.

I Pseudotime values are derived by orthogonal projection ontothe curves.

I Cells belonging to clusters that are included in multiplelineages have multiple, similar pseudotime values.

71 / 88


I Previous approaches also use smooth curves to representlineages (Campbell et al., 2015; Petropoulos et al., 2016,Embeddr), while others use piecewise linear paths through theMST and extract orderings either by orthogonal projection (Jiand Ji, 2016; Shin et al., 2015, TSCAN,Waterfall) or PQ tree(Trapnell et al., 2014, Monocle).

I We find that smooth curves provide discerning power not foundin piecewise linear trajectories, while also adding stability overa range of dimensionality reduction and clustering methods.

• Differential expression. Regression of gene expressionmeasures on pseudotime, e.g., generalized additive models(GAM) (Ji and Ji, 2016, TSCAN).

• Visualization. Two- and three-dimensional plots of celllineages and pseudotimes, gene-level trajectories, heatmaps forDE genes.

72 / 88


• R package slingshot, to be released through the BioconductorProject: github.com/kstreet13/slingshot.

73 / 88

http://github.com/kstreet13/slingshot

Software Package slingshot

• Modularity.I Integrates easily with a range of normalization, clustering, and

dimensionality reduction methods.I get lineages: Given expression measures and cluster labels,

use MST to infer lineages.get lineages(X, clus.labels, start.clus = NULL,

end.clus = NULL, dist.fun = NULL, omega = Inf,

distout = FALSE).I get curves: Given lineages, infer pseudotimes.

get curves(X, clus.labels, lineages, thresh =

1e-04, maxit = 100, stretch = 2, shrink = TRUE).

• Flexibility. Can be used with varying levels of supervision.I Cluster-based approach allows for easy supervision when

researchers have prior knowledge of cell classes, while stillbeing able to detect novel branching events.

I User-supplied or data-driven selection of root and leaf nodes.

74 / 88


• Visualization.

I plot tree: MST in 2 and 3D.I plot curves: Lineage curves in 2 and 3D.

75 / 88



• Applied to the first three principal components, Slingshotidentifies two lineages: The first corresponds to theHBC-to-neurons transition, the second to theHBC-to-sustentacular cells transition.

• A first-pass DE analysis, based on a regression of log-count onpseudotime using GAM, suggests that many genes areinvolved in the differentiation process.

• Among the top 100 DE genes for each lineage, only 13 are DEin both, suggesting distinct processes in the neuronal vs.non-neuronal lineages.

76 / 88


−50 0 50−

500

50

PC2

PC

1

−40 0 20 40 60−

400

2040

6080

PC3

PC

2

Figure 27: slingshot. PCA of gene-level log-counts, colored by clusters,with MST edges used to infer lineages (get lineages, plot tree).

77 / 88


−50 0 50−

500

50

PC2

PC

1

−40 0 20 40 60−

400

2040

6080

PC3

PC

2

Figure 28: slingshot. PCA of gene-level log-counts, colored by clusters,with smooth curves representing lineages and used to infer pseudotimes(get curves, plot curves).

78 / 88


0 50 150

04

8

fstl5

pseudotimelo

g(co

unt+

1)

0 50 150

04

8

ugt2a1

pseudotime

log(

coun

t+1)

0 50 150

06

gstm1

pseudotime

log(

coun

t+1)

0 50 150

04

8

ncam1

pseudotime

log(

coun

t+1)

0 50 150

04

8

gstm2

pseudotimelo

g(co

unt+

1)

0 50 150

04

8

rtn1

pseudotime

log(

coun

t+1)

0 50 150

04

8

aqp3

pseudotime

log(

coun

t+1)

0 50 150

04

8

scn9a

pseudotime

log(

coun

t+1)

0 50 150

04

gm581

pseudotimelo

g(co

unt+

1)

Slingshot: DE genes based on GAM, lineage 1

Figure 29: slingshot. Scatterplots of gene-level log-count vs. pseudotimefor GAM DE genes in lineage 1 (HBC–Neurons).

79 / 88


jakmip1

tuba1a

gm11223

stmn1

sult1d1

ak1

ebf3

elavl3

myt1l

atf5

ttll7

homer2

uchl1

stmn3

rd3

calb2

snap25

cntn4

aplp1

clgn

rtp1

gap43

trim66

sept3

ank2

spock1

gm581

scn9a

mgst3

gng13

5730409k12rik

prune2

flrt1

syt11

cystm1

dcx

stmn2

fstl5

ncam1

ebf1

tubb3

ebf2

ell3

rtn1

map1b

gnb1

rps27

prdx1

sp8

ifitm2

runx1

ezr

zfp36l1

tcf7l2

atf3

krt18

sdc4

prdx6

gstm1

cyp2f2

cbr2

fmo6

ccnd2

gsta4

slc26a7

lypd2

notch2

aox2

hspb1

socs3

dapl1

s100a11

ahnak

sik1

hes1

bhlhe40

irf1

cyr61

junb

nr4a1

klf4

fosb

wls

rbm47

foxn3

ptn

gja1

ifitm3

clu

aqp4

anxa2

ces1d

ephx1

zfp36

epas1

gpm6a

scgb1c1

ugt2a1

aqp3

gstm2 mergeClustersm1m2m3m5m7m8

BioHBCHBC transitionINP/GBCINP2iOSNmOSNsSUSSUS precursorSUS progenitor

0

2

4

6

8

10

Slingshot: DE genes based on GAM, lineage 1

mergeClustersBio

Figure 30: slingshot. Heatmap of log-counts for GAM DE genes inlineage 1 (HBC–Neurons), cells sorted by pseudotime.

80 / 88


Lineages.

$lineage1

[1] "8" "3" "7" "5" "2" "1"

$lineage2

[1] "8" "3" "7" "4"

Biological annotation of clusters.

cl

b 1 2 3 4 5 7 8

HBC 0 0 26 0 0 1 47

HBC transition 0 0 35 0 0 19 0

INP/GBC 0 0 1 0 41 0 0

INP2 0 34 0 0 0 0 0

iOSN 1 77 1 0 0 0 0

Microvillous 0 0 0 8 0 0 0

mOSNs 33 1 0 0 0 0 0

SUS 0 0 1 61 0 5 0

SUS precursor 0 0 7 2 4 62 0

SUS progenitor 0 0 0 0 6 0 0

81 / 88

Software Package slingshot: Summary

Ongoing efforts.

• Number of lineages: User-supplied, testing for distinctlineages, merging non-differential lineages.

• DE within and between (i.e., bifurcation) lineages.

• Visualization.

• Performance measures.

• OOP with S4 classes and methods.

• Shiny app for interactive web interface.

82 / 88

References

S. Anders and W. Huber. Differential expression analysis for sequence countdata. Genome Biology, 11(10):R106, 2010.

S. C. Bendall, K. L. Davis, E. D. Amir, M. D. Tadmor, E. F. Simonds, T. J.Chen, D. K. Shenfeld, G. P. Nolan, and D. Pe’er. Single-cell trajectorydetection uncovers progression and regulatory coordination in human B celldevelopment. Cell, 157(3):714–725, 2014.

B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison ofnormalization methods for high density oligonucleotide array data based onbias and variance. Bioinformatics, 19(2):185–193, 2003.

J. H. Bullard, E. A. Purdom, K. D. Hansen, and S. Dudoit. Evaluation ofstatistical methods for normalization and differential expression inmRNA-Seq experiments. BMC Bioinformatics, 11:Article 94, 2010. URLhttp://www.biomedcentral.com/1471-2105/11/94/abstract. (Highlyaccessed).

83 / 88

http://www.biomedcentral.com/1471-2105/11/94/abstract

References

K. Campbell, C. P. Ponting, and C. Webber. Laplacian eigenmaps and principalcurves for high resolution pseudotemporal ordering of single-cell RNA-seqprofiles. Technical report, MRC Functional Genomics Unit, University ofOxford, UK, 2015. URLbiorxiv.org/content/early/2015/09/18/027219.

S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clusteringprocedure. Bioinformatics, 19(9):1090–1099, 2003. URL http:

//bioinformatics.oxfordjournals.org/content/19/9/1090.abstract.

J. Gagnon-Bartsch, L. Jacob, and T. P. Speed. Removing unwanted variationfrom high dimensional data with negative controls. Technical Report 820,Department of Statistics, University of California, Berkeley, 2013.

J. A. Gagnon-Bartsch and T. P. Speed. Using control genes to correct forunwanted variation in microarray data. Biostatistics, 13(3):539–552, 2012.

T. Hastie and W. Stuetzle. Principal curves. Journal of the AmericanStatistical Association, 84(406):502–516, 1989.

R. A. Irizarry, B. Hobbs, F. Collin, Y. D. Beazer-Barclay, K. J. Antonellis,U. Scherf, and T. P. Speed. Exploration, normalization, and summaries ofhigh density oligonucleotide array probe level data. Biostatistics, 4(2):249–264, 2003.

84 / 88

biorxiv.org/content/early/2015/09/18/027219

http://bioinformatics.oxfordjournals.org/content/19/9/1090.abstract

http://bioinformatics.oxfordjournals.org/content/19/9/1090.abstract

References

L. Jacob, J. Gagnon-Bartsch, and T. P. Speed. Correcting gene expression datawhen neither the unwanted factors nor the factor of interest is observed.Annals of Applied Statistics, 2013. (Submitted).

Z. Ji and H. Ji. TSCAN: Pseudo-time reconstruction and evaluation insingle-cell RNA-seq analysis. Nucleic Acids Research, 2016.

W. E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarrayexpression data using empirical Bayes methods. Biostatistics, 8(1):118–127,2007.

F. Leisch. Bagged clustering. Technical Report 51, SFB Adaptive InformationSystems and Modelling in Economics and Management Science, ViennaUniversity of Economics and Business Administration, Vienna, Austria,August 1999. URLwww.ci.tuwien.ac.at/~leisch/papers/fl-techrep.html.

J. Loven, D. A. Orlando, A. A. Sigova, C. Y. Lin, P. B. Rahl, C. B. Burge,D. L. Levens, T. I. Lee, and R. A. Young. Revisiting global gene expressionanalysis. Cell, 151(3):476–482, 2012.

85 / 88

www.ci.tuwien.ac.at/~leisch/papers/fl-techrep.html

References

J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad. RNA-seq:an assessment of technical reproducibility and comparison with geneexpression arrays. Genome Research, 18(9):1509–1517, 2008.

A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mappingand quantifying mammalian transcriptomes by RNA-Seq. Nature Methods,5(7):621–628, 2008.

S. Petropoulos, D. Edsgard, B. Reinius, Q. Deng, S. P. Panula, S. Codeluppi,A. Plaza Reyes, S. Linnarsson, R. Sandberg, and F. Lanner. Single-cellRNA-Seq reveals lineage and X chromosome dynamics in humanpreimplantation embryos. Cell, 165(In press):1–15, 2016.

D. Risso, K. Schwartz, G. Sherlock, and S. Dudoit. GC-content normalizationfor RNA-Seq data. BMC Bioinformatics, 12:Article 480, 2011. URLhttp://www.biomedcentral.com/1471-2105/12/480/abstract. (Highlyaccessed).

D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. Normalization of RNA-seq datausing factor analysis of control genes or samples. Nature Biotechnology, 32(9):896–902, 2014a. URL http:

//www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2931.html.

86 / 88

http://www.biomedcentral.com/1471-2105/12/480/abstract

http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2931.html

http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.2931.html

References

D. Risso, J. Ngai, T. P. Speed, and S. Dudoit. The role of spike-in standards inthe normalization of RNA-seq. In S. Datta and D. Nettleton, editors,Statistical Analysis of Next Generation Sequencing Data, Frontiers inProbability and the Statistical Sciences, chapter 9, pages 169–190. SpringerInternational Publishing, 2014b.

M. D. Robinson and A. Oshlack. A scaling normalization method for differentialexpression analysis of RNA-seq data. Genome Biology, 11(3):R25, 2010.

J. Shin, D. A. Berg, Y. Zhu, J. Y. Shin, J. Song, M. A. Bonaguidi,G. Enikolopov, D. W. Nauen, K. M. Christian, G. Ming, and H. Song.Single-cell RNA-Seq with Waterfall reveals molecular cascades underlyingadult neurogenesis. Cell Stem Cell, 17(3):360–372, 2015.

C. Trapnell, D. Cacchiarelli, J. Grimsby, P. Pokharel, S. Li, M. Morse, N. J.Lennon, K. J. Livak, T. S. Mikkelsen, and J. L. Rinn. The dynamics andregulators of cell fate decisions are revealed by pseudotemporal ordering ofsingle cells. Nature Biotechnology, 4(32):381–391, 2014.

G. C. Tseng and W. H. Wong. Tight clustering: a resampling-based approachfor identifying stable and tight patterns in data. Biometrics, 61(1):10–16,2005.

87 / 88

References

C. A. Vallejos, D. Risso, A. Scialdone, S. Dudoit, and J. C. Marioni.Normalizing single-cell RNA sequencing data: challenges and opportunities.Nature Methods, 2016. (Submitted).

Z. Wang, M. Gerstein, and M. Snyder. RNA-Seq: a revolutionary tool fortranscriptomics. Nature Reviews Genetics, 10(1):57–63, 2009.

88 / 88

using single-cell transcriptome sequencing to …...olfactory stem cells and neural regeneration the...

Documents