supplementary figure 1. the smart-seq protocol...adapter ligation tagmentation (tn5) pcr...
TRANSCRIPT
First-strand synthesis & tailing by SMARTScribe RT
Amplify cDNA by PCR
Poly A+ RNA
polyA 3'
SMARTer II AOligonucleotide
SMART CDS primer
Double-stranded cDNA
Template switching and extension by SMARTScribe RT
polyAXXXXX
5'
XXXXX5'
XXXXX
polyA
XXXXXXXXXX5'
Singlestep
5'
5'
End repair
Shearing (Covaris)
Base addition
Adapter ligation
Tagmentation (Tn5)
PCR amplification
Sequence library
Supplementary Figure 1. The Smart-Seq protocolProtocol for cDNA amplification and library preparation from few or single cells. Cells are first lysed in reverse transcription compatible buffer and the reaction is initiated with oligo-dT containing primer. First strand synthesis is completed with the addtion of a few untem-plated C nucleotides followed by template switching and the incorporation of SMARTer IIA oligonucleotide. Full-length cDNAs are amplified using PCR to obtain a few nanograms of DNA. Fragmentation and adapter introduction was performed using either acoustic shear-ing or transposase tagmentation based procedures to generate sequencing libraries.
0%
100%80%60%40%20%
0%
100%80%60%40%20%
0%
100%80%60%40%20%
0%
100%80%60%40%20%
0%
100%80%60%40%20%
0%
100%80%60%40%20%
0%
100%80%60%40%20%
0%
100%80%60%40%20%
0-1 kb 1-2 kb 2-3 kb
3-4 kb 4-5 kb 5-7 kb
7-9 kb 9-15 kb
oocytes (Smart-Seq, n=3)oocytes, 2-cell and 4-cellblastomeres (Ref. 7, n=24)
0 1 0 21 31 2from 3’ end (kb)
0 4from 3’ end (kb)
0 9from 3’ end (kb)
0 15from 3’ end (kb)
0 5from 3’ end (kb)
0 7from 3’ end (kb)
1 2 3 1 2 3 4 2 4 61 3 5
3 6 3 6 9 12
from 3’ end (kb)0
from 3’ end (kb)
read
cov
erag
ere
ad c
over
age
read
cov
erag
e
ES cells, ICM, epiblast(Ref. 8, n=34)
d e f
g h
i j k
Supplementary Figure 2. Comparison of Smart-Seq and Tang et al. data (2009, 2010)(a-h) Read coverage across transcripts for Smart-Seq oocytes, oocytes and blasto-meres from Ref. 7 and cells from Ref. 8. Transcripts has been separated according to their lengths as indicated over the plots. The gray vertical line indicate the point from where the coverage is expected to drop based on the span of transcript lengths included in each figure. (i-k) Comparisons of sensitivity between mouse oocytes generated with Smart-Seq (green) and from Ref. 7 (red), using 5 million reads per cell. (i) The number of genes detected in each oocyte (result not sensi-tive to read cutoff for detection). (j) The percentage of genes detected in pairs of oocytes, as a function of the gene expression level (max over biologial replicates). Error bars, s.e.m. (n�7). (k) Variation in expression level estimation between pairs of oocytes as a function of the mean expression. Error bars, s.e.m (n�7).
0 1 2 30.0
0.1
0.2
0.3
std
dev,
log1
0 sc
ale
RPKM, log10
5,000 10,000*HQHV�GHWHFWHG�����UHDG�
0
oocyte, Smart-Seqoocyte, Smart-Seqoocyte, Smart-Seq
oocyte, Tang et al. 2009oocyte, Tang et al. 2009
0%
100%
0 1 2 3RPKM, log10
dete
ctio
n
a b c
oocytes (Smart-Seq)oocytes (Ref. 7)
a0
33
019
027
033
040
077
0263
0131
0100
035
mBrainSmart-Seq
ES cells(Tang2010)
POLR2B
[0 - 77]
57,850 kb 57,860 kb 57,870 kb 57,880 kb 57,890 kb 57,900 kb
56 kb
b
3' end
Supplementary Figure 3. Full-transcript coverage and reconstruction(a) Smart-Seq reads mapping to a 14-kb region of the RNA polymerase II alpha gene locus containing 27 of the 28 exons. The read coverage is displayed for Smart-Seq data from diluted amounts of mouse brain RNA (green) and for previ-ously published single-cell data from mouse embryonic stem (ES) cells (red) from Tang et al. 2010. (b) Example of a fully reconstructed transcript (human POLR2B) from Smart-Seq reads from an individual T24 cancer cell line cell. Reads were visualized using the Integrated Genome Viewer (IGV, Broad Insti-tute).
ba
c
Gene expression (RPKM), log10
0.0 0.5 1 21.5 2.5 3 3.5
100%
80%
60%
40%
20%
0%
Gen
e de
tect
ion
(%)
1M reads5M reads10M reads20M readsfull depth
within cell lines
non-ampl.
1ng100pg10pg
Gene expression (RPKM), log10
0.0 0.5 1 21.5 2.5 3 3.5
100%
80%
60%
40%
20%
0%
Gen
e de
tect
ion
(%)
10ng1ng100pg10pg
mRNA-Seq
Smart-Seq (PE)
Smart-Seq (Tn5)
1 10 1000.1
Read depth (million uniquely mapped reads)
Spe
arm
an c
orre
latio
n0.71
0.70
0.69
0.68
0.67
0.66
0.65
millionsmRNA-Seq
1 40 80 120 160
Number of LNCaP cells (Smart-Seq)
0
3,000
6,000
9,000
12,000
15,000
Num
ber o
f gen
es d
etec
ted
10 cells (n~11,900 genes)5 cells (n~11,200 genes)
d
80 cells (n~13,400 genes)
Supplementary Figure 4. Sensitivity and variation in Smart-Seq data(a) The number of detected genes in individual cells or when combining data from two or more cells as indicated on the x-axis. For comparison, the last data point show the number of genes detected in standard mRNA-Seq from millions of cells. (b) Gene sensitivity presented as the mean percentage of genes detected in replicates, binned according to expression levels. We performed all pair-wise comparisons within groups of replicates anre report the mean and 95% confidence interval. Comparing sequence libraries constructed using either transposase “tagmentation” (Tn5) or shearing followed by ligation (PE) from indicated low amounts of starting materials. (c) Gene detection sensitivity for cancer cells as in (Fig. 2b) but using different sequence depths. (d) Spearman correlation (rho) in gene expression levels between different cells of the same origin, as a function of the sequence depth.
20,000
10,000
14,000
12,000
16,000
18,000
8,000
Ref
seq
trans
crip
ts
Number of PCR cycles
151050 18
Total RNA starting amount (ng)
0.01 0.1 1 10 NA
a
Supplementary Figure 5. Transcript detection and coverage with varying PCR cycles(a) The number of Refseq transcripts detected as a function of PCR cycles (left) and starting amount of total RNA (right). A RPKM threshold of 0.1 was used for detection. Colors separate independent dilution series. The error bars show standard error. (b) Mean read coverage over transcripts. We compared read coverage for Smart-Seq data generated from 1ng or 10pg of total RNA using either 12 or 18 cycles of PCR. No reproducible trends of PCR cycles on read coverage were observed. Errors bars represent standard deviation.
b
Transcript (%)5' end
Rea
d co
vera
ge (%
)
100
80
60
40
20
0
Smart-Seq1ng, 18 cycles10pg, 18 cycles1ng, 12 cycles
3' end
0 2 4 6 8 10 12
02
46
810
12
mBrain (1ng, rep 1) RPKM, log2
spearman r=0.97pearson r=0.95
mB
rain
(1ng
, rep
2) R
PK
M, l
og2
spearman r=0.81pearson r=0.83
spearman r=0.78pearson r=0.74
mBrain (50pg, rep 1) RPKM, log2
mB
rain
(50p
g, re
p 2)
RP
KM
,log 2
T24 cell (cell 1) RPKM, log2
a
b c
T24
cell
(cel
l 2) R
PK
M, l
og2
0 2 4 6 8 10 12
02
46
810
12
0 2 4 6 8 10 12
02
46
810
12
1ng B
100pg B
10pg U
dilution (Br UHRR PE)
10ng100pg
10pg
dilution (Brain, UHRR Tn5)
10ng100pg
10pgskmel
uacc PMCTChESPC3
LNCaPT24
10ng100pg
10pg
human cancer cell linecells
T24 vs. l
ncap
T24 vs. P
C3
lncap vs. P
C3
human cells and caner cell line cells
PM vs:
skmeluaccCTC
hESCTC vs:
skmeluaccPM hES
1.0
0.9
0.8
0.7
0.6
0.5
0.4
Pear
son
corr
elat
ion
Brain vs. UHRR:Brain UHRR
d
Supplementary Figure 6. Correlations in gene expression levels between technical and biological replicates. (a) Correlations over gene expression levels in dilution- or single cell replicates were shown as boxplots (black). Analyses of different types of cells or diluted RNA of different origins were shown in red. Correlations were computed from log2 transformed RPKM gene expression levels. (b-d) Representative scatter plots of gene expression levels from libraries generated from independent technical replicates of 1ng total RNA (b), or 50pg total RNA (c), or libraries generated from independent T24 cancer cell line cells (d). These scatter plots are representative of gene expression levels variations found and the results from all pairwise comparisons are summarized in Fig. 2c,d.
a b
LNCaP UHR hBrain mBrain LNCaP UHR hBrain mBrain
0.4
0.2
0.0
-0.2
-0.2
-0.4
Cor
rela
tion
with
tran
scrip
t G+C
frac
tion
Cor
rela
tion
with
tran
scrip
t len
gth
mRNA-SeqSmart-Seq (PE)Smart-Seq (Tn5)
mRNA-SeqSmart-Seq (PE)Smart-Seq (Tn5)
Supplementary Figure 7. Analyses of GC and length biases in Smart-Seq and mRNA-Seq data(a) Pearson correlation coe!cients between log transformed expression values and the tran-script GC content for pre-ampli"ed (Smart-Seq) and standard mRNA-Seq data. (b) Pearson correlation coe!cients between log transformed expression values and log transcript length for pre-ampli"ed (Smart-Seq) and standard mRNA-Seq. As the true biological GC and length biases are not known, there is no obvious best correlation. We found no systematic di#erence between Smart-Seq data generated with shearing and ligation (PE) or Tn5-mediated “tagmentation” (Tn5) or standard mRNA-Seq samples, but a tendency towards lower GC content correlations for Smart-Seq data.
0.72±0.005
0.58±0.003
0.74±0.008
0.56±0.016
0.60±0.004
0.73±0.003
0.92±0.009
0.54±0.013
0.65±0.002
0.57±0.002
0.71±0.002
0.63±0.006
0.61±0.004
0.60±0.002
0.76±0.007
0.56±0.014
0.59±0.004
0.55±0.003
0.70±0.006
0.56±0.015
brainESC CTC10 pg 100+ pg
ESC
brain
melanoma
lymph node
leukocytes
Smart-Seq
standardmRNA-Seq
Supplementary Figure 8. Correlation analyses between Smart-Seq and mRNA-Seq gene expression levels.Individual human ES cells (ESC, n=8), diluted brain RNA at 10pg and 100pg or more, and individual CTCs (n=6), compared with mRNA-Seq data from
error of the mean is shown for each comparison. Within each Smart-Seq sample (i.e. columns) the matrix cells are colored according to a linear white to red scale, with white at lowest mean correlation red at maximum mean correlation. Note
mRNA-Seq data from ES cells and melanoma respectively. Identical Brain RNA prepared with Smart-Seq and mRNA-Seq correlate higher with increasing start-ing amounts in Smart-Seq. ESC RNA-Seq data was downloaded from GEO (GSM672836).
PMELMITFTYRMLANA
Expression (RPKM, log2)
SKMEL UACC PM CTC ESC
0 105
ImmunemRNA-Seq
BL BJ W LPBL
Supplementary Figure 9. Single-cell cDNAs from peripheral blood lym-phocytesTwo individual peripheral blood lymphocytes (PBLs) were prepared using Smart-Seq and shallow sequenced on a MiSeq. These two cells (PBL) were compared with Smart-Seq transcriptomes from melanoma cancer cell lines (SKMEL5, UACC257), primary melanocytes (PM), circulating tumor cells (CTC), human embryonic stem cells (ESC) and mRNA-Seq transcriptome data from burkitts lymphoma cell lines (BL41, BJAB), white blood cells (W) and lymphnodes (L). Gene expression levels (RPKM, log2) for marker genes of melanocyte and hematopoietic lineages are shown in a heatmap. Despite a shallow sequence depth, we detected markers for hematopoietic lineage in both PBLs, in contrast to melanocyte lineage markers.
melanocytes
lymphocytes,monocytes
Lineage Markers:
CD53PTPRCCCL5CD4
Smart-Seq on individual cells
0 31 2Expression (RPKM, log10)
MAGEH1
MAGEA1
MAGEC1
MAGEB1+MA GEB4
MAGEA11
MAGEA8
MAGEF1
MAGED4+MA GED4B
MAGED4+MA GED4B
MAGEE1
MAGED2
MAGED1
MAGEB6
MAGEL2
MAGEA4
MAGEB10
MAGEB18
MAGEC3
MAGEB16
MAGEB3
MAGEE2
MAGEA9+MA GEA9B
MAGEA9+MA GEA9B
MAGEB2
MAGEC2
MAGEA10+MA GEA10-MAGEA5+MA GEA5
MAGEA2+MA GEA2B
MAGEA2+MA GEA2B
MAGEA12
MAGEA6
MAGEA3
SKMEL5
UACC257
PM CTChESC
BL
Supplementary Figure 10. Expression of all melanoma-associated antigens (MAGEs)Heatmap showing the expression levels of all MAGE genes in CTC, PM, melanoma cell lines (SKMEL5 and UACC257), and in human ESC and Burkitts lymphoma cell lines BL41 and BJAB (BL).
Relative Expression ( log2)
-5 50
Supplementary Figure 11. Detection of SNPs and mutations in CTCs(a) Genomic sites with support for allelic difference between single-cell CTCs and reference sequence were sorted according to the QUAL values (Phred scaled probability of polymor-phism). For bins of genomic sites, we analyzed the fraction of sites that were already present in dbSNP (build 132) and the ratio of transitions to transversions (Ti/Tv). In genomic DNA re-sequencing studies, one expects to find ~90% of individual differences to be present in dbSNP and a Ti/Tv ratio of ~2.1 (genome-wide) or ~2.8 (exons). We found that sites with QUAL values of 500 or higher (dashed line) had 92% overlap with SNPdb and a Ti/Tv ratio of 2.3, suggesting that genotype calls above that QUAL threshold are reliable. At this threshold we found 4,312 genomic sites using transcriptome data from the six CTCs and requiring that the allele be supported in two or more CTCs. (b) The number of sites indentified requiring support in an increasing number of CTCs. The barplot shows the number of genomic sites with known or novel transitions and transversions. We found a large difference between sites supported in one or two cells, likely reflecting a large amount of false positives in calls from a single cell. Requiring support in two or more cells led to more consistent and slowly declining number of SNP or mutation calls that probably reflect power. The QUAL threshold used is 500 as indicated with the dashed line in (a).
Ti/Tv1 1.5 2.52
0.5 0.6 0.80.7 0.9 1.0
Fraction of sites overlapping dbSNP (132)
Gen
omic
site
s
1600
2500
4000
6300
10000
a
Transitions, known SNPsTransversions, known SNPsTransitions, novel sitesTransversions, novel sites
Num
ber o
f site
s (c
umul
ativ
e)Minimum number of samples
b
1 3 4 5 62
8,000
6,000
4,000
2,000
0