detection of new transcripts with oligonucleotide...

29
Detection of New Transcripts with Oligonucleotide Arrays How can one use the plethora of expression data ? Presented by Stefan Röpcke

Upload: others

Post on 09-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Detection of New Transcripts withOligonucleotide Arrays

How can one use the plethora of expression data ?

Presented by Stefan Röpcke

Page 2: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Thesis - Overview

Data analysis of Affymetrix-Chips(Normalisation, data condensation, DB)

Cancer research• reanalysis of public raw data• Bhattacharjee et al (PNAS 2001)• Garber et al (PNAS 2001)• comparison of technologies• comparison to literature data

Transcript detection• analysis of expression of the same genomic locus• evidence for antisense transcripts• artefact or biology ?

Interdependencies of gene structure and expression• yeast, fly• reanalysis of public data

Page 3: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Oligo Array Experiment

Extraction ofpoly-A-RNA

Fragmentation,hybridisation and

staining

Amplificationand labelingof the RNA

Page 4: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Overview – Data Analysis

PM

MM

Feature ProbesetProbe pair

Third-Quartile-Method75% percentile of the matching oligos (PM)

Wilcoxon test• nonparametric, paired test • tests whether the PM is brighter than the MM values

1. Feature level normalisation: offset substraction, division by the median

2. Data condensation:

Result: ONE representative expression value and ONE detection score per probeset and chip

Page 5: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Implementation & Storage

Data about the sequences the experiments the samples

CEL-files an intensity and a deviation value for each feature

• Wilcoxon test• expression value calculation

• Quality control• Normalization

Perl database interface

Page 6: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Thesis - Overview

Data analysis of Affymetrix-Chips(Normalisation, data condensation, DB)

Cancer research• reanalysis of public raw data• Bhattacharjee et al (PNAS 2001)• Garber et al (PNAS 2001)• comparison of technologies• comparison to literature data

Transcript detection• analysis of expression of the same genomic locus• evidence for antisense transcripts• artefact or biology ?

Interdependencies of gene structure and expression• yeast, fly• reanalysis of public data

Page 7: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Expression Analysis of Lung Cancer Samples

Adenocarcinoma versus normal lung

Squamous cell carcinomaversus normal lung

10 32-3 -1-2 10 32-3 -1-2

10

2-1

-2

10

2-1

-2

Ave

rage

Log

Rat

io o

n O

ligo-

Arr

ays

X-Axis: Avgerage LogRatio on cDNA-Arrays

> 0: higher in tumor, < 0: lower in tumor

Comparison of Oligo and cDNA Array Data

RED points: most „differential“ genes in one of the data sets.

Page 8: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Thesis - Overview

Data analysis of Affymetrix-Chips(Normalisation, data condensation, DB)

Cancer research• reanalysis of public raw data• Bhattacharjee et al (PNAS 2001)• Garber et al (PNAS 2001)• comparison of technologies• comparison to literature data

Transcript detection• analysis of expression of the same genomic locus• evidence for antisense transcripts• artefact or biology ?

Interdependencies of gene structure and expression• yeast, fly• reanalysis of public data• avg expression vs gene length• length of introns/exons

Page 9: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Highly expressed Genes are shorter

90% percentile75% percentileMedian

Gene length (log scale)

Ave

rage

exp

ress

ion

(log

scal

e)

Length distribution ofnon annotated genes

Yeast expression data set from Cho et al [PNAS 1998]

2820 annotated genes only(MIPS, SGD)

6200 predicted yeast genes

Page 10: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Detection of Putative Transcripts

• Analysis of probesets from the same locus• Evidence of antisense transcripts• ? Real transcripts or technical artefact ?

3‘ Exon

2 probesets, sense and antisense

Genomicsequence

Page 11: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Data Set

Sequence sets on metaGen chip I• 6117 probesets (20 oligos PM-perfectly matching 25 bases + 20 MM)• circa 4000 different genes• 1066 Sense-antisense sequence pairs• 588 CDS-UTR sequence pairs

310 Hybridisations with poly(A+)-RNA• 250 Samples from cancer patients• 60 Samples from cell lines

Subset used for this presentation• Blast against RefSeq database (NCBI)• Criterion: exactly one almost complete hit (at least 18 oligos)-> 102 Sense-antisense sequence pairs

Page 12: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Data Analysis

PM

MM

Feature ProbesetProbe pair

Third-Quartile-Method75% percentile of the matching oligos (PM)

Wilcoxon test• nonparametric, paired test • tests whether the PM is brighter than the MM values

How to compare two different probesets on one chip ?(sense vs antisense probeset)

Detection Call• p-Value > 0.05: absent• p-Value < 0.05: present

Page 13: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Counting Schema

examplespresent absent

Reverse complement

Forwardstrand

present

absent 3 47

2555 Zinc FingerProtein (ZFP106)

a_a a_p p_a p_pNM_017657 47 3 250 10 +/- hypothetical protein FLJ20080 (FLJ20080), mRNANM_022473 47 3 255 5 +/- zinc finger protein 106 (ZFP106), mRNANM_019000 45 0 235 30 +/- hypothetical protein (FLJ20152), mRNA

HIT HEADERHIT ID Count experiments ORI

Representation in table format

p_a means present sense probeset and abesent antisense probeset

Page 14: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Contradictory Chip Results to theAnnotation out of 102 Pairs

a_a a_p p_a p_pNM_005033 92 13 187 18 +/+ 0,6110 polymyositis/scleroderma autoantigen 1 (75kD) (PMSCL1)NM_014943 183 0 127 0 +/+ 0,2265 KIAA0854 protein (KIAA0854)NM_015310 282 5 23 0 +/+ -1,6180 KIAA0942 protein (KIAA0942)NM_031431 300 0 10 0 +/+ 1,6951 tethering factor SEC34 (SEC34)NM_022497 136 159 0 15 +/- 3,4254 mitochondrial ribosomal protein S25 (MRPS25)NM_032659 219 89 2 0 +/- 1,6605 hypothetical protein MGC11138 (MGC11138)NM_014654 252 58 0 0 +/- -0,3542 KIAA0468 gene product (KIAA0468)NM_024709 296 14 0 0 +/- -1,2547 hypothetical protein FLJ14146 (FLJ14146)

Count experiments Forward_Reverse ORI

Mean of F-R

HIT HEADERHIT ID

p_a means present sense probeset and abesent antisense probeset

RED marked entries contrary to the expectations

Page 15: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Evidence for Antisense Transcripts

a_a a_p p_a p_pNM_018011 10 2 24 274 +/- 6,2635 hypothetical protein FLJ10154 (FLJ10154)NM_016127 3 3 34 270 +/- 9,8122 HSPC035 protein (LOC51669)NM_018509 9 3 32 266 +/- 9,4732 hypothetical protein PRO1855 (PRO1855)NM_006526 26 19 37 228 +/- 1,0152 zinc finger protein 217 (ZNF217)NM_018975 4 0 97 209 +/- 5,0257 TRF2-interacting telomeric RAP1 protein (RAP1)NM_016629 19 2 98 191 +/- 7,6667 hypothetical protein (LOC51323)NM_024026 35 65 32 178 +/- 1,4489 mitochondrial ribosomal protein 63 (MRP63)NM_016617 8 2 125 175 +/- 3,4644 hypothetical protein (BM-002)NM_021238 5 1 146 158 +/- 6,7714 TERA protein (TERA)NM_030912 7 1 147 155 +/- 6,247 tripartite motif protein TRIM8 (TRIM8)NM_022349 38 41 81 150 +/- 2,8373 CD20-like precusor (LOC64166)NM_006283 14 2 156 138 +/- 9,6877 transforming, acidic coiled-coil containing prot, 1 (TACC1)NM_015385 101 34 43 132 +/- 1,7782 SH3-domain protein 5 (ponsin) (SH3D5)NM_014050 67 55 64 124 +/- 2,373 mitochondrial ribosomal protein L42 (MRPL42)NM_023037 86 11 122 91 +/- 1,558 putative gene product (13CDNA73)NM_022781 25 6 201 78 +/- 1,6658 hypothetical protein FLJ21343 (FLJ21343)NM_007106 36 2 194 78 +/- 2,0472 ubiquitin-like 3 (UBL3)NM_016052 1 232 0 77 +/- -1,7541 CGI-115 protein (LOC51018)NM_032323 282 13 15 0 +/- 2,2953 hypothetical protein MGC13102 (MGC13102)NM_020188 7 139 2 162 +/+ -1,0526 DC13 protein (DC13)NM_015070 39 31 93 147 +/+ 0,6826 KIAA0853 protein (KIAA0853)NM_017689 62 16 121 111 +/+ -7,3946 hypothetical protein FLJ20151 (FLJ20151)

HIT HEADERCount experimentsHIT IDmean of

F-RORI

RED marked entries contrary to the expectations

p_a means present sense probeset and abesent antisense probeset22 out of 102

Page 16: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Most Caseshigher and more variable expressed annotated strand

Expression values of the annotated strand

Exp

ress

ion

of th

e an

tisen

se s

tran

d

Few Casessimilarly expressed annotated and antisense strand

Correlation of Expression Values of Senseand Antisense Probeset

Page 17: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Biological Phenomenon orTechnical Artefact

1. Reflection of the annotation in most pairs 2. BUT Contradictory chip results in 29 pairs:

• 21 pairs of probesets (sense-antisense) are together present in more than 76 out of 310 experiments.• Further 8 pairs: contradictory chip results to the annotated orientation

Sense-Antisense sequence pairs (total 102)

bt12 Antisense (14885/94) bt12 Sense (14885/94)

In situ hybridization: Sense probe as standard control

Page 18: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

No Systematic Effect of theGeneral Quality of Experiments

Negative probe pairs Saturated Present calls

No_69_1_3 41804 1546 2394 837.07 2755.5No_138_1_1 41927 1928 2266 845.55 2449No_184_1_1 44132 328 2217 1068.6 1932.5No_185_1_1 44918 330 2120 965.22 1603.3No_242_1_1 42214 1043 2389 904.21 2404No_254_1_1 41582 1202 2487 850.38 2465.7mdamb231_p5 40594 4449 2480 1938.55 5719.7B61_1_1 41547 1478 2469 567.92 2214.8

No7_3rd_15ug 53027 2 1297 875.83 340.2No16_3rd_15ug 55847 4 983 1151.68 592No_27_1_1 50842 530 1526 2235.63 3417.5No_78_1_1 55829 190 776 1134.03 2099no_203x_1_2 50574 1 1526 844.81 575C32_1_1 51371 124 1414 569.65 637.8Cal6_14rere 51750 32 1433 1441.33 789Cal6_14__06_10 52238 20 1326 1301.98 662Cal6_7rere_06_10 51885 1 1346 894.78 392.5

Counts ofBackground

Range of valuesExperiment

Bar-Graph of Counts of Experiments In how many experiments sense and antisense are both present ?

Number of sense-antisense pairs that are both present

Table of Indicators for the Extremes

Page 19: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

No Systematic Effect of GC-Content

Number of samples (310) in wich sense and antisense probes detect a signal

Ave

rage

GC

-con

tent

in th

e pr

obes

et

Page 20: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

No Systematic Effect of Preamplification

Total 247Portion of TISSUE samples in wich sense

and antisense probes detect a signal(preamplified)

Por

tion

of C

ell L

ine

sam

ples

in w

ich

sens

e an

d an

tisen

se p

robe

s de

tect

a s

igna

lEach red star represents one RefSeq transcriptTotal 63

Page 21: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Potential Binding Sites for the T7-(dT)24 Primer

ID oligo(dT) end_posNM_032323 - 0NM_030912 TTTTCCTTTTGGATTTTGTTTTGGTTTTGGCTTTGTTTTTGATTTTTTTTTATT 800NM_030912 TTTTGTTTTTGTTTTTTTAAATTTTTTT 460NM_024026 - 0NM_023037 TTTTCTTTTCTTTTTTCTTTTTTCTTTTTCCTTTCTTT 600NM_022781 TTTGTTTTGATATTTTACTGCCTTTCCTTTTAATTTTTGTTT 1500NM_022349 TTTTTTTTT 900NM_021238 TTTTTTTTTTTTTTTTTTT 2000NM_020188 AATAAAATAATTTAAATAAAAAAAAAAAA 700NM_018975 - 0NM_018509 - 0NM_018011 TTTTTTGTTTTTGTTTTTGTTTTTTTAATTT 560NM_018011 TTTTTTTTTTTTTTTT 480NM_017689 AAAATAGTTAAAAAACAACAAAAAA 630NM_017689 AAAAAAAAAAAAAAAAA 2000NM_016629 TTTTTATTATTGTTTGTCCTTTATAAATTTTCTT 720NM_016617 TTTTTATTTTTATTTATTTGTTTTGTTTTTTTTTTTTTTT 520NM_016617 TTTTTTTTTTT 3500NM_015070 AAAAAAAGGACAGAAAAAGAAAAGCATTGAAAAAAAACGTAAAAAA 2100NM_014050 TTTTCTTTTTTCTTTTTCTATTTT 400NM_015385 TTTTTTTTTTTTTGTTGTTGTTGTTGTTTT 1200NM_015385 TTTTTTTTTCCTTTTTAATTTCATTATTTTGGGTTTTTGGTTTTT 320NM_016052 TTTTATTTTTT 480NM_016127 - 0NM_007106 TTTTCATTCTGATTACTCTTTTTTT 800NM_006526 TTTTTTTTTTTTT 1680NM_006526 TTTTTTTTTTTT 750NM_006283 TTTTTTGTAGTTTTTT 880X00351 TTTATTTGTTTTTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTT 560

Page 22: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Potential Artefacts of RNA preparation

RNase-H Verdau

TTT…T73‘5‘ AAA…UUU…U

AAA…A

TTT…T73‘ AAA…ATTT…T

T7…

TTT… 5‘3‘

AAA… 3‘5‘

AAA…T7 3‘5‘TTT…T73‘ AAA…A

TTT…TT7…

5‘ AAA…UUU…U 3‘

Invitro-Transkription

Zweitstrangsynthese

Reverse Transkription

mRNA

cRNA

Interne Bindungsstelle für Oligo-dT-T7-Primer

Strongest hypothesis (known in EST-data)

Page 23: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Result of FASTA Search

total 102

seleceted 21 remaining 81

U-stretch 5‘

57.1%

U-stretch 3‘

71.4% 30.9% 49.4%

U-stretch 5‘ U-stretch 3‘

5‘ AAA…UUU…U 3‘

U-stretch 5‘ of the oligo set

5‘ AAA…UUU…U 3‘

U-stretch 3‘ of the oligo set

Oligoset

Search for U-stretches in RefSeq-mRNA (FAST-score > 25)

Oligoset

Page 24: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Sequence Selection for Wet Experiments

• DD3, ZNF217, Ponsin, MRPL42 - most differential• CGI-115, DC13, ZNF217 - F-R plot • FLJ10154, HSPC035, PRO1855 - highest pp• PRO1855, HSPC035, MRP63 - no U-stretch • ß-Actin, GapDH (_st present in 97.4%, 58.5%) - well known ?• CGI-115 - contrary to annotation

• OPN3/KMO (positive control)

Hunting for antisense transcripts

NO hints for systematic technical problems

Rasterfahndung

Page 25: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

DC13 – Unknown Gene

441 008 409 828

PCR-Produkt (Sonde)

mRNA: NM_020188 (706 bp)

140 256 328

Primer: 568-545Primer: 34-53

706

Exon 4Exon 2Exon 1 Exon 3

Genomische Sequenz: NT_010380.7

1

One of the best candidates

EST data: partial overlap with BM039 (unknown gene antisense)

Page 26: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Confirmation of the Chip Results

9,5

Herz

Gehirn

Plazen

ta

Lunge

Leber

Skele

ttmusk

el

Niere Pankr

eas

MilzThy

musHo

denOvar Dünnd

arm

PBL

Prosta

ta

Kolon

Gehirn

Plazen

ta

Lunge

Leber

Skele

ttmusk

el

Niere Pankr

eas

MilzThy

musHo

denOvar Dünnd

arm

PBL

Prosta

ta

Kolon

7,5

4,4

2,4

1,35

0,24

kb

MTN1 MTN2 MTN1 MTN2

(A) (B)

Hybridisation of labeled oligos on a multiple tissue northern

• Oligo-Selection from the probeset and non overlapping-> per transcript 7-10 Oligos (25 bases long)• radioactive end labeling & hybridisation

• 3 out 4 examples where confirmed• positive control did not workc Specificity ?

DC

13 g

ene

Sense (RefSeq) Antisense

Nor

ther

n B

lot

Page 27: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

RNA In Situ Hybridisation (DC13)

DC13 Sense Sonde (Antisense wird detektiert.)

DC13 Antisense Sonde (Sense wird detektiert.)

Breast cancer tissue, resected and parafin embedded

1 strong, 2 weak signals for sense and antisense out 4

Page 28: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Summary & Perspectives

• Antisense transcripts for a high percentage of genes • NO systematic technical error found• (pending) Confirmation of the transcripts by RPA

Findings

• In-depth analysis of EST-data • Design of an antisense array

If TRUE

Page 29: Detection of New Transcripts with Oligonucleotide Arrayscompdiag.molgen.mpg.de/docs/talk_13_02_03_roepcke.pdf · DC13 – Unknown Gene 441 008 409 828 PCR-Produkt (Sonde) mRNA: NM_020188

Acknowledgement

Dr. Mennerich (project planning, lab work)Dr. Pilarsky (Chip Lab)Eva KlopockiAnke VogelNicol Creutzburg (ISH)

metaGen