20150330 abundance profiles - universiteit...

5
30Mar15 1 Abundance profiles Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 30 th 2015 Omics sciences The suffix ome refers to a totality of some sort Gene (genetics) Transcript (RNA) Protein Genome Transcriptome Proteome Genomics Transcriptomics Proteomics RNA Protein Metabolite Lipid Microbe Metabolome Lipidome Microbiome Metabolomics Lipidomics Microbiomics (?!) DNA DNA sequencing First generation Chain termination sequencing Sanger Second generation Massively parallel sequencing Illumina (MiSeq) Ion Torrent Third generation Single molecule sequencing Oxford Nanopore (MinION) Pacific Biosciences (PacBio) Massively parallel sequencing Many thousands, up to billions of short DNA sequences ~50500 base pairs Bad quality nucleotides need to be removed (trimming)! These reads are randomly sampled from the total DNA/RNA content of a sample This gives a detailed overview of the sequences in the sample

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 20150330 abundance profiles - Universiteit Utrechttheory.bio.uu.nl/BPA/2015/slides/20150330_abundance... · 2015. 3. 30. · • Last week we discussed how to annotate these sequencing

30‐Mar‐15

1

Abundance profiles

Bas E. DutilhSystems Biology: Bioinformatic Data Analysis

Utrecht University, March 30th 2015

Omics sciences• The suffix ‐ome refers to a totality of some sort

• Gene (genetics)

• Transcript (RNA)

• Protein

• Genome

• Transcriptome

• Proteome

• Genomics

• Transcriptomics

• Proteomics

RNA Protein

• Metabolite

• Lipid

• Microbe

• Metabolome

• Lipidome

• Microbiome

• Metabolomics

• Lipidomics

• Microbiomics (?!)

DNA

DNA sequencing• First generation

– Chain termination sequencing• Sanger

• Second generation

– Massively parallel sequencing• Illumina (MiSeq)

• Ion Torrent

• Third generation

– Single molecule sequencing• Oxford Nanopore (MinION)

• Pacific Biosciences (PacBio)

Massively parallel sequencing

• Many thousands, up to billions of short DNA sequences

– ~50‐500 base pairs

– Bad quality nucleotides need to be removed (trimming)!

• These reads are randomly sampled from the total DNA/RNA content of a sample

– This gives a detailed overview of the sequences in the sample

Page 2: 20150330 abundance profiles - Universiteit Utrechttheory.bio.uu.nl/BPA/2015/slides/20150330_abundance... · 2015. 3. 30. · • Last week we discussed how to annotate these sequencing

30‐Mar‐15

2

Who/what is present in the sample?• Last week we discussed how to annotate these sequencing reads by aligning to a reference database

– Fast, heuristic similarity search programs are used for this

• The result is an overview of the genes/functions/microbes and their relative abundances in the sequencing reads

HighMany readsgLow

Many readsFew reads

FunctionsSpeciesor taxa

Micro‐arrays• Micro‐arrays are another way of identifying sequences in a sample

– Micro‐arrays can only identify previously known sequences

– That is why DNA sequencing is now the standard

A micro‐array is a glass slidethat contains pieces of DNAwith a known sequenceq

Green/red labeled samplesequences hydridize to the

sequences on the micro‐array

These results are just lists of numbers• DNA: metagenome: list of all microbialgenes and organisms and relativeabundances in an environment

– Micro‐organisms play important

roles in ecology and thus in health

• RNA: transcriptome lists all genes and their relative expression values in a cell/tissueexpression values in a cell/tissue

– Gene expression is important for phenotype

• Protein: proteome lists all proteins in a sample and their relative abundances

– Proteins perform most of the functions in a cell

• A list of numbers is also known as amultidimensional vector, for example:

ax

ay

az

DNA: human microbiome• Which bacterial phyla are present in different 

human body sites?

• Which metabolic functions do they encode?

• Can we use this to understand the differences between people?

Different healthy people

DNA: microbiome studies

Jack GilbertArgonne National Lab

Jack GilbertArgonne National Lab

RNA: discover cancer biomarker genes• Can we improve the prognosis for cancer patients by analyzing their gene expression profile?

Up‐regulated genesDown‐regulated genes

ts

Genes

Patien

t

Page 3: 20150330 abundance profiles - Universiteit Utrechttheory.bio.uu.nl/BPA/2015/slides/20150330_abundance... · 2015. 3. 30. · • Last week we discussed how to annotate these sequencing

30‐Mar‐15

3

RNA: heat shock response• How do Saccharomyces cerevisiae (yeast) genes respond to increased growth temperature?

Up‐regulated genesDown‐regulated genes

s

5 15 30 60

Time (minutes)

Gen

e

High‐throughput sequencing• Same organism, different tissues or body sites

– For example: brain versus liver, mouth versus gut

• Same tissue, same organism

– For example: treatment versus control, tumor versus healthy

• Same tissue, different organisms

– For example: wildtype versus knock‐out/transgenic/mutant, comparing monozygotic twin pairscomparing monozygotic twin pairs

• Time course experiments 

– For example: effect of a treatment, development of a tissue, response of microbiota to environmental change

Scaling• When analyzing transcriptome data, we find that the RNA expression of gene A consists of:

– 4,000 reads in a healthy tissue sample

– 10,000 reads in a tumor sample

• Can we conclude that gene A is over‐expressed in cancer?

– The total volume of the transcriptomic datasets are:• 80 000 sequencing reads from the healthy tissue80,000 sequencing reads from the healthy tissue

• 500,000 sequencing reads from the tumor

– No, the gene is actually expressed much lower in the tumor

4,00080,000

10,000500,000

= 0.05 = 0.02Tumor:Healthy:

Comparing read counts• To compare samples, we need to:

– Scale the numbers so that they add up to 1• This accounts for differences in the sample size (total number of reads)

• Divide each number by the total number of reads

– Normalize numbers so that they are (close to) normally distributed

• This is important in many statistical tests

Bi l i l d t ft l ith i ll di t ib t d if ld t k• Biological data are often logarithmically distributed – if so, you could take the logarithm of the number of reads to normalize

Value

Log (value)

Gene expression in time• Normalized and scaled gene expression values

– For example, expression of genes in aging Arabidopsis leaves

G 2

Gene 115

20

25

pression

0.

0.

0

Gene 3

Gene 2

0

5

10

15

1 2 3 4 5 6 7 8 9 10

Abundance/Exp

…or leaves…Time/environments/samples…

0.

0.

0.0

Page 4: 20150330 abundance profiles - Universiteit Utrechttheory.bio.uu.nl/BPA/2015/slides/20150330_abundance... · 2015. 3. 30. · • Last week we discussed how to annotate these sequencing

30‐Mar‐15

4

Microbial abundance in time• Normalized and scaled microbial abundance values

– For example, presence of pathogens on rotting Arabidopsis leaves

Mi b 2

Microbe 115

20

25

pression

0.

0.

0

Microbe 3

Microbe 2

0

5

10

15

1 2 3 4 5 6 7 8 9 10

Abundance/Exp

Time/environments/samples… …or leaves…

0.

0.

0.0

Research setup1. Design experimental conditions and sampling strategy

2. Extract DNA/RNA/protein

3. Sequence nucleotides or proteins

4. Quality control of sequencing reads or peptides

5. Annotate (e.g. align reads to database) and count

6. Normalize and scale the counts

7. Compare samples, clustering (next lecture)

8. Interpret results and perform verification experiments

Quantifying similarity between vectors• Based on these measurements, which genes/microbes/etc are more 

similar to each other?

15

20

25

pression

• Abundance/expression levelsare most similar between and

• Abundance/expression patternsare most similar between and

W di t

0.

0.

0

0

5

10

15

1 2 3 4 5 6 7 8 9 10

Abundance/Exp

Time/Environments/Samples

• We can use a distance measureto quantify the (dis‐)similaritybetween the lists

– Many different distancemeasures exist

0.

0.

0.0

Distance matrices• Distance matrix

0 x y

x 0 z

y z 0

• Similarity matrix

1 1 ‐ x 1 ‐ y

1 ‐ x 1 1 ‐ z

1 ‐ y 1 ‐ z 1

inverse

inverse

distance = 1 ‐ similarity

Manhattan distance (levels)

0 0.265

0.265 0

0.799

0.799

0.534

0.534 0

• Example:d     = |0.20 – 0.15| + 

dAB = |XA – XB| + |YA – YB|

1 0.20 0.15 0.122 0.17 0.15 0.093 0.16 0.16 0.084 0.20 0.15 0.115 0.20 0.16 0.126 0.17 0.16 0.107 0.16 0.15 0.088 0.20 0.15 0.129 0.18 0.16 0.1110 0.16 0.15 0.08

|0.17 – 0.15| +|0.16 – 0.16| + |0.20 – 0.15| + |0.20 – 0.16| + |0.17 – 0.16| + |0.16 – 0.15| + |0.20 – 0.15| + |0.18 – 0.16| + |0.16 – 0.15| = 0.265

d     = 0.799d     = 0.534

(YA – YB)2

dAB2= +

Euclidean distance (levels)

0 0.103

0.103 0

0.253

0.253

0.178

0.178 0

• Example:d  2 = (0.20 – 0.15)2 +  (YA – YB)

2

dAB2dAB =    (XA – XB)

2 + (YA – YB)2

( A B)

(XA – XB)2

(0.17 – 0.15)2 +(0.16 – 0.16)2 + (0.20 – 0.15)2 + (0.20 – 0.16)2 + (0.17 – 0.16)2 + (0.16 – 0.15)2 + (0.20 – 0.15)2 + (0.18 – 0.16)2 + (0.16 – 0.15)2 = 0.0105  d     = 0.103

d     = 0.253d     = 0.178

( A B)

(XA – XB)2

1 0.20 0.15 0.122 0.17 0.15 0.093 0.16 0.16 0.084 0.20 0.15 0.115 0.20 0.16 0.126 0.17 0.16 0.107 0.16 0.15 0.088 0.20 0.15 0.129 0.18 0.16 0.1110 0.16 0.15 0.08

Page 5: 20150330 abundance profiles - Universiteit Utrechttheory.bio.uu.nl/BPA/2015/slides/20150330_abundance... · 2015. 3. 30. · • Last week we discussed how to annotate these sequencing

30‐Mar‐15

5

0.15

0.2

0.25

ression

Comparing patterns instead of distances• Correlation can be used to quantify the similarity between patterns

r     = ‐0.35

Low correlation

1 0.20 0.15 0.122 0.17 0.15 0.093 0.16 0.16 0.084 0.20 0.15 0.115 0.20 0.16 0.126 0.17 0.16 0.10

0

0.05

0.1

0 0.05 0.1 0.15 0.2 0.25

Abundance/Exp

Abundance/Expression of1

1

1 0.97

0.97

‐0.35

‐0.35

‐0.16

‐0.16

0 1.35

1.35 0

1.16

1.16

0.03

0.03 0

r     =  0.97

High correlation

7 0.16 0.15 0.088 0.20 0.15 0.129 0.18 0.16 0.1110 0.16 0.15 0.08

0 15

0.2

0.25

ession

1

1

1 0.97

0.97

‐0.35

‐0.35

‐0.16

‐0.16

0 1.35

1.35 0

1.16

1.16

0.03

0.03 0

Compare patterns instead of distances• Correlation can be used to quantify the similarity between patterns

15

20

25

pression

0.

0.

0

r     = ‐0.35

Little correlation

0

0.05

0.1

0.15

0 0.05 0.1 0.15 0.2 0.25

Abundance/Expr

Abundance/Expression of

0

5

10

15

1 2 3 4 5 6 7 8 9 10

Abundance/Exp

Time/Environments/Samples

0.

0.

0.0 r     =  0.97

Positive correlation