center for biological sequence analysis bic biocentrum-dtu technical university of denmark 1/31...

36
C E N T E R F O R B I O L O G I C A L S E Q U E N C E A N A L Y S I S BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences Ilka Hoof Ph.D. student Immunological Bioinformatics Center for Biological Sequence Analysis Danmarks Tekniske Universitet

Upload: nelson-nicholas-melton

Post on 03-Jan-2016

226 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 1/31

Prediction of significant positions in biological

sequences

Ilka HoofPh.D. student

Immunological Bioinformatics

Center for Biological Sequence Analysis

Danmarks Tekniske Universitet

Page 2: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 2/31

Significant positions?

HIV-1 gp120

PDB: 2NY7

Page 3: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 3/31

Significant positions?

HIV-1 gp120

PDB: 2NY7

Antibody-binding site?

Page 4: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 4/31

Significant positions?

HIV-1 protease

PDB: 2CEN

Catalytic efficiency?

Page 5: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 5/31

Significant positions?

“Which sites in HIV-1 protease contribute significantly to the fitness level of an HIV-1 mutant?”

“Where is the binding site of a specific antibody located on the antigen?”

“Which sites are important for enzymatic activity?”

Given a multiple sequence alignment and a numerical value associated with each sequence Values imply a ranking of the sequences

What we’re interested in:Which positions distinguish high and low ranking sequence?

e.g. binders vs. non-binders high vs. low fitness

high vs low enzymatic activity

Page 6: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 6/31

The data we have

Page 7: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 7/31

The output we want

...how do we get there?

Page 8: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 8/31

SigniSite 1.0

http://www.cbs.dtu.dk/services/SigniSite/

Page 9: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 9/31

SigniSite - method

Rank-based statistical test

0.0020.0840.1280.2730.5930.5930.8920.9230.999

1.02.03.04.05.55.57.08.09.0

real-valued data ranks

Calculate mean rankfor each residue type

Page 10: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 10/31

SigniSite - the method

Page 11: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 11/31

SigniSite - the method

Calculate the mean rankfor each residue type.

Page 12: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 12/31

SigniSite - the method

What’s the null hypothesis of our statistical test?

The observed mean rank of a residue type does not significantly deviate from the expected mean rank.

What is expected?

We assume random distribution of the amino acids in the column.Given N sequences, the expected mean rank is

Page 13: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 13/31

Z score determines significance

Given the shape of the distribution, what’s significant?

mean

sd

obs. rank

Z score can be calculated from mean and standard deviation:

+1.96

p < 0.025

Page 14: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 14/31

Z score determines significance

observed mean rank for E

Page 15: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 15/31

Are the random mean ranks normally distributed?

Page 16: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 16/31

Same mean, but different standard deviation

Frequencies: 0.5 0.25 0.1 0.05

Mean rank distributions for different frequencies

Page 17: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 17/31

How to estimate the standard deviation?

Our test reminds of the Wilcoxon rank statistic:Given two samples of size n1 and n2, n1+n2 = N.

Let R be the mean rank of sample 1.

The distribution of mean ranks R can be approximated by the normal distribution

with mean

and standard deviation

Page 18: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 18/31

Coping with ties

Formula as before but weighted with tie-correction factor T

where

and t is a vector which contains the counts of ties, i.e. m denotes the number of distinct values in the data set.

Example: all values the same => T = 0 all values different => T = 1

Page 19: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 19/31

Simple example

cate

gory

1

cate

gory

2

Page 20: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 20/31

Simple example

Tie correction vs. no tie correction

Standard deviation Z score

Page 21: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 21/31

Multiple testing problem

We perform a significance test for each amino acid type in each column.

Problem: The more hypotheses we test, the higher the probability of obtaining at least one false positive.

Each test is performed with the same type-I error e.g. = 0.05.The total significance level totof m significance tests is then

given by

tot 1 - (1 - )m

Examples:1 test tot 1 - (1 - 0.05)1 = 0.05

2 tests tot 1 - (1 - 0.05)2 = 0.0975

100 tests tot 1 - (1 - 0.05)100 = 0.99Correction for multiple testing necessary!

Page 22: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 22/31

How many statistical tests are performed?

One test per amino acid type and column.

wi is the number of different amino acids in column i

Page 23: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 23/31

Correction for multiple testing

Adjusted p-values using Bonferroni’s single-step method:

Multiply all unadjusted p-values by the number of tests m

Adjusted p-values are given by

for j = 1, ..., m

Page 24: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 24/31

Correction for multiple testing

Adjusted p-values using Holm’s step-down method:

observed ordered unadjusted p-values

Adjusted p-values are given by

for j = 1, ..., m

So, nothing more than:

Page 25: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 25/31

Application of SigniSite

Page 26: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 26/31

Ab-binding affinity to HIV-1 gp120

Alignment length: 569 residues

Page 27: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 27/31

SigniSite web service

Page 28: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 28/31

SigniSite results

10 significant sites identified.

Holm step-down correction, = 0.05

Heatmap

Page 29: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 29/31

SigniSite results

Sequence logos

display Z score for all amino acid types

display Z score only for significant amino acid types

“ordinary” frequency logo

Page 30: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 30/31

SigniSite results

Page 31: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 31/31

SDPpred

http://math.genebee.msu.ru/~psn/index.htm

Kalinina et al. (2004), Protein Sci 13(2): 443-56

Page 32: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 32/31

SDPpred

• Categories instead of continuous values• Mutual information

• Amino acids with similar physico-chemical properties are weakly penalized

• Statistical test: observed mutual inf. = expected

mutual inf.?

Ip = fPα =1

20

∑i=1

N

∑ (α ,i)logfP (α ,i)

fP (α ) f (i)

Page 33: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 33/31

SDPpred - Results

Page 34: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 34/31

SDPpred - Results

Page 35: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 35/31

SDPpred - Results

Page 36: CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BiC BioCentrum-DTU Technical University of Denmark 1/31 Prediction of significant positions in biological sequences

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS

BiC BioCentrum-DTUTechnical University of Denmark 36/31

• You can use SigniSite and SDPpred to find sites of interest in your biological data

• Logos are a nice and clear way of displaying sequence information

• Whenever you perform statistical tests, remember the multiple testing problem!

Conclusion