lecture 3 data types in computational biology/systems biology useful websites

50
Lecture 3 Data Types in computational biology/Systems biology Useful websites Handling Multivariate data: Concept and types of metrics, distances etc. K-mean clustering

Upload: rafael

Post on 24-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Lecture 3 Data Types in computational biology/Systems biology Useful websites Handling Multivariate data: Concept and types of metrics, distances etc. K-mean clustering. What is systems biology? Each lab/group has its own definition of systems biology. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Lecture 3Data Types in computational biology/Systems biologyUseful websitesHandling Multivariate data: Concept and types of metrics, distances etc.K-mean clustering

Page 2: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

What is systems biology?

Each lab/group has its own definition of systems biology.

This is because systems biology requires the understanding and integration different levels of OMICS information utilizing the knowledge from different branches of science and individual labs/groups are working on different area.

Theoretical target: Understanding life as a system.

Practical Targets: Serving humanity by developing new generation medical tests, drugs, foods, fuel, materials, sensors, logic gates……

Understanding life or even a cell as a system is complicated and requires comprehensive analysis of different data types and/or sub-systems.Mostly individual groups or people work on different sub-systems---

Page 3: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Some of the currently partially available and useful data types:

Genome sequencesBinding motifs in DNA sequences or CIS regulatory regionCODON usageGene expression levels for global gene sets/microRNAsProtein sequencesProtein structuresProtein domainsProtein-protein interactionsBinding relation between proteins and DNARegulatory relation between genesMetabolic PathwaysMetabolite profilesSpecies-metabolite relationsPlants usage in traditional medicines

Usually in wet labs, experiments are conducted to generate such dataIn dry labs like ours we analyze these data to extract targeted information using different algorithms and statistics etc.

Data Types in computational biology/Systems biology

Page 4: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

>gi|15223276|ref|NP_171609.1| ANAC001 (Arabidopsis NAC domain containing protein 1); transcription factor [Arabidopsis thaliana]MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRENNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDLLPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQQFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLVDERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPLHNYKAQEQPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG

Sequence data (Genome /Protein sequence)

Usually BLAST algorithms based on dynamic programming are used to determine how two or more sequences are matching with each other

Sequence matching/alignments

Page 5: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

CODONS

Page 6: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

CODON USAGE

Page 7: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

CODON USAGE

Page 8: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Multivariate data (Gene expression data/Metabolite profiles)

There are many types of clustering algorithms applicable to multivariate data e.g. hierarchical, K-mean, SOM etc.

Multivariate data also can be modeled using multivariate probability distribution function

Page 9: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Binary relational Data (Protein-protein interactions, Regulatory relation between genes, Metabolic Pathways) are networks.

Clustering is usually used to extract information from networks.

Multivariate data and sequence data also can be easily converted to networks and then network clustering can be applied.

AtpB AtpAAtpG AtpEAtpA AtpHAtpB AtpHAtpG AtpHAtpE AtpH

Page 10: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Useful Websites

Page 11: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

www.geneontology.org www.genome.ad.jp/kegg www.ncbi.nlm.nih.gov www.ebi.ac.uk/databases http://www.ebi.ac.uk/uniprot/ http://www.yeastgenome.org/ http://mips.helmholtz-muenchen.de/proj/ppi/ http://www.ebi.ac.uk/trembl http://dip.doe-mbi.ucla.edu/dip/Main.cgi www.ensembl.org

Some websites

Some websites where we can find different types of data and links to other databases

Page 12: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 13: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 14: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 15: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 16: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 17: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

NETWORK TOOLSSource: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 18: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

NETWORK TOOLSSource: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 19: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 20: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 21: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Handling Multivariate data: Concept and types of metrics

Multivariate data formatMultivariate data example

Page 22: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Distances, metrics, dissimilarities and similarities are related concepts

A metric is a function that satisfy the following properties:

A function that satisfy only conditions (i)-(iii) is referred to as distances

Source: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health)Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine Dudoit (Editors)

Page 23: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Example:Let,X = (4, 6, 8)Y = (5, 3, 9)

These measures consider the expression measurements as points in some metric space.

Page 24: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Widely used metrics for finding similarity

Correlation

Page 25: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

These measures consider the expression measurements as points in some metric space.

Page 26: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Statistical distance between points

The Euclidean distance between point Q and P is larger than that between Q and origin but it seems P and Q are the part of the same cluster but not Q and O.

Statistical distance /Mahalanobis distance between two vectors can be calculated if the variance-covariance matrix is known or estimated.

Page 27: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Distances between distributions

Different from the previous approach (i.e. considering expression measurements as points in some metric space) the data for each feature can be considered as independent sample from a population.

Therefore the data reflects the underlying population and we need to measure similarities between two densities/distributions.

Kullback-Leibler Information

Mutual information

KLI measures how much the shape of one distribution resembles the other

MI is large when the joint distribution is quiet different from the product of the marginals.

Page 28: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

K-mean clustering

Page 29: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.

Page 30: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Source:Teknomo, Kardi. K-Means Clustering Tutorials http:\\people.revoledu.com\kardi\ tutorial\

kMean\

Page 31: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

1. Initial value of centroids: Suppose we use medicine A and medicine B as the first centroids. Let c1 and c2 denote the coordinate of the centroids, then c1 = (1,1) and c2 = (2,1)

Page 32: Lecture 3 Data  Types in computational biology/Systems biology Useful websites
Page 33: Lecture 3 Data  Types in computational biology/Systems biology Useful websites
Page 34: Lecture 3 Data  Types in computational biology/Systems biology Useful websites
Page 35: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Hierarchical clustering

Page 36: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Hierarchical Clustering

AtpB AtpAAtpG AtpEAtpA AtpHAtpB AtpHAtpG AtpHAtpE AtpH

Data is not always available as binary relations as in the case of protein-protein interactions where we can directly apply network clustering algorithms.

In many cases for example in case of microarray gene expression analysis the data is multivariate type.

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

Page 37: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

We can convert multivariate data into networks and can apply network clustering algorithm about which we will discuss in some later class.

If dimension of multivariate data is 3 or less we can cluster them by plotting directly.

Hierarchical Clustering

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

Page 38: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

However, when dimension is more than 3, we can apply hierarchical clustering to multivariate data.

In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place.

Some data reveal good cluster structure when plotted but some data do not.

Data plotted in 2 dimensions

Hierarchical Clustering

Page 39: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Hierarchical clustering is a technique that organizes elements into a tree.

A tree is a graph that has no cycle.

A tree with n nodes can have maximum n-1 edges.

A Graph A tree

Hierarchical Clustering

Page 40: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Hierarchical Clustering is subdivided into 2 types

1. agglomerative methods, which proceed by series of fusions of the n objects into groups,

2. and divisive methods, which separate n objects successively into finer groupings.

Agglomerative techniques are more commonly used

Data can be viewed as a single cluster containing all objects to n clusters each containing a single object .

Hierarchical Clustering

Page 41: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Distance measurementsThe Euclidean distance between points and

, in Euclidean n-space, is defined as:

Euclidean distance between g1 and g2

0622.81640

)910()08()1010( 222

Hierarchical Clustering

Page 42: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

In stead of Euclidean distance correlation can also be used as a distance measurement.

For biological analysis involving genes and proteins, nucleotide and or amino acid sequence similarity can also be used as distance between objects

Hierarchical Clustering

Page 43: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

•An agglomerative hierarchical clustering procedure produces a series of partitions of the data, Pn, Pn-1, ....... , P1. The first Pn consists of n single object 'clusters', the last P1, consists of single group containing all n cases. •At each particular stage the method joins together the two clusters which are closest together (most similar).  (At the first stage, of course, this amounts to joining together the two objects that are closest together, since at the initial stage each cluster has one object.)   

Hierarchical Clustering

Page 44: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

An Introduction to Bioinformatics Algorithms by Jones & Pevzner

Differences between methods arise because of the different ways of defining distance (or similarity) between clusters.

Hierarchical Clustering

Page 45: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

How can we measure distances between clusters?

Single linkage clustering

Distance between two clusters A and B, D(A,B) is computed as D(A,B) = Min { d(i,j) : Where object i is in cluster A and

object j is cluster B}

Hierarchical Clustering

Page 46: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Complete linkage clustering

Distance between two clusters A and B, D(A,B) is computed as D(A,B) = Max { d(i,j) : Where object i is in cluster A and

object j is cluster B}

Hierarchical Clustering

Page 47: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Average linkage clustering

Distance between two clusters A and B, D(A,B) is computed as D(A,B) = TAB / ( NA * NB)

Where TAB is the sum of all pair wise distances between objects of cluster A and cluster B. NA and NB are the sizes of the clusters

A and B respectively.  

Total NA * NB edges

Hierarchical Clustering

Page 48: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Average group linkage clustering

Distance between two clusters A and B, D(A,B) is computed as D(A,B) = = Average { d(i,j) : Where observations i and j are in

cluster t, the cluster formed by merging clusters A and B }

Total n(n-1)/2 edges

Hierarchical Clustering

Page 49: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Alizadeh et al. Nature 403: 503-511 (2000).

Hierarchical Clustering

Page 50: Lecture 3 Data  Types in computational biology/Systems biology Useful websites

Classifying bacteria based on 16s rRNA sequences.