protein structure similarity calculation and visualization cmps 561-fall 2014 sumi singh sxs5729

19
PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Upload: hector-stephens

Post on 17-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

PROTEIN STRUCTURE SIMILARITY CALCULATION

AND VISUALIZATIONCMPS 561-FALL 2014

SUMI SINGHSXS5729

Page 2: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Protein Structure

2

RPDFCLEPPYAGACRARIIRYFYNAKAGLCQ

Primary Structure

Sequence of Amino Acids. Not enough for functional prediction. Tertiary

Structure(3D Structure)

Formed by 3D folding pattern of the protein. It makes protein functional.

Page 3: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Comparing protein 3D structures-get functional insight

3

Structure of 1QLQ Structure of 4HHB

Compare structures of two DIFFERENT

proteins

Page 4: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Significance of comparing protein 3D structures

Structural similarity between two proteins

means functional similarities

Predict binding

site

Predict drug

interaction

4

Page 5: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Structural elements represented by quintuple of features

5

Labels represent Primary

Structure (amino acids sequence)

Theta represents orientation

Length represents size/scale

{๐ฟ๐‘Ž๐‘๐‘’๐‘™1 ,๐ฟ๐‘Ž๐‘๐‘’๐‘™2 ,๐ฟ๐‘Ž๐‘๐‘’๐‘™3 ,๐œƒ ,๐ท }

Tertiary/ 3D structure

Page 6: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Structural alphabet (key) generation

6

Assign labels to amino acids in triple

Perform rule based label arrangement

Calculate Angle and Length

{๐ฟ๐‘Ž๐‘๐‘’๐‘™1 ,๐ฟ๐‘Ž๐‘๐‘’๐‘™2 ,๐ฟ๐‘Ž๐‘๐‘’๐‘™3 ,๐œƒ ,๐ท }

Generate all possible triples of amino acids

Quintuple

Label2Label1

d13

Label3

d23

d12

ฮธ1

Representative Length (D)

Mapping from structure space into unique key (integer space)

Page 7: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Output of the key generation system

For every protein millions of keys are generated each representing some special feature.

The protein structure is represented and stores as unique KEY-COUNT pair.

Page 8: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Learning goals

Page 9: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Familiarizing with complex research problem and the process of solving it including reading and

understanding published research papers and using them in problem solving.

Parallel implementation of algorithm(s) and demonstrate the

speedup from serial to parallel.

Visualizing the output.

Page 10: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Task Outline

Page 11: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Calculate pairwise similarity between two proteins implemented in PARALLEL (moduleA)

11

Similarity ComputationJaccard Coefficient that allows (unique or count={0,1}) set as its arguments

Jaccard-Tanimoto Coefficient that allows multi-sets (count>1) as its arguments

TSR Key-Count Set representing 1QLQ

Structure of 1QLQ

TSR Keys-Count Set representing 4HHB

Structure of 4HHB

Page 12: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Input to moduleA

There may be some keys that present in one protein while absent in other as they represent unique features.

All input files will be given as key-count pairs that will be the input to the system.

Keys are integers representing the unique structural feature.

All keys for a given protein will have corresponding count >=1.

Page 13: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Output from moduleA

Display/write the pairwise similarity between each protein file as lower triangular matrix for comparison purpose

You will be given a set of proteins and you have to calculate all by all pairwise similarity between them.

Page 14: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Input to moduleB or visualization module and the output

The all by all pairwise similarity calculated in moduleA will be used as input to moduleB.

Output should be connectivity graph (as shown in next slide) between all proteins.

Each edge must display the similarity value.

Preferred output will be each edge length weighted as similarity value between the two connecting proteins.

Page 15: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Construct structural similarity graph (moduleB)

Method for finding the global structural connectivity between proteins that contain a specific domain of interest.

15

1A061A06

1AD5 1FMK 1ERK

1FGK1FGK 1CKI1CKI

83%

75%

74%85%

85%74% 75%

84%

1A061A06

1AD5 1FMK 1ERK

1FGK1FGK 1CKI1CKI

83%

75%

74%85%

85%74% 75%

84%

1A061A06

1AD5 1FMK 1ERK

1FGK1FGK 1CKI1CKI

83%

75%

74%85%

85%74% 75%

84%

Page 16: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

Final system

Construct similarity graph.

Should integrate moduleA and moduleB.

If given a set of proteins should be able to find all by all similarity between them, display the lower

triangular similarity matrix.

Page 17: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

What do you get from me?

Page 18: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

1. Training protein structure (key-count) file with their precalcuated similarity values, both Jaccard and Jaccard Tanimoto

-- around 50 proteins-- you can use these to evaluate your system

2. Test set (50 proteins), only key-count pairs and no similarity values.

3. All the files will be text files.

4. Time taken by me to calculate the all by all similarity on the test and training set using an optimized serial algorithm for comparison with your parallel implementation.

Page 19: PROTEIN STRUCTURE SIMILARITY CALCULATION AND VISUALIZATION CMPS 561-FALL 2014 SUMI SINGH SXS5729

You can use Hadoop-mapreduce for moduleA.

Visualization can be done on GEPHI http://gephi.github.io/

Information on Jaccard and Jaccard-Tanimoto can be found in the following paper:

http://csis.pace.edu/ctappert/dps/d861-12/session4-p2.pdf

Lower triangular matrix:

http://en.wikipedia.org/wiki/Triangular_matrix