computational biology, part 24 biological imaging iv robert f. murphy copyright 2001. all rights...

Computational Biology, Part 24Biological Imaging IV

Computational Biology, Part 24Biological Imaging IV

Robert F. MurphyRobert F. Murphy

Copyright Copyright 2001. 2001.

All rights reserved.All rights reserved.

ProteomicsProteomics

The set of proteins expressed in a given cell The set of proteins expressed in a given cell type or tissue is called its type or tissue is called its proteomeproteome

Not all transcripts are actually made into Not all transcripts are actually made into protein, and the steady-state level of protein protein, and the steady-state level of protein expression is controlled by many factors expression is controlled by many factors other than transcript amountother than transcript amount

Protein differences between cell types Protein differences between cell types responsible for different roles of those cellsresponsible for different roles of those cells


Things to learn about proteinsThings to learn about proteins sequencesequence locationlocation structurestructure activityactivity partnerspartners


Things to learn about proteinsThings to learn about proteins sequencesequence location - gives insight into functionlocation - gives insight into function structurestructure activityactivity partnerspartners


Things to learn about proteinsThings to learn about proteins sequencesequence location - gives insight into functionlocation - gives insight into function structurestructure activityactivity partnerspartners

Almost nothing is known about most Almost nothing is known about most proteins!proteins!

One Approach to Proteomics -CD-taggingOne Approach to Proteomics -CD-tagging Infect cells with a retrovirus carrying a Infect cells with a retrovirus carrying a

DNA sequence that will produce a “tag” in DNA sequence that will produce a “tag” in a random proteina random protein

Examine many cells, each of which Examine many cells, each of which expected to express one tagged protein, to expected to express one tagged protein, to determine the determine the subcellular locationsubcellular location of that of that proteinprotein

Use fluorescence microscopyUse fluorescence microscopy

Principles of CD-Tagging(CD = Central Dogma)

Exon 1 Intron 1

Exon 2

Genomic DNA +CD-cassette

Exon 1 Tag

Exon 2

Tagged DNA

CD cassette

Tag Tagged mRNA

Tagged ProteinTag (Epitope)

Tag

Use a CD-cassette containing the hemagglutinin (Use a CD-cassette containing the hemagglutinin (HAHA) ) epitopeepitope

Insert the cassetteInsert the cassette into introns of the nucleolin geneinto introns of the nucleolin gene Obtain clonal linesObtain clonal lines expressing the tagged proteinexpressing the tagged protein ImageImage the distribution of nucleolin using the distribution of nucleolin using

immunofluorescence microscopyimmunofluorescence microscopy

CD-Tagging: Proof of concept

Results: CD-Tagging

Tagged Nucleolin

in HeLa Cells

• Improved epitope tagImproved epitope tag

The HA epitope works only in The HA epitope works only in one reading frameone reading frame Designed an epitope that is the same in all three Designed an epitope that is the same in all three

reading frames - the reading frames - the universal epitopeuniversal epitope Endogenously fluorescent tagsEndogenously fluorescent tags

Can use Can use fluorescent proteinsfluorescent proteins (e.g., GFP, YFP) as the (e.g., GFP, YFP) as the inserted tag!inserted tag!

Don’t need fixation and antibodiesDon’t need fixation and antibodies

CD-Tagging: Extensions

CD-tagging projectCD-tagging project

Large project funded by National Cancer Large project funded by National Cancer Institute to identify locations for all Institute to identify locations for all expressed genesexpressed genes Jonathan JarvikJonathan Jarvik Peter BergetPeter Berget Robert MurphyRobert Murphy

My group responsible for automated My group responsible for automated analysis of subcellular location patternsanalysis of subcellular location patterns

The ProblemThe ProblemThe ProblemThe Problem

Different investigators may use Different investigators may use differentdifferent terms to terms to refer to the same pattern or the refer to the same pattern or the samesame term to refer term to refer to different patternsto different patterns

Current determinations do not lend themselves to Current determinations do not lend themselves to incorporation into incorporation into databases databases (at best, databases (at best, databases may describe a protein in comment fields as being may describe a protein in comment fields as being a “cytoskeletal protein” or an “endosomal protein” a “cytoskeletal protein” or an “endosomal protein” even though these are known to be imprecise)even though these are known to be imprecise)

Cartoonists view of Subcellular LocationsCartoonists view of Subcellular Locations

Cells Alive! Cells Alive! “rollover” cell “rollover” cell with with information on information on each organelleeach organelle

http://www.cellsalive.net/cells/animcell.htm

The Starting PointThe Starting PointThe Starting PointThe Starting Point

A systematic, quantitative approach to A systematic, quantitative approach to protein localization (whether from a pattern protein localization (whether from a pattern analysis or a bioinformatics perspective) has analysis or a bioinformatics perspective) has not been presented previouslynot been presented previously

This is a Golgi protein !

The GoalThe GoalThe GoalThe Goal

More problemsMore problemsMore problemsMore problems

Direct (point-by-point) comparison of Direct (point-by-point) comparison of individual images is not possible, sinceindividual images is not possible, since different cells have different different cells have different shapes, sizes, shapes, sizes,

orientationsorientations organelles within cells are organelles within cells are not found in fixed not found in fixed

locationslocations

The ApproachThe ApproachThe ApproachThe Approach

1. Create sets of images showing the localization 1. Create sets of images showing the localization of many different proteins (each set defines one of many different proteins (each set defines one classclass of pattern) of pattern)

2. Reduce each image to a set of numerical values 2. Reduce each image to a set of numerical values (“(“featuresfeatures”) that are insensitive to position and ”) that are insensitive to position and rotation of the cellrotation of the cell

3. Use statistical 3. Use statistical classification methodsclassification methods to “learn” to “learn” how to distinguish each class using the featureshow to distinguish each class using the features

InputInput

Created image database for HeLa cellsCreated image database for HeLa cells Ten classes covering all major subcellular Ten classes covering all major subcellular

structures: Golgi, ER, mitochondria, structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubulesmicrofilaments, microtubules

Includes classes that are similar to each Includes classes that are similar to each otherother

Example ImagesExample Images

Patterns that might be easily confusedPatterns that might be easily confused

Endoplasmic Reticulum (ER) Mitochondria



Lysosomes (LAMP2) Endosomes (TfR)



F-actin Tubulin


Classes expected to be indistinguishableClasses expected to be indistinguishable

Golgi (Giantin) Golgi (gpp130)

FeaturesFeatures

Zernike moment featuresZernike moment features (based on the (based on the Zernike polynomials) - give information on Zernike polynomials) - give information on basic nature of pattern (e.g., circle, donut) basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in patternand sizes (frequencies) present in pattern

Haralick texture featuresHaralick texture features - give - give information on correlations in intensity information on correlations in intensity between adjacent pixelsbetween adjacent pixels

Examples of Zernike Polynomials

Zernike Moments Reconstruction

Original

Order 12

Order 20

Order 45

FeaturesFeatures

Developed additional features (Developed additional features (SLFSLF, for , for SSubcellular ubcellular LLocation ocation FFeatures)eatures)

Motivated by descriptions of patterns used Motivated by descriptions of patterns used by biologists (e.g., punctate, perinuclear)by biologists (e.g., punctate, perinuclear)

Combined with Zernike and Haralick Combined with Zernike and Haralick features to give 84 features used to describe features to give 84 features used to describe each imageeach image

Example Features from SLF1

Number of fluorescent objects per cellNumber of fluorescent objects per cell Variance of the object sizesVariance of the object sizes Ratio of the largest object to the smallestRatio of the largest object to the smallest Average distance of objects to the ‘center of Average distance of objects to the ‘center of

fluorescence’fluorescence’ Fraction of convex hull occupied by Fraction of convex hull occupied by

fluorescencefluorescence

1. Acquisition of Images

2. Image Processing

3. Feature Extraction

4. Classifier Design and Training

5. Classification

feature1 feature2 ... featureNImage1 0.3489 0.1294 ... 1.9012Image2 0.4985 0.4823 ... 1.8390... ...ImageM 1.8245 0.8290 ... 0.9018

This is a Golgi Protein

The ApproachThe Approach

Backpropagation Neural Network

Input 1

Input 2

Input n

Output 1

Output 2

Output m

Internal‘Neurons’

P r e d . C l a s sTrueClass

DNA

ER

GIantIn

GPP130

LAMP

MItoch.

Nucleoli

ActIn

TfRecept

TubulIn

DNA 99 1

ER 86 3 5 5

Giant 77 19 1 2 1

GPP 18 78 2 2 1

LAM 1 3 2 73 1 2 17 1

Mito. 9 2 4 77 2 6

Nucl. 2 1 2 1 94

Actin 3 91 6

TfR 5 3 1 25 3 5 55 5

Tub. 5 1 7 1 4 5 77

Classification accuracy for single images

Average Correct Classification Rate:

81%

How does it work?Scatter plot for TfR and LAMP2How does it work?Scatter plot for TfR and LAMP2

0

20

40

60

80

100

120

140

0 500 1000 1500

number of objects per cell

Feature SubsetsFeature Subsets

The large number of features used may The large number of features used may make training of the network harder due to make training of the network harder due to the large number of weights needing to be the large number of weights needing to be adjustedadjusted

Therefore stepwise discriminant analysis Therefore stepwise discriminant analysis was used to select a subset of the features was used to select a subset of the features that optimizes a criterion for distinguishing that optimizes a criterion for distinguishing classesclasses

P r e d . C l a s sTrueClass

DNA

ER

GIantIn

GPP130

LAMP

MItoch.

Nucleoli

ActIn

TfRecept

TubulIn

DNA 99 1

ER 87 2 1 7 2 2

Giant 1 77 19 1 1 1

GPP 16 78 2 1 1 1

LAM 1 5 2 74 1 1 16 1

Mito. 8 2 2 79 1 2 6

Nucl. 1 1 2 95

Actin 1 96 2

TfR 5 1 1 20 3 2 62 6

Tub. 4 8 1 5 81

Results: “Best” Features

Average Correct Classification Rate:

83%

How to do even betterHow to do even better

Biologists interpreting images of protein Biologists interpreting images of protein localization typically view many cells localization typically view many cells before reaching a conclusionbefore reaching a conclusion

Can simulate this by classifying Can simulate this by classifying setssets of cells of cells from the same microscope slidefrom the same microscope slide

Also applicable for Also applicable for coloniescolonies of CD-tagged of CD-tagged cells cells

Classification accuracy for sets of ten images

DNA ER Giantin GPP LAMP Mito. Nucleoli Actin TfR Tubulin Unknown

DNA 100ER 100

Giantin 98 1

True GPP 99 1

Class LAMP 97 1 2Mito. 100

Nucleoli 100Actin 100Tfr 6 88 6

Tubulin 100

Average Correct Classification Rate = 98%(99% for those sets not considered “unknown”)

Predicted Class

Conclusion so farConclusion so far

Have demonstrated feasibility of using Have demonstrated feasibility of using automated classification to assign a automated classification to assign a subcellular location “class” to an imagesubcellular location “class” to an image

Gearing up to do this for thousands of Gearing up to do this for thousands of proteinsproteins

This is a Golgi protein !

SLIC (SLIC (SSubcellular ubcellular LLocation ocation IImage mage CClassifier)lassifier)SLIC (SLIC (SSubcellular ubcellular LLocation ocation IImage mage CClassifier)lassifier)

Extending to 3DExtending to 3D

Have begun extending this approach to 3D Have begun extending this approach to 3D images collected by confocal microscopyimages collected by confocal microscopy

Also beginning to collect 3D images by new Also beginning to collect 3D images by new method using “grating imager” (with F. method using “grating imager” (with F. Lanni)Lanni)

3D labeling approach3D labeling approach All Proteins labeled with Cy5 conjugated All Proteins labeled with Cy5 conjugated

reactive dyereactive dye DNA labeled with PIDNA labeled with PI Specific Proteins labeled with primary Ab + Specific Proteins labeled with primary Ab +

secondary Alexa488 conjugated Absecondary Alexa488 conjugated Ab

Features for 3D ImagesFeatures for 3D Images

Use a subset of the 2D SLF features: Number of Objects Euler Number Average Object Size Standard Deviation of Object sizes Ratio of the Largest to the Smallest Object Size Average Distance of Objects from COF Standard Deviation of Object Distances from COF Ratio of the Largest to Smallest Object Distance

DNA FeaturesDNA Features

Use the parallel DNA image to calculate The average object distance from the COF of the DNA image The variance of object distances from the DNA COF The ratio of the largest to the smallest object to DNA COF distance The distance between the protein COF and the DNA COF The ratio of the volume occupied by protein to that occupied by DNA The fraction of the protein fluorescence that co-localizes with DNA

3D Classification Results with 14 features3D Classification Results with 14 features

Output of ClassifierTrue Class DN ER Gia GP LA Mit Nuc Act TfR Tub

DNA 94 6 0 0 0ER

Giantin 3 97 0 0 0GPP130 0 0 100 0 0LAMP2 0 0 2 92 6Mitoch.

NucleolinActin 0 0 0 3 97TfR

Tubulin

Overall accuracy = 96%Overall accuracy = 96%

2D Results — Same 14 Features2D Results — Same 14 Features

Output of ClassifierTrue Class DN ER Gia GP LA Mit Nuc Act TfR Tub

DNA 100 0 0 0 0ER

Giantin 0 59 36 4 1GPP130 2 38 56 4 0LAMP2 0 3 3 93 2Mitoch.

NucleolinActin 0 0 0 4 96TfR

Tubulin

Overall accuracy = 82%Overall accuracy = 82%

Next: Experiment InterpretationNext: Experiment Interpretation

Growing use of digital microscopy Growing use of digital microscopy anticipated to give rise to a need for a anticipated to give rise to a need for a variety of computational approaches that variety of computational approaches that can automate can automate extraction of informationextraction of information from images or from images or testing of hypothesestesting of hypotheses using using image setsimage sets

Key is design and validation of feature setsKey is design and validation of feature sets

Goal: Typical Image SelectionGoal: Typical Image Selection

To develop automated methods for selecting To develop automated methods for selecting a representative image from a set of images a representative image from a set of images obtained by fluorescence microscopyobtained by fluorescence microscopy

The third imageis the most typical

of the set!!

TypIC - TypIC - TypTypical ical ImImage age CChooserhooserTypIC - TypIC - TypTypical ical ImImage age CChooserhooser

Image Set

MotivationMotivation Authors/Speakers must choose images for Authors/Speakers must choose images for

publication/presentation that represent an publication/presentation that represent an entire setentire set

Currently choice is subjective and may Currently choice is subjective and may change over timechange over time

Currently choice cannot be verified by Currently choice cannot be verified by othersothers

ApproachApproach Use sets of images collected for the Use sets of images collected for the

classification project to evaluate various classification project to evaluate various approaches to choosing a typical imageapproaches to choosing a typical image

Sample ImagesSample Images

ApproachApproach Calculate numerical features that contain Calculate numerical features that contain

information about each image (just like information about each image (just like when classifying images)when classifying images)

Calculate the similarity of each image to the Calculate the similarity of each image to the other images (using the numerical features)other images (using the numerical features)

Choose the image that is representative Choose the image that is representative (typical) by choosing the image that is most (typical) by choosing the image that is most similar to the otherssimilar to the others

Image SimilarityImage Similarity

Why do we need to be able to measure image Why do we need to be able to measure image similarity?similarity? To find images similar to a particular image, To find images similar to a particular image,

either on the web, in a database or on a either on the web, in a database or on a microscopemicroscope

To pick a representative image from a setTo pick a representative image from a set To test hypotheses regarding images (are two To test hypotheses regarding images (are two

images or groups of images the same or images or groups of images the same or different)different)

What is typical?What is typical?

What do we mean by a typical (or What do we mean by a typical (or representative) point in multidimensional representative) point in multidimensional space?space?

In one dimension, we think of the median In one dimension, we think of the median pointpoint

What we need then is a multidimensional What we need then is a multidimensional medianmedian Problem: No unique definitionProblem: No unique definition

Possible approaches to multidimensional medianPossible approaches to multidimensional median Convex peelingConvex peeling Closest point to combination of Closest point to combination of

unidimensional mediansunidimensional medians Closest point to meanClosest point to mean >>> In all cases, beware of outliers!>>> In all cases, beware of outliers!

Results For Golgi (giantin) ImagesResults For Golgi (giantin) Images

Most Typical

Least Typical

Goal: Image Set ComparisonGoal: Image Set Comparison

A common paradigm in molecular cell biology A common paradigm in molecular cell biology is to compare the distribution of a protein with is to compare the distribution of a protein with and without the addition of a potential and without the addition of a potential perturbing agent (e.g., drug, overexpressed perturbing agent (e.g., drug, overexpressed protein)protein)

Such experiments usually assayed by visual Such experiments usually assayed by visual examinationexamination

We have explored automating such We have explored automating such comparisonscomparisons

These sets are statistically

different!

SImEC - SImEC - SStatistical tatistical ImImaging aging EExperiment xperiment CComparatoromparatorSImEC - SImEC - SStatistical tatistical ImImaging aging EExperiment xperiment CComparatoromparator

Image Set 1

Image Set 2

Inputs to MethodInputs to Method

1) Two sets of images taken under identical 1) Two sets of images taken under identical conditions except for condition being tested conditions except for condition being tested (e.g., with & without drug)(e.g., with & without drug) Should have roughly equal number of images in Should have roughly equal number of images in

each seteach set Total number of images between both sets Total number of images between both sets

should exceed the number of featuresshould exceed the number of features

Inputs to MethodInputs to Method

2) A specification of the feature set to be used2) A specification of the feature set to be used default is 65 features, 49 Zernike moments and default is 65 features, 49 Zernike moments and

16 SLF features16 SLF features

3) A confidence level3) A confidence level default is 95%default is 95%

MethodMethod

Calculate feature matrix for each set of Calculate feature matrix for each set of imagesimages

Compare feature matrices using a Compare feature matrices using a multivariate hypothesis test called the multivariate hypothesis test called the Hotelling THotelling T22-test-test

Hotelling T2 testHotelling T2 test

Let Let n1 and and n2 be the number of images in be the number of images in

the two setsthe two sets Let Let p be the number of features be the number of features Calculate mean vector for each set, Calculate mean vector for each set, I1 and and I2

Calculate covariance matrices for each set, Calculate covariance matrices for each set, cov1 and and cov2


Calculate Calculate mergedmerged covariance matrix covariance matrix

S=(n1−1)cov1( )+ (n2 −1)cov2( )

n1+n2 −2


Calculate Calculate Mahalanobis distanceMahalanobis distance between the mean vectors using combined covariance matrix between the mean vectors using combined covariance matrix measures how far apart the two sets aremeasures how far apart the two sets are

D2 = I1−I2( )(S)-1 I1 −I2( )T


Calculate Calculate Hotelling THotelling T22

T2 =n1n2

n1+n2

D2

and associated F statisticand associated F statistic

F =n1+n2 −p−1(n1+n2 −2)p

T2

Hotelling T2 testHotelling T2 test This This FF statistic has statistic has nn and and n-pn-p degrees of freedom degrees of freedom Tests HTests H00: : II11==II2 2

Accept HAccept H00 if if FF is less than the critical value for the two degrees of freedom is less than the critical value for the two degrees of freedom

Summary of MethodSummary of Method

Collect 2 sets of imagesCollect 2 sets of images Extract featuresExtract features Perform Hotelling TPerform Hotelling T22 test test If F value falls below If F value falls below

critical value for desired critical value for desired confidence level (e.g., confidence level (e.g., 95%) then the two 95%) then the two distributions are distributions are considered to be the same considered to be the same

F values for comparison of all pairs of classes using 65 featuresF values for comparison of all pairs of classes using 65 features

ClassNo. ofimages

DAPI ER giantin gpp130 LAMP2 mc151 nucleolin phal. tfr

DAPI 87ER 86 83.2

giantin 87 206.1 34.7gpp130 85 227.4 44.5 2.4LAMP2 84 112.2 13.8 10.7 11.4

mc151 73 152.4 8.9 39.2 44.5 15.9nucleolin 73 79.8 39.8 17.2 15.1 14.5 46.6

phal. 98 527.2 63.5 325.3 354.0 109.8 16.0 266.4tfr 91 102.8 7.4 14.8 15.6 2.8 9.2 20.5 29.1

tubulin 91 138.3 10.8 63.0 72.2 18.4 7.0 49.4 22.4 5.5

Critical values are approximately 1.4 for all Critical values are approximately 1.4 for all comparisons (depends on number of images)comparisons (depends on number of images)

Comparison of two sets drawn randomly from the same classComparison of two sets drawn randomly from the same class

TfRTfR PhalPhal

Average FAverage F 1.051.05 1.051.05

Critical F (0.95)Critical F (0.95) 1.631.63 1.611.61

Number of failing sets out of 1000Number of failing sets out of 1000 47 47 4545

Expected result obtained: 95% of randomly drawn sets are considered to be the same

SImEcSImEc

Have system for comparing image sets Have system for comparing image sets can detect subtle differencescan detect subtle differences but still concludes that two sets of images of the but still concludes that two sets of images of the

same protein are the samesame protein are the same

ConclusionsConclusions

New frontier of New frontier of automatedautomated cell biology just cell biology just openingopening Classification of subcellular patternsClassification of subcellular patterns Selection of representative imagesSelection of representative images Comparison of image setsComparison of image sets

Will be combined with informatics tools to Will be combined with informatics tools to produce self-justifying, self-populating produce self-justifying, self-populating knowledge bases for proteinsknowledge bases for proteins

computational biology, part 24 biological imaging iv robert f. murphy copyright 2001. all rights...

Documents

different proteins

protein localization

cytoskeletal protein

endosomal protein

proteomics cd

fluorescent proteins

different shapes

different terms