computational biology, part 24 biological imaging iv robert f. murphy copyright 2001. all rights...
TRANSCRIPT
Computational Biology, Part 24Biological Imaging IV
Computational Biology, Part 24Biological Imaging IV
Robert F. MurphyRobert F. Murphy
Copyright Copyright 2001. 2001.
All rights reserved.All rights reserved.
ProteomicsProteomics
The set of proteins expressed in a given cell The set of proteins expressed in a given cell type or tissue is called its type or tissue is called its proteomeproteome
Not all transcripts are actually made into Not all transcripts are actually made into protein, and the steady-state level of protein protein, and the steady-state level of protein expression is controlled by many factors expression is controlled by many factors other than transcript amountother than transcript amount
Protein differences between cell types Protein differences between cell types responsible for different roles of those cellsresponsible for different roles of those cells
ProteomicsProteomics
Things to learn about proteinsThings to learn about proteins sequencesequence locationlocation structurestructure activityactivity partnerspartners
ProteomicsProteomics
Things to learn about proteinsThings to learn about proteins sequencesequence locationlocation structurestructure activityactivity partnerspartners
ProteomicsProteomics
Things to learn about proteinsThings to learn about proteins sequencesequence location - gives insight into functionlocation - gives insight into function structurestructure activityactivity partnerspartners
ProteomicsProteomics
Things to learn about proteinsThings to learn about proteins sequencesequence location - gives insight into functionlocation - gives insight into function structurestructure activityactivity partnerspartners
Almost nothing is known about most Almost nothing is known about most proteins!proteins!
One Approach to Proteomics -CD-taggingOne Approach to Proteomics -CD-tagging Infect cells with a retrovirus carrying a Infect cells with a retrovirus carrying a
DNA sequence that will produce a “tag” in DNA sequence that will produce a “tag” in a random proteina random protein
Examine many cells, each of which Examine many cells, each of which expected to express one tagged protein, to expected to express one tagged protein, to determine the determine the subcellular locationsubcellular location of that of that proteinprotein
Use fluorescence microscopyUse fluorescence microscopy
Principles of CD-Tagging(CD = Central Dogma)
Exon 1 Intron 1
Exon 2
Genomic DNA +CD-cassette
Exon 1 Tag
Exon 2
Tagged DNA
CD cassette
Tag Tagged mRNA
Tagged ProteinTag (Epitope)
Tag
Use a CD-cassette containing the hemagglutinin (Use a CD-cassette containing the hemagglutinin (HAHA) ) epitopeepitope
Insert the cassetteInsert the cassette into introns of the nucleolin geneinto introns of the nucleolin gene Obtain clonal linesObtain clonal lines expressing the tagged proteinexpressing the tagged protein ImageImage the distribution of nucleolin using the distribution of nucleolin using
immunofluorescence microscopyimmunofluorescence microscopy
CD-Tagging: Proof of concept
Results: CD-Tagging
Tagged Nucleolin
in HeLa Cells
• Improved epitope tagImproved epitope tag
The HA epitope works only in The HA epitope works only in one reading frameone reading frame Designed an epitope that is the same in all three Designed an epitope that is the same in all three
reading frames - the reading frames - the universal epitopeuniversal epitope Endogenously fluorescent tagsEndogenously fluorescent tags
Can use Can use fluorescent proteinsfluorescent proteins (e.g., GFP, YFP) as the (e.g., GFP, YFP) as the inserted tag!inserted tag!
Don’t need fixation and antibodiesDon’t need fixation and antibodies
CD-Tagging: Extensions
CD-tagging projectCD-tagging project
Large project funded by National Cancer Large project funded by National Cancer Institute to identify locations for all Institute to identify locations for all expressed genesexpressed genes Jonathan JarvikJonathan Jarvik Peter BergetPeter Berget Robert MurphyRobert Murphy
My group responsible for automated My group responsible for automated analysis of subcellular location patternsanalysis of subcellular location patterns
The ProblemThe ProblemThe ProblemThe Problem
Different investigators may use Different investigators may use differentdifferent terms to terms to refer to the same pattern or the refer to the same pattern or the samesame term to refer term to refer to different patternsto different patterns
Current determinations do not lend themselves to Current determinations do not lend themselves to incorporation into incorporation into databases databases (at best, databases (at best, databases may describe a protein in comment fields as being may describe a protein in comment fields as being a “cytoskeletal protein” or an “endosomal protein” a “cytoskeletal protein” or an “endosomal protein” even though these are known to be imprecise)even though these are known to be imprecise)
Cartoonists view of Subcellular LocationsCartoonists view of Subcellular Locations
Cells Alive! Cells Alive! “rollover” cell “rollover” cell with with information on information on each organelleeach organelle
http://www.cellsalive.net/cells/animcell.htm
The Starting PointThe Starting PointThe Starting PointThe Starting Point
A systematic, quantitative approach to A systematic, quantitative approach to protein localization (whether from a pattern protein localization (whether from a pattern analysis or a bioinformatics perspective) has analysis or a bioinformatics perspective) has not been presented previouslynot been presented previously
This is a Golgi protein !
The GoalThe GoalThe GoalThe Goal
More problemsMore problemsMore problemsMore problems
Direct (point-by-point) comparison of Direct (point-by-point) comparison of individual images is not possible, sinceindividual images is not possible, since different cells have different different cells have different shapes, sizes, shapes, sizes,
orientationsorientations organelles within cells are organelles within cells are not found in fixed not found in fixed
locationslocations
The ApproachThe ApproachThe ApproachThe Approach
1. Create sets of images showing the localization 1. Create sets of images showing the localization of many different proteins (each set defines one of many different proteins (each set defines one classclass of pattern) of pattern)
2. Reduce each image to a set of numerical values 2. Reduce each image to a set of numerical values (“(“featuresfeatures”) that are insensitive to position and ”) that are insensitive to position and rotation of the cellrotation of the cell
3. Use statistical 3. Use statistical classification methodsclassification methods to “learn” to “learn” how to distinguish each class using the featureshow to distinguish each class using the features
InputInput
Created image database for HeLa cellsCreated image database for HeLa cells Ten classes covering all major subcellular Ten classes covering all major subcellular
structures: Golgi, ER, mitochondria, structures: Golgi, ER, mitochondria, lysosomes, endosomes, nuclei, nucleoli, lysosomes, endosomes, nuclei, nucleoli, microfilaments, microtubulesmicrofilaments, microtubules
Includes classes that are similar to each Includes classes that are similar to each otherother
Example ImagesExample Images
Patterns that might be easily confusedPatterns that might be easily confused
Endoplasmic Reticulum (ER) Mitochondria
Example ImagesExample Images
Patterns that might be easily confusedPatterns that might be easily confused
Lysosomes (LAMP2) Endosomes (TfR)
Example ImagesExample Images
Patterns that might be easily confusedPatterns that might be easily confused
F-actin Tubulin
Example ImagesExample Images
Classes expected to be indistinguishableClasses expected to be indistinguishable
Golgi (Giantin) Golgi (gpp130)
FeaturesFeatures
Zernike moment featuresZernike moment features (based on the (based on the Zernike polynomials) - give information on Zernike polynomials) - give information on basic nature of pattern (e.g., circle, donut) basic nature of pattern (e.g., circle, donut) and sizes (frequencies) present in patternand sizes (frequencies) present in pattern
Haralick texture featuresHaralick texture features - give - give information on correlations in intensity information on correlations in intensity between adjacent pixelsbetween adjacent pixels
Examples of Zernike Polynomials
Zernike Moments Reconstruction
Original
Order 12
Order 20
Order 45
FeaturesFeatures
Developed additional features (Developed additional features (SLFSLF, for , for SSubcellular ubcellular LLocation ocation FFeatures)eatures)
Motivated by descriptions of patterns used Motivated by descriptions of patterns used by biologists (e.g., punctate, perinuclear)by biologists (e.g., punctate, perinuclear)
Combined with Zernike and Haralick Combined with Zernike and Haralick features to give 84 features used to describe features to give 84 features used to describe each imageeach image
Example Features from SLF1
Number of fluorescent objects per cellNumber of fluorescent objects per cell Variance of the object sizesVariance of the object sizes Ratio of the largest object to the smallestRatio of the largest object to the smallest Average distance of objects to the ‘center of Average distance of objects to the ‘center of
fluorescence’fluorescence’ Fraction of convex hull occupied by Fraction of convex hull occupied by
fluorescencefluorescence
1. Acquisition of Images
2. Image Processing
3. Feature Extraction
4. Classifier Design and Training
5. Classification
feature1 feature2 ... featureNImage1 0.3489 0.1294 ... 1.9012Image2 0.4985 0.4823 ... 1.8390... ...ImageM 1.8245 0.8290 ... 0.9018
This is a Golgi Protein
The ApproachThe Approach
Backpropagation Neural Network
Input 1
Input 2
Input n
Output 1
Output 2
Output m
Internal‘Neurons’
P r e d . C l a s sTrueClass
DNA
ER
GIantIn
GPP130
LAMP
MItoch.
Nucleoli
ActIn
TfRecept
TubulIn
DNA 99 1
ER 86 3 5 5
Giant 77 19 1 2 1
GPP 18 78 2 2 1
LAM 1 3 2 73 1 2 17 1
Mito. 9 2 4 77 2 6
Nucl. 2 1 2 1 94
Actin 3 91 6
TfR 5 3 1 25 3 5 55 5
Tub. 5 1 7 1 4 5 77
Classification accuracy for single images
Average Correct Classification Rate:
81%
How does it work?Scatter plot for TfR and LAMP2How does it work?Scatter plot for TfR and LAMP2
0
20
40
60
80
100
120
140
0 500 1000 1500
number of objects per cell
Feature SubsetsFeature Subsets
The large number of features used may The large number of features used may make training of the network harder due to make training of the network harder due to the large number of weights needing to be the large number of weights needing to be adjustedadjusted
Therefore stepwise discriminant analysis Therefore stepwise discriminant analysis was used to select a subset of the features was used to select a subset of the features that optimizes a criterion for distinguishing that optimizes a criterion for distinguishing classesclasses
P r e d . C l a s sTrueClass
DNA
ER
GIantIn
GPP130
LAMP
MItoch.
Nucleoli
ActIn
TfRecept
TubulIn
DNA 99 1
ER 87 2 1 7 2 2
Giant 1 77 19 1 1 1
GPP 16 78 2 1 1 1
LAM 1 5 2 74 1 1 16 1
Mito. 8 2 2 79 1 2 6
Nucl. 1 1 2 95
Actin 1 96 2
TfR 5 1 1 20 3 2 62 6
Tub. 4 8 1 5 81
Results: “Best” Features
Average Correct Classification Rate:
83%
How to do even betterHow to do even better
Biologists interpreting images of protein Biologists interpreting images of protein localization typically view many cells localization typically view many cells before reaching a conclusionbefore reaching a conclusion
Can simulate this by classifying Can simulate this by classifying setssets of cells of cells from the same microscope slidefrom the same microscope slide
Also applicable for Also applicable for coloniescolonies of CD-tagged of CD-tagged cells cells
Classification accuracy for sets of ten images
DNA ER Giantin GPP LAMP Mito. Nucleoli Actin TfR Tubulin Unknown
DNA 100ER 100
Giantin 98 1
True GPP 99 1
Class LAMP 97 1 2Mito. 100
Nucleoli 100Actin 100Tfr 6 88 6
Tubulin 100
Average Correct Classification Rate = 98%(99% for those sets not considered “unknown”)
Predicted Class
Conclusion so farConclusion so far
Have demonstrated feasibility of using Have demonstrated feasibility of using automated classification to assign a automated classification to assign a subcellular location “class” to an imagesubcellular location “class” to an image
Gearing up to do this for thousands of Gearing up to do this for thousands of proteinsproteins
This is a Golgi protein !
SLIC (SLIC (SSubcellular ubcellular LLocation ocation IImage mage CClassifier)lassifier)SLIC (SLIC (SSubcellular ubcellular LLocation ocation IImage mage CClassifier)lassifier)
Extending to 3DExtending to 3D
Have begun extending this approach to 3D Have begun extending this approach to 3D images collected by confocal microscopyimages collected by confocal microscopy
Also beginning to collect 3D images by new Also beginning to collect 3D images by new method using “grating imager” (with F. method using “grating imager” (with F. Lanni)Lanni)
3D labeling approach3D labeling approach All Proteins labeled with Cy5 conjugated All Proteins labeled with Cy5 conjugated
reactive dyereactive dye DNA labeled with PIDNA labeled with PI Specific Proteins labeled with primary Ab + Specific Proteins labeled with primary Ab +
secondary Alexa488 conjugated Absecondary Alexa488 conjugated Ab
Features for 3D ImagesFeatures for 3D Images
Use a subset of the 2D SLF features: Number of Objects Euler Number Average Object Size Standard Deviation of Object sizes Ratio of the Largest to the Smallest Object Size Average Distance of Objects from COF Standard Deviation of Object Distances from COF Ratio of the Largest to Smallest Object Distance
DNA FeaturesDNA Features
Use the parallel DNA image to calculate The average object distance from the COF of the DNA image The variance of object distances from the DNA COF The ratio of the largest to the smallest object to DNA COF distance The distance between the protein COF and the DNA COF The ratio of the volume occupied by protein to that occupied by DNA The fraction of the protein fluorescence that co-localizes with DNA
3D Classification Results with 14 features3D Classification Results with 14 features
Output of ClassifierTrue Class DN ER Gia GP LA Mit Nuc Act TfR Tub
DNA 94 6 0 0 0ER
Giantin 3 97 0 0 0GPP130 0 0 100 0 0LAMP2 0 0 2 92 6Mitoch.
NucleolinActin 0 0 0 3 97TfR
Tubulin
Overall accuracy = 96%Overall accuracy = 96%
2D Results — Same 14 Features2D Results — Same 14 Features
Output of ClassifierTrue Class DN ER Gia GP LA Mit Nuc Act TfR Tub
DNA 100 0 0 0 0ER
Giantin 0 59 36 4 1GPP130 2 38 56 4 0LAMP2 0 3 3 93 2Mitoch.
NucleolinActin 0 0 0 4 96TfR
Tubulin
Overall accuracy = 82%Overall accuracy = 82%
Next: Experiment InterpretationNext: Experiment Interpretation
Growing use of digital microscopy Growing use of digital microscopy anticipated to give rise to a need for a anticipated to give rise to a need for a variety of computational approaches that variety of computational approaches that can automate can automate extraction of informationextraction of information from images or from images or testing of hypothesestesting of hypotheses using using image setsimage sets
Key is design and validation of feature setsKey is design and validation of feature sets
Goal: Typical Image SelectionGoal: Typical Image Selection
To develop automated methods for selecting To develop automated methods for selecting a representative image from a set of images a representative image from a set of images obtained by fluorescence microscopyobtained by fluorescence microscopy
The third imageis the most typical
of the set!!
TypIC - TypIC - TypTypical ical ImImage age CChooserhooserTypIC - TypIC - TypTypical ical ImImage age CChooserhooser
Image Set
MotivationMotivation Authors/Speakers must choose images for Authors/Speakers must choose images for
publication/presentation that represent an publication/presentation that represent an entire setentire set
Currently choice is subjective and may Currently choice is subjective and may change over timechange over time
Currently choice cannot be verified by Currently choice cannot be verified by othersothers
ApproachApproach Use sets of images collected for the Use sets of images collected for the
classification project to evaluate various classification project to evaluate various approaches to choosing a typical imageapproaches to choosing a typical image
Sample ImagesSample Images
ApproachApproach Calculate numerical features that contain Calculate numerical features that contain
information about each image (just like information about each image (just like when classifying images)when classifying images)
Calculate the similarity of each image to the Calculate the similarity of each image to the other images (using the numerical features)other images (using the numerical features)
Choose the image that is representative Choose the image that is representative (typical) by choosing the image that is most (typical) by choosing the image that is most similar to the otherssimilar to the others
Image SimilarityImage Similarity
Why do we need to be able to measure image Why do we need to be able to measure image similarity?similarity? To find images similar to a particular image, To find images similar to a particular image,
either on the web, in a database or on a either on the web, in a database or on a microscopemicroscope
To pick a representative image from a setTo pick a representative image from a set To test hypotheses regarding images (are two To test hypotheses regarding images (are two
images or groups of images the same or images or groups of images the same or different)different)
What is typical?What is typical?
What do we mean by a typical (or What do we mean by a typical (or representative) point in multidimensional representative) point in multidimensional space?space?
In one dimension, we think of the median In one dimension, we think of the median pointpoint
What we need then is a multidimensional What we need then is a multidimensional medianmedian Problem: No unique definitionProblem: No unique definition
Possible approaches to multidimensional medianPossible approaches to multidimensional median Convex peelingConvex peeling Closest point to combination of Closest point to combination of
unidimensional mediansunidimensional medians Closest point to meanClosest point to mean >>> In all cases, beware of outliers!>>> In all cases, beware of outliers!
Results For Golgi (giantin) ImagesResults For Golgi (giantin) Images
Most Typical
Least Typical
Goal: Image Set ComparisonGoal: Image Set Comparison
A common paradigm in molecular cell biology A common paradigm in molecular cell biology is to compare the distribution of a protein with is to compare the distribution of a protein with and without the addition of a potential and without the addition of a potential perturbing agent (e.g., drug, overexpressed perturbing agent (e.g., drug, overexpressed protein)protein)
Such experiments usually assayed by visual Such experiments usually assayed by visual examinationexamination
We have explored automating such We have explored automating such comparisonscomparisons
These sets are statistically
different!
SImEC - SImEC - SStatistical tatistical ImImaging aging EExperiment xperiment CComparatoromparatorSImEC - SImEC - SStatistical tatistical ImImaging aging EExperiment xperiment CComparatoromparator
Image Set 1
Image Set 2
Inputs to MethodInputs to Method
1) Two sets of images taken under identical 1) Two sets of images taken under identical conditions except for condition being tested conditions except for condition being tested (e.g., with & without drug)(e.g., with & without drug) Should have roughly equal number of images in Should have roughly equal number of images in
each seteach set Total number of images between both sets Total number of images between both sets
should exceed the number of featuresshould exceed the number of features
Inputs to MethodInputs to Method
2) A specification of the feature set to be used2) A specification of the feature set to be used default is 65 features, 49 Zernike moments and default is 65 features, 49 Zernike moments and
16 SLF features16 SLF features
3) A confidence level3) A confidence level default is 95%default is 95%
MethodMethod
Calculate feature matrix for each set of Calculate feature matrix for each set of imagesimages
Compare feature matrices using a Compare feature matrices using a multivariate hypothesis test called the multivariate hypothesis test called the Hotelling THotelling T22-test-test
Hotelling T2 testHotelling T2 test
Let Let n1 and and n2 be the number of images in be the number of images in
the two setsthe two sets Let Let p be the number of features be the number of features Calculate mean vector for each set, Calculate mean vector for each set, I1 and and I2
Calculate covariance matrices for each set, Calculate covariance matrices for each set, cov1 and and cov2
Hotelling T2 testHotelling T2 test
Calculate Calculate mergedmerged covariance matrix covariance matrix
S=(n1−1)cov1( )+ (n2 −1)cov2( )
n1+n2 −2
Hotelling T2 testHotelling T2 test
Calculate Calculate Mahalanobis distanceMahalanobis distance between the mean vectors using combined covariance matrix between the mean vectors using combined covariance matrix measures how far apart the two sets aremeasures how far apart the two sets are
D2 = I1−I2( )(S)-1 I1 −I2( )T
Hotelling T2 testHotelling T2 test
Calculate Calculate Hotelling THotelling T22
T2 =n1n2
n1+n2
D2
and associated F statisticand associated F statistic
F =n1+n2 −p−1(n1+n2 −2)p
T2
Hotelling T2 testHotelling T2 test This This FF statistic has statistic has nn and and n-pn-p degrees of freedom degrees of freedom Tests HTests H00: : II11==II2 2
Accept HAccept H00 if if FF is less than the critical value for the two degrees of freedom is less than the critical value for the two degrees of freedom
Summary of MethodSummary of Method
Collect 2 sets of imagesCollect 2 sets of images Extract featuresExtract features Perform Hotelling TPerform Hotelling T22 test test If F value falls below If F value falls below
critical value for desired critical value for desired confidence level (e.g., confidence level (e.g., 95%) then the two 95%) then the two distributions are distributions are considered to be the same considered to be the same
F values for comparison of all pairs of classes using 65 featuresF values for comparison of all pairs of classes using 65 features
ClassNo. ofimages
DAPI ER giantin gpp130 LAMP2 mc151 nucleolin phal. tfr
DAPI 87ER 86 83.2
giantin 87 206.1 34.7gpp130 85 227.4 44.5 2.4LAMP2 84 112.2 13.8 10.7 11.4
mc151 73 152.4 8.9 39.2 44.5 15.9nucleolin 73 79.8 39.8 17.2 15.1 14.5 46.6
phal. 98 527.2 63.5 325.3 354.0 109.8 16.0 266.4tfr 91 102.8 7.4 14.8 15.6 2.8 9.2 20.5 29.1
tubulin 91 138.3 10.8 63.0 72.2 18.4 7.0 49.4 22.4 5.5
Critical values are approximately 1.4 for all Critical values are approximately 1.4 for all comparisons (depends on number of images)comparisons (depends on number of images)
Comparison of two sets drawn randomly from the same classComparison of two sets drawn randomly from the same class
TfRTfR PhalPhal
Average FAverage F 1.051.05 1.051.05
Critical F (0.95)Critical F (0.95) 1.631.63 1.611.61
Number of failing sets out of 1000Number of failing sets out of 1000 47 47 4545
Expected result obtained: 95% of randomly drawn sets are considered to be the same
SImEcSImEc
Have system for comparing image sets Have system for comparing image sets can detect subtle differencescan detect subtle differences but still concludes that two sets of images of the but still concludes that two sets of images of the
same protein are the samesame protein are the same
ConclusionsConclusions
New frontier of New frontier of automatedautomated cell biology just cell biology just openingopening Classification of subcellular patternsClassification of subcellular patterns Selection of representative imagesSelection of representative images Comparison of image setsComparison of image sets
Will be combined with informatics tools to Will be combined with informatics tools to produce self-justifying, self-populating produce self-justifying, self-populating knowledge bases for proteinsknowledge bases for proteins