ontology-based annotation & query of tma data
DESCRIPTION
Ontology-based Annotation & Query of TMA data. Nigam Shah Stanford Medical Informatics ([email protected]). Tissue Microarrays. www.nature.com/clinicalpractice/onc. Stanford tissue microarray database. http://tma.stanford.edu/tma_portal/. Key analysis issue. - PowerPoint PPT PresentationTRANSCRIPT
Ontology-based Annotation & Query of Ontology-based Annotation & Query of TMA data TMA data
Nigam Shah
Stanford Medical Informatics([email protected])
Tissue MicroarraysTissue Microarrays
www.nature.com/clinicalpractice/onc
Stanford tissue microarray databaseStanford tissue microarray database
http://tma.stanford.edu/tma_portal/
Key analysis issueKey analysis issue
Tissue microarrays query a large number of samples/patients for one protein.
The key query dimension in TMA data is a tissue sample
Because of the lack of a commonly used ontology to describe the diagnosis [or
annotations] for a given TMA sample in TMAD it is not easy to perform such as query.
Ontologies consideredOntologies considered
The NCI Thesaurus, version 05.09g
The SNOMED-CT, from UMLS 2005 AA
Available annotations for a blockAvailable annotations for a block
Each donor block in the TMA has semi-structured text associated with it.
ID Organ Diagnosis Subclass 1 Subclass 2 Subclass 3 Subclass 4
2334 Ovary MMMT
3335 Prostate Carcinoma Adeno intraductal
7022 Bladder Carcinoma Transitional cell
In situ
7288 Testis teratoma immature Embryonal carcinoma
8060 Liver Carcinoma hepatocellular No vascular invasion
HepC cirrhosis
6662 Soft tissue Sarcoma Leiomyo epithelioid
6663 lung Sarcoma Leiomyo epithelioid
4713 stomach carcinoma unknown
Map text to ontology termsMap text to ontology terms
Make all possible permutations Rules to weed out bad permutations
Check for an exact match with NCI and SNOMED-CT terms (and/or synonyms) Rules to weed out bad matches
Prostate Carcinoma Adeno intraductal 24 permutations
Prostate Carcinoma Adeno intraductal:Carcinoma Prostate intraductal Adeno:Adeno Carcinoma intraductal Prostate:Prostate intraductal Adeno Carcinoma
Prostate_Ductal_Adenocarcinoma
Sample matches (from NCI-T)Sample matches (from NCI-T)
Organ Diagnosis Subclass 1 Subclass 2 Subclass 3 Ontology Terms
2334 Ovary MMMT Malignant_Mixed_Mesodermal_Mullerian_Tumor
3335 Prostate Carcinoma Adeno intraductal Prostate_Ductal_Adenocarcinoma
7022 Bladder Carcinoma Transitional cell
In situ Stage_0_Transitional_Cell_Carcinoma
Transitional_Cell_Carcinoma
Bladder_Carcinoma
Carcinoma_in_situ
7288 Testis teratoma immature Embryonal carcinoma
Immature|Teratoma
Testicular_Embryonal_Carcinoma
Immature_Teratoma
8060 Liver Carcinoma hepatocellular No vascular invasion
HepC cirrhosis
Hepatocellular_Carcinoma
6662 Soft tissue Sarcoma Leiomyo epithelioid Soft_Tissue_Sarcoma
Leiomyosarcoma
Epithelioid_Sarcoma
6663 lung Sarcoma Leiomyo epithelioid Lung_Sarcoma
Leiomyosarcoma
Epithelioid_Sarcoma
4713 stomach carcinoma unknown Gastric_carcinoma
Results and validationResults and validation
Mapped the term-sets for 8495 records, which correspond to 783 distinct term-sets. 577 term-sets (6614 records) matched to the NCI thesaurus 365 term-sets (3465 records) matched to SNOMED-CT
In total mapped 6871 records (80%) of annotated records in TMAD (641 distinct term-sets) to one or more ontology terms.
Validation NCI SNOMED-CT
Appropriate Inappropriate Appropriate Inappropriate
Set-1 41 9 41 9
Set-2 42 8 43 7
Set-3 46 4 38 12
Total 129 21 122 28
Average (%) 43.0 (86%) 7.0 (14%) 40.66 (81%) 9.33 (19%)
Browsing interfaceBrowsing interface
Parents & Siblings nodes with data (Burly wood)
Child nodes with data (Yellow)
Child nodes with no data (Grey)
Click on the “anchor” link to get dataClick on the “anchor” link to get data
2/17/2006 9/23/2068495 8518 Donor blocks to match6614 7162 Donor blocks with NCI match3465 6959 Donor blocks with SNOMEDCT match6871 7399 Donor blocks with any match3208 6722 Donor blocks with both match
Updates since FebruaryUpdates since February
2/17/2006 9/23/2006783 791 Distinct Terms577 610 Distinct Terms with NCI match365 610 Distinct Terms with SNOMEDCT match641 651 Distinct Terms with any match295 569 Distinct Terms with both match
0
100
200
300
400
500
600
700
800
900
Distinct Terms Distinct Terms w ithNCI match
Distinct Terms w ithSNOMEDCT match
Distinct Terms w ithany match
Distinct Terms w ithboth match
2/17/2006
9/23/2006
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Donor blocks tomatch
Donor blocks w ithNCI match
Donor blocks w ithSNOMEDCT match
Donor blocks w ithany match
Donor blocks w ithboth match
2/17/2006
9/23/206
How do ontology based annotation help?How do ontology based annotation help?
Better search: we can retrieve samples of all the retroperitoneal tumors or malignant uterine neoplasms for example.
Better Integration of data: we can correlate gene expression with protein expression across multiple tumor types.
Tissue microarray data from TMADGene expression data from GEO
Integrating mRNA and protein expressionIntegrating mRNA and protein expression
Proteins
Sam
ples
Genes Sam
ples
Partial alignment of NCI-T and SNOMED-CT as a “bonus”
Steps in AlignmentSteps in Alignment
Anchor identification Identify similar class
labels in the ontologies to be aligned
Usually done by string matching
Ontology structure Use the “similar”
classes as anchors and examine the local [graph] structure around them to inform the “similarity” metric
Root
Term-1 Term-2
Term-3 Term-4
Term-5
R
t1 t2
t4
t5 t6 t7
t3
We might improve alignment …We might improve alignment …
Root
Term-1 Term-2
Term-3 Term-4
Term-5
R
t1 t2
t4
t5 t6 t7
t3
Term-2 t1
Term-5 t5
Ontology [graph] structure based step
Provide Anchors from annotated data
S2
t5
Term-5
S2
t5
Term-5
Better Text-mapping Better Text-mapping Better Alignment Better Alignment
0
100
200
300
400
500
600
700
800
900
Distinct Terms Distinct Terms w ithNCI match
Distinct Terms w ithSNOMEDCT match
Distinct Terms w ithany match
Distinct Terms w ithboth match
2/17/2006
7/23/2006
2/17 7/23
783 791 Distinct Terms
577 620 Terms with NCI match
365 610 Terms with SNOMEDCT match
641 654 Terms with any match
295 576 Terms with both match
SummarySummary
Ability to map word-groups to ontology terms
Proteins
Sam
ple
s
Genes Sam
ples
Root
Term-1 Term-2
Term-3 Term-4
Term-5
R
t1 t2
t4
t5 t6 t7
t3
Term-2 t1
Term-5 t5
Ontology [graph] structure based step
Provide Anchors from annotated data
S2
t5
Term-5
S2
t5
Term-5
Credits and acknowledgementsCredits and acknowledgements
PathologyRobert MarinelliMatt van de Rijn
Medical InformaticsKaustubh SupekarDaniel RubinMark Musen
FundingNIH