from bioinformatics to health information technology · – gtift tbli thgenetic factors: metabolic...

19
From Bioinformatics to Health From Bioinformatics to Health Information Technology Information Technology ©Edited by Mingrui Zhang, CS Department, Winona State University, 2009 Outline Outline What can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them Introduction to lung cancer problems Introduction to lung cancer problems Brief review on microarray technology An existing computer algorithm, FCM Adaptation of FCM for biological problems Case and control study Software integration ©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Upload: others

Post on 24-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

From Bioinformatics to Health From Bioinformatics to Health Information TechnologyInformation Technology

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

OutlineOutline

• What can we contribute to cancer research and treatment from Computer Science or Mathematics?

• How do we adapt our expertise for themIntroduction to lung cancer problems– Introduction to lung cancer problems

– Brief review on microarray technology– An existing computer algorithm, FCM– Adaptation of FCM for biological problems– Case and control study– Software integrationg

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Page 2: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Genomic Factors Associated Genomic Factors Associated with Lung Cancerwith Lung Cancerwith Lung Cancerwith Lung Cancer

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Lung CancerLung Cancer

• Lung cancer is the leading cause of death among cancer victims in the United States.

– It claims more lives than colon, prostate, and breast cancer combined.

– Smoking is the most significant factor for lung cancer. But, only 10% of ever smokers have lung cancers.

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Page 3: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Observed and projected lung cancer death Observed and projected lung cancer death rates, United States, 1930rates, United States, 1930––20032003

The observed death rates are based on data published by the National Center for Health Statistics, Centers for Disease Control. The dotted lines represent straight line projections of the observed slope from 1950–1975 in men and

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

from 1975–1990 in women. (http://tobaccocontrol.bmj.com)

National Expenditures for Medical National Expenditures for Medical Treatment for the Most Common CancersTreatment for the Most Common Cancers

Based on Cancer Prevalence in 1998 and Cancer-Specific Costs for 1997-1999 projected to 2004 using

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Based on Cancer Prevalence in 1998 and Cancer Specific Costs for 1997 1999, projected to 2004 usingthe medical care component of the Consumer Price Index. (http://progressreport.cancer.gov)

Page 4: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Low Survival RateLow Survival Rate

Type 5-Year survival for all stages

EarlyDetection

Late Detection

L ng 14 9% 48 7% 21%Lung 14.9% 48.7% 21%Breast 86.6% 97.0% 23.2%Prostate 97.5% 100% 34.0%Colon 62.3% 90.1% 9.2%

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Lung CancerLung Cancer

• Interesting questions:1. Are there factors other than smoking attributed to lung

cancers?2. How does second-hand smoking contribute to cancer?g

• Goals: – To predict the effectiveness of lung cancer treatments

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Page 5: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

ApproachApproach

• Case control studies– Cases are group of lung cancer patients– Controls are group of normal people

• Identify causal factors for lung cancer other thanIdentify causal factors for lung cancer other thansmoking

– Environmental factorsG ti f t t b li th– Genetic factors: metabolic pathway genes

– Interaction between environmental factors and the metabolic pathway genes

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Biomedical Informatics ResearchBiomedical Informatics Research

Data Mining Software ToolData Collection

Clinical Information

ing

SNPsBlood

Mod

eli

Genetic InformationMicroarrayLung Tissue

Page 6: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Microarray Technology:Microarray Technology:Microarray Technology:Microarray Technology:Genes Attributed to CancerGenes Attributed to Cancer

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Microarray ExperimentMicroarray Experiment

• To understand the roles certain genes play in the progression of cancer cancer tissue is taken andprogression of cancer, cancer tissue is taken andused in microarray experiment.

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Page 7: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Gene ExpressionGene Expression

• There are over 10,000 different probes used. • Each dot represents the location of a gene probe• Each dot represents the location of a gene probe.

Probe GSM10966GSM10966GSM10966GSM10966GSM10966GSM10966GSM10966GSM10966Hs.544577 39513.15 37409.76 29715.27 20536.73 12636.55 30290.4 5380.344 8963.756Hs.544577 39513.15 37409.76 29715.27 20536.73 12636.55 30290.4 5380.344 8963.756Hs.561260 9229.464 21894.5 16412.41 12636.55 17446.9 13930.68 5254.592 24600.7Hs.567356 19056.91 32565.55 26372.36 19455.14 23552.87 22335.71 17281.12 21230.21Hs.436873 35737.01 35313.64 38926 21348.13 23681.18 42374.79 28674.05 12128.16Hs.517792 9468.588 14035.96 5919.428 5919.428 14174.17 11231.28 17788.3 24389.6Hs.487027 29305.41 36906.58 21473.12 22962.5 41911.48 34204.18 28017.95 33684.04Hs.1422 15283.12 25165 23287.42 15057.2 23223.62 15438.6 19509.78 8151.832Hs.202453 15641.64 20993.95 18437.06 18131.26 20888.96 20207.75 18781.45 19350.3Hs.592158 22163.46 37996.65 26291.84 30290.4 36422.28 30758.42 24600.7 23980.17Hs.525622 15737.5 24801.41 17638.29 19056.91 23429.47 20444.12 20089.34 19633.77Hs.226755 28674.05 33514.62 19799.32 24314.99 28297.22 41428.3 28781.67 20327.08Hs.250687 13646.48 28297.22 19860.2 16085.3 16269.2 14468.99 34204.18 12725.62Hs.425633 8644.256 5516.18 11838.31 6331.144 2745.636 4145.204 10042.56 5919.428Hs.386168 9999.856 30404.73 16830.96 25076.12 23820.99 12172.11 9736.256 6141.688Hs.584908 24600.7 27737.02 24923.88 24243.52 16573.38 27044.52 7832.452 10143.37Hs.133379 31731.44 19350.3 12991.75 19118 31893.15 29930.01 15781.63 31497.45Hs.150423 5061.944 12172.11 9780.32 10709.47 8246.392 5868.392 1589.784 3577.928Hs.175473 33684.04 33356.45 24243.52 28569.07 43439.98 37409.76 22699.38 21473.12Hs.83169 39184.98 31232.35 33026.59 6282.64 3898.764 31132.38 30883.74 10624.71Hs 532325 2344 252 1622 224 4019 816 2998 332 997 068 2784 528 12269 02 1310 164

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Hs.532325 2344.252 1622.224 4019.816 2998.332 997.068 2784.528 12269.02 1310.164Hs.518267 13831.62 15188.12 8554.644 15587.79 18781.45 17593.42 7651.908 12588.82

Fuzzy ClusteringFuzzy Clustering

• The algorithm assigns a gene to a given number of l tclusters

• Each gene may belong to more than one cluster with different degrees of membershipg p

Page 8: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Fuzzy ClusteringFuzzy Clustering

• The method produces a set of cluster centroids and a pmembership table

A. Gasch and M. Eisen, "Exploring the conditional coregulation of yeast gene expression, p g g y g pthrough fuzzy k-means clustering," Genome Biology, vol. 3, pp. 1-22, 2002.

Fuzzy CFuzzy C--Means ClusteringMeans Clustering

� A set of N samples with their features as X={x1, x2,…p { 1, 2,xN}T

� xi=[xi1, xi2, … xip] is sample i with its p features� A cluster cj=[cj1, cj2,… cjp]� The fuzzy membership uij of sample i to a cluster cj

Page 9: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Fuzzy CFuzzy C--Means (FCM) ClusteringMeans (FCM) Clustering

Randomly initialize membership matrix uijRandomly initialize membership matrix uij

Repeat until ��� � )1()( tt uu for t=1, 2...

C l d� �� �

N

imt

iju )1(

1 2 CCompute cluster centroids� �

� ��

�� N

i

mtij

ij

u1

)1(

1 ; j=1,2,…C.

Find sets 0;1| 2 ���� dCjjI and i ICI �� 21Find sets 0;1| , ���� jii dCjjI and ii ICI �� ,...2,1

� �

����

����

��

���

��

�C m

t

tij

dd

1

12

)(

)(

Compute membership as� �

���

��

����

��

���

��

i

i

k iktij

IiIi

du

10

1

.

���� i

iIiI

Adaptation of Fuzzy Clustering for Adaptation of Fuzzy Clustering for Bioinformatics ProblemsBioinformatics ProblemsBioinformatics ProblemsBioinformatics Problems

Page 10: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Kernels and Validity IndexesKernels and Validity Indexes

• Different kernnels/distance metrics� Distance metrics: Euclidean distance based;

Pearson correlation based

� Choice of fuzziness, m

• Different validity indexes� Crisp: WCSS, FOM, etc.

� Fuzzy: Xie’s, Partition coefficient. Etc.

Different Versions of Fuzzy Different Versions of Fuzzy ClusteringClusteringClusteringClustering

• Methods are categorized according to the objective• Methods are categorized according to the objectivefunction and the metrics used in the method

Objective Function J Metrics m Data Sets

K-means [3] Correlation 2 Yeast

J-means [1] Euclidean 1.15 ~ 1.75 Cancer, Blood [ ] ,

C-means [2, 5, 6] Euclidean 1.1 ~ 2.54Serum, Sporulation,

Yeast,Cancer, Cell line

Page 11: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

AdaptingAdapting the Kernelthe Kernel

I iti li tiInitialization: 1. Classify genes into biological processes based on Gene

Ontology terms; )0(2. Use pre-classified genes to initialize )0(

j , and themembership uij;

3 Normalize membership u by 1��C u for each gene i3. Normalize membership uij, by 11

�� �j iju for each gene i.

Apply FCM with a squared Pearson correlation distance,2 h h l b2

,1 CXij id ��� where

ji CX ,� is the Pearson correlation between a gene xi and a cluster cj.

Fuzzy WCSS IndexFuzzy WCSS Index

Page 12: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Gene ExpressionGene Expression

ReferencesReferences

1. Zhang, M., et al. A Fuzzy C-Means Algorithm Using a Correlation Metrics and Gene Ontology in The 19th International Conference on Pattern RecognitionGene Ontology. in The 19th International Conference on Pattern Recognition.2008. Tampa, Florida, USA.

2. Zhang, M., W. Zhang, H. Sicotte and P. Yang, 2009, “Validating a Correlation-Based Fuzzy C-means Clustering Algorithm”, IEEE EMBC, submitted.

24

Page 13: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

GEO DatabasesGEO Databases• http://www.ncbi.nlm.nih.gov/geo/

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Single Nucleotide Polymorphisms:Single Nucleotide Polymorphisms:Genomic Variations in DiseaseGenomic Variations in DiseaseGenomic Variations in DiseaseGenomic Variations in Disease

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Page 14: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Single Nucleotide PolymorphismsSingle Nucleotide Polymorphisms

• SNPs are single bases at a particular locus where individual people have differences in their sequences.p p q

– SNPs are another form of genomic variation in population

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Population BasedPopulation Based

• Each ethnic group has its own collection of SNPs.• Human SNPs classified by major or minor allelesHuman SNPs classified by major or minor alleles.

– major alleles are common for all human– minor alleles are useful within an ethnic group

• You should know the average frequency of alleles of the population you are studying!!

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Page 15: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

HapMapHapMap

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

HapMap ProjectHapMap Project

• The international HapMap consortium has identified >1 million SNPsidentified >1 million SNPs– Samples from four populations– 1 SNP every 2 kb of genomic sequence1 SNP every 2 kb of genomic sequence

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Page 16: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Use SNPs as MarkersUse SNPs as Markers

• SNPs are reliable markers– Most genes contain at least one SNP– Combinations of alleles are associated with particular disease.

• Study of evolutionStudy of evolution– Understand how a subpopulation adapted to the environment by

comparing the differences in their SNPs

• DNA fingerprinting for criminal or parental verification• DNA fingerprinting for criminal or parental verification.• Genotype-specific medication

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

HapMapHapMap Data and Data and HaploviewHaploview

Page 17: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

HaploviewHaploview

HaploviewHaploview

Page 18: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

The Goal is to Determine the Best Treatments The Goal is to Determine the Best Treatments or to Improve Patient’s Quality of Lifeor to Improve Patient’s Quality of Lifep yp y

©Edited by Mingrui Zhang, CS Department, Winona State University, 2009

Prototype Software ArchitecturePrototype Software Architecture

EHR databaseMayo Clinic

EHR databaseClinic X...

Prediction output Presentation Input FormR

Current Patient’s Data

ViewCSS Styles

Model

Prediction Model

VariableDefinition- XML

Model Manager- Model Interface (JRI)

View Manager- Web Form Generator- Presentation Generator

Controller

Page 19: From Bioinformatics to Health Information Technology · – Gtift tbli thGenetic factors: metabolic pathway genes – Interaction between environmental factors and the metabolic pathway

Prototype WebPrototype Web--based Toolbased Tool

ReferencesReferences

1. Zhang, M., Olson, S., Francioni, J., Gegg-Harrison, T., Meng, N., Sun, Z., and Yang P 2009 Integrating R Models with Web Technologies HEALTHINF 2009Yang, P., 2009. Integrating R Models with Web Technologies. HEALTHINF 2009,Porto, Portugal, January 2009.

2. Gegg-Harrison, T., Zhang, M., Meng, N., Sun, Z., and Yang, P., 2009, Porting a Cancer Treatment Prediction to a Mobile Device, IEEE EMBC, submitted.

38