knowledge discovery - ist department at ritrpv/local/syllabi/discovery/knowledgediscovery1.pdfthe...
TRANSCRIPT
![Page 1: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/1.jpg)
KnowledgeDiscovery
![Page 2: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/2.jpg)
Ourgoal
......to understanding (wisdom) ......to knowledge ......to information
data
![Page 3: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/3.jpg)
WhydoweneedKnowledgeDiscovery?
• DataExplosion:webusage,automateddatacollec?ontools,maturedatabasetechnology
• ToomuchdataandtooliAleknowledge
• HumansnotabletosiDthroughthedataeffec?vely
• Computa?onalapproachestodataanalysisarerequiredforthecon?nuallyincreasing,accumulateddata
![Page 4: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/4.jpg)
Poten?alApplica?ons
• Marketanalysis,customerrela?onshipmanagement
• Riskanalysisandmanagement• Frauddetec?on• Textminingnewsgroups,email,documents• Webminingoflogs,datastreamsforcustomiza?on,adver?sing,marke?ng
• BiologyandMedicine‐manytypesofhigh‐throughputdatafordiagnos?cs,predic?veandpersonalizedmedicine
![Page 5: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/5.jpg)
Linktoimagereference
![Page 6: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/6.jpg)
Linktoimagereference
![Page 7: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/7.jpg)
![Page 8: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/8.jpg)
EvenBeAerConsulttheDomainExpert(s)
![Page 9: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/9.jpg)
TheProcess
• GuidedDiscovery– PBL– KnowledgeDiscovery– Learnthroughexamplesandprac?ce
• Samegeneralapproachmaybeappliedtomanydifferentproblemdomains
• Selectappropriatemethodstocustomizeapproach
• Noonerightanswer!
![Page 10: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/10.jpg)
RunningExampleofKD
• GeneExpressionData• Whyagoodexample?
– Biotechnologyadvancescreatedhugeinfluxofdata
– Biologistsnotequippedtoanalyzethedata– Computa?onalscien?stsdidn’tunderstandthebiology
– KDDprocesssorelyneeded– Hassignificantlyadvancedoverthelast10years
![Page 11: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/11.jpg)
Papers
• Datapreprocessingandtransforma?on– Quackenbush
• Needforstandards– MAGE‐ML– www.mged.org
• MininglargedatasetsforpaAerns– MolecularClassifica?onofCancer– Golubetal.
![Page 12: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/12.jpg)
ATypicalScenario
• Biologistdesignsandrunsanexperimentanddeliverssamples(alongwith$$)totheFunc?onalGenomicslabforhigh‐throughputgeneexpressionanalysis.AcoupleweekslaterbiologistpicksupaCDwithmul?plefilescontainingtherawdataandsomepreprocesseddata…notknowinghowtoanalyzethedatabiologistcallsinyourhelp…
• Wheredowestart?– Understandthedomainandtheproblems
![Page 13: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/13.jpg)
13
HighThroughputSystemsforStudyingGlobalGeneExpressionare
Complex
• Needtolearnaboutandconsider:– thebiologybehindtheexperiments&theinterpreta?onoftheexperiments
– Howthedataisacquired(biotechnology)– thedataissues
![Page 14: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/14.jpg)
14
BiologyBasics:TheFlowofInforma?on
Ageneisexpressedin2steps: DNAistranscribedintoRNA(mRNA)
RNAistranslatedintoprotein
![Page 15: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/15.jpg)
15
GenotypetoPhenotype
• Individualcellsinanorganismhavethesamegenes(DNA)– thegenotype
but….notallgenesareac?ve(expressed)ineachcell
• Itistheexpressionofthousandsofgenesandtheirproducts(RNA,proteins),func?oninginacomplicatedandorchestratedway,thatmakeaspecificcellwhatitis.– thephenotype
![Page 16: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/16.jpg)
16
GeneExpressionDependsonContext
• Thesubsetsofgenesthatareexpressed(RNA/protein)willdifferamongcells,?ssues,organs,condi?ons…– thesubsetexpressedconfersuniqueproper?estothecell
musclemuscle
neuron liver
![Page 17: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/17.jpg)
17
Differen?alGeneExpression
• Thelevelofexpressionofgenesalsodifferswiththecellularcontext
• i.e.theamountofagivenRNAwillvary
• Wecanthinkofgeneexpression(inhigherorganisms)ashavingbothan“on/off”switchand“volume”control
![Page 18: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/18.jpg)
18
WhatBiologistsWanttoKnow:SpecificPaAernsofGeneExpression• Tissue/Celltype‐specific ‐e.g.skincellvs.braincell ‐e.g.kera?nocytevs.melanocyte
• Developmentalstage ‐e.g.embryonicskincellvs.adultskincell
• Diseasestate
‐e.g.normalskincellvs.skintumorcell• Environment‐specific(drugs,toxins)
‐e.g.skincelluntreatedvs.treated
![Page 19: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/19.jpg)
19
Butalso,themoredifficultproblem:GeneNetworks
• Genesandtheirproductsarerelatedthroughtheirrolesin:– metabolicpathways– cellsignallingnetworks
![Page 20: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/20.jpg)
20
MetabolicPathway
FromKEGGDatabase
![Page 21: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/21.jpg)
21
CellSignallingNetworks
www.mpi‐dortmund.mpg.de/departments/dep1/signaltransduk?on/image3.gif
![Page 22: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/22.jpg)
22
WhatcanwelearnbystudyingglobalpaAernsofgeneexpression?
• Individualgeneexpressionpa1erns• Classifica5ons:fordiagnosis,predic?on…
– GroupsofGenes– Moleculartaxonomyofdisease
• GeneNetworks/Pathways:– Reconstruc?onofmetabolic®ulatorypathways
![Page 23: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/23.jpg)
Nowthatwehavesomeunderstandingofthedomainandgoals…
• Whataboutthedata?– Howarethedatagenerated?– Datatype?– Dataquality?– Needfordatacleaningandpreprocessing?
![Page 24: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/24.jpg)
KnowledgeDiscoveryProcessConsulttheDomainExpert(s)
![Page 25: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/25.jpg)
25
GeneChip®Oligonucleo?deArray
High‐throughputgeneexpressionanalysis
![Page 26: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/26.jpg)
26
RecallthatDNAandRNAarecomposedofstringsofnucleo?des
• Ageneofinterestwillhaveaspecificnucleo?desequence
• DNAandRNAsequencescanformbondswithcomplementarybasesonanotherstring‐calledbase‐pairing.
• Whenwedothisexperimentallywecallithybridiza?onandwecandetectitbylabelingoneofthestrings(akastrands)
![Page 27: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/27.jpg)
GeneChip®ExpressionAnalysis
Hybridiza?onandStaining
Array
cRNATarget
HybridizedArray
Streptavidin‐phycoerythrinconjugate
CourtesyofM.Hessner,CAAGEDWorkshop
![Page 28: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/28.jpg)
HowdoAffymetrixmicroarrayswork?
• 12‐20probesarepickedto“interrogate”agene,theideaistogetmul?plemeasurements.Eachprobeisa25meroligonucleo?dethatbindstoagene
• Thecollec?onofprobesthataredesignedtohybridizetothesamegeneiscalleda“probeset”….maybetensofthousandsoftheseprobesetsonagivenchip
• Probesetnameshaveiden?fica?onnamescalled“AffymetrixIds”,andlooklike“10329_g_at”,etc.OnanyGenechip,someprobesetsarededicatedfor“QualityControl”,thesebeginwith“AFFX_”
• Take‐homemessage:havetolearnalotofterminology
![Page 29: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/29.jpg)
29
AffymetrixChips
300,000“Probes”PerfectMatchandMismatchAverageDifferenceValuesCourtesyofJ.GlasnerCAAGEDWorkshop
![Page 30: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/30.jpg)
AffymetrixAnalysis
• Highresolu?onimageofthescannedmicroarraygeneratesaDATfile
• Sincetheprobesarelaidoutinagridfashion,andeachprobeposi?ondeterminedintermsofitsX‐Yco‐ordinates,onecancomputethePMandMMprobeintensi?esfromthepixelatedimage
• TheCDF(chipdefini?onfile)libraryfilecontainstheXYlayoutofeveryprobe
![Page 31: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/31.jpg)
AffymetrixDataFlow
ScanChip
HybridizedGeneChip
DATfileProcessImage(GCOS)
CELfile
CDFfile
MAS5(GCOS)
CHPfile
TXTfile
RPTfileEXPfile
GeneChipOpera?ngSoDware(GCOS)‐AffymetrixhAp://www.affymetrix.com/products/soDware/specific/gcos.affx
![Page 32: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/32.jpg)
AffymetrixFileTypes• DATfile:
– Raw(TIFF)op?calimageofthehybridizedchip• CDFFile(ChipDescrip?onFile):
– ProvidedbyAffy,describeslayoutofchip• CELFile:
– ProcessedDATfile(intensity/posi?onvalues)– hAp://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/
AffxFileFormats/cel.html• CHPFile:
– The“CHP”filecontainssummarizedgeneexpressionscoresaDerprobecellsareanalyzed;
– formatis:Gene Avg.D PresenceAFFX_CreX_at 48 AAFFX_BioB_at 149 P
• TXTFile:– Probesetexpressionvalueswithannota?on(CHPfileintextformat)
• RPTFile– GeneratedbyAffysoDware,reportofQCinfo
![Page 33: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/33.jpg)
KnowledgeDiscoveryProcessConsulttheDomainExpert(s)
![Page 34: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/34.jpg)
DataQuality
• Mostdataminingtechniquescantoleratesomelevelofimperfec?oninthedata,butimprovingdataqualitycanimprovequalityofanalyses
• Mainissues– Noise– Outliers– Missingvalues
– Duplicatedata– Inconsistentdata
![Page 35: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/35.jpg)
35
ThereareManyProblemsFacingExpressionAnalysisontheBiotechside
• Standardiza?on&qualitycontrolintheexperiments(affectsdataqualityatmanylevels)
• Cost
![Page 36: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/36.jpg)
36
Probleminreproducibilityofexperimentaldata
• Lotsofvaria?oninarrays– morethan100experimentalsteps
• Sourcesofvaria?on– biologicalvariabilityineachRNAextract– eachlabelingreac?onisdifferent– eachslideisaseparatehybridiza?on– spotsontheslidearevariableacrossslides(andwithinslideswhen
doublespoAed)
– each“color”isscannedseparately• NeedReplicatesandSta?s?cs!
![Page 37: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/37.jpg)
37
Outcome
• “Noisy”data• Datapreprocessingisnecessary
– normaliza?on
– scaling• Heavyrelianceonsta?s?cstoday
![Page 38: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/38.jpg)
Whatdothespots(intensitymeasurements)represent?
• Fluorescenceintensityisameasureoftherela?veabundanceofindividualmRNAs(expressedgenes)ingivensamples– e.g.experimentalrela?vetocontrol
• But,geneexpressionexperimentsarerunon“mul?plesamples”Why?
• Wearetryingtounderstandadynamicprocess‐eachsampleonlyrepresentsa“snapshot”– Compareamongsamples(differentarrays)
– Compareacrossa?me‐courseofrelatedsamples
![Page 39: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/39.jpg)
Howcanweusethedata?
• Wecanonlyreallydependonbetween‐samplefoldchangeforMicroarraysnotabsolutevaluesorwithinsamplecomparisons(>1.3‐2.0foldchange,ingeneral)
• Take‐homemessage:Havetobecarefulwhencomparingbetweenarrays;fromexperimenttoexperiment….
![Page 40: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/40.jpg)
40
Pre‐processing
• Genefiltering– controlgenes– uninforma?vegenes
• Normaliza?onandscaling– allowscomparisonsacrossarrays
– scalingtocontroldynamicrange
• Transforma?on• logarithmictransforma?onforimprovedsta?s?calproper?es
![Page 41: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/41.jpg)
Normaliza?on
Cy3signal(log2)
Cy5signal(log
2)
![Page 42: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/42.jpg)
Take‐homeMessage
• Importanttorememberthatoncepreprocessing,normaliza?on,transforma?onofthedatahaveoccurred,alldownstreamminingwillbeaffected.
![Page 43: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/43.jpg)
DataRepresenta?on
• Flatfile• Vectordata• Sparsematrix(text)data
• Sequencedata(e.g.weborgenomic)
• Timeseries
• Imagedata
• Spa?o‐temporal
![Page 44: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/44.jpg)
Threelevelsofmicroarraygeneexpressiondataprocessing
Brazma et al., Nature Genetics, 29:365-371, 2001
![Page 45: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/45.jpg)
OutcomesofMicroarrayAnalysis
Large,complexdatasetsofhighdimensionality– exampleofarou?nestudy:
50,000“genes”from20samples‐approx.1‐2X106piecesofdata
challengesforBioinforma?cs• annota?on,storage,retrieval,sharingofdata• informa?onfromthedata
![Page 46: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/46.jpg)
KnowledgeDiscoveryProcessConsulttheDomainExpert(s)
![Page 47: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/47.jpg)
StateofMicroarrayData
• Wideavailabilityoftechnologyhasgivenrisetoalargenumberofdistributeddatabases
• datascaAeredamongmanyindependentsites(accessibleviaInternet)ornotpubliclyavailableatall
• Needforstandardiza?on!
![Page 48: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/48.jpg)
MGEDGroupandStandardiza?onIssues
• MicroarrayGeneExpressionDatabase(MGED)Group
www.mged.org
• MGEDistakingonthechallengeofstandardiza?on
• Fourmajorprojects
![Page 49: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/49.jpg)
• MIAME‐Theformula?onoftheminimuminforma?onaboutamicroarrayexperimentrequiredtointerpretandverifytheresults.
• MAGE‐Theestablishmentofadataexchangeformat(MAGE‐ML)andobjectmodel(MAGE‐OM)formicroarrayexperiments.
MGEDProjects
![Page 50: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/50.jpg)
MGEDProjects
• Ontologies‐Thedevelopmentofontologiesformicroarrayexperimentdescrip?onandbiologicalmaterial(biomaterial)annota?oninpar?cular.
• Normaliza?on‐Thedevelopmentofrecommenda?onsregardingexperimentalcontrolsanddatanormaliza?onmethods.
![Page 51: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/51.jpg)
MAGE‐ML
• theXMLrepresenta?onoftheMAGE‐OM• theDTD(documenttypedefini?on)iswhatisspecifiedinMAGE_ML– rulesordeclara?ons– whattagscanbeused– whattagscontain
![Page 52: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/52.jpg)
• MAGE‐OM• hAp://www.mged.org/Workgroups/MAGE/mage‐om.html
• mappingofmicroarrayexperimentalworkflowtotheOM
![Page 53: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/53.jpg)
• DTD• hAp://www.omg.org/docs/dtc/03‐05‐03.dtd
![Page 54: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/54.jpg)
• MAGE‐STKsoDwaretoolkit– definesanAPItoMAGE‐OM– inJava,Perl,C++
• Usedto– exportdatatoMAGE_ML– tostoredatainrela?onaldatabase– inputdatatoanalysistools
• Reader:MAGE‐MLdocsintoobjects• Writer:objectsintoMAGE‐ML
![Page 55: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/55.jpg)
KnowledgeDiscoveryProcessConsulttheDomainExpert(s)
![Page 56: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/56.jpg)
DataMiningTechniques
• Exploratorydataanalysis• Descrip?vemodeling
• Predic?vemodeling
• PaAerndiscovery• others
![Page 57: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/57.jpg)
ExploratoryDataAnalysis
• Interac?veandvisual• Insightandfeelforthedatainabroadsense
– Providesummaries• e.g.max/min,mean/median,varianceetc
– Visualiza?on• Histograms,scaAerplots
• Usefulfordatavalida?onorverifica?on• Simpleexploratorydataanalysisisinvaluable
– Alwaysgetacursoryviewofthedatabeforeapplyingdataminingalgorithms
![Page 58: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/58.jpg)
PaAernDiscovery
• Discoverinteres?nglocalpaAernsindataratherthantocharacterizedataglobally
• Marketbasketdata– Discoverthatifcustomersbuywineandbread,theybuycheesewitha0.9probability
– Knownasassocia?onrules
![Page 59: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/59.jpg)
Descrip?veModeling
• Buildmodelforunderlyingprocess– Simulatethedataifneeded
• Clusteranalysistofindnaturalgroupsinthedata
• Bayesiannetworktofinddependencymodelsamongvariables
![Page 60: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/60.jpg)
Predic?veModeling
• PredictavariableY,givenap‐dimensionalvectorX– Classifica?on:Yiscategorical– Regression:Yisreal‐valued
• Muchlikefunc?onapproxima?on– Learningtherela?onshipbetweenYandX
• Sta?s?csandmachinelearninghavemanyalgorithmsforpredic?vemodeling– EmphasisisoDenonpredic?veaccuracyratherthanunderstandingthemodelitself.
![Page 61: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/61.jpg)
MiningofExpressionDataRecallthat:• AgeneexpressionpaAernderivedfromasinglemicroarrayissimplyasnapshot(oneexperimentalsamplevsreference)
• Usuallywanttounderstandaprocessorchangesinexpressionoveracollec?onofsamples
geneexpressionprofile
![Page 62: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/62.jpg)
62
WorkingwithGeneExpressionData
• Hypothesis‐drivenapproaches– Typicallymodel‐oriented– Descrip?vesta?s?csrelyingonpriorknowledgeandgooddesign
• Discovery‐based– Few,ifany,apriorihypotheses– Data‐drivenandalgorithm‐oriented– Sta?s?calalgorithms– Machinelearningusingheuris?ctechniques
![Page 63: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/63.jpg)
63
Tes?ngHypotheses
• Basedonpriorbiologicalknowledge• Simplest
– lookforindividualdifferen?allyexpressedgenes– foldchanges
• ScaAerplot• Sta?s?calmeasures
![Page 64: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/64.jpg)
64
ScaAerplot
![Page 65: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/65.jpg)
65
Somesimplesta?s?cs
• Ifwearelookingatsamplesthatseemtobelongtotwogroupsorcondi?ons
• t‐testcomparesthemeansoftwogroupswhileaccoun?ngforthestandarderrorofthedifferenceofthemeans
• ANOVAifwanttoextendtheanalysistomorethantwogroups
![Page 66: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/66.jpg)
66
But,genechipsallowustomeasurethousandsofgenes....
• Acrossmul?plesamples
![Page 67: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/67.jpg)
GoalofAnalysisofExpressionMatrix
• Somesta?s?calmethodsappliedto:1. “Group”similargenestogether=>groupsof
func?onallysimilargenes.
2. ”Group”similarcellsamplestogether.
3. “Extract”representa?vegenesineachgroup.
![Page 68: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/68.jpg)
Typicalapproach
• LookforpaAerns– comparerowstofindevidenceforco‐regula?onofgenes– comparecolumnstofindevidenceforrelatednessamongsamples
1)Chooseameasureofsimilarity(distance)amongtheobjectsbeingcompared‐eachroworcolumnisconsideredavectorinspace
2)Then,grouptogetherobjects(genesorsamples)withsimilarproper?es‐isamul?dimensionalanalysis
![Page 69: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/69.jpg)
69
Anexperiment
• 12Genes• Expressionvaluesat0,2,4,6,8and10hours
![Page 70: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/70.jpg)
70
Table4.2ofCampbell/Heyer• Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs
C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 1 .25 .25 .1 G 1 2 3 4 3 2 H 1 .5 .33 .25 .33 .5 I 1 4 8 4 1 .5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M 1 .33 .25 .25 .33 .5 N 1 .125 .0833 .0625 .0833 .125
![Page 71: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/71.jpg)
71
Takelogs•
C 0 3.0 3.58 4.0 3.58 3.0 D 0 1.58 2.0 2.0 1.58 1.0 E 0 2.0 3.0 3.0 3.0 3.0 F 0 0 0 -2.0 -2.0 -3.32 G 0 1.0 1.58 2.0 1.58 1.0 H 0 -1.0 -1.6 -2.0 -1.6 -1.0 I 0 2.0 3.0 2.0 0 -1.0 J 0 1.0 0 1.0 0 1.0 K 0 0 0 0 1.58 1.58 L 0 1.0 1.58 2.0 1.58 1.0 M 0 -1.6 -2.0 -2.0 -1.6 -1.0 N 0 -3.0 -3.59 -4.0 -3.59 -3.0
• Compare
![Page 72: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/72.jpg)
72
HowSimilararetwoRows?
• Howsimilararetheexpressionsoftwogenes?
• Firstwe’llnormalizeeachrow
• Calculatethemeanandstandarddevia?onforeachgene
• Normalizeeachvaluebysubtrac?ngthemeananddividingbythestandarddevia?on.
![Page 73: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/73.jpg)
73
HowSimilararetwoRows?
• CalculatethePearsonCorrela?onbetweenpairsofrows
• Correla?onquan?fiestheextenttowhichtheexpressionpaAernsoftwogenesgoupordowntogether,regardlessoftheirmagnitudes.
• Calculatedbytakingthedotproductofthetwovectors
> (pc '( 1 2 3 4 3 2 ) ; row G '( 1 2 3 4 3 2 )) ; row L 1.0 > (pc '( 1 2 3 4 3 2 ) ; row G '( 1 3 4 4 3 2 )) ; row D 0.8971499589146109
![Page 74: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/74.jpg)
74
Someotherpairs• Name 0 hrs 2 hrs 4 hrs 6 hrs 8 hrs 10 hrs
C 1 8 12 16 12 8 D 1 3 4 4 3 2 E 1 4 8 8 8 8 F 1 1 1 .25 .25 .1 G 1 2 3 4 3 2 H 1 .5 .33 .25 .33 .5 I 1 4 8 4 1 .5 J 1 2 1 2 1 2 K 1 1 1 1 3 3 L 1 2 3 4 3 2 M 1 .33 .25 .25 .33 .5 N 1 .125 .0833 .0625 .0833 .125
> (pc '( 1 3 4 4 3 2) ; row D '( 1 .33 .25 .25 .33 .5)) ; row M -0.9260278787295065 > (pc '( 1 2 3 4 3 2) ; row G '( 1 .5 .33 .25 .33 .5)) ; row H -0.9090853650855358
![Page 75: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/75.jpg)
75
PearsonCorrela?on
• pc(G,L)=1‐‐iden?callyexpressedgenes• pc(G,D)=.897‐‐similarlyexpressedgenes• pc(D,M)=‐.926‐‐reciprocallyexpressed• pc(G,H)=‐.909‐‐alsoreciprocallyexpressed
![Page 76: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/76.jpg)
Descrip?veandPredic?veModeling
• Clustering• Featureextrac?on/selec?on• Classifica?on‐discrimina?onanalysis
![Page 77: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/77.jpg)
Analy?cApproaches
• Clustering:Identification of associations between data points; organization of data into groups
• UnsupervisedClustering:genesclusteredbysimilarity/correla?on,orothercriteriabasedonX‐values‐nousefulexternalinforma?onabouttheY–variables(theresponse),isused→doesn’trevealgroupsofgeneswithspecialinterestfor?ssuediscrimina?on
• SupervisedMethods:‐groupingofvariables(genes),controlledbyinforma?onabouttheXandYvariables→supervisedalgorithmstrytofindgeneclusters,whoseaverageexpressionprofilehasgreatpoten?alforexplainingtheresponseY,i.e.for?ssuediscrimina?on
![Page 78: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/78.jpg)
• UnsupervisedClusteringAlgorithms– Hierarchical– K‐means– Self‐organizingmaps– Others
![Page 79: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/79.jpg)
Eisen et al.
http://www.pnas.org/cgi/content/full/95/25/14863
samples
g
e
n
e
s
Gene Expression Matrix
& Hierarchical Clustering
![Page 80: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/80.jpg)
Theory
• HierarchicalClusteringworksbysequen?allyjoiningthetwonearestclustersandthenhierarchicallyjoiningthenexttwoclosestclustersandsooninthisfashion,joiningthenearestclustersfirstandfarthestclusterslast.
• Ini?allyeachindividualdataptissetequaltoonecluster
![Page 81: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/81.jpg)
HierarchicalClusteringAlgorithm
• GivenasetofNitemstobeclustered,andanN*Ndistance(orsimilarity)matrix.
1. Startbyassigningeachitemtoacluster,sothatifyouhaveNitems,youwillnowhaveNclusters,eachcontainingjustoneitem.Letthedistances(similari?es)betweentheclustersbedefinedasthesameasthedistances(similari?es)betweentheitemstheycontain.
2. Findtheclosest(mostsimilar)pairofclustersandmergethemintoasinglecluster.Younowhaveoneclusterless.
3. Computedistances(similari?es)betweenthenewclusterandeachoftheoldclusters.
4. Repeatsteps2and3un?lallitemsareclusteredintoasingleclusterofsizeN.
![Page 82: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/82.jpg)
Hierarchicalinac?on
![Page 83: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/83.jpg)
Varia?onsofHierarchicalAlgorithm
• Step3(compu?ngdistancesbetweenthenewclusterandeachoftheoldclusters)canbedoneinseveraldifferentways.SingleLinkage,averagelinkageandcompletelinkage.
• Insinglelinkagethedistancebetweenclustersisequaltotheshortestdistancefromanyonememberofoneclustertoanyonememberoftheothercluster.
• InAveragelinkagethedistancebetweentwoclustersisdefinedastheaveragedistancebetweenanymemberofoneclustertoanymemberoftheothercluster.
• Completelinkageisdefinedasthethemaximumdistancefromanyonememberofthefirstclustertoanyonememberofthesecondcluster.
![Page 84: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/84.jpg)
Varia?onsofHierarchicalAlgorithm
• SelfOrganizingTreeAlgorithm– Unsupervisedneuralnetworkwithabinarytreetopology
– Combina?onofSOMandhierarchicalclustering
– Run?meisapproximatelylinear• Fasterthannormalhierarchicalmethod
– Usesdivisivemethod• IncomparisontoboAomupmethodofhierarchical
![Page 85: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/85.jpg)
Advantages
• Hierarchicalclusteringresultsinavisualrepresenta?onthatisconvenientforhumanstoanalyze
• Unlikek‐meansandSOM,doesnothaveanaprioriclusternumber
![Page 86: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/86.jpg)
Whyclusteranalysismaynotbe“the”answer
• Clusteringmethodstypicallyrequireuserinputs:
Example:distancemeasure• Clusteringmethodsdifferinthewaythatthenumberofclustersarespecified.
• ClusteringmethodsareoDensensi?vetotheini?aliza?oncondi?on(star?ngguess)
• Localvs.globalsamplingofclusteringspace
![Page 87: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/87.jpg)
ClusterAnalysisChallenges
• “Noise”inthedataitself• Largedatasets
– mostofthetechniquescurrentlyusedwerenotdevelopedformul?dimensionaldata
• Whataboutnetworks?– limita?onofclusteranalysis:similarityinexpressionpaAernsuggestsco‐regula?onbutdoesn’trevealcause‐effectrela?onships
![Page 88: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/88.jpg)
FeatureSelec?on&Classifica?on
• First,iden?fyfeatures(genes)thatdiscriminatebetweenclasses
• Thenusefeaturesforclassifica?on– machinelearningapproach– supervisedanalysis– assignmentofanewsampletoapreviouslyspecifiedclass,basedonsamplefeaturesandatrainedclassifier
![Page 89: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/89.jpg)
“Classic”Example:Classifica?onofAMLvs.ALL
• Biological/ClinicalProblems:• previously,nosinglereliabletesttodis?nguishthem• differgreatlyinclinicalcourse&responsetotreatments
Golub et al., Science Oct 15 1999: 531-537
• Comparing 2 acute leukemias • acute myeloid leukemia (AML) • acute lymphoid leukemia (ALL)
![Page 90: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/90.jpg)
Golub et al., Science Oct 15 1999: 531-537
Study Design
![Page 91: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/91.jpg)
![Page 92: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/92.jpg)
The prediction of a new sample is based on 'weighted votes' of a set of informative genes
![Page 93: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/93.jpg)
Resultsofthestudy
1)Clusteringofmicroarraydatausingtumorsofknowntype
found1100of6817genescorrelatedwithclassdis?nc?on
2)Forma?onofaclasspredictor=50mostinforma?vegenesusedasatrainingset
classifica?onofunknowntumors
Golub et al., Science Oct 15 1999: 531-537
![Page 94: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/94.jpg)
Results
Howtotestthevalidityofclasspredictors?
• Cross‐valida?ontests:The50‐genepredictorassigned36ofthe38samplesaseitherAMLorALLandtheremainingtwoasuncertain(PS<0.3).All36predic?onsagreedwiththepa?ents'clinicaldiagnosis;
• Independenttest:The50‐genepredictorwasappliedtoanindependentcollec?onof34leukemiasamples.Thepredictorassigned29ofthe34samples,andtheaccuracywas100%;
• Predic?onstrength:medianPS=0.77incross‐valida?onand0.73inindependenttest(Fig.3A).
![Page 95: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/95.jpg)
Results
Classdiscovery
• IftheAML‐ALLdis?nc?onwerenotalreadyknown,couldithavebeendiscoveredsimplyonthebasisofgeneexpression?
![Page 96: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/96.jpg)
Results
Twoclusteranalysis
(1).Clustertumorsbygeneexpression:
• Atwo‐clusterSOMwasappliedtoautoma?callygroupthe38ini?alleukemiasamplesintotwoclassesonthebasisoftheexpressionpaAernofall6817genes.
![Page 97: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/97.jpg)
Results
Determinewhetherputa?veclassesproducedaremeaningful.
• TheclusterswerefirstevaluatedbycomparingthemtotheknownAML‐ALLclasses(Fig.4A).ClassA1containedmostlyALL(24of25samples)andclassA2containedmostlyAML(10of13samples).TheSOMwasthusquiteeffec?veatautoma?callydiscoveringthetwotypesofleukemia.
![Page 98: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/98.jpg)
Results
• Howcouldoneevaluatesuchputa?veclustersifthe"right"answerwerenotalreadyknown?
Classdiscoverycouldbetestedbyclasspredic?on;Ifputa?veclassesreflecttruestructure,thenaclasspredictorbasedontheseclassesshouldperformwell.
![Page 99: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/99.jpg)
![Page 100: Knowledge Discovery - IST Department at RITrpv/local/syllabi/discovery/KnowledgeDiscovery1.pdfThe Process • Guided Discovery – PBL – Knowledge Discovery – Learn through examples](https://reader031.vdocuments.site/reader031/viewer/2022021821/5b05d1627f8b9a79538b8e05/html5/thumbnails/100.jpg)