advanced gene mapping coursemath.ucdenver.edu/~spaul/empty/hostedfiles/... · • chi-square test...
TRANSCRIPT
AdvancedGeneMappingCourse
RockefellerUniversity,NY
Subrata Paul
GeneticsforStatisticsiswhatPhysicsisforMathematics
Geneticsisaleadingmotivationfordevelopmentofnewbasicstatistics.
TopicsCovered
• Populationandfamilybasedassociationstudies• DataQC• Rarevariantassociationanalysis• DetectingInteraction• Imputation• Metaanalysis• Linearmixedmodel• eQTL mapping• Evolutionarygenetics• Incorporatefunctionalityinrarevariantassoc• Missingheritability
• PLINK• VAT• GenAbel• BEAM3• CASSI• MACH• MINIMAC• METAL• GCTA-MLMA• GERP• GenMAPP• Etc.
Instructors
• HeatherCordell,InstituteofGeneticMedicine,NewcastleUniversity,UK• SuzanneM.Leal,BaylorCollageofMedicine• GoncaloAbecasis,Univ ofMichiganSchoolofPublicHealth• NancyJ.Cox,VanderbiltGeneticsInstitute• Shamil Sunyaev,DepartmentofMedicine,HarvardMedicalSchool
GWAS:WTCCC
WelcomeTrustCaseControlConsortium
• 7differentdiseases:Bipolardisorder,coronaryartery,crohn's disease,hypertension,rheumatoidarthritis,type1andtype2diabetes.• 2000casesforeachdisease• Commonpopulation-basedcontrols• Foundsignals6outof7diseases• ExpandedtoWTCCC2andWTCCC3with5200commoncontrols
DataQC:
• LowCallrates;excessheterozygosity• Xchromosomemarkersusefulforcheckinggender• Checkingrelationshipandethnicity• Mendelianmisinheritances• Hardy-Weinbergdisequilibrium• MinorAlleleFrequency
DataQC:CallratesandHeterozygosity
Inbreeding
SampleContamination
AccessingSex
• MaleswithanexcessofheterozygousSNPsontheXchromosomecandenote• Malesmislabeledasfemales• MaleswithKlinefelter syndrome
• FemaleswithanexcessofhomozygousgenotypeontheXchromosomecandenote• Femalesmislabeledasmales• FemaleswithTurnerSyndrome
• Canbeobservedduetosamplemix-ups• Samplesforwhichthesexisincorrectshouldberemovedfromtheanalysis(probablynotthepersonyouthinkitis)
DataQC:
• Ethnicity
QQPlots(good)
QQPlots(bad)
GenomicInflationFactor
• GenomicInflationFactoristheratioofthemedianoftheteststatisticstoexpectedmedianandisusuallyrepresentedas𝜆• Noinflationoftheteststatistics𝜆 = 1• Inflation𝜆 > 1• Deflation𝜆 < 1
PopulationStratification
• Populationsampledactuallyconsistsofseveralsub-populationthatdonotintermix• Canleadtospuriousfalsepositive(type1errors)incase/controlstudies• Solutions:• PCA• MDS(MultidimensionalScaling)alsoknownasprincipalcoordinatesanalysis
PopulationStratification(PCA)
• Computetheeigenvectorsandeigenvaluesofmatrixofcorrelationsbetweenindividuals(basedonIBDorIBS)• Includeprincipalcomponentscoresfromtop10(say)eigenvectorsascovariatesinalogisticregressionanalysis• Plottingfirstprincipalcomponents(firsttwo)youcanvisualizeethnicoutliers• LinearMixedModel• Estimatekinshipmatrix(IBDsharing)betweenpairsofindividualsusinggenome-widegenotypedata• Usethistomodeltheir(extra)correlation,inalinearregressiontypeanalysis
PopStratification(Variancecomponentsmodels)
• Analternativeapproachbasedonvariancecomponentsmodelshasbeenproposed• Kangetal.(2010)NatGenet42:348-354• Zhangetal.(2010)NatGenet42:355-360
• Basedonmethodsdesignedtotestforgenotypeassociationswithquantitativetraits:linearregression
𝑦 = 𝜇 + 𝛽𝑥 + 𝜖Where,
𝑦 isthetraitvalue𝑥 isavariablecodingforgenotype𝜖 ∼ 𝑁(0, 𝜎3) Residualerror
VarianceComponents(mixed)models• Linearmixedmodelsallowthisideatobeappliedtorelatedindividuals• 𝜖 ∼ 𝑀𝑉𝑁(0, 𝑉) wherevariance/covariancematrix𝑉 followsstandardvariancecomponentsmodel,accountingforknownkinship
• 𝑉78 = 𝜎93 + 𝜎:3 𝑖 = 𝑗• 𝑉78 = 2Φ78𝜎93 𝑖 ≠ 𝑗
• 𝜎93, 𝜎:3 representstheadditivepolygenicvariance(duetoallloci)andtheenvironmental(=error)variancerespectively
• Φ78 ishalftheexpectedIBDsharingbetweenindividuals𝑖and𝑗(=theirkinshipcoefficient)
• CloselyrelatedtoQTDT(Abecasis et.al2000a;b)whichimplementsaslightlymoregeneral/complexmodel• Softwaretoimpement :GenABEL,EMMAX,FaST-LMM,GEMMA,MMM
LinearMixedModel(detailed)𝑌7 =A𝛽8𝑋78 + 𝜖
�
8
𝑋78- Normalizedgenotypeofindividual𝑖 atSNP𝑗Inthematrixform:
𝑦D = 𝑋�̅� + 𝜖Twoimportantmatrices
𝐿𝐷 =1𝑀𝑋H𝑋
𝐺𝑅𝑀 =1𝑁𝑋𝑋
H
LinearMixedModel(detailed)Ourmodel
𝑌7 =A𝛽8𝑋78 + 𝜖�
8Wehavetofitmarkersindividually
𝑌7 = 𝛽K + 𝑋K +A𝛽8𝑋78 + 𝜖 ∼ 𝛽K𝑋K + 𝜖′�
8M3ForeachSNPwecanfitthemodel
𝑌7 = 𝛽𝑋7 + 𝑢7 + 𝜖𝜖 ∼ 𝑁 0, 𝐼𝜎3 𝑢 ∼ 𝑀𝑉𝑁(𝑜, 𝐺𝑅𝑀)
ROADTRIPS
• RobustAssociation-DetectionTestforRelatedIndividualswithPopulationSubstructure• ThorntonandMcPeek (2010)AJHG86:172-184
• ExtensionofMQLS(MaximumQuassi-LikelihoodStatistic)• Bothmethodsconstructadjustedversionofcase/control𝜒3(orArmitageTrend)test• Usingknownpedigreerelationshipstocorrectforrelatedness• ROADTRIPSalsousescovariancematrixbasedonkinship/IBDsharingtocorrectforunknownrelatedness/populationstratification
ComplexTrait:RareVariants
• MRV– MultipleRareVarianthypothesis:Complextraitsaretheresultofmultiplerarevariantswithalargephenotypiceffect• Largeeffectsizecomparedtocommonvariants• Althoughthesevariantsarerarecollectivelytheymaybequitecommon• Strongevidencethatrarevariantsplayanimportantrole
FunctionalRareVariants
Keizun,Garimella,Do,Stitziel etal.NatureGenetics2012
AnalysisofRareVariants
• Difficulties• Lackofararevariantcatalogwithreferencegenotypes• Largesamplesizeneeded.
• Samplingallelewithfrequency.5%or.05%withprobability99%needs460or4600individualsrespectively.
• Betteranalyticaltoolboxneededtogain power.• Commonvariantshaveonlyalimitedcapacitytotagrarevariants
• SingleMarkerTest• Chi-squaretest• Cochran-Armitagetestfortrend
• MultipleMarkerTest• Hotelling's T^2• LogisticRegression• Minimalp-value
SingleMarkerTest• Forcase-controldatapossiblemethods:chi-squared,Fishers'exact,Cocharn-Armitagetrend,logisticregression(linearregression)• Fisher'sexacttestisrecommendedwhentherearesmallcounts• Regressionanalysiscontrollingforcofounders• Correctionformultiplecomparisonsneeded• ControllingFWERresultsinaloseofpower• Obtainempiricalp-valuesbyrandompermutationorcontrolFDR(sequential Bonferroni-typeprocedure).• Samplesizemustbeverylargeforsufficientpower
• Need6,400,54,000and540,000samplesforMAF0.1,0.01and0.001toget80%power• Successexample:insulinprocessing;Sample– 8000,variantsinSGSM2withMAF=1.4%,𝑝 = 8.7×10WKX andMADDwithMAF3.7%,𝑝 = 7.6×10WKZ
MultipleMarkerTests
• Multipleregression:reduceddegreesoffreem• Hotelling’s twosample𝑇3 test:
• Reductionofpowerwithnumberofvariants• Greatlyeffectedbymaf• Identifiedriskallele(direction)isneeded
• MDMR(MultivariateDistanceMatrixRegression)• Usesgeneticsimilarityofindividuals• Don’tneedtoidentifyriskalleleateachvariant
• KBAT(Kernel-BasedAssociationTest)• Basedongenotypesimilarityscorebetweenindividualsmeasuredbyakernelfunction
• Noassumptionaboutdirection• Canhandlecorrelatedand/orindependentSNPs
GenebasedAggregationTests
• Regressionbasedtests• Burdentests(collapsing)• Adaptiveburdentests• Variancecomponenttests• Combinationoftheabove
• Evaluatecumulativeeffectsofmultiplevariants• CMC(CombinedMultivariateandCollapsing)
CMC
ResentMethodsandSummary
• CMC– jointlyassessesroleofcommonandrarevariants• WSS– Weightedsumstatistics• KBAC– Kernelbasedadaptiveclustertest:weightingscheme• SKAT– sequencekernelassociationtest• Powertodetectassociationdepends• Thenumberandproportionofcausalvariants• Populationfrequency• Theireffectsizesanddirectionality• Numberofgenescontributingtothetrait• Thefractionofcausalvariantslocated(bysequencinge.g.exomeseq)
RecentmethodsandSummary
• Statisticaltestsaresensitivetodiseasearchitecture• Differenttestshowsstrengthfordifferenteffectsizedistribution:• WWS:1/𝑥(1 − 𝑥);x-populationfreq.• SKAT:𝛽(𝑥; 𝑎K, 𝑎3) forpre-specified𝑎K, 𝑎3
• Allowoppositeeffectsontraits• Step-up,C-alpha,thereplication-basedtest,SKAT
SoftwarePackages
Gene× GeneInteraction
Gene× GeneInteraction
Gene× GeneInteraction
Gene× GeneInteraction
TestingforInteraction
• Logistic(linear)regressionforcase/controldata• ‘—epistasis’inPLINK• Morepowerful:Case-onlyanalysis• Interaction⟺ Correlationbetweenrelevantpredictors• TestNullhypothesis:twolociareindependent(nocorrelation)• Chi-squaretestofindependence• Gainspowerwithassumptionthatthetwolociareindependentinpopulation• Preferabletoincorporatecase-onlyandcase-controlestimatorintoasingletest(greaterpowerthanlogistic);--fast-epistasisinPLINKperformssuchtest
PLINK--fast-epistasis
• ExhaustiveSearch:useGPUs,suffersfrommultipletesting• Dataminingapproach:usecross–validationtoavoidoverfitting• MultifactorDimensionalityReduction(Ritchieetal.(2001)AJHG)• RandomForest(CART)• Penalizedregressionmethods(Zhuetal.(2014))• Entropybasedmethods• BEAM(Zhangetal.(2007))• Bayesianmodelselection• MCMC,MECPM(JiangandNeapolitan(2015))
OtherTechniques
THANKYOU