advanced gene mapping coursemath.ucdenver.edu/~spaul/empty/hostedfiles/... · • chi-square test...

Post on 29-Mar-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

AdvancedGeneMappingCourse

RockefellerUniversity,NY

Subrata Paul

GeneticsforStatisticsiswhatPhysicsisforMathematics

Geneticsisaleadingmotivationfordevelopmentofnewbasicstatistics.

TopicsCovered

• Populationandfamilybasedassociationstudies• DataQC• Rarevariantassociationanalysis• DetectingInteraction• Imputation• Metaanalysis• Linearmixedmodel• eQTL mapping• Evolutionarygenetics• Incorporatefunctionalityinrarevariantassoc• Missingheritability

• PLINK• VAT• GenAbel• BEAM3• CASSI• MACH• MINIMAC• METAL• GCTA-MLMA• GERP• GenMAPP• Etc.

Instructors

• HeatherCordell,InstituteofGeneticMedicine,NewcastleUniversity,UK• SuzanneM.Leal,BaylorCollageofMedicine• GoncaloAbecasis,Univ ofMichiganSchoolofPublicHealth• NancyJ.Cox,VanderbiltGeneticsInstitute• Shamil Sunyaev,DepartmentofMedicine,HarvardMedicalSchool

GWAS:WTCCC

WelcomeTrustCaseControlConsortium

• 7differentdiseases:Bipolardisorder,coronaryartery,crohn's disease,hypertension,rheumatoidarthritis,type1andtype2diabetes.• 2000casesforeachdisease• Commonpopulation-basedcontrols• Foundsignals6outof7diseases• ExpandedtoWTCCC2andWTCCC3with5200commoncontrols

DataQC:

• LowCallrates;excessheterozygosity• Xchromosomemarkersusefulforcheckinggender• Checkingrelationshipandethnicity• Mendelianmisinheritances• Hardy-Weinbergdisequilibrium• MinorAlleleFrequency

DataQC:CallratesandHeterozygosity

Inbreeding

SampleContamination

AccessingSex

• MaleswithanexcessofheterozygousSNPsontheXchromosomecandenote• Malesmislabeledasfemales• MaleswithKlinefelter syndrome

• FemaleswithanexcessofhomozygousgenotypeontheXchromosomecandenote• Femalesmislabeledasmales• FemaleswithTurnerSyndrome

• Canbeobservedduetosamplemix-ups• Samplesforwhichthesexisincorrectshouldberemovedfromtheanalysis(probablynotthepersonyouthinkitis)

DataQC:

• Ethnicity

QQPlots(good)

QQPlots(bad)

GenomicInflationFactor

• GenomicInflationFactoristheratioofthemedianoftheteststatisticstoexpectedmedianandisusuallyrepresentedas𝜆• Noinflationoftheteststatistics𝜆 = 1• Inflation𝜆 > 1• Deflation𝜆 < 1

PopulationStratification

• Populationsampledactuallyconsistsofseveralsub-populationthatdonotintermix• Canleadtospuriousfalsepositive(type1errors)incase/controlstudies• Solutions:• PCA• MDS(MultidimensionalScaling)alsoknownasprincipalcoordinatesanalysis

PopulationStratification(PCA)

• Computetheeigenvectorsandeigenvaluesofmatrixofcorrelationsbetweenindividuals(basedonIBDorIBS)• Includeprincipalcomponentscoresfromtop10(say)eigenvectorsascovariatesinalogisticregressionanalysis• Plottingfirstprincipalcomponents(firsttwo)youcanvisualizeethnicoutliers• LinearMixedModel• Estimatekinshipmatrix(IBDsharing)betweenpairsofindividualsusinggenome-widegenotypedata• Usethistomodeltheir(extra)correlation,inalinearregressiontypeanalysis

PopStratification(Variancecomponentsmodels)

• Analternativeapproachbasedonvariancecomponentsmodelshasbeenproposed• Kangetal.(2010)NatGenet42:348-354• Zhangetal.(2010)NatGenet42:355-360

• Basedonmethodsdesignedtotestforgenotypeassociationswithquantitativetraits:linearregression

𝑦 = 𝜇 + 𝛽𝑥 + 𝜖Where,

𝑦 isthetraitvalue𝑥 isavariablecodingforgenotype𝜖 ∼ 𝑁(0, 𝜎3) Residualerror

VarianceComponents(mixed)models• Linearmixedmodelsallowthisideatobeappliedtorelatedindividuals• 𝜖 ∼ 𝑀𝑉𝑁(0, 𝑉) wherevariance/covariancematrix𝑉 followsstandardvariancecomponentsmodel,accountingforknownkinship

• 𝑉78 = 𝜎93 + 𝜎:3 𝑖 = 𝑗• 𝑉78 = 2Φ78𝜎93 𝑖 ≠ 𝑗

• 𝜎93, 𝜎:3 representstheadditivepolygenicvariance(duetoallloci)andtheenvironmental(=error)variancerespectively

• Φ78 ishalftheexpectedIBDsharingbetweenindividuals𝑖and𝑗(=theirkinshipcoefficient)

• CloselyrelatedtoQTDT(Abecasis et.al2000a;b)whichimplementsaslightlymoregeneral/complexmodel• Softwaretoimpement :GenABEL,EMMAX,FaST-LMM,GEMMA,MMM

LinearMixedModel(detailed)𝑌7 =A𝛽8𝑋78 + 𝜖

8

𝑋78- Normalizedgenotypeofindividual𝑖 atSNP𝑗Inthematrixform:

𝑦D = 𝑋�̅� + 𝜖Twoimportantmatrices

𝐿𝐷 =1𝑀𝑋H𝑋

𝐺𝑅𝑀 =1𝑁𝑋𝑋

H

LinearMixedModel(detailed)Ourmodel

𝑌7 =A𝛽8𝑋78 + 𝜖�

8Wehavetofitmarkersindividually

𝑌7 = 𝛽K + 𝑋K +A𝛽8𝑋78 + 𝜖 ∼ 𝛽K𝑋K + 𝜖′�

8M3ForeachSNPwecanfitthemodel

𝑌7 = 𝛽𝑋7 + 𝑢7 + 𝜖𝜖 ∼ 𝑁 0, 𝐼𝜎3 𝑢 ∼ 𝑀𝑉𝑁(𝑜, 𝐺𝑅𝑀)

ROADTRIPS

• RobustAssociation-DetectionTestforRelatedIndividualswithPopulationSubstructure• ThorntonandMcPeek (2010)AJHG86:172-184

• ExtensionofMQLS(MaximumQuassi-LikelihoodStatistic)• Bothmethodsconstructadjustedversionofcase/control𝜒3(orArmitageTrend)test• Usingknownpedigreerelationshipstocorrectforrelatedness• ROADTRIPSalsousescovariancematrixbasedonkinship/IBDsharingtocorrectforunknownrelatedness/populationstratification

ComplexTrait:RareVariants

• MRV– MultipleRareVarianthypothesis:Complextraitsaretheresultofmultiplerarevariantswithalargephenotypiceffect• Largeeffectsizecomparedtocommonvariants• Althoughthesevariantsarerarecollectivelytheymaybequitecommon• Strongevidencethatrarevariantsplayanimportantrole

FunctionalRareVariants

Keizun,Garimella,Do,Stitziel etal.NatureGenetics2012

AnalysisofRareVariants

• Difficulties• Lackofararevariantcatalogwithreferencegenotypes• Largesamplesizeneeded.

• Samplingallelewithfrequency.5%or.05%withprobability99%needs460or4600individualsrespectively.

• Betteranalyticaltoolboxneededtogain power.• Commonvariantshaveonlyalimitedcapacitytotagrarevariants

• SingleMarkerTest• Chi-squaretest• Cochran-Armitagetestfortrend

• MultipleMarkerTest• Hotelling's T^2• LogisticRegression• Minimalp-value

SingleMarkerTest• Forcase-controldatapossiblemethods:chi-squared,Fishers'exact,Cocharn-Armitagetrend,logisticregression(linearregression)• Fisher'sexacttestisrecommendedwhentherearesmallcounts• Regressionanalysiscontrollingforcofounders• Correctionformultiplecomparisonsneeded• ControllingFWERresultsinaloseofpower• Obtainempiricalp-valuesbyrandompermutationorcontrolFDR(sequential Bonferroni-typeprocedure).• Samplesizemustbeverylargeforsufficientpower

• Need6,400,54,000and540,000samplesforMAF0.1,0.01and0.001toget80%power• Successexample:insulinprocessing;Sample– 8000,variantsinSGSM2withMAF=1.4%,𝑝 = 8.7×10WKX andMADDwithMAF3.7%,𝑝 = 7.6×10WKZ

MultipleMarkerTests

• Multipleregression:reduceddegreesoffreem• Hotelling’s twosample𝑇3 test:

• Reductionofpowerwithnumberofvariants• Greatlyeffectedbymaf• Identifiedriskallele(direction)isneeded

• MDMR(MultivariateDistanceMatrixRegression)• Usesgeneticsimilarityofindividuals• Don’tneedtoidentifyriskalleleateachvariant

• KBAT(Kernel-BasedAssociationTest)• Basedongenotypesimilarityscorebetweenindividualsmeasuredbyakernelfunction

• Noassumptionaboutdirection• Canhandlecorrelatedand/orindependentSNPs

GenebasedAggregationTests

• Regressionbasedtests• Burdentests(collapsing)• Adaptiveburdentests• Variancecomponenttests• Combinationoftheabove

• Evaluatecumulativeeffectsofmultiplevariants• CMC(CombinedMultivariateandCollapsing)

CMC

ResentMethodsandSummary

• CMC– jointlyassessesroleofcommonandrarevariants• WSS– Weightedsumstatistics• KBAC– Kernelbasedadaptiveclustertest:weightingscheme• SKAT– sequencekernelassociationtest• Powertodetectassociationdepends• Thenumberandproportionofcausalvariants• Populationfrequency• Theireffectsizesanddirectionality• Numberofgenescontributingtothetrait• Thefractionofcausalvariantslocated(bysequencinge.g.exomeseq)

RecentmethodsandSummary

• Statisticaltestsaresensitivetodiseasearchitecture• Differenttestshowsstrengthfordifferenteffectsizedistribution:• WWS:1/𝑥(1 − 𝑥);x-populationfreq.• SKAT:𝛽(𝑥; 𝑎K, 𝑎3) forpre-specified𝑎K, 𝑎3

• Allowoppositeeffectsontraits• Step-up,C-alpha,thereplication-basedtest,SKAT

SoftwarePackages

Gene× GeneInteraction

Gene× GeneInteraction

Gene× GeneInteraction

Gene× GeneInteraction

TestingforInteraction

• Logistic(linear)regressionforcase/controldata• ‘—epistasis’inPLINK• Morepowerful:Case-onlyanalysis• Interaction⟺ Correlationbetweenrelevantpredictors• TestNullhypothesis:twolociareindependent(nocorrelation)• Chi-squaretestofindependence• Gainspowerwithassumptionthatthetwolociareindependentinpopulation• Preferabletoincorporatecase-onlyandcase-controlestimatorintoasingletest(greaterpowerthanlogistic);--fast-epistasisinPLINKperformssuchtest

PLINK--fast-epistasis

• ExhaustiveSearch:useGPUs,suffersfrommultipletesting• Dataminingapproach:usecross–validationtoavoidoverfitting• MultifactorDimensionalityReduction(Ritchieetal.(2001)AJHG)• RandomForest(CART)• Penalizedregressionmethods(Zhuetal.(2014))• Entropybasedmethods• BEAM(Zhangetal.(2007))• Bayesianmodelselection• MCMC,MECPM(JiangandNeapolitan(2015))

OtherTechniques

THANKYOU

top related