identification of non-invasive cytokine biomarkers for...
TRANSCRIPT
AbstractPolycystic ovary syndrome (PCOS) is a common endocrine disorder thataffects up to 20% of women, however diagnosis is commonly unreliable andun-quantitative. Here we use supervised machine learning and measurementsof 51 cytokines from a large cohort of patients to identify a low-dimensionalset of potential biomarkers for diagnosis of PCOS. Both whole blood andindividual follicular fluid (FF) aspirates were collected women during pre-intracytoplasmic sperm injection with in vitro fertilization (ICSI/IVF) oocyteretrieval and linked with patients’ PCOS status as diagnosed by the Rotterdamcriteria (n = 69 PCOS, n = 222 non-PCOS). We trained a binary support vectormachine (SVM) using a random subset of patient data to determine cytokineprofile associated with PCOS. Our resultant model includes 3 variables and is76% accurate. This provides insight into the immunological basis of PCOS andmay define a potential non-invasivequantitative strategy for diagnosis.
Introduction
Definitions&Equations
StatisticalAnalysis UsingSupportVectorMachines OtherAnalyses
FutureDirectionsThe dataset described in this study is a small subset of a much largercollection of patient data. In the future we plan to incorporate more ofthese data to determine what more can be said about the classification ofPCOS and prediction of fertility treatment success. Another question wemight be interested in is if Follicular Fluids data could be better used topredict pregnancy results while blood plasma data could be better atpredicting the presence of PCOS. Applying different methods of analysis(i.e. different machine learning classifiers) may have different strengthsthan our current models.
References
1.http://www.med.nyu.edu/chibi/sites/default/files/chibi/Final.pdf2.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3872139/3.http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf4.Latchman,DavidS.GeneRegulation- FifthEdition.Taylor&FrancisGroup2005.Print.5.http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2785020/
Thank you tothe members ofthe Gunawardena lab includingbut not limitedtoJohn,Sieu,Dan,David,Deepesh,Mohan,Felix,Javi.ThisworkwassupportedbytheGwill YorkandPaulMaeder ResearchAwardforSystemsBiologyandtheFASCenterforSystemsBiology. This work wassupported bythe NationalScience Foundation,Award id.1462629.
Acknowledgements
IdentificationofNon-invasiveCytokineBiomarkersforPolycysticOvarySyndromeUsingSupervisedMachineLearning
Sensitivity = 𝑇𝑃
𝑇𝑃 + 𝐹𝑁
Specificity = 𝑇𝑁
𝑇𝑁 + 𝐹𝑃
Accuracy = 𝑇𝑁
𝑇𝑁 + 𝐹𝑃
PCOS is an endocrine disorder that affects up to 20% of women. It is diagnosedusing the Rotterdam criteria, which are as follows:At least two of the followingmust be present: Hyperandrogenism, which could include clinical hirsutism(excessive hairiness) or biochemically raised testosterone levels; anovulation,or difficulties getting pregnant; the presence of cysts in ovaries analyzed by anultrasound.
With these diagnosis criteria in mind, one thing my projects aims to accomplishis to provide a different measurement of PCOS using concrete levels ofcytokines. This study involved taking samples from 291 women by firstexposing them to long-protocol ovarian hyperstimulation, which is a techniqueused to induce ovulation by multiple ovarian follicles. Then samples are takenfrom each woman’s blood plasma and follicular fluid. To note, the reason whythere is approximately double the amount of follicular fluid samples is that forthis set, each ovary is sampled, yielding roughly double the number ofsamples. Then, fifty-one whole blood and FF cytokines were measured by fluid-phase multiplex cytometric immunoassay (the resultant dataset is picturedbelow). The different cytokines were detected using different antibodies,which can be quite an expensive and lengthy test. So, another goal of thisprojec is to reduce the number of cytokines, or features, needed to predictPCOS in patients.
Cytokines:Cytokinesaresmallsecretedproteinsreleasedbycellshaveaspecificeffectontheinteractionsandcommunicationsbetweencells.
Lowlevels
Highlevels
P Dataset FFDataset
Datasetvisualization.Thesetwofiguresrepresenttheplasma(P)andfollicularfluid(FF)datasetswhichare291x51and530x51respectively.
Acomprehensivelistofthe48cytokinesmeasuredis:IL.1a,IL.1b,IL.1ra,IL.2,IL.2ra,IL.3,IL.4,IL.5,IL.6,IL.7,IL.8,IL.9,IL.10,IL.12..p40.,IL.12..p70., IL.13,IL.15,IL.16,IL.17,IL.18,CTACK,Eotaxin,FGF,G.CSF,GM.CSF,GRO.a,IFN.a,IFN.g,IP.10,LIF,MCP.1,MCP.3,M.CSF,MIF,MIG,MIP.1a,MIP.1b,b.NGF,PDGF,RANTES,SCF,SDF.1a,TGF.b,TGF.b,TNF.a,TNF.b,TRAIL,VEGF,CRP
ln 89:;89 = 𝛽= + 𝛽:XLogisticRegression:
LinearKernel: 𝐾 𝑥:, 𝑥B = (𝑥:·𝑥B)
f (x) = sign(w⋅ x + b)
SVMOptimizationProblem:
DanielaPerry1,2,Tathagata Dasgupta2,JosephDexter2,SarahField3, MicheleCummings3, VinaySharma3,NadiaGopichandran3,Ellis Baskind3,NicholasOrsi3,JeremyGunawardena21CornellUniversity,DepartmentofBiologicalStatisticsandComputationalBiology2HarvardMedicalSchool,DepartmentofSystemsBiology3UniversityofLeedsInstituteofCancerandPathology
First,weputall48variablesinalogisticregressionmodelandpredictedtheprobabilityofeachsamplebeingaPCOSorControl(ALL). Bothdatasetshadcomparablypoorperformance.Thenweconductedstepwiseregression,whichusestheAICcriteriatoeliminatetheleastsignificantvariable(STEP).Oncethisprocesswascomplete,weanalyzedtheperformanceofthemodelusingtheFFandPdatasetsonceagain.Finally,togetasenseofhowthemodelswouldperformwithamoreclinicallyfeasiblenumberofvariables,weremovedvariablesbasedonthehighestp-valueuntilwewereleftwithjustfourcytokines(FOUR). ROCcurveshighlightingtheperformanceofallsixmodelsaredisplayedbelow.Inaddition,avisualofaccuracy,specificity,andsensitivityarealsodisplayed.
PrincipalComponentAnalysis(PCA):ff.pca
Variances
05
1015
1 2 3 4 5 6 7 8 9
IL.1a
IL.1b
IL.1ra
IL.2
IL.2ra
IL.3
IL.4
IL.5
IL.6
IL.7
IL.8
IL.9
IL.10
IL.12..p40.
IL.12..p70.
IL.13
IL.15
IL.16
IL.17
IL.18
CTACK
Eotaxin
FGF
G.CSF
GM.CSF
GRO.a
HGF
ICAM.1
IFN.a
IFN.g
IP.10
LIF
MCP.1
MCP.3
M.CSF
MIF
MIG
MIP.1a
MIP.1b
b.NGF
PDGF
RANTES
SCF
SCGF.b
SDF.1aTGF.b
TNF.a
TNF.b
TRAIL
VCAM.1
VEGF
CRP
-30
-20
-10
0
10
20
0 20 40PC1 (36.2% explained var.)
PC
2 (1
9.1%
exp
lain
ed v
ar.)
Control PCOS
K-MeansClustering:K-meansclusteringisastochasticprocessthatgroupsdatapointsbasedontheirdistancefromeachother.Pointsarerandomlyassignedtoacluster,thenclusterplacementisoptimizedbyfindingthecenterofeachcluster.Ourclusteringanalysisresultedinthegraphtotheright,whichishighlycondensedbecauseofthepresenceofsomanyuniqueoutliers,whichisconsistentwithone-classclassification.
PCAusesanorthogonaltransformationtomakeasetofvariableslinearlyindependentvariablescalledprincipalcomponents.Theabovelinegraphrepresentshowmuchvarianceinthedataisaccountedforbyeachprincipalcomponent.Thegraphtotheleftrepresentsthegroupingsbasedonthetwomostsignificantprincipalcomponents.
FullModel
“StepwiseModel”
FourVariableModel
StepwiseRegression
P-valueelimination
Model #Variables TrainingSet TestingSet Specificity Sensitivity Accuracy
FF All 48 75%offullFF
25%offullFF
0.9207921 0.21875 0.7518797
FFStepwise 21 75%offullFF
25%offullFF
0.950495 0.25 0.7819549
FFFour 4 75%offullFF
25%offullFF
0.990099 0.0625 0.7669173
P All 48 75%offullP 25%offullP 0.8571429 0.2222222 0.7027027
PStepwise 12 75%offullP 25%offullP 0.8392857 0.1111111 0.6621622
PFour 4 75%offullP 25%offullP 0.9821429 0.05555556 0.7567568
0.0
0.1
0.2
0.3
Control PCOSGroup
IL.10
Group vs. IL.10 (P)
-0.1
0.0
0.1
Control PCOSGroup
IL.10
Group vs. IL.10 (FF)
0.00
0.05
0.10
0.15
Control PCOSGroup
IL.10
Group vs. IL.10 (FF All)
0.00
0.05
0.10
0.15
Control PCOSGroup
IL.10
Group vs. IL.10 (FF Step)
0.00
0.05
0.10
0.15
Control PCOSGroup
IL.10
Group vs. IL.10 (FF Four)
0.0
0.1
0.2
0.3
Control PCOSGroup
IL.10
Group vs. IL.10 (P All)
0.0
0.1
0.2
0.3
0.4
Control PCOSGroup
IL.10
Group vs. IL.10 (P Four)
0.0
0.1
0.2
0.3
Control PCOSGroup
IL.10
Group vs. IL.10 (P Step)
FirstweperformedgridsearchinordertooptimizeparametersforourSVMmodelusingalinearkernel.Then,wetrainedabinaryclassifierandtesteditsperformanceusing5-foldcrossvalidation(resultsinthetablebelow).Theresultsledustoconductone-classclassificationforoutlierdetection.Thenweretestedourmodelexcludingthepointswefoundtobeoutliers.
GridSearch
TrainabinaryclassificationSVM
Testmodelperformance
Applybestparameters
5-foldcrossvalidation
One-classclassification
Testmodelwithoutoutliers
5-foldcrossvalidation
Lowmodelstrength
HowSVMWorks1
Group2
Group1 Group1
Group2
Model #Variables TrainingSet TestingSet Specificity Sensitivity Accuracy
FFSVM 7 75%offullFF
25%offullFF
1 0.0625 0.7744361
5-foldCV 5-foldCV 0.9876543 0.03846154 0.7570093
5-foldCV(nooutliers)
5-foldCV(nooutliers)
1 0 0.7583333
PSVM 3 75%offullFF
25%offullFF
0.9821429 0.05555556 0.7567568
5-foldCV 5-foldCV 1 0 0.7627119
5-foldCV(nooutliers)
5-foldCV(nooutliers)
1 0.06666667 0.7878788