user manual cosmin risk of bias tool v4 jan final · 2021. 1. 16. · 6 1. background information...
Post on 05-Apr-2021
3 Views
Preview:
TRANSCRIPT
1
COSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorofoutcomemeasurementinstrument
usermanual
Version1.0datedJanuary2021
LidwineBMokkinkMaartenBoers
CeesvanderVleutenDonaldLPatrickJordiAlonsoLexMBouter
HenricaCWdeVetCarolineBTerwee
ContactLBMokkink,PhDAmsterdamUMC,VrijeUniversiteitAmsterdam,DepartmentofEpidemiologyandDataScienceAmsterdamPublicHealthresearchinstituteDeBoelelaan1117,1081BTAmsterdamTheNetherlandsWebsite:www.cosmin.nlE‐mail:w.mokkink@amsterdamumc.nl
2
ThedevelopmentoftheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorwaspartoftheVENIprogrammewithprojectnumber91617098,fundedbyZonMw(TheNetherlandsOrganisationforHealthResearchandDevelopment).
3
TableofContent
Foreword 5
1. Backgroundinformation 6
1.1 COSMINinitiativeandsteeringcommittee 6
1.2Howtocitethismanual 7
1.3DevelopmentoftheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerror
7
1.4 Definitionsofreliabilityandmeasurementerror 7
1.5 FocusoftheCOSMINRiskofBiastool 8
1.6ThestructureoftheCOSMINRiskofBiastool 10
1.7 The“worst‐score‐counts”method 10
1.8 Relevanceoftheresearchquestion 11
1.9 UsingtheCOSMINRiskofBiastoolinasystematicreview 11
1.10Expertiserequiredforusingthetool 12
1.11UsingtheCOSMINRiskofBiastooltoassessstudiesonPROMsorObsROMs
12
1.12ARiskofBiastoolisnotastudydesignchecklist,norareportinggiudeline
13
2. PartA.Understandinghowastudyinformsusaboutthereliabilityandmeasurementerrorofanoutcomemeasurementinstrument
14
2.1 Componentsofoutcomemeasurementinstruments 14
2.2 Extractingtheelementsofacomprehensiveresearchquestion 20
2.3 ExampleofhowtousePartAoftheCOSMINRiskofBiastooltoassessthequalityofastudybySkeieetal.(2015)
27
3. PartB.Assessingtheriskofbiasofastudyonreliabilityormeasurementerror
31
3.1Elaborationonstandardsforstudiesonreliability 33
3.2Elaborationonstandardsforstudiesonmeasurementerror 40
3.3ExampleofhowtousePartBoftheCOSMINRiskofBiastooltoassessthequalityofastudybySkeieetal.(2015)
45
4. UsingtheCOSMINRiskofBiastoolinasystematicreviewofoutcomemeasurementinstruments
47
4.1Theeleven‐stepprocedureforconductingasystematicreviewofClinROMs,PerFOMs,orlaboratoryvalues
50
4
Appendix1.DataExtractiontableofrelevantinformationforeachincludedstudyinasystematicreview.
60
Appendix2.RiskofBiasratingsperstandardperstudy 62
Appendix3.ExampleofaFlow‐chart 63
Appendix4.Exampleofreportingtableoncharacteristicsoftheincludedmeasurementinstruments.
64
Appendix5.Exampleofreportingtableoncharacteristicsofthestudypopulations. 65
Appendix6.OverviewTableofqualityandresultsofstudiesonreliabilityandmeasurementerror.
66
Appendix7.SummaryofFindingsTablesforReliabilityandMeasurementerror. 67
References 68
5
ForewordTheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorwasdevelopedtotransparentlyandsystematicallyassessthemethodologicalqualityofstudiesonreliabilityandmeasurementerrorofalltypesofoutcomemeasurementinstruments.ItisanextendedversionoftheCOSMINRiskofBiaschecklistfortheboxesreliabilityandmeasurementerrorforPROMs(1).Itwasdevelopedforclinician‐reportedoutcomemeasures(ClinROMs)(includinge.g.readingsbasedonimagingmodalitiesandratingsbasedonobservations),performance‐basedoutcomemeasurementinstruments(PerFOMs),orbiomarkers–alsocalledlaboratoryvalues(2,3).ThesemeasurementinstrumentsaremorecomplexthanPROMs,asnotonlypatientsareinvolved,butalsoprofessionals,andsometimes(complex)devices.Specificallyinstudiesonreliabilityandmeasurementerrortheseadditionalsourcesofvariationcomplicatethedesignofthesestudiesandmayinfluencetheirquality.Asdifferentsourcesofvariationcanplayarole,differentstudiescanbeconductedtoassessthereliabilityormeasurementerrorofanoutcomemeasurementinstrument.Toassessthequalityofsuchastudy,oneshouldunderstand(1)howtheresultsofapublishedstudyonreliabilityormeasurementerrorinformusaboutthereliabilityandmeasurementerroroftheoutcomemeasurementinstrumentunderstudy,and(2)whetherwecantrusttheresultfoundinthestudybyassessingtheriskofbiasofthestudy.ThesetwostepsarereflectedinthenewCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityormeasurementerrorofoutcomemeasurementinstruments(4).Thequalityassessmentofastudyonreliabilityormeasurementerrorcanbeconductedinthecontextofasystematicreviewofoutcomemeasurementinstruments.Insuchareviewallmeasurementpropertiesareconsidered,thequalityoftheeachstudyisassessed,theresultsofthestudiesareextracted,andpermeasurementpropertyanoverallconclusionisdrawnaboutthequalityoftheinstrumentbasedonallavailableevidenceforeachmeasurementinstrument.Subsequently,thequalityoftheevidenceisgraded,takingthenumber,quality,and(consistencyof)resultsofthestudiesintoaccount.Arecommendationforthemostsuitableinstrumentismade,basedonquality,feasibilityandinterpretabilityofeachinstrument.Asthisisnotaneasytasktoperform,weencouragetousesystematicandtransparentmethodswhenconductingsuchsystematicreviews.WedevelopedtheCOSMINmethodologyforconductingsystematicreviewsofPROMS(5),includingtheCOSMINRiskofBiaschecklist(1,6).Whenconductingasystematicreviewofothertypesofoutcomemeasurementinstruments,suchasClinROMs,PerFOMs,orlaboratoryvalues,thisnewlydevelopedCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorcanbeincorporatedintotheCOSMINmethodology.Inthismanualwewillexplainhowthisnewtoolshouldbeused.
6
1. Backgroundinformation
1.1 COSMINinitiativeandsteeringcommittee
TheCOSMINinitiativeaimstoimprovetheselectionofhealthmeasurementinstrumentsbothinresearchandclinicalpracticebydevelopingtoolsforselectingthemostsuitableinstrumentforagivensituation.COSMINisaninternationalinitiativeconsistingofamultidisciplinaryteamofresearcherswithexpertiseinepidemiology,psychometrics,andqualitativeresearch,andinthedevelopmentandevaluationofoutcomemeasurementinstrumentsinthefieldofhealthcare,aswellasinperformingsystematicreviewsofoutcomemeasurementinstruments.ThistoolwasdevelopedinaDelphistudy(4).Thesteeringcommitteeofthisstudyconsistedof:LidwineBMokkinkMaartenBoersCeesvanderVleutenDonaldLPatrickJordiAlonsoLexMBouterHenricaCWdeVetCarolineBTerweeWeareverygratefultoallthepanelistsofthisstudy,whoprovideduswithmanyhelpfulandcriticalcommentsandarguments(inalphabeticalorder):M.A.D’Agostino,DorcasBeaton,SophievanBelle,SandraBeurskens,KristieBjornson,JanBoehnke,PatrickBossuyt,DonBushnell,StefanCano,SaskialeCessie,AlessandroChiarotto,MikeClark,JonDeeks,IrisEekhout,JimFarnsworthII,OkeGerke,SabineGoldhahn,RobertM.Gow,PhilipGriffiths,CristianGugiu,Jean‐BenoitHardouin,DesiréevanderHeijde,I‐ChanHuang,EllenJanssen,BrianJolly,LarsKonge,JanKottner,BrittanyLapin,HannekevanderLee,MariskaLeeflang,NancyMayo,SueMallett,JoyC.MacDermid,GeertMolenberghs,HolgerMuehlan,KoenNeijenhuijs,RaymondOstelo,LauraQuinn,DennisRevicki,JussiRepo,JohannesB.Reitsma,AnneW.Rutjes,MohsenSadatsafavi,DavidStreiner,MatthewStephenson,BerendTerluin,ZyphanieTyack,WernerVach,GemmaVilagutSaiz,MarcK.Walton,MatthijsWarrens,andDanielYeeTakFong.
7
1.2 Howtocitethismanual
ThismanualaccompaniesthetooldevelopedintheDelphistudy.Please,refertothearticlewhenusingthemanualoftheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerror.LBMokkink,MBoers,CPMvanderVleuten,LMBouter,JAlonso,DLPatrick,HCWdeVet,CBTerwee.COSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityormeasurementerrorofoutcomemeasurementinstruments:aDelphistudy.BMCMedicalResearchMethodology.2020;20(293).1.3 DevelopmentoftheCOSMINRiskofBiastooltoassessthequalityofstudieson
reliabilityandmeasurementerror
ThisCOSMINtoolwasdevelopedinaDelphistudy,containingthreerounds.Formoreinformationaboutthemethodsofthisstudy,werefertoMokkinketal.2020.InthisDelphistudywereachedconsensusonhowtoformulateacomprehensiveresearchquestionforstudiesonreliabilityandmeasurementerror,oncomponentsofoutcomemeasurementinstruments(whicharethepotentialsourcesofvariationrelevantinstudiesonreliabilityandmeasurementerror),andonstandardstoassessthequalityofastudyonreliabilityandmeasurementerrorofClinROMs,PerFOMs,orlaboratoryvalues.Basedonthoseresults,wedevelopedtheCOSMINRiskofBiastoolwhichcomprisestwoparts:1)sevenelementsthatmakeupacomprehensiveresearchquestionofthestudy,whichinformsusonhowthereliabilityandmeasurementerroroftheoutcomemeasurementinstrumentwasstudied,and2)standardsondesignrequirementsandpreferredstatisticalmethodsofstudiesonreliabilityandmeasurementerror,whichcanbeusedtoassessthequalityofthestudy.1.4 Definitionsofreliabilityandmeasurementerror
Reliabilityandmeasurementerrorareimportantmeasurementpropertiesofoutcomemeasurementinstruments.Reliabilityandmeasurementerroraredeterminedbasedonthesamestudydesignanddatacollection,butwithdifferentstatisticalmethods.Thesemeasurementpropertiesarethereforerelated,butdistinct.Reliabilityisdefinedastheproportionofthetotalvarianceinthemeasurementwhichisduetotruedifferencesbetweenpatients(7).Itreferstowhatextendaninstrumentisabletodistinguishbetweenpatients;areliabilitystudyinvestigatestheextenttowhichdifferentsourcesofvariationinfluencethemeasurement.Thisgivesdirectionforhowtoimprovethemeasurement,forexamplebystandardizationorrestrictionofthesourceofvariation.ReliabilitycanbecalculatedwithanIntra‐classCorrelation
8
Coefficient(ICC),aGeneralizabilityCoefficientorwithakappa.Reliabilityparametersareexpressedasaproportionandliesbetween0and1.Measurementerrorisdefinedasthesystematicandrandomerrorofapatient’sscorethatisnotattributedtotruechangesintheconstructtobemeasured(7).Itreferstohowclosethescoresofrepeatedmeasurementsinstablepatientsare;suchstudiesinvestigatetheabsolutedeviationofthescoresortheamountoferrorofrepeatedmeasurementsinstablepatients.Incaseofcategoricaloutcomesitisalsocalled‘agreement’.ForcontinuousoutcomesmeasurementerrorisexpressedinthemeasurementunitsofthemeasurementinstrumentwithaStandardErrorofMeasurement(SEM)orLimitsofAgreement(LoA).Forcategoricaloutcomesagreementisexpressedaspercentagetotalagreementorpercentagesspecific(e.g.positiveandnegative)agreement.1.5 FocusoftheCOSMINRiskofBiastoolWefocusonoutcomemeasurementinstruments,definedasinstrumentsusedtomonitorthehealthstatusof(agroupof)peopleovertime,forexampleinaclinicaltrialorinclinicalpractice.
Severaltypesofmeasurementinstrumentsexist,suchaspatient‐reportedoutcomemeasure(PROM);observer‐reportedoutcomemeasures(ObsROMs;i.e.proxymeasures);clinician‐reportedoutcomemeasurementinstruments(ClinROMs)(includinge.g.readingsbasedonimagingmodalitiesandratingsbasedonobservations);performance‐basedoutcomemeasurementinstruments(PerFOMs);andbiomarkeroutcomes–alsocalledlaboratoryvalues(2).TheCOSMINRiskofBiastooltoassessreliabilityandmeasurementerrorisspecificallydevelopedforClinROMs,PerFOMs,andlaboratoryvalues(seeTable1forexamples).Theseoutcomemeasurementinstrumentstypicallyrequireinvolvementofoneormoreprofessionalstooperateequipmentortools,togiveinstructionstothepatient(e.g.toperformataskoraction)ortocometoascorethroughtheirclinicalexpertise(e.g.afterobservingapatientoranimage).Anoutcomemeasurementinstrumentcomprisesthewholemeasurementproceduretocometoascore,includingissuessuchasmaterials,communication(e.g.instructionsandmotivatingpatientsincaseofperformance‐basedtest),clinicaljudgment,performingatask.Allissuesrelevantforreliableandvalidmeasurementshouldbedescribedinthemeasurementprotocolofanoutcomemeasurementinstrument.
9
Table1.ExamplesofClinROMs,PerFOMs,andlaboratoryvaluesClinician‐reportedoutcomemeasurementinstruments(ClinROMs)Clinician‐reportedratingoftheseverityofadiseaseorcondition.Forexample,theHamiltonAnxietyRatingScaletoassesstheseverityofanxietysymptomscomprises14itemsthatarescoredbyaclinician(8).AGlobalAssessmentoftheseverityofaconditionscorede.g.onasingle‐itemVisualAnalogueScalebyahealth‐careprofessional.Resultofclinicalexaminationof(patho)physiology,suchasbloodpressureoracountofswollenjoints.Clinicalreadingofdevice‐basedresults(oftenimaging),suchpowerDopplerultrasonographytoassessscardiacstructure,functionandhemodynamics(echocardiography)(9),orMRIusedtoevaluatecartilagedefectsize,depth,andsubchondralboneinordertoassesschondralandosteochondrallesionsattheknee(10).Performance‐basedoutcomemeasurementinstrument(PerFOMs)Aperformance‐basedwalkingtest(e.g.thetimed25‐footwalktest(11)),inwhichaprofessionalinstructsapatienttowalk25feetathisowncomfortablepacewithorwithoutawalkingaid.Timeneededtocover25feetismeasuredbytheprofessional.LaboratoryvalueorbiomarkerLaboratoryvaluesuchasHbA1c(glycatedhaemoglobin)measuredbytheturbidimetricinhibitionimmunoassay(TINIA)(12).DifferentversionsoroperationalizationsofoutcomemeasurementinstrumentsTomeasureaspecificconstruct,differentversionsofameasurementinstrumentmayexist.Forexample,theDoloplusisaclinicalassessmenttooltomeasurebehaviouralpainassessmentincognitivelyimpairedpatients,andisadministerede.g.bytheattendingnurse.TheoriginalDoloplus‐1contained15items,whiletheDoloplus‐2contains10items(13).Ameasurementinstrument(i.e.themeasurementprotocol)canbeoperationalizedinmanydifferentways,andeachoperationalizationcouldbeconsideredadifferentversion.Forexample,thespecificequipmentusedtomeasuretherangeofmotion(ROM)candiffer,e.g.,asimpleuniversalgoniometer(14)oranelectromagnetic3‐dimensionaltrackingsystem(15).Thelocationtobemeasuredcandiffer,e.g.,theneck(14)ortheshoulder(16).Thebackgroundoftheprofessionalinvolvedcandiffer,e.g.,arheumatologistoraradiologistwhoconductsthemeasurement,andtheseratersmayhavehaddifferentlevelsoftraining(17).Inprinciple,weconsidereachversionofanoutcomemeasurementinstrumentoreachdifferentoperationalizationofthemeasurementprotocolasaseparatemeasurementinstrument,untilevidenceisprovided(e.g.testingofmeasurementinvariance,orreliability)thattheversionsperformsimilarly.
10
1.6 ThestructureoftheCOSMINRiskofBiastool
TheCOSMINRiskofBiastoolcomprisestwoparts.PartAhelpstounderstandhowtheresultsofapublishedstudyinformusaboutthereliabilityormeasurementerroroftheoutcomemeasurementinstrumentsunderstudy.PartBhelpstoassesswhetherwecantrusttheresultobtainedinthestudybyassessingtheriskofbiasofthestudy.PartAForagoodunderstandingofhowtheresultsofastudyinformsusaboutthereliabilityandmeasurementerroroftheinstrument,agoodunderstandingofthedesignofthestudyanditscorrespondingcomprehensiveresearchquestionisneeded.InpartAwedescribethesevenelementsthatwerecommendtobeextracted,andthattogethercanbeusedtoconstructacomprehensiveresearchquestionforeachanalysis.Inaddition,PartAofthetoolcontainsanoverviewofthecomponentsofoutcomemeasurementinstruments.Thesecomponentarethepotentialsourcesofvariationthatcaneitherbestudied(i.e.variedacrosstherepeatedmeasurements),orarekeptorassumedtobestable(i.e.standardized).PartB.Next,wedevelopedtwoboxeswithstandardsforstudiesonreliabilityandforstudiesonmeasurementerror,respectively.AsintheCOSMINRiskofBiaschecklistforPROMs(1),standardsrefertodesignrequirementsandpreferredstatisticalmethodsofstudiesonmeasurementproperties.Forexample,‘reliabilityandmeasurementerrorshouldbeassessedinpatientsthatareassumedtobestable’;or‘measurementerrorshouldbeassessedwiththestandarderrorofmeasurementorwiththelimitsofagreement’.Thestandardsarestatedasquestions:e.g.‘werepatientsstableintheinterimperiodontheconstructtobemeasured?’.Wereferto‘preferred’statisticalmethods.Wemeanby‘preferred’thatthesestatisticalmethodsareappropriatetousewhenevaluatingreliabilityormeasurementerrorofoutcomemeasurementinstruments,andarecommonlyused.Othermethodsmaybeappropriatetouseaswell(forexamplebi‐factormodelsorMulti‐TraitMulti‐Method(MTMM)analyses,ornewlydevelopedmethods).Itisnotourintentiontocomprehensivelydescribeallpossiblestatisticalmethods,rathertodescribetheadequatemethodsthatarecommonlyusedintheliterature.ItisuptotheuseroftheCOSMINtoolhowstudiesusingtheselesscommonlyusedmethodsareassessed.1.7 The“worst‐score‐counts”principle
Eachstandardinaboxisscoredonthefour‐pointscale,i.e.‘verygood’,‘adequate’,‘doubtful’,and‘inadequate’,seechapter3formoreinformation.SimilarasintheCOSMINRiskofBiaschecklistforPROMs(1),weusetheworst‐score‐countsmethod(18)tocometoaratingforthequalityofthestudyonreliabilityormeasurementerror.
11
1.8 Relevanceoftheresearchquestion
Whilemanydifferentresearchquestionsconcerningthereliabilityormeasurementerrorofanoutcomemeasurementinstrumentcanbeinvestigated,therelevanceofastudyisnotunderquestionwhenusingthistool.Therelevanceofastudyreferstodifferentaspects.
‐ Choiceofthepotentialsource(s)ofvariationthathasbeenvariedovertherepeatedmeasurements.
‐ Choiceofthetargetpopulationofpatientsandprofessionals(whenapplicable)ofthestudy.
‐ Choiceofhowthemeasurementprotocolwasexecuted,whenapplicable.‐ Choiceofevaluatingthespecificmeasurementproperty,eitherreliabilityor
measurementerror.Oftenonlyreliabilityisreported,whilethemeasurementerrorcanbecalculatedusingthesamedata.
WhenusingthisCOSMINRiskofBiastool,theseaspectswillbeextractedfromthedesignofthestudy(inpartA).However,nojudgementwillbegivenabouttheappropriatenessofthechoicesmade.Thechoicesmadeintheresearchquestionandstudydesignbytheresearchersdeterminetheinterpretationandgeneralizabilityoftheresults.1.9 UsingtheCOSMINRiskofBiastoolinasystematicreview
TheCOSMINRiskofBiastoolisdevelopedtoassessthequalityofapublishedstudy.OneapplicationoftheCOSMINRiskofBiastoolistoassessthequalityofstudieswhenconductingasystematicreviewonmeasurementinstruments.COSMINdevelopedasystematicmethodologyforconductingsystematicreviewsofPROMs(5).Itconsistsofa10stepprocedure,inwhichtheCOSMINRiskofBiaschecklist(1)(containingstandardsforallninemeasurementproperties)canbeappliedtothestudiestoassessthequalityofeachstudy.TousetheCOSMINmethodologyforconductingsystematicreviewsofothertypesofinstruments–thatis:otherthanPROMs–weadvisetoreplacetheboxes6(Reliability)and7(Measurementerror)withtheCOSMINRiskofBiastooltoassessthequalityofstudiesonreliabilityandmeasurementerrorofoutcomemeasurementinstruments.MoreinformationabouthowtoconductasystematicreviewusingthenewCOSMINRiskofBiastoolcanbefoundinchapter4.
12
1.10 Expertiserequiredforusingthetool
Toassessthequalityofastudyonreliabilityandmeasurementerror,i.e.foruseinasystematicreviewonthequalityofoutcomemeasurementisquitecomplexandtimeconsuming,anditrequiresexpertisewithintheresearchteamonseveralaspects.Werecommendthatatleastoneoftheteammembersshouldhaveexpertiseontheconstructtobemeasured,e.g.tounderstandwhatappropriatetimeintervalsarebetweenrepeatedmeasurements;onthemeasurementinstruments,e.g.tounderstandwhatconcomitantsourcesofvariationcouldbe(andtheseshouldberestrictedorstandardized–seeelement2inPartA);onthepatientpopulation,e.g.tounderstandwhetherpatientswerestablebetweenrepeatedmeasurementsorwhethersubgroupsofpatientscanbeconsideredinonestudy.Aclinicalexpertmightcombinetheseexpertises.Amethodologicalexpertshouldbepartoftheteammemberwithexpertiseonthetheoryofreliabilityandmeasurementerror,e.g.tounderstandwhetherthedesignisappropriatelyanalyzed(e.g.standards7).1.11 UsingtheCOSMINRiskofBiastooltoassessstudiesonPROMsorObsROMs
ThisnewCOSMINRiskofBiastoolisdevelopedspecificallyforClinROMs,PerFOMs,andlaboratoryvalues.However,itcanalsobeusedtoassessthequalityofstudiesonreliabilityormeasurementerrorofPROMsorobserver‐reportedoutcomemeasures(ObsROMs;i.e.observationsmade,appraised,andrecordedbyapersonotherthanthepatientwhodoesnotrequirespecializedprofessionaltraining(2),e.g.proxymeasures).However,forthesetwotypesofinstrumentsthetoolmayseemunnecessarilycomplex.Thefirststepinthetool(i.e.understandinghowtheresultsinformusonthequalityofthemeasurementinstrumentunderstudy)isoftenobvious,astheaimofreliabilitystudiesofPROMsandObsROMsismostoftentoassesstest‐retestreliabilityormeasurementerrorofthewholemeasurementinstrument(asthesemeasurementinstrumentscanonlybetakeninonego,andtheonlypotentialsourceofvarianceisoccasion).Thesecondstepinthetool(assessingthequalityofthestudyusingthestandards)willleadtothesameratingcomparedtousingthestandardsoftheRiskofBiaschecklistforPROMs.Thestandardsondesignrequirementsinbothtoolsarepartlythesame.However,thenewtypesofoutcomemeasurementinstrumentsforwhichweadaptedtheCOSMINchecklist(i.e.ClinROMs,PerFOMsandlaboratoryvalues),requireadditionalstandards,whicharenotusuallyapplicableforPROMsandObsROMs.(Ifitisapplicableinaspecificstudy,itcouldberatedusingthe‘otherflaws’standardintheRiskofBiaschecklistforPROMs).Theresponseoptionsforstandardsonpreferredstatisticalmethodsinthenewtoolaresomewhatdifferentlyformulated,butwillleadtothesameratingasthePROMRiskofBiaschecklist.
13
1.12 ARiskofBiastoolisnotastudydesignchecklist,norareportingguideline
ThisCOSMINRiskofBiastoolisdevelopedtoassessthequality(i.e.riskofbias)ofapublishedstudyonreliabilityormeasurementerror.Thistoolisnotdevelopedasadesignchecklistorareportingguideline.Whendesigningorreportingastudyonreliabilityormeasurementerroradditionalitemsarerelevanttoconsiderorreport.Forexample,thesamplesizeofpatientsamplesandnumberofratersorrepeatedmeasurementsareimportantinthedesignofastudy,andwhenreportingspecificresultssuchasthevariancecomponents,95%confidenceintervalsaroundICCs,marginalwhenreportingkappa’s,oradditionalassumptionsarerequired.
14
2. PartA.Understandinghowastudyinformsusaboutthereliabilityandmeasurementerrorofanoutcomemeasurementinstrument.
Ingeneral,thedesignofastudyonreliabilityandmeasurementerrorisaboutrepeatedmeasurementinstablepatients.Eachmeasurementisaccompaniedbysomeerror.Thiserroriscausedbysourcesofvariation,suchastheequipmentused,theprofessionalsinvolved,andothercomponentsofmeasurementinstruments.Forexample,thescoreonaninstrumentcanbeinfluencedbyhowtheratermotivatesthepatient,howthemachinewassetup,orbytheoccasion(e.g.firstandsecondoccasion,dayoftheweek,timeoftheday).Inchapter2.1wesystematicallydescribeallcomponentsofoutcomemeasurementinstruments,whicharethepotentialsourcesofvariationofanoutcomemeasurementinstrument.Manydifferentsourcesofvariationcanaffectthemeasurement,andeachofthemcanbestudiedusingadifferentstudydesigns.Eachstudydesignanswersadifferentresearchquestion,andeachresearchquestiongivesspecificinformationaboutthequalityofthemeasurementinstrument.Tounderstandhowastudycaninformusaboutthequalityofanoutcomemeasurementinstrumentwedescribeinchapter2.2sevenelementsofacomprehensiveresearchquestion.PartAofthetoolcontainstheoverviewsofthecomponentsofoutcomemeasurementinstruments(foroutcomemeasurementinstrumentsthatdoesnotinvolvebiologicalsampling,andthosethatinvolvebiologicalsampling,respectively),andthesevenelementsofacomprehensiveresearchquestion.Inchapter2.3weprovideanexampleinwhichweshowhowtousePartAofthetool,byapplyingittoapaperbySkeie(19).Inchapter2.2wewillusethisexample,too(amongotherexamples).
2.1 Componentsofoutcomemeasurementinstruments
Allmeasurementinstrumentsconsistofcomponents,suchasequipmentandpreparatoryactions.Wedevelopedtwotaxonomiesofcomponentsofoutcomemeasurementinstruments,oneforoutcomemeasurementinstrumentsthatdonotinvolvebiologicalsampling(i.e.ClinROMsandPerFOMs)(seeTable2),andoneforthosethatdo(i.e.thelaboratoryvalues,suchasbloodorurinetests,tissuebiopsy)(seeTable3).
15
Table2.ComponentsofoutcomemeasurementinstrumentsthatdonotinvolvebiologicalsamplingComponent Elaboration Examples
Equipment Allequipmentnecessaryinthepreparation,theadministration,andtheassignmentofscoresoftheoutcomemeasurementinstrument
Questionnaireforms,computers,tablet,penandpaper;stairstepsofaspecificheight;deviceortools(suchasstopwatch,probe,tube);ultrasoundmachine,ultrasoundgels,MRIscanner;software.
Preparatoryactionsprecedingrawdatacollectionbyprofessionals,patients,andothers(ifapplicable)
1.Generalpreparatoryactions,suchasrequiredexpertiseortrainingforprofessionalstoprepare,administer,storeorassignthescores2.Specificpreparatoryactionsforeachmeasurement,suchas
preparationsofequipment,environment,storagebyprofessionalsa
preparationsofthepatientbbytheprofessional
Training,educationorexperiencerequired,certification.Preparationofequipment:calibrationofdevice/equipment,adjustsettingsofthemachine.Preparationoftheenvironment:lightconditions,roomtemperature,humidity,specificlengthofawalkingtrack.Preparationforstorage:designdatabaseandlogbookProvidegeneralandpreparatoryinstructionsforthepatients,suchasexplainingthetasks/actionthatneedtobeperformedincludingtimeschedule,safetyissuesandsideeffects;instructionsondiet(e.g.useofcaffeine),clothing(e.g.comfortableshoes,nojewelry,glassesordevices),performanceduringtests(e.g.performataskasusual;trytowalkasfastasyoucan;lieascalmaspossible);setsometrainingorperformafamiliarizationsession.Attachingelectrodestothebody,injectionwithradioactivesubstanceorcontrastdye,positioningthepatient,applyingultrasoundgel.
16
Component Elaboration Examples
Preparationsundertakenbythepatients
Listentoandunderstandingtheinstructionsprovided;adherencetothepreparatoryinstructionssuchasfasting,resting,takingmedication,bowelpreparation,exercising,shaving.
Collectionofrawdata
Allactionsundertakenbypatientandprofessional(s)tocollectthedata,beforeanydataprocessing
Thepatientcompletingquestionsathome,oratthehospital;orperformingthetasks;theraterobservingortimingtheperformance;switchingtheimagingdeviceonandoff;positioningandmovingtheultrasoundprobe.
Dataprocessingandstorage
Allactionsundertakenontherawdatatostoreitinausable(electronic)formforlaterdatamanipulation(suchasscoreassignmentorstatisticalanalysis)
ThedigitallyconvertedsignalofaspecificbodyMRIscanwhichistemporarilystoredintheK‐space,issenttoanimageprocessorwhereamathematicalformula(i.e.Fouriertransformation)isapplied,leadingtoanimagewhichisdisplayedonamonitorandsavedonacomputer;Otherexamples:answersofquestionitemsarerecordedone.g.paperformsandstoredorLikertscaleformatresponseoptionsareconvertedintoa0‐4scoreanddirectlyenteredinacomputerdatabase.Performanceofdataqualitycheckse.g.doubleentryorvalidationchecksonthestored/entereddata.
Assignmentofthescore(s)
Methodsusedtoconvertprocesseddataintoascorecthatconstitutestheoutcomemeasurementinstrument.
Acalculationofamathematicalformulaortheapplicationofascoringsalgorithm(e.g.asetofrulestobefollowed)totheprocesseddata;aclinicianselectsthespecificimagesandjudgestheseverityandquantityofe.g.lesionsonthesetofimagesorcomparesittoareference;scoresadjustedfore.g.missingdataorpatientsusingdevicessuchasmobilityaids.
aProfessionalsarethosewhoareinvolvedinthepreparationortheperformanceofthemeasurement,inthedataprocessing,orintheassignmentofthescore;thismaybedonebyoneandthesameperson,orbydifferentpersons.bIntheCOSMINmethodologyweusetheword‘patient.’However,sometimesthetargetpopulationisnotpatients,bute.g.healthyindividuals,caregivers,clinicians,orbodystructures(e.g.joints,orlesions).Inthesecases,thewordpatientshouldbereadase.g.healthyvolunteer,clinician,ortherelevantbodystructure.cThescorecanbefurtherusedorinterpreted,byconvertingascoretoanotherscale,metricorclassification.Forexample,acontinuousscoreisclassifiedintoanordinalscore(e.g.mild/moderate/severe),ascoreisdichotomizedintobeloworaboveanormalvalue,patientsareclassifiedasrespondertotheintervention(e.g.whentheirchangeislargerthantheMinimalImportantChange(MIC)value).
17
Table3.Componentsofoutcomemeasurementinstrumentsthatinvolvebiologicalsampling
Component Elaboration Examples
Equipment Allequipmentusedinthepreparation,theadministration,andthedeterminationofthevaluesoftheoutcomemeasurementinstrument
Collectiontools,suchasvenapunctureset,biopsytool;materialcontainers,suchasforbloodplasma(EDTAofheparintube),fortissue(containerforfrozenspecimensforimmunofluorescence,jarfilledwithformalin),forurinecollection(sterile,screw‐topcontainer),forstandardmicroscopictissueevaluation(fluidortissueforculture(sterilejar));laboratoryequipmentsuchascentrifuges,cabinets,andchromatographysystems,computers,software.
Preparatoryactionsprecedingsamplecollectionbyprofessionals,patients,andothers(ifapplicable)
1.Generalpreparatoryactions,suchasrequiredexpertiseortrainingforprofessionalstoprepare,administer,storeanddeterminethevalue
Training,educationorexperiencerequired,certification.
2.Specificpreparatoryactionsforeachmeasurement,suchas
preparationsofequipment,environment,andstoragebyprofessionalsa
preparationofthepatientbbytheprofessional
Preparationofequipment:calibrationofdevice/equipment,adjustsettingsofthemachine.Preparationoftheenvironment:lightconditions,roomtemperature,humidity.Preparationofstorage:set‐upallequipmentforstorage.Providegeneralandpreparatoryinstructionstothepatients,suchasexplainingthemeasurementprocedureincludingsafetyissuesandsideeffects;instructionsondiet;insertionandwithdrawalofacatheterintoabloodvessel.
18
Component Elaboration Examples
Preparatoryactionsundertakenbythepatients
Listentoandunderstandingtheinstructionsprovided;adherencetothepreparatoryinstructionssuchasfasting,resting,takingmedication,exercising,shaving,washingofhands.
Collectionofbiologicalsample
Allactionsundertakentocollectthebiologicalsample,beforeanysampleprocessing
Takingabloodsampleortissuebiopsy,collectionofasampleofurine‘mid‐stream’inacontainer.
Biologicalsamplingprocessingandstorage
Allactionsundertakentobeabletopreserve,transport,andstorethebiologicalsamplefordetermination;and,ifapplicable,furtheractionsundertakenonthestoredsampletobeabletoconductthedeterminationofthebiologicalsample
Initialreactionofmaterialtoreagentincontainer(e.g.anticoagulationbyheparin).Bloodisdecomposed(bygravity)intoplasmaandbloodcells,andstoredataspecifictemperature.Tissueissnapfrozenbyimmersioninliquidnitrogen,orfixedinformalinembeddedin/processedtoparaffinforlong‐termstorage.Bloodiscollectedinatubecontaininganaqueoussolutiontetra‐sodiumsaltofethylene‐diamine‐tetra‐aceticacid(EDTA)andmixedwithairtolysetheerythrocytesandconverthemoglobintooxyhemoglobin.Cutsectionsorprepareasmearonaslide,tissuesarestainedbyimmunofluorescentmarkersspecificforcertainsurfaceantigens.Screwthelidoftheurinecontainershut,putinasealedplasticbagandstoreitinthefridgeataround4degreesCelsius,formax.24hours.
Determinationofthevalueofthebiologicalsample
Methodsusedforcountingorquantifyingtheamountofthesubstanceorentityofinterestc
Theabsorbanceofoxyhemoglobinat540nmthroughspectrophotometryquantifiesthehemoglobinconcentrationinthesample.Thepresenceofthemarkeronthecellsurfaceisdetectedandquantifiedbyfluorescencesignalintensity.Raterobserveseachslideandcountspositivecellsinanarea.Acalculationortheapplicationofamathematicalformulatothepreparedsample.
19
aProfessionalsarethosewhoareinvolvedinthepreparationortheperformanceofthemeasurement,inthedataprocessing,orintheassignmentofthescore;thismaybedonebyoneandthesameperson,orbydifferentpersons;bIntheCOSMINmethodologyweusetheword‘patient.’However,sometimesthetargetpopulationisnotpatients,bute.g.healthyindividuals,caregivers,clinicians,orbodystructures(e.g.joints,orlesions).Inthesecases,thewordpatientshouldbereadase.g.healthyvolunteer,clinician,orrelevantbodystructure;cThevaluecanbefurtherprocessedintoaclinicalscore,ifapplicable,byalinearorsemi‐quantitativeconversion.Forexample,acontinuousscoreisclassifiedintoanordinalscore(e.g.mild/moderate/severe),ascoresisdichotomizedintobeloworaboveanormalvalue,patientsareclassifiedasresponderontreatment(e.g.whentheirchangeislargerthantheMinimalImportantChange(MIC)value).Asnonoisewilloccurfromthisconversion,thisisnotapotentialsourceofvariance,butratheraninterpretationofthevalue.Thereforewedonotincludethisphaseinthecomponentsforoutcomemeasurementinstrumentsthatinvolvebiologicalmaterials.
20
2.2ExtractingtheelementsofacomprehensiveresearchquestionBeforewecancomprehensivelyassesstheinformationinastudyonthereliabilityormeasurementerrorofaninstrument,weneedtofullyunderstandthedesignofthestudyandreformulatetheresearchquestionintowhatwecalla‘comprehensiveresearchquestion’.Oftenthepublishedresearchquestionisnotspecificenoughtoratetheadequacyofthestudydesign.Forexample,ifthestatedaimoftheirstudyistoassessinter‐raterreliabilityofaninstrument,itisclearthatraterswillbevaried.However,withoutfurtherinformationitisnotclearwhethertheinterestisintheinter‐raterreliabilityofthewholemeasurementprocedure(e.g.bydifferentclinicians),oronlyinthereliabilityofapartofthemeasurementprocedure(e.g.onlytheassignmentofthescorebasedonanimage).Togetacompletepicture,werecommendtoextractsevenelementsfromthepublicationthattogethercanformthe‘comprehensiveresearchquestion’(seeTable4).Notethatonearticlecancontainmultiplequestions,eachrequiringanextractionofthesevenelements.Table4.Elementsofacomprehensiveresearchquestion.1 thenameoftheoutcomemeasurementinstrument2 theversionoftheoutcomemeasurementinstrumentorwayofoperationalizationofthe
measurementprotocol3 theconstructmeasuredbythemeasurementinstrument4 aspecificationwhetheroneisinterestedinareliabilityparameter(i.e.arelative
parametersuchasforcontinuousoutcomesanICC,Generalizabilitycoefficientφ,orKappaκ)oraparameterofmeasurementerror(i.e.anabsoluteparameterexpressedintheunitofmeasuremente.g.SEM,LoAorSDC;orforcategoricaloutcomesexpressedasagreementormisclassification,e.g.thepercentagespecificagreement).
5 aspecificationofthecomponentsofthemeasurementinstrumentthatwillberepeated(especiallywhenonlypartofthemeasurementinstrumentisrepeated,e.g.onlyassignmentofthescorebasedonthesameimages)
6 aspecificationofthesource(s)ofvariationthatwillbevaried(e.g.timeoroccasion,the(levelofexpertiseof)professionals,themachines,orothercomponentsofthemeasurement)
7 aspecificationofthepatientpopulationstudiedICC=Intraclasscorrelationcoefficient;SEM=standarderrorofmeasurement;LoA=LimitsofAgreement;SDC=smallestdetectablechange.
21
ElaborationontheelementsofacomprehensiveresearchquestionElement1.ThenameoftheoutcomemeasurementinstrumentThenameoftheinstrumentshouldbeexactlyspecified.Sometimes,thisisreadilyapparent,e.g.the6minuteWalkingtest(6MWT)ortheNineHolePegTest(NHPT).Insomecases,ameasurementprotocolinvolvesmultiplemeasurementinstruments(e.g.theMultipleSclerosisFunctionalComposite(MSFC)includestheTimed25‐FootWalktest,theNineHolePegTest,andthePacedAuditorySerialAdditionTest(11)),whileinothercases(e.g.imaging)theremaynotyetbeaclearname.Notethatthenameofthemachineisnotthenameoftheoutcomemeasurementinstrument;oftenamachinecanbeusedtomeasureavarietyofparameters(e.g.Greyscaleultrasound[tomeasure]synovialthickening(synovialhypertrophy)orDopplerultrasound[tomeasure]increasedbloodflow(Synovialhyperemia)(19)),orapathologicalentitycanbemeasuredbydifferenttypesofimages(forexample,enthesitismeasuredbyultrasound(17)orbyMRI(20)).Werecommendtoincludethetypeofmeasurement(e.g.ultrasound)incombinationwiththeentitymeasuredasthenameofthescore(e.g.ultrasoundenthesitisscore).Element2.TheversionoftheoutcomemeasurementinstrumentorwayofoperationalizationofthemeasurementprotocolDetailsontheversion,andoperationalizationoftheoutcomemeasurementinstrumentshouldbeextracted.Detailsonspecificversionreferthee.g.thelengthofthetask(e.g.the2‐,6‐or12‐minutewalkingtest(21)),orthenumberofitemsincludedintheversion(e.g.Doloplus‐1orDoloplus‐2(13)),orthelanguageused(theEnglish(21)orDutchversion(22)ofthe6‐minutewalktest).Choicesinhowthemeasurementprotocolwasoperationalizedmayaffectthemeasurement,andshouldthusbemadeexplicit.Specifically,thecomponentsthatarepotentialsourcesofvariation,needtobelisted,forexample,specificcharacteristicsoftheequipmentused(e.g.brandandtypeofthemachine),andcharacteristicsoftheprofessionalsinvolvedinthemeasurement(e.g.backgroundandexperiences).Thetaxonomyofthecomponentsofmeasurementinstruments(seechapter2.1)canbeusedforthis.Element2referstocomponentsknownorexpectedtoinfluencethescorethatarenottheobjectofstudy.Toeliminatetheinfluenceofthesepotentialsourcesofvariationonthescoresobtained,thesecomponentsshouldhavebeenrestrictedorstandardizedinthestudy.Forexample,ifitisexpectedthatdifferenttypesorbrandsofmachinesmayinterferewiththescore,onlyonetypeandbrandofamachineisused(andreported).InthestudybySkeieetal(2015)onlytheMedisonAccuvixV10ultrasoundscannerwitha3–7MHzcurvilinearprobewasused(19)–inotherwords,thebrandandtypeofmachineandprobewasstandardized.Moreover,chiropractorswithrespectively4and8yearsofexperiencedindiagnosticultrasoundforthemusculoskeletalsystem,andwitha
22
postgraduatediplomaindiagnosticultrasoundwereinvolvedinthemeasurements(19).Thus,thebackgroundoftheraterswasrestrictedtoaspecificprofession(i.e.chiropractors)withspecificdurationofexpertise(4/8yearsindiagnosticultrasound)havingreceivedspecifictraining.Inaddition,insomecasestheinstrumentprocedurerequiresmultiplereadings,andasummarystatistic(usuallythemean,butsometimesthemedian,maximumorminimum)iscalculatedasorusedtoassignthefinalscore(i.e.theresultsofthemeasurement).Awell‐knownexampleisbloodpressuremeasurementintheclinic.1Howthemeasurementistaken,shouldbespecified,asitisneededtoassessstandards7(seechapter3).ForpeoplefamiliarwiththeterminologyoftheGeneralizabilityTheory,theversionorthewayofoperationalizationofthemeasurementinstrumentreferstothefacetsofstratification,wherepatients(i.e.theobjectofmeasurement)arenestedinafacet(23).
Element3.TheconstructmeasuredbythemeasurementinstrumentToidentifyexactlywhichoutcomemeasurementinstrumentwasstudied,werecommendtoextracttheconstructmeasured,unlessitisclearfromthegivenname.Theconstructreferstowhatisbeingmeasured,i.e.the‘aspectofhealth’.Itisalsoreferredtoasthe‘conceptofinterest’orthe’intendedobjectivetobemeasured’.Whenthemeasurementinstrumentdoesnothaveaname,identifyingtheconstructcanhelptofullycharacterizetheoutcomemeasurementinstrument(whichwealsorecommendtomentioninthename,i.e.element1).Table5providessomeexamples.Notethatastudyonreliabilityormeasurementerrordoesnotprovideinformationaboutwhetherindeedtheconstructisbeingmeasured,forthatyouneedvalidityandaccuracystudies.
1 To measure blood pressure, the technician first palpates the radial artery, inflates the cuff until the pulse disappears, inflates an extra 20-30 mm Hg, and then slowly deflates until the pulse reappears. The pressure is noted, and the measurement begins: first, the stethoscope is placed on the brachial artery just medial and above the cubital fold. Then the cuff is reinflated. The pressure is quickly increased to 30 mm Hg above the previous reading, and then slowly deflated until the pulse sounds are detected (systolic blood pressure, measured in 2 mm increments), then further deflated until the sounds disappear (diastolic blood pressure). The cuff is fully deflated, then inflated again to repeat the measurement.
23
Table5.Examplesofelements1,2,and3.
Element 1: name Element2:version/operationalization Element3:construct
Nineholepegtest(24)
Awoodenorplasticboardwith9holes(10mmdiameter,15mmdepth),placedapartby32mm(25)
Fingerdexterity
Ultrasound enthesitis score
Sonography images obtained by experienced sonographers using the Esaote Technos MPX machine
Enthesitis
HbA1cvaluebasedonimmune‐turbidimetry(12)
Turbidimetricinhibitionimmunoassay(TINIA),including2reagens(i.e.anti‐HbA1cantibody(R1),andbuffer/polyhaptenreagent(R2));Tetradecyltrimethylammoniumbromide(TTAB)isdetergent;Roche/Hitachicobascsystems.
HbA1c(glycatedhaemoglobin)
Element4.Specificationofthemeasurementpropertyofinterest
Whenthemeasurementpropertyofinterestisreliability,thestudywillreportrelativeparameterssuchasanICC,Generalizabilitycoefficientφ,orKappaκ.Whenthemeasurementpropertyofinterestismeasurementerror,thestudywillreportabsoluteparameters,eitherexpressedintheunitofmeasurement,suchasSEM,LOAorSDC,orexpressedasagreementormisclassification,e.g.thepercentagespecificagreement.
WerecommendtousetheCOSMINterminologytodeterminewhetherastudyassessedreliabilityormeasurementerror,regardlessofthetermsusedinthearticle,becauseconfusionpersistsaboutthecorrectapplicationoftheseterms.Forexample,wheninaparticulararticleitisstatedthat‘reliability’wasassessed,butthestandarderrorofmeasurement(SEM)orthelimitsofagreementarereported,theresultofthatstudyshouldbeconsideredasevidenceformeasurementerror(26).Whenanauthorstatestohaveevaluated‘agreementbetweenraters’usingthekappastatistic,theresultofthisstudyreferstothereliabilityoftheoutcomemeasurementinstrument(27).
24
Element5.Specificationofthecomponentsofthemeasurementinstrumentthatwillberepeated.(Figure1)
Itshouldbeextractedwhethertheinterestofthestudyisinthereliabilityormeasurementerrorofthewholemeasurementprocedure(seeFigure1,studyA),oronlyinpartofthemeasurementprocedure(seeFigure1,studyB).Forexample,basedonanstaticimagethatwasmadeonceforapatient,onlytheassignmentofthescorewasrepeated,ortheperformanceofataskofeachpatientwasvideotaped,andonlythelastcomponent(i.e.assignmentofthescores)isrepeated.
Figure1.Whichpartofthemeasurementisrepeated.
Element6.Specificationofthecomponentsofthemeasurementinstrumentthatwillbevaried
Thecomponentofthemeasurementinstrumentthatisbeingvariedacrossthemeasurementsisthemainfocusofthestudy.Examplesaretimeoroccasion(test‐retest,orintra‐rater),theprofessionals(inter‐rater),orthemachines(inter‐machineorinter‐device)(28).Forexample,inFigure1ratersarevaried:raterAconductsthefirstmeasurementandraterBconductsthesecondmeasurementforeachpatients.
25
Inthedesignofthestudyoneormoresourcescanbeconsidered.Forexample,boththemachineandtheraterwhoconductsthewholemeasurementarevariedacrosstherepeatedmeasurements(seeFigure2,studyA).Thetaxonomiesofcomponentsofmeasurementinstruments(seechapter2.1)canbeusedtoconsidervariouspotentialsourcesofvariation.
Figure2.Designsinwhichcomponentsarevariedacrossrepeatedmeasurements
Alternatively,theresearcherscanassumethatacomponent(e.g.preparationorassignmentofthescore)is‘stable’,inotherwords,thattheraterwhopreparesthemeasurementorwhoassignsthescorewillnotintroduceerrorinthispartofthemeasurement(indicatedingreyinFigure2studyBandC),andinvestigateonlytheinfluenceofthecomponents(e.g.)equipment,preparation,collectionofrawdataanddataprocessingandstorage.
InthedesignsshowninFigure1and2weassumethatallpatientsweremeasuredthisway.Thisiscalledacrosseddesign(29).However,so‐callednesteddesignsarepossible,too(seeFigure3).Inthesedesigns,partofthepatientsaremeasuredfollowingmeasurementconditionsAandotherpatientsaremeasuredusingmeasurementconditionsB.InFigure3anestedinter‐raterreliabilitydesignisshown,wheresomeofthepatientsaremeasuredfirstbyraterAandnextbyraterB(i.e.measurementconditionA),whileotherpatientsaremeasuredfirstbyRaterCandnextbyraterD(i.e.measurementconditionB),etc.Thesedesignsareappropriatetouse,andinthecalculationoftheICC,thiscouldbetakenintoaccount.Forexample,bycalculating
26
variancecomponentspermeasurementcondition,andnextpoolthesevariancecomponents(weightedbysamplesize)acrossthemeasurementconditions(e.g.(30)),orbyusingaone‐wayrandomeffectsmodel(31).
Figure3.Nestedinter‐raterreliabilitydesign.
ForpeoplefamiliarwiththeterminologyoftheGeneralizabilityTheory,thecomponentsthatarebeingvariedacrossmeasurementsarecalledtherandomorfixedfacetsofGeneralizability(23).
Element7.Patientpopulation
Thereliabilitydependsonthehomogeneityorheterogeneityofthestudypopulation.Therefore,thesample(anditssubgroups)includedinthestudyshouldbeextractedandassessedbytheuserofthistool.InthestudybySkeieetal(2015)therecruitedsampleconsistedoflowbackpatients,patientswithotherspinalcomplaints,butalsoofpain‐freesubjects.Thislattergroupcouldhaveincreasedthevariancebetweenpatients,andsubsequently,influencedtheresults(i.e.increasedtheICC)ofthereliabilitystudy.
IntheCOSMINmethodologyweusethewordpatient.However,sometimesthestudypopulationofinterestconsistsofhealthyindividuals,bodystructures(e.g.joints,kidneys),cliniciansorcaregivers.Inthesecases,thewordpatientshouldbereadase.g.healthypersonorcaregiver.
ForpeoplefamiliarwiththeterminologyoftheGeneralizabilityTheory,thepatientpopulationreferstotheobjectofmeasurementorthefacetsofdifferentiation(23).
27
2.3ExampleofhowtousePartAoftheCOSMINRiskofBiastooltoassessthequalityofastudybySkeieetal.(2015)
InthischapterweprovideanexampleofhowtousetheCOSMINtool–PartAusingapaperbySkeieetal.(19).Togetafullunderstandingofthestudy,werecommendtofirstreadtheintroductionandmethodsectionofthepaper.Inthispaperfourdifferentstudiesaredescribed.Hereweusethefirsttwosubstudies,andprovideasummaryofthesetwostudies.
Inthispaper,thelumbarmultifidusmuscle(LMM)thicknessscore(study1)andcontractionscore(study2)wasinvestigatedbyultrasound.Themeasurementproceedsasfollows:apatientisaskedtolaydowninaspecificposition,andtheprobeisplacedonaveryspecificbodypart.Thisyieldsanon‐screenimage.Subsequently,amarkerisplacedonaspecificstructure(i.e.theapexofthefacetjoint)identifiedontheimage.Instudy1,astillimageisrecorded,andthefirstraterplacesthesecondmarkeronanotherspecificstructure(i.e.processusmammillaris)onthisimage,andmeasuresthedistancebetweenthemarkerswiththecallipersoftware.ThetwomarkerscorrespondwiththethicknessoftheLMM.Thefirstraterrepeatsthesecondmarkerplacementanddistancemeasurementonthestillimagetwice,foratotalofthreemeasurements.Thepatientleaves.Next,basedontheverysamestillimage(withonlythefirstmarkervisible)asecondraterplacesthesecondmarkeronthescreenandmeasuresthedistanceatotalofthreetimes.Next,alldataistransferredtoaseparatepaperbyrater1whocalculatesameanvalueperpatientperrater.ThismeanvalueistheLMMthicknessscore.Therepeatedplacementofthesecondmarkeronthestillimageandapplicationofthecalipertooltomeasurethedistancebetweenthetwomarkersispartofonemeasurement(19).ThisprocedureisdepictedinFigure3,study1.
Figure3.StudydesignsofSkeieetal.
28
Instudy2,foreachpatienteachoftheratersindependentlygeneratedoneimageoftheLMMintherestingstateandoneimageoftheLMMincontractedstate.Usingasplit‐screenofthetwostillimagesofbothstates,eachratermeasuredthickness(i.e.caliper‐assesseddistancebetweenthemarkers)ofthetwostatesthreetimes.Next,rater1transferredthedatatoaseparatepaperandcalculatedmeanvalues of the thickness of each state. Next,rater1calculatedthe‘LMMcontractionscore’astheexactchangeinthickness(contractedLMMminusrestingLMM)(19).ThisprocedureisdepictedinFigure3,study2.
BasedonthethoroughelaborationofthestudyperformedanddescribedbySkeieandcolleagues,weextracttheelementsofacomprehensiveresearchquestion.
Table6.ExampleofhowtousePartAoftheCOSMINRiskofBiastoolbasedonthestudybySkeie(19).
Element Instruction Study1 Study21.Nameoftheinstrument
Alternatively:typeofinstrumentandparameter
Ultrasoundmeasurementofthelumbarmultifidusmuscle(LMM)thicknessscore
UltrasoundmeasurementoftheLMMcontractionscore
2.Versionorwayofoperationalization
Allrelevantcomponentsthatareknownorexpectedtoinfluencethescore,andwhicharestandardizedorrestricted(facetofstratification(23))
Equipment:MedisonAccuvixV10ultrasoundscannerwitha3–7MHzcurvilinearprobe;Preparatoryactions:twochiropractorswith4respectively8yearsofexperienceindiagnosticultrasoundforthemusculoskeletalsystem,withapostgraduatediplomaindiagnosticultrasound;stillon‐screenimageswereobtainedwiththesubjectsinapronepositionwithapillowplacedundertheabdomentoflattenthelumbarlordosis.Preparation:Imagewason‐screengeneratedandamarkerwasplacedontheimageonthemamillaryprocessoftheleveltobemeasured.Unprocesseddatacollection:Thesecondmarkerwasplacedontheon‐screenimage,andthedistancewascomputedbythecallipersoftware.Thispartwasrepeatedthreetimes.
Preparation:Inrestingposition,animagewason‐screengeneratedandamarkerwasplacedontheimageonthemamillaryprocessoftheleveltobemeasured.Next,incontractedstate(LMMcontractionwasinducedbyacontralateralarmliftingtask),animagewason‐screengenerated,too,andamarkerwasplacedontheimage.
29
Element Instruction Study1 Study2Dataprocessingandstorage:Dataistransferredtoaseparatepaperbyrater1.
Unprocesseddatacollection:basedonthesplit‐screenofbothimages,thesecondmarkerwasplacesoneachimage,andthedistance(perimage)wascalculatedbythecallipersoftware.Thispartwasrepeatedthreetimes.Dataprocessingandstorage:Dataistransferredtoaseparatepaperbyrater1.
Assignmentofthescore:Rater1calculatedameanvalueperpatientperrater.
Assignmentofthescore:Rater1calculatesameanvalueperpatientperraterforbothstates.Next,theratercalculatedthe‘LMMcontractionscore’astheexactchangeinthickness(contractedLMMminusrestingLMM).
3.Construct Descriptionofwhatisbeingmeasured
LMMthickness LMMcontraction,whichischangeinLMMthicknessincontractedandrestingstate(contractedLMMminusrestingLMM).
4.Measurementproperty
Reliabilityandmeasurementerror
Reliabilityandmeasurementerror
5.Componentsthatwillberepeated
Eitherthewholemeasurement(i.e.allcomponents)ortheassignmentofthescore(i.e.lastcomponent)
Thewholemeasurementwillberepeated.However,thefocusofinterestinontheunprocesseddatacollection:placingofthesecondmarkerontheon‐screenimage(meanofthreetimes).
Thewholemeasurementwillberepeated.However,thefocusofinterestinonthepreparation(i.e.preparationandgenerationofimagesintherestingandcontractedstates,andtheplacingofthefirstmarker),andontheunprocesseddatacollection(placingofthe
30
Element Instruction Study1 Study2secondmarkerontheon‐screenimage(meanofthreetimes).
6.Source(s)ofvariationvaried
Componentswhichisvariedacrossthemeasurements(i.e.focusofanalysis;facetofgeneralizability(23))
Raters(n=2;inter‐raterreliability)
Raters(n=2;inter‐raterreliability)
7.Patientpopulation
(i.e.facetofdifferentiation(23))
LBPpatients,patientswithotherspinalcomplaintssuchasmidbackpain,neckpain,and/orextremitypain,andpain‐freesubjects(n=30ineachexperiment,totaln=120)
Basedontheextractedinformation,acomprehensiveresearchquestioncanbeformulatedas:
Study1:Whatistheinter‐raterreliabilityofthedatacollectionphaseofthelumbarmultifidusmuscle(LMM)thicknessscorebasedonthemeanofthreemarkeddistancewiththecallipersoftwareonastillimageoftheultrasoundmeasurement,measuredusingtheMedisonAccuvixV10ultrasoundscannerwitha3–7MHzcurvilinearprobebypost‐graduateexperiencedchiropractors,inLBPpatients,patientswithotherspinalcomplaintssuchasmidbackpain,neckpain,and/orextremitypain,andpain‐freesubjects?
Study2:Whatistheinter‐raterreliabilityofpreparing,generating,anddatacollectionphasesofthelumbarmultifidusmuscle(LMM)contractionscore,basedonthemeanofthreemarkeddistancewiththecallipersoftwareonanon‐screenimageinrestingandincontractionstateoftheultrasoundmeasurement,measuredusingtheMedisonAccuvixV10ultrasoundscannerwitha3–7MHzcurvilinearprobebypost‐graduateexperiencedchiropractors,inLBPpatients,patientswithotherspinalcomplaintssuchasmidbackpain,neckpain,and/orextremitypain,andpain‐freesubjects?
Please,notethatwedonotrecommendtoreporttheresearchquestionalwaysasthisinonelongquestion.Though,weconsideritveryusefultodescribeallthisinformationclearly,e.g.inthemethodsectionofapaper.
31
3. PartB.Assessingtheriskofbiasofastudyonreliabilityormeasurementerror
PartBoftheCOSMINRiskofBiastoolcontainstwoboxeswithstandardsthatcanbeusedtodeterminewhethertheresultofastudyonreliabilityormeasurementerror,respectively,canbetrusted.Standardsrefertothedesignrequirementsofthestudyortothepreferredstatisticalmethods.Thestandards1to5inbothboxesrefertodesignrequirements.Thesestandardsarethesameforstudiesonreliabilityandforstudiesonmeasurementerror,asthesamedesigncanbeusedforassessingbothmeasurementproperties.Threestandardsrefertothepreferredstatisticalmethodsforstudiesonreliabilityandtwostandardsrefertothepreferredstatisticalmethodsforstudiesonmeasurementerror.IntheCOSMINRiskofBiastool,weincludedstandardsconcerningthepreferredstatisticalmethodsthatareappropriatetousewhenevaluatingreliabilityormeasurementerrorofoutcomemeasurementinstruments(seealsosection1.6).Othermethodsmaybeappropriatetouseaswell(forexamplebi‐factormodelsorMulti‐TraitMulti‐Method(MTMM)analyses,ornewlydevelopedmethods).Itisnotourintentiontocomprehensivelydescribeallpossiblestatisticalmethods,rathertodescribetheadequatemethodsthatarecommonlyusedintheliterature.Eachboxalsocontainsastandardaskingiftherewereanyotherimportantmethodologicalflawsthatarenotcoveredbytheotherstandards(standard6),butthatmayhaveledtobiasedresultsorconclusions.Someflawsareratheruncommon,andtherefore,donotjustifyaseparatestandard.Inchapter3.1weprovideseveralexamplesfortheseflaws.Eachstandardwillbescoredonafour‐pointratingsystem(i.e.‘verygood’,‘adequate’,‘doubtful’,or‘inadequate’)inlinewiththeCOSMINRiskofBiaschecklistforPatient‐ReportedOutcomeMeasures(PROMs)(1).Subsequently,thelowestratinggiveninaboxdeterminesthefinalrating,i.e.thequalityofthestudy(thisiscalledtheworst‐score‐countsmethod(18)todeterminetheriskofbias).Sometimesaresponseoptionisindicatedingrey,meaningthattheresponseoptionisnotapplicableforthestandard,andusersshouldchoosebetweentheotheroptions.Final,somestandardscanberatedas‘notapplicable’.Ingeneral,astandardonadesignrequirementisratedas‘verygood’whenthereisevidenceorconvincingargumentswereprovidedthatthestandardismet;‘adequate’whenitisassumable,althoughnotexplicitlydescribed,thatthestandardismet;‘doubtful’whenitisunclearthatthestandardismet;and‘inadequate’whenthereisevidencethatthestandardisnotmet(18).Astandardaboutpreferredstatisticalmethodsisingeneralratedas‘verygood’whenapreferredmethodwasoptimallyused;‘adequate’whenthepreferredmethodwasused,
32
butitwasnotoptimallyused,‘doubtful’whenitisunclearifapreferredmethodwasused,and‘inadequate’whenthestatisticalmethodsusedareconsideredinadequate.Theboxesforreliabilityandmeasurementerror,respectively,canbefoundhere.Below,anelaborationofeachstandardisdescribedforreliability(chapter3.1)andmeasurementerror(chapter3.2).Inchapter3.3weprovideanexampleforratingtheboxonreliabilityinthestudybySkeie,thatwasalsousedasanexampleinchapter2.3.
33
3.1ElaborationonstandardsforstudiesonreliabilityTheboxonreliabilitycontainsfivestandardsaboutdesignrequirements,onestandards‘otherflaws’andthreestandardsaboutpreferredstatisticalmethods.Foreachstandardwegivesuggestionsforhowtoratethestandard.Standard1.Stabilityofthepatient verygood adequate doubtful inadequate NA
Werepatientsstableinthetimebetweentherepeatedmeasurementsontheconstructtobemeasured?
Yes(evidenceprovided)
Reasonstoassumestandardwasmet
Unclear No(evidenceprovided)
Notapplicable
Elaboration:Patientsshouldbestablewithregardtotheconstructtobemeasuredbetweentherepeatedmeasurements.Whenaninterventionsuchassurgeryormedicationisgivenintheinterimperiod,itislikelythat(manyof)thepatientshavechangedontheconstructtobemeasured.Inotherwords,theyarenotstable–andthestandardshouldberatedas‘inadequate’.Whentheaimistoassessthereliabilityoftheassignmentofthescore,e.g.usingstaticimagesorvideosoftheperformanceofataskasobjectofinterest(seeFigure1study2–page24),thisstandardisnotapplicableastheimagesandvideoswereacquiredonlyonce.Furthermore,themeasurementcaninterferewiththestabilityofthepatient.Forexample,thereshouldbeenoughtimeforpatientstorecoverfromexperiencedpainorfatiguebetweenrepeatedmeasurementsandpermitpatientstoreturntotheirinitialstate.Ifnot,thestandardshouldberatedas‘doubtful’,asitisunclearwhetherthepatientsarestableontheconstructtobemeasured.Whenevidenceorconvincingargumentsareprovidedthatthepatientswerestable,thestandardisscored‘verygood’.Standard2:Timeinterval verygood adequate doubtful inadequate
Wasthetimeintervalbetweenthemeasurementsappropriate?
Yes Doubtful,ORtimeintervalnotstated
No
Elaboration:Thetimeintervalbetweenthemeasurementsmustbeappropriate.Thedefinitionof“appropriate”dependsontheconstructtobemeasuredandthestudypopulation.Thetimeintervalshouldbelongenoughtopreventrecallbiasofpreviousscoresincaseofintra‐raterreliability,andshortenoughtoensurethatpatientshavenotchangedontheconstructtobemeasured.Forexamplesynovitiscanchangeinafewdays,whileachangeincartilageorbonestatuswouldtakeafewmonths.
34
Standard3.Similarmeasurementconditions
verygood adequate doubtful inadequate
Werethemeasurementconditionssimilarforthemeasurements–exceptfortheconditionbeingevaluatedasasourceofvariation?
Yes(evidenceprovided)
Reasonstoassumestandardwasmet,ORchangewasunavoidable
Unclear No(evidenceprovided)
Elaboration:Eachrepeatedmeasurementshouldbeconductedwiththesamemeasurementprotocol–exceptforthesourceofvariationthatwasintentionallyvaried,i.e.element6ofthecomprehensiveresearchquestion(seechapter2.2).Forexample,iftheaimwastounderstandthevariationduetodifferentraters(i.e.inter‐raterreliability),onlytheratersshouldbevaried.Otherconcomitantsourcesofvariation(i.e.element2ofthecomprehensiveresearchquestion,seechapter2.2)shouldbekeptsimilar.Wasthestudyuptostandard?Wereallequipment,preparatoryactions,theenvironmentalconditions(e.g.temperature),andmethodsofprocessingthesameinbothmeasurements?Forexample,whenthepatientsareverylikelytoshowalearningeffect(forexampleonaperformance‐basedtest),theabsenceofafamiliarizationsessionshouldyieldaratingofdoubtfulorinadequateonthisstandard,asthefirstmeasurementcanthenbeconsideredtobethefamiliarizationsession,andthemeasurementconditionsarenotthesame.Adescriptionofsimilarityofthemeasurementconditionsoftherepeatedmeasurementscanbeconsideredasevidence.Standards4.AdministrationofmeasurementsIninstrumentsthatdonotinvolvebiologicalsampling,theadministrationreferstothecomponents‘Collectionofrawdata’and‘Dataprocessingandstorage’(seechapter2.1).Ininstrumentsinvolvingbiologicalsampling,itreferstothecomponents‘Collectionofbiologicalsampling’and‘Biologicalsamplingprocessingandstorage’(seechapter2.1). verygood adequate doubtful inadequate
Didtheprofessional(s)administerthemeasurementwithoutknowledgeofscoresorvaluesofotherrepeatedmeasurement(s)inthesamepatients?
Yes(evidenceprovided)
Reasonstoassumestandardwasmet
Unclear No(evidenceprovided)
Elaboration:Allmeasurementsshouldbeadministeredbytheprofessional(s)involvedwithoutthemhavingknowledgeofthescoresorvaluesofotherrepeatedmeasurementsonthesameoutcomemeasurementinstrument.Thismeansthatthemeasurementsshouldallbeadministeredwithoutknowledgeoftheprior(e.g.incaseofanintra‐raterreliabilitystudy)orother(e.g.incaseofaninter‐raterreliabilitystudy)score(s)orvalue(s)ontheinstrumentofinterest.
35
Theratingofthisstandardisrathersubjective.Forexample,ifinastudytheratersindependentlyadministeredthemeasurement,andnonewereinvolvedinthecareofthepatients(makingitveryunlikelythattheratersreceivedadditionalinformationofthepatientsincludingknowledgeofthescore(s)ofotherrepeatedmeasurements),thiscanbeconsideredas‘evidenceprovided’,andtheratingis‘verygood’.Whentheotherscoreisknowntotheprofessionalwhileadministeringtherepeatedmeasurement,itmayinfluencethewaythemeasurementisadministered.Forexample,withaseverescoreobtainedwithanimagingtechnique,therepeatedmeasurementcanbeadministeredmorecarefully,andmoretimecanbeusedtolookatthepatient.Ifitisknownthatthiswasthecase,theratingis‘inadequate’.Whenthereisnoexplicitdescription,butitseemsveryunlikelythattheratersknewthescoresorvaluesofotherrepeatedmeasurements,itcanberatedas‘adequate’,or‘doubtful’.Insomesituationsthisstandardisnotapplicable,i.e.whentheadministration(i.e.collectionoftherawmaterialorbiologicalsample,dataorsamplingprocessingandstorage)isnotrepeatedinthestudy,butonlytheassignmentofthescoreorthedeterminationofthevalue(seeforexampleschapter2.2element5ofthecomprehensiveresearchquestion,orFigure1study2).Standard5.Assignmentofthescoreordeterminationofthebiologicalvalue
verygood adequate doubtful inadequate
Didtheprofessional(s)assignscoresordeterminevalueswithoutknowledgeofthescoresorvaluesofotherrepeatedmeasurement(s)inthesamepatients?
Yes(evidenceprovided)
Reasonstoassumestandardwasmet
Unclear No(evidenceprovided)
Elaboration:Thescoresonallmeasurementsshouldbeassignedorvaluesshouldbedeterminedbytheprofessional(s)involvedwithoutthemhavingknowledgeofthescoresorvaluesofotherrepeatedmeasurements.Thismeansthatassigningascoretoameasurementordeterminingthevalueofabiologicalsampleshouldbedonewithoutknowledgeoftheprior(e.g.incaseofanintra‐raterreliabilitystudy)orother(e.g.incaseofaninter‐raterreliabilitystudy)score(s)orvalue(s)ontheinstrumentofinterest.Althoughpartofthedeterminationofthevalueofabiologicalsamplecanbeanautomaticstep,theremaybehumanactionrequiredtodothisdetermination.Forexample,anurinepHleveltesttomeasuretheacidityoralkalinityofurinewherethecolorofthestripisinterpretedbytheprofessional.Theratingissimilarlyasexplainedforstandard4.
36
Standard6.Otherimportantflaws verygood adequate doubtful inadequate
Werethereanyotherimportantflawsinthedesignorstatisticalmethodsofthestudy?
No Minorothermethodologicalflaws
Yes
Elaboration:Thisstandardisincludedbecausetheremightbeuncommondesignflawsthatarenotcoveredbyotherstandardsbutthatmaycauseadditionalriskofbias.Below,someexamplesareprovided.Whenvariousprofessionalsareinvolvedinthemeasurementinstrument,andoneoftheprofessionalsistheattendingphysicianofthepatient,thisphysicianhas(much)moreinformationaboutthepatientthantheotherprofessionals.Insomesituations–dependingontheaimofthestudyandthespecificconstructtobemeasured–thiscouldbeconsideredaflawbecauseoftheinfluenceonthescoresobtained.InthepreviouschapterwesawintheexampleofSkeiethatpartofthesamplecomprisedhealthypatients,whereastheauthorswereultimatelyinterestedinthesemeasurementsinlowbackpainpatients(19).Asthiswillincreasethevariancebetweenpatients,anditwillincreasetheresultsofthestudy(i.e.theICCorGCoefficient).Dependingonwherethisstudysitsinthedevelopmentoftheinstrument,thiscouldbedeemedproper(whenthefullrangeofthescoresisnotyetknown)oranimportantflawwhenthepurposeistodeterminethereliabilityofmeasurementintheclinicalsettingoflowbackpain.AfinalexamplereferstotheuseoftheICCmodelforaveragescores.Althoughdiscussedunderstandard7forreliability,itmaybethattheICCforthemeanscoreofthemeasurementsisreported,whereasinclinicalpracticethesinglescoreisused.Dependingonthepurposeofthestudythiscanbeproper(whenthemeanscoreisgoingtobeusedinfutureresearch)oranimportantflawwhenthestudyisaimedatprovingreliabilityonclinicalpractice(wherethesinglescoreisused).
ItisuptotheuseroftheCOSMINRiskofBiastoolwhetheraflawisconsideredminor(andisratedas‘doubtful’)orimportant(andisratedas‘inadequate’).Thescoresoftheotherflawsareincludedintheoverallscore/ratingbasedontheworstscorecountsprinciple.
37
Standard7:Preferredstatisticalmethodsforcontinuousscores verygood adequate doubtful Inadequate
Forcontinuousscores:wasanintraclasscorrelationcoefficient(ICC)calculated?
ICCcalculated;themodelorformulawasdescribed,andmatchesthestudydesignandthedata
ICCcalculatedbutmodelorformulawasnotdescribedordoesnotoptimallymatchthestudydesignORPearsonorSpearmancorrelationcoefficientcalculatedWITHevidenceprovidedthatnosystematicdifferencebetweenmeasurementshasoccurred
PearsonorSpearmancorrelationcoefficientcalculatedWITHOUTevidenceprovidedthatnosystematicdifferencebetweenmeasurementshasoccurredORWITHevidenceprovidedthatsystematicdifferencebetweenmeasurementshasoccurred
Elaboration:Forcontinuousscorestheintraclasscorrelationcoefficient(ICC)ispreferredtoevaluatereliability.ICCsareafamilyofstatisticalparameters,includingGeneralizability(G)coefficients,andDecision(D)coefficients.Togeta“verygood”rating,theICCmodelusedinthereliabilitystudyshouldmatchthestudydesign(andtheaim)ofthestudythatisbeingassessed.Therefore,themodelorformulaoftheICCorGCoefficientusedshouldbedescribed.Itshouldbeclear,e.g.whetheracrossedornesteddesignwasused(seealsopage25/26),orwhetheraone‐wayrandomeffectsmodel,two‐orthree‐wayrandomormixedeffectsmodelwasused.Next,itshouldbecomparedtothestudydesignusingtheextractedinformationfromPartA,anddeterminedwhethertheICCorGCoefficientusedindeedmatchesthestudydesign.TheICCbasedonthetwo‐waymixedeffectsmodelofconsistency(31)(alsoreferredtoasICCmodel3.1(32)),andthePearsonorSpearmancorrelationcoefficientdonottakeasystematicdifferencebetweentherepeatedmeasurementsintoaccount,andarethereforeconsideredlessappropriate,asitcanleadtooverestimatingthereliability.Therefore,basedoninformationofasystematicdifferencebetweenthesourceofvariationconsidered(e.g.raters)either‘adequate’(whennoorverylittlesystematicdifferenceoccurred),or‘doubtful’(whentherewasasystematicdifferencebetweene.g.theraters)canberated.Whenthestudywasdesignedtoinvestigateaspecificsourceofvariation(e.g.inter‐rater),andthesystematicdifferencesbetweenthissourceofvariationintherepeatedmeasurementswastakenintoaccountintheformula(forexample,byusingtheICCrandomeffectsmodelforagreement(31),alsoreferredtoasModel2.1(32)ortheφcoefficient(seee.g.(23)),thestudycanberatedas‘verygood’.Whenastudyisdesignedwithoutanyspecificsourceofvariationisconsidered,theappropriateICCmodelisaone‐wayrandomeffectsmodel(31).Inthissituationtheuse
38
ofaone‐wayrandomeffectsmodelcanberatedas‘verygood’,whiletheuseofothermodelscanberatedas‘adequate’.Next,theICCcanbecalculatedforasinglemeasurementoranaveragemeasurement(31).Ifasinglemeasurementisnormallyusedinclinicalpracticeortrials(andnottheaveragescoreofmultiplemeasurements,suchisdonebyabloodpressuremeasurement),theICCforsinglemeasuresshouldhavebeencalculated.TheICCaveragereferstothereliabilityoftheaveragedscoreofthemeasurements,andreferstotheuseoftheaveragedscoreonrepeatedmeasurements.WhentheICCforaveragemeasuresisreported,inthesituationthatusuallyasinglemeasurementistaken,werecommendthisstandardtoberatedas‘adequate’,asthemodeldoesnotoptimallymatchthedesignofthestudy.However,wealsorecommendinthissituation,toratestandard6(i.e.otherflaws),as‘doubtful’oreven‘inadequate’(seealsotheexampleatstandard6).Moreover,togeta‘verygood’rating,thedescribedICCorGcoefficientmodelorformulashouldmatchthedata.Ifthereisa(known)problemwithnormaldistributionofthedata(normality)whichisnotproperlytakenintoaccount,thestudycouldberatedas‘adequate’insteadof‘verygood’.Itisimpossibletodescribeallotherflawshere,ThereforeitisuptotheuseroftheCOSMINRiskofBiastooltodecidehowtheidentifiedflawshouldbescored.Relevantquestioninthisregardishowcertainandhowlargetheinfluenceisonthestudyresult.Standard8:Preferredstatisticalmethodsforordinalscores verygood adequate doubtful inadequate
Forordinalscores:wasa(weighted)kappacalculated?
Kappacalculated;theweightingschemewasdescribed,andmatchesthestudydesignandthedata
Kappacalculated,butweightingschemenotdescribedordoesnotoptimallymatchthestudydesign
Elaboration:Toassessreliabilityforordinalscores,Cohen’skappa(33‐35)isconsideredthepreferredstatisticalparameter.Nobetteralternativeisknown(4,36).Informationonthespecifickappausedshouldbedescribedintermsofwhetheraweightingschemewasusedandwhichschemewasused.Unweightedkappaconsidersanymisclassificationequallyinappropriate.However,amisclassificationoftwoadjacentcategoriesmaybelesserroneousasamisclassificationofcategoriesthataremoreapartfromeachother.Aweightedkappatakesthisintoaccount(e.g.usinglinearorquadraticweights(37)).Ifthegoalofthestudywastoconsideranymisclassificationasequallyimportant,anditwasstatedthattheunweightedkappawasused,thisstandardcanberateda‘verygood’.However,inothersituation(e.g.misclassificationofcategoriesmore
39
apartfromeachotherisabiggerproblemthatmisclassificationofadjacentcategories)aspecificweightingschemeismorepreferred.Ifunweightedkappacalculatedinthatcasethestandardcouldberatedas‘adequate’.Standard9:Preferredstatisticalmethodsfordichotomousornominalscores
verygood adequate doubtful inadequate
Fordichotomous/nominalscores:wasKappacalculatedforeachcategoryagainsttheothercategoriescombined?
Kappacalculatedforeachcategoryagainsttheothercategoriescombined
Elaboration:Astudyonreliabilityofanoutcomemeasurementinstrumentwithdichotomousornominalscoresgetsa‘verygood’score,whenanunweightedkappawascalculatedofeachcategoryagainsttheothercategories(33).
40
3.2Elaborationonstandardsforstudiesonmeasurementerror
Standards1to6oftheboxforstandardsforstudiesonmeasurementerrorarethesameasforstudiesonreliability.Foranelaborationoneachofthestandards,pleaseseeabove.Standard7:Preferredstatisticalmethodsforcontinuousscores
verygood adequate doubtful inadequate
Forcontinuousscores:wastheStandardErrorofMeasurement(SEM),SmallestDetectableChange(SDC),LimitsofAgreement(LoA)orCoefficientofVariation(CV)calculated?
SEM,SDC,LoAorCVcalculated;themodelorformulafortheSEM/SDCisdescribed;itmatchesthereviewerconstructedresearchquestionandthedata
SEM,SDC,LoAorCVcalculated,butthemodelorformulaisnotdescribedordoesnotoptimallymatchthereviewerconstructedresearchquestionandevidenceprovidedthatnosystematicdifferencehasoccurred
SEMconsistencySDCconsistencyorLoAorCVcalculated,withoutknowledgeaboutsystematicdifferenceorwithevidenceprovidedthatsystematicdifferencehasoccurred
SEMcalculatedbasedonCronbach’salpha,ORusingSDfromanotherpopulation
Elaboration:ForcontinuousscorespreferredmeasuresforthemeasurementerrorofasinglescorearetheSEM,LoAortheCoefficientofVariation(CV);theSDCispreferredasameasureforchangescores.Differentformulascanbeusedtocometocalculatethesevariousmeasures.Therefore,wewillfirstdescribetheirformulas.Subsequently,wewillexplainthestandardforstudiesusingSEMandSDCderivedfromvariancecomponentsanalyses.Next,wewilldiscussLoA,SEMandSDCusingtheSDdifference.Wewillexplainwhenignoringtheinfluenceofthesourceofvariationisappropriate.Andlast,wewilldiscusssomeothermethodsused,includingtheCV.Measuresthattakeallerrorintoaccount,includingthesystematicdifferencebetweenrepeatedmeasurements,basedonaone‐wayortwo‐wayeffectsmodel,are:
(1)
(2)
1.96 ∗ √2 ∗ 1.96 ∗ √2 ∗ (3)
41
Measuresthatdonottakethesystematicdifferencebetweenrepeatedmeasurementsintoaccount:
(4)
1.96 ∗ √2 ∗ 1.96 ∗ √2 ∗ (5)
√
(6)
1.96 ∗ √2 ∗ 1.96 ∗ √2 ∗√
(7)
1.96 ∗ (8)
1.96 ∗ (9)
Togeta‘verygood’rating,theformulausedshouldmatchthestudydesign(andtheaim)ofthestudythatisbeingassessed.Therefore,itshouldbeclearwhattheaimis,andwhichmeasureorwhichformulawasusedinthestudybeingassessed.Measurementerrorderivedfromvariancecomponentsanalyses(formulas1‐5)Thespecificmodelusedshouldbeclearlydescribed,e.g.whetheraone‐wayrandomeffectsmodel,oratwo‐orthree‐wayrandomormixedeffectsmodelwasused,andwhetherallerror(exceptfromthevarianceduetovariationbetweenpatients)wasincludedinthecalculationofthemeasurementerror,orwhetherthesystematicerrorbetweenthesourceofvariationthatisbeingvariedinthedesignisignored(i.e.asoccurredwhencalculatingSEMconsistencyforsinglescores(formula4)andSDCconsistencyforchangescores(formula5)).Next,itshouldbecomparedtothestudydesignusingtheextractedinformationaboutthecomprehensiveresearchquestion(seePartAofthetool),anddeterminedwhetherthemethodusedindeedmatchesthestudydesign.Inotherwords,whentheaimofthestudywastoassessthemeasurementerrorofasinglescoreofanymeasurementtakeninclinicalpracticeoftrials,theaimistogeneralizetheresultsbeyond(e.g.)thespecificratersinvolvedinthestudy.Inthiscase,thesystematicerrorbetweenratersshouldbetakenintoaccount;theraters(inthisexample)shouldbeconsideredrandom;andallerrorshouldbetakenintoaccount(i.e.formulas1‐3)tomatchthedesignofthestudy(andthisisrated‘verygood’).Ifinthiscase,(withtheaimtogeneralizebeyondthespecificraters)theSEMconsistency(formula4)orSDCconsistency(formula5)wascalculated(i.e.ignoringasystematic
42
differencebetweenraters),evidenceshouldbeprovidedthatno(oronlyverysmall)systematicdifferencehasoccurredbetweentheraters.Incaseofnoorverysmalldifferencesthestandardcanberatedas‘adequate’,astheSEMagreement(formula2)andSEMconsistency(formula4),orSDCagreement(formula3)andSDCconsistency(formula5)willbethesameorveryclose.Ifitisunclearwhethersystematicdifferencesoccurred(becauseitwasnotreported),thestandardisratedas‘doubtful’.MeasurementerrorderivedfromtheSDdifference(formulas6‐9)ThemeasurementerrorofasinglescoreorachangescorecanalsobecalculatedusingtheSDdifference.Thisreferstothestandarddeviationofthedifferenceofthescoresontherepeatedmeasurements(38,39).InaBlandandAltmanplottworepeatedmeasurementsperpatientareplotted:onthex‐axesthemeanscoreofthetwomeasurements,andonthey‐axesthemeandifferencebetweentherepeatedmeasurements(39).Althoughtheplotisdesignedinsuchawaythatsystematicdifferencescaneasilybeseen(i.e.thelineofthemeandifferencesinscores,andtheasymmetricallylocatedlimitsofagreementaroundthezero),thesystematicdifferenceisdisregardedwhentheSDCiscalculatedfromtheselimits(resultingintheSDCconsistency).Therefore,ifa(large)systematicerrorbetweentherepeatedmeasurementsoccurred,whiletheaimofthestudyistogeneralizebeyondthespecificsourceofvariation(e.g.raters),thestandardshouldberatedas‘doubtful’,astheresultsofthestudyisunderestimatingthemeasurementerror.Whenisameasureofconsistency(formulas4‐9)appropriate?Sometime,thesourceofvariationthatisbeingvariedacrossthemeasurementsisconsideredtobefixedinastudy.Thismeansthattheaimofthestudyisnottogeneralizebeyondthespecificstudyobjectsincludedinthestudy.Forexample,inastudyonlytworatersareconsidered(e.g.theratersMyrtheandBrechtje),andtheaimofthestudyiswhetherthesetworaterswillcometoequalscores(e.g.becausetheywillbetheonlytworatersinvolvedinthemeasurementsforaspecifictrial).IfasystematicerroroccursbetweenMyrtheandBrechtje(e.g.Myrthesystematicallyscores5pointshighercomparedtoBrechtje),thescoresobtainedinthetrialcaneasilybeadjustedbyextracting5pointsofeachmeasurementobtainedbyMyrthe.Inthisstudy,thesourceofvariation‘rater’isdeemedirrelevant(31),asthesystematicdifferencewillbeadjustedlateronwhenusingtheinstrumentbyeitherMyrtheorBrechtje.Inthisspecificsituation,theSEMconsistency,SDCconsistencyorthelimitsofagreementmatchtheaimanddesignofthestudy,soitcanberatedas‘verygood’.However,theseresultscannotbegeneralizedtootherraters,as‘rater’wasconsideredfixed.Therefore,thestudyislessrelevantinothersituations,especiallywhenthereisasystematicdifferencebetweentheraters.
43
MeasurementerrorcalculatedusingtheformulaSD*(√1‐ICC)ThereisanotherformulawhichissometimesusedtocalculatetheSEMfromtheICCformula:SEM=SD*(√1‐ICC)(40).ThestandarddeviationreferstotheSDpooledofthesample,thatisofSDtestandSDretest.UsingthisformulaisonlyjustifiedifthedataforICCandSDarederivedfromthesamestudy.WhentheSDisbasedonanotherpopulation,thisisconsideredinadequate,astheSDofthisotherpopulationmaybesmaller,andsubsequently,themeasurementerrorissmaller.Moreover,sometimestheCronbach’salphaisinsertedintheformulainsteadoftheICC.Thisisconsideredinadequate,asthismeasureisbasedononefull‐scalemeasurementwhereitemsareconsideredastherepeatedmeasurements,insteadofatleasttwofull‐scalemeasurementsusingthetotalscoreinthecalculationoftheSEM.OftenCronbach’salphaishigherthanICC’sbasedonrepeatedmeasurements,thusleadingtosmallerSEMvalues.Byratingthisinadequate,theresultofthisstudycanstillbeconsidered,however,itisconsideredtobelesstrustworthy.Moreover,Cronbach’salphaissometimesusedinadequately,becauseitiscalculatedforascalethatisnotunidimensional,orbasedonaformativemodel.InsuchcasestheCronbach’salphacannotbeinterpreted.Otherparametersthatarebasedonsinglemeasurements,suchasthepersonseparationindex(orotherIRT‐basedmeasurementerrormeasures)ortheOmega,arenotcoveredbythemeasurementerroraccordingtotheCOSMINtaxonomy,butbyinternalconsistency.TheCoefficientofvariationCoefficientofvariation(CV)isalsoaparameterofmeasurementerror.Itisoftenusedinphysicsandtopresentthemeasurementerrorofadevice.Whendevelopinganewdevicethemeasurementerrorisassessedbymeasuringafixedsamplemany(e.g.50)times.TheSDofthesemeasurementsisthestandarderrorofmeasurements.Oftenthemeasurementerrorincreaseswithhighervalues.ForthesesituationCVisasuitablemeasure,asCVexpressestheSDaspercentageofthemeanvalue:informulaCV=SD/mean.Usually,itisexpressedinpercentage,forexample,themeasurementerroris2%ofthemeasuredvalue.TheassumptionunderlyingCVisthattheCVgivesaconstantvalueoverallvaluesofthemean,sothattheSDise.g.2%ofthemeanvalue,regardlessofameanvalueof10or100or1000.InaBlandandAltmanplot,wehadacontraryassumption,i.e.thattheSDofthedifferenceisconstantoverthemeanvalues,ontheX‐axis.Ifthedifferencesarelowerwithsmallvaluesandhigherwithlargevaluethehorizontallinesofthelimitsofagreementgiveawrongvalue:toolargeforthesmallvaluesandtoosmallforthelargemeanvalues.Inthatcaseoneshouldtransformthedata.Oftenanaturallogarithmor10loglogarithmtransformationisused.Thishastheadvantagethatthelimitsofagreementcanbedirectlyexpressedinacoefficientsofvariation(41).
44
Standard8:Preferredstatisticalmethodsfordichotomous,nominal,orordinalscores
verygood adequate doubtful inadequate
Fordichotomous/nominal/ordinalscores:Wasthepercentagespecific(e.g.positiveandnegative)agreementcalculated?
%specificagreementcalculated
%agreementcalculated
Elaboration:Oftenkappaisconsideredasameasureofagreement,however,kappaisameasureofreliability(42).Anappropriateparameterofmeasurementerror(alsocalledagreement)ofdichotomous/nominal/ordinalscoresistheproportionofspecificagreement(42‐44).Itisameasurethatexpressestheagreementseparatelyforeachcategoryofthescore–thatispositiveandnegativeratingsagreementincasethescoreisdichotomous.
45
3.3ExampleofhowtousePartBoftheCOSMINRiskofBiastooltoassessthequalityofastudybySkeieetal.(2015)
InthischapterweprovideanexampleofhowtousetheCOSMINtool–PartBusingagainthepaperbySkeieetal.(19).TofullyunderstandtheexplanationinTable7,werecommendtofirstreadtheintroductionandmethodsectionofthepaper,andthesummaryprovidedatpage27/28.Inthispaperfourdifferentstudiesaredescribed.Hereweusethefirsttwosubstudies.
Table7.ExampleofhowtousePartBoftheCOSMINRiskofBiastoolbasedonthestudybySkeie(19).
StandardsondesignrequirementsforReliabilityandMeasurementerrorDesignrequirements Ratingstudy1 Ratingstudy2 1 Werepatientsstableinthetimebetween
therepeatedmeasurementsontheconstructtobemeasured?
NA(measurementswerebasedonastillimage
Verygood.Measurementswereconductedinsuccession.
2 Wasthetimeintervalbetweentherepeatedmeasurementsappropriate?
NA Verygood.Thetimeinterval(i.e.thesecondraterstartedimmediatelyafterthefirsthadcompletedtheprocedure)hasprobablynotinfluencedthescores.
3 Werethemeasurementconditionsimilarfortherepeatedmeasurements–exceptfortheconditionbeingevaluatedasasourceofvariation?
Verygood Verygood
4 Didtheprofessional(s)administerthemeasurementwithoutknowledgeofscoresorvaluesofotherrepeatedmeasurement(s)inthesamepatients?
Verygood.Noneofthepreviousscoreswereavailable
Verygood.Noneofthepreviousscoreswereavailable
5 Didtheprofessional(s)assignthescoresordeterminedthevalueswithoutknowledgeofthescoresorvaluesofotherrepeatedmeasurement(s)inthesamepatients?
Verygood.Noneofthepreviousscoreswereavailable
Verygood.Noneofthepreviousscoreswereavailable
6 Werethereanyotherimportantflawsinthedesignorstatisticalmethodsofthestudy?
Forreliability:Doubtful.5of30persons(seeTable1ofthepaper)werepain‐freesubjects,whichcouldhavemajorlyincreasedthevariationbetweenthepatients,andsubsequentlytheICC
Forreliability:Verygood.(inthisstudynopain‐freepersonswereincluded,seeTable1ofthepaper)
Formeasurementerror:verygood.Heterogeneityofthesampleisconsideredlessaproblem,asthevariationbetweenpatientsisnotincludedintheparameter.
46
StandardsonpreferredstatisticalmethodsforReliability Ratingstudy1 Ratingstudy2
7 Forcontinuousscores:wasanIntraclass
CorrelationCoefficient(ICC)calculated?
Adequate.ICCtwo‐waymixedsinglemeasures(3.1)andtwo‐waymixedaveragemeasures(3.2)werecalculated.ThisistheICCconsistency,whichdoesnottakethesystematicerrorbetweenratersintoaccount.Thestudyaimstogeneralizebeyondtheratersinvolved,therefore,theratersshouldnotbeconsideredfixed,andtheICCmodeldoesnotmatchoptimallytheresearchaimanddesign.BasedonthemeanofthemeasurementsprovidedinTable2,wecanconcludethatnosystematicdifferencebetweentheratersoccurred.TheICCtwo‐waymixedaveragemeasures(3.2)referstothepracticeinwhichtworaterswouldmeasureeachpatient(withtripleplacementofsecondmarker),andbothfinalscoreswereaveraged.Asthiswillnotbecommonpractice,wewillignorethisICC.Therepetitionofpartofthemeasurementisalreadypartofonemeasurement.
8 Forordinalscores:wasa(weighted)
Kappacalculated?
Notapplicable Notapplicable
9 Fordichotomous/nominalscores:was
Kappacalculatedforeachcategoryagainst
theothercategoriescombined?
Notapplicable Notapplicable
FinalRiskofBiasratingReliabilitystudies Doubtful Adequate
StandardsonpreferredstatisticalmethodsforMeasurementerrorRatingstudy1 Ratingstudy2
7 Forcontinuousscores:wastheStandard
ErrorofMeasurement(SEM),Smallest
DetectableChange(SDC),Limitsof
Agreement(LoA)orCoefficientofVariation
(CV)calculated?
Adequate,asthelimitsofagreementwerecalculated,whiletheaimwastogeneralizebeyondtheratersincludedinthisstudy,andprobablytherewasnosystematicdifferencebetweentheraters.
8 Fordichotomous/nominal/ordinalscores:
Wasthepercentagespecific(e.g.positiveand
negative)agreementcalculated?
Notapplicable Notapplicable
FinalRiskofBiasratingstudyonMeasurement
error
Adequate Adequate
47
4. UsingtheCOSMINRiskofBiastoolinasystematicreviewofoutcomemeasurementinstruments
Researchersandclinicianswhoaredecidingonthemostsuitableoutcomemeasurementinstrumentforuseintheirstudy,canoftenchoosefrommultipledifferentinstruments.Theselectionshouldbebasedontheevidenceofthequalityoftheoutcomemeasurementinstruments(i.e.reliability,validity,andresponsiveness),aswellasonaspectsoffeasibilityandinterpretability.Ahigh‐qualitysystematicreviewonoutcomemeasurementinstrumentsgivesaclearoverviewofallimportantaspectstomakeyourchoice.Understandingthequalityofthestudiesandthequalityofthemeasurementinstrumentunderstudyisachallengingtask,specificallyforresearchersandclinicianswhoarelessfamiliarwiththemethodologytoevaluateallmeasurementproperties.Therefore,in2018,we(COSMINinitiative)publishedathoroughmethodologytoconductasystematicreviewofPROMs(5).Itconsistedofaten‐stepproceduretosummarizetheavailableevidencepermeasurementpropertyperincludedPROManddrawconclusionsoneachmeasurementpropertyperPROM.Andsubsequently,togiverecommendationsofthemostsuitablePROMforagivenpurpose,includingalsofeasibilityandinterpretabilityaspects.ThismethodologyalsoincludestheCOSMINRiskofBiaschecklisttoassessthequalityofstudiesonmeasurementpropertiesofPROMs(1),includingstandardsfordesignrequirementsandpreferredstatisticalmethodsorganizedinboxespermeasurementproperty.ToperformasystematicreviewonthequalityofClinROMs,PerFOMsandlaboratoryvalues,thesamemethodologycanbeused.However,werecommendsomeadaptations.TwoaspectsoftheCOSMINmethodologyforsystematicreviewsofPROMsaredifferentforClinROMs,PerFOMsorlaboratoryvalues:recommendationtousedifferentboxesforreliabilityandmeasurementerror,andtheadditionofanewstepThenewboxesInsystematicreviewsofClinROMs,PerFOMsorlaboratoryvaluestheCOSMINRiskofBiaschecklistforPROMs(1)canbeused,althoughtheboxesforreliabilityandmeasurementerrorshouldbereplacedwiththeCOSMINRiskofBiastooltoassessthequalityofastudyonreliabilityormeasurementerror(4).Standardsformostoftheremainingmeasurementproperties(i.e.contentvalidity,internalconsistency,constructvalidity,criterionvalidityandresponsiveness)developedforPROMscanbeusedforothertypesofmeasurementinstrumentsaswell.Somemeasurementpropertiesareonlyrelevantformulti‐iteminstrumentsbasedonareflectivemodel(i.e.structuralvalidityandinternalconsistency).Forsomeothermeasurementpropertiesonlythefinalscoreorvalueofameasurementinstrumentisconsidered(i.e.hypothesestesting
48
forconstructvalidity,criterionvalidityandresponsiveness).Thequalityofstudiesonthesemeasurementpropertiesaresimilarlyassessedforalltypesofoutcomemeasurementinstruments,andtheexistingboxesfromtheCOSMINRiskofBiaschecklistforPROMscanbeused.AnadditionalstepInareliabilitystudyorastudyonmeasurementerrorofaPROMthefocusofinterestisusuallyonthequalityofthePROMasitisbeingusedinclinicalpractice(analyzedusingaone‐wayrandomeffectsmodel),orinthetest‐retestreliability(usingatwo‐wayrandomeffectsmodelofagreement).However,thefocusofinterestinareliabilitystudyofothertypesofmeasurementinstrumentsismuchmorediverse.Asexplainedinchapter2,therearemanypotentialsourcesofvariation(i.e.manydifferentwaystooperationalizethecomponentsofoutcomemeasurementinstruments)thatcouldbethefocusofinterestinastudyonreliability.Eachresultofallthosestudiestellsyousomethingaboutthequalityoftheinstrument(andgivessuggestionsforimprovementofthemeasurementbystandardizingorrestrictingthesourceofvariationwhichshowedthelargesterror).Basedonanoverviewofallthesestudies,anbest‐evidencemeasurementprotocolcanberecommended.InaCOSMINreviewsofClinROMs,PerFOMsorlaboratoryvalues,anadditionalstepisneededintheten‐stepprocedure(seeFigure3),specificallyintheassessmentofreliabilityandmeasurementerror.Towellinterprettheresultsofstudiesincludedinasystematicreview,youneedtodecidehowtheresultsofthestudyyouwanttoassessinformyouaboutthequalityofthemeasurementinstrument.Therefore,weseparatedtheassessmentofreliabilityandmeasurementerrorfromtheothermeasurementproperties.Changeinthemethodology
Basedonourexperienceusingthemethodology,wedecidedtoremovestep8(whichwas‘Evaluateinterpretabilityandfeasibility’)fromthemethodology.Aspectsofinterpretabilityandfeasibilityareonlyextracted(andsummarized)ratherthanevaluated.Therefore,thisstepisirrelevantinthemethodology.However,weconsideritveryusefultohaveaseparatestepondataextraction.Onceyouincludedallthestudiesinareview,wefirstrecommendyoutoextractallnecessaryinformationfromanarticle,beforeassessingtheriskofbias,andthequalityoftheinstrument.Relevantinformationtobeextractedreferstocharacteristicsoftheincludedmeasurementinstruments,informationonfeasibilityandinterpretability,characteristicsofthestudies,andtheresultsofthestudy.
Consequently,thestep‐numbersaredeviatingfromthestepnumberspresentedintheoriginal10‐stepprocedureoftheCOSMINmethodologytoconductasystematicreviewofPROMs(5).
49
Figure3.Eleven‐stepprocedureforconductingasystematicreviewonanytypeofoutcomemeasurementinstrument
50
4.1Theeleven‐stepprocedureforconductingasystematicreviewofClinROMs,PerFOMs,orlaboratoryvalues
Below,asummaryisgivenfortheeleven‐stepprocedure.IntheusermanualoftheCOSMINmethodologyforsystematicreviewsofPROMs(45)athoroughexplanationofeachstepisprovided.OnlythestepsthataredifferentforareviewofoutcomemeasurementinstrumentsotherthanPROMsaredescribedhereindetail.Pleasenotethatthenumberofthesteparechanged.
Themethodologyofasystematicreviewofoutcomemeasurementinstrumentsissubdividedintothreeparts(A,B,andC)(5).
Step1‐4:Performtheliteraturesearch
Thesteps1‐4arestandardprocedureswhenperformingsystematicreviews,andareinagreementwithexistingguidelinesforreviews(46,47):formulatingthespecificaimofthereview,andtheeligibilitycriteria,performingtheliteraturesearch,andselectingrelevantpublications.
Intheresearchquestion,andeligibilitycriteriafourkeyelementsshouldbeincluded:1)theconstruct;2)thepopulation;3)thetype(s)ofinstruments;and4)themeasurementpropertiesofinterest.
Inthesearchstrategywerecommendtoalsousethesekeyelements,exceptfromthetypeofinstruments,aswearenotawareofhighlysensitivesearchblocksfordifferenttypesofmeasurementinstruments.Searchfiltersfordifferentconstructsmaybefoundathttps://blocks.bmi‐online.nl/.Whenusingthesearchfilterforfindingstudiesonmeasurementproperties(48)ofCLinROMs,PerFOMsandlaboratoryvalues,werecommendtouseadditionalsearchtermsforfindingstudiesusingGeneralizabilitytheory.Thisstring,developedwiththehelpofaclinicallibrarian,canbeaddedwiththebolean“OR”tothesearchfilter.
PubmedsearchstringforfindingstudiesusingGeneralizabilitytheory:
G‐theory[tiab]OR"Gtheory"[tiab]OR"generalizabilitytheory"[tiab]OR"generalisabilitytheory"[tiab]
EMBASEsearchstringforfindingstudiesusingGeneralizabilitytheory:
‘g‐theory’:ti:abOR‘gtheory’:ti,abOR‘generalizabilitytheory’:ti,abOR‘generalisabilitytheory’:ti,ab
51
Step5:Dataextraction
Onceyouincludedallrelevantarticles,youcheckperarticlewhichmeasurementpropertieswereevaluated(andsubsequentlydecidewhichCOSMINboxesarerelevanttobecompletedforthespecificarticle).Whenreadingthroughthearticle,atthispoint,werecommendyoutoextractallinformationfromthearticleaboutthecharacteristicsoftheincludedmeasurementinstruments(forsuggestionsofcharacteristicsseeappendix4),includingaspectsoffeasibilityandinterpretability(seebelow).Interpretabilityisdefinedasthedegreetowhichonecanassignqualitativemeaning(thatis,clinicalorcommonlyunderstoodconnotations)toaquantitativescoreorchangeinscoresofanoutcomemeasurementinstrument(7).Boththeinterpretabilityofsinglescoresandtheinterpretabilityofchangescoresisinformativetoreportinasystematicreview.Theinterpretationofsinglescorescanbeoutlinedbyprovidinginformationonthedistributionofscoresinthestudypopulationorotherrelevantsubgroups,asitmayrevealclusteringofscores,anditcanindicatefloorandceilingeffects.TheinterpretabilityofchangescorescanbeenhancedbyreportingM(C)ICvalues.However,thereisanongoingdebateabouthowthesevaluesshouldbeassessed.
Feasibilityisdefinedastheeaseofapplicationofthemeasurementinstrumentinitsintendedcontextofuse,givenconstraintssuchastimeormoney(49).Aspectsoffeasibilityare,forexample,completiontime,costofaninstrument,lengthoftheinstrument,typeandeaseofadministration.Feasibilityappliestoboththepatientsandtheprofessionalwhoareinvolvedinthemeasurement.Theconcept‘feasibility’isrelatedtotheconcept‘clinicalutility’,wherefeasibilityreferstoameasurementinstrument,andclinicalutilityreferstoanintervention(50).
Interpretabilityandfeasibilityarenotmeasurementpropertiesbecausetheydonotrefertothequalityofanoutcomemeasurementinstrument.However,theyareconsideredimportantaspectsforawell‐consideredselectionofanoutcomemeasurementinstrument.
52
Steps6‐9:Evaluatethemeasurementproperties
Thesteps6‐9concerntheevaluationoftheninemeasurementpropertiesoftheincludedoutcomemeasurementinstruments.Inthesestepspermeasurementproperty,dataisextractedonthecharacteristicsofthestudies,andtheresultofeachstudy,theriskofbiasoftheincludedstudiesisratedbyusingtheCOSMINRiskofBiasstandards,andtheresultsofthestudiesareratedbyapplyingthecriteriaforgoodmeasurementproperties.Subsequently,allevidenceissummarized,andthequalityofallavailableevidencepermeasurementpropertypermeasurementinstrumentisgradedusingamodifiedGRADEapproach.
Characteristicsofthestudiesrefertothecharacteristicsoftheincludedpatientpopulations,andpopulationofincludedprofessionals(forsuggestionsofcharacteristicsseeappendix5).Forspecificrecommendationsforextractinginformationontheresultsofstudiesonreliabilityandmeasurementerrorseestep8extractinginformation(p53).
Instep6thecontentvalidityisassessed.Instep7theinternalstructure(structuralvalidity,internalconsistencyandcross‐culturalvalidity\measurementinvariance)isassessed.Astheassessmentofreliabilityandmeasurementerrorrequiresanadditionalstep(i.e.understandinghowtheresultsofastudyinformyouaboutthereliabilityormeasurementerrorofaoutcomemeasurementinstrument),thesetwomeasurementpropertiesarenowassessedinaseparatestep,i.e.step8,apartfromtheassessmentofthemeasurementpropertiescriterionvalidity,hypothesestestingforconstructvalidity,andresponsiveness(i.e.step9).
Step6.Evaluatecontentvalidity
Instep6contentvalidityisevaluated.InthecurrentstandardsandcriteriaforassessingcontentvalidityofPROMs(6)emphasizeisputontherelevance,comprehensiveness,andcomprehensibilityofthePROMfortheconstruct,targetpopulation,andintendedcontextofuse.InthisassessmentalsothedevelopmentofthePROMisconsidered,specifically,theitemelicitationphaseandtheresultsfromthepilot‐testingphase.Theassessmentofcontentvalidityofothertypesofinstrumentsmaybedifferent,andmoreresearchisneededtodevelopstandardsandcriteriaforothertypesofmeasurementinstruments.
Assessingthecontentvalidityofmeasurementinstrumentsthatincludemultipleitems–eitherbasedonareflectiveorformativemodel–canheavilyleanonthestandardsandcriteriaforPROMs.Only,becauseprofessionalsareinvolvedinthemeasurement,thethreeaspectsofcontentvalidity(i.e.relevance,comprehensiveness,andcomprehensibility)shouldbeaskedtotheprofessionals.Dependingontheconstructofinterest,theseaspectscouldbeaskedtopatients,too,forexampleforPerFOMs,orClinROMsaboutsymptomsorseverityofconditions.
53
Fortheassessmentofcontentvalidityofmeasurementinstrumentsthatexistofasingleparameter(e.g.imaging‐basedparameters,orlaboratoryvalues),otheraspectsarelikelymorerelevant.Forexample,youshouldjudgewhetheritmakessensethatthemeasurementinstrumentindeedmeasurestheconstructitpurportstomeasure,basedontheoryandmedicalknowledge,andbasedontheclaimsbythemanufacturer.Inaddition,theunitofmeasurementshouldmatchtheconstructtobemeasured.Forexample,a6minutewalktest–expressedinthedistancecoveredoveratimeof6minutes–measureswalkingcapacity,ratherthanphysicalfunctioning(51).Ascurrentlynostandardsandcriteriaforcontentvalidityexist,facevalidity(whichisarathersubjectivejudgmentaboutwhetherthecontentoftheinstrumentindeedlooksasanadequatereflectionoftheconstructtobemeasured)couldbeassessedbythereviewer.
Step7.Evaluatetheinternalstructure
Instep7theinternalstructure(structuralvalidity,internalconsistencyandcross‐culturalvalidity\measurementinvariance)isassessed.Thisstepisonlyrelevantwhenthemeasurementinstrumentisamulti‐iteminstrumentbasedonareflectivemodel.Thestandards(1)andcriteria(5)providedforsystematicreviewsofPROMscanbeused.
Step8.Evaluatereliabilityandmeasurementerror
Next,instep8reliabilityandmeasurementerrorareassessed.Inchapter2and3wehaveexplainedhowtoassessthequalityofeachstudyonreliabilityandmeasurementerror.
Inasystematicreviewperstudy,youshouldfirstextractinformationabouttheelementsofacomprehensiveresearchquestion(seechapter2),thespecificICCmodelorformula,andtheresultsofeachstudy.Next,youshouldassessthestudyqualityusingthestandards(seechapter3),andassesstheresultsofeachstudy,bycomparingtheresultsagainstthecriteriaforgoodmeasurementproperties(5).Subsequently,youshouldsummarizeallevidenceforreliabilityandformeasurementerror,respectively,andgradethequalityoftheevidenceusingthemodifiedGRADEapproach(5).Basedonthisoverview,youcanrecommendonthebest‐evidencemeasurementprotocolforaspecificmeasurementinstrument.
Extractinginformation
InAppendix1weprovideanexampleofadataextractiontable.First,werecommendtoextractthesevenelementsofacomprehensiveresearchquestion,andtheresearch
54
questionasstatedbytheauthorsinthearticle.Basedontheelements,youcansubsequentlyformulateacomprehensiveresearchquestion.Next,werecommendtoextracttheinformationaboutthekeyelementsofthereview,i.e.theconstruct,population,typeofmeasurementinstrument,andmeasurementpropertiesofinterest.Theconstructtobemeasured(element3ofacomprehensiveresearchquestion),andthespecificmeasurementproperties(element4ofacomprehensiveresearchquestion)arealreadyextracted,sothetargetpopulationandthetypeofmeasurementinstrumentarerecommendedtobeextracted.Thetargetpopulationreferstothetargetpopulationofthespecificstudy.IntheexampleofSkeieetal.(19),thetargetpopulationwerepatientswithlow‐backpain.Thiscanbedifferentfromthestudypopulation(i.e.thesampleused)asextractedinitem7,or(slightly)differentfromthetargetpopulationofthereview(e.g.abroaderpopulation).InthestudyofSkeie,notonlypatientswithlow‐backpainwereincluded,butalsopatientswithotherspinalcomplaintssuchasmidbackpain,neckpain,and/orextremitypain,orevenpain‐freesubjects.ThetypeofmeasurementinstrumentreferstowhethertheinstrumentunderstudyisaClinROM,PerFOM,laboratoryvalue,aPROMoranObsROM.
Last,werecommendtoextractinformationaboutthestatistics:themodelorformulaused,theresult,and,ifapplicable,its95%confidenceinterval.Ifavailable,werecommendtoextractthevariancecomponents,ortheSDsampleorSDdifference(seealsochapter3.2formoreexplanation).Forordinalordichotomousdatawerecommendtoextracttherawnumbersinthecellsplusmarginaltotals.
RiskofBiasassessment
Thenextstepinthereview,istoassessthequalityofeachstudy,usingPartBoftheRiskofBiastooltoassessreliabilityandmeasurementerror(asdescribedinchapter3).Werecommendtousetheworst‐scorecountsmethodstocometoanoverallratingperstudy.InAppendix2weprovideanexampleofsuchatabletoorganizetheseratings.Werecommendthateachstudyisassessedbytwoindependentreviewers,andthattheycometoconsensus.
Comparisonagainstthecriteriaforgoodmeasurementproperties
Eachresultofeachsinglestudyonreliabilityormeasurementerrorisnowcomparedagainstthecriteriaforgoodmeasurementproperties(5).AsnocriteriafortheunweightedKappa,andCVwereprovidedintheguidelinesforsystematicreviewsofPROMs,weaddedthesemissingcriteria(seeTable8).Criteriafor%specificagreementaredifficulttoset,becausetheyare,justlikesensitivityandspecificity,highlydependentonthesituation.Asaruleofthumb80%mightbeused.
55
Table8.Extendedcriteriaforgoodreliabilityandmeasurementerror(adaptedfromPrinsenetal.(5))
Reliability
+ ICCor(weighted)Kappa≥0.70
? ICCor(weighted)Kappanotreported
– ICCor(weighted)Kappa<0.70
Measurementerror
+
SDCorLoAorCV*√2*1.96<M(C)IC1;%specificagreement>80%2
? MICnotdefined
–SDCorLoAorCV*√2*1.96>M(C)IC1;%specificagreement<80%2
1theM(C)ICvaluemaycomefromanotherstudy.2Sometimesahigherpercentageismoreappropriate;whensubstantiated,thiscouldbeappropriate,too.
Summarizingtheevidence
Tocometoanoverallconclusionofthereliabilityorthemeasurementerrorofanoutcomemeasurementinstrument,oneshouldfirstdecidewhethertheresultsfrommultiplestudiescanbecombined.Youshouldtaketwoaspectsintoaccountinthisdecision.1)Dotheresultsrefertothesameinformation(i.e.refertothesameunderlyingcomprehensiveresearchquestion).Resultsfromdifferentdesigns(i.e.differentcomponentswerevariedacrosstherepeatedmeasurements)giveyouotherinformationaboutthereliabilityofaninstrument,andthereforecannotsimplybesummarized.And2)Aretheresultsconsistent,thatisallresultsareeithersufficient(+)orinsufficient(‐).Incaseofinconsistencyinresults,werecommendtosearchforreasonsforthisinconsistency,e.g.differentdesignsorstatisticalmodels,differentpopulations,differentbackgroundofraters.Subsequently,subgroupsofstudiescanbesummarized.
Tosummarizetheevidence,youcaneitherqualitativelysummarizetheresults(e.g.describetherangeoftheresults)orquantitativelypooltheresults.Inreliabilitystudies,onlythepointestimateofanICCorCohen’skappaisusedtoconcludewhetherthespecificmeasurementinstrumenthassufficientreliability(e.g.inthecriteriathatweproposeabove).Therefore,itisnotnecessarytopoolthedatatoobtainamoreprecisepointestimate.
Themeasurementerrorreferstotheabsolutedeviationofthescorefromthe‘true’scoreortheamountoferrorinthescore.Thepointestimateofthemeasurementerrorparameterreferstothisdeviationorerror,andthereforeitisusedtoknowhowprecisethemeasurementinstrumentisabletomeasureapatient.Tocometoamoreprecisepointestimatesofthemeasurementerror,theparametersobtainedinstudieswiththesamedesign(i.e.thathavethesameunderlyingcomprehensiveresearchquestion)can
56
bepooled,whentheconfidenceintervalsarealsoreporting(whichcanbeobtainedusingthesamplesize(39)orbootstrappingmethods(52)).
Note,thatyoushouldonlysummarizeorpoolparametersofmeasurementerrorthatwerederivedfromthesamestudydesignandmodelorformulaused.Forexample,theSEMconsistency(eitherformula4or8,chapter3.2)andSEMagreement(formula2,chapter3.2)shouldnotbecombined.However,SEMconsistencyusingeitherformula4or6(chapter3.2)canbecombinedastheywillleadtothesameresult,andtheSDCconsistencyusingeitherformula5,7,or9(chapter3.2)canbecombined.ThesameresultsarefoundwhenusingeithertheSEMone‐wayrandomeffectsmodel(formula1,chapter3.2)orSEMagreement(formula2,chapter3.2).Thisisbecauseallsourcesofvariance(apartfromthevariancebetweenpatients)aretakenintoaccountinbothformulas.Therefore,theseparameterscanbecombined.
Handlinginconsistentresults.
Iftheresultsofstudieswiththesameunderlyingresearchquestionareinconsistent(e.g.bothsufficientandinsufficientresultsarefound),firstexplanationsforinconsistencyshouldbeexplored.Forexample,slightlydifferentstudypopulationsormethodswereused.Ifanexplanationisfound,subgroupsofstudies(e.g.nowbasedonthesamestudypopulation,orinwhichthesamesourceofvariationisvaried)canbesummarized.Theoverallconclusionfor(e.g.)reliabilitycansubsequentlybedrawnpersubgroup.Whentheexplanationisfoundinthequalityofthestudies(i.e.verygoodandadequatestudiesleadtoanotheroverallratingthandoubtfulandinadequatestudies),thedoubtfulandinadequatequalitystudiesmayonlybereported,butignoredinthisstep,andonlyverygoodandadequatequalitystudiesareconsideredtobedecisiveindeterminingtheoverallratingwhenratingsareinconsistent.Thisshouldbeexplainedinthemanuscript.
Ifstudieswiththesameunderlyingresearchquestionshowedinconsistentresults,andnoexplanationcanbefound,onecanconcludethatresultsareinconsistent.
WerefertotheUsermanualoftheCOSMINmethodologyforsystematicreviewsofPROMsformoreinformation.
Ratethequalityofthesummarizedresult
Ifmultiplestudiescanqualitativelybesummarized(e.g.therangeofresults)orquantitativelypooled,theoverallresultcanagainbecomparedtothecriteriaforgoodmeasurementproperties(seeTable8);youcanthenconcludethattheoutcomemeasurementinstrumenthaseithersufficient(+)orinsufficient(‐)reliabilityormeasurementerror.Oryoushouldconcludethattheresultsareinconsistent(±),or
57
indeterminate(?).Formoreinformation,werefertotheUsermanualoftheCOSMINmethodologyforsystematicreviewsofPROMs.
GradingthequalityoftheevidenceusingthemodifiedGRADEapproach
Aftersummarizingorpoolingallevidenceperoutcomemeasurementinstrumentforreliabilityandformeasurementerror,andratingthesummarizedorpooledresultsagainstthecriteriaforgoodmeasurementproperties,thenextstepistogradethequalityoftheevidence.Thequalityoftheevidencereferstotheconfidencethatthesummarizedorpooledresultsistrustworthy.WedevelopedamodifiedGRADE(GradingofRecommendationsAssessment,Development,andEvaluation)approachtogradetheevidenceashigh,moderate,loworverylow(5),basedonthe1)riskofbias(i.e.themethodologicalqualityofthestudies),2)inconsistency(i.e.unexplainedinconsistencyofresultsacrossstudies),3)imprecision(i.e.totalsamplesizeoftheavailablestudies),and4)indirectness(i.e.evidencefromdifferentpopulationsthanthepopulationofinterestinthereview).ThisprocedureisdescribedintheUsermanualoftheCOSMINmethodologyforsystematicreviewsofPROMs(5,45).
Drawconclusionon‘best‐evidencemeasurementprotocol’
Theresultsofreliabilitystudieswiththeirspecificdesignsinformyouwhetherasourceofvariation(forexamplethetrainingofarater,thespecificmachineused)importantlyaffectsthescore(i.e.themeasurement).Ifpossible,thissourceofvariationshouldbestandardizedorrestrictedinfuturemeasurements.Bylookingatallevidenceforvarioussourceofvariation,youcannowdrawconclusionsabouthowtostandardizeandrestrictthemeasurement,anddescribethisbest‐evidencemeasurementprotocol.
Step8.Evaluatecriterionvalidity,hypothesestestingforconstructvalidity,andresponsiveness
Instep8criterionvalidity,hypothesestestingforconstructvalidity,andresponsivenessisassessed.Thestandards(1)andcriteria(5)providedforsystematicreviewsofPROMscanbeused.
58
Steps10‐11:.Selecttheoutcomemeasurementinstrument
Thesteps10and11concernstheformulatingrecommendations(step10)andthereportingofthesystematicreview(step11).
Step10.Formulaterecommendations
Thegoalofasystematicreviewonmeasurementinstrumentsistogetanoverviewofallavailableevidenceonthequalityofoutcomemeasurementinstrumentsthatmeasureaspecificconstructinadefinedpatientpopulation.Basedonthisoverview,andtakingaspectsoffeasibilityandinterpretabilityintoaccount,werecommendyoutoformulateyourrecommendationsaboutthemostsuitableoutcomemeasurementinstrument.Tocometoanevidence‐basedandfully‐transparentrecommendation,werecommendtocategorizetheincludedmeasurementinstrumentsintothreecategories.Pertypeofmeasurementinstrumentyoucanconcludewhichinstrument(s)arerecommended(categoryA)orpromising(categoryB),orinsufficient(categoryC)andshouldnotbeusedanymore.
Category(A):
Werecommendusingdifferentdefinitionsofthecategory(A),dependingonthestructureofthemeasurementinstrument:
Multi‐itemreflectief
Evidenceforsufficientcontentvalidity(anylevel),ANDsufficientinternalconsistency(atleastlowquality,meaningalsosufficientstructuralvalidity)
Multi‐itemformatief
Evidenceforsufficientcontentvalidity(anylevel)
Singleitem(singleparameter)(nogoldstandard)
Sufficientfacevalidity(ratedbye.g.thereviewersteam),ANDevidenceforsufficientreliability(anylevel)
Singleitem(goldstandardavailable)
Evidenceforsufficientcriterionvalidity,ANDevidenceforsufficientreliability(anylevel)
Category(B):outcomemeasurementinstrumentnotcategorizedas‘A’or‘B’.
Category(C):outcomemeasurementinstrumentwithhighqualityevidenceforaninsufficientmeasurementproperty.
59
Step11.Reportthesystematicreview
InaccordancewiththePRISMAStatement(53,54),werecommendtoreportthefollowinginformation:(1)thesearchstrategy(forexampleonawebsiteorinthe(online)supplementalmaterialstothearticleatissue),andtheresultsoftheliteraturesearchandselectionofthestudiesandmeasurementinstruments,displayedinthePRISMAflowdiagram(includingthefinalnumberofarticlesandthefinalnumberofmeasurementinstrumentsincludedinthereview)(Appendix3);(2)thecharacteristicsoftheincludedmeasurementinstruments,includingaspectsoffeasibilityandinterpretability(Appendix4);(3)thecharacteristicsofthestudies,includingthecharacteristicsoftheincludedpatientpopulations,andpopulationofincludedprofessionals(Appendix5);(4)themethodologicalqualityratingsofeachstudypermeasurementpropertypermeasurementinstrument(i.e.verygood,adequate,doubtful,inadequate),theresultsofeachstudy,andtheaccompanyingratingsoftheresultsbasedonthecriteriaforgoodmeasurementproperties(sufficient(+)/insufficient(‐)/indeterminate(?)).IntheUserManualforconductingsystematicreviewsofPROMs(45)anexampleisprovided.InAppendix6weprovideexamplesspecificallyforcolumnsonreliabilityandmeasurementerror.ThetablecouldbepublishedforexampleasAppendixorsupplementalmaterialonthewebsiteofthejournalonly;(5)aSummaryofFindings(SoF)tablepermeasurementproperty,includingthepooledorsummarizedresultsofthemeasurementproperties,itsoverallrating(i.e.sufficient(+)/insufficient(‐)/inconsistent(±)/indeterminate(?)),andthegradingofthequalityofevidence(high,moderate,low,verylow).IntheUserManualforconductingsystematicreviewsofPROMs(45)anexampleisprovided.InAppendix7weprovideexamplesspecificallyforcolumnsonreliabilityandmeasurementerror.TheseSoFtables(i.e.onepermeasurementproperty)willultimatelybeusedinprovidingrecommendationsfortheselectionofthemostappropriatePROMforagivenpurposeoraparticularcontextofuse.
60
Appendix1.DataExtractiontableofrelevantinformationforeachincludedstudyinasystematicreview.
Extractionitem Instruction Study1 Study2Elementsofacomprehensiveresearchquestion1.Nameoftheinstrument
Alternatively:typeofinstrumentandparameter
2.Versionorwayofoperationalization
Allrelevantcomponentsthatareknownorexpectedtoinfluencethescore,andwhicharestandardizedorrestricted(facetofstratification(23))
Equipment:Preparatoryactions:
Equipment:Preparatoryactions:
Unprocesseddata/samplecollection:Dataprocessingandstorage:
Unprocesseddata/samplecollection:Dataprocessingandstorage:
Assignmentofthescore/determinationofthevalue:
Assignmentofthescore/determinationofthevalue:
3.Construct Descriptionofwhatisbeingmeasured
4.Measurementproperty
Reliabilityand/ormeasurementerror
5.Componentsthatwillberepeated
e.g.wholemeasurement(i.e.allcomponents)orsomeofthecomponent
6.Source(s)ofvariationvaried
Componentswhichisvariedacrossthemeasurements(i.e.focusofanalysis;facetofgeneralizability(23))
7.Patientpopulation
(i.e.facetofdifferentiation(23))
Theresearchquestion
Publishedresearchquestion
Asformulatedbytheauthors
Comprehensiveresearchquestion
Asformulatedbythereviewer
Additionalkeyelementofresearchaimofthereview
Targetpopulation Descriptionofthepopulationtowhichtheauthorswanttogeneralize
Typesofmeasurementinstrument
e.g.ClinROM,PerFOM,laboratoryvalue,PROMorObsROM
61
Statisticalinformationandresults
Modelorformulaused
Statisticalmodel
Result e.g.results(95%CI)ofICC,kappa,SEM,LoAandsystematicdifference
Variancecomponents
Allreportedvariancecomponents
Applycriteriaforgoodmeasurementproperty*
sufficient(+),insufficient(‐),orindeterminate(?)
*althoughthisisarating,andnotdataextraction,weincludeithere,astherequiredinformationtomaketheratingisextractedhere.
62
Appendix2.RiskofBiasratingsperstandardperstudy
RiskofBiasrating study1 rater1 rater2 consensusDesignrequirements 1 Stabilityofthepatients 2 Timeinterval 3 Similarityofmeasurementcondition 4 Administationwithoutknowledgeof
scoresorvalues 5 Scoreassignmentordeterminationof
valueswithoutknowledgeofthescoresorvalues
6 Otherimportantflaws Statisticalmethods 7 Forcontinuousscores:ICC 8 Forordinalscores:Kappa 9 Fordichotomous/nominalscores:
Kappaforeachcategoryagainsttheothercategoriescombined?
Finalrating
63
Appendix3.ExampleofaFlow‐chart
64
Appendix4.Exampleofreportingtableoncharacteristicsoftheincludedmeasurementinstruments.
Name(referencetofirstarticle)
Construct Intendedcontextofuse
Best‐evidencemeasurementprotocol
Targetpopulation
Typeofmeasurementinstrument
Feasibilityaspects
Interpretabilityaspects
LMMthickness(19)
Thicknessofrestingmuscle
Evaluation Trainingdiagnosticultrasound.Specificinstructionsforpatient,andprobepositions.
Patientswithlowbackpain
Ultrasound Meanscoreinmixofpainpatientswas27.9mm(±3.2)
LMMcontraction(19)
Comparisonofthethicknessofrestingmusclewiththatofactivatedmuscle
Evaluation Trainingdiagnosticultrasound.Specificinstructionsforpatient,andprobepositions.
Patientswithlowbackpain
Ultrasound Meanscoreinmixofpainpatientsranges1.3mm(±1.7)–3.5mm(±2.6)
Othercharacteristicswhichmaybeextractedare:conceptualmodelused,recommendedbystandardizationinitiatives,fullcopyavailable,fitforpurpose(diagnostic,prognostic,evaluation).
Aspectsoffeasibilityare,forexample,completiontime,licensinginformationandcostsofaninstrument,typeandeaseofadministration.Feasibilityappliestoboththepatientsandtheprofessionalwhoareinvolvedinthemeasurement.ItmaybeconsideredtoreportthisinformationinaseparateTable.
Aspectsofinterpretabilityreferto1)interpretabilityofsinglescores(e.g.informationonthedistributionofscoresinstudypopulationorotherrelevantsubgroups,andfloorandceilingeffects),and2)interpretabilityofchangescores(i.e.M(C)ICvalues).
65
Appendix5.Exampleofreportingtableoncharacteristicsofthestudypopulations.
Measurementinstrument
Reference Measurementpropertyassessed
Patientpopulation Professionalpopulation Responserate
Samplesize
Patientcharacteristics Samplesize
Characteristicsofprofessionals
LMMcontraction
(19)Study2 Reliability,measurementerror
30 47%female,agemean(SD)37(±12);LBPn=20;neck/midbackpainn=5;extremitypainn=1;painfreen=4
2 Chiropractorsexperiencedindiagnosticultrasoundforthemusculoskeletalsystem,i.e.4and8yearsresp.,withapostgraduatediplomaindiagnosticultrasound.Beforethestudy.bothdevelopedtheprotocolofdiagnosticultrasoundthatwasappliedinthisstudy.
(19)Study3 Reliability 30 50%female,agemean(SD)38(±11);LBPn=23;neck/midbackpainn=7
2
(19)Study4 Reliability,measurementerror
30 43%female,agemean(SD)40(±11);LBPn=20;neck/midbackpainn=6;extremitypainn=3;painfreen=1
2
B 1
2
Patientcharacteristicsreferto,e.g.age,gender,diseasecharacteristics(diagnosis,diseaseduration,diseaseseverity),setting,andgeographicallocation.
Ratercharacteristicsmayreferto,e.g.professionalbackground,specifictrainingreceived,oryearsofexperience.
66
Appendix6.OverviewTableofqualityandresultsofstudiesonreliabilityandmeasurementerror.
Measurementinstrument(MI)(ref)
TypeofMI Reliability Measurementerrorn Studyquality Result(rating) N Studyquality Result(rating)
LLMcontractionscore(study2)(19)
Ultrasound 30 Adequate 0.97(0.92‐0.98) 30 Adequate LoA[−0.94;1.22mm]
LLMcontractionscore(study3)(19)
Ultrasound 30 Adequate 0.94(0.88‐0.97)
LLMcontractionscore(study4)(19)
Ultrasound 30 Adequate 0.97(0.94‐0.99) 30 Adequate LoA[−1.32;1.25mm]
LLMcontractionscore(ref)
LLMcontractionscore(ref)
Pooledorsummaryresult(overallrating)
90 0.94‐0.97(+) 90 SDCconsistsncy=1.08;1.29a
acalculatedfromLoA
67
Appendix7.SummaryofFindingsTablesforReliabilityandMeasurementerror.
BasedonthestudiesonreliabilitydescribedbySkeie(19)
Reliability Summaryresult Overallrating Qualityofevidence
UltrasoundmeasurementoftheLMMcontractionscore–best‐evidencemeasurementprotocol:rater,dayandactivemotortasksperformedbeforemeasurementwerenotofinfluence
RangeICC:0.94‐0.97 Sufficient High(twostudiesofadequatequality)
MeasurementinstrumentB–
BasedonthestudiesonmeasurementerrordescribedbySkeie(19)
Measurementerror Summaryresult Overallrating Qualityofevidence
UltrasoundmeasurementoftheLMMcontractionscore–best‐evidencemeasurementprotocol:rater,dayandactivemotortasksperformedbeforemeasurementwerenotofinfluence
RangeSDCconsistsncy:1.08‐1.29
MIC=notassessed
?
MeasurementinstrumentB–
68
References1. Mokkink LB, de Vet HCW, Prinsen CAC, Patrick DL, Alonso J, Bouter LM, et al. COSMIN Risk of Bias checklist for systematic reviews of Patient-Reported Outcome Measures. Qual Life Res. 2018;27(5):1171-9. 2. Walton MK, Powers JA, Hobart J, al. e. Clinical outcome assessments: A conceptual foundation – Report of the ISPOR Clinical Outcomes Assessment Emerging Good Practices Task Force. Value Health. 2015;18:741-52. 3. Powers JH, 3rd, Patrick DL, Walton MK, Marquis P, Cano S, Hobart J, et al. Clinician-Reported Outcome Assessments of Treatment Benefit: Report of the ISPOR Clinical Outcome Assessment Emerging Good Practices Task Force. Value Health. 2017;20(1):2-14. 4. Mokkink LB, Boers M, van der Vleuten CPM, Bouter LM, Alonso J, Patrick DL, et al. COSMIN Risk of Bias tool to assess the quality of studies on reliability or measurement error of outcome measurement instruments: a Delphi study. . BMC Medical Research Methodology. 2020;20(293). 5. Prinsen CAC, Mokkink LB, Bouter LM, Alonso J, Patrick DL, de Vet HCW, et al. COSMIN guideline for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1147-57. 6. Terwee CB, Prinsen CAC, Chiarotto A, Westerman MJ, Patrick DL, Alonso J, et al. COSMIN methodology for evaluating the content validity of patient-reported outcome measures: a Delphi study. Qual Life Res. 2018;27(5):1159-70. 7. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63(7):737-45. 8. Hamilton M. The assessment of anxiety states by rating. Br J Med Psychol. 1959;32(1):50-5. 9. Douglas PS, DeCara JM, Devereux RB, Duckworth S, Gardin JM, Jaber WA, et al. Echocardiographic imaging in clinical trials: American Society of Echocardiography Standards for echocardiography core laboratories: endorsed by the American College of Cardiology Foundation. J Am Soc Echocardiogr. 2009;22(7):755-65. 10. Jungmann PM, Welsch GH, Brittberg M, Trattnig S, Braun S, Imhoff AB, et al. Magnetic Resonance Imaging Score and Classification System (AMADEUS) for Assessment of Preoperative Cartilage Defect Severity. Cartilage. 2017;8(3):272-82. 11. Fischer JSJ, A.J.; Kniker, J.E.; Rudick, R.A.; Cutter,G. Multiple Sclerosis Functional Composite (MSFC). Administration and scoring manual.; 2001. 12. Genc S, Omer B, Aycan-Ustyol E, Ince N, Bal F, Gurdol F. Evaluation of turbidimetric inhibition immunoassay (TINIA) and HPLC methods for glycated haemoglobin determination. J Clin Lab Anal. 2012;26(6):481-5. 13. Holen JC, Saltvedt I, Fayers PM, Hjermstad MJ, Loge JH, Kaasa S. Doloplus-2, a valid tool for behavioural pain assessment? BMC Geriatr. 2007;7:29. 14. Farooq MN, Mohseni Bandpei MA, Ali M, Khan GA. Reliability of the universal goniometer for assessing active cervical range of motion in asymptomatic healthy persons. Pak J Med Sci. 2016;32(2):457-61. 15. Jordan K, Haywood KL, Dziedzic K, Garratt AM, Jones PW, Ong BN, et al. Assessment of the 3-dimensional Fastrak measurement system in measuring range of motion in ankylosing spondylitis. J Rheumatol. 2004;31(11):2207-15.
69
16. Correll S, Field J, Hutchinson H, Mickevicius G, Fitzsimmons A, Smoot B. Reliability and Validity of the Halo Digital Goniometer for Shoulder Range of Motion in Healthy Subjects. Int J Sports Phys Ther. 2018;13(4):707-14. 17. D'Agostino M A, Aegerter P, Jousse-Joulin S, Chary-Valckenaere I, Lecoq B, Gaudin P, et al. How to evaluate and improve the reliability of power Doppler ultrasonography for assessing enthesitis in spondylarthritis. Arthritis Rheum. 2009;61(1):61-9. 18. Terwee CB, Mokkink LB, Knol DL, Ostelo RW, Bouter LM, de Vet HC. Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist. Qual Life Res. 2012;21(4):651-7. 19. Skeie EJ, Borge JA, Leboeuf-Yde C, Bolton J, Wedderkopp N. Reliability of diagnostic ultrasound in measuring the multifidus muscle. Chiropr Man Therap. 2015;23:15. 20. Mathew AJ, Ostergaard M. Magnetic Resonance Imaging of Enthesitis in Spondyloarthritis, Including Psoriatic Arthritis-Status and Recent Advances. Front Med (Lausanne). 2020;7:296. 21. Butland RJ, Pang J, Gross ER, Woodcock AA, Geddes DM. Two-, six-, and 12-minute walking tests in respiratory disease. Br Med J (Clin Res Ed). 1982;284(6329):1607-8. 22. de Jong K ea. Richtlijnen 6-minutes timed walking test.; 2000. 23. Bloch R, Norman G. Generalizability theory for the perplexed: a practical introduction and guide: AMEE Guide No. 68. Med Teach. 2012;34(11):960-92. 24. Feys P, Lamers I, Francis G, Benedict R, Phillips G, LaRocca N, et al. The Nine-Hole Peg Test as a manual dexterity performance measure for multiple sclerosis. Mult Scler. 2017;23(5):711-20. 25. Mathiowetz V, Weber K, Kashman N, Volland G. Adult norms for the Nine Hole Peg Test of finger dexterity. Occup Particip Health. 1985;5:24-38. 26. Arvidsson Lindvall M, Anderzen-Carlsson A, Appelros P, Forsberg A. Validity and test-retest reliability of the six-spot step test in persons after stroke. Physiother Theory Pract. 2020;36(1):211-8. 27. Romani J, Giavedoni P, Roe E, Vidal D, Luelmo J, Wortsman X. Inter- and Intra-rater Agreement of Dermatologic Ultrasound for the Diagnosis of Lobular and Septal Panniculitis. J Ultrasound Med. 2020;39(1):107-12. 28. Gellhorn AC, Carlson MJ. Inter-rater, intra-rater, and inter-machine reliability of quantitative ultrasound measurements of the patellar tendon. Ultrasound Med Biol. 2013;39(5):791-6. 29. Brennan RL. Generalizability Theory. New York: Springer-Verlag; 2001. 30. Govaerts MJ, van der Vleuten CP, Schuwirth LW. Optimising the reproducibility of a performance-based assessment test in midwifery education. Adv Health Sci Educ Theory Pract. 2002;7(2):133-45. 31. McGraw KOW, S.P. Forming inferences about some intraclass correlation coefficients. Psychological Methods. 1996;1:30-46. 32. Shrout PE, Fleiss JL. Intraclass Correlations: Uses in assessing rater reliability. Psychological Bulletin. 1979;86:420-8. 33. Kraemer HC, Periyakoil, V. S., Noda, A. Kappa coefficients in medical research. Tutorial in biostatistics. Statistics in Medicine. 2002;21:2109–29. 34. Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 1968;70:213-20. 35. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;20:37-46. 36. Vach W. The dependence of Cohen's kappa on the prevalence does not matter. J Clin Epidemiol. 2005;58(7):655-61.
70
37. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. . Educational and Psychological Measurement. 1973;33:613-9. 38. Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res. 1999;8(2):135-60. 39. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307-10. 40. de Vet HC, Terwee CB, Mokkink L, Knol DL. Measurement in Medicine. Cambridge: Cambridge University Press; 2011 2010. 41. Euser AM, Dekker FW, le Cessie S. A practical approach to Bland-Altman plots and variation coefficients for log transformed variables. J Clin Epidemiol. 2008;61(10):978-82. 42. de Vet HC, Mokkink LB, Terwee CB, Hoekstra OS, Knol DL. Clinicians are right not to like Cohen's kappa. BMJ. 2013;346:f2125. 43. de Vet HC, Dikmans RE, Eekhout I. Specific agreement on dichotomous outcomes can be calculated for more than two raters. J Clin Epidemiol. 2017. 44. de Vet HCW, Mullender MG, Eekhout I. Specific agreement on ordinal and multiple nominal outcomes can be calculated for more than two raters. J Clin Epidemiol. 2018;96:47-53. 45. Mokkink LB, Vet HC, Prinsen CA, patrick DL, Alonso J, Bouter LM, et al. COSMIN methodology for systematic reviews of Patient‐Reported Outcome Measures (PROMs) - user manual 2018 [Available from: www.cosmin.nl. 46. Higgins JP, Green S. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration, 2011. 2011 [Available from: www.handbook.cochrane.org. 47. Cochrane Hanbook for Systematic reviews of Diagnostic Test Accuracy Reviews 2013 [Available from: http://methods.cochrane.org/sdt/handbook-dta-reviews. 48. Terwee CB, Jansma EP, Riphagen, II, de Vet HC. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;18(8):1115-23. 49. Boers M, Kirwan JR, Tugwell P, Beaton D, Bingham CO, III, Conaghan PG, et al. The OMERACT handbook: OMERACT; 2015 2015. 50. Smart A. A multi-dimensional model of clinical utility. International journal for quality in health care : journal of the International Society for Quality in Health Care. 2006;18(5):377-82. 51. Stratford PW, Kennedy D, Pagura SM, Gollish JD. The relationship between self-report and performance-related measures: questioning the content validity of timed tests. Arthritis Rheum. 2003;49(4):535-40. 52. Efron B. Better bootstrap confidence intervals. Journal of the American Statistical Association. 1987;82(397):171-85. 53. Moher D, Liberati A, Tetzlaff J, Altman DG, Group P. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e1000097. 54. Peterson DAB, P.; Jabusch, H. C.; Altenmuller, E.; Frucht, S. J. Rating scales for musician's dystonia: the state of the art. Neurology. 2013;81(6):589-98.
top related