thrio
DESCRIPTION
THRio. Database Linkage and THRio Database Issues. Database matching. There are several systems that do not “talk” to each other SINAN – reportable diseases (TB, AIDS) SIM – Mortality SICOM – Pharmaceutical database (ARVs) THRio – Our DB Original plan - PowerPoint PPT PresentationTRANSCRIPT
Database matchingDatabase matching
There are several systems that do not There are several systems that do not “talk” to each other“talk” to each other SINAN – reportable diseases (TB, AIDS)SINAN – reportable diseases (TB, AIDS) SIM – MortalitySIM – Mortality SICOM – Pharmaceutical database (ARVs)SICOM – Pharmaceutical database (ARVs) THRio – Our DBTHRio – Our DB
Original planOriginal plan Match THRio with all other 3 DBs aboveMatch THRio with all other 3 DBs above
Database matchingDatabase matching
ProblemsProblems There is no unique identifier common for all There is no unique identifier common for all
systemssystems We use name, gender and DOB and mother’s We use name, gender and DOB and mother’s
name as surrogatesname as surrogates The information is not uniform – many missing The information is not uniform – many missing
variables – especially mother’s namevariables – especially mother’s name THRio THRio
Standardization of names abbreviationsStandardization of names abbreviations Double data entryDouble data entry Not enough – names are misspelledNot enough – names are misspelled
The other databases – even worseThe other databases – even worse No QCNo QC
Database matchingDatabase matching
Proposed strategyProposed strategy Compare different approachesCompare different approaches
Translated SOUNDEXTranslated SOUNDEX Reclink – probabilistic linkageReclink – probabilistic linkage Other algorithmsOther algorithms
Apply to different examples and get Apply to different examples and get sensitivity/specificity for each onesensitivity/specificity for each one
SICOMSICOM Sequential matchingSequential matching Match TB Match TB beforebefore doing the sequential doing the sequential
Database matchingDatabase matching
The project was split:The project was split:
ARV database revisitedARV database revisited
Development of a new Development of a new algorithm for database linkagealgorithm for database linkage
Database matchingDatabase matching
ARV database revisitedARV database revisited Consistency problems (as pointed out Consistency problems (as pointed out
before)before) First HAART abstracted for THRioFirst HAART abstracted for THRio
Inconsistency confirmedInconsistency confirmed Dates did not match (40%)Dates did not match (40%) Drugs did not matchDrugs did not match
Now all the ART history will be collected Now all the ART history will be collected (since HAART only)(since HAART only)
Should we insist and compare the database Should we insist and compare the database with the whole history?with the whole history?
Database matchingDatabase matching Development of algorithm for Development of algorithm for
database linkagedatabase linkage Using Python to implement the interface Using Python to implement the interface
Adapted soundex algorithmAdapted soundex algorithm ““Gestalt” algorithm – rather hyperbolicGestalt” algorithm – rather hyperbolic Direct field comparisonsDirect field comparisons
Including an hierarchical structure for Including an hierarchical structure for searching and comparing recordssearching and comparing records
Means taking advantage of differences in Means taking advantage of differences in amount of information availableamount of information available
Computational problemsComputational problems OptimizationOptimization
Database matchingDatabase matching BlockingBlocking
Speeds up computationSpeeds up computation I’ll be concerned with records that are a little I’ll be concerned with records that are a little
similar to begin withsimilar to begin with SoundexSoundex
First and last namesFirst and last names Mother’s first and last namesMother’s first and last names First name and mother’s last nameFirst name and mother’s last name
Needed to expand to account for errors Needed to expand to account for errors in the first and last names’ first letterin the first and last names’ first letter
Database matchingDatabase matching Full comparisonFull comparison
All fields exactly the sameAll fields exactly the same Small error in DOBSmall error in DOB Similar names (gestalt) – generates Similar names (gestalt) – generates
scoresscores A combination of the aboveA combination of the above
Several “levels” createdSeveral “levels” created Have to choose 2 cutoffsHave to choose 2 cutoffs
Not a matchNot a match Definitely a matchDefinitely a match Have to manually decideHave to manually decide
Database matchingDatabase matching Computational problems – testing Computational problems – testing
phasephase Using PostgreSQL and PythonUsing PostgreSQL and Python Too slow when matching with the TB Too slow when matching with the TB
databasedatabase > 100,000 records> 100,000 records
Changed the algorithm to Python onlyChanged the algorithm to Python only
Computational times (currently)Computational times (currently) THRio x SIM (12,689 X 2,922)THRio x SIM (12,689 X 2,922)
3-4 minutes3-4 minutes THRio x TB (12,689 X 102,919)THRio x TB (12,689 X 102,919)
100-105 minutes100-105 minutes
Database matchingDatabase matching ResultsResults First we chose a sample of the First we chose a sample of the
mortality databasemortality database Year 2005Year 2005 AIDS onlyAIDS only 871 records871 records
Matched with THRio databaseMatched with THRio database 10,344 records at the time10,344 records at the time
Database matchingDatabase matching
Compared Manual x Reclink x Compared Manual x Reclink x AlgorithmAlgorithm We were going to use the manual We were going to use the manual
linkage as the gold standardlinkage as the gold standard The algorithm found 13 extra right matchesThe algorithm found 13 extra right matches
We used the combination of those as We used the combination of those as the standardthe standard
Database matchingDatabase matchingManual check
Match Not match
Total
Match 107 1 108 Algoritm
Not match 0 10236 10236
Total 107 10237 10344
Sensitivity: 100% Specificity: 99.9% PPV:99% NPV:100%
Manual check
Match Not match Total
Match 89 0 89 RecLink
Not match 18 10237 10255
Total 107 10237 10344
Sensitivity: 83.2% Specificity: 100% PPV: 99% NPV: 99%
Database matchingDatabase matching The algorithm outperformed both The algorithm outperformed both
RecLink and manual checkRecLink and manual check But after some adjustmentsBut after some adjustments That was just the “training phase”That was just the “training phase”
The only mistake has actually to be The only mistake has actually to be checked if it is a twin brotherchecked if it is a twin brother Full info and only one different letter in Full info and only one different letter in
the first namethe first name
We still have to test it again with a We still have to test it again with a different sample and with TBdifferent sample and with TB
Database matchingDatabase matching
THRio (latest) x SIM (2003-2005) THRio (latest) x SIM (2003-2005) 340 matches (total)340 matches (total)
79 (23%) to be manually checked only79 (23%) to be manually checked only This means that both DBs have good This means that both DBs have good
quality, at lest in terms of completenessquality, at lest in terms of completeness Ended up with 273 matches and one Ended up with 273 matches and one
possible mistakepossible mistake
When we actually implement it…When we actually implement it… Extra check with date of last annotation Extra check with date of last annotation
in the chartin the chart
Database matchingDatabase matching
ChallengeChallenge TB databaseTB database Data quality is much poorer than SIMData quality is much poorer than SIM Might lead to lower sensitivityMight lead to lower sensitivity Will lead to much more manual Will lead to much more manual
checkingchecking Development of interface to help workDevelopment of interface to help work
Database matchingDatabase matching THRio (latest) x TB (1995-2005) THRio (latest) x TB (1995-2005) 6453 matches (total)6453 matches (total)
3870 (60%) to be manually checked3870 (60%) to be manually checked 721 (11%) with names only721 (11%) with names only
Quality is much worse than SIMQuality is much worse than SIM Many duplicatesMany duplicates
Proposed solutions:Proposed solutions: Reduce time frame (for prospective TB Reduce time frame (for prospective TB
cases only)cases only) Use date of TB diagnosis to exclude Use date of TB diagnosis to exclude
duplicatesduplicates GUI to help GUI to help
Database matchingDatabase matching
Further discussion for mortality:Further discussion for mortality: What database to use?What database to use? All causes X HIV-AIDS as a basic causeAll causes X HIV-AIDS as a basic cause
Patients may be dying of other causesPatients may be dying of other causes Municipality X StateMunicipality X State
Patients may live in other citiesPatients may live in other cities Municipality just records deaths that Municipality just records deaths that
occurred in the cityoccurred in the city
Data analysis issuesData analysis issues
Complex structureComplex structure Currently 17 tables with informationCurrently 17 tables with information Dates are not date fieldsDates are not date fields
We need dates!!!We need dates!!!
We don’t collect information about We don’t collect information about specific visitsspecific visits It is the information since last It is the information since last
annotation up to the current one – could annotation up to the current one – could mean multiple visitsmean multiple visits
Definitions are hard to makeDefinitions are hard to make
Data analysis issuesData analysis issues
All the events have to be based on All the events have to be based on datesdates Partial missing datesPartial missing dates
In general I’ll accept missing days – turned In general I’ll accept missing days – turned to 15to 15
What to use as a surrogate?What to use as a surrogate? For data collected under the study – date of For data collected under the study – date of
last annotationlast annotation What about baseline data?What about baseline data?
Data analysis issuesData analysis issues
Definition of Baseline dataDefinition of Baseline data Study begins on September 1Study begins on September 1stst 2005 2005 Baseline data collection finished on June Baseline data collection finished on June
20062006 ““Baseline form” doesn’t mean baseline Baseline form” doesn’t mean baseline
informationinformation Is it baseline for the study or for the Is it baseline for the study or for the
patient?patient? What about new patients? Do they have What about new patients? Do they have
baseline data?baseline data?
Data analysis issuesData analysis issues
Definition of a new patientDefinition of a new patient We have two “candidate” datesWe have two “candidate” dates
Date of enrollment in the clinicDate of enrollment in the clinic Could be long before HIV diagnosisCould be long before HIV diagnosis
Date of HIV diagnosisDate of HIV diagnosis Could be long before enrollment in that clinicCould be long before enrollment in that clinic
A “new” patient is not necessarily new, A “new” patient is not necessarily new, depending on what we wantdepending on what we want
Do we need newly diagnosed or newly Do we need newly diagnosed or newly enrolled?enrolled?
Should we use both?Should we use both?
Data analysis issuesData analysis issues
Several possible outcomesSeveral possible outcomes Primary outcome of study (TB)Primary outcome of study (TB) Secondary outcome (death)Secondary outcome (death) Operational outcomesOperational outcomes
Waiting for PPDWaiting for PPD PPD placed and readPPD placed and read Reactive PPDReactive PPD INH startedINH started
How to deal with all of these?How to deal with all of these?
Data analysis issuesData analysis issues
General output for data analysisGeneral output for data analysis For each patient, look for baseline statusFor each patient, look for baseline status
As of Sept 2005 or at enrollmentAs of Sept 2005 or at enrollment Look for all changes in timeLook for all changes in time
Need the dates!!!Need the dates!!! Set up like a database for survival analysisSet up like a database for survival analysis
For every change repeat records withFor every change repeat records with Initial statusInitial status Initial dateInitial date Final statusFinal status Final dateFinal date
Possible to customize for specific outcomesPossible to customize for specific outcomes