thrio

26
THRio THRio Database Linkage and Database Linkage and THRio Database Issues THRio Database Issues

Upload: augustus-moser

Post on 30-Dec-2015

30 views

Category:

Documents


0 download

DESCRIPTION

THRio. Database Linkage and THRio Database Issues. Database matching. There are several systems that do not “talk” to each other SINAN – reportable diseases (TB, AIDS) SIM – Mortality SICOM – Pharmaceutical database (ARVs) THRio – Our DB Original plan - PowerPoint PPT Presentation

TRANSCRIPT

THRio THRio

Database Linkage and THRio Database Linkage and THRio Database IssuesDatabase Issues

Database matchingDatabase matching

There are several systems that do not There are several systems that do not “talk” to each other“talk” to each other SINAN – reportable diseases (TB, AIDS)SINAN – reportable diseases (TB, AIDS) SIM – MortalitySIM – Mortality SICOM – Pharmaceutical database (ARVs)SICOM – Pharmaceutical database (ARVs) THRio – Our DBTHRio – Our DB

Original planOriginal plan Match THRio with all other 3 DBs aboveMatch THRio with all other 3 DBs above

Database matchingDatabase matching

ProblemsProblems There is no unique identifier common for all There is no unique identifier common for all

systemssystems We use name, gender and DOB and mother’s We use name, gender and DOB and mother’s

name as surrogatesname as surrogates The information is not uniform – many missing The information is not uniform – many missing

variables – especially mother’s namevariables – especially mother’s name THRio THRio

Standardization of names abbreviationsStandardization of names abbreviations Double data entryDouble data entry Not enough – names are misspelledNot enough – names are misspelled

The other databases – even worseThe other databases – even worse No QCNo QC

Database matchingDatabase matching

Proposed strategyProposed strategy Compare different approachesCompare different approaches

Translated SOUNDEXTranslated SOUNDEX Reclink – probabilistic linkageReclink – probabilistic linkage Other algorithmsOther algorithms

Apply to different examples and get Apply to different examples and get sensitivity/specificity for each onesensitivity/specificity for each one

SICOMSICOM Sequential matchingSequential matching Match TB Match TB beforebefore doing the sequential doing the sequential

Database matchingDatabase matching

The project was split:The project was split:

ARV database revisitedARV database revisited

Development of a new Development of a new algorithm for database linkagealgorithm for database linkage

Database matchingDatabase matching

ARV database revisitedARV database revisited Consistency problems (as pointed out Consistency problems (as pointed out

before)before) First HAART abstracted for THRioFirst HAART abstracted for THRio

Inconsistency confirmedInconsistency confirmed Dates did not match (40%)Dates did not match (40%) Drugs did not matchDrugs did not match

Now all the ART history will be collected Now all the ART history will be collected (since HAART only)(since HAART only)

Should we insist and compare the database Should we insist and compare the database with the whole history?with the whole history?

Database matchingDatabase matching Development of algorithm for Development of algorithm for

database linkagedatabase linkage Using Python to implement the interface Using Python to implement the interface

Adapted soundex algorithmAdapted soundex algorithm ““Gestalt” algorithm – rather hyperbolicGestalt” algorithm – rather hyperbolic Direct field comparisonsDirect field comparisons

Including an hierarchical structure for Including an hierarchical structure for searching and comparing recordssearching and comparing records

Means taking advantage of differences in Means taking advantage of differences in amount of information availableamount of information available

Computational problemsComputational problems OptimizationOptimization

Database matchingDatabase matching BlockingBlocking

Speeds up computationSpeeds up computation I’ll be concerned with records that are a little I’ll be concerned with records that are a little

similar to begin withsimilar to begin with SoundexSoundex

First and last namesFirst and last names Mother’s first and last namesMother’s first and last names First name and mother’s last nameFirst name and mother’s last name

Needed to expand to account for errors Needed to expand to account for errors in the first and last names’ first letterin the first and last names’ first letter

Database matchingDatabase matching Full comparisonFull comparison

All fields exactly the sameAll fields exactly the same Small error in DOBSmall error in DOB Similar names (gestalt) – generates Similar names (gestalt) – generates

scoresscores A combination of the aboveA combination of the above

Several “levels” createdSeveral “levels” created Have to choose 2 cutoffsHave to choose 2 cutoffs

Not a matchNot a match Definitely a matchDefinitely a match Have to manually decideHave to manually decide

Database matchingDatabase matching Computational problems – testing Computational problems – testing

phasephase Using PostgreSQL and PythonUsing PostgreSQL and Python Too slow when matching with the TB Too slow when matching with the TB

databasedatabase > 100,000 records> 100,000 records

Changed the algorithm to Python onlyChanged the algorithm to Python only

Computational times (currently)Computational times (currently) THRio x SIM (12,689 X 2,922)THRio x SIM (12,689 X 2,922)

3-4 minutes3-4 minutes THRio x TB (12,689 X 102,919)THRio x TB (12,689 X 102,919)

100-105 minutes100-105 minutes

Database matchingDatabase matching ResultsResults First we chose a sample of the First we chose a sample of the

mortality databasemortality database Year 2005Year 2005 AIDS onlyAIDS only 871 records871 records

Matched with THRio databaseMatched with THRio database 10,344 records at the time10,344 records at the time

Database matchingDatabase matching

Compared Manual x Reclink x Compared Manual x Reclink x AlgorithmAlgorithm We were going to use the manual We were going to use the manual

linkage as the gold standardlinkage as the gold standard The algorithm found 13 extra right matchesThe algorithm found 13 extra right matches

We used the combination of those as We used the combination of those as the standardthe standard

Database matchingDatabase matchingManual check

Match Not match

Total

Match 107 1 108 Algoritm

Not match 0 10236 10236

Total 107 10237 10344

Sensitivity: 100% Specificity: 99.9% PPV:99% NPV:100%

Manual check

Match Not match Total

Match 89 0 89 RecLink

Not match 18 10237 10255

Total 107 10237 10344

Sensitivity: 83.2% Specificity: 100% PPV: 99% NPV: 99%

Database matchingDatabase matching The algorithm outperformed both The algorithm outperformed both

RecLink and manual checkRecLink and manual check But after some adjustmentsBut after some adjustments That was just the “training phase”That was just the “training phase”

The only mistake has actually to be The only mistake has actually to be checked if it is a twin brotherchecked if it is a twin brother Full info and only one different letter in Full info and only one different letter in

the first namethe first name

We still have to test it again with a We still have to test it again with a different sample and with TBdifferent sample and with TB

Database matchingDatabase matching

THRio (latest) x SIM (2003-2005) THRio (latest) x SIM (2003-2005) 340 matches (total)340 matches (total)

79 (23%) to be manually checked only79 (23%) to be manually checked only This means that both DBs have good This means that both DBs have good

quality, at lest in terms of completenessquality, at lest in terms of completeness Ended up with 273 matches and one Ended up with 273 matches and one

possible mistakepossible mistake

When we actually implement it…When we actually implement it… Extra check with date of last annotation Extra check with date of last annotation

in the chartin the chart

Database matchingDatabase matching

ChallengeChallenge TB databaseTB database Data quality is much poorer than SIMData quality is much poorer than SIM Might lead to lower sensitivityMight lead to lower sensitivity Will lead to much more manual Will lead to much more manual

checkingchecking Development of interface to help workDevelopment of interface to help work

Database matchingDatabase matching THRio (latest) x TB (1995-2005) THRio (latest) x TB (1995-2005) 6453 matches (total)6453 matches (total)

3870 (60%) to be manually checked3870 (60%) to be manually checked 721 (11%) with names only721 (11%) with names only

Quality is much worse than SIMQuality is much worse than SIM Many duplicatesMany duplicates

Proposed solutions:Proposed solutions: Reduce time frame (for prospective TB Reduce time frame (for prospective TB

cases only)cases only) Use date of TB diagnosis to exclude Use date of TB diagnosis to exclude

duplicatesduplicates GUI to help GUI to help

Database matchingDatabase matching

Further discussion for mortality:Further discussion for mortality: What database to use?What database to use? All causes X HIV-AIDS as a basic causeAll causes X HIV-AIDS as a basic cause

Patients may be dying of other causesPatients may be dying of other causes Municipality X StateMunicipality X State

Patients may live in other citiesPatients may live in other cities Municipality just records deaths that Municipality just records deaths that

occurred in the cityoccurred in the city

Data analysis issuesData analysis issues

Data analysis issuesData analysis issues

Complex structureComplex structure Currently 17 tables with informationCurrently 17 tables with information Dates are not date fieldsDates are not date fields

We need dates!!!We need dates!!!

We don’t collect information about We don’t collect information about specific visitsspecific visits It is the information since last It is the information since last

annotation up to the current one – could annotation up to the current one – could mean multiple visitsmean multiple visits

Definitions are hard to makeDefinitions are hard to make

Data analysis issuesData analysis issues

All the events have to be based on All the events have to be based on datesdates Partial missing datesPartial missing dates

In general I’ll accept missing days – turned In general I’ll accept missing days – turned to 15to 15

What to use as a surrogate?What to use as a surrogate? For data collected under the study – date of For data collected under the study – date of

last annotationlast annotation What about baseline data?What about baseline data?

Data analysis issuesData analysis issues

Definition of Baseline dataDefinition of Baseline data Study begins on September 1Study begins on September 1stst 2005 2005 Baseline data collection finished on June Baseline data collection finished on June

20062006 ““Baseline form” doesn’t mean baseline Baseline form” doesn’t mean baseline

informationinformation Is it baseline for the study or for the Is it baseline for the study or for the

patient?patient? What about new patients? Do they have What about new patients? Do they have

baseline data?baseline data?

Data analysis issuesData analysis issues

Definition of a new patientDefinition of a new patient We have two “candidate” datesWe have two “candidate” dates

Date of enrollment in the clinicDate of enrollment in the clinic Could be long before HIV diagnosisCould be long before HIV diagnosis

Date of HIV diagnosisDate of HIV diagnosis Could be long before enrollment in that clinicCould be long before enrollment in that clinic

A “new” patient is not necessarily new, A “new” patient is not necessarily new, depending on what we wantdepending on what we want

Do we need newly diagnosed or newly Do we need newly diagnosed or newly enrolled?enrolled?

Should we use both?Should we use both?

Data analysis issuesData analysis issues

Several possible outcomesSeveral possible outcomes Primary outcome of study (TB)Primary outcome of study (TB) Secondary outcome (death)Secondary outcome (death) Operational outcomesOperational outcomes

Waiting for PPDWaiting for PPD PPD placed and readPPD placed and read Reactive PPDReactive PPD INH startedINH started

How to deal with all of these?How to deal with all of these?

Data analysis issuesData analysis issues

General output for data analysisGeneral output for data analysis For each patient, look for baseline statusFor each patient, look for baseline status

As of Sept 2005 or at enrollmentAs of Sept 2005 or at enrollment Look for all changes in timeLook for all changes in time

Need the dates!!!Need the dates!!! Set up like a database for survival analysisSet up like a database for survival analysis

For every change repeat records withFor every change repeat records with Initial statusInitial status Initial dateInitial date Final statusFinal status Final dateFinal date

Possible to customize for specific outcomesPossible to customize for specific outcomes

Thank you!Thank you!