wendy)chapman) danielle)mowery) - idash · pdf filewendy)chapman) danielle)mowery)))...
TRANSCRIPT
integra(ng Data for Analysis, Anonymiza(on, and SHaring
Natural Language Processing Wendy Chapman Danielle Mowery
Tools & Services
Collabora(ve Knowledge Authoring
Visualiza(on of NLP Annota(ons
Evalua(on Workbench
De-‐Iden(fica(on
Classifier Development
Annota(on Environment
Increase access to text through NLP
Decrease Burden of Developing NLP
NLP Tools & Services for iDASH
Surveillance from TwiOer
NLP App Customiza(on
Overview
• How can we encourage sharing of clinical data? » Crea(ng an iDASH de-‐iden(fica(on applica(on
• How can we decrease the burden in crea(ng training cases and annota(ng? » Developing an iDASH annota(on environment » Demo of the iDASH annota(on environment
• De-‐iden(fica(on use case
7/19/12 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 3
Enabling Data Sharing
• Kawasaki Disease DBP has pa(ent data » images » structured data » clinical reports
• Sharing this clinical data with other researchers » Offers opportuni(es for research advances » Presents many challenges
• How can we enable sharing of Kawasaki Disease and other clinical data? » Informed consent » Customizable DUA for data providers » HIPAA-‐compliant storage
Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 4 7/19/12
De-‐iden(fica(on of Clinical Data
• Missing link » Tool for removing 18 HIPAA Iden(fiers
• Headers – fairly straighXorward • Text – more difficult
7/19/12 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 5
NAME: Yongsan Wong MRN: 5238492 DOB: 06.06.2006 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ This is a 14-‐month-‐old baby boy (Yongsan) who was transferred from Children’s Community with presump(ve diagnosis of Kawasaki with fever for more than 5 days and conjunc(vi(s, mild arthri(s with edema, rash, resolving and with elevated neutrophils and thrombocytosis, elevated CRP and ESR. When he was sent to the hospital, he had a fever of 102.
Pa(ent names
Hospital names
Medical record numbers
…
Headers Text
Customizable De-‐iden(fica(on Service
BoB
Run de-id tool locally
Retrain on local data
Evaluate de-id On local data
Produce de-id texts
Enable sharing of clinical data
1. Pre-trained de-id application
2. Interface for corrections & retraining
3. Support for evaluation of output
Danielle Mowery, BreO South, Anurag Nara, Liqin Wang, Mingyuan Zhang, Shazia Ashfaq, Melissa Tharp
Customizable De-‐iden(fica(on Service
BoB
Run de-id tool locally
Retrain on local data
Evaluate de-id On local data
Produce de-id texts
Enable sharing of clinical data
1. Pre-trained de-id application
2. Interface for corrections & retraining
3. Support for evaluation of output
1. Build a Shareable De-identified Corpus
• MT Samples » Website with thousands of medical transcriptions » Minimally de-identified » Freely available
• Pilot annotation phase » 6 annotators » 350 reports
• Distributed annotation phase » Recruit community annotators » 2,000 reports
Research Ques(ons:
-‐ What is the best way to train many annotators?
-‐ How does pre-‐annota(on help?
-‐ Does clustering data improve speed?
Danielle Mowery, BreO South, Liqin Wang, Mingyuan Zhang, Anurag Narra, Shazia Ashfaq
BoB: Best of Breed
7/19/12 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 9
Ini(al De-‐id Tool -‐ BoB • Developed at the Salt Lake City VA • Incorporates techniques used in all other de-‐iden(fica(on applica(ons
• Sta(s(cal • Regular expressions • Dic(onaries
Eventually add other open source tools for user to select from
Customizable De-‐iden(fica(on Service
BoB
Run de-id tool locally
Retrain on local data
Evaluate de-id On local data
Produce de-id texts
Enable sharing of clinical data
1. Pre-trained de-id application
2. Interface for corrections & retraining
3. Support for evaluation of output
2. Interface for Correction & Retraining
eHOST
University of Utah – BreO South, Chris Leng
Customizable De-‐iden(fica(on Service
BoB
Run de-id tool locally
Retrain on local data
Evaluate de-id On local data
Produce de-id texts
Enable sharing of clinical data
1. Pre-trained de-id application
2. Interface for corrections & retraining
3. Support for evaluation of output
Document & annota(ons
Outcome Measures for Selected Annota(ons
Select Classifica(ons to View
Report List
AOributes for Selected
Annota(on
Rela(onships for Selected
Annota(on Christensen, Murphy, Frabetti, Rodriguez, Savova
3. Evalua(on Workbench
Crea(ng a Training Set
7/19/12 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 14
• Time consuming » Recruiting & training annotators for high agreement
• Expensive » Domain experts especially expensive » Need annotation by multiple people
• Logistically challenging » Managing files and batches of reports » Setting up annotation tool
• Redundant » Hasn’t someone created a schema for this before?
Overview
• How can we encourage sharing of clinical data? » Crea(ng an iDASH de-‐iden(fica(on applica(on
• How can we decrease the burden in crea(ng training cases and annota(ng? » Developing an iDASH annota(on environment » Demo of the iDASH annota(on environment
• De-‐iden(fica(on use case
7/19/12 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 15
iDASH Annotation Environment
Annotation Admin eHOST
Client apps on local computer
S Duvall, B South, G Savova, N Elhadad, H Hochheiser
Goal: provide an environment to decrease the burden of annotation
Annotator Registry
iDASH Web Services
Evalua(on Workbench
Annotator Registry
• Enlist for annota(on • Cer(fy for annota(on tasks
» Personal health informa(on » Part-‐of-‐speech tagging » UMLS mapping
• Set pay rate • Searchable • Available for inclusion in new annota(on task
hOp://idash.ucsd.edu/nlp-‐annotator-‐registry
1. Assign annotators to a task
Annota(on Admin
2. Create a Schema
3. Assign users and set (me expecta(ons
3. Keep track of progress
eHOST
Syncs with Annotation Admin » Download schema to annotate with » Download batch of reports to annotate » Upload annotated reports
Evalua(on Workbench
• Compare annota(ons from two sources • Drill down to understand differences • Calculate outcome measures • Perform error analysis
7/19/12 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 23
Demo of iDASH Annotation Environment
Annotation Admin eHOST
Client apps on local computer
Danielle Mowery
Annotator Registry
Evalua(on Workbench
iDASH Web Services
Conclusion
• iDASH NLP Ecosystem goals » Decrease barriers to sharing of clinical data » Enhance clinical data use for research
• Leveraging the iDASH secure cloud
• Future work » Evaluate and extend the Annota(on environment for crowdsourcing
» Create a customizable de-‐id applica(on for iDASH users
7/19/12 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 25
7/19/12 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego
Thank you!
Ques(ons?