how can natural language processing help meddracoding? · summary about nlp and nlp in life...

24
How can Natural Language Processing help MedDRA coding? April 16 2018 Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics

Upload: buiduong

Post on 01-May-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

How can Natural Language Processing help MedDRA coding?April 16 2018

Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics

Summary

About NLP and NLP in life sciencesUses of NLP with MedDRAExamples in MedDRA coding of adverse events in FDA drug labelsHow NLP could feed into MedDRA development

© Linguamatics 20182

Use of NLP in Life Sciences

Advanced text analytics delivers value along the pipeline

© Linguamatics 20183

Gene-disease mapping

Target ID/selection

Mutation/expression analysis

Toxicity analysis and prediction

Biomarker discovery

Drug repurposing

Patent analysis

KOL identification

Opportunity scouting

Trial site selection and study design

Safety Competitive intelligence

Pharmacovigilance

Voice of the Customer analysis

Comparative Effectiveness

Regulatory Submission QC HEOR

SAR

Social media analysis

IDMP

Real World Evidence

NLP Turns Text into Actionable Insights

© Linguamatics 20184

Turn text Into structured datausing sophisticated queries

Analytics

To driveanalytics

Enterprise Warehouse

Natural Language Processing – Ontologies – Statistical Methods –Machine Learning - Chemistry –Regular Expressions – etc.

Transform unstructured or semi-structured data into insights to advance human health

NLP finds information however it is expressed

5 © Linguamatics 2018

Different word, same meaning

cyclosporineciclosporin

NeoralSandimmune

Different expression, same meaning

Non-smokerDoes not smoke

Does not drink or smokeDenies tobacco use

Different grammar, same meaning

5mg/kg of cyclosporine per day5mg/kg per day of cyclosporinecyclosporine 5mg/kg per day

Same word, different context

Diagnosed with diabetesFamily history of diabetes

No family history of diabetes

NLP

Blend of powerful rule- and machine learning-based methods to transform unstructured data into structured

• Precise linguistic relationships, sentence co-occurrence• Precise negation e.g. “pressure” not “blood pressure”• Multiple languages

Linguistic Processing

• Search for concepts and their synonyms with spelling and optical character recognition (OCR) correction

• Out of the box or custom ontologies

Terminologies/ Ontologies

• Quantitative & pattern-based data extraction at scale e.g. numerical data, dates, gene mutations

• Range searchQuantitative Data

• Identify and extract chemicals in context based on substructure and chemical similarityChemistry

• Ontology and rule-based normalization of results• Essential for organizing structured output• Enables indirect relations, filtering/faceting results, etc.

Results Normalization

• Unique capability to capture knowledge from tables embedded in documents

• Fielded search within regions of a document

Table & Region Processing

© Linguamatics 20186

Data normalization: always treat the same concept in the same way – the key to structured results

Concept Text Normalized ValueDiseases breast cancer Breast Neoplasm

carcinoma of the breast

Genes Raf-1 RAF1

Raf I

Dates 27th Feb 2014 20140227

2014/02/27

Measurements 0.2g 200 mg

Two hundred milligrams

Mutations Val 158 Met V158M

Val by Met at codon 158

Behaviours denies alcohol and tobacco use

Non-smoker

is not a cigarette smoker

Relationships ...nimesulide, a selective COX2 inhibitor, …

Entrez ID: 5743

inhibits

Data normalization

Overview• Convert text into a standard

format• Is a fundamental component

in transforming text into structured data and driving actionable insights

Key benefitsFind concepts however they

are expressedJoin results to discover new

indirect relationshipsCluster or facet results by

concept or quantityCompare measurements with

different units e.g. kg vs. lbs

© Linguamatics 20187

Use of NLP with MedDRA

Errors in Regulatory SubmissionsSocial MediaAdverse Events in Drug Labels

© Linguamatics 20188

Commonly reported conditions included Seasonal allergies, Back pain, and Hypercholesterolaemia. The majority of AEs were considered treatment related in all cohorts and the relationship between treatment groups and between cohorts was similar to that observed for all-causality AEs. Permanent discontinuations were reported at higher rates in the Rx groups than in the placebo groups in the 3 pooled cohorts. The majority of AEs leading to permanent discontinuation were considered treatment related in both treatment groups in all cohorts. The single most frequently reported event was headache, which was reported in approximately 40% of Rx subjects and 20% of placebo subjects in the 2000 Pooled cohort. Other AEs reported across all cohorts at rates greater in Rx subjects than placebo subjects included Seasonal allergies and Insomnia (2000 8.4% vs 5.4%, 2003 0.9% vs 0.8%, 2006 14.0% vs 10.1%; Rx vs placebo respectively).

Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately.

Table: Most Frequently Reported Medical Conditions (≥5% in Any Treatment Group)

Study2000 Pooled

Studies2003 Pooled Study

Total Number Subjects

RxN=997

PboN=927

RxN=1021

PboN=956

Number (%) of SubjectsCardiac disorders 70

(7.0)32

(3..5)108

(10.6)101

(10.6)Angina pectoris 4

(0.4)5

(0.5)74

(7.2)71

(7.4)Dyspepsia 174

(17.5)120

(12.9)3

(0.3)2

(0.2)GERD 83

(8.3)52

(5.6)30

(2.9)27

(2.8%)Metabolic / nutritional disorders

253(25.4)

165(17.8)

194(19.0)

212(22.2)

Dyslipedaemia 1(0.1)

0(0)

15(1.5)

19(2.0)

Hypercholesterolaemia 65(6.5)

50(5.4)

88(8.6)

103(10.8)

Hyperlipidaemia 147(14.7)

79(8.5)

56(5.5)

66(6.9)

Osteoarthritis 102(10.2)

57(6.6)

12(1.2)

11(1.2)

Nervous system disorders

628(63.0)

409(44.1)

28(2.7)

19(2.0)

Headache 413(41.4)

280(30.2)

9(0.9)

7(0.7)

Psychiatric disorders 137(13.7)

81(8.7)

14(1.4)

15(1.6)

Insomnia 84(8.4)

47(5.1)

9(0.9)

8(0.8)

Key

Incorrect formatting: doubled period, incorrect number of decimal places, addition of percent signIncorrect calculation: number of patients divided by total number does not agree with percent termIncorrect threshold: presence of row does not agree with table titleText-Table inconsistency: numbers in the table do not agree with numbers in the accompanying text

© Linguamatics 20169

Use Case: Automated Blinded Data Review for Regulatory Submissions

Before unblinding a clinical trial, data are checked for errors and inconsistenciesAmong the many checks performed, MedDRA terms for Adverse Events Reports are verified, including:− Is the Preferred Term valid in any version of MedDRA?

Reporter may have inserted the Investigator Entry in the wrong field, or used an LLT

− Are multiple MedDRA versions in use in the same trial?Reporter Error or Error when generating the blinded data

− Does the specified version of MedDRA agree with the Preferred Terms being reported? Reporter may have used a more precise MedDRA term from a more recent version of MedDRA

− Does the Preferred Term agree with the declared System Organ Class?

Automation of this process is in use at large pharma

© Linguamatics 201810

Use Case: Social Media Analysis

© Linguamatics 201811

Social media: plenty of AEs mentionedLanguage informalLinguistic patterns can find mentions of AEs without using a dictionary Using MedDRA LLTs finds only one of the following 4 examples

Use Case: Extraction of Adverse Events using MedDRA

Extraction of adverse events, MedDRA terms and frequency of occurrence, clustered by medicinal product Structured results can be used to populate a database, e.g. IDMP− Different customers have different MedDRA requirements, e.g. PT vs LLT, which is easy to accommodate

© Linguamatics 201812

Results table (background) and highlighted source document (foreground) are shown

Extraction of AEs from FDA Drug Labels

FDA drug labels are not structuredWant to compare AEs found in Real World Evidence with known AEsFind AEs from within text, and within tables

Pull out values if want to filter to only include AEs where greater than placebo

© Linguamatics 201813

Use of NLP terminology features in extracting AEs

© Linguamatics 201814

Increase recall with:− Morphological variants

− Spelling correction

− Matching across conjunctions

− Mapping multiple concepts to MedDRA PT

Increase precision with:− Excluding inappropriate contexts

− Use of document sections to exclude inappropriate terms

Increase recall: morpho variants

© Linguamatics 201815

MedDRA PT “Congenital anomaly”

*

*

*

*

*

** Additional hits when using morphological variants

Increase recall: spelling correction

© Linguamatics 201816

MedDRA PT “Hypersensitivity”

Increase recall: MedDRA matching across conjunctions

© Linguamatics 201817

MedDRA PT “Hepatic neoplasm” OR “Thyroid neoplasm

Increase recall: mapping multiple concepts to a MedDRA PT

© Linguamatics 201818

MedDRA PT “Blood creatinineincreased”

•Blood creatinine increased •Creatinine blood increased •Creatinine high •Creatinine increased •Creatinine serum increased •Increased serum creatinine•Plasma creatinine increased •Raised serum creatinine•Serum creatinine increased

has low recall.

Combining MedDRA PT “Blood creatinine”

•Blood creatinine•Creatinine•Plasma creatinine•Serum creatinine

with Relation “Increase”•Increase•Elevate•Raise•...

in a linguistic pattern allowing flexibility in expression... gives significant additional recall (*).

*

*****

**

**

Increase precision: exclusion of hits in inappropriate contexts when searching for adverse events

© Linguamatics 201819

Thousands of examples of MedDRA concepts that are not AEs. Linguistic patterns can filter out inappropriate contexts.

Increase precision: using document regions - exclusion of PTs that occur in Indications when searching for AEs

© Linguamatics 201820

Can be removed based on same PT

How NLP could feed into MedDRA development: improved coverage of terminology

© Linguamatics 201821

Terms appearing with MedDRA terms in the same listExplicit constructions such as “AEs such as”, or from tablesLook for terms in appropriate contexts e.g. “made me ?”

Noun phrases occurring in a list after “adverse events such as”, and which are not already in MedDRA

© Linguamatics 201822

Noun phrases occurring in the same list as another MedDRA term

© Linguamatics 201823

Summary

NLP is required to rule out inappropriate contexts, improving precisionNLP techniques e.g. Morphological variants and OCR correction improve recallString based synonym matching cannot cope with all the variation found in real text, e.g. Elevation of blood creatinine. Here Linguistic patterns are required.Region and table processing are often required to get the right context.

© Linguamatics 201824