1.2.2006wp6 – information extraction introduction to medieq quality labelling of medical web...

21
1.2.2006 WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction http:// zeus . iit . demokritos . gr / medieq Martin Labský labsky @ vse . cz Knowledge Engineering Group (KEG) University of Economics Prague (UEP)

Upload: cameron-lester

Post on 27-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 2: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 2

Purpose of MedIEQ

• Medical web sites are increasingly popular• Content strongly affects users’ decisions• Therefore, quality labeling is very important• Agencies invest large effort into labeling

websites manually• We develop tools to minimize their effort• Tools will be multi-lingual, will support different

and evolving labeling criteria

Page 3: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 3

Agenda

• Partners• Description of relevant work packages [3]

– Web content collection, Information Extraction, Lexical and semantic resources

– Goals, tasks, partners– Existing tools (to be extended)– New tools (to be developed)– Existing resources (to be made accessible)

• Milestones & deliverables• References• Questions

Page 4: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 4

Partners

• Agencies– WMA: Web Médica Acreditata (Es)

• assigns a quality label that is shown on medical websites• websites ask for the label, are suggested changes, then get it

– AQUMED: Agency for Quality Labeling in medicine (De)• maintains a web directory organized by topics• only good-quality websites are present

• Developers– NCSR Demokritos and I-Sieve (spin-off) (Gr)– UEP: University of Economics Prague (Cz)– UNED: National University of Distance Education (Es)– HUT: Helsinki University of Technology (Fi)

Page 5: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 5

Web Content Collection (WP5)

Page 6: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 6

Website monitoring

• Regular visits to labeled website• Checking pages

– for relevant changes– which changes are relevant?

• manual rules, machine learning...

– alert agency when significant changes occur– or, increase the website’s (web page’s)

priority in a list of to-be-checked resources– show what has changed, suggest solution

• Needed by WMA, AQuMed

Page 7: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 7

Web focused crawling

• Find new medical websites

• Use multiple existing search engines– specify lists of keywords / keyphrases– give sample “similar” documents– use Google/Yahoo API and filter their results

• NCSR already has a focused crawler– we should contribute to its development

• Needed by WMA

Page 8: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 8

Website spidering

• Walk pages of a single website• Classify each page

– in order to choose relevant docs for quality labeling– e.g. contact page, page containing treatment description, page

with sponsors– use machine learning, e.g. based on a bag-of-words (unigram,

bigram) document representation

• Spidering strategy– which documents belong together (e.g. page 1/7)– which links to follow next

• NCSR has a spider– uses classifiers from Weka for doc classification– we should contribute

Page 9: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 9

Information Extraction (WP6)

Page 10: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 10

IE introduction

• Documents to extract from– pages retrieved & classified by spider

• from known websites• from crawler

– monitored labeled pages that have changed

• Information to be extracted– derived from agencies’ labeling criteria– e.g. contact information of responsible persons, sponsor

names, privacy warning texts...

• Questions– how much human intervention needed?– complexity of label sets to be supported?– methodology of porting to a new language?

Page 11: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 11

Example extracted information I.• Transparency and honesty

– site provider (company name, contact)– site purpose, type of target audience– funding (grants, sponsors)

• Authority– source citation for information provided, its type and date– names and credentials of all information providers

• Privacy and data protection– privacy policy description

• Timeliness of information– dates of publication/modification

• Accountability– names (and roles) of people responsible for presented information– editorial policy description

Page 12: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 12

Example extracted information II.

• Content– medical terms, e.g. disease and drug names

– statements recommending a certain product/method

– advertisements

– disallowed combinations (e.g. advertisement for X adjacent to an article related to X)

• Formal– mandatory statements (e.g. importance of physical examination,

privacy warnings when posting data into chats)

Page 13: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 13

Sources of extraction knowledge• Training data

– scarcity will be a problem for most extracted attributes– different types: labeled documents, sample extracted data, data

previously extracted from the same website, domain dictionaries

• Extraction patterns– induced (semi)automatically from scarce training data– or even authored manually

• Background domain knowledge– relations between extracted attributes, cardinalities ...– e.g. typically just one company is the web site’s provider, but there

are often multiple sponsors

• Web site structure– exploit common formatting of a group of documents within a website– exploit common formatting used for a particular type of extracted

data across different websites

Page 14: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 14

IE tools• Ex (UEP)

– IE system under development using “extraction ontologies”– extracts instances from semi-structured documents– utilizes training data + manually defined patterns, includes spider– old version based on HMMs – http://eso.vse.cz/~labsky/client/

• Named entity recognizer (UNED)– extracts dates, person/institution names

• 3rd party IE tools– wrapper management systems– e.g. LP2-based IE tool or annotation editor from Sheffield

Page 15: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 15

Website assessment

• Check website’s technical correctness– SEO (findability in search engines with respect to

some keyphrases)– accessibility (possibility of font enlargement, blind

access, pages hidden deep in website structure, color schemes perceivable by anybody)

– formal correctness (dead links, violations of HTML standards, failure to display well under at least the 3 most popular browsers)

• Check non-technical correctness– e.g. typos, “clear, easy-to-understand language”– more: check for black-listed phrases, claims, etc.

Page 16: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 16

Website assessment tools

• Relaxed (UEP)– HTML validator based on Relax NG and Schematron patterns– can perform formal checks of website content beyond DTDs– http://relaxed.sourceforge.net/

• SEO tool (UEP)– could Honza’s SEO tool be extended?

Page 17: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 17

IE Deliverables

• Duration: M1-M28

• Deliverables– D8: Methodology & architecture of IE (M9)– D9.1: First version of IE toolkit (M15)– D9.2: Final version of IE toolkit (M24)

Page 18: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 18

Lexical and semantic resources(WP7)

Page 19: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 19

Lexical and semantic resources• Sp, De, En, Cz, Gr, Fi, Catalan (7!)• We are in charge of Cz, De(!)• Semantic

– thesauri, ontologies (MESH)– lists of cures, vaccine names, lists of medical

companies, illnesses, diagnoses– generic ontologies and translation dictionaries (e.g.

Eurowordnet)

• Lexical– lemmatizers/morphology analyzers, part-of-speech

taggers, chunkers, syntactic parsers– medical document collections (for classification)

Page 20: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 20

References

• MedIEQ:– http://www.iit.demokritos.gr/~vangelis/MedIEQ/– http://zeus.iit.demokritos.gr/medieq

• Related projects:– WRAPIN http://debussy.hon.ch/cgi-bin/Wrapin/ClientWrapin.pl– Quatro http://www.quatro-project.org/DC2005.htm– CROSSMARC http://www.iit.demokritos.gr/skel/crossmarc/

• Relaxed: – http://badame.vse.cz/validator/

• Ex:– http://eso.vse.cz/~labsky/doc/ex.pdf

• Ellogon:– http://www.ellogon.org/

Page 21: 1.2.2006WP6 – Information Extraction Introduction to MedIEQ Quality Labelling of Medical Web content using Multilingual Information Extraction

1.2.2006 WP6 – Information Extraction 21

Questions

• ?