biomedical articles per year

43
George Paliouras, May 2014 www.bioasq.org National Center for Scientific Research ‘Demokritos’ George Paliouras BioASQ Intelligent Information Management Targeted Competition Framework ICT- 2011.4.4(d) George Paliouras, May 2014 A challenge on large-scale biomedical semantic indexing and question answering www.bioasq.org

Upload: blaine

Post on 24-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Biomedical articles per year. Questions of biomedical experts. Yes/No question. “Are there any DNMT3 proteins present in plants ?”. Exact Answer. “ Yes” . Ideal Answer. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

National Center for Scientific Research ‘Demokritos’

George Paliouras

BioASQ

Intelligent Information Management Targeted Competition Framework ICT-2011.4.4(d)

George Paliouras, May 2014

A challenge on large-scale biomedical semantic indexing and question answering

www.bioasq.org

Page 2: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Biomedical articles per year

2/43

Page 3: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Questions of biomedical experts

“Are there any DNMT3 proteins present in plants?”

“Yes”

“Yes. The plant DOMAINS REARRANGED METHYLTRANSFERASE2 (DRM2) is a homolog of the mammalian de novo methyltransferase DNMT3. DRM2 contains a novel arrangement of the motifs required for DNA methyltransferase catalytic activity.”

Yes/No question

Exact Answer

Ideal Answer

3/43

Page 4: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Questions of biomedical experts

“What is the methyl donor of DNA (cytosine-5)-methyltransferases?”

“S-adenosyl-L-methionine”

“S-adenosyl-L-methionine (AdoMet, SAM) is the methyl donor of DNA (cytosine-5)-methyltransferases. DNA (cytosine-5)-methyltransferases catalyze the transfer of a methyl group from S-adenosyl-L-methionine to the C-5 position of cytosine residues in DNA.”

Factoid question

Exact Answer

Ideal Answer

4/43

Page 5: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Questions of biomedical experts (III)List question

“In 1955, the production of itaconic acid was firstly described for Ustilago maydis. Some Aspergillus species, like A. itaconicus and A. terreus, show the ability to synthesize this organic acid and A. terreus can secrete significant amounts to the media. Itaconic acid is mainly supplied by biotechnological processes with the fungus Aspergillus terreus. Cloning of the cadA gene into the citric acid producing fungus A. niger showed that it is possible to produce itaconic acid also in a different host organism.”

“Aspergillus terreus”, “Aspergillus niger”, “Ustilago maydis”

Exact Answer

Ideal Answer

“Which species may be used for the biotechnological production of itaconic acid?”

5/43

Page 6: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Questions of biomedical experts (III) Summary question

“Histone methyltransferases (HMTs) are responsible for the site-specific addition of covalent modifications on the histone tails, which serve as markers for the recruitment of chromatin organization complexes. There are two major types of HMTs: histone-lysine N-Methyltransferases and histone-arginine N-methyltransferases. The former methylate specific lysine (K) residues such as 4, 9, 27, 36, and 79 on histone H3 and residue 20 on histone H4. The latter methylate arginine (R) residues such as 2, 8, 17, and 26 on histone H3 and residue 3 on histone H4. Depending on what residue is modified and the degree of methylation (mono-, di- and tri-methylation), lysine methylation of histones is linked to either transcriptionally active or silent chromatin.”

-

Exact Answer

Ideal Answer

“How do histone methyltransferases cause histone modification?”

6/43

Page 7: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org 7/43

Page 8: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Finding relevant snippets

8/43

Page 9: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Not only texts: ontologies, linked data, …

9/43

Page 10: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org 10/43

Page 11: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Information from structured dataList question

http://www.disease-ontology.org/api/metadata/DOID:162 (cancer) http://www.uniprot.org/uniprot/M3K8_RAT (TPL2 synonym)

Subject: http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3003 (lung cancer)Predicate: http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGeneObject: http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/TPL2"

Related RDF triple

Related concepts

“Which forms of cancer is the Tpl2 gene associated with?”

11/43

Page 12: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org 12/43

Page 13: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

BioASQ Vision

• Make sure this knowledge is used to the benefit of patients

• Need to make it accessible to biomedical experts• Search is not effective enough• Push research in automated answering of

questions• A challenge for such systems can achieve a

multiplying effect

13/43

Page 14: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

What is BioASQ?A challenge funded by the European Union (FP7).

Task a: Hierarchical text classification• Organizers distribute new unclassified PubMed articles.• Participants assign MeSH terms to the articles.• Evaluation based on annotations of PubMed curators.

Task b: IR, QA, summarization, …• Organizers distribute English biomedical questions.• Participants provide: relevant articles, snippets,

concepts, triples, “exact” answers, “ideal” answers. • Evaluation: both automatic (GMAP, MRR, ROUGE etc.)

and manual (by biomedical experts). 14/43

Page 15: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Task bThe challenge

15/43

Task a

Page 16: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org 16/43

Page 17: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Behind the scenes

17/43

Page 18: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

BioASQ Platform

18/43

Page 19: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Datasets

Task b data contain gold articles, snippets, concepts, triples, “exact” and “ideal” answers prepared by biomedical experts from around Europe.

Task a 1st challenge 2nd challenge

Training 10,876,004 12,628,968Test 83490 71950 Task b 1st challenge 2nd challenge

Training 29 310Test 281 500

19/43

Page 20: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Data sources

They include both text and structured info.► PubMed abstracts,

PubMed Central articles, MeSH.

► Gene Ontology, UniProt, Jochem, Disease Ontology.

20/43

Page 21: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Annotation: questions and queries

21/43

Page 22: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Annotation: snippets

22/43

Page 23: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Annotation: answers

23/43

Page 24: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Assessment: relevance of material

24/43

Page 25: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Assessment: information in answers

25/43

Page 26: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

BioASQ social network

26/43

Page 27: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Oracle

27/43

Page 28: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Oracle

28/43

Page 29: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Two cycles

Evaluation infrastructure &

dry-run data

Start of the

challenge

End of the

challenge

BioASQ worksho

p March 2013 June 2013 August 2013 September 2013

2013 Schedule

Start of Task 2A

Start of Task 2B

End of the

challenge

BioASQ worksh

op February 2014 March 2014 May 2014 September 2014

2014 Schedule

The official challenge is over, but…► Task a continues to run each week .► An oracle for task b will be available soon.► Oracles will remain available.► Third cycle is being designed …

29/43

Page 30: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Challenge participants so far

30/43

Page 31: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Challenge participants in each cycle

31/43

Page 32: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Evaluation measuresTask a: Hierarchical text classification

Flat measures for multi-label classification: Accuracy, MiF, MaF, EBFHierarchical measures: LCA-F (new), HF

Task b: IR, QA, summarization, …Phase A:

standard IR measures, mean precision, mean recall, mean F-measure, MAP (used for winners selection), G-MAP

Phase B:‘Exact answers’ (based on type): accuracy (yes/no), strict/lenient accuracy, MRR (factoid), mean F-measure (list)‘Ideal answers’: manual scores from the experts {Readability, Repetition, Information Precision and Recall}, plus ROUGE

32/43

Page 33: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

First year technology/results overview• Task 1a

– Mainly SVMs and learning-to-rank.– Mostly flat classification, ignoring class taxonomy.– Mediocre results by hierarchical methods.– One of the systems outperformed NLM’s system.

• Task 1b– Phase A (retrieve relevant documents, concepts, snippets,

triples): low performance (compared to baselines).– Phase B (formulate ‘exact’ and ‘ideal’ answers): poor performance

for ‘exact’ answers (except for yes/no questions); high performance for ‘ideal’ answers (paragraph-sized summaries), but starting with gold documents, snippets etc.

• Large scope for improvements, esp. in Task 1b.

33/43

Page 34: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

“Exact” answer results (batch 2/3)

34/43

Page 35: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

“Ideal” answer results (batch 2/3)

35/43

Page 36: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Results – task a – flat measures

36/43

Page 37: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Results – task a – hierarchical

37/43

Page 38: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

First challenge prizes

38/43

Page 39: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Sustainability

• BioASQ Oracle• Software release and installation instructions• Benchmark datasets • BioASQ social network• Involvement of the biomedical community in the

process• Attracting sponsors for prizes

Making the challenge viable, at very low cost, after the end of the project

39/43

Page 40: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Project Consortium

1. National Centre for Scientific Research “Demokritos” -NSCR “D” (EL)

2. Transinsight GmbH – TI (D)3. Universite Joseph Fourier- UJF (F)4. University Leipzig - ULEI (D)5. Universite Pierre et Marie Curie Paris 6 – UPMC (F)6. Athens University of Economics and Business –

Research Centre – AUEB-RC (EL)

40/43

Page 41: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Project Consortium

41/43

Page 42: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Get in touch!

BioASQ workshop @CLEF (Sheffield, Sept 14)

Visit www.bioasq.orgFollow @BioASQ

42/43

Page 43: Biomedical articles per year

George Paliouras, May 2014 www.bioasq.org

Useful Links• BioASQ Annotation & assessment tools:

– http://at.bioasq.org/– http://assess.bioasq.org/– https://github.com/AKSW/BioASQ-AT

• BioASQ social network: – http://sn.bioasq.org/– https://github.com/AKSW/BioASQ-SN

• BioASQ platform: – http://bioasq.lip6.fr/

• BioASQ Oracles: – http://bioasq.lip6.fr/oracle/

43/43

A. Kosmopoulos, I. Partalas, E. Gaussier, G. Paliouras, I. Androutsopoulos, Evaluation Measures for Hierarchical Classification: a unified view and novel approaches. Data Mining and Knowledge Discovery (To appear)