im2013vit

11
SKIMMR: Making Knowledge Discovery Easier Vít Novᡠcek ([email protected]) February 8th, 2013 @ DERI meeting

Upload: information-sciences-institute

Post on 09-Jul-2015

128 views

Category:

Technology


1 download

DESCRIPTION

Unlike full reading, 'skim-reading' involves the process of looking quickly over information in an attempt to cover more material whilst still being able to retain a superficial view of the underlying content. Within this work, we specifically emulate this natural human activity by providing a dynamic graph-based view of entities automatically extracted from text using superficial text parsing / processing techniques. We provide a preliminary web-based tool (called `SKIMMR') that generates a network of inter-related concepts from a set of documents. In SKIMMR, a user may browse the network to investigate the lexically-driven information space extracted from the documents. When a particular area of that space looks interesting to a user, the tool can then display the documents that are most relevant to the displayed concepts. We present this as a simple, viable methodology for browsing a document collection (such as a collection scientific research articles) in an attempt to limit the information overload of examining that document collection.

TRANSCRIPT

Page 1: Im2013vit

SKIMMR: Making KnowledgeDiscovery Easier

Vít Novácek ([email protected])

February 8th, 2013 @ DERI meeting

Page 2: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

Outline

IntroductionSKIMMR

KB ComputationKB Utilisation

DemoEvaluation

Evaluated FeaturesEvaluation Methodology

Conclusions

1 / 10

Page 3: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

Machine-Aided Skim Reading

Traditional (Skim) Readingfull reading – deep insights (slow)skim reading – superficial overview (quicker)

How Can Automation Help?going deep is hardlarge scale shallow processing more feasible

What Kind of Automation?extraction (text and data mining)augmentation (computing more complex content)indexing and queryingpresentation of the results

Related Workprocessing: text mining, graph analysis, distributional semantics, fuzzy IRpresentation: GoPubMed, Textpresso, IVEA, CORAAL, Exhibit, . . .

Image source:http://a-pieceofpaper.blogspot.com

2 / 10

Page 4: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

Input/Extraction Pipe-Lines

Text Extractionpreprocessing (tokenization, tagging, shallow parsing)NE recognitionrelation extractionco-occurrence analysis + statistics (PMI, TF/IDF, . . . )

Digesting Linked Datagraph decompositioncluster analysisco-occurrence analysis + statistics (PMI, TF/IDF, . . . )

Extraction Results(s,p,o, r ,w) statementssubject, predicate, object, provenance, weight

Image source:http://atyoursurveys.blogspot.com

3 / 10

Page 5: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

Computing the Knowledge Base

Distributional Representationaggregated co-occurrence/relation statementsstatements → tensor representationevery element still linked to its provenancematrix perspectives of the tensor

Augmentationperspectives give rise to emergent patterns like:

semantic similarityconcept clusters and taxonomiesIF-THEN rulesconcept ordering and relative relevance

Image source:www.bystonline.org

4 / 10

Page 6: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

Indexing the Knowledge Base

Term IndexT1 T2 . . . Tn

T1 w1,1 w1,2 . . . w1,n

T2 w2,1 w2,2 . . . w2,n...

......

. . ....

Tn wn,1 wn,2 . . . wn,n

wi,j ∈ [0, 1]

Statement IndexS1 S2 . . . Sm

T1 c1,1 c1,2 . . . c1,m

T2 c2,1 c2,2 . . . c2,m...

......

. . ....

Tn cn,1 cn,2 . . . cn,m

ci,j ∈ {0, 1}

Provenance IndexP1 P2 . . . Pq

S1 w1,1 w1,2 . . . w1,q

S2 w2,1 w2,2 . . . w2,q...

......

. . ....

Sm wm,1 wm,2 . . . wm,q

wi,j ∈ [0, 1]

Auxiliary Fulltext Indexuser’s entry pointincreasing robustness“keys”: queriesvalues: term identifiersfairly standard IR:

OKAPI BM25F

Image source:http://teptdataservices.blogspot.com

5 / 10

Page 7: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

Querying the Knowledge Base

Initial Result Term Setexample query: ? ↔ Tx AND (? ↔ Ty OR ? ↔ Tz)term index look-up:

Fx = {(T1, wx,1), (T2, wx,2), . . . , (Tn, wx,n)}Fy = {(T1, wy,1), (T2, wy,2), . . . , (Tn, wy,n)}Fz = {(T1, wz,1), (T2, wz,2), . . . , (Tn, wz,n)}

combining atomic results: Fx ∩ (Fy ∪ Fz)

Complete Results

terms: RT = {(T1,wT1 ), (T2,wT

2 ), . . . ,Tn,wTn }, where wT

i arethe weights resulting from the combinationstatements: RS = {(S1,wS

1 ), (S2,wS2 ), . . . , (Sm,wS

m)}, wherewS

i = fν(∑n

j=1 wTj cj,i)

provenances: RP = {(P1,wP1 ), (P2,wP

2 ), . . . , (Pq,wPq )}, where

wPi = fν(

∑mj=1 wS

j wj,i)

Image source:http://nuget.org

6 / 10

Page 8: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

Let’s Learn About Some Grim Stuff!

7 / 10

Page 9: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

What to Evaluate?

Quality of the Extracted/Computed Content“noise-to-signal” ratiorelevance of results w.r.t. queriesinformation value (obvious vs. enlightening)

User Experienceusability of SKIMMR

generaldomain-specific

performance benefits (over a base-line)

Image source:http://voguepay.com

8 / 10

Page 10: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

How to Evaluate?

Quality of the Extracted/Computed Contentidentification (or creation) of a gold standardgeneralised IR measurescommittee-based annotation of the results

User ExperienceSUS surveydomain-specific surveyuser performance analysis (SKIMMR vs. base-line)

Image source:http://www.123rf.com

9 / 10

Page 11: Im2013vit

Introduction SKIMMR Demo Evaluation Conclusions

Conclusions and Future Work

Current Statusmachine-aided skim reading notion coinedbasic theoretical background proposeda prototype implemented (general and biomedical versions)

http://pypi.python.org/pypi/skimmr_gt/0.1-a1http://pypi.python.org/pypi/skimmr_bm/0.1-a1

Next Stepsevaluation (with a gold standard and sample users)dissemination and follow-ups (write-up, proposals)back-end extensions:

more (complex) types of relationsproper APIs (development, web service, . . . )database and/or cloud storage

front-end extensions:smoother transition between the graphscomplex queryingadditional visualisations (trends, focused provenances, . . . )

Image source:http://support.pacifichost.com

10 / 10