tna taxonomies 20160525

29

Upload: jeremie-charlet

Post on 14-Apr-2017

42 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: TNA taxonomies 20160525
Page 2: TNA taxonomies 20160525

Jeremie Charlet

25th May2016

Presentation of Taxonomy Applications and their development to the BBC

Page 3: TNA taxonomies 20160525

Introduction

3

– Categorisation was initially done with Autonomy: 2 years work from the Taxonomy team to write and perfect category queries

– Since we migrated our search engine to Solr, we had to build the taxonomy tools from scratch

“air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical Branch" OR "Air Department" OR "Air Board" OR "Air Council" OR "Department of the Air Member" OR "air army“ …

Page 4: TNA taxonomies 20160525

Plan

Introduction

1. Solution

2. How we implemented it

3. Attempt on Machine Learning

Conclusion: learnings and next steps

http://discovery.nationalarchives.gov.uk/4

Page 5: TNA taxonomies 20160525

5

Categories displayed on Discovery our archives portal

Administration User Interface for taxonomists

Command Line Interface to categorise everything once

Batch Job to categorise documents every day

1/ Solution

Page 6: TNA taxonomies 20160525

1. Solution / Discovery

Page 7: TNA taxonomies 20160525

7

1. Solution / admin GUI

Page 8: TNA taxonomies 20160525

8

1. Solution / admin GUI

Page 9: TNA taxonomies 20160525

9

Application to categorise documents every day1. to categorise new documents2. to re-categorise documents when they are updated

1. Solution / daily updates

Page 10: TNA taxonomies 20160525

10

1. Solution / daily updates

Page 11: TNA taxonomies 20160525

11

Application to categorise everything once1. To do it for the first time2. to apply latest modifications from taxonomists on all documents

1. Solution / categorise all docs

Page 12: TNA taxonomies 20160525

12

under the hood of taxonomy-batch-cat-all1. Solution / categorise all docs

Page 13: TNA taxonomies 20160525

13

Categorisation and updates on Solr are decoupled1. Solution / categorise all docs

Page 14: TNA taxonomies 20160525

14

Architecture diagram for daily updates (Java side)1. Solution

Page 15: TNA taxonomies 20160525

Plan

Introduction

1. Solution– Discovery portal– Administration UI– Tool to categorise everything once– Batch Job to categorise every day

2. How we implemented it

3. Attempt on Machine Learning

Conclusion: learnings and next steps

http://discovery.nationalarchives.gov.uk/15

Page 16: TNA taxonomies 20160525

16

To get it right

To get it fast• Algorithm• Fine tuning• Distributed system with Akka

2. Implementation

Page 17: TNA taxonomies 20160525

Many parameters to take into account• Is case sensitiveness important? • Use punctuation? • Use synonyms? • Ignore stop words (of, the, a, …)? • Use wildcards? • Which meta data to use?

= Iterative process

How to evaluate if our results are valid? > Use documents and categories from former system> Categorise them again and compare results

To do that quickly, created Command Line Interface

17

[jcharlet@server ~]$

./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true

2. Implementation / get it right

It dependsIt dependsYesNo, use stop words* ?Title, description, context description, categories, people, places, corporate bodies

Page 18: TNA taxonomies 20160525

We apply our 136 categories to 22 millions records in 1,5 days (~ 5ms per doc)

• We create an index in memory with a single document and run our queries against it. Then we run the matching queries to the complete index to have a score that enables us to rank matches

• Distributed system with Akka (13 processes running on 2 servers)

2 * 24 Core CPU40 Go RAM

18

2. Implementation / get it fast

Page 19: TNA taxonomies 20160525

Use the right driver for your system (NRTCacheDirectory instead of default one) > 1 line in 1 file = 20% faster on search queries

Use filter instead of query to search on only 1 document + use carefully low level api

Profile your application frequently> Identify ugly code, where to add cache, where to add concurrencySpent 7% on creating Query objects for every document: instead, create them once and store them in memory

19

2. Implementation / get it fast

Page 20: TNA taxonomies 20160525

How to transmit documents to categorise efficiently?

By sending messages to workers

See the problem?

Categorisation Supervisor

Categorisation Worker

Categorisation Worker

Categorisation Worker

C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879

2. Implementation / get it fast

Page 21: TNA taxonomies 20160525

Solution: http://www.michaelpollmeier.com/akka-work-pulling-pattern/

2. Implementation / get it fast

Page 22: TNA taxonomies 20160525

Applied to taxonomy Applications

https://github.com/nationalarchives/taxonomy

There are 2 types batch applications (each runs in its own application server)

• 1 instance of Taxonomy-cat-all-supervisor

• N instances of Taxonomy-cat-all-worker

Categorisation supervisor browses the whole index and retrieve 1000 documents at a time

Categorisation worker receives categorisation requests that contains a list of documents to

categorise

2. Implementation / get it right

Page 23: TNA taxonomies 20160525

Plan

Introduction

1. Solution– Discovery portal– Administration UI– Tool to categorise everything once– Batch Job to categorise every day

2. How we implemented it– Get it right– Get it fast

• Fine tuning• Distributed system with Akka

3. Attempt on Machine Learning

Conclusion: learnings and next steps

http://discovery.nationalarchives.gov.uk/23

Page 24: TNA taxonomies 20160525

Research on a training set based solution for 2 months

1.Take a data set of known (already classified) documents2.Split it into a test set and training set

– Train the system with the training set– Evaluate it using the test set– Iterate until satisfactory

3.Move it to production– Classify new documents using the trained system

24

3. Attempt on Machine Learning

Page 25: TNA taxonomies 20160525

Why it did not work1.Using category queries to create the training set

– Highly dependent on the validity/accuracy of the category queries

2.Nature of our categories– far too many (136)– categories too vague / broad or too similar (“Poverty”, “Military”): do not

suit such a system

3.Not the right tool? We used Lucene (search engine) built in tool

4.Nature of the data? Quality of the meta data?

25

3. Attempt on Machine Learning

Page 26: TNA taxonomies 20160525

Plan

Introduction

1. Solution– Discovery portal– Administration UI– Tool to categorise everything once– Batch Job to categorise every day

2. How we implemented it– Get it right– Get it fast

• Fine tuning• Distributed system with Akka

3. Attempt on Machine Learning

Conclusion: learnings and next steps

http://discovery.nationalarchives.gov.uk/26

Page 27: TNA taxonomies 20160525

Conclusion: learnings and next steps

27

Gains and lossesNo * within words

categorisation 10 times faster

use of free solutions (*)

admin interface more fluid and useable

Page 28: TNA taxonomies 20160525

Conclusion: learnings and next steps

28

Possible improvements- Update documents for 1 category on demand- Create more generic solution- Add missing GUI (reporting, categorise all)- Build solution upon Solr, not Lucene- Use Cloud Services instead of onsite servers

Next steps- Categorise other archives- Work on new digital-born records New categories ? New research on machine learning ?

Solr

Lucene

Page 29: TNA taxonomies 20160525

Thank you for listening

Any questions ?