tna taxonomies 20160525

Jeremie Charlet

25th May2016

Presentation of Taxonomy Applications and their development to the BBC

Introduction

3

– Categorisation was initially done with Autonomy: 2 years work from the Taxonomy team to write and perfect category queries

– Since we migrated our search engine to Solr, we had to build the taxonomy tools from scratch

“air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical Branch" OR "Air Department" OR "Air Board" OR "Air Council" OR "Department of the Air Member" OR "air army“ …

Plan

Introduction

1. Solution

2. How we implemented it

3. Attempt on Machine Learning

Conclusion: learnings and next steps

http://discovery.nationalarchives.gov.uk/4

5

Categories displayed on Discovery our archives portal

Administration User Interface for taxonomists

Command Line Interface to categorise everything once

Batch Job to categorise documents every day

1/ Solution

1. Solution / Discovery

7

1. Solution / admin GUI

8

1. Solution / admin GUI

9

Application to categorise documents every day1. to categorise new documents2. to re-categorise documents when they are updated

1. Solution / daily updates

10

1. Solution / daily updates

11

Application to categorise everything once1. To do it for the first time2. to apply latest modifications from taxonomists on all documents

1. Solution / categorise all docs

12

under the hood of taxonomy-batch-cat-all1. Solution / categorise all docs

13

Categorisation and updates on Solr are decoupled1. Solution / categorise all docs

14

Architecture diagram for daily updates (Java side)1. Solution

Plan

Introduction

1. Solution– Discovery portal– Administration UI– Tool to categorise everything once– Batch Job to categorise every day

2. How we implemented it




16

To get it right

To get it fast• Algorithm• Fine tuning• Distributed system with Akka

2. Implementation

Many parameters to take into account• Is case sensitiveness important? • Use punctuation? • Use synonyms? • Ignore stop words (of, the, a, …)? • Use wildcards? • Which meta data to use?

= Iterative process

How to evaluate if our results are valid? > Use documents and categories from former system> Categorise them again and compare results

To do that quickly, created Command Line Interface

17

[jcharlet@server ~]$

./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true

2. Implementation / get it right

It dependsIt dependsYesNo, use stop words* ?Title, description, context description, categories, people, places, corporate bodies

We apply our 136 categories to 22 millions records in 1,5 days (~ 5ms per doc)

• We create an index in memory with a single document and run our queries against it. Then we run the matching queries to the complete index to have a score that enables us to rank matches

• Distributed system with Akka (13 processes running on 2 servers)

2 * 24 Core CPU40 Go RAM

18

2. Implementation / get it fast

Use the right driver for your system (NRTCacheDirectory instead of default one) > 1 line in 1 file = 20% faster on search queries

Use filter instead of query to search on only 1 document + use carefully low level api

Profile your application frequently> Identify ugly code, where to add cache, where to add concurrencySpent 7% on creating Query objects for every document: instead, create them once and store them in memory

19


How to transmit documents to categorise efficiently?

By sending messages to workers

See the problem?

Categorisation Supervisor

Categorisation Worker



C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879


Solution: http://www.michaelpollmeier.com/akka-work-pulling-pattern/


http://www.michaelpollmeier.com/akka-work-pulling-pattern/

Applied to taxonomy Applications

https://github.com/nationalarchives/taxonomy

There are 2 types batch applications (each runs in its own application server)

• 1 instance of Taxonomy-cat-all-supervisor

• N instances of Taxonomy-cat-all-worker

Categorisation supervisor browses the whole index and retrieve 1000 documents at a time

Categorisation worker receives categorisation requests that contains a list of documents to

categorise

2. Implementation / get it right

https://github.com/nationalarchives/taxonomy

Plan

Introduction


2. How we implemented it– Get it right– Get it fast

• Fine tuning• Distributed system with Akka




Research on a training set based solution for 2 months

1.Take a data set of known (already classified) documents2.Split it into a test set and training set

– Train the system with the training set– Evaluate it using the test set– Iterate until satisfactory

3.Move it to production– Classify new documents using the trained system

24


Why it did not work1.Using category queries to create the training set

– Highly dependent on the validity/accuracy of the category queries

2.Nature of our categories– far too many (136)– categories too vague / broad or too similar (“Poverty”, “Military”): do not

suit such a system

3.Not the right tool? We used Lucene (search engine) built in tool

4.Nature of the data? Quality of the meta data?

25


Plan

Introduction


2. How we implemented it– Get it right– Get it fast

• Fine tuning• Distributed system with Akka





27

Gains and lossesNo * within words

categorisation 10 times faster

use of free solutions (*)

admin interface more fluid and useable


28

Possible improvements- Update documents for 1 category on demand- Create more generic solution- Add missing GUI (reporting, categorise all)- Build solution upon Solr, not Lucene- Use Cloud Services instead of onsite servers

Next steps- Categorise other archives- Work on new digital-born records New categories ? New research on machine learning ?

Solr

Lucene

Thank you for listening

Any questions ?

tna taxonomies 20160525

Technology