tna taxonomies 20160525

Jeremie Charlet

25th May2016

Presentation of Taxonomy Applications and their development to the BBC

Introduction

– Categorisation was initially done with Autonomy: 2 years work from the Taxonomy team to write and perfect category queries

– Since we migrated our search engine to Solr, we had to build the taxonomy tools from scratch

“air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical Branch" OR "Air Department" OR "Air Board" OR "Air Council" OR "Department of the Air Member" OR "air army“ …

Introduction

1. Solution

2. How we implemented it

3. Attempt on Machine Learning

Conclusion: learnings and next steps

http://discovery.nationalarchives.gov.uk/4

Categories displayed on Discovery our archives portal

Administration User Interface for taxonomists

Command Line Interface to categorise everything once

Batch Job to categorise documents every day

1/ Solution

1. Solution / Discovery

1. Solution / admin GUI

Application to categorise documents every day1. to categorise new documents2. to re-categorise documents when they are updated

1. Solution / daily updates

Application to categorise everything once1. To do it for the first time2. to apply latest modifications from taxonomists on all documents

1. Solution / categorise all docs

under the hood of taxonomy-batch-cat-all1. Solution / categorise all docs

Categorisation and updates on Solr are decoupled1. Solution / categorise all docs

Architecture diagram for daily updates (Java side)1. Solution

Introduction

1. Solution– Discovery portal– Administration UI– Tool to categorise everything once– Batch Job to categorise every day

2. How we implemented it

To get it right

To get it fast• Algorithm• Fine tuning• Distributed system with Akka

2. Implementation

Many parameters to take into account• Is case sensitiveness important? • Use punctuation? • Use synonyms? • Ignore stop words (of, the, a, …)? • Use wildcards? • Which meta data to use?

= Iterative process

How to evaluate if our results are valid? > Use documents and categories from former system> Categorise them again and compare results

To do that quickly, created Command Line Interface

[jcharlet@server ~]$

./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true

2. Implementation / get it right

It dependsIt dependsYesNo, use stop words* ?Title, description, context description, categories, people, places, corporate bodies

We apply our 136 categories to 22 millions records in 1,5 days (~ 5ms per doc)

• We create an index in memory with a single document and run our queries against it. Then we run the matching queries to the complete index to have a score that enables us to rank matches

• Distributed system with Akka (13 processes running on 2 servers)

2 * 24 Core CPU40 Go RAM

2. Implementation / get it fast

Use the right driver for your system (NRTCacheDirectory instead of default one) > 1 line in 1 file = 20% faster on search queries

Use filter instead of query to search on only 1 document + use carefully low level api

Profile your application frequently> Identify ugly code, where to add cache, where to add concurrencySpent 7% on creating Query objects for every document: instead, create them once and store them in memory

How to transmit documents to categorise efficiently?

By sending messages to workers

See the problem?

Categorisation Supervisor

Categorisation Worker

C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879

C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879C456321;C65465;C654879;C56879C456321;C65465;

C654879;C56879

Solution: http://www.michaelpollmeier.com/akka-work-pulling-pattern/

Applied to taxonomy Applications

https://github.com/nationalarchives/taxonomy

There are 2 types batch applications (each runs in its own application server)

• 1 instance of Taxonomy-cat-all-supervisor

• N instances of Taxonomy-cat-all-worker

Categorisation supervisor browses the whole index and retrieve 1000 documents at a time

Categorisation worker receives categorisation requests that contains a list of documents to

categorise

2. Implementation / get it right

Introduction

2. How we implemented it– Get it right– Get it fast

• Fine tuning• Distributed system with Akka

Research on a training set based solution for 2 months

1.Take a data set of known (already classified) documents2.Split it into a test set and training set

– Train the system with the training set– Evaluate it using the test set– Iterate until satisfactory

3.Move it to production– Classify new documents using the trained system

Why it did not work1.Using category queries to create the training set

– Highly dependent on the validity/accuracy of the category queries

2.Nature of our categories– far too many (136)– categories too vague / broad or too similar (“Poverty”, “Military”): do not

suit such a system

3.Not the right tool? We used Lucene (search engine) built in tool

4.Nature of the data? Quality of the meta data?

Introduction

2. How we implemented it– Get it right– Get it fast

• Fine tuning• Distributed system with Akka

Gains and lossesNo * within words

categorisation 10 times faster

use of free solutions (*)

admin interface more fluid and useable

Possible improvements- Update documents for 1 category on demand- Create more generic solution- Add missing GUI (reporting, categorise all)- Build solution upon Solr, not Lucene- Use Cloud Services instead of onsite servers

Next steps- Categorise other archives- Work on new digital-born records New categories ? New research on machine learning ?

Lucene

Thank you for listening

Any questions ?

tna taxonomies 20160525

Technology

tna quarterly

learning taxonomies

tygerburger durbanville 20160525

engaging tna

tna catalogue 2004-2005 - gaines de ventilation...

tna diplomas

iai ifrs seminar 20160525 sesi 1 iasb update

evaluating taxonomies

bug taxonomies

tygerburger kuilsrivier 20160525

tna transformer

testing taxonomies

tygerburger goodwood 20160525

tygerburger brackenfell 20160525

tygerburger de grendel 20160525

tygerburger tyger valley 20160525

zuger presse 20160525

media markt akcios ujsag budapest 20160525 0605

managing mature taxonomies: resolving orphan terms · 2019....

uas mosaic - taxonomies · uas mosaic - taxonomies 1....