Download - From keyword searching to discourse mining
From
keyword searching to
discourse mining
Pim Huijnen, Juliette Lonij
DH2016, Kraków 15 July 2016
From: The oasis, 13 April 1912, p.9. Chronicling America: Historic American Newspapers. Lib. of Congress.
From: The oasis, 13 April 1912, p.9. Chronicling America: Historic American Newspapers. Lib. of Congress.
Tangherlini, T. R. and Leonard, P. (2013). Trawling in the Sea of the Great Unread: Sub-corpus
topic modeling and Humanities research, Poetics, 41: 725-749.
Van den Hoven, M., Van den Bosch, A. and Zervanou, K. (2010). Beyond Reported History:
Strikes That Never Happened. Proceedings of the First International AMICUS Workshop on
Automated Motif Discovery in Cultural Heritage and Scientific Communication Texts,
Vienna: 20-28.
Wiedemann, G. and Niekler, A. (2014). Document Retrieval for Large Scale Content Analysis
using Contextualized Dictionaries. Terminology and Knowledge Engineering, Berlin, June
2014: https://hal.archives-ouvertes.fr/hal-01005879.
Using extensive and context-specific word lists (‘dictionaries’) to replace the contingency of single keywords
Developing a script to extract dictionaries from literature based on topic modeling
Experimenting with tools to visualise results of dictionary searching in kranten.delpher.nl
Goals researcher-in-residence project
Flexibility (evaluation based on human expertise)
Transparency (avoiding black-boxing)
Practicality (available for the wider public)
KB researcher-in-residence project
Script to extract dictionaries
B
Topic modeling
TF-IDF
A
BC
Script to extract dictionaries
Visualising results of dictionary searches in Delpher
Use OR-query to search KB’s newspaper corpus Visualise results on the basis of Solr’s relevancy-score (min. no. of words)
(arbeid* OR bedrij* OR beheer OR controle* OR factor* OR functie* OR kost* OR leiding* OR loon* OR maatregel* OR management OR methode* OR model* OR norm* OR organisatie* OR plannen OR prijs OR productie OR rationeel OR rendement OR reorganisatie OR statistiek OR taylor OR tijd OR werkbesparing OR werkverdeeling)
kbresearch.nl/dictionary
Challenges
Running an OR-query of 25+ (or, preferably, more) words on a 90.000.000+ document dataset
Accounting for particularities of the corpus: * number of newspaper titles per year * changes in newspaper titles over the years * changes in article length over the years
Getting an idea of the exact combination of words in the visualised results
Thank you!
https://github.com/jlonij/keyword_generator
http://blog.kbresearch.nl/
http://www.pimhuijnen.com