the ilk suite of text tools
DESCRIPTION
The ILK Suite of Text Tools. Antal van den Bosch ILK Research Group Faculty of Humanties, Tilburg University http://ilk.uvt.nl Political Mashup Meeting Amsterdam, March 19, 2008. The ILK Text Tools. Text Quality Management Text normalization Spelling and grammar checking - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/1.jpg)
The ILK Suite of Text ToolsThe ILK Suite of Text Tools
Antal van den BoschILK Research Group
Faculty of Humanties, Tilburg Universityhttp://ilk.uvt.nl
Political Mashup MeetingAmsterdam, March 19, 2008
![Page 2: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/2.jpg)
The ILK Text ToolsThe ILK Text Tools
• Text Quality Management– Text normalization– Spelling and grammar checking– Structured data cleaning
• Text Mining– Entity recognition– Relation finding
• Text Recommendation– Document recommendation– Expert recommendation
![Page 3: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/3.jpg)
ILK Text Tools ApplicationsILK Text Tools Applications
• Cultural Heritage– Historical texts: Royal Library, DBNL– Entity recognition: Naturalis field
books– Structured data cleaning: Naturalis,
Beeld & Geluid, Army Museum, Meertens
• Service and media industries– Text mining: Textkernel B.V.– Recommendation: Trouw
![Page 4: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/4.jpg)
TICCLTICCL
• Text-induced corpus cleanup– Martin Reynaert
• Robust, scalable method for finding wordform variants
• Sensitive to morphology and context
• Knowledge-free
Very large corpus
Linked word list
Dirty word list
indexes
![Page 5: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/5.jpg)
TICCLTICCL
• hartstochtelijk hartstochtelyk hartstochtelyke hartstochtlijk hartstochtlijke hartstochtlyk hartstogtelijk hartstogtelijke hartstogtelijks hartstogtelyk
• wenkbrauwen wenkbraauwen wenkbraeuwen wenkbrauwen winkbraauwen wynbraauwen wynbrauwen
• Nederland NEDERLANDEN Nederlan Nederland Nederlanden Nederlander Nederlandse Nederlandt Nederlandts Nederlandze Nederlansch Nederlanse Nederlant Nederlants Neederland Neerland Neerlands Neerlandts Neerlants Netherlands
![Page 6: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/6.jpg)
Other Text QM ToolsOther Text QM Tools
• Knowledge-free, corpus-driven
• Tokenization and sentence splitting
• Grammar checking– All d/t/dt errors
• gebeurd/gebeurt, word/wordt
– Inflectional and derivational errors
• Run-on/split detection• Word completion
Dirty corpus
Cleaner corpus
Disambiguator
![Page 7: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/7.jpg)
MITCH: Mining Natural History
MITCH: Mining Natural History
• Piroska Lendvai, Marieke van Erp, Steve Hunt
• Field books and registers describe objects in many valuable facets, – In ambiguous, elliptic
language– In multiple languages– Describing animals,
people, biotopes, geographical names, time expressions
![Page 8: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/8.jpg)
Cleaning and overhauling data
Cleaning and overhauling data
AuteurDetermi-
natorFamilie Genus Land
Bewaar-methode
(Daudin, 1802)
Bataguridae
AnolisCambodj
a(Schild droog)
(Schlegel)
G. vd. Boog
Colubridae
Indonesia
Schneider
M.S. Hoogmoe
dBufo Suriname
(Horst, 1883)
Tyler, M.J. Hylidae Litoria alcohol
GeophisGeophis??Rhabdo-Rhabdo-phis?phis?Actual value: Geophis
Expected: Rhapdophis
![Page 9: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/9.jpg)
Entity type correctionEntity type correction
![Page 10: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/10.jpg)
![Page 11: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/11.jpg)
11
1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865
Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis.
Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool.Hoedt 1867.
RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Entity detection in fieldbooksEntity detection in fieldbooks
![Page 12: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/12.jpg)
12
→ Number
1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865
Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis.
Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool.Hoedt 1867.
RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Entity detection in fieldbooksEntity detection in fieldbooks
![Page 13: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/13.jpg)
13
→ Number, Genus
1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865
Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis.
Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool.Hoedt 1867.
RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Entity detection in fieldbooksEntity detection in fieldbooks
![Page 14: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/14.jpg)
14
→ Number, Genus, Species
1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865
Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis.
Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool.Hoedt 1867.
RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Entity detection in fieldbooksEntity detection in fieldbooks
![Page 15: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/15.jpg)
15
→ Number, Genus, Species, Biotope
1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865
Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis.
Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool.Hoedt 1867.
RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Entity detection in fieldbooksEntity detection in fieldbooks
![Page 16: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/16.jpg)
16
→ Number, Genus, Species, Biotope, Collection Time
1 ex. Leptodactylus wagneri At base of tree on small island, primary forest, 20.45-22.00 u. RMNH 23865
Lithodytes lineatus, Brownsberg, aan voet, onder stuk rot hout, 13.07.1968, 8.45 u., RMNH 26076 Dorsolateraal strepen heldergeel, tekening op dijen vuurrood, veel feller als bij P. femoralis.
Gonyocephalus auritus Meyer, 3 ex. (1 juv.), Misool.Hoedt 1867.
RMNH 17656 Eleutherodactylus zeuctotylus 1 [vrouw] Lelygebergte, 4 km N.O. van airstrip, distr. Marowijne, Suriname, 19-VIII-1975, onder stuk hout, 610m, l [plus] d M. S. Hoogmoed.
Entity detection in fieldbooksEntity detection in fieldbooks
![Page 17: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/17.jpg)
Entity detection in fieldbooksEntity detection in fieldbooks
• Training on labeled examples– Easy: short, regular entities– Hard: longer textual descriptions
• Metadata detection in description entities– Types of forest, soil, … in biotopes– Physical appearance, … in special
comments
• By automatically learning the “grammar” of these entities (ABL)
![Page 18: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/18.jpg)
![Page 19: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/19.jpg)
Expert searchExpert search
• Toine Bogers, A Propos project• Two types
– Expert finding– Expert profiling
• Evidence of expertise– Content-based evidence– Evidence from social networks– Activity-based evidence
• Current results on academic workgroup– Content-based not better than citation-base– Number of citations just as good as PageRank– “authorship = expertise”? not 100%
![Page 20: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/20.jpg)
• news article recommender for Trouw– recommend related stories for article posted online– editors provide feedback on recommendations– approved recommendations are automatically placed online
Trouw RecommenderTrouw Recommender
![Page 21: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/21.jpg)
![Page 22: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/22.jpg)
Other ILK Text ToolsOther ILK Text Tools
• Translation– Memory-based, any pair of languages
• Morpho-syntactic analysis: Tadpole– Part-of-speech tagging, lemmatization– Dependency parsing, 20 languages
• Text-to-speech conversion– Dutch speech synthesizer: NeXTeNS
• Word sense disambiguation, co-reference resolution, paraphrasing, named entity recognition.
![Page 23: The ILK Suite of Text Tools](https://reader036.vdocuments.site/reader036/viewer/2022062408/56814424550346895db0c0e4/html5/thumbnails/23.jpg)
Thank youThank you
http://ilk.uvt.nl
Toine Bogers, Martin Reynaert, Piroska Lendvai, Marieke van Erp, Steve Hunt, Peter Berck, Ko van der Sloot, Herman Stehouwer, Menno van Zaanen, Tanja Gaustad, Sebastiaan Tesink, Erwin Marsi, Iris Hendrickx, Antal van den Bosch, Walter Daelemans