Corpus analysis for indexing: Corpus analysis for indexing: when corpus-based when corpus-based
terminology makes a terminology makes a differencedifference
DDébora Oliveira ébora Oliveira Luís Sarmento Luís Sarmento
Belinda Maia Belinda Maia Diana SantosDiana Santos
LinguatecaLinguateca
Corpus-based indexing of a Corpus-based indexing of a specialized Web portal in PT & ENspecialized Web portal in PT & EN
Interdisciplinary work Interdisciplinary work – Information retrieval Information retrieval – Corpus-based terminologyCorpus-based terminology
CorpógrafoCorpógrafo– Web-based environment for terminology work Web-based environment for terminology work
BuscaBusca– Linguateca’s site search engineLinguateca’s site search engine
LINGUATECALINGUATECA
Linguateca is a distributed language resource Linguateca is a distributed language resource centre for Portuguese centre for Portuguese Aim: contributing to the quality of NLP resources Aim: contributing to the quality of NLP resources for Portuguesefor PortugueseIncreasingly large website at Increasingly large website at http://www.linguateca.pthttp://www.linguateca.pt since mid 1998 since mid 1998– Several on-line resources (corpora, tools, Several on-line resources (corpora, tools,
publications, etc) produced by Linguatecapublications, etc) produced by Linguateca– Catalogue of resources produced by other Catalogue of resources produced by other
researchersresearchers– 1300 web documents and 2500 external links1300 web documents and 2500 external links
Busca: a simple search engineBusca: a simple search engine
A search-engine for our site:A search-engine for our site:1.1. Person Search (simple database query)Person Search (simple database query)2.2. Publication Search (simple database query)Publication Search (simple database query)3.3. Simple keyword search (Free-text Search):Simple keyword search (Free-text Search):
Processing of rtf, ps and pdf files includedProcessing of rtf, ps and pdf files includedWhole system based on CQP: “Site as a corpus”Whole system based on CQP: “Site as a corpus”All words are “alike”: no TF/IDF, no document All words are “alike”: no TF/IDF, no document clustering, no terminological knowledgeclustering, no terminological knowledge
Search Systems 1 and 2 are OK but not Search Systems 1 and 2 are OK but not System 3System 3 (too naive! too simple...) (too naive! too simple...)
How could we improve Busca?How could we improve Busca?
Our group has an extensive experience in Our group has an extensive experience in terminologyterminologyTerminology and IR/search-engines seem a Terminology and IR/search-engines seem a “perfect-match”“perfect-match”– BUT terminology has not been widely accepted in IRBUT terminology has not been widely accepted in IR
Our question: is the knowledge of Our question: is the knowledge of terminologically relevant units going to help us terminologically relevant units going to help us improve Busca?improve Busca?– At indexing stageAt indexing stage– At query processing stageAt query processing stage– At result ranking stageAt result ranking stage– ......
Looking at Busca logs Looking at Busca logs
January 2003 - April 2005January 2003 - April 20051527 “free-text searches” queries:1527 “free-text searches” queries:– Excluding own searchesExcluding own searches– Very few queries for more than 2 years!!Very few queries for more than 2 years!!
Some statistics:Some statistics:Repetition of the search strings
Four times; 25;
2%
Twice; 170; 15%
Three times; 55;
5%
Five times or more; 13; 1%
Once; 835; 77%
Number of queries vs size of the search string
590
242
12666 74
0
100
200
300
400
500
600
700
1 2 3 4 5 or more
What was being searched in Busca?What was being searched in Busca?
search string #
Variaçoes 10
Adjunto 9
Cabeça 8
Verbos 7
Corpus 5
corpus da folha de são Paulo 5
linguagem natural 5
Peniche 5
registros doque é Conjuções coordenadas 5
Sexo 5
Tesouro 5
Tradução 5
Trail 5
About 4
Adjetivos 4
Admir 4
Árvore 4
Autor 4
Concordância 4
Consultoria 4
search string (2 or more tokens) #
corpus da folha de são paulo 5
linguagem natural 5
Registros doque é Conjuções coordenadas 5
creme de legumes 4
ele é nada mais nada menos que um idiota 4
há momentos 4
lingua portuguesa 7%AA série 4
o cortiço 4
redação coerência e coesão 4
singno linguistico 4
Vanguardaeuropeia 4
verbos irregulares 3
adjunto adniminal 3
cetem publico um milhao de palavras 3
comparable corpora 3
concordancia verbal 3
dicionário técnico 3
emprego do artigo 3
ensino%2C portugues%2C lingua estrangeira 3
floresta sintactica 3
Search stringSearch string # queries# queries
linguateca linguateca 832832
dicionario ingles portugues on line dicionario ingles portugues on line 812812
literatura infantil literatura infantil 625625
livrarias livrarias 602602
portugues para estrangeiros portugues para estrangeiros 582582
priberam priberam 463463
compara compara 457457
avalon avalon 451451
editoras editoras 431431
power translator power translator 431431
livrarias portugal livrarias portugal 424424
dicionario portugues ingles on line dicionario portugues ingles on line 392392
dicionario portugues aurelio dicionario portugues aurelio 391391
português para estrangeiros português para estrangeiros 384384
dinalivro dinalivro 381381
dicionario portugues dicionario portugues 360360
curriculum vitae curriculum vitae 349349
dicionario portugues ingles dicionario portugues ingles 334334
dicionario portugues on line dicionario portugues on line 315315
EnciclopediasEnciclopedias 310310
What was being searched in What was being searched in Google to get to Linguateca’s site?Google to get to Linguateca’s site?
Word in search string # ocorrences
de 36151
portugues 18102
dicionario 14228
dicionário 11725
ingles 10920
download 8757
português 8419
on 8270
line 7966
para 7941
em 6746
da 5612
inglês 5349
do 5063
e 5054
online 4953
portuguesa 4230
lingua 3350
tradução 3034
Termos 2895
Overview of queries found in logsOverview of queries found in logs
Informatics in generalInformatics in general – E.g.: “CAD”, “Pascal”, “Java”, “Autocad 2000 E.g.: “CAD”, “Pascal”, “Java”, “Autocad 2000
Topics concerning Portuguese language Topics concerning Portuguese language (literature, grammar, use)(literature, grammar, use)– E.g.: “figuras de estilo”, “verbos”, “Tipos de Sujeito E.g.: “figuras de estilo”, “verbos”, “Tipos de Sujeito
Indeterminado e Oração sem Sujeito”, “verbo Indeterminado e Oração sem Sujeito”, “verbo inacusativo”, “expressões idiomáticas”.inacusativo”, “expressões idiomáticas”.
General tools or resources.General tools or resources. – E.g.: “corpora”, “dicionário”, “conjugador de verbos”E.g.: “corpora”, “dicionário”, “conjugador de verbos”
Overview of queries found in logsOverview of queries found in logs
Specific fields or knowledge domains.Specific fields or knowledge domains. – E.g.: “extracção de informação”, “terminologia”, E.g.: “extracção de informação”, “terminologia”,
“semântica lexical”, “Portuguese language history”.“semântica lexical”, “Portuguese language history”.
Queries about specific tools or resources.Queries about specific tools or resources.– E.g.: “Cetempúblico”, “Cetenfolha” (two corpora from E.g.: “Cetempúblico”, “Cetenfolha” (two corpora from
Linguateca), “COMPARA”, “Corpógrafo”Linguateca), “COMPARA”, “Corpógrafo”
Queries that seem to be intended for our on-Queries that seem to be intended for our on-line concordance tools rather than for the line concordance tools rather than for the search engine.search engine. – E.g.: “sem nada”, "abonad.+", "ansioso para", “porém E.g.: “sem nada”, "abonad.+", "ansioso para", “porém
(ocorrências)”. (ocorrências)”.
Some conclusionsSome conclusions
All six cases suggest that users have:All six cases suggest that users have:– different goals in minddifferent goals in mind– different knowledge about the content of the site different knowledge about the content of the site
Users ARE familiar with terminological units:Users ARE familiar with terminological units:– especially noun phrases especially noun phrases – use them in search expressions naturally use them in search expressions naturally
even if the TUs are inappropriate in respect to the even if the TUs are inappropriate in respect to the content of our websitecontent of our website
Sometimes users type incomplete, ill-defined Sometimes users type incomplete, ill-defined or misspelled terminological units.or misspelled terminological units.
Initial improvements Initial improvements for Buscafor Busca
Each document in the site should be Each document in the site should be indexed using only the TUs it containsindexed using only the TUs it containsQuite easy if complete list of TUs known: Quite easy if complete list of TUs known: the the CorpógrafoCorpógrafo may help us in this! may help us in this!Knowing all possible variants and Knowing all possible variants and synonyms of a given TUsynonyms of a given TUFor more problematic search strings For more problematic search strings (ambiguous, incomplete) > set of TUs (ambiguous, incomplete) > set of TUs suggesting re-formulation to usersuggesting re-formulation to user
Empirical workEmpirical work
Subcorpus - 178 files in Portuguese Subcorpus - 178 files in Portuguese
Total number of tokens approximately 1M.Total number of tokens approximately 1M.
Corpógrafo > extracted and manually Corpógrafo > extracted and manually validated 1209 TUsvalidated 1209 TUs
5+ words4%
2 words42%
3 words18%
4 words9%
1 word27%
Frequency and Distribution of the 1209 TUs extracted. The axis are set to logarithmic scale.
Region 1Region 1
Region 3Region 3
Region 2Region 2
Explanation of chartExplanation of chart
Region 1Region 1: frequent but not widely distributed : frequent but not widely distributed TUs. E.g.: “modelo coclear”, “taxa de disparos” TUs. E.g.: “modelo coclear”, “taxa de disparos” -- usually compound words. usually compound words.Region 2Region 2: frequent and widely distributed TUs. : frequent and widely distributed TUs. E. g.: “análise”, “corpus”, “modelo”, E. g.: “análise”, “corpus”, “modelo”, “linguística”, etc. - “linguística”, etc. - usually very generic TUs, usually very generic TUs, and /or single words (they nevertheless have and /or single words (they nevertheless have multiple possible modifiers).multiple possible modifiers).Region 3Region 3: where less frequent and less : where less frequent and less distributed TUs may be found. distributed TUs may be found. E.g.: “verbo E.g.: “verbo intransitivo”, “relação semâtica”,”vibração intransitivo”, “relação semâtica”,”vibração macromecânica”.macromecânica”.
Items to help searchesItems to help searches
Synonyms Portuguese (53 pair) - E.g.: Synonyms Portuguese (53 pair) - E.g.: “adjectivo: adjetivo”, “bibliografia: documento: “adjectivo: adjetivo”, “bibliografia: documento: publicação”;publicação”;Translation equivalents between Portuguese-Translation equivalents between Portuguese-English (107 pairs)- E.g.: “dicionário: English (107 pairs)- E.g.: “dicionário: dictionary”;dictionary”;Synonyms English (23 pair)- E.g.: “parsing Synonyms English (23 pair)- E.g.: “parsing system: parser”;system: parser”;Acronyms in Portuguese and English (81)- Acronyms in Portuguese and English (81)- E.g.: “RI: Recuperação de Informação”.E.g.: “RI: Recuperação de Informação”.
POS occur. % Examples
CN + ADJ 504 41,6 vagueza grammatical, sumarização automática
CN 226 18,7 dicionário, gramática
CN + PRP + CN 178 14,7 sistema de tradução, sinal de fala
PN 52 4,3 COMPARA, Corpógrafo
CN + PRP + CN + ADJ 37 3,1 reconhecimento de dígitos isolados, resolução da ambigüidade lexical
CN + PN 35 2,9 dicionário Aurélio, sistema Edite
CN + PRP + CN + PRP + CN 28 2,3 arquitectura do sistema de interrogações, processo de aquisição de vocabulário
CN + ADJ + PRP + CN 20 1,7 Legendagem automática de notícias, reconhecimento óptico de caracteres
CN + PRP + PN 19 1,6 modelo de Kanis-Deboer, teorema de Bayes, rede de Elman
Acronym/abbreviation 14 1,2 bd, cce, IA, lil
CN + ADJ + PRP + CN + ADJ 9 0,7 processamento automático da linguagem natural, criação semi-automática de recursos lexicais
CN + ADJ + PRP + PN 3 0,2 modelo auditivo de Seneff, modelo coclear de Goldstein
Other POS structures 84 7
The distribution of existing POS structures (ADJ – adjective; CN – common name; PN – Proper Name; PRP - Preposition)
Semantic Classification 1Semantic Classification 1
Language resourcesLanguage resources. E.g.: “corpora”, . E.g.: “corpora”, “CETEMPúblico”, “dicionário”, “Wordnet”, “CETEMPúblico”, “dicionário”, “Wordnet”, “COMPARA” etc.“COMPARA” etc.Tools and systemsTools and systems.. E.g.: “anotador”, E.g.: “anotador”, “analisador morfológico”, “Corpógrafo”, “analisador morfológico”, “Corpógrafo”, etc.etc.Actions and processesActions and processes.. E.g.: E.g.: “aquisição de vocabulário”, “extracção de “aquisição de vocabulário”, “extracção de terminologia”, “anotação de corpora”.terminologia”, “anotação de corpora”.
Semantic Classification 2Semantic Classification 2
Specific theories and modelsSpecific theories and models.. E.g.: “modelo E.g.: “modelo auditivo de Seneff”, “algoritmo de Earley”, etc. auditivo de Seneff”, “algoritmo de Earley”, etc.
Linguistic concepts and phenomenaLinguistic concepts and phenomena.. E.g.: E.g.: “polissemia”, “ambiguidade lexical”, “verbo “polissemia”, “ambiguidade lexical”, “verbo incusativo”, “advérbio de tempo”, “adjectivo”, incusativo”, “advérbio de tempo”, “adjectivo”, etc. etc.
Disciplines or knowledge fieldsDisciplines or knowledge fields.. E.g.: E.g.: “lexicografia”, “engenharia da linguagem”, “lexicografia”, “engenharia da linguagem”, “inteligência artificial”, “semântica lexical”, etc. “inteligência artificial”, “semântica lexical”, etc.
SuggestionsSuggestions
For:For:– Improvement of Busca’s search capabilities Improvement of Busca’s search capabilities – User satisfaction.User satisfaction.
Easier searchingEasier searching
Single wordsSingle words– Suggest possible modifiers of wordSuggest possible modifiers of word– With names of resources > to resource – e.g. With names of resources > to resource – e.g.
COMPARACOMPARA
Mechanism to cope with different varieties Mechanism to cope with different varieties of spelling in Portugueseof spelling in PortugueseLists of synonym lists, acronym lists and Lists of synonym lists, acronym lists and translation equivalentstranslation equivalents Clustering of resultsClustering of results
More suggestionsMore suggestions
Semantic classification of keywords + pragmatic rules of Semantic classification of keywords + pragmatic rules of thumbthumbIf interested in a particular technology/tool/resource, > If interested in a particular technology/tool/resource, > systems that apply or implement such a technology or systems that apply or implement such a technology or functionfunctionE.g. - “morphology” > choice E.g. - “morphology” > choice – ““scientific discipline”scientific discipline”– ““applications that deal with morphology”applications that deal with morphology” (morphological (morphological
analysers, stemmers, morphological generators, POS taggers)analysers, stemmers, morphological generators, POS taggers)– ““specific systems that perform any of these tasks”specific systems that perform any of these tasks”
(Palavroso, PALMORF, etc.) (Palavroso, PALMORF, etc.) – ““evaluation” evaluation”
More suggestionsMore suggestions
Manually select correct semantic Manually select correct semantic classification of each TUclassification of each TU (partially done) (partially done)
Automatic text categorization systemAutomatic text categorization system
Corpógrafo tools for finding semantic Corpógrafo tools for finding semantic relationsrelations and building thesaurus/ontologies and building thesaurus/ontologies for helping navigationfor helping navigation
ETCETC
Conclusions on Conclusions on Interdisciplinary work Interdisciplinary work
Requires Requires – Mutual understandingMutual understanding– Tolerance Tolerance – Mental gymnastics Mental gymnastics
Exemplified here withExemplified here with– Computer scienceComputer science– Computational linguisticsComputational linguistics– Terminology Terminology
Thank You!Thank You!
Contact:Contact:– www.linguateca.ptwww.linguateca.pt– www.linguateca.pt/corpografowww.linguateca.pt/corpografo
DDébora Oliveira: [email protected]ébora Oliveira: [email protected]
Luís Sarmento: [email protected]ís Sarmento: [email protected]
Belinda Maia: [email protected] Maia: [email protected]
Diana Santos: [email protected] Santos: [email protected]