gretel showcase
DESCRIPTION
Introduction to GrETEL, a search engine for linguists to query treebanks by exampleTRANSCRIPT
GrETEL an introduction
Liesbeth Augustinus
Vincent Vandeghinste Ineke Schuurman Frank Van Eynde
CCL, KU Leuven
GrETEL
GrETEL (Greedy Extraction of Trees for Empirical Linguistics)
• linguistic search engine taking examples in natural language as input
– you don’t need to know a formal query language
– you don’t need to be familiar with the annotation scheme used in a specific treebank
3
Nederbooms
GrETEL is created as part of the CLARIN-VL
project “Nederbooms”.
• Originally, only the LASSY treebank (written
Dutch, +/- 1 million words) was supported
• Currently (version 1.2), the CGN treebank
(spoken Dutch, also +/- 1 million words) is
also supported
4
GrETEL
In the present version of GrETEL, you first
are to decide whether you are
interested in a corpus of spoken Dutch
(CGN) or one of written Dutch (Lassy).
This can be done using the links at the
left, or those in the main text.
5
Nederbooms.png
6
Further options
Note that the left navigation bar (previous slide) also offers the possibility to use
• XPath search (formal query language)
• String search (regular text search plus regular expressions)
We suggest not to use these until you are familiar with GrETEL, including the XPath options offered there (with a fall-back option!)
7
Further options 2
Furthermore, the left navigation bar offers
access to
• Manuals and further documentation
(papers, slide shows, …)
• The Alpino parser, i.e., the parser used for
the corpora as they appear in GrETEL
• Tree viewers, i.e., tools for visualizing
syntactic tree structures
8
How to …
In the next slides we will show how
GrETEL can be used, step-by-step.
Success !!
Step 1
• Insert a sentence (or part of a sentence) representative for the type of construction you are interested in.
– Note that this construction plays a major role in what follows. When you are looking for een aantal mensen (several people) as subject, it should function as subject in the example sentence as well
• Click on Submit
Upper part, 2nd page
(See next slides for explanations wrt the
matrix shown above)
Lower part, 2nd page
The guidelines for filling out the matrix. They
explain the meaning of the various options (cf
slides 12/13).
Below you can submit your preferences (as
stated in the matrix), return to the previous
page to adapt the input sentence (back), or
insert a new query (at this level same result)
12
2nd page, step 1
The matrix asks you to state how similar
to the input sentence the results should
be:
The first option (pos) results in similar
constructions
The last option (token) results in exactly the
same construction
Note, however, that for full sentences, the
chance of finding an identical sentence
will be small!
13
2nd page, step 1 (bis)
The option extended pos allows you to make a
more fine-grained selection. For example to
differentiate between singular nouns and
plural nouns.
• Note that for pronouns, this option will have
more or less the same result as token, as these
have very detailed tags (± 190) in both Lassy
and CGN, almost one per token
The option lemma will search for sentences
with the same word, but not necessarily the
same form of that word.
14
2nd page, step 1 (bis)
Note that you can mix the options, and that also part of the example can be stated to be optional (cf. slide 10)
• Leaving all parts optional, however, will result in an error message. In that case, clicking on ‘back’ will return you to the page where you can specify what you want.
Clicking on ‘show parse tree’ results in a tree at the bottom of the page (cf next slide)
Bottom, 2nd page
This is the tree that was automatically created for
your example sentence. It allows you to check
whether the sentence is analyzed correctly. If
not, you may want to adapt the input sentence
slightly (using the back option).
16
Page 2, step 2
After having selected the relevant parts in the matrix, you have the option to specify whether
• the word order should be respected, i.e. should the subject be in first position, as in the example sentence?
• the dominating node should be ignored. This is mainly relevant for the distinction between main clause and subordinate clause
• extended pos should be split. Recommended in case you may want to adapt the XPath query (later in the process)
This can be done using the Options, cf. slide 10
Upper part, 3rd page
The trees show which parts were selected: in red
in the left tree, isolated in the right tree
Lower part, 3rd page
19
Step 4 (optional)
Below the trees, you will see the XPath
query that was automatically generated,
and reflects your choices. It will be used to
select similar constructions in the corpus.
• you may want to adapt it, but you don’t have to!! Note
that you can always fall back to the original one (link just
above the box with the query)
• Cf below the effect of splitting up a complex tag, thus
making it easier to adapt the query (cf slide 19)
Step 5
• Here you can select the parts of the corpus you
want to use. Click on Treebank when you want to
use part of the corpus, and after that on the parts
you are interested in.
• You can also specify whether you are interested in
some context.
At the bottom you can submit your choices, reset
them (returns the default state), go back one page, or
submit a new query
Upper part, 4th page
Above you see the (modified) XPath, which you may want to
download for further use. Below listings of results.
The right one can be shown or hidden
Middle part, 4th page
Midpage, selected sentences will be shown (cf next
slide)
Lower part, 4th page
Lower part, 4th page
This table contains all
results:
• Clicking SENTENCE ID will
show the tree at this page
(previous slide)
• Clicking at the right, either
a full page tree shows up,
or the XML-format
• At the top you may
download all results,
whether in a printer-
friendly or ‘machine-
friendly’ format.
25
And now …
Give GrETEL a try !!
Success !