e-science at språkbanken - göteborgs universitet · kommentera, blogga, googla precision and...
TRANSCRIPT
![Page 1: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/1.jpg)
LT-basedE-science at Språkbanken
Språkbanken kick-off
January 2015
![Page 2: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/2.jpg)
Definition of e-science
• E-Science (or eScience) is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable distributed collaboration, such as the Access Grid.
• Most of the research activities into e-Science have focused on the development of new computational tools and infrastructures to support scientific discovery.
![Page 3: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/3.jpg)
• Digital humanitiesand social science
• Political
• Medical
• Historical …
• Korp/Karp front end
• Korp
• Karp
Methods
Historicalresources
Modernresources
Infr
astr
uct
ure
s
Lan
guag
eTe
chn
olo
gy
SKO2nCjyT3
![Page 4: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/4.jpg)
• Digital humanitiesand social science
• Political
• Medical
• Historical …
• Korp/Karp front end
• Korp
• Karp
Methods
Historicalresources
Modernresources
Applicationareas
Front endsOn
eex
amp
le,
Lärk
a
![Page 5: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/5.jpg)
• Digital humanitiesand social science
• Political
• Medical
• Historical …
• Korp/Karp front end
• Korp
• Karp
Methods
Historicalresources
Modernresources
Applicationareas
Front endsSwe-
Cla
rin
![Page 6: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/6.jpg)
• Digital humanitiesand social science
• Political
• Medical
• Historical …
• Korp/Karp front end
• Korp
• Karp
Methods
Historicalresources
Modernresources
Applicationareas
Front ends
![Page 7: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/7.jpg)
The SB definition of e-science
• IT based research methodology• With or without large amounts of data
• Corpus linguistics is a prominent example
• In the domain of digital humanities and social sciences
![Page 8: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/8.jpg)
What has been done at SB?What are we working on?Words (multiwords)
Relations
Coordinations
Topics
Readability
![Page 9: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/9.jpg)
MWE detection to improve parsing quality
1. Use lists as a basis for e.g., idioms, terminology and entities
2. Add reg. exp., pattern matchingto find more MWEs
3. Perform Parsing
Confirmed intution and previous experiments that pre-recognizing MWEs improveparsing (by 16%). Figure from Boleda G. & Evert S.:
“Multiword Expressions: A pain in the neck of lexical semantics”
![Page 10: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/10.jpg)
Semantics in Storytelling in Swedish Fiction
Relation extraction from Swedish Prose Fiction (SPF)
• List of relations
• NEE to extract names and aliasesdocument center approach to linkaliases names
• Extract sentences with min. 2 names.
• Detect relation
Automatic detection would sign. improve coverage of relations
![Page 11: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/11.jpg)
Semantics in Storytelling in Swedish Fiction
Relation extraction from Swedish Prose Fiction (SPF)
• List of relations
• NEE to extract names and aliasesdocument center approach to linkaliases names
• Extract sentences with min. 2 names.
• Detect relation
Automatic detection would sign. improve coverage of relations
Relations between 2 males = red, between 2 females = green, otherwise blue.
![Page 12: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/12.jpg)
Semantics in Storytelling in Swedish Fiction
Relation extraction from Swedish Prose Fiction (SPF)
• List of relations
• NEE to extract names and aliasesdocument center approach to linkaliases names
• Extract sentences with min. 2 names.
• Detect relation
Automatic detection would sign. improve coverage of relations
Relations between 2 males = red, between 2 females = green, otherwise blue.
![Page 13: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/13.jpg)
Swedish Psuedo Coordination (SPC) detection and change• Verb pairs where the first is light
• ”åka och handla”, ”gå och gifta sig”, ”ringa och berätta”
• Typical properties apply:• E.g., both is not possible: ”jag
både satt och läste”
• No paraphrasing: ”Mona satt och hon läste”.
• Try to classify SPCs from non-SPCs using these features
• False positives
• We think non-SPC, algorithmguesses SPC.
• Relaxing drop in P/R
• fara, resa, trilla, varda, stog, vända, testa, mejla, maila, kommentera, blogga, googla
Precision and recall for Blogmixen
![Page 14: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/14.jpg)
Topic Modeling
• SPF used as data set
• Topic Modeling applied (Mallet)
which parts of a documentbelong to which topic
which part of any documentbelongs to topic i
Link original resources to helpvalidate topics
![Page 15: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/15.jpg)
Readability of text
• All paragraphs assigned to topic i that are easy to read.
• Investigate different readabilitymeasures for text.
• Measures for English Swedish
Readability measures are not very reliable when applied directlyto Swedish texts.
![Page 16: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/16.jpg)
Twitter Analysis around political debates
• Start with somehashtags, e.g., #pldebatt
• Find all tweets = core
• Train classifier to findrelated tweets
• Divide into known topics(from debate)
CORE Topic1
![Page 17: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/17.jpg)
Twitter Analysis around political debates
• Start with somehashtags, e.g., #pldebatt
• Find all tweets = core
• Train classifier to findrelated tweets
• Divide into known topics(from debate)
CORE Topic1Topic
2
Topic3
Topic4
Topic5
Topic6
![Page 18: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/18.jpg)
Twitter Analysis around political debates
• Start with somehashtags, e.g., #pldebatt
• Find all tweets = core
• Train classifier to findrelated tweets
• Divide into known topics(from debate)
CORE Topic1
October
May
Topic2
Topic3
Topic4
Topic5
Topic6
![Page 19: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/19.jpg)
Twitter Analysis around political debates
• Start with somehashtags, e.g., #pldebatt
• Find all tweets = core
• Train classifier to findrelated tweets
• Divide into known topics(from debate)
CORE Topic1
October
MayTopic 10: jan björklund, allians, frisyr, åkesson, slips, siffra, sverige, prata, romson, nöjd
T exTopic 1: Digram attackera, fusklapp läcka, vinna, ord, dålig, analys, tydlig, jobba, önska, missa
T ex
Topic2
Topic3
Topic4
Topic5
Topic6
![Page 20: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied](https://reader035.vdocuments.site/reader035/viewer/2022071114/5feb95b05537ee61b36fda6c/html5/thumbnails/20.jpg)
Conclusions
• There are many, manyinteresting things to do in the field of E-science
Come and join us!
Future work
• Workshop on SB relatedactivities for DHSS on April 17th
• Want to present your work?