e-science at språkbanken - göteborgs universitet · kommentera, blogga, googla precision and...

20
LT-based E-science at Språkbanken Språkbanken kick-off January 2015

Upload: others

Post on 09-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

LT-basedE-science at Språkbanken

Språkbanken kick-off

January 2015

Page 2: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Definition of e-science

• E-Science (or eScience) is computationally intensive science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing; the term sometimes includes technologies that enable distributed collaboration, such as the Access Grid.

• Most of the research activities into e-Science have focused on the development of new computational tools and infrastructures to support scientific discovery.

Page 3: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

• Digital humanitiesand social science

• Political

• Medical

• Historical …

• Korp/Karp front end

• Korp

• Karp

Methods

Historicalresources

Modernresources

Infr

astr

uct

ure

s

Lan

guag

eTe

chn

olo

gy

SKO2nCjyT3

Page 4: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

• Digital humanitiesand social science

• Political

• Medical

• Historical …

• Korp/Karp front end

• Korp

• Karp

Methods

Historicalresources

Modernresources

Applicationareas

Front endsOn

eex

amp

le,

Lärk

a

Page 5: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

• Digital humanitiesand social science

• Political

• Medical

• Historical …

• Korp/Karp front end

• Korp

• Karp

Methods

Historicalresources

Modernresources

Applicationareas

Front endsSwe-

Cla

rin

Page 6: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

• Digital humanitiesand social science

• Political

• Medical

• Historical …

• Korp/Karp front end

• Korp

• Karp

Methods

Historicalresources

Modernresources

Applicationareas

Front ends

Page 7: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

The SB definition of e-science

• IT based research methodology• With or without large amounts of data

• Corpus linguistics is a prominent example

• In the domain of digital humanities and social sciences

Page 8: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

What has been done at SB?What are we working on?Words (multiwords)

Relations

Coordinations

Topics

Readability

Twitter

Page 9: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

MWE detection to improve parsing quality

1. Use lists as a basis for e.g., idioms, terminology and entities

2. Add reg. exp., pattern matchingto find more MWEs

3. Perform Parsing

Confirmed intution and previous experiments that pre-recognizing MWEs improveparsing (by 16%). Figure from Boleda G. & Evert S.:

“Multiword Expressions: A pain in the neck of lexical semantics”

Page 10: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Semantics in Storytelling in Swedish Fiction

Relation extraction from Swedish Prose Fiction (SPF)

• List of relations

• NEE to extract names and aliasesdocument center approach to linkaliases names

• Extract sentences with min. 2 names.

• Detect relation

Automatic detection would sign. improve coverage of relations

Page 11: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Semantics in Storytelling in Swedish Fiction

Relation extraction from Swedish Prose Fiction (SPF)

• List of relations

• NEE to extract names and aliasesdocument center approach to linkaliases names

• Extract sentences with min. 2 names.

• Detect relation

Automatic detection would sign. improve coverage of relations

Relations between 2 males = red, between 2 females = green, otherwise blue.

Page 12: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Semantics in Storytelling in Swedish Fiction

Relation extraction from Swedish Prose Fiction (SPF)

• List of relations

• NEE to extract names and aliasesdocument center approach to linkaliases names

• Extract sentences with min. 2 names.

• Detect relation

Automatic detection would sign. improve coverage of relations

Relations between 2 males = red, between 2 females = green, otherwise blue.

Page 13: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Swedish Psuedo Coordination (SPC) detection and change• Verb pairs where the first is light

• ”åka och handla”, ”gå och gifta sig”, ”ringa och berätta”

• Typical properties apply:• E.g., both is not possible: ”jag

både satt och läste”

• No paraphrasing: ”Mona satt och hon läste”.

• Try to classify SPCs from non-SPCs using these features

• False positives

• We think non-SPC, algorithmguesses SPC.

• Relaxing drop in P/R

• fara, resa, trilla, varda, stog, vända, testa, mejla, maila, kommentera, blogga, googla

Precision and recall for Blogmixen

Page 14: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Topic Modeling

• SPF used as data set

• Topic Modeling applied (Mallet)

which parts of a documentbelong to which topic

which part of any documentbelongs to topic i

Link original resources to helpvalidate topics

Page 15: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Readability of text

• All paragraphs assigned to topic i that are easy to read.

• Investigate different readabilitymeasures for text.

• Measures for English Swedish

Readability measures are not very reliable when applied directlyto Swedish texts.

Page 16: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Twitter Analysis around political debates

• Start with somehashtags, e.g., #pldebatt

• Find all tweets = core

• Train classifier to findrelated tweets

• Divide into known topics(from debate)

CORE Topic1

Page 17: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Twitter Analysis around political debates

• Start with somehashtags, e.g., #pldebatt

• Find all tweets = core

• Train classifier to findrelated tweets

• Divide into known topics(from debate)

CORE Topic1Topic

2

Topic3

Topic4

Topic5

Topic6

Page 18: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Twitter Analysis around political debates

• Start with somehashtags, e.g., #pldebatt

• Find all tweets = core

• Train classifier to findrelated tweets

• Divide into known topics(from debate)

CORE Topic1

October

May

Topic2

Topic3

Topic4

Topic5

Topic6

Page 19: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Twitter Analysis around political debates

• Start with somehashtags, e.g., #pldebatt

• Find all tweets = core

• Train classifier to findrelated tweets

• Divide into known topics(from debate)

CORE Topic1

October

MayTopic 10: jan björklund, allians, frisyr, åkesson, slips, siffra, sverige, prata, romson, nöjd

T exTopic 1: Digram attackera, fusklapp läcka, vinna, ord, dålig, analys, tydlig, jobba, önska, missa

T ex

Topic2

Topic3

Topic4

Topic5

Topic6

Page 20: E-science at Språkbanken - Göteborgs universitet · kommentera, blogga, googla Precision and recall for Blogmixen. Topic Modeling •SPF used as data set •Topic Modeling applied

Conclusions

• There are many, manyinteresting things to do in the field of E-science

Come and join us!

Future work

• Workshop on SB relatedactivities for DHSS on April 17th

• Want to present your work?