artem sokolov computerlinguistik universit at heidelberg

37
Software Project Artem Sokolov Computerlinguistik Universit¨ at Heidelberg Sommersemester 2015 (includes material by Laura Jehl) 1 Overview 2 Projects Presentation

Upload: others

Post on 15-Jan-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Artem Sokolov Computerlinguistik Universit at Heidelberg

Software Project

Artem SokolovComputerlinguistik

Universitat HeidelbergSommersemester 2015

(includes material by Laura Jehl)

1 Overview2 Projects Presentation

Page 2: Artem Sokolov Computerlinguistik Universit at Heidelberg

Overview

n Audience: bachelor studentsn Requirements:

á Programmierprufungá participation in Ressourcenvorkurs

n Duration: 1 semester

n 6 LP + 4 LP UK

| Software Project 1 / 32

Page 3: Artem Sokolov Computerlinguistik Universit at Heidelberg

Outline

1 Overview

2 Projects Presentation

| Software Project 2 / 32

Page 4: Artem Sokolov Computerlinguistik Universit at Heidelberg

Software Project In a Nutshell

You will have to accomplish an NLP/ML task working inautonomous teams:

1 planning

2 implementing

3 testing

4 documenting

5 packaging

6 presenting

| Software Project 3 / 32

Page 5: Artem Sokolov Computerlinguistik Universit at Heidelberg

Goal

You will learn how to:

1 develop a practical plan from an abstract task

2 map the plan to viable work-packages and team-member assignments

3 present and analyze your results to other teams

| Software Project 4 / 32

Page 6: Artem Sokolov Computerlinguistik Universit at Heidelberg

Goal

You will learn how to:

1 develop a practical plan from an abstract task

á reformulate the task in your own wordsá modularize the task, define inter-relations and reusable partsá split responsibilities among your team fellowsá prioritize the sub-tasks and develop a project timelineá define architecture, data structures and interfaces

2 map the plan to viable work-packages and team-member assignments

3 present and analyze your results to other teams

| Software Project 4 / 32

Page 7: Artem Sokolov Computerlinguistik Universit at Heidelberg

Goal

You will learn how to:

1 develop a practical plan from an abstract task

2 map the plan to viable work-packages and team-member assignments

á design common libraries and routinesá timely implement work-packages assigned to youá document and test your modules for other members to useá settle on means of communication and issue tracking, ensuring no

organizational blocks arise

3 present and analyze your results to other teams

| Software Project 4 / 32

Page 8: Artem Sokolov Computerlinguistik Universit at Heidelberg

Goal

You will learn how to:

1 develop a practical plan from an abstract task

2 map the plan to viable work-packages and team-member assignments

3 present and analyze your results to other teams

á clearly present the project’s goalá explain the main difficulties and problems you encounteredá demonstrate the results and explain themá argue further improvements

| Software Project 4 / 32

Page 9: Artem Sokolov Computerlinguistik Universit at Heidelberg

Teamwork

n regular meetings in a team (in person or video) and with me

n shared logbooks: wiki/notes/documents

n shared calendar

n weekly status reports of all team members

n code-reviews

n refactoring

| Software Project 5 / 32

Page 10: Artem Sokolov Computerlinguistik Universit at Heidelberg

Semester Plan

Dates, events, announcements:

http:

//www.cl.uni-heidelberg.de/courses/ss15/softwareprojekt

| Software Project 6 / 32

Page 11: Artem Sokolov Computerlinguistik Universit at Heidelberg

Bidding on Projects

Send before Friday, 24.04 to sokolov@cl.. an email with:

1 subject: “SWP Anmeldung”

2 projects you’d like to take in decreasing order of priority (rate all 6)

3 programming languages you know with a grade(1: no experience,..., 5: mother tongue)

4 names of team mates(if you already know with whom you’d wish to team up)

| Software Project 7 / 32

Page 12: Artem Sokolov Computerlinguistik Universit at Heidelberg

Semester Week by Week

date week event

21.04 1 intro today

28.04 2 team formation next week

05.05 3 kickoff meetings bring questions!

12.05 4 plan 2 pages, by email

19.05 5 status report per team, in person

26.05 6 specification slides

02.06–14.07

7

status reports

89 7 weeks

10 per team11 in person/by email1213

21.07 14 presentation slides, last meeting

28.07 15 delivery package, hard deadline

| Software Project 8 / 32

Page 13: Artem Sokolov Computerlinguistik Universit at Heidelberg

Next meeting

Splitting into Teams – 28.04, 14:15

n 3..5-member teams

n tentative project assignments

n splitting into teams (teams can be preformed beforehand)

n everyone gets a project

| Software Project 9 / 32

Page 14: Artem Sokolov Computerlinguistik Universit at Heidelberg

Kickoffs

Kickoff Meetings in Teams – 05.05

n read and discuss articles

n identify questions or unclear things

n elaborate plan (tasks, timeline, responsibilities)

| Software Project 10 / 32

Page 15: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project Plan

Project Plan – to be sent on 12.05

1 description of the task (goal)

2 ... and solution plan (method) in your own words

3 evaluation methods/measures

4 what datasets and tools are necessary

n write-up: ∼2 pages

| Software Project 11 / 32

Page 16: Artem Sokolov Computerlinguistik Universit at Heidelberg

1st Report

First Status Report – 19.05

1 status of the specification preparation

2 can do by email or in person

| Software Project 12 / 32

Page 17: Artem Sokolov Computerlinguistik Universit at Heidelberg

Specification Report

Specification Report – to be handed on 26.05

1 content:

á task, solving approach, evaluation (from the Plan)á selection of required resources and methods/algorithms

2 modularization & task distribution:

á definitions of modules/tasksá sub-task assignments to peopleá timeline

3 concrete planning:

á program architectureá data structuresá programming languages

n oral report with slides: 20 min. + 5 min. questions

| Software Project 13 / 32

Page 18: Artem Sokolov Computerlinguistik Universit at Heidelberg

Regular meetings

Status reports: 02.06 - 14.07

n on individual basis with every team

n normally – in person, weekly (Tu, 14h-18h)

n reserve a 30-45 min. slot from 14h to 18h, by email

n no meetings in other time

n if all goes as planned – send accomplished milestones by email

á still plan at least a bi-weekly meeting in personá 30.06 – all teams report by email

| Software Project 14 / 32

Page 19: Artem Sokolov Computerlinguistik Universit at Heidelberg

Presentation

Final presentation – 21.07

1 your implementation of the task and the approach

2 presentation of evaluation results

3 “demonstration” of the running system

4 lessons learned:

á what hypothesis proved to be wrong?á what new hypothesis were called to replace them?á what did not work?á what could be done better?

n slides, max 25 min. + 10 min. questions

n optional: poster

| Software Project 15 / 32

Page 20: Artem Sokolov Computerlinguistik Universit at Heidelberg

Delivery

Final deadline for a packaged project – 28.07

1 this is hard deadline

2 delivery by email/link

3 “should just work”

Requirements for passing:

n participation/reporting on all dates

n plan

n specification

n final presentation

n documentation + packaging

á documentation of source textá README / INSTALL

| Software Project 16 / 32

Page 21: Artem Sokolov Computerlinguistik Universit at Heidelberg

Resources@CL

1 Resources:https://wiki.cl.uni-heidelberg.de/foswiki/bin/view/

Main/Resources/webhome

2 Style guide:https://wiki.cl.uni-heidelberg.de/foswiki/bin/view/

Main/Resources/SoftwareStyleguide

3 recommended: svn or git

| Software Project 17 / 32

Page 22: Artem Sokolov Computerlinguistik Universit at Heidelberg

Outline

1 Overview

2 Projects Presentation

| Software Project 18 / 32

Page 23: Artem Sokolov Computerlinguistik Universit at Heidelberg

Projects Overview

MachLearn DeepLearn CrossLang InfRetr (Big)Data HumanCompInt

1 3 4 4 3

2 4 4 3

3 4 3 3 4

4 3 4 3

5 3 3 4

6 3 4 4

| Software Project 19 / 32

Page 24: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project Description Template

1 Motivation (why?)

2 Idea (how?)

3 Task (have to do this)

4 Info (recommended data, tools)

5 Research Problems (need a challenge?)

| Software Project 20 / 32

Page 25: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 1: Motivation

Corpus Harvesting from Quasi-parallel Linked Data

Motivation:

n how to induce knowledge from linked data?

n medical research uses terminology inaccessible to general audience

á although only one language is used

n example: linking observed symptoms to a preliminary diagnosis

n quasi-crosslingual information retrieval problem:

á layman English → “medicalese” Englishá “brain worm” → “neurocysticercosis”

| Software Project 21 / 32

Page 26: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 1: Idea

n need to map terms from everyday to terminology-heavy English..

n learn on examples of common English and relevant medical articles

á how do we know about the relevance?á from citations that documents make to scientific research!

| Software Project 22 / 32

Page 27: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 1: Example

| Software Project 23 / 32

Page 28: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 1: Task

1 create a large train/dev/test dataset(http://nutritionfacts.org)

á crawl transcripts of videos, description, notes and summary articles(plain English)

á crawl and extract text from linked research pdf articles(“medicalese” English)

2 come up with a meaningful graded-relevance scheme, e.g.

á direct link - 1, linked from a linked article - 2, same category - 3..á consider links from additional fields: comments, doctor notes,

descriptions

3 choose some (CL)IR baselines, e.g.:

á SMT + IR engine: e.g. cdec + Lucene, Moses + Luceneá standard metrics: tf-idf, bm25á approaches from our group

| Software Project 24 / 32

Page 29: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 1: Tools/Info

n Recommended Tools

á Python (crawling: sitescraper, pdf: pypdf/pdfminer)á unix tools (pdftotext, libpoppler)á IR engines (Lucene/Xapian)á hadoop or multi-threaded evaluationá look at in-house tools from our group

n Info

á for people with interest in Web, CLIR and SMTá Data: to be created in the projectá Evaluation metrics: MAP/NDCG/PRESá Literature for ideas: [Schamoni et al., 2014, Sokolov et al., 2013]

| Software Project 25 / 32

Page 30: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 1: Research Problems

n try query expansion with UMLS [Eck et al., 2004]

n improve over CLIR baselines

n learn an SMT system from quasi-parallel data

| Software Project 26 / 32

Page 31: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 6: Motivation

Feedback-based Learning for Machine Translation

n obtaining strictly parallel data is time-consuming & expensive

n weak binary feedback (“ok”/“garbage”) is relatively cheap/easy

á is possible to get even from the users who are monolingual

n this kind of feedback is a natural fit to online SMT applications(smartphones)

á where quick translation improvement is highly desirable andá repetitions of the same mistake annoy users

| Software Project 27 / 32

Page 32: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 6: Idea

n incorporate the binary feedback into the max-margin structural loss ofthe SVM

n the loss will combine both

á the fully supervised sub-dataset(where references exist)

á online weekly supervised sub-dataset(where only binary feedback is available)

| Software Project 28 / 32

Page 33: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 6: Task

1 implement the approach of [Saluja and Zhang, 2014]

2 evaluate on FR-EN and PT-EN datasets

| Software Project 29 / 32

Page 34: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 6: Info/Tools

n Recommended Toolsá decoders: moses (C++) or cdec (C++/Python)

study existing implementation for cdec:http://github.com/asaluja/cdec

n Info

á for people with interest in SMT, human-computer interaction,commercial NLP

á Data

FR-EN LIG corpus http://bit.ly/1Gc679QPT-EN part of the Unbabel corpus (have it in-house)

| Software Project 30 / 32

Page 35: Artem Sokolov Computerlinguistik Universit at Heidelberg

Project 6: Research Problems

n tackle the problem of long sentences

n design a better update

n improve over baselines

| Software Project 31 / 32

Page 36: Artem Sokolov Computerlinguistik Universit at Heidelberg

Viel Gluck!

Send before Friday, 24.04 to sokolov@cl.. an email with:

1 subject: “SWP Anmeldung”

2 projects you’d like to take in decreasing order of priority (rate all 6)

3 programming languages you know with an experience grade(1: no experience,..., 5: mother tongue)

4 optional: names of team mates(if you already know with whom you’d wish to team)

| Software Project 32 / 32

Page 37: Artem Sokolov Computerlinguistik Universit at Heidelberg

Eck, M., Vogel, S., and Waibel, A. (2004).

Improving statistical machine translation in the medical domain using the unified medical language system.In COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23-27August 2004, Geneva, Switzerland.

Saluja, A. and Zhang, Y. (2014).

Online discriminative learning for machine translation with binary-valued feedback.Machine Translation, 28(2):69–90.

Schamoni, S., Hieber, F., Sokolov, A., and Riezler, S. (2014).

Learning translational and knowledge-based similarities from relevance rankings for cross-language retrieval.In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27,2014, Baltimore, MD, USA, Volume 2: Short Papers, pages 488–494.

Sokolov, A., Jehl, L., Hieber, F., and Riezler, S. (2013).

Boosting cross-language retrieval by learning bilingual phrase associations from relevance rankings.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of theACL, pages 1688–1699.

| Software Project 32 / 32