july 14, 2005national e-science centre searching speech: a research agenda douglas w. oard college...

52
July 14, 2005 National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies University of Maryland, College Park

Upload: ashton-daniels

Post on 28-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

July 14, 2005 National E-Science Centre

Searching Speech:A Research Agenda

Douglas W. OardCollege of Information Studies and

Institute for Advanced Computer StudiesUniversity of Maryland, College Park

Page 2: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Some Grid Use at Maryland

• Global Land Cover Facility– 13 TB of raw and derived data from 5 satellites

• Digital archives– Preserving the meaning of metadata structure

• Access grid– No-operator information studies classroom

Page 3: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Expanding the Search Space

Scanned Docs

Scanned Docs

Identity: Harriet

“… Later, I learned that John had not heard …”

Page 4: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Indexable Speech

• What if we could collect “everything”?– 1 billion users of speech-enabled devices– Each producing >10K words per day– Much of it not worth finding

• Comparison case: Web search– Google indexes ~10 billion Web pages– Perhaps averaging ~1K words each– Much of it not worth finding

Page 5: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

A Web of Speech?

Web in 1995 Speech in 2004

Storage(words per $)

300K 1.5M

Internet Backbone(simultaneous users)

250K 30M

“Last Mile”(Download time)

1 second(no graphics)

Streaming

Display Capability(Computers/US population)

10% 100%

Search Systems Lycos

Yahoo

Page 6: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

The Need for Scalable Solutions

0.00

01

0.00

1

0.01 0.1 1 10 100

1000

1000

0

Speech in a day

Webcasts in a year

British Library

Shoah Foundation

SingingFish

SpeechBot

TDT

Millions of Hours

Page 7: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Some Spoken Word Collections

• Broadcast programming– News, interview, talk radio, sports, entertainment

• Storytelling– Books on tape, oral history, folklore

• Incidental recording– Speeches, courtrooms, meetings, phone calls

Page 8: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Indexing Options

• Transcript-based (e.g., NASA)– Manual transcription, editing by interviewee

• Thesaurus-based (e.g., Shoah Foundation)– Manually assign descriptors to points in an interview

• Catalog-based (e.g., British Library)– Catalog record created from interviewer’s notes

• Speech-based (MALACH)– Create access points with speech processing

Page 9: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Supporting “Intellectual Access”

SourceSelection

Search

Query

Selection

Ranked List

Examination

Recording

Delivery

Recording

QueryFormulation

Search System

Query Reformulation and

Relevance Feedback

SourceReselection

• Speech Processing• Computational Linguistics• Information Retrieval• Information Seeking • Human-Computer Interaction• Digital Libraries

Page 10: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Some Technical Challenges• “Fast” ASR systems are way too slow

– 6 orders or magnitude slower than tokenization

• Situational sublanguage induces variability– Impedes interactive vocabulary acquisition

• Knee in the WER/MAP curve comes early– 30-40% for broadcast news– Somewhere below 30% for conversations

• Skimmable summaries from imperfect ASR– Particularly important for linear media

• Classic IR measures focus on “documents”– Conversationalboundaries are ambiguous

Page 11: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Start Time Error Cost

0.00.10.20.30.40.50.60.70.80.91.0

-5 -4 -3 -2 -1 0 1 2 3 4 5

Page 12: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Shoah Foundation Collection• Substantial scale

– 116,000 hours; 52,000 interviews; 32 languages

• Spontaneous conversational speech– Accents, elderly, emotional, …

• Accessible– $100 million collection and digitization investment

• Manually indexed (10,000 hours)– Segmented, thesaurus terms, people, summaries

• Users– A department working full time on dissemination

Page 13: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Interview Excerpt

• Audio characteristics– Accented (this one is unusually clear)– Separate channels for interviewer / interviewee

• Dialog structure• Interviewers have different styles

• Content characteristics– Domain-specific terms– Named entity mentions and relationships

Page 14: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

MALACH Languages

English Czech Russian Slovak Polish

Collected 24,874 573 7,080 573 1,400

Cataloged 22,820 531 7,016 464 989

Indexed 22,820 22 701 0 0

Digitized 13,735 374 3,052 427 835

Completed 11,464 22 287 0 0

As of January 31, 2004

Testimonies (average 2.25 hours each)

Page 15: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Observational Studies 8 independent searchers

– Holocaust studies (2)– German Studies– History/Political Science– Ethnography– Sociology– Documentary producer– High school teacher

8 teamed searchers– All high school teachers

Thesaurus-based search

Rich data collection– Intermediary interaction– Semi-structured interviews– Observational notes– Think-aloud– Screen capture

Qualitative analysis– Theory-guided coding– Abductive reasoning

Page 16: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Relevance Criteria

Criterion

Number of Mentions

All(N=703)

Think-Aloud

Relevance Judgment(N=300)

QueryForm.

(N=248)

Topicality 535 (76%) 219 234

Richness 39 (5.5%) 14 0

Emotion 24 (3.4%) 7 0

Audio/Visual Expression 16 (2.3%) 5 0

Comprehensibility 14 (2%) 1 10

Duration 11 (1.6%) 9 0

Novelty 10 (1.4%) 4 2

6 Scholars, 1 teacher, 1 film producer, working individually

Page 17: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Topicality

0 20 40 60 80 100 120 140

Object

Time Frame

Organization/Group

Subject

Event/Experience

Place

Person

Total mentions

6 Scholars, 1 teacher, 1 movie producer, working individually

Page 18: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

AutomaticSearch

BoundaryDetection

InteractiveSelection

ContentTagging

SpeechRecognition

QueryFormulation

Test Collection Design

Page 19: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Test Collection Design

AutomaticSearch

BoundaryDetection

SpeechRecognition

QueryFormulation

Topic Statements

Ranked Lists

EvaluationRelevanceJudgments

Mean Average Precision

Interviews

ContentTagging

Manual: Topic boundariesAutomatic: Topic boundaries

Manual: ~5 Thesaurus labels3-sentence summaries

Automatic: Thesaurus labels

Automatic: 35% interview-tuned40% domain-tuned

Training: 38 existingEvaluation: 25 new

Page 20: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

CLEF-2005 CL-SR Track• Test collection distributed by ELDA

– ~7,800 segments from ~300 English interviews• Hand segmented / known boundaries

– 63 topics (title/description/narrative)• 38 for training, 25 for blind evaluation• 5 languages (EN, SP, CZ, DE, FR)

– Relevance judgments• Search-guided + post-hoc judgment pools

• 5 participating teams– DCU, Maryland, Pitt, Toronto/Waterloo, UNED

• One required cross-site baseline run– ASR segments / English TD topics

Page 21: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Additional Resources

• Thesaurus– ~3,000 core concepts

• Plus alternate vocabulary + standard combinations

– ~30,000 location-time pairs, with lat/long– Both “is-a” and “part-whole” relationships

• In-domain expansion collection– 186,000 3-sentence summaries

• Indexer’s scratchpad notes• Digitized speech

– .mp2 or .mp3

Page 22: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

0

10

20

30

40

50

60

70

80

90

100

Jan-02 Jul-02 Jan-03 Jul-03 Jan-04 Jul-04 Jan-05

En

glis

h W

ord

Err

or

Ra

te (

%)

English ASR

Training: 200 hours from 800 speakers

ASR2003A

ASR2004A

Page 23: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

<DOCNO>VHF00017-062567.005</DOCNO>

<KEYWORD> Warsaw (Poland), Poland 1935 (May 13) - 1939 (August 31), awareness of political or military events, schools </KEYWORD>

<PERSON> Sophie P[…], Henry H[…] </PERSON>

<SUMMARY> AH talks about the college she attended before the war. She mentions meeting her husband. She discusses young peoples' awareness of the political events that preceded the outbreak of war. </SUMMARY>

<SCRATCHPAD> graduated HS, went to college 1 year, professional college hotel management; met future husband, knew that they'd end up together; sister also in college, nice social life, lots of company, not too serious; already got news from Czechoslovakia, Sudeten, knew that Poland would be next but what could they do about it, very passive; just heard info from radio and press </SCRATCHPAD>

<ASRTEXT> no no no they did no not not uh i know there was no place to go we didn't have family in a in other countries so we were not financially at the at extremely went so that was never at plano of my family it is so and so that was the atmosphere in the in the country prior to the to the war i graduate take the high school i had one year of college which was a profession and that because that was already did the practical trends f so that was a study for whatever management that eh eh education and this i i had only one that here all that at that time i met my future husband and that to me about any we knew it that way we were in and out together so and i was quite county there was so whatever i did that and this so that was the person that lived my sister was it here is first year of of colleagues and and also she had a very strongly this antisemitic trend and our parents there was a nice social life young students that we had open house always pleasant we had a lot of that company here and and we were not too serious about that she we got there we were getting the they already did knew he knew so from czechoslovakia from they saw that from other part and we knew the in that that he is uhhuh the hitler spicy we go into this year this direction that eh poland will be the next country but there was nothing that we would do it at that time so he was a very very he says belong to any any organizations especially that the so we just take information from the radio and from the dress </ASRTEXT>

Page 24: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -2044.00 54.01 224.90 391.70 326.00 287400.00 75031.00

44.5%??

Segment duration (s)

Page 25: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Keywords vs. Segment duration

Page 26: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Nodes descendingfrom parents ofleaves

Page 27: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Years spoken in ASR

Page 28: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Min. : 0.0000 1st Qu.: 0.0000 Median : 0.0000 Mean : 0.6575 3rd Qu.: 1.0000 Max. : 13.0000

Spoken dates in release ASR

Page 29: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Current classifier performance:

MAP: .2374, even post-mixing of scratchpad/summary from 20NN, remixed with time-label densities estimated w/Gaussian kernel at 5x def. bandwidth

46,601 (1,175) 3,610 ( 169)

1,437 (168) 613 ( 47)

Page 30: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

An Example English TopicNumber: 1148

Title: Jewish resistance in Europe

Description:Provide testimonies or describe actions of Jewish resistance in Europe before and during the war.

Narrative:The relevant material should describe actions of only- or mostly Jewish resistance in Europe. Both individual and group-based actions are relevant. Type of actions may include survival (fleeing, hiding, saving children), testifying (alerting the outside world, writing, hiding testimonies), fighting (partisans, uprising, political security) Information about undifferentiated resistance groups is not relevant.

Page 31: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

5-level Relevance Judgments

• “Classic” relevance (to “food in Auschwitz”)

Direct Knew food was sometimes withheld

Indirect Saw undernourished people

• Additional relevance typesContext Intensity of manual laborComparison Food situation in a different campPointer Mention of a study on the subject

Binary qrels

Page 32: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Title queries, adjudicated judgments

0

0.1

0.2

0.3

0.4

0.5

ASR Scratchpad ThesTerm Summary Metadata

Me

an

Av

era

ge

Pre

cis

ion

+Persons

Comparing Index Terms

Page 33: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Title queries, adjudicated judgments

0.0

0.2

0.4

0.6

0.8

1.0

Ave

rag

e P

reci

sio

n

ASR

ASR+Rel+Top10

Metadata

jewish kapo(s) fort ontariorefugee camp

Searching Manual Transcripts

Page 34: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Category Expansion

Spoken Words (hand transcribed)

ThesaurusTerms

3,199 Training segments

Spoken Words(ASR transcript)

ThesaurusTerms

test segments

kNNCategorization

Index

F=0.19(microaveraged)

0.0941

0.00

0.02

0.04

0.06

0.08

0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n A

vera

ge

Pre

cisi

on

Title queries, linear score combination, adjudicated judgments

ASRWords

ThesaurusTerms

Page 35: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

ASR-Based Search

0.00

0.02

0.04

0.06

0.08

0.10

Inquery Character n-grams

Okapi Qkapi +Query

Expansion

Okapi +Category

Expansion

Okapi +QE+CE

Mea

n A

vera

ge P

reci

sion

Title queries, adjudicated judgments

+27%

Average of3.4 relevantsegments in

top 20

Page 36: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Rethinking the Problem• Segment-then-label models planned speech well

– Producers assemble stories to create programs– Stories typically have a dominant theme

• The structure of natural speech is different– Creation: digressions, asides, clarification, …– Use: intended use may affect desired granularity

• Documentary film: brief snippet to illustrate a point• Classroom teacher: longer self-contextualizing story

Page 37: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Activation MatrixL

abel

s

Time

Page 38: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Training Data: 196,000 Segments

Subject PersonLocation-Time

Berlin-1939 Employment Josef Stein

Berlin-1939 Family life Gretchen Stein Anna Stein

Dresden-1939 Schooling Gunter Wendt Maria

Dresden-1939 Relocation Transportation-rail inte

rvie

w ti

me

+ Segment summaries + Indexer’s notes

Page 39: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Preprocessing Training Data

• Normalize labeled categories?– Food in hiding -> food AND hiding

• Develop class models– Existing hierarchy, types of personal relationships

• Determine the extent for each label and class– Merge the extent of repeated labels

Page 40: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Characteristics of the Problem• Clear dependencies

– Correlated assignment of applications – Living in Dresden negates living in Berlin

• Heuristic basis for class models– Persons, based on type of relationship– Date/Time, based on part-whole relationship– Topics, based on a defined hierarchy

• Heuristic basis for guessing without training– Text similarity between labels and spoken words

• Heuristic basis for smoothing– Sub-sentence retrieval granularity is unlikely

Page 41: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Modeling Location

Berlin Dresden

• Presence in a new location negates presence in the prior location• Location granularity varies (inclusion relationships are known)

Germany

Page 42: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

A Class Model for People

fathermothersister

fathermother

sisterfriend

nobody

• Several people may be discussed simultaneously• Small inventory of relationship types• Relationship type is known for most people that are mentioned

Page 43: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Search• Compute a score at each time based on:

– How likely is each descriptor? (~TF)– How selective is each descriptor? (~IDF)– What related descriptors are active? (~expansion)

• Determine passage start time based on:– Score trajectory (sequence of scores)– Additional heuristics (e.g., pause, speaker turn)

• Rank passages based on score trajectory– e.g., by peak score within the passage

Page 44: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Timelines for the whole

interview text

Page 45: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Some Open Issues

• Is the expressive power of a lattice needed?– An activation matrix is an unrolled lattice

• What states do we need to represent?– Balance fidelity, accuracy, and complexity

• How to integrate manual onset marks?

• How much training data do we need?– Annotating new data costs ~$100/hour

• How will people use the system we build?

Page 46: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Non-English ASR Systems

10/01 4/02 10/02 4/03 10/03 4/04 10/04 4/05 10/05 4/06 10/06

30

40

50

60

70

WER [%]

Czech Russian Slovak Polish

45h + LMTr

84h + LMTr

+ LMTr+TC+ standard.

57.92%

45.91%

41.15%

38.57%35.51%

+ adapt.

20h + LMTr

66.07%

50.82%

34.49%

Hungarian

100h + LMTr

+ stand.+LMTr+TC

100h + LMTr

+ stand.+LMTr+TC

45.75%

40.69%

Page 47: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Planning for the Future• Tentative CLEF-2006 CL-SR Plans:

– Adding a Czech collection– Larger English collection (~900 hours)

• Adding word lattice as standard data

– No-boundary evaluation design– ASR training data (by special arrangement)

• Transcripts, pronunciation lexicon, language model

• Possible CLEF-2007 CL-SR Options: – Add a Russian or Slovak collection?– Much larger English collection (~5,000 hours)?

Page 48: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

The CLEF CL-SR Team

• Shoah Foundation– Sam Gustman

• IBM TJ Watson– Bhuvana Ramabhadran– Martin Franz

• U. Maryland– Doug Oard– Dagobert Soergel

• Johns Hopkins– Zak Schefrin

• U. Cambridge (UK)– Bill Byrne

• Charles University (CZ)– Jan Hajic– Pavel Pecina

• U. West Bohemia (CZ)– Josef Psutka– Pavel Ircing

• UNED (ES)– Fernando López-Ostenero

USA Europe

Page 49: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

More Things to Think About

• Privacy protection– Working with real data has real consequences

• Are fixed segments the right retrieval unit?– Or is it good enough to know where to start?

• What will it cost to tailor an ASR system?– $100K to $1 million per application?

• Do we need to change what we collect?– Speaker enrollment, metadata standards, …

Page 50: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

Final Thoughts

• The moving hand, having writ, moves on– Ephemeral webcasting– Forgone acquisition opportunities

Page 51: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer

For More Information

• The MALACH project– http://www.clsp.jhu.edu/research/malach

• CLEF-2005 evaluation– http://www.clef-campaign.org

• NSF/DELOS Spoken Word Access Group– http://www.dcs.shef.ac.uk/spandh/projects/swag

Page 52: July 14, 2005National E-Science Centre Searching Speech: A Research Agenda Douglas W. Oard College of Information Studies and Institute for Advanced Computer