chapter 9 topic detection and tracking

Upload: xenexx

Post on 10-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    1/28

    Chapter 9:

    Topic Detection and Tracking

    (TDT)

    Some slides from Overview NIST Topic Detection and Tracking

    -Introduction and Overview by G. Doddington

    -http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt99/presentations/index.htm

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    2/28

    TDT Task Overview

    5 R&D Challenges:

    Story Segmentation Topic Tracking

    Topic Detection

    First-Story Detection Link Detection

    TDT3 CorpusCharacteristics: Two Types of Sources:

    Text Speech

    Two Languages: English 30,000

    stories Mandarin 10,000

    stories

    11 Different Sources: _8 English__ 3

    MandarinABC CNN VOAPRI VOA XINNBC MNB ZBNAPW NYT

    ** see http://www.itl.nist.gov/iaui/894.01/tdt3/tdt3.htm for details see http://morph.ldc.upenn.edu/Projects/TDT3/for details

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    3/28

    Preliminaries

    A topictopic is a seminal eventevent or activity, along with all

    directly related events and activities.

    A storystory is a topically cohesive segment of news that

    includes two or more DECLARATIVE

    independent clauses abouta single event.

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    4/28

    Example Topic

    Title: Mountain Hikers Lost

    WHAT: 35 or 40 young Mountain Hikerswere lost in an avalanche in France aroundthe 20th of January.

    WHERE: Orres, France WHEN: January 1998

    RULES OF INTERPRETATION: 5.Accidents

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    5/28

    (for Radio and TV only)

    Transcription:

    text (words)

    Story:

    Non-story:

    The Segmentation Task:

    To segment the source stream into its constituent stories,

    for all audio sources.

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    6/28

    The Topic Tracking Task:

    To detect stories that discuss the target topic,

    in multiple source streams.

    Find all the stories that discuss a giventarget topic Training: Given Nt sample stories that

    discuss a given target topic,

    Test: Find all subsequent stories thatdiscuss the target topic.

    on-topic

    unknown

    unknown

    training data

    test datanot guaranteed

    to be off-topic

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    7/28

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    8/28

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    9/28

    Topic Detection Conditions

    Decision Deferral Conditions:

    Maximum decision deferral

    period in # of source files

    1

    10

    100

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    10/28

    There is no supervised topic training

    (like Topic Detection)

    Time

    First Stories

    Not First Stories

    = Topic 1

    = Topic 2

    The First-Story Detection Task:

    To detect the first story that discusses a topic,

    for all topics.

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    11/28

    The Link Detection Task

    To detect whether a pair of stories discuss the same topic.

    The topic discussed is a free variable.

    Topic definition and annotation is unnecessary.

    The link detection task represents a basic

    functionality, needed to support all applications(including the TDT applications of topic detection andtracking).

    The link detection task is related to the topic trackingtask, with Nt = 1.

    same topic?

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    12/28

    TDT3 Evaluation Methodology

    All TDT3 tasks are cast as statistical detection (yes-no) tasks.

    Story Segmentation: Is there a story boundary here?

    Topic Tracking: Is this story on the given topic?

    Topic Detection: Is this story in the correct topic-clustered set?

    First-story Detection: Is this the first story on a topic?

    Link Detection: Do these two stories discuss the same topic?

    Performance is measured in terms of detection cost, which is aweighted sum of missand false alarmprobabilities:

    CDet

    = CMiss

    PMiss

    + CFA

    PFA

    (e.g. CMiss=0.2, CFA = 0.98 )

    Detection Cost is normalized to lie between 0 and 1:(CDet)Norm = CDet / min{CMiss, CFA }

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    13/28

    Example Performance Measures:

    0.01

    0.1

    1

    Eng

    lish

    Manda

    rinNormalized

    TrackingCo

    st

    Tracking Results on Newswire Text (BBN)

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    14/28

    0.01

    0.1

    1

    BBN

    1

    CMU

    1

    Dragon

    1

    UPenn1

    GE

    1

    UIowa1

    UMd1

    Normaliz

    edTrackingCost

    1999 TDT3 Tracking ResultsRequired Evaluation Condition4 English Training Stories, Multilingual Test Texts,Newswire Text+Broadcast News ASR, Given ASR Boundaries

    95%

    ConfidenceBars

    Actual

    Decision Cost

    MinimumDETGraphCost

    0.092

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    15/28

    1999 TDT3 Tracking ResultsRequired Evaluation Condition

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    16/28

    Story Segmentation using

    Decision Trees

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    17/28

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    18/28

    Results

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    19/28

    Relevance Models and Link

    Detection Given two stories A and B

    Determine if topic(A)=topic(B)

    Estimate topics models of A and B

    e.g. language models

    Measure distance between the models

    e.g. Kullback-Leibler

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    20/28

    Generating Queries

    Suppose you have some source of

    queries You have generated several queries

    q1qn from this source

    What is)....|( 111 + NN qqqP

    q1q2q3 ?

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    21/28

    Universe of Models

    q1q2

    q3

    ?

    M1

    M2

    Mk

    =

    ++=

    k

    i

    NiiNNN qqMPMqPqqqP

    1

    1111 )...|()|()...|(

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    22/28

    Using Relevance Models in Link

    Detection Question:

    Are stories S1 and S2 linked? Approach

    Create are relevance model for S1 and S2 Measure the distance between the models

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    23/28

    Building Relevance Models as

    Topic Models S1 :

    Generate queries from it Retrieve documents from the collection

    Estimate

    SizeColl

    cf

    D

    tfDwP w

    Dw

    .)1(

    ||)|(

    , +=

    )...|()|(

    )...|()|(

    1

    11

    N

    RD

    N

    qqDPDwP

    qqwPSwP

    =

    =

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    24/28

    Measuring Distances

    =w SwP

    SwPSwPSSD)|()|(log)|()||(

    2

    1121

    ( ))||()||(2

    1)||( 122121 SSDSSDSSDsym +=

    Kullback-Leibler Distance

    Symmetric Kullback-Leibler Distance

    Kullback-Leibler Distance with Clarity

    english)general:(GE

    )|(

    )|(log)|()||( 2121 =

    w

    ClGEwP

    SwPSwPSSD

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    25/28

    Comparison on Training Data

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    26/28

    Comparison on Evaluation Data

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    27/28

    Distance Metric /Smoothing

    Sym. KL distance + clarity is not only the best methodbut also is robust against changes in the smoothing

  • 8/8/2019 Chapter 9 Topic Detection and Tracking

    28/28

    Summary

    TDT:

    International Benchmark Various sub tasks

    Link detection