topic detection and tracking using web mining

6
 Internatio nal Journal of Exploring Emerging Trends in Engineering (IJEETE) Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM ISSN  2394-0573 All Rights Reserved © 2014 IJEETE Page 5 TOPIC DETECTION AND TRACKING USING WEB MINING 1  Nain Kanwal Kaur Assistant Professor, Department of Computer Science and Engineering, Continental Institute of Engineering and Technology, Jalvehra, Punjab, India [email protected] Abstract:-  Web mining - is the application of data mining  techniques to discover patterns  from the Web. Topic tracking is one of the technologies that has been developed and can be used in the text mining process. The main  purpose of topic tracking is to identify and  follow events presented in multiple news  sources, including newswires, radio and TV broadcasts. In this paper, a survey of topic tracking techniques is presented. K e yw o r d s- Text Mining, Topic detection, topic tracking I INTRODUCTION The World Wide Web (WWW) is a popular and interactive medium with tremendous growth of amount of data or information available today. The World Wide Web is the collection of documents, text files, images, and other forms of data in structured, semi structured and unstructured form. The primary aim of web mining is to extract useful information and knowledge from web. Raw Data Patterns Knowledge Web mining is used to capture relevant information, creating new knowledge out of relevant data, personalization of the information and learning about Consumers or individual users and several others. Web mining can be divided into three categories depending on the type of data as: (i) Web usage mining, (ii) Web content mining and (iii) Web structure mining. II WEB CONTENT MINING Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that  permits much of the ever-expanding information sources on the World Wide Web. Research activities in this field also involve using techniques from other disciplines such as Information Retrieval (IR) and natural language  processing (NLP) [12]. III TEXT MINING Text mining is a new area of computer science which fosters strong connections with natural language processing, data mining, machine learning, information retrieval and knowledge management. Several approaches exist for the identification of patterns including automated classification and clustering [14]. The field of text mining has received a lot of attention due to the always increasing need for managing the information that resides in the vast amount of available documents [16]. Figure 1. Typical text mining process [6] 

Upload: ijeete

Post on 04-Nov-2015

14 views

Category:

Documents


0 download

DESCRIPTION

Web mining - is the applicationof data mining techniques to discover patternsfrom the Web. Topic tracking is one of thetechnologies that has been developed and canbe used in the text mining process. The mainpurpose of topic tracking is to identify andfollow events presented in multiple newssources, including newswires, radio and TVbroadcasts. In this paper, a survey of topictracking techniques is presented

TRANSCRIPT

  • International Journal of Exploring Emerging Trends in Engineering (IJEETE)

    Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM

    ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 5

    TOPIC DETECTION AND TRACKING USING WEB MINING

    1Nain Kanwal Kaur

    Assistant Professor, Department of Computer Science and Engineering,

    Continental Institute of Engineering and Technology, Jalvehra, Punjab, India

    [email protected]

    Abstract:- Web mining - is the application

    of data mining techniques to discover patterns

    from the Web. Topic tracking is one of the

    technologies that has been developed and can

    be used in the text mining process. The main

    purpose of topic tracking is to identify and

    follow events presented in multiple news

    sources, including newswires, radio and TV

    broadcasts. In this paper, a survey of topic

    tracking techniques is presented.

    Keywords- Text Mining, Topic detection, topic

    tracking

    I INTRODUCTION

    The World Wide Web (WWW) is a popular and

    interactive medium with tremendous growth of

    amount of data or information available today.

    The World Wide Web is the collection of

    documents, text files, images, and other forms

    of data in structured, semi structured and

    unstructured form. The primary aim of web

    mining is to extract useful information and

    knowledge from web.

    Raw Data Patterns Knowledge

    Web mining is used to capture relevant

    information, creating new knowledge out of

    relevant data, personalization of the information

    and learning about Consumers or individual

    users and several others. Web mining can be

    divided into three categories depending on the

    type of data as:

    (i) Web usage mining,

    (ii) Web content mining and

    (iii) Web structure mining.

    II WEB CONTENT MINING

    Web content mining is the mining, extraction

    and integration of useful data, information and

    knowledge from Web page content. The

    heterogeneity and the lack of structure that

    permits much of the ever-expanding

    information sources on the World Wide Web.

    Research activities in this field also involve

    using techniques from other disciplines such as

    Information Retrieval (IR) and natural language

    processing (NLP) [12].

    III TEXT MINING

    Text mining is a new area of computer science

    which fosters strong connections with natural

    language processing, data mining, machine

    learning, information retrieval and knowledge

    management. Several approaches exist for the

    identification of patterns including automated

    classification and clustering [14]. The field of

    text mining has received a lot of attention due to

    the always increasing need for managing the

    information that resides in the vast amount of

    available documents [16].

    Figure 1. Typical text mining process [6]

  • International Journal of Exploring Emerging Trends in Engineering (IJEETE)

    Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM

    ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 6

    IV TOPIC DETECTION AND TRACKING

    (TDT)

    Topic detection and tracking (TDT) applications

    aim to organize the temporally ordered stories

    of a news stream according to the events. A

    topic tracking system works by keeping user

    profiles and based on the documents the user

    views, predicts other documents of interest to

    the user[5]. The task of topic tracking is to

    monitor a stream of news stories and find out

    what discuss the same topic described by a few

    positive samples [20]. It collects dispersed

    information together and makes it easy for user

    to get a general understanding [11]. There are

    many areas where topic tracking can be applied

    in industry. It can be used to alert companies

    anytime a competitor is in the news. It could

    also be used in the medical industry by doctors

    and other people looking for new treatments.

    The tasks of TDT can be briefed as:

    1. The Topic Tracking Task: The TDT topic

    tracking task is defined to be the task of

    associating incoming stories with topics that

    are known to the system. A topic is

    known by its association with stories that

    discuss it. Thus each target topic is defined

    by one or more stories that are on the topic.

    To support this task, a small set of on-topic

    training stories is identified for each topic to

    be tracked.

    2. The Supervised Adaptive Tracking Task: An optional variant of the topic tracking

    task is supervised adaptive tracking. This

    task is identical to the topic tracking task

    except that, for each story judged to be on-

    topic, the relevance judgment for that story

    is then made available, allowing supervised

    adaptation during tracking.

    3. The New Event Detection Task: The TDT new event detection task is defined to be the

    task of detecting, in a chronologically

    ordered stream of stories from multiple

    sources (and in multiple languages), the first

    story that discusses an event.

    4. The Link Detection Task: The TDT link detection task is defined to be the task of

    determining whether two stories discuss the

    same topic. Thus, the system must embody

    an understanding of what a topic is, and this

    understanding must be independent of topic

    specifics.

    Figure 2. Architecture of a topic tracking system [13]

    V LITERATURE REVIEW

    Event detection problem is a part of topic

    detection and tracking (TDT). The topic is a

    seminal activity or event which considers all

    associated events. The event is an occurrence

    reported at a particular time and place with

    consequences. It is defined by a list of stories

    that discusses the single event. New events refer

    to those stories that discuss an event which has

    not been reported already in previous stories.

    Real-time detection of the events and discovery

    of their evolutions should be explored to more

    effectively present news stories.

    James Allan, Ron Papka and Victor Lavrenko

    [1] performed event detection using a clustering

    algorithm and threshold model. The major

    components of the model are the properties of

    an event. For event tracking, filtering methods

    are deployed. The event detection follows an

    online setting strictly, i.e., processing one news

    story at a time. The proposed work

    encompasses properties of event identity which

    determines whether two events are the same. A

    system incorporating the event identity

    properties performs new event detection by

  • International Journal of Exploring Emerging Trends in Engineering (IJEETE)

    Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM

    ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 7

    comparing the newly arrived story in the stream

    with the existing ones. The algorithm used for

    new event detection is a modified version of

    single pass clustering.

    Yiming Yang, Jaime Q. Carbonell, Ralf D.

    Brown, Thomas Pierce, Brian Archibald, and

    Xin Liu [19] proposed topic detection and

    tracking (TDT) to devise an intelligent system

    that automatically detects novel events from

    large volumes of news stories. This method

    accepts news stories from various TV channels

    and radio broadcasts as input. The subtasks of

    TDT includes segmentation of speech

    recognized input into news stories, detection of

    events from segmented news streams, tracking

    user interested events. Event detection task is

    unsupervised and is divided into two forms

    [18]:

    1. Retrospective detection.

    2. Online detection.

    Hassan Sayyadi, Matthew Hurst and Alexey

    Maykov [15] presented an algorithm for new

    event detection, which detects events by

    creating keyword graph and using community

    detection methods. Events are characterized by

    a set of keywords. Keywords are extracted from

    the news articles which comprises named

    entities. The key factor is the dependency

    between the extracted keywords. More than one

    event can be denoted by the same set of terms

    causing ambiguity. Thus a graph is constructed

    using the extracted keywords called key graph.

    Each node in the graph represents a keyword

    whereas the edge represents co-occurrence of

    the keywords in multiple documents. The

    proposed algorithm performs three tasks,

    namely building the key graph, community

    detection and document clustering. For each

    keyword term frequency (TF), document

    frequency (DF) and inverse document

    frequency (IDF) values are computed to

    determine its relevancy and association with

    other keywords. A node is removed if the

    keyword has low document frequency. An edge

    is removed if the keywords co-occurrence is

    below some threshold value.

    Wei CHEN, Chun CHEN, Li-Jun ZHANG, Can

    WANG, Jia-Jun BU [3] monitors the news

    stream for a predefined duration to identify

    bursty events. It is represented using features

    (i.e., keywords). Bursty event comprises bursty

    features whose frequency increases as the

    corresponding event occurs. The steps involved

    are identified bursty features in the current

    window for different periods, grouping the

    bursty features detected and formulating the

    bursty

    events, each being associated with a power

    value corresponding to its bursty level,

    discovering the evolution of events. Bursty

    features are identified using an online multi

    resolution burst detection (OMRBD) algorithm.

    Giridhar Kumaran and James Allan [8] perform

    new event detection (NED). It involves

    monitoring the news stream to identify stories

    that report on a new event. In this work, NED is

    treated as a binary classification problem. Each

    news story has three representations on the basis

    of named entities. Since the occurrence of new

    event does not follow a pattern and is almost

    instantaneous, named entity is used. Named

    entities like person, location, organization, etc.

    are identified. When two stories depict the same

    event, then the named entities and topic terms

    will be similar.

    W. Lam, H. M. L. Meng, K. L. Wong, J. C. H.

    Yen [9] presented a method called contextual

    analysis for event detection in a continuous

    stream of Newswire stories. The proposed

    method doesn't only depend on keywords for

    describing an event, but takes into account the

    concept terms, named entities like person,

    location, organization and story terms.

    The information obtained from these terms

    along with its weights is used for event

  • International Journal of Exploring Emerging Trends in Engineering (IJEETE)

    Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM

    ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 8

    detection. Event detection model is composed

    of three components:

    1. Similarity calculation component.

    2. Grouping the relevant elements by means of

    agglomerative clustering.

    3. Event identification.

    James Allan, Victor Lavrenko and Margaret E.

    Connell [2] described the purpose of new event

    detection is to find the point where the system

    must decide to start a new cluster. The new

    event evaluation focuses entirely on whether or

    not a system finds the triggers of new topics and

    ignores what happens within the topics. The

    approach evaluates the tasks within topic

    detection and tracking (TDT) using a signal

    detection methodology. A TDT system

    produces binary YES/NO judgments for every

    story in a stream.

    Wang Xiaowei, JiangLongbin, MaJialin and

    Jiangyan came up with a new improved

    approach for topic tracking [17]. They proposed

    multi vector model that extracts NER features

    from text and make it into a separate vector. It

    first selects the features and classifies in

    accordance with characteristics of different

    tasks, then calculates the vector, then finally

    selects the combination of model and optimizes

    the parameters.

    Xianfei Zhang, Zhigang Guo and Bicheng Li

    proposed a new method for News topic tracking

    [20]. The LSI-SVM (Latent Semantic Analysis-

    Support Vector Machine) method makes an in

    depth analysis of the co- occurrence of words

    and provides a way of dealing with synonymy

    automatically It is based on the assumption that

    there is an underlying or latent structure in the

    pattern of word usage across document.

    Figure 3. NER System Architecture [17]

    The keyword extraction technique can be used

    for tracking the topics over time. Keywords are

    the set of significant words in an article that

    gives high level description of its contents to

    readers. But manual keyword extraction is

    extremely difficult and time consuming task.

    This problem has been addressed by Sungjick

    Lee and Han-joon Kim [10]. They proposed an

    automatic unsupervised keyword extraction

    technique. The conventional model evaluates

    the degree of importance of a word in a single

    document, but the proposed variants evaluate

    the degree of importance of a word in a whole

    document collection.

    Kamaldeep Kaur and Vishal Gupta [6] proposed

    event detection and topic tracking for Punjabi

    news streams. Punjabi is highly inflectional and

    agglutinating language providing one of the

    richest and most challenging sets of linguistic

    and statistical features resulting in long and

    complex word forms. The topic tracking for

    Punjabi language has been experimented with

    two approaches:

  • International Journal of Exploring Emerging Trends in Engineering (IJEETE)

    Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM

    ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 9

    1. NER based approach

    2. Keyword extraction approach

    CONCLUSION

    Topic tracking monitors a stream of news

    stories and find out what discuss the same topic

    described by a few positive samples. In this

    report, the two approaches are studied that are

    NER features extraction and keyword

    extraction. The language dependent features and

    language independent features are formed and

    analyzed. Name entities such as date/ time,

    location, person name, organization, designation

    and keywords from title, cue phrase and high

    frequency noun are extracted.

    REFRENCES:

    [1] Allan, James, Papka, Ron and Lavrenko, Victor, 1998, Online new event detection and tracking. In: Proceedings of the Annual International ACM SIGIR Conference on

    Research and Development in Information

    Retrieval, Association for Computing

    Machinery Special Interest Group on

    Information Retrieval, , p 37-45.

    [2] Allan, James, Lavranko, Victor and Connell, E., Margaret, A Month to Topic Detection and Tracking in Hindi.

    [3] CHEN, Wei, CHEN, Chun, ZHANG, Li-Jun, WANG, Can, BU, Jia-Jun, 2010,

    Online detection of bursty events and their evolution in news streams J. Zhejiang Univ.-Sci C 340-355.

    [4] Dadgar, Omid, Topic Detection and tracking, Available: www.tcnj.edu/~mmmartin/.../TDT/TopicDet

    ectionTracking04.ppt

    [5] Gupta, Vishal, Lehal, S., G., (2009), A Survey of Text Mining Techniques and

    Applications,in Journal of Emerging Technologies in Web Intelligence.

    [6] Kaur, Kamaldeep and Gupta, Vishal, 2011, TOPIC TRACKING FOR PUNJABI LANGUAGE, Computer Science & Engineering: An International Journal

    (CSEIJ), Vol.1, No.3.

    [7] Kolya, Kumar, Anup, Ekbal, Asif, Bandyopadhyay, Sivaji, 2009, A Simple Approach for Monolingual Event Tracking

    System in Bengali, 8th International Symposium on NaturalLanguage

    Processing, IEEE.

    [8] Kumaran, Giridhar and Allan, James, Using names and topics for new event detection

    [9] Lam, W., Meng, L., M., H., K., Wong, L., Yen, H., C., J., Event detection using contextual analysis. Int. J. Intell. Syst., 16 (4): 525-546. [doi: 10.1002/int. 1022]

    [10] Lee, Sungjick, Kim, Han-joon, 2008, News Keyword Extraction for Topic Tracking, 4th International Conferenceon Networked Computing and Advanced

    Information Management, IEEE.

    [11] Liu, Yan, Lv, Nan, Luo, Junyong, Yang, Huijie, (2009), Subtopic Based Topic Evolution Analysis, International Conference on Web Information Systems

    and Mining, IEEE.

    [12] Navathe, Shamkant, B. and Ramez, Elmasri, 2000, Data Warehousing and Data Mining, inFundamentals of Database Systems, Pearson Education pvtInc, Singapore, 841-872.

    [13] Qin, Xiangju, Zhang, Yang, (2008), Improving the performance of Topic Tracking System by Ensemble, International Conference on Computer

    Science and Software Engineering, IEEE.

    [14] Radovanovic, Milos, vanovic, MirjanaI, (2008), Text Mining: Approaches and Applications, Novi Sad J. Math, Vol. 38, No. 3: 227-234

    [15] Sayyadi, Hassan, Hurst, Matthew and Maykov, Alexey, 2009, Event detection and tracking in social streams, 3rd Intl AAAI Conference on Weblogs and Social

    Media, ICWSM 09, AAAI,. [16] Stavrianou, Anna, Andritsos, Periklis,

    Nicoloyannis, Nicolas, (2007), Overview and Semantic Issues of Text Mining, Sigmod Record, Vol. 36, No. 3

    [17] Xiaowei, Wang, Longbin, Jiang, Jiangyan, MaJialin, 2008, Use of NER Information for Improved Topic Tracking, Eighth International Conference on

  • International Journal of Exploring Emerging Trends in Engineering (IJEETE)

    Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM

    ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 10

    Intelligent Systems Design and

    Applications, IEEE.

    [18] Yang Y, Pierce T, Carbonell J. 1998, A study on retrospective and on-line event

    detection. In: Proceedings of the Annual International ACM SIGIR Conference on

    Research and Development in Information

    Retrieval.

    [19] Yang, Yiming, Carbonell, Q., Jaime, Brown, D., Ralf, Pierce, Thomas, Archibald,

    Brian. and Liu, Xin, (1999) Learning approaches for detecting and tracking news

    events IEEE Intell. Syst. 32-43. [20] Zhang, Xianfei, Guo, Zhigang, Li,

    Bicheng, (2009), An Effective Algorithm of News Topic Tracking, Global Congress on Intelligent Systems,IEEE.

    AUTHORS BIBLOGRAPHY

    Nain Kanwal Kaur Currently

    working as Assistant Professor,

    Department of Computer

    Science and Engineering,

    Continental Institute of

    Engineering and Technology,

    Jalvehra, Punjab, India