topic detection and tracking using web mining

International Journal of Exploring Emerging Trends in Engineering (IJEETE)

Vol. 02, Issue 01, JAN, 2015 WWW.IJEETE.COM

ISSN 2394-0573 All Rights Reserved 2014 IJEETE Page 5

TOPIC DETECTION AND TRACKING USING WEB MINING

1Nain Kanwal Kaur

Assistant Professor, Department of Computer Science and Engineering,

Continental Institute of Engineering and Technology, Jalvehra, Punjab, India

[email protected]

Abstract:- Web mining - is the application

of data mining techniques to discover patterns

from the Web. Topic tracking is one of the

technologies that has been developed and can

be used in the text mining process. The main

purpose of topic tracking is to identify and

follow events presented in multiple news

sources, including newswires, radio and TV

broadcasts. In this paper, a survey of topic

tracking techniques is presented.

Keywords- Text Mining, Topic detection, topic

tracking

I INTRODUCTION

The World Wide Web (WWW) is a popular and

interactive medium with tremendous growth of

amount of data or information available today.

The World Wide Web is the collection of

documents, text files, images, and other forms

of data in structured, semi structured and

unstructured form. The primary aim of web

mining is to extract useful information and

knowledge from web.

Raw Data Patterns Knowledge

Web mining is used to capture relevant

information, creating new knowledge out of

relevant data, personalization of the information

and learning about Consumers or individual

users and several others. Web mining can be

divided into three categories depending on the

type of data as:

(i) Web usage mining,

(ii) Web content mining and

(iii) Web structure mining.

II WEB CONTENT MINING

Web content mining is the mining, extraction

and integration of useful data, information and

knowledge from Web page content. The

heterogeneity and the lack of structure that

permits much of the ever-expanding

information sources on the World Wide Web.

Research activities in this field also involve

using techniques from other disciplines such as

Information Retrieval (IR) and natural language

processing (NLP) [12].

III TEXT MINING

Text mining is a new area of computer science

which fosters strong connections with natural

language processing, data mining, machine

learning, information retrieval and knowledge

management. Several approaches exist for the

identification of patterns including automated

classification and clustering [14]. The field of

text mining has received a lot of attention due to

the always increasing need for managing the

information that resides in the vast amount of

available documents [16].

Figure 1. Typical text mining process [6]




IV TOPIC DETECTION AND TRACKING

(TDT)

Topic detection and tracking (TDT) applications

aim to organize the temporally ordered stories

of a news stream according to the events. A

topic tracking system works by keeping user

profiles and based on the documents the user

views, predicts other documents of interest to

the user[5]. The task of topic tracking is to

monitor a stream of news stories and find out

what discuss the same topic described by a few

positive samples [20]. It collects dispersed

information together and makes it easy for user

to get a general understanding [11]. There are

many areas where topic tracking can be applied

in industry. It can be used to alert companies

anytime a competitor is in the news. It could

also be used in the medical industry by doctors

and other people looking for new treatments.

The tasks of TDT can be briefed as:

1. The Topic Tracking Task: The TDT topic

tracking task is defined to be the task of

associating incoming stories with topics that

are known to the system. A topic is

known by its association with stories that

discuss it. Thus each target topic is defined

by one or more stories that are on the topic.

To support this task, a small set of on-topic

training stories is identified for each topic to

be tracked.

2. The Supervised Adaptive Tracking Task: An optional variant of the topic tracking

task is supervised adaptive tracking. This

task is identical to the topic tracking task

except that, for each story judged to be on-

topic, the relevance judgment for that story

is then made available, allowing supervised

adaptation during tracking.

3. The New Event Detection Task: The TDT new event detection task is defined to be the

task of detecting, in a chronologically

ordered stream of stories from multiple

sources (and in multiple languages), the first

story that discusses an event.

4. The Link Detection Task: The TDT link detection task is defined to be the task of

determining whether two stories discuss the

same topic. Thus, the system must embody

an understanding of what a topic is, and this

understanding must be independent of topic

specifics.

Figure 2. Architecture of a topic tracking system [13]

V LITERATURE REVIEW

Event detection problem is a part of topic

detection and tracking (TDT). The topic is a

seminal activity or event which considers all

associated events. The event is an occurrence

reported at a particular time and place with

consequences. It is defined by a list of stories

that discusses the single event. New events refer

to those stories that discuss an event which has

not been reported already in previous stories.

Real-time detection of the events and discovery

of their evolutions should be explored to more

effectively present news stories.

James Allan, Ron Papka and Victor Lavrenko

[1] performed event detection using a clustering

algorithm and threshold model. The major

components of the model are the properties of

an event. For event tracking, filtering methods

are deployed. The event detection follows an

online setting strictly, i.e., processing one news

story at a time. The proposed work

encompasses properties of event identity which

determines whether two events are the same. A

system incorporating the event identity

properties performs new event detection by




comparing the newly arrived story in the stream

with the existing ones. The algorithm used for

new event detection is a modified version of

single pass clustering.

Yiming Yang, Jaime Q. Carbonell, Ralf D.

Brown, Thomas Pierce, Brian Archibald, and

Xin Liu [19] proposed topic detection and

tracking (TDT) to devise an intelligent system

that automatically detects novel events from

large volumes of news stories. This method

accepts news stories from various TV channels

and radio broadcasts as input. The subtasks of

TDT includes segmentation of speech

recognized input into news stories, detection of

events from segmented news streams, tracking

user interested events. Event detection task is

unsupervised and is divided into two forms

[18]:

1. Retrospective detection.

2. Online detection.

Hassan Sayyadi, Matthew Hurst and Alexey

Maykov [15] presented an algorithm for new

event detection, which detects events by

creating keyword graph and using community

detection methods. Events are characterized by

a set of keywords. Keywords are extracted from

the news articles which comprises named

entities. The key factor is the dependency

between the extracted keywords. More than one

event can be denoted by the same set of terms

causing ambiguity. Thus a graph is constructed

using the extracted keywords called key graph.

Each node in the graph represents a keyword

whereas the edge represents co-occurrence of

the keywords in multiple documents. The

proposed algorithm performs three tasks,

namely building the key graph, community

detection and document clustering. For each

keyword term frequency (TF), document

frequency (DF) and inverse document

frequency (IDF) values are computed to

determine its relevancy and association with

other keywords. A node is removed if the

keyword has low document frequency. An edge

is removed if the keywords co-occurrence is

below some threshold value.

Wei CHEN, Chun CHEN, Li-Jun ZHANG, Can

WANG, Jia-Jun BU [3] monitors the news

stream for a predefined duration to identify

bursty events. It is represented using features

(i.e., keywords). Bursty event comprises bursty

features whose frequency increases as the

corresponding event occurs. The steps involved

are identified bursty features in the current

window for different periods, grouping the

bursty features detected and formulating the

bursty

events, each being associated with a power

value corresponding to its bursty level,

discovering the evolution of events. Bursty

features are identified using an online multi

resolution burst detection (OMRBD) algorithm.

Giridhar Kumaran and James Allan [8] perform

new event detection (NED). It involves

monitoring the news stream to identify stories

that report on a new event. In this work, NED is

treated as a binary classification problem. Each

news story has three representations on the basis

of named entities. Since the occurrence of new

event does not follow a pattern and is almost

instantaneous, named entity is used. Named

entities like person, location, organization, etc.

are identified. When two stories depict the same

event, then the named entities and topic terms

will be similar.

W. Lam, H. M. L. Meng, K. L. Wong, J. C. H.

Yen [9] presented a method called contextual

analysis for event detection in a continuous

stream of Newswire stories. The proposed

method doesn't only depend on keywords for

describing an event, but takes into account the

concept terms, named entities like person,

location, organization and story terms.

The information obtained from these terms

along with its weights is used for event




detection. Event detection model is composed

of three components:

1. Similarity calculation component.

2. Grouping the relevant elements by means of

agglomerative clustering.

3. Event identification.

James Allan, Victor Lavrenko and Margaret E.

Connell [2] described the purpose of new event

detection is to find the point where the system

must decide to start a new cluster. The new

event evaluation focuses entirely on whether or

not a system finds the triggers of new topics and

ignores what happens within the topics. The

approach evaluates the tasks within topic

detection and tracking (TDT) using a signal

detection methodology. A TDT system

produces binary YES/NO judgments for every

story in a stream.

Wang Xiaowei, JiangLongbin, MaJialin and

Jiangyan came up with a new improved

approach for topic tracking [17]. They proposed

multi vector model that extracts NER features

from text and make it into a separate vector. It

first selects the features and classifies in

accordance with characteristics of different

tasks, then calculates the vector, then finally

selects the combination of model and optimizes

the parameters.

Xianfei Zhang, Zhigang Guo and Bicheng Li

proposed a new method for News topic tracking

[20]. The LSI-SVM (Latent Semantic Analysis-

Support Vector Machine) method makes an in

depth analysis of the co- occurrence of words

and provides a way of dealing with synonymy

automatically It is based on the assumption that

there is an underlying or latent structure in the

pattern of word usage across document.

Figure 3. NER System Architecture [17]

The keyword extraction technique can be used

for tracking the topics over time. Keywords are

the set of significant words in an article that

gives high level description of its contents to

readers. But manual keyword extraction is

extremely difficult and time consuming task.

This problem has been addressed by Sungjick

Lee and Han-joon Kim [10]. They proposed an

automatic unsupervised keyword extraction

technique. The conventional model evaluates

the degree of importance of a word in a single

document, but the proposed variants evaluate

the degree of importance of a word in a whole

document collection.

Kamaldeep Kaur and Vishal Gupta [6] proposed

event detection and topic tracking for Punjabi

news streams. Punjabi is highly inflectional and

agglutinating language providing one of the

richest and most challenging sets of linguistic

and statistical features resulting in long and

complex word forms. The topic tracking for

Punjabi language has been experimented with

two approaches:




1. NER based approach

2. Keyword extraction approach

CONCLUSION

Topic tracking monitors a stream of news

stories and find out what discuss the same topic

described by a few positive samples. In this

report, the two approaches are studied that are

NER features extraction and keyword

extraction. The language dependent features and

language independent features are formed and

analyzed. Name entities such as date/ time,

location, person name, organization, designation

and keywords from title, cue phrase and high

frequency noun are extracted.

REFRENCES:

[1] Allan, James, Papka, Ron and Lavrenko, Victor, 1998, Online new event detection and tracking. In: Proceedings of the Annual International ACM SIGIR Conference on

Research and Development in Information

Retrieval, Association for Computing

Machinery Special Interest Group on

Information Retrieval, , p 37-45.

[2] Allan, James, Lavranko, Victor and Connell, E., Margaret, A Month to Topic Detection and Tracking in Hindi.

[3] CHEN, Wei, CHEN, Chun, ZHANG, Li-Jun, WANG, Can, BU, Jia-Jun, 2010,

Online detection of bursty events and their evolution in news streams J. Zhejiang Univ.-Sci C 340-355.

[4] Dadgar, Omid, Topic Detection and tracking, Available: www.tcnj.edu/~mmmartin/.../TDT/TopicDet

ectionTracking04.ppt

[5] Gupta, Vishal, Lehal, S., G., (2009), A Survey of Text Mining Techniques and

Applications,in Journal of Emerging Technologies in Web Intelligence.

[6] Kaur, Kamaldeep and Gupta, Vishal, 2011, TOPIC TRACKING FOR PUNJABI LANGUAGE, Computer Science & Engineering: An International Journal

(CSEIJ), Vol.1, No.3.

[7] Kolya, Kumar, Anup, Ekbal, Asif, Bandyopadhyay, Sivaji, 2009, A Simple Approach for Monolingual Event Tracking

System in Bengali, 8th International Symposium on NaturalLanguage

Processing, IEEE.

[8] Kumaran, Giridhar and Allan, James, Using names and topics for new event detection

[9] Lam, W., Meng, L., M., H., K., Wong, L., Yen, H., C., J., Event detection using contextual analysis. Int. J. Intell. Syst., 16 (4): 525-546. [doi: 10.1002/int. 1022]

[10] Lee, Sungjick, Kim, Han-joon, 2008, News Keyword Extraction for Topic Tracking, 4th International Conferenceon Networked Computing and Advanced

Information Management, IEEE.

[11] Liu, Yan, Lv, Nan, Luo, Junyong, Yang, Huijie, (2009), Subtopic Based Topic Evolution Analysis, International Conference on Web Information Systems

and Mining, IEEE.

[12] Navathe, Shamkant, B. and Ramez, Elmasri, 2000, Data Warehousing and Data Mining, inFundamentals of Database Systems, Pearson Education pvtInc, Singapore, 841-872.

[13] Qin, Xiangju, Zhang, Yang, (2008), Improving the performance of Topic Tracking System by Ensemble, International Conference on Computer

Science and Software Engineering, IEEE.

[14] Radovanovic, Milos, vanovic, MirjanaI, (2008), Text Mining: Approaches and Applications, Novi Sad J. Math, Vol. 38, No. 3: 227-234

[15] Sayyadi, Hassan, Hurst, Matthew and Maykov, Alexey, 2009, Event detection and tracking in social streams, 3rd Intl AAAI Conference on Weblogs and Social

Media, ICWSM 09, AAAI,. [16] Stavrianou, Anna, Andritsos, Periklis,

Nicoloyannis, Nicolas, (2007), Overview and Semantic Issues of Text Mining, Sigmod Record, Vol. 36, No. 3

[17] Xiaowei, Wang, Longbin, Jiang, Jiangyan, MaJialin, 2008, Use of NER Information for Improved Topic Tracking, Eighth International Conference on




Intelligent Systems Design and

Applications, IEEE.

[18] Yang Y, Pierce T, Carbonell J. 1998, A study on retrospective and on-line event

detection. In: Proceedings of the Annual International ACM SIGIR Conference on

Research and Development in Information

Retrieval.

[19] Yang, Yiming, Carbonell, Q., Jaime, Brown, D., Ralf, Pierce, Thomas, Archibald,

Brian. and Liu, Xin, (1999) Learning approaches for detecting and tracking news

events IEEE Intell. Syst. 32-43. [20] Zhang, Xianfei, Guo, Zhigang, Li,

Bicheng, (2009), An Effective Algorithm of News Topic Tracking, Global Congress on Intelligent Systems,IEEE.

AUTHORS BIBLOGRAPHY

Nain Kanwal Kaur Currently

working as Assistant Professor,

Department of Computer

Science and Engineering,

Continental Institute of

Engineering and Technology,

Jalvehra, Punjab, India

topic detection and tracking using web mining

Documents