deepak garg phd, deepak garg ieee, deepak garg acm developing efficient algorithms for incremental...

Download Deepak Garg PhD, Deepak Garg IEEE, Deepak Garg ACM Developing Efficient Algorithms for Incremental Mining

If you can't read please download the document

Post on 01-Sep-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Developing Efficient Algorithms

    for Incremental

    Mining of Sequential Patterns

    A Thesis

    Submitted in fulfillment of the

    requirements for the award of the degree of

    Doctor of Philosophy

    Submitted by

    Bhawna Mallick (Registration No: 950903027)

    Under the supervision of

    Dr. Deepak Garg Dr. P S Grover Associate Professor, Professor of Computer Science,

    Computer Science and Engineering Guru Tegh Bahadur Institute of Technology,

    Department, Thapar University, Guru Gobind Singh Indraprastha University,

    Patiala – 147004 Delhi – 110006

    Computer Science and Engineering Department

    Thapar University,

    Patiala-147004 (Punjab), India.

    August 2013

  • To

    My Family

  • CERTIFICATE

    I hereby certify that the work which is being presented in this thesis entitled

    DEVELOPING EFFICIENT ALGORITHMS FOR INCREMENTAL MINING OF

    SEQUENTIAL PATTERNS, in fulfillment of the requirements for the award of

    degree of DOCTOR OF PHILOSOPHY submitted in Computer Science and

    Engineering Department (CSED), Thapar University, Patiala, Punjab, is an authentic

    record of my own work carried out under the supervision of Dr. Deepak Garg and

    Dr. P. S. Grover, and refers the work of other researchers, which are duly listed in

    the reference section.

    The matter presented in this thesis has not been submitted for the award of any

    other degree of this or any other university.

    (Bhawna Mallick)

    Registration No. 950903027

    This is to certify that the above statement made by the candidate is correct and true

    to the best of my knowledge and belief.

    (Deepak Garg) (P.S. Grover)

    Associate Professor, Professor of Computer Science,

    Computer Science and Engineering Guru Tegh Bahadur Institute of Technology,

    Department, Thapar University, Guru Govind Singh Indraprastha University,

    Patiala – 147004. Delhi – 110006.

    Supervisor Supervisor

  • ii

    ACKNOWLEDGEMENTS

    It would not have been possible to write this doctoral thesis without the help and support

    of the kind people around me, to only some of whom it is possible to give particular

    mention here.

    At the outset, this research endeavor has been only possible with the help,

    support and able guidance of my supervisors, Dr. Deepak Garg and Dr. P.S. Grover. I am

    highly indebted to Dr. P. S. Grover for his advice, encouragement and unsurpassed

    knowledge of comparative linguistics. The good support, diligence and friendship of Dr.

    Deepak Garg have been invaluable on both an academic and a personal level, for which I

    am extremely grateful. It has indeed been a privilege to carry out this research work

    under their guidance.

    I would like to acknowledge the Director and Dean (R&D), Thapar University for

    the academic and technical support which has been indispensable. I express my gratitude

    to the Doctoral Committee comprising of Dr. Maninder Singh, Dr. R. K. Sharma and PG

    Coordinator, Dr. Inderveer Chana for monitoring the progress and providing valuable

    suggestions for improvement of my Ph. D. work from time to time. I also thank the

    faculty and staff of Computer Science and Engineering Department, especially the Head

    of Department, Dr. Maninder Singh in promoting a stimulating and welcoming academic

    and cooperative environment.

    I am grateful to the Director and Management, Galgotias College of Engineering

    and Technology, Greater Noida, Uttar Pradesh for providing me resources and facilities

    to carry out this research work. I would also like to thank my colleagues and friends for

    their support and valuable suggestions throughout. I am thankful to my friends working

    in corporate for providing necessary feedback and suggestions required for this research

    work.

    My hearty thanks to all the anonymous referees of several international research

    journals in the field of intelligent data mining and fuzzy set theory. The research work

    presented in this thesis report has accomplished the eminence due to the constructive

    comments and ideas for improvement by them.

  • iii

    I would like to thank many acclaimed researchers across the world for timely

    contributing the reprints of their valuable research papers for my reference apropos this

    thesis report. These include Dr. Lei Chang, Dr. C.H. Cheng, Dr T.P. Hong, Dr. Jiang Hua,

    Dr. S. Piramuthu, Dr. Jong Bum Lee, Dr. S. K. Gupta, Dr. M.Y. Lin, Dr. Jian Pei, Dr. V.

    Kumar, Dr. X. Wu, Dr. Mingming Zhou, and several others.

    I express gratitude to my loving husband Nitin for his personal support and great

    patience at all times. I acknowledge the constant cooperation from him without which

    this work could not have been possible. My parents and sisters have given me their

    unequivocal support throughout, as always, for which my mere expression of thanks

    likewise does not suffice. I dedicate this thesis to my family.

    Above all, I pay my reverence to the almighty God.

    Date: July, 2013.

    Place: Thapar University, Patiala. (Bhawna Mallick)

  • iv

    ABSTRACT

    There has been an increase in our capabilities of both generating and collecting

    data with the progress of humanity and development of technology. The ultimate intent

    of this massive data collection is to utilize it for different reasons that include process

    efficiency, system automation and competitive benefits among others. This is achieved by

    determining both implicit and explicit unidentified patterns in data that can direct the

    process of decision making. Data mining is a process of finding rules and patterns that

    describe the characteristic properties of the data. Sequential pattern mining is one of the

    methods of data mining. Sequential pattern is a sequence of item sets that frequently

    occur in a specific order; where all items in the same item set are supposed to have the

    same transaction time value.

    Sequential pattern mining is used in wide spectrum of areas, such as financial

    data analysis of banks and other financial institutions, retail industry, customer shopping

    history, goods transportation, consumption and services, telecommunication industry,

    biological data analysis, scientific applications, network intrusion detection.

    The sequential pattern mining algorithms, in general, result in enormous number

    of interesting and uninteresting patterns. Therefore, the miner may have difficulty in

    selecting the interesting/significant ones. The core idea behind this study is the

    incorporation of specific constraints into the mining process. One another important

    concern of sequential pattern mining methods is that they are based on assumption that

    the data is centralized and static. For dynamic data addition and deletion in the

    databases, these methods waste computational and input/output (I/O) resources. It is

    undesirable to mine sequential patterns from scratch each time when a small set of

    sequences grow, or when some new sequences are added, deleted or updated in the

    database.

    There is consistent outdating of the discovered sequential patterns because of the

    incremental evolution of databases (progressive databases) over time. This necessitates

    incremental mining of sequential patterns from progressive database so as to capture the

    dynamic nature of data addition, deletion and updating. In present study, an extensive

    study of the algorithms that perform discovery of constraint-based sequential patterns

  • v

    and incremental mining of sequential patterns available in the literature is done. On the

    basis of the study; efficient algorithms for incremental mining of sequential patterns from

    progressive database are proposed.

    This study contributes to the literature as it identifies the approach and

    constraints that can be used for developing algorithms for the incremental mining of

    sequential patterns. The constraints have been selected to keep track of customer

    purchasing history and further be used for customer value analysis for CRM. A COPRE

    (COnstraint-based PREfix) framework is developed based on pattern growth approach

    that can incorporate any number of constraints and used for any domain. The first three

    proposed algorithms, ‚Constraint-based sequential pattern mining algorithm

    incorporating compactness, length and monetary‛; ‚CFM-PrefixSpan algorithm‛ and

    ‚Progressive CFM-Miner algorithm‛ are based on this framework. The real-world

    business and commercial databases consist of numerical and time-stamped data. We can

    get more relevant rules from these databases by making use of fuzzy set theory to

    minimize the sharp cut between intervals. Fuzzy sequential mining generate simple and

    practical patterns which are close to human reasoning. So, the fourth algorithm,

    ‚Progreslattice-Miner (PLM) algorithm‛ is ba

Recommended

View more >