deepak garg phd, deepak garg ieee, deepak garg acm developing efficient algorithms for incremental...
Post on 01-Sep-2020
0 views
Embed Size (px)
TRANSCRIPT
Developing Efficient Algorithms
for Incremental
Mining of Sequential Patterns
A Thesis
Submitted in fulfillment of the
requirements for the award of the degree of
Doctor of Philosophy
Submitted by
Bhawna Mallick (Registration No: 950903027)
Under the supervision of
Dr. Deepak Garg Dr. P S Grover Associate Professor, Professor of Computer Science,
Computer Science and Engineering Guru Tegh Bahadur Institute of Technology,
Department, Thapar University, Guru Gobind Singh Indraprastha University,
Patiala – 147004 Delhi – 110006
Computer Science and Engineering Department
Thapar University,
Patiala-147004 (Punjab), India.
August 2013
To
My Family
CERTIFICATE
I hereby certify that the work which is being presented in this thesis entitled
DEVELOPING EFFICIENT ALGORITHMS FOR INCREMENTAL MINING OF
SEQUENTIAL PATTERNS, in fulfillment of the requirements for the award of
degree of DOCTOR OF PHILOSOPHY submitted in Computer Science and
Engineering Department (CSED), Thapar University, Patiala, Punjab, is an authentic
record of my own work carried out under the supervision of Dr. Deepak Garg and
Dr. P. S. Grover, and refers the work of other researchers, which are duly listed in
the reference section.
The matter presented in this thesis has not been submitted for the award of any
other degree of this or any other university.
(Bhawna Mallick)
Registration No. 950903027
This is to certify that the above statement made by the candidate is correct and true
to the best of my knowledge and belief.
(Deepak Garg) (P.S. Grover)
Associate Professor, Professor of Computer Science,
Computer Science and Engineering Guru Tegh Bahadur Institute of Technology,
Department, Thapar University, Guru Govind Singh Indraprastha University,
Patiala – 147004. Delhi – 110006.
Supervisor Supervisor
ii
ACKNOWLEDGEMENTS
It would not have been possible to write this doctoral thesis without the help and support
of the kind people around me, to only some of whom it is possible to give particular
mention here.
At the outset, this research endeavor has been only possible with the help,
support and able guidance of my supervisors, Dr. Deepak Garg and Dr. P.S. Grover. I am
highly indebted to Dr. P. S. Grover for his advice, encouragement and unsurpassed
knowledge of comparative linguistics. The good support, diligence and friendship of Dr.
Deepak Garg have been invaluable on both an academic and a personal level, for which I
am extremely grateful. It has indeed been a privilege to carry out this research work
under their guidance.
I would like to acknowledge the Director and Dean (R&D), Thapar University for
the academic and technical support which has been indispensable. I express my gratitude
to the Doctoral Committee comprising of Dr. Maninder Singh, Dr. R. K. Sharma and PG
Coordinator, Dr. Inderveer Chana for monitoring the progress and providing valuable
suggestions for improvement of my Ph. D. work from time to time. I also thank the
faculty and staff of Computer Science and Engineering Department, especially the Head
of Department, Dr. Maninder Singh in promoting a stimulating and welcoming academic
and cooperative environment.
I am grateful to the Director and Management, Galgotias College of Engineering
and Technology, Greater Noida, Uttar Pradesh for providing me resources and facilities
to carry out this research work. I would also like to thank my colleagues and friends for
their support and valuable suggestions throughout. I am thankful to my friends working
in corporate for providing necessary feedback and suggestions required for this research
work.
My hearty thanks to all the anonymous referees of several international research
journals in the field of intelligent data mining and fuzzy set theory. The research work
presented in this thesis report has accomplished the eminence due to the constructive
comments and ideas for improvement by them.
iii
I would like to thank many acclaimed researchers across the world for timely
contributing the reprints of their valuable research papers for my reference apropos this
thesis report. These include Dr. Lei Chang, Dr. C.H. Cheng, Dr T.P. Hong, Dr. Jiang Hua,
Dr. S. Piramuthu, Dr. Jong Bum Lee, Dr. S. K. Gupta, Dr. M.Y. Lin, Dr. Jian Pei, Dr. V.
Kumar, Dr. X. Wu, Dr. Mingming Zhou, and several others.
I express gratitude to my loving husband Nitin for his personal support and great
patience at all times. I acknowledge the constant cooperation from him without which
this work could not have been possible. My parents and sisters have given me their
unequivocal support throughout, as always, for which my mere expression of thanks
likewise does not suffice. I dedicate this thesis to my family.
Above all, I pay my reverence to the almighty God.
Date: July, 2013.
Place: Thapar University, Patiala. (Bhawna Mallick)
iv
ABSTRACT
There has been an increase in our capabilities of both generating and collecting
data with the progress of humanity and development of technology. The ultimate intent
of this massive data collection is to utilize it for different reasons that include process
efficiency, system automation and competitive benefits among others. This is achieved by
determining both implicit and explicit unidentified patterns in data that can direct the
process of decision making. Data mining is a process of finding rules and patterns that
describe the characteristic properties of the data. Sequential pattern mining is one of the
methods of data mining. Sequential pattern is a sequence of item sets that frequently
occur in a specific order; where all items in the same item set are supposed to have the
same transaction time value.
Sequential pattern mining is used in wide spectrum of areas, such as financial
data analysis of banks and other financial institutions, retail industry, customer shopping
history, goods transportation, consumption and services, telecommunication industry,
biological data analysis, scientific applications, network intrusion detection.
The sequential pattern mining algorithms, in general, result in enormous number
of interesting and uninteresting patterns. Therefore, the miner may have difficulty in
selecting the interesting/significant ones. The core idea behind this study is the
incorporation of specific constraints into the mining process. One another important
concern of sequential pattern mining methods is that they are based on assumption that
the data is centralized and static. For dynamic data addition and deletion in the
databases, these methods waste computational and input/output (I/O) resources. It is
undesirable to mine sequential patterns from scratch each time when a small set of
sequences grow, or when some new sequences are added, deleted or updated in the
database.
There is consistent outdating of the discovered sequential patterns because of the
incremental evolution of databases (progressive databases) over time. This necessitates
incremental mining of sequential patterns from progressive database so as to capture the
dynamic nature of data addition, deletion and updating. In present study, an extensive
study of the algorithms that perform discovery of constraint-based sequential patterns
v
and incremental mining of sequential patterns available in the literature is done. On the
basis of the study; efficient algorithms for incremental mining of sequential patterns from
progressive database are proposed.
This study contributes to the literature as it identifies the approach and
constraints that can be used for developing algorithms for the incremental mining of
sequential patterns. The constraints have been selected to keep track of customer
purchasing history and further be used for customer value analysis for CRM. A COPRE
(COnstraint-based PREfix) framework is developed based on pattern growth approach
that can incorporate any number of constraints and used for any domain. The first three
proposed algorithms, ‚Constraint-based sequential pattern mining algorithm
incorporating compactness, length and monetary‛; ‚CFM-PrefixSpan algorithm‛ and
‚Progressive CFM-Miner algorithm‛ are based on this framework. The real-world
business and commercial databases consist of numerical and time-stamped data. We can
get more relevant rules from these databases by making use of fuzzy set theory to
minimize the sharp cut between intervals. Fuzzy sequential mining generate simple and
practical patterns which are close to human reasoning. So, the fourth algorithm,
‚Progreslattice-Miner (PLM) algorithm‛ is ba