52213065 e filing report
DESCRIPTION
projectTRANSCRIPT
1
CHAPTER 1
INTRODUCTION
This chapter provides the overview of this research project and discussed about
research background, problem statement, objectives of the research, research scope
and significance of the research.
1.1 Research Background
E-filing provides access to large database that consist list of electronic files.
According to Olson, Edwards and Monty (2003), e-filing is a highly secure
and reliable method for sending, receiving and managing legal documents.
This is because, it takes time to find needed files manually and e-filing
provides secured access to identify needed files easily without searching
manually at huge shelf. Olson et al. (2003) also stated that state courts,
federal courts and law firms across the country are using e-filing more and
more to improve access to documents, maximize resources and streamline
filing and service activities. It is much easier to know status of the needed
files and identified location of the files before going through to the real files.
The purpose of this research is to develop a prototype of e-filing web-based
system for Majlis Daerah Kerian. Majlis Daerah Kerian, Parit Buntar, Perak
act as local government which is a government unit that is closest to the
citizens and these includes municipalities, local authorities, town councils and
city councils. There are eight departments in Majlis Daerah Kerian which is
Law and Administration Unit, Assessment Unit, Information Technology
Unit, Account and Finance Unit, License and Parking Unit, Town Service
Unit, Garden and Recreation Unit, and Building Unit.
2
Within the e-filing web-based system, staffs easily gather information about
status of the files and identify suitable files that meet their requirement. The
system is developed using data mining technique specifically clustering
technique. According to Phyu (2009), data mining involves the use of
sophisticated data analysis tools to discover previously unknown, valid
patterns and relationships in large database set. This is because data mining
not even consists of more than collection and managing data, but also
includes analysis and prediction. Garofalakis, Rastogi, Seshadri and Shim
(1999) stated that there are three popular data mining techniques which are
association rules, classification and clustering. This research identified
suitable searching method using data mining techniques either association,
classification or clustering techniques in order to develop a prototype of e-
filing web-based system.
1.2 Problem Statement
The staffs in Majlis Daerah Kerian face difficulties in managing and
identifying needed files that meet their requirement. This is because, it is
difficult to search needed files manually. According to Mrs. Shalina,
Administrative Assistant of Majlis Daerah Kerian, there are many steps to
search files manually which is :
a. Searching suitable number of file that required by using a log
book.
b. Determine file name by using file number.
c. Check needed file on many big shelves that required long time.
d. Surveying on each staff’s table or other department in Majlis
Daerah Kerian if the file is not on the shelf.
All this steps will create barriers in order to give best respond for each action.
By developing this system, staff can find the file that satisfies their needs so
that it will create interactive environment for them.
3
1.3 Aim
The aim for this research project is to provide a suitable searching method
using data mining techniques for e-filing web-based system.
1.4 Objective of the Research
To achieve the aim of the project above, the objective can be divided into
four. The objectives are:
a. To identify requirements that will be needed for E-Filing from
Majlis Daerah Kerian.
b. To identify the searching method based on data mining techniques.
c. To design e-filing web-based system.
d. To demonstrate e-filing web-based system using identified data
mining technique.
1.5 Significance of Research
The significance of this development is that this system can be used by staff
in Majlis Daerah Kerian. E-filing will act as an information center for staff to
gather information about status of the files. Besides that, it also provides staff
with interactive environment in making their choice in determining the
suitable files that meets their requirement.
4
1.6 Scope of Study
The e-filing web-based system is developed using PHP with MySQL
database. The development is for Majlis Daerah Kerian, Parit Buntar, Perak
and focused on filing management only. This is a web-based application that
can be accessed via browser and will be used internally by Majlis Daerah
Kerian’s employees.
1.7 Limitation
The important task carried out in this study is to gather information from
staffs in Majlis Daerah Kerian who are involved in filing management. It is
conducted through the interview that requires arranging schedules and need
the right interviewee to gain the proper and effective interview sessions.
Conducting the interview time is the main constraint. This is because, the
researcher have to reschedule the interview when the interviewee canceled
the interview session. It is difficult for the researcher to gather all of the
information and possibility of missing some important information. Interview
session was conducted at Majlis Daerah Kerian, Parit Buntar, Perak.
Another limitation is that there are three different data mining techniques, but
researcher must select the best data mining technique that suite with the
objective. Researcher need to study properly for each data mining techniques
and come out with the related journals that support the findings.
Next, there are a large number of data mining tools available, but not all the
tools support different kind of data mining techniques. So researcher need to
study the tools based on their function and usability with the selected
techniques. Furthermore, the tool used in this research is new to the
researcher so that requires time to familiarize with the tool.
5
Experience of the researcher is another limitation factor of the research. This
is the first research for the researcher. However, researcher can learn and
have the proper guide based on the research plan and instruction from the
supervisor and examiner.
1.8 Outcomes/Deliverables
The outcome from the research project is a suitable searching method using
data mining technique for e-filing web-based system.
1.9 Layout of Dissertation
This research project has both a theoretical and practical part. The theoretical
part will describes the concepts and literature review of the e-filing and data
mining techniques. The practical part consists of an analysis of data gathered
from the interview session and secondary data from literature review.
The remaining chapters of this research are:
Chapter 2 is about the literature review on the e-filing and data mining
techniques. These literatures will act as a reference for this research
project.
Chapter 3 describes the research approach and methodology used in
this research project. The choice of method, how data is gathered and
the strategy used to perform an analysis of the data are explained.
Chapter 4 discusses the construction of the system’s prototype.
Chapter 5 discusses the findings and the analysis from the interview
sessions and secondary data.
Chapter 6 provides suggestion of conclusion and recommendations
for further research.
6
1.10 Summary
This chapter explains the background of the problem and its proposed
solution together with a brief explanation of the solution. The important
aspects of the projects such as research background, objectives of the project,
scope of the project and significance of the project are included in this
chapter. The methodology diagram as shown in Figure 3.1 in Chapter 3 and
other contents of this chapter will be used in the following chapter as the
basis for direction.
The next chapter discusses the literature review for the research project.
7
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
This chapter describes in detail the related literatures to support the research
project. Literature review also clarifies the relationship between the study and
previous work conducted on the topic. This chapter covers overview of e-
filing and data mining, brief explanation for each technique in data mining
and steps in selecting data mining tools.
2.2 E-Filing
2.2.1 Introduction to E-Filing
E-Filing provides access to large database that consist list of
electronic files. According to Olson et al. (2003), e-filing is a highly
secure and reliable method for sending, receiving and managing legal
documents. Besides, e-filing is a highly secure and reliable method for
sending, receiving and managing legal documents and case
information. However, the rules to implement e-filing need to be fully
understand in order to achieve the best filing.
2.2.2 Purposes of the Rules in E-Filing
According to Olson et al. (2003), there are reason why rules are
important for electronic filing :
8
To define the electronic filing system : Electronic filing
and services can mean anything. So, the exact information
regarding type of files must clearly defined in order to
provide guidance for where and how to access the files.
To authorize electronic filing and service : Rules of
procedure are very specific when it comes to defining the
mechanical rules of filing. The valid method for delivering
document into right files need to identify for the best filing.
To clearly specify the procedural mechanics : How to
file electronically, security, service and filing deadlines,
and how to sign documents electronically can more easily
for simplicity and to avoid complexity.
To encourage use of electronic filing : Electronic filing
looks new to some people and training is the solution in
order to encourage them to use this system.
2.2.3 Proposed Model Rules for E-Filing
According to Olson et al. (2003), these rules below may be cited as
“e-filing rules” :
Short title
Clear definitions of files
Give authority
Determine authorized users
Give effective date
Signature to identify responsible user
9
2.3 What is Data Mining?
2.3.1 Definition of Data Mining
According to Phyu (2009), data mining is the use of sophisticated data
analysis tools to discover previously unknown, valid patterns and
relationships in large database set.
According to Chen, Han and Yu (1996), data mining which is also
referred to as knowledge discovery in databases, means a process of
nontrivial extraction of implicit, previously unknown and potentially
useful information (such as knowledge rules, constraints, regularities)
from data in databases.
Tang, Steinbach and Kumar (2006) stated that data mining is the
process of automatically discovering useful information in large
database repositories. Data mining techniques are deployed to scour
large database in order to find novel and useful patterns that might
otherwise remain unknown.
There are also many other terms founded in some articles and journals
that carry a similar or slightly different meaning, such as knowledge
meaning from databases, knowledge extraction, data archeology, data
dredging or data analysis.
2.3.2 Data Mining and Knowledge Discovery
Data mining is an integral part of knowledge discovery in database,
which is the overall process of converting raw data into useful
information, as shown in Figure 2.1. This process consists of a series
of transformation steps, from data preprocessing to postprocessing of
data mining results. (Tang et al., 2006)
10
Figure 2.1 : The Process of knowledge discovery in database.
Tang et al. (2006) stated that the input data can be stored in a variety
of formats (flat files, spread-sheets, or relational tables) and may
reside in a centralized data repository or be distributed across multiple
sites. The purpose of preprocessing is to transform the raw input data
into an appropriate format for subsequent analysis. The steps involved
in data preprocessing include fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and
selecting records and features that are relevant to the data mining task
at hand. Because of the many ways data can be collected and stored,
data preprocessing is perhaps the most laborious and time-consuming
step in the overall knowledge discovery process.
“Closing the loop” is the phrase often used to refer to the process of
integrating data mining results into decision support systems. For
example, in business applications, the insights offered by data mining
results can be integrated with campaign management tools so that
effective marketing promotions can be conducted and tested. Such
integration requires a postprocessing step that ensures that only valid
and useful results are incorporated into the decision support system.
Statistical measures or hypothesis testing methods can also be applied
during postprocessing to eliminate spurious data mining results.
11
According to Shyu, Chen and Haruechaiyasak (2005), data mining or
knowledge discovery in databases has emerged recently as an active
research area for extracting implicit, previously unknown, and
potentially useful information from large databases mining techniques
into the IR context, specifically as the information filtering tools for
the recommender system framework.
The overall process for designing and implementing a recommender
system is illustrated in Figure 2.2. The process involves the following
five steps.
Figure 2.2 : Process for designing and implementing a
recommender system (Shyu et al., 2005)
Data Collection: This initial step involves the collection of data sets
for executing the data mining algorithms. Three data components are
considered: (a) textual content (i.e., index terms or keywords), (b) link
structure (embedded hyperlinks within Web pages), and (c) user log
records.
Data Preprocessing: This step is required to clean and transform the
collected data sets into the formats which are suitable for the data
12
mining algorithms. This step includes the data reduction and selection
techniques to improve the efficiency of the data mining algorithms.
Information Filtering via Data Mining: This step is the core
process of the recommender system framework, where the data sets
are analyzed and the data mining algorithms are applied as the
information filtering tools to generate and discover any useful and
interesting recommended outputs.
Database Design and Implementation: To improve the efficiency of
data and information access and retrieval.
User Interface Design and Implementation: The user interface acts
as an intermediary between the users and the recommender system.
This step involves the design and implementation of a Web (i.e.,
HTTP) server which receives the users’ requests via the WWW,
processes the requests by accessing the database, and responds by
returning the results to the users. The user interface provides a
recommendation function with the user personalization technique by
requiring each user to log into the system in order to keep track of the
preferences.
2.3.3 Challenges of Data Mining
According to Tang et al. (2006), traditional data analysis techniques
have often encountered practical difficulties in meeting the challenges
posed by new data sets.
Chen et al. (1996) stated the importance to examine what kind of
features an applied knowledge discovery system is expected to have
and what kind of challenges may face at the development of data
mining techniques.
13
Chen et al. (1996) also provide the list of challenges that will face
during development of data mining techniques which is :
a. Handling of different types of data.
There are many kinds of data and databases used in
different applications. This will cause knowledge
discovery system should be able to perform effective data
mining on different kinds of data. Since most available
databases are relational, it is crucial that a data mining
systems performs effective knowledge discovery on
relational data. Besides, most databases contain complex
data types, such as structured data and complex data
objects, hypertext and multimedia data, spatial and
temporal data, transaction data, legacy data and so on. So,
powerful system should be able to perform efficient data
mining on complex types of data as well. However, data
mining system can handle specific kinds of data such as
systems dedicated to knowledge mining in relational
databases, transaction databases, spatial databases,
multimedia databases and so on in order to face diversity
of data types.
b. Efficiency and scalability of data mining algorithms.
In order to extract information from a large amount of data
in databases, the knowledge discovery algorithms must be
efficient and scalable. This is because, the running time of
a data mining algorithms must be predictable and
acceptable for large databases.
c. Usefulness, certainty, and expressiveness of data
mining results.
The contents of the database must accurately portray and
be useful for certain application in order to discover
14
knowledge. This also encourage a systematic study of
measuring the quality of the discovered knowledge,
including interestingness and reliability, by construction of
statistical, analytical and simulative models and tools.
d. Expression of various kinds of data mining requests
and results.
Different kinds of knowledge can be discovered from a
large amount of data. It is important to discovered
knowledge from different views and presents them in
different forms. This task requires them to express both the
data mining requests and the discovered knowledge in
high-level languages or graphical user interfaces so that
the data mining process can be specified by none expert,
understandable and directly usable by users.
e. Interactive mining knowledge at multiple abstraction
levels.
A high-level data mining query should be treated as a
probe which may disclose some interesting traces for
further exploration. Interactive discovery allow users to
interactively refine a data mining request, dynamically
change data focusing, progressively deepen a data mining
process and flexibly view the data and data mining results
at multiple abstraction levels from different areas.
f. Mining information from different sources of data.
Many sources of data are available through local and wide-
area computer network, including internet. Mining
knowledge from different sources either formatted or
unformatted data with diverse data are becomes new
challenges to data mining. Data mining may help by come
out with simple query systems.
15
g. Protection of privacy and data security.
Protecting data security and guarding against the invasion
of privacy are important when data viewed from many
different angles and at different abstraction levels. The
measurement of security can avoid disclosure of sensitive
information.
However, these requirements may cause conflict. For example,
protection of data security may conflict with the requirements of
interactive mining of multiple-level knowledge from different angles.
2.4 Data Mining Techniques
2.4.1 Overview of Data Mining Techniques
According to Garofalakis et al. (1999), data mining techniques
describe key data mining algorithms that have been developed for
large databases.
Garofalakis et al. (1999) also stated the popular data mining
techniques which are association rules, classification and clustering.
2.4.2 Classifying Data Mining Techniques
Chen et. al (1996) stated the kinds of techniques that can be utilized
during classification which is :
Type of databases to work on
A data mining system can be classified according to the
kinds of databases on which the data mining is performed.
This is important to identify the data type in order to
16
specific the area that system will perform. For example, a
system is a relational data miner if it discovers knowledge
from relational data, or an object-oriented one if it mines
knowledge from object-oriented databases. In general, a
data miner can be classified according to its mining of
knowledge from the following different kinds of databases:
relational databases, transaction databases, object oriented
databases, deductive databases, spatial databases, temporal
databases, multimedia databases, heterogeneous databases,
active databases, legacy databases, and the Internet
information-base.
Type of knowledge to be mined
Data miners should identify several kinds of knowledge
including association rules, characteristic rules,
classification rules, clustering and deviation analysis.
However, this knowledge depends on abstraction level of
the databases.
Type of techniques to be utilized
Data miners will be categorized according to the
underlying data mining techniques and approach. For
example, it can be categorized according to the driven
method into autonomous knowledge miner, data-driven
miner, query-driven miner, and interactive data miner. It
can also be categorized according to its underlying data
mining approach into generalization based mining, pattern-
based mining, mining based on statistics or mathematical
theories, and integrated approaches, etc.
17
2.4.3 Association Rules
Association rules provide a useful mechanism for discovering
correlations among items belonging to customer transactions in a
market basket database (Garofalakis et al., 1999). For example : given
a database of sales transactions, it is desirable to discover the
important associations among items such that the presence of some
items in a transaction will imply the presence of other items in the
same transaction.
Chen et. al (1996) stated the problem of mining association rules that
composed into the following two steps :
a. Discover the large item sets.
b. Use the large item sets to generate the association rules for
the database.
It is noted that the overall performance of mining association rules is
determined by the first step. After the large item sets are identified,
the corresponding association rules can be derived in a
straightforward manner.
Figure 2.3 : The general architecture of Mining Association Rule
model (Defit & Md Sap, 2001)
18
Figure 2.3 represents the general architecture of Mining Association
Rule (MAR) model. MAR model consists of two main modules, pre-
processing and processing module. The first module, pre-processing is
used to transform data, identify and remove inconsistent data from
databases. Next, processing is executed to generate rules and evaluate
the generated rules.
2.4.4 Classification
Data classification is the process which finds the common properties
among a set of objects in a database and classifies them into different
classes, according to a classification model. (Chen et al., 1996)
Chen et al. (1996) also stated the objective of the classification which
is :
a. Analyze the training data.
b. Develop an accurate description or a model for each class
using the features available in the data.
Garofalakis et al. (1999) stated that classification are useful in the
Web context to build taxonomies and topic hierarchies on Web pages,
and subsequently perform context-based searches for Web pages
relating to a specific topic. Decisions tree classifiers are popular since
they are easily interpreted by humans and are efficient to build.
19
Figure 2.4 : Hierarchical Classification Process
(Khodra & Widyantoro, 2007)
Figure 2.4 shows the hierarchical classification process that consists
of two stages: offline stage, and online stage. Offline stage encodes
classification scheme metadata for each web page. In online stage, all
search results are hierarchically categorized using the classification
scheme provided in the metadata of retrieved documents.
Classification scheme is total ordering class from the most general (i.e.
root of ontology) to the most specific class (i.e. leaf of ontology).
They use Lucene as search engine. They combined Lucene with
interactive navigation interface generator that uses this hierarchical
structure to present list of search results hierarchically.
20
2.4.5 Clustering
Visnick (2003) defined clustering as a technique to achieve high data
density. She classifies clustering into different techniques which is
isolate index, object pooling and object modeling that conduct
different function.
Chen et al. (1996) defined clustering as the process of grouping
physical or abstract objects into classes of similar objects. It helps
data miner to construct meaningful partitioning of a large set of
objects based on a “divide and conquer” methodology which
decomposes a large scale system into smaller components to simplify
design and implementation.
Garofalakis et al. (1999) defined clustering as a useful technique for
discovering interesting data distributions and patterns in the
underlying data.
Qiu, Davis and Ikem (2004) stated that clustering techniques are
heuristic in nature. Almost all techniques have a number of arbitrary
parameters that can be “adjusted” to improve results.
Clustering techniques fall into the following broad categories :
a. Hierarchical vs partitional : Hierarchical techniques produce
a nested sequence of partitions, with a single, all inclusive
cluster at the top and singleton clusters of individual instances
at the bottom. Each intermediate level can be viewed as a
combination of two clusters from the next lower level, or a
split of a cluster from the next higher level into two.
Partitional (or non-nested) techniques create a one-level
partitioning of the data instances. After the user specifies the
desired number of clusters, a partitional approach typically
21
finds all clusters at once. This is in contrast to traditional
hierarchical schemes, which bisect a cluster to get two clusters
or merge two clusters to get one.
b. Divisive vs agglomerative : Hierarchical clustering
techniques proceed either from the top to the bottom or from
the bottom to the top, i.e. clustering starts with one large
cluster and splits it, or starts with clusters each containing a
point and then merges them.
c. Incremental vs non-incremental : Some clustering
techniques work with one instance at a time and decide how to
place it into an appropriate cluster, but most clustering
techniques are non-incremental, using information about all
the instances at once to form clusters.
Typical pattern clustering activity involves the following steps (Jain,
Murty and Flynn, 2000) :
Pattern representation (optionally including feature
extraction and/or selection)
Definition of a pattern proximity
Measure appropriate to the data domain
Clustering or grouping
Data abstraction (if needed), and
Assessment of output (if needed).
Figure 2.5 : Stages in clustering (Jain et al., 2000)
22
Figure 2.5 depicts a typical sequencing of the first three of these steps,
including a feedback path where the grouping process output could
affect subsequent feature extraction and similarity computations.
Pattern representation refers to the number of classes, the number of
available patterns, and the number, type, and scale of the features
available to the clustering algorithm. Some of this information may
not be controllable by the practitioner. Feature selection is the process
of identifying the most effective subset of the original features to use
in clustering. Feature extraction is the use of one or more
transformations of the input features to produce new salient features.
Either or both of these techniques can be used to obtain an appropriate
set of features to use in clustering. Pattern proximity is usually
measured by a distance function defined on pairs of patterns. A
variety of distance measures are in use in the various communities. A
simple distance measure like Euclidean distance can often be used to
reflect dissimilarity between two patterns, whereas other similarity
measures can be used to characterize the conceptual similarity
between patterns. The grouping step can be performed in a number of
ways. The output clustering (or clusterings) can be hard (a partition of
the data into groups) or fuzzy (where each pattern has a variable
degree of membership in each of the output clusters).
Hierarchical clustering algorithms produce a nested series of
partitions based on a criterion for merging or splitting clusters based
on similarity. Partitional clustering algorithms identify the partition
that optimizes (usually locally) a clustering criterion.
2.5 Selecting Data Mining Techniques
It is important for the researcher to select a suitable searching method using
data mining techniques in order to accomplish the objective. Researcher
decided to review three main data mining techniques which are classification,
23
association and clustering. These techniques deliver the same objective of
data mining, but different in terms of their function and suitability for the
system.
According to Tang et al. (2006), data mining is a technology that blends
traditional data analysis methods with sophisticated algorithms for processing
large volumes of data. It has also opened up exciting opportunities for
exploring and analyzing new types of data and for analyzing old types of data
in new ways.
Classification, which is the task of assigning objects to one of several
predefined categories, is a pervasive problem that encompasses many diverse
applications. Examples include detecting spam email messages based upon
the message header and content, categorizing cells as malignant or benign
based upon the results of MRI scans and classifying galaxies based upon their
shapes. (Tang et al., 2006)
Association is useful for discovering interesting relationships hidden in large
data sets. The uncovered relationships can be represented in the form of
association rules or sets of frequent items. Besides, many business enterprises
accumulate large quantities of data from their day-to-day operations. For
example, huge amounts of customer purchase data are collected daily at the
checkout counters of grocery stores. Retailers are interested in analyzing the
data to learn about the purchasing behavior of their customers. Such valuable
information can be used to support a variety of business-related applications
such as marketing promotions, inventory management and customer
relationship management. Association techniques will discover the patterns
from a large transaction data and evaluating the discovered patters in order to
prevent the generation of spurious results. (Tang et al., 2006)
Cluster divides data into groups (clusters) that are meaningful, useful, or both.
If meaningful groups are the goal, then the clusters should capture the natural
structure of the data. The concept of clustering has been around for a long
time. It has several applications, particularly in the context of information
24
retrieval and in organizing web resources. The main purpose of clustering is
to locate information and in the present day context, to locate most relevant
electronic resources. In database management, data clustering is a technique
in which, the information that is logically similar is physically stored
together. In order to increase the efficiency of search and the retrieval in
database management, the number of disk accesses is to be minimized. In
clustering, since the objects of similar properties are placed in one class of
objects, a single access to the disk can retrieve the entire class. If the
clustering takes place in some abstract algorithmic space, we may group a
population into subsets with similar characteristic, and then reduce the
problem space by acting on only a representative from each subset.
Clustering is ultimately a process of reducing a mountain of data to
manageable piles. For examples, analyze the large amounts of genetic
information that are now available, group the search result into a small
number of clusters, identify different types of depression and to segment
customers into a small number of groups for additional analysis and
marketing activities. (Ravichandra, 2003)
However, it is important for the researcher to identify suitability of each
technique in order to implement the good searching method. Researcher
reviewed the techniques based on their definition, concept, functions,
suitability and examples given in several journals. (Refer Table 2.1)
25
Table 2.1 : Differences of Classification, Association and Clustering
techniques
DM Techniques
Differences Classification Association Clustering
Definition
Data
classification is
the process
which finds the
common
properties
among a set of
objects in a
database and
classifies them
into different
classes,
according to a
classification
model. (Chen et
al., 1996)
Association rules
provide a useful
mechanism for
discovering
correlations
among items
belonging to
customer
transactions in a
market basket
database
(Garofalakis et
al., 1999)
Clustering as
the process of
grouping
physical or
abstract objects
into classes of
similar objects.
(Chen et al.,
1996)
Concept
Classification,
which is the task
of assigning
objects to one of
several
predefined
categories, is a
pervasive
problem that
encompasses
many diverse
applications.
(Tang et al., 2006)
Association is
useful for
discovering
interesting
relationships
hidden in large
data sets. (Tang
et al., 2006)
Cluster divides
data into groups
(clusters) that
are meaningful,
useful, or both.
(Ravichandra,
2003)
26
DM Techniques
Differences Classification Association Clustering
Functions
Classification is
useful in the
Web context to
build taxonomies
and topic
hierarchies on
Web pages, and
subsequently
perform context-
based searches
for Web pages
relating to a
specific topic.
(Garofalakis et
al., 1999)
It will discover
the patterns from
a large
transaction data
and evaluating
the discovered
patters in order
to prevent the
generation of
spurious results.
(Tang et al.,
2006)
It helps data
miner to
construct
meaningful
partitioning of a
large set of
objects based on
a “divide and
conquer”
methodology
which
decomposes a
large scale
system into
smaller
components to
simplify design
and
implementation.
(Chen et al.,
1996)
Suitability
Develop an
accurate
description or a
model for each
class using the
features
available in the
data. (Chen et
al., 1996)
Discovering
correlations
among items
belonging to
customer
transactions in a
market basket
database for
market analysis
(Garofalakis et
al., 1999).
Increase the
efficiency of
search and the
retrieval in
database
management.
(Ravichandra,
2003)
27
DM Techniques
Differences Classification Association Clustering
Examples
Detecting spam
email messages
based upon the
message header
and content,
categorizing
cells as
malignant or
benign based
upon the results
of MRI scans
and classifying
galaxies based
upon their
shapes. (Tang et
al., 2006)
Huge amounts of
customer
purchase data are
collected daily at
the checkout
counters of
grocery stores.
Retailers are
interested in
analyzing the
data to learn
about the
purchasing
behavior of their
customers. (Tang
et al., 2006)
Analyze the
large amounts
of genetic
information that
are now
available, group
the search result
into a small
number of
clusters,
identify
different types
of depression
and to segment
customers into
a small number
of groups for
additional
analysis and
marketing
activities.
(Ravichandra,
2003)
According to the comparison above, after reviewing each technique based on
their definition, concept, functions, suitability and examples given by several
journals, researcher found that clustering is the suitable searching method for
e-filing web-based system.
28
Although Classification, Association and Clustering have similarity in terms
of information retrieval, but there are differences regarding how the
information retrieved, analyzed and delivered. Classification assigning
objects to several predefined categories in order to develop a model for each
data using the features available in the data. Association is useful to discover
correlations among data in order to identify interesting relationships hidden in
large data sets especially for market analysis. However, clustering groups the
physical or abstract objects into list of similar objects to provide simplified
list of data. In other words, it divides data into groups that have similarity,
meaningful and useful.
This is because, partitioning of a large set of data by clustering will
decompose a large search result into smaller components to simplify the
content. It helps user to review accurate search result that fulfill their needs
and expectation.
In terms of suitability, clustering increase the efficiency of search and the
retrieval of information in database management (Ravichandra, 2003). It
analyzed the search result to identify similarity between the results and
provide simplified list of results. Further information regarding why
clustering is suitable for searching method in e-filing web-based system are
discussed in Chapter 5 (Result and Findings).
29
2.6 Selecting Data Mining Tools
Data mining tools are used widely to solve real-world problems in
engineering, science and business. (Abbott, Matkovsky & Elder, 1998)
Nowadays, numbers of data mining tools are increases and it has become
more challenges in order to select effective tools. The data mining tool
market has become more crowded in recent years, with more than 50
commercial data mining tools as stated at the KDNuggets website
(http://www.kdnuggets.com). KDnuggets.com is the Data Mining
Community’s Top Resource since 1997 for data mining and analytics news,
tools, jobs, courses, data and more.
Collier, Carey, Sautter and Marjaniemi (1999) proposed four categories of
criteria for selecting from among the assortment of commercially available
data mining tools which is :
a. Performance
As per Table 2.2 is the ability to handle a variety of data sources
in an efficient manner. From a computational perspective,
hardware configuration has a major impact on tool performance.
Besides, some data algorithms are more efficient than others.
However, this category focuses on the qualitative aspects of a
tool’s ability to easily handle data under a variety of hardware
configuration. The criteria that should consider in this task are
platform variety, software architecture, heterogeneous data access,
data size, efficiency, interoperability and robustness.
30
Table 2.2 : Computational Performance Criteria (Collier et al., 1999) Criteria Description Platform Variety Does the software run on a wide-variety of computer platforms? More
importantly, does it run on typical business user platforms? Software Architecture
Does the software use client-server architecture or a stand-alone architecture? Does the user have a choice of architectures?
Heterogeneous Data Access
How well does the software interface with a variety of data sources (RDBMS, ODBC, CORBA, etc)? Does it require any auxiliary software to do so? Is the interface seamless?
Data Size How well does the software scale to large data sets? Is performance linear or exponential?
Efficiency Does the software produce results in a reasonable amount of time relative to the data size, the limitations of the algorithm, and other variables?
Interoperability Does the tool interface with other KDD support tools easily? If so, does it use a standard architecture such as CORBA or some other proprietary API?
Robustness Does the tool run consistently without crashing? If the tool cannot handle a data mining analysis, does it fail early or when the analysis appears to be nearly complete? Does the tool require monitoring and intervention or can it be left to run on its own?
b. Functionality
There are variety of capabilities, techniques, and methodologies
for data mining (Table 2.3). In order to know well the tool adapt to
different data mining problem, software functionality will help to
solve it. The criteria in functionality aspect are algorithm variety,
prescribed methodology, model validation, data type flexibility,
algorithm modifiability, data sampling, reporting, model exporting,
user interface, learning curve, user types, data visualization, error
reporting, action history and domain variety.
Table 2.3 : Functionality Criteria (Collier et al., 1999) Criteria Description
Algorithmic Variety Does the software provide an adequate variety of mining techniques and algorithms including neural networks, rule induction, decision trees, clustering, etc.?
Prescribed Methodology
Does the software aid the user by presenting a sound, step-by-step mining methodology to help avoid spurious results?
Model Validation Does the tool support model validation in addition to model creation? Does the tool encourage validation as part of the methodology?
Data Type Flexibility Does the implementation of the supported algorithms handle a wide-variety of data types, continuous data without binning, etc.?
Algorithm Modifiability
Does the user have the ability to modify and fine-tune the modeling algorithms?
Data Sampling Does the tool allow random sampling of data for predictive modeling? Reporting Are the results of a mining analysis reported in a variety of ways? Does
the tool provide summary results as well as detailed results? Does the tool select actual data records that fit a target profile?
Model Exporting After a model is validated does the tool provide a variety of ways to export the tool for ongoing use (e.g., C program, SQL, etc.)?
31
c. Usability
Different level and types of user will cause usability (Table 2.4).
One problem with easy-to-use mining tools is their potential
misuse. The criteria should consider are data cleansing, value
substitution, data filtering, binning, deriving attributes,
randomization, record deletion, handling blanks, metadata
manipulation and result feedback.
Table 2.4 : Usability Criteria (Collier et al. 1999) Criteria Description
User Interface Is the user interface easy to navigate and uncomplicated? Does the interface present results in a meaningful way?
Learning Curve Is the tool easy to learn? Is the tool easy to use correctly? User Types Is the tool designed for beginning, intermediate, advanced users or a
combination of user types? How well suited is the tool for its target user type? How easy is the tool for analysts to use? How easy is the tool for business (end) users to use?
Data Visualization
How well does the tool present the data? How well does the tool present the modeling results? Are there a variety of graphical methods used to communicate information?
Error Reporting How meaningful is the error reporting? How well do error messages help the user debug problems? How well does the tool accommodate errors or spurious model building?
Action History Does the tool maintain a history of actions taken in the mining process? Can the user modify parts of this history and re-execute the script?
Domain Variety Can the tool be used in a variety of different industries to help solve a variety of different kinds of business problems? How well does the tool focus on one problem domain? How well does it focus on a variety of domains?
Data mining tools is costly and generally accompanied by moderately step
learning. Selection of the wrong tool is expensive both in terms of waste
money and time. These categories for selecting data mining tools will help
practitioners avoid spending much time only to discover that a particular tool
does not provide the necessary solution. (Collier et al., 1999)
Bialynicka (2008) stated that there are data mining tools that suite with
clustering which are :
Scatter
Grouper
Carrot²
Vivisimo
32
Scatter is designed for browsing that support online clustering based on two
novel clustering algorithms which are buckshot and fractionation. Buckshot
fast for online clustering and fractionation is accurate for offline initial
clustering of the entire set. (Bialynicka, 2008)
Grouper is suitable for online purposes that operate on query result snippets.
It will cluster together documents with large common subphrases.
(Bialynicka, 2008)
Carrot² is component framework that allows substituting components for
input (from other search engines), filter (stemming, distance measure and
clustering) and output the result. (Bialynicka, 2008)
Vivisimo is the commercial online clustering that support hierarchical and
conceptual clustering techniques. (Bialynicka, 2008)
However, for this research project, researcher used free tools that available
for learning purposes which is Carrot².
Carrot2 is an open source search results clustering engine. It can
automatically organize small collections of documents, e.g. search results,
into thematic categories. (Carrot², 2010)
Apart from two specialized document clustering algorithms, Carrot2 offers
ready-to-use components for fetching search results from various sources
including YahooAPI, GoogleAPI, MSN Live API, eTools Meta Search,
Lucene, SOLR, Google Desktop and more. Carrot2 is implemented in Java,
but it easily integrates with non-Java software, such as PHP, Ruby or C#.
(Carrot², 2010)
33
2.7 Summary
This chapter provides overview of e-filing and data mining techniques based
on the literature review from several journals. Rules in e-filing, overview of
data mining and challenges in data mining are discussed. Researcher also
reviews three basic data mining techniques which are classification,
association and clustering. After that, researcher come out with comparison
between them and selects the suitable data mining techniques for searching
method in e-filing web-based system (Refer Table 2.1). Based on the
comparison in Table 2.1, researcher found that clustering is the suitable
searching method for e-filing web-based system. Besides, for this research
project, researcher used free tools that available for learning purposes which
is Carrot² (open source search results clustering engine) after review several
journals regarding data mining tools.
The next chapter discusses the research approach and the methodology for the
research project.
34
CHAPTER 3
RESEARCH APPROACH AND METHODOLOGY
3.1 Introduction
This chapter describes the methodology and approaches that were used in the
research from problem identification until development of the system. To
achieve the objective of this project, the right approach must be applied for
best conclusions. This research used five major steps to start developing
prototype of e-filing web-based system using data mining techniques. It
consists of problem identification and planning, requirement gathering,
requirement analysis, design model and develop prototype. The overview of
this methodology can be shown below in Figure 3.1.
Figure 3.1 : Overview of Research Approach and Methodology
35
3.2 Problem Identification and Planning
This phase will identify the goal, scope, budget, schedule, technology and
system development process, methods and tools to ensure that everything are
in right place. However, it depends to what researcher wants to plan
according to the stakeholder requirement.
Before start to plan the project’s planning, the researcher should know the
current situation and problem that the old system have. An understanding of
potential problems is the main process to make the development
successful. After the researcher identifies the problems, scope of the project
is defined. The goal must be determined and the objectives of the project
must solved on the problems that have been identified. After analyzing all
the problems and identifying what task need to be done, a measurable
and achievable project plan is schedule using a Microsoft Project tool.
For this research, Microsoft Project is used to produce Gantt Chart (Refer
Appendix A- Project Planning) as a guideline for researcher in order to finish
the project. Besides, this phase involves list of steps which is :
a. Discuss the current problem with staff at Majlis Daerah
Kerian
The current problems for this research need to identify in order to
solve the problem in the next task.
b. Identify goal, objective, scope, and significance of research
The goal, objective, scope and significance of research need to be
clearly defined.
c. Plan related task
Plan the related task using Microsoft Project to schedule all the
planning. Time must be allocated carefully and entire task must be
stated to ensure the completion of the research.
36
3.3 Requirement Gathering
Requirement gathering is the process to gather all information that is needed
to develop the system. In this analysis phase, a method of data collection has
been applied. This phase is to identify some of the concept and
requirement that will be required and apply in developing the e-filing web-
based system. For this research, there are two types of data collection which
are :
3.3.1 Primary Data
Primary data is about gathering requirement from the original
resource such as interviews, questionnaire and observation. For this
research, the researcher used data from the interview with staff at
Majlis Daerah Kerian. Interviewing is a technique used to gain
detailed information regarding the related subject of interest of this
research. This includes software and hardware used and also the
problem that arises in current system so that requirements identified.
Table 3.1 below shows the information of people that involved in
interview session for gathering requirement of e-filing web-based
system.
Table 3.1 : Information of people that involve in interview
Respondent Name Department
Mr. Gobibaskaran A/L Govindaraju Head of IT Department,
Majlis Daerah Kerian.
Puan Shalina Mat Piah Administrative Assistant,
Majlis Daerah Kerian.
37
The main advantages of interviews are that the answer of the
interviewees is more spontaneous without an extended reflection. This
can be done by using a top down approach where the interviewer
starts with a general question and progress to specific question about
task. Interviews should plan in advance by defining a set of interview
questions to be asked. This does not only assist in ensuring
consistency between interviews conducted with different interviewees
but also help to focus on the purpose of the interview session.
The deliverable of this activity is an identified requirement that
needed for e-filing web-based system.
3.3.2 Secondary Data
The secondary data for this research is about data collection through
many resources such as articles, journals, books and other related
academic publication information about e-filing and data mining. It is
important to gain deeper understanding to e-filing and data mining.
3.4 Requirement Analysis
This is the next stage after all data has been collected from the requirement
gathering phase. The primary data collected is needed to be analyzed to
define the system requirement for developing e-filing web-based system. The
collected data need to be studied and analyzed properly in order to have
accurate, reliable and relevant information during the development. This
entire requirement helped researcher to identify the use case that produce
system functions and finally researcher come out with Software Requirement
Specification (SRS) documentation.
38
Besides, secondary data that collected during requirement gathering phase is
useful to identify suitable searching method using data mining techniques.
Researcher made comparison between three popular techniques (association,
classification and clustering) in data mining in order to identify suitable
searching method from selected data mining techniques. Researcher finally
comes out with suitable searching method using data mining techniques. The
tool used during this phase is Rational Rose.
3.5 Design Model
The model will be designed and determine before proceeding with the actual
construction of the database and system. System interface, classes, objects
and their relation will be designed using Rational Rose. The entire related
diagram to this research that includes class diagram, use case, sequence
diagram will be designed based on the result from the requirement analysis
phase.
After all the objects and classes are illustrated clearly with its attributes and
methods, a development of database was conducted. This activity is
accomplished by using MySQL database. At the end of this activity, a
detailed design (database model) is produced. The deliverable of this phase
has been documented in Software Design Document (SDD).
3.6 Develop Prototype
Develop prototype is related with building the application of the system using
the appropriate development technologies. In this phase, researcher will
develop the prototype of e-filing web-based system using data mining
techniques. The Apache is use as a web server, MySQL database as a
database server, and PHP programming language as the platform of the
development. In order to write programming code, Dreamweaver is used as a
39
workspace and Carrot² as a data mining tool. At the end of this phase, e-filing
prototype system using data mining technique will be produced.
3.7 Summary
The research methodology describes the research strategy that is used in this
research project. For this research project, a plan of action is laid out that
shows how the problem will be investigated, what information will be
collected using which method and how this information will be analyzed to
come to the conclusion. It consists of problem identification and planning,
requirement gathering, requirement analysis, design model and develop
prototype.
The methodology stated above was followed to develop the e-filing web-
based system in order to achieve the project’s objectives as well as to
fulfill requirements specified by the user. With understandable and
achievable methodology, the project is carried out in a proper manner that
consequently completed effectively.
The next chapter discusses the construction for the research project.
40
CHAPTER 4
PROTOTYPE CONSTRUCTION
4.1 Introduction
This chapter explained about the construction of prototype in depth and
details in developing the project development of the e-filing web-based
system. It explains on the result and ways it achieves the project objectives.
4.2 Software Requirements
Specified below is the list of software tools that are selected during the
development process. These include operating system and other applications
that are compulsory for the system to be developed and deployed.
4.2.1 Software Tools
Table 4.1 : Software Tools Specifications
No. Software Type
1. Windows XP SP2 Operating System (OS) 2. MySQL Database Server 3. PHP Programming Platform 4. Apache Web Server 5. Rational Rose Enterprise Edition Unified Modeling
Language Software 6. Adobe Photoshop CS3 Graphics Design Software 7. Macromedia Dreamweaver MX 2004 Workspace Software 8. Carrot² Open source framework for
building search clustering engines
41
4.2.2 Software Tools Installation
Referring to Table 4.1, the installation of the three basic tools related
which is Apache, MySQL Server version 5, Rational Rose Enterprise
Edition, Adobe Photoshop CS3, Macromedia Dreamweaver MX 2004
are explain further as the following.
a. Apache
The Apache HTTP Server, commonly referred to as Apache is
web server software notable for playing a key role in the initial
growth of the World Wide Web. In 2009 it became the first
web server software to surpass the 100 million web site
milestone. Apache was the first viable alternative to the
Netscape Communications Corporation web server (currently
known as Sun Java System Web Server), and has since
evolved to rival other Unix-based web servers in terms of
functionality and performance. Apache supports a variety of
features, many implemented as compiled modules which
extend the core functionality. These can range from server-
side programming language support to authentication schemes.
Some common language interfaces support Perl, Python, Tcl,
and PHP. Apache provides a variety of MultiProcessing
Modules (MPMs) which allow Apache to run in a process-
based, hybrid (process and thread) or event-hybrid mode, to
better match the demands of each particular infrastructure.
This implies that the choice of correct MPM and the correct
configuration is important. Where compromises in
performance need to be made, the design of Apache is to
reduce latency and increase throughput, relative to simply
handling more requests, thus ensuring consistent and reliable
processing of requests within reasonable time-frames. (Apache,
2002)
42
b. MySQL Version 5
MySQL is the world's most popular open source database
software, with over 100 million copies of its software
downloaded or distributed throughout its history. With its
superior speed, reliability, and ease of use, MySQL has
become the preferred choice for Web, Web 2.0, SaaS, ISV,
Telecom companies and forward-thinking corporate IT
Managers because it eliminates the major problems associated
with downtime, maintenance and administration for modern,
online applications. (MySQL, 2009)
MySQL server is chosen as the storage for the data in E-Filing
web-based system because of its consistency, fast
performance, high reliability and ease of use. The researcher
only need to follow all the instruction on the wizard until the
installation process is completed. Once the installation is
completed, MySQL Server Version 5 can be used in the
development of E-Filing web-based system.
c. Rational Rose Enterprise Edition
According to IBM Corporation (2006), Rational Rose enables
the creation of the following types of UML based diagrams:
activity diagrams, class, component, deployment, sequence,
state chart, use case, collaboration, physical storage and
deployment, and physical data and tables.
Researcher used Rational Rose Enterprise Edition to create
UML modeling for e-filing web-based system. It consists of
use case diagram, sequence diagram and class diagram for e-
filing web-based system.
43
d. Adobe Photoshop CS3
Photoshop CS3 is part of Adobe’s Creative Suite (along with a
host of other products such as Illustrator). It is Adobe’s
flagship bit map editor, and a professional level editor for fine
art photography there is no viable alternative. Photoshop is the
industry standard because of its flexibility and extensibility (it
supports a wide range of third-party plug-ins), its support for
color management, and the robustness of its tools. (Levy,
2007)
Researcher used Adobe Photoshop CS3 to design the interface
of E-Filing web-based system that consists of header, logo and
system’s layout.
e. Macromedia Dreamweaver MX 2004
Dreamweaver is a powerful web page creation and web site
management tool. It offers numerous, sophisticated functions
that can be used to create professional quality web sites.
Because of this, it’s one of the most popular web authoring
tools among web designers. (San Diego State University,
2004)
Researcher used Macromedia Dreamweaver MX 2004 as the
workspace software in order to develop coding using PHP
language for E-Filing web-based system.
f. Carrot²
According to Carrot² (2010), Carrot2 is an Open Source Search
Results Clustering Engine. It can automatically organize small
44
collections of documents, e.g. search results, into thematic
categories.
Apart from two specialized document clustering algorithms,
Carrot2 offers ready-to-use components for fetching search
results from various sources including YahooAPI, GoogleAPI,
MSN Live API, eTools Meta Search, Lucene, SOLR, Google
Desktop and more. Besides, Carrot2 is implemented in Java,
but it easily integrates with non-Java software, such as PHP,
Ruby or C#.
Researcher used Carrot² which is open source framework to
build a search results clustering engine. It will organize the
search results into topics, fully automatically and without
external knowledge such as taxonomies or reclassified content.
4.3 Hardware Requirements
In developing and deploying e-filing web-based system, the minimum
hardware requirement that project needed is standard personal computer with
Intel or AMD processor, standard motherboard, 80 GB hard disk and 512MB
DDRAM memory. No additional external device is needed for this project.
4.4 Development Phase
Based on research methodology depicts in Figure 3.1, system construction
process involved in last 3 phases of research methodology, which are
Requirement Analysis, Design and Development phase. Each process
involved in mentioned phase is explained further below.
45
4.4.1 Requirement Analysis Phase
In this construction process, the researcher analyzed the requirement
in more detail. The researcher illustrated use case diagram using
Rational Rose Software which focused on high level view that
concentrated on a user-centered view of the system. This is to analyze
class diagram which is the primary model for describing the internal
structure and behavior of the project system. Furthermore, each use
case is described thoroughly that stated the flows involved within it as
well as the production of sequence diagram are also taken placed. As
a result, a summary of requirements for development of E-Filing web-
based system is fully constructed. For details on the requirement,
please refer Appendix D: Software Requirement Specification (SRS).
4.4.2 Design Phase
The design phase is concerned with specifying the e-filing web-based
system that will meet the requirements. The design of this project
takes place at two main levels, which is system design and detailed
design.
a. System Design
System design is focuses on architectural aspects that affect
the entire system (Bennett, McRobb & Farmer, 2006). The
system design of e-filing web-based system involved setting of
standard such as the design of the human computer interface,
the development of coding standard are specified, and the
suitable database management for data storage is selected.
This project uses the MySQL as the database management and
PHP as a programming language.
46
b. Detailed Design
Detailed Design is addresses the design of classes and the
detail working of this project system. It was based on the
requirement designed in the Software Requirement
Specification (SRS) that follows object-oriented design
approach. In an object-oriented system, the detailed design is
concerned the design of objects. Object Design is mainly
concerned with the specification of attributes types, how
operations function, and how objects are linked to other object
(Bennett et al., 2006). For details description of class diagram,
please refer Appendix E: Software Design Document (SDD).
4.4.3 Development Phase
In this development phase, a series of development tasks were
performed during this phase. It consists of constructing database
establishing its connection and coding task. These tasks are explained
further as below.
a. Coding
This task was concurrently done with the enhancement of the
interfaces. The necessary codes were added in the programs to
enable the interfaces to function correctly. Figure 4.1 shows
one of the coding segments that has been constructing during
development using Macromedia Dreamweaver MX 2004.
47
Figure 4.1 : Coding index.php
b. Data Mining Techniques
This task was concurrently done with the enhancement of the
e-filing web-based system with searching method using data
mining techniques. Clustering selected as the suitable data
mining techniques for searching method. Researcher used
Carrot² which is open source framework for building search
clustering engines. The necessary codes were added in the
system to cluster search results.
c. Interface
Figure 4.2 shows the main page of the system. This page
appear after the authorize user (staff) enter into the system.
This page shows the list of menu for staff to handle the
system.
(Refer Appendix F – Description of Interface System)
48
Figure 4.2 : The main page interface of e-filing
4.5 Summary
This chapter explained about the construction of the system in details in
developing the E-Filing web-based system. Researcher reviews the list of
software tools that are selected during the development process. These
include operating system and other applications that are compulsory for the
system to be developed and deployed which is Dreamweaver MX 2004,
Apache, MySQL, Rational Rose and Carrot². Besides, researcher comes out
with the minimum hardware requirements in developing and deploying E-
Filing web-based system. In the development phase, researcher reviews a
series of development tasks that were performed. It consists of requirement
analysis, design and development phase.
The next chapter discusses the result and findings for the research project.
49
CHAPTER 5
RESULT AND FINDINGS
5.1 Introduction
This chapter will explain how the collected data is organized, analyzed and
finalized to be used in the development phase of the research. The result of
the research that has been conducted will be explained in depth in this
chapter. It includes the findings and result gathered from the interviews and
discussions.
5.2 Interview Results
In order to generate good interview question, researcher follows a model for
navigating interview processes in requirements elicitation (Refer Figure 5.1).
Figure 5.1 : A Model for Navigating Interview Processes in Requirements Elicitation
50
In developing a Software Requirements Specification (SRS) of good quality,
it is quite important to correctly elicit requirements from stakeholders. The
interview session has been conducted with Encik Gobibaskaran A/L
Govindaraju, the Head of Information Technology at Majlis Daerah Kerian
and Puan Shalina Mat Piah, the Administrative Assistant at Majlis Daerah
Kerian. The interview questions are categorized into two categories. The first
category focused more on the current problems faced by staffs in Majlis
Daerah Kerian. All the necessary data from the current problems has been
collected through this category. The second category is focusing on the
functional requirement for the system to be developed. The sample interview
question can be found in Appendix C.
5.2.1 Current Problems
Interviewee :
Puan Shalina Mat Piah, Administrative Assistant,
Majlis Daerah Kerian.
The results gained from the first category of the interview questions
are presented in the Table 5.1 below.
Table 5.1 : The problems that have been identified from the interviews.
Problem Researcher Interviewee
PQ.1 Is the current manual
system easier and
comfortable to you?
No
PQ.2 Please describe the
current system in
regarding the manual
managing and
searching files.
Involve many step :
Searching suitable
number of file that
required by using
log book.
Determine file name
by using file
51
number.
Check needed file
on many big shelf
that required long
time.
Surveying on each
staff’s table or other
department in Majlis
Daerah Kerian if the
file not have on the
shelf.
PQ.3 Is it easy to identify
the suitable files
manually according to
your requirement?
No
PQ.4 Why you think it is not
easy to identify the
suitable files
manually?
Difficult to search the
suitable files.
Difficult to know status
of the files.
Required long time.
There are thousands of
files on the shelf.
Sometimes, there are
interchanges of files
between departments.
PQ.5 In your opinion, is it
important for MDK to
have web-based
system that will act as
information center for
staff to gather
information about the
status of the files?
Yes, of course
52
5.2.2 Functional Requirements
Interviewee :
Encik Gobibaskaran A/L Govindaraju, Head of IT Department,
Majlis Daerah Kerian.
Apart from that, the second category of the interview is focusing more
on the functional requirement of the system. The requirements and
suggestions gathered from the interviews are represented in the Table
5.2 below.
Table 5.2: The requirement and suggestion that had been
identified from the interviews
Requirement Researcher Interviewer
RQ.1 How many users
required
involving in the
system?
Three users which is
Administrator, Manager
and Staff
RQ.2 What do you
think E-Filing
web-based
system should
have?
Stored general staffs
information.
Stored files information.
Stored status and
location of the files.
Implement automated
searching to identify
suitable files.
RQ.3 What is the rule
for
Administrator,
Manager and
Staff in the
system?
Admin : handle user
account, view and delete
files.
Manager : handle user
information, maintain
files and delete staff.
Staff : handle user
53
information and
maintain files.
RQ.4 What is your
suggestion about
the language to
develop the
system?
Use the open source
language that suite with
any platform such as
PHP programming.
RQ.5 What is your suggestion about the database to develop the system?
MySQL database
Based on the Table 5.2, several processes for the system are
identified. This requirement is all about system functionality for e-
filing web-based system. This requirement is collected and analyzed
to produce the new system.
54
5.3 Use Case Diagram
Maintain User Account
View Files
Delete Files
Admin
Staff
Validate User
Maintain Files Information
Maintain User Information
Delete Staff
Maintain Customer Information
Manager
Figure 5.2 : Use Case Diagram for E-Filing web-based system
Referring to Figure 5.2 above, it shows the use case diagram for e-filing web-
based system. This use case illustrated the functionality for the administrator,
manager and staff. First, the admin, manager and staff must login into the
system. They must registered first before can use the system. Upon they have
login into the system, admin can maintain user account, view files and delete
files. Manager can maintain user information, maintain files information,
maintain customer information and delete staff. Staff can maintain user
information, maintain files information and maintain customer information.
55
The description about the use cases is described in Table 5.3.
Table 5.3 : Description of Use Case diagram
Use Cases Description
Maintain User Account
Maintain User Account use case is used by
administrator to update and delete user’s
account that used the system.
View Files
View Files use case is used by administrator to
view files from all departments in Majlis
Daerah Kerian.
Delete Files
Delete Files use case is used by administrator
to delete files from all departments in Majlis
Daerah Kerian.
Validate User Validate User use case is used by administrator,
manager and staff to login into the system.
Maintain User Information
Maintain User Information is used by manager
and staff for their registration and update their
information.
Maintain Files Information
Maintain Files Information is used by manager
and staff to add new files, update files and
delete files.
Maintain Customer
Information
Maintain Customer Information is used by
manager and staff to add new customer, update
customer and delete customer.
Delete Staff Delete Staff is used by manager to delete their
staff that not belonging to their department.
56
5.4 Class Diagram
Figure 5.3 : Class Diagram for E-Filing web-based system
Referring to Figure 5.3, it is a class diagram for e-filing web-based system. The class diagram is a type of static structure diagram of the system. It shows the system's classes, their attributes, and their relationships between the classes.
file_formfile_idfile_namefile_statusfile_remarkopen_dateupdate_datestaff_nodept_name
<<boundary>>
customer_formfile_idcust_iccust_namecust_add1cust_add2cust_citycust_postcodecust_statecust_phonestaff_no
<<boundary>>
manage
validate
manage
manage
have
staff_formstaff_nostaff_icstaff_namestaff_add1staff_add2staff_citystaff_postcodestaff_statestaff_hpstaff_emaildept_nameadvisor_no
<<boundary>>
advisor_formadvisor_noadvisor_icadvisor_nameadvisor_hpadvisor_email
<<boundary>>
login_formuser_nameuser_password
<<boundary>>
staff_control
search_staff()set_staff_detail()set_staff_update()removeStaff()
<<control>>
advisor_control
set_advisor_detail()set_advisor_update()
<<control>>
login_control
set_user_update()remove_user()validate_user()
<<control>>
advisor<<PK>> advisor_noadvisor_icadvisor_nameadvisor_hpadvisor_emaildept_name
add_advisor()update_advisor()display_advisor()
<<entity>>
login<<PK>> user_nameuser_passworduser_iduser_leveluser_dept
update_user()delete_user()display_user()
<<entity>>
1
1
1
1
validate
file_control
search_files()set_file_detail()set_file_update()remove_file()
<<control>>
customer_control
search_cust()set_cust_detail()set_cust_update()remove_cust()
<<control>>
staff<<PK>> staff_nostaff_icstaff_namestaff_add1staff_add2staff_citystaff_postcodestaff_statestaff_hpstaff_emaildept_nameadvisor_no
add_staff()update_staff()delete_staff()display_staff()
<<entity>>
0..*
1
0..*
1
1
1
1
1
customer<<PK>> cust_idfile_idcust_iccust_namecust_add1cust_add2cust_citycust_postcodecust_statecust_phonestaff_no
add_cust()update_cust()delete_cust()display_cust()
<<entity>>
0..*
1
0..*
1
file<<PK>> file_idfile_namefile_statusfile_remarkopen_dateupdate_datestaff_nodept_name
add_files()update_files()delete_files()display_files()
<<entity>>
0..*
1
0..*
1
1
0..n
1
0..n
57
5.5 Clustering as the Suitable Searching Method
5.5.1 Introduction
For this research project, it is important for the researcher to select the
suitable searching method using data mining techniques. Researcher
decided to review three main data mining techniques which are
classification, association and clustering. These techniques deliver the
same objective of data mining, but different in terms of their function
and suitability for the system.
Researcher reviewed the techniques based on their definition, concept,
functions, suitability and examples given in several journals. (Refer
Table 2.1 in Chapter 2-Literature Review, page 25)
According to the comparison in Table 2.1, after reviewing each
technique based on their definition, concept, functions, suitability and
examples given by several journals, researcher found that clustering is
the suitable searching method for e-filing web-based system.
5.5.2 Why Clustering Search Result
This decision supported by several journals that stated clustering as
the suitable searching method. According to Zhang, Zie and Wu
(2006), clustering will cluster the search results that can help users
find the results in several clustered collections, so it is easy to locate
the valuable search results that the users really needed.
58
Aliakbary, Khayyamian and Abolhassani (2008) stated that clustering
search results helps the user to overview returned results and to focus
on the desired clusters. Most of search result clustering methods use
title, URL and snippets returned by a search engine as the source of
information for creating the clusters.
According to Lipai (2008), clustering search tools results means
grouping them into object classes which are constructed using the
search results characteristics, with the purpose of simplifying the
users work to retrieve the information it needs, helping him to find
faster better quality results.
Bialynicka (2008) stated that, clustering will organize search result
into groups, so that different groups correspond to different user
needs. This is because, flanked list is not enough and documents
pertaining to different topics cannot be compared. Besides, there are
relationships between the results that can be utilized in order to cluster
the search results.
5.5.3 Examples of Clustering Search Result
Jasco (2007) gives example the useful of clustering techniques in
search result list. Figure 5.3 below shows google’s one dimensional
result list without clustering techniques. By using “clustering search
result” keywords, google gives about 15,500,000 list of result which is
large and difficult to choose.
59
Figure 5.4 : Google’s One Dimensional Result List
Figure 5.4 below shows the good search result list with clustering
technique. By using “clustering search result” keywords same as
Figure 5.3 above, it gives about 194 list of result only which is more
accurate, simple and easy to choose.
Figure 5.5 : Good clustering result list
60
Figure 5.5 below shows the search result list with clustering technique
that available in the World Wide Web (http://search.carrot2.org).
Figure 5.6 : Good clustering result list from http://search.carrot2.org
61
5.5.4 Clustering Search Result from e-filing web-based system
Figure 5.7 below shows the search result list with clustering technique
that available in e-filing web-based system.
Figure 5.7 : Good clustering result list from e-filing web-based system
Figure 5.8 below shows the data mining tool provided by Carrot²
which is the open source framework for building search clustering
engines. The necessary codes were added in the system to cluster
search results.
62
Figure 5.8 : Data Mining Tool by Carrot²
5.6 Summary
On this chapter, researcher explained how the collected data is organized,
analyzed and finalized to be used in the development phase of the research.
Researcher analyzed interview results with two staffs in Majlis Daerah
Kerian in terms of their current problems and functional requirements for e-
filing web-based system. Besides, researcher also discussed the reasons why
clustering is selected as the suitable searching method for e-filing web-based
system. Researcher comes out with several journals, examples that support
clustering as the suitable method to cluster search result and clustered result
from e-filing web-based system.
The next chapter discusses the conclusion and recommendations for the
research project.
63
CHAPTER 6
CONCLUSION AND RECOMMENDATIONS
6.1 Introduction
This chapter will conclude what has been done by the researcher from
defining the objectives until obtaining the findings through developing
the prototype of e-filing web-based system using data mining techniques.
This chapter also concludes the report for this project and provides limitations
of the software and recommendations for those who wish to pursue the
research on the development of the e-filing web-based system.
6.2 Conclusions
As for the conclusion of the research project on a development the
prototype of e-filing web-based system using data mining techniques, the
researcher managed to achieve the entire objectives based on defined
research approach and methodology that consists of a proper theoretical
findings (Secondary Data) and data findings (Primary Data). The
achievement of these objectives is hoped to provide solutions to the current
problems in Majlis Daerah Kerian, Parit Buntar, Perak.
The first objective of the research project is to identify requirements that will
be needed for e-filing from Majlis Daerah Kerian. This objective has been
achieved through requirement gathering by conducting interview session with
staffs in Majlis Daerah Kerian in order to know the current problems and
functional requirements for e-filing web-based system. The deliverable for
this objective has been documented and can be referred in the Appendix D:
Software Requirement Specification (SRS).
64
The second objective of the research project is to identify the searching
method based on data mining techniques. For this phase, researcher reviewed
many resources such as article, journal, books and other related academic
publication information about e-filing and Data Mining in order to gain
deeper understanding to e-filing and Data Mining. This secondary data is
useful to identify suitable searching method using data mining techniques.
Researcher make comparison between three popular data mining techniques
(association, classification and clustering) in order to identify suitable
techniques for searching method in e-filing web-based system. This objective
has been achieved when researcher found that clustering is the suitable
searching method for e-filing web-based system.
After the second objective has been achieved, the research proceeds with the
third objective of designing e-filing web-based system. This objective has
been achieved through the design stage, which is system design and
detailed design. In system design, the development of e-filing web-based
system highlight the importance of interface design with the human
computer interface characteristics through proper choosing of colors,
buttons, and fonts. Despite, overall system structure is produced to illustrate
how the overall system works. In detailed design, it addressed the design of
classes and the detail working of this project system. The detail design
described the attributes, operations, and classes. The third objective
deliverables been documented and can be referred in the Appendix E:
Software Design Document (SDD).
The fourth objective of this project is to demonstrate e-filing web-based
system using identified data mining technique. The third objective must
follow the three objectives that have been achieved. It was based on the
project methodology that consists of requirement gathering and analyzing,
then designing the model that must follows the user requirements. Finally, the
process of development the prototype is implemented by translating the
design into program code using selected programming platform, database
server, web server and selected data mining technique. Thus, the last
objective has been realized.
65
By developing e-filing web-based system for Majlis Daerah Kerian, it is
expected that it will providing staff interactive environment in making their
choice in determining the suitable files that meets their requirements. Besides,
it also expects that it will help staff to identify their needed files more
accurate and faster as a result of using suitable searching method using
selected data mining technique. This system also expected to become
information center for staff in Majlis Daerah Kerian to gather information
about status of the files.
Although all the objectives have been achieved, the e-filing web-based
system using data mining technique is far from complete and has its own
limitations. There are still lots of improvement that can be considered to
enhance this project. The limitations and recommendation for this project are
discussed below.
6.3 Limitations
The project had encountered a number of limitations while in progress. The
limitations are as follows :
a. The interview session for gathering the information about the current
problems and functional requirements was conducted only with Head
of Information Technology and Administrative Assistant of Majlis
Daerah Kerian. Interview with two person only, provide less
information about the requirements.
b. Due to the time constraint, researcher developed the prototype of e-
filing web-based system which is the system for demonstrate
purposes.
66
c. There are a lot of journal regarding data mining techniques, but
researcher faces difficulties to understand each journal because not
familiar with this knowledge.
d. There are three different data mining techniques, but researcher must
select the better data mining techniques that suite with the objective.
Researcher need to study properly for each data mining techniques
and come out with the related journals that support the findings.
e. There are a large number of data mining tools available, but not all the
tools support different kind of data mining techniques. So researcher
need to study the tools based on their function and usability with the
selected data mining techniques. Furthermore, the tool used in this
research is new to the researcher so that requires time to familiarize
with the tool.
f. Experience of the researcher is another limitation factor of this
research. This is the first research for the researcher. However,
researcher can learn and have the proper guide based on the research
plan and instruction from the supervisor and examiner.
6.4 Recommendations
There are several recommendations that can be considered to further enhance
the development of e-filing web-based system as the following:
a. Suggest that project scope of the system to be expanded to know
contents of the files other than status of the files.
b. Suggest that this system can be used by others local government, not
only Majlis Daerah Kerian.
67
c. Suggest that project can be online through the Internet so that it
can be access by everyone at anytime and anywhere. It is because, this
project has limited access by using Local Area Network (LAN) only.
Through the implementation of this system, hopefully there will be other
enhancement made for further project.
68
REFERENCES
Abbott, D.W., Matkovsky, I.P., & Elder, J.F. (1998). An Evaluation of High-end
Data Mining Tools for Fraud Detection. IEEE Transaction on Knowledge and
Data Engineering, 2836.
Aliakbary, S., Khayyamian, M., & Abolhassani, H. (2008). Using Social Annotations
for Search Result Clustering. Retrieved February 10, 2010, from http://
www.springerlink.com/index/v770wm385n256p68.pdf
Apache. (2002). Retrieved February 14, 2010, from The Apache Software
Foundation: http://apache.org/
Bennett, S., McRobb, S., & Farmer, R. (2006). Object-Oriented Systems Analysis
and Design Using UML Third Edition. McGraw-Hill Education(UK)
Limited.
Bialynicka, I. (2008). Clustering Web Search Results. Retrieved March 2, 2010, from
http://medialab.di.unipi.it/web/Search+QA/Seminar/Clustering.ppt
Carrot² (2010). Carrot²-Open Source Search Results Clustering Engine. Retrieved
March 1, 2010, from Carrot² Website : http://project.carrot2.org/index.html
Chen, M., Han, J., & Yu, S.Y. (1996). Data Mining : An Overview from a Database
Perspective. IEEE Transaction on Knowledge and Data Engineering, 8, 6.
Collier, K., Carey, B., Sautter, D., & Marjaniemi, C. (1999). A Methodology for
Evaluating and Selecting Data Mining Software. IEEE Transaction on
Knowledge and Data Engineering, 2-4.
69
Defit, S., & Md Sap, M. N. (2009). Mining Association Rule from Large Databases.
Retrieved October 10, 2009, from http://fsksm.utm.edu.my
Garofalakis, M. N., Rastogi, R., Seshadri, S., & Shim, K. (1999). Data Mining and
the Web : Past, Present and Future. Retrieved July 17, 2009, from
http://www.softnet.tuc.gr/~minos/Papers/widm99.pdf
IBM Corporation. (2006). IBM Rational Rose. Retrieved March 1, 2010, from
http://ftp.software.ibm.com/software/rational/web/datasheets/rose_ds.pdf
Jain, A. K., Murty, M. N., & Flynn, P. J. (2000). Data Clustering: A Review. ACM
Computing.
Jasco, P. (2007). Clustering Search Result, Part 1: Web-wide Search Engines.
Retrieved January 5, 2010, from http://www.emeraldinsight.com/1468-4527.htm
Khodra, M. L., Widyantoro, D. H. (2007). An Efficient and Effective Algorithm for
Hierarchical Classification of Search Results. Retrieved March 20, 2010,
from http://repository.gunadarma.ac.id:8000/711/1/C-07.pdf
Lee, H. K. (2005). Inductive Clustering : A Technique for Clustering Search Results.
Retrieved July 15, 2009, from http://sifaka.cs.uiuc.edu/course
/598cxz05s/report-hle.pdf
Levy, P. (2007). A Review of Adobe Photoshop CS3. Retrieved February 3, 2010,
from http://www.becs-wa.org/PhotoShop_CS3.pdf
Lipai, A. (2008). World Wide Web Metasearch Clustering Algorithm. Retrieved
March 13, 2010, from http://revistaie.ase.ro/content/46/Adina%20Lipai.pdf
MySQL. (2009). Retrieved Disember 28, 2009, from MySQL Website:
http://www.mysql.com/
70
Olson, T., Edwards, M., & Monty, H.A. (2003). A Guide to Model Rules for
Electronic Filing and Service. Retrieved July 15, 2009, from
http://www.ncsconline.org/WC/Publications/External_ElFileModelRulesLexi
sPub.pdf
Phyu, T.N. (2009). Survey of Classification Techniques in Data Mining. Retrieved
August 5, 2009, from
http://www.iaeng.org/publication/IMECS2009/IMECS2009pp727-731.pdf
Qiu, M., Davis, S., & Ikem, F. (2004). Evaluation of Clustering Techniques in Data
Mining Tools. Retrieved January 5, 2010, from
http://www.iacis.org/iis/2004_iis/PDFfiles/QiuDavisIkem.pdf
Ravichandra, R. (2003). Data Mining and Clustering Techniques. Retrieved April 1,
2010, from https://drtc.isibang.ac.in/bitstream/handle/1849/121
/K_ikr_datamining.PDF?sequence=2
San Diego State University. (2004). Dreamweaver MX 2004 Introduction. San
Diego, Berkeley. Academic Affairs.
Shyu, M. L., Chen, S. C., & Haruechaiyasak, C. (2005). Retrieved February 12,
2010, from http://www.hlt.nectec.or.th/Publications/Conferences/A%20
Data%20Mining%20Framework%20for%20Building%20A%20Web-
Page%20Recommender%20System.pdf
Tang, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining.
Boston : Pearson Education.
Visnick, L. (2003). Clustering Techniques. Retrieved July 30, 2009, from
http://www.progress.com/realtime/docs/whitepapers/
clustering_techniques.pdf
Zhang, H., Xie, K., & Wu, H. (2006). An Efficient Algorithm for Clustering Search
Engine Results. Retrieved February 6, 2010, from http://www.ieee.org
71
APPENDICES
72
APPENDIX A
PROJECT PLANNING
A
73
APPENDIX B
PROGRESS SLIDE PRESENTATION
B
74
APPENDIX C
INTERVIEW QUESTION
C
75
APPENDIX D
SOFTWARE
REQUIREMENT SPECIFICATION
(SRS)
D
76
APPENDIX E
SOFTWARE DESIGN DOCUMENT
(SDD)
E
77
APPENDIX F
DESCRIPTION OF SYSTEM
INTERFACE
F
78
APPENDIX G
IN-PROGRESS ASSESSMENT
G
79
UNIVERSITI TEKNOLOGI MARA
DEVELOPMENT OF E-FILING FOR MAJLIS DAERAH KERIAN
USING DATA MINING TECHNIQUES
MOHAMED SYAHMI BIN MOHAMED ISA
BSc. (Hons)
INFORMATION SYSTEM ENGINEERING
MAY 2010
80
Universiti Teknologi MARA
Development of E-Filing for Majlis Daerah Kerian
using Data Mining Techniques
Mohamed Syahmi Bin Mohamed Isa
Thesis submitted in fulfillment of the requirements for Bachelor of Science (Hons)
Information System Engineering Faculty of Computer and Mathematical Sciences
MAY 2010
81
DECLARATION
This declaration is to certify that this thesis and all of its submitted contents are
original in its stature, excluding those in which have been acknowledged specifically
in the references. The contents of this thesis are of my own endeavor and any ideas
or quotations from the work of other people; published or otherwise are fully
acknowledged in accordance with the standard referring practices of the discipline.
Name of Candidate : MOHAMED SYAHMI BIN MOHAMED ISA
Candidate’s ID No. : 2008287242
Programme : BACHELOR OF SCIENCE (HONS)
INFORMATION SYSTEM ENGINEERING
(CS 226)
Faculty : FACULTY OF COMPUTER AND
MATHEMATICAL SCIENCES
Project Title : DEVELOPMENT OF E-FILING FOR MAJLIS
DAERAH KERIAN USING DATA MINING
TECHNIQUES
Signature of
candidate :
Date : 24th MAY 2010
82
APPROVAL
DEVELOPMENT OF E-FILING FOR MAJLIS DAERAH KERIAN
USING DATA MINING TECHNIQUES
By
Mohamed Syahmi Bin Mohamed Isa
2008287242
This thesis is prepared under the direction of thesis coordinators, Assoc. Prof. Wan
Nor Amalina Wan Hariri and Assoc. Prof. Rashidah Md. Rawi, Information System
Engineering Program, and it has been approved by the thesis supervisor, Puan
Norisan Abd Karim. It was submitted to the Faculty of Computer and Mathematical
Sciences and was accepted in partial fulfillment of the requirement for the degree of
Bachelor of Science.
Approved by:
__________________________
Madam Norisan Abd Karim
Thesis Supervisor
Date: 24th May 2010
83
DEDICATION
“For my mother, Sadiah Binti Harun,
my late father, Mohamed Isa Bin Harun, and my brothers.”
84
ACKNOWLEDGEMENT
Praise be to Allah SWT Most Gracious, Most Beneficent
Firstly, I would like to pay my gratitude to Allah S.W.T for giving me strength to be
able to complete this project. Without His blessing and permission, this project could
not have been completed.
I would like to give my sincere appreciation to my supervisor Puan Norisan Abd
Karim for her concern, advices, supports and encouragement throughout this thesis
progress. My gratitude also goes to my coordinator of Final Year Project (ITS690)
FSKM, UiTM Shah Alam, Assoc. Prof. Wan Nor Amalina Wan Hariri and Assoc.
Prof. Rashidah Md. Rawi for their valuable guidance in the completion of this
project.
Special thanks to Mr. Gobibaskaran and Puan Shalina for giving the opportunity to
perform the interview session that helped me in gathering the requirements for this
project.
Finally yet importantly, thoughtful thanks to my parents, who gave me an
appreciation of learning and taught me the value of perseverance and resolve. I also
would like to say thank you to my friends for their support and to the entire person
that directly or indirectly helped me in this project. Thanks for inspiring me in such a
means that could not be written in words. May Allah SWT bless all of you.
i
85
TABLE OF CONTENTS
TITLE PAGE
ACKNOWLEDGEMENT i
TABLE OF CONTENT ii
LIST OF TABLES vi
LIST OF FIGURES vii
ABSTRACT viii
CHAPTER 1
INTRODUCTION
1.1 Research Background 1
1.2 Problem Statement 2
1.3 Aim 3
1.4 Objective of the Research 3
1.5 Significance of Research 3
1.6 Scope of Study 4
1.7 Limitation 4
1.8 Outcomes/Deliverables 5
1.9 Layout of Dissertation 5
1.10 Summary 6
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction 7
2.2 E-Filing
2.2.1 Introduction to E-Filing 7
2.2.2 Purposes of the Rules in E-Filing 7
2.2.3 Proposed Model Rules for E-Filing 8
2.3 What is Data Mining
2.3.1 Definition of Data Mining 9
ii
86
2.3.2 Data Mining & Knowledge Discovery 9
2.3.3 Challenges of Data Mining 12
2.4 Data Mining Techniques
2.4.1 Overview of Data Mining Techniques 15
2.4.2 Classifying Data Mining Techniques 15
2.4.3 Association Rules 17
2.4.4 Classification 18
2.4.5 Clustering 20
2.5 Selecting Data Mining Techniques 22
2.6 Selecting Data Mining Tools 29
2.7 Summary 33
CHAPTER 3
RESEARCH APPROACH AND METHODOLOGY
3.1 Introduction 34
3.2 Problem Identification and Planning 35
3.3 Requirement Gathering 36
3.3.1 Primary Data 36
3.3.2 Secondary Data 37
3.4 Requirement Analysis 37
3.5 Design Model 38
3.6 Develop Prototype 38
3.7 Summary 39
CHAPTER 4
PROTOTYPE CONSTRUCTION
4.1 Introduction 40
4.2 Software Requirement 40
4.2.1 Software Tools 40
4.2.2 Software Tools Installation 41
4.3 Hardware Requirements 44
4.4 Development Phase 44
iii
87
4.4.1 Requirement Analysis Phase 45
4.4.2 Design Phase 45
4.4.3 Development Phase 46
4.5 Summary 48
CHAPTER 5
RESULT AND FINDINGS
5.1 Introduction 49
5.2 Interview Results 49
5.2.1 Current Problems 50
5.2.2 Functional Requirements 52
5.3 Use Case Diagram 54
5.4 Class Diagram 56
5.5 Clustering as the Suitable Searching Method 57
5.5.1 Introduction 57
5.5.2 Why Clustering Search Result 57
5.5.3 Examples of Clustering Search Result 58
5.5.4 Clustering Search Result from e-filing
web-based system 61
5.6 Summary 62
CHAPTER 6
CONCLUSION AND RECOMMENDATIONS
6.1 Introduction 63
6.2 Conclusions 63
6.3 Limitations 65
6.4 Recommendations 66
REFERENCES 68
iv
88
APPENDICES 71
APPENDIX A : Project Planning A
APPENDIX B : Progress Slide Presentation B
APPENDIX C : Interview Question C
APPENDIX D : Software Requirements Specification (SRS) D
APPENDIX E : Software Design Document (SDD) E
APPENDIX F : Description Of System Interface F
APPENDIX G : In-Progress Assessment G
v
89
LIST OF TABLES
Table 2.1 : Differences of Classification, Association and Clustering techniques 25
Table 2.2 : Computational Performance Criteria (Collier et. al, 1999) 30
Table 2.3 : Functionality Criteria (Collier et. al, 1999) 30
Table 2.4 : Usability Criteria (Collier et. al, 1999) 31
Table 4.1 : Software Tools Specifications 40
Table 5.1 : The problems that have been identified from the interviews 50
Table 5.2 : The requirement and suggestion that had been identified
from the interviews 52
Table 5.3 : Description of Use Case diagram 55
vi
90
LIST OF FIGURES
Figure 2.1 : The Process of knowledge discovery in database 10
Figure 2.2 : Process for designing and implementing arecommender
system (Shyu et al., 2005) 11
Figure 2.3: The general architecture of Mining Association Rule model
(Defit & Md Sap, 2001) 17
Figure 2.4: Hierarchical Classification Process (Khodra & Widyantoro, 2007) 19
Figure 2.5 : Stages in clustering (Jain et al., 1999) 21
Figure 3.1 : Overview of Research Approach and Methodology 34
Figure 4.1 : Coding index.php 47
Figure 4.2 : The main page interface of e-filing 48
Figure 5.1 : A Model for Navigating Interview Processes in
Requirements Elicitation 49
Figure 5.2 : Use Case Diagram for E-Filing web-based system 54
Figure 5.3 : Class Diagram for E-Filing web-based system 56
Figure 5.4 : Google’s One Dimensional Result List 59
Figure 5.5 : Good clustering result list 59
Figure 5.6 : Good clustering result list from http://search.carrot2.org 60
Figure 5.7 : Good clustering result list from e-filing web-based system 61
Figure 5.8 : Data Mining Tool by Carrot² 62
vii
91
ABSTRACT
E-filing web-based system is a development project that using a data mining
technique called clustering. There are different types of data mining that are useful
based on their functions and stated conditions. Majlis Daerah Kerian act as local
government which is a government unit that is closest to the citizens and these
includes municipalities, local authorities, town councils and city councils. The staff
in Majlis Daerah Kerian face difficulties in managing and identifying needed files
that meet their requirement. This is because, they have thousand of files and eight
departments, so that is difficult to search needed files manually that contains many
steps to follow. This research provides suitable searching method using data mining
technique for e-filing web-based system. The researcher make comparison between
three different data mining techniques (association, classification and clustering) to
identify suitable data mining technique for searching files and do interview session
with staff in Majlis Daerah Kerian to gather details requirement. By developing e-
filing web-based system for Majlis Daerah Kerian, it will help staff to identify their
needed files more accurate and faster as a result of using suitable searching method
by selected data mining techniques. It also will provide staff with interactive
environment in making their choice in determining the suitable files that meets their
requirements. It is expected that this e-filing web-based system will act as
information center for staff in Majlis Daerah Kerian to gather information about
status of the files.
viii