52213065 e filing report

1

CHAPTER 1

INTRODUCTION

This chapter provides the overview of this research project and discussed about

research background, problem statement, objectives of the research, research scope

and significance of the research.

1.1 Research Background

E-filing provides access to large database that consist list of electronic files.

According to Olson, Edwards and Monty (2003), e-filing is a highly secure

and reliable method for sending, receiving and managing legal documents.

This is because, it takes time to find needed files manually and e-filing

provides secured access to identify needed files easily without searching

manually at huge shelf. Olson et al. (2003) also stated that state courts,

federal courts and law firms across the country are using e-filing more and

more to improve access to documents, maximize resources and streamline

filing and service activities. It is much easier to know status of the needed

files and identified location of the files before going through to the real files.

The purpose of this research is to develop a prototype of e-filing web-based

system for Majlis Daerah Kerian. Majlis Daerah Kerian, Parit Buntar, Perak

act as local government which is a government unit that is closest to the

citizens and these includes municipalities, local authorities, town councils and

city councils. There are eight departments in Majlis Daerah Kerian which is

Law and Administration Unit, Assessment Unit, Information Technology

Unit, Account and Finance Unit, License and Parking Unit, Town Service

Unit, Garden and Recreation Unit, and Building Unit.

2

Within the e-filing web-based system, staffs easily gather information about

status of the files and identify suitable files that meet their requirement. The

system is developed using data mining technique specifically clustering

technique. According to Phyu (2009), data mining involves the use of

sophisticated data analysis tools to discover previously unknown, valid

patterns and relationships in large database set. This is because data mining

not even consists of more than collection and managing data, but also

includes analysis and prediction. Garofalakis, Rastogi, Seshadri and Shim

(1999) stated that there are three popular data mining techniques which are

association rules, classification and clustering. This research identified

suitable searching method using data mining techniques either association,

classification or clustering techniques in order to develop a prototype of e-

filing web-based system.

1.2 Problem Statement

The staffs in Majlis Daerah Kerian face difficulties in managing and

identifying needed files that meet their requirement. This is because, it is

difficult to search needed files manually. According to Mrs. Shalina,

Administrative Assistant of Majlis Daerah Kerian, there are many steps to

search files manually which is :

a. Searching suitable number of file that required by using a log

book.

b. Determine file name by using file number.

c. Check needed file on many big shelves that required long time.

d. Surveying on each staff’s table or other department in Majlis

Daerah Kerian if the file is not on the shelf.

All this steps will create barriers in order to give best respond for each action.

By developing this system, staff can find the file that satisfies their needs so

that it will create interactive environment for them.

3

1.3 Aim

The aim for this research project is to provide a suitable searching method

using data mining techniques for e-filing web-based system.

1.4 Objective of the Research

To achieve the aim of the project above, the objective can be divided into

four. The objectives are:

a. To identify requirements that will be needed for E-Filing from

Majlis Daerah Kerian.

b. To identify the searching method based on data mining techniques.

c. To design e-filing web-based system.

d. To demonstrate e-filing web-based system using identified data

mining technique.

1.5 Significance of Research

The significance of this development is that this system can be used by staff

in Majlis Daerah Kerian. E-filing will act as an information center for staff to

gather information about status of the files. Besides that, it also provides staff

with interactive environment in making their choice in determining the

suitable files that meets their requirement.

4

1.6 Scope of Study

The e-filing web-based system is developed using PHP with MySQL

database. The development is for Majlis Daerah Kerian, Parit Buntar, Perak

and focused on filing management only. This is a web-based application that

can be accessed via browser and will be used internally by Majlis Daerah

Kerian’s employees.

1.7 Limitation

The important task carried out in this study is to gather information from

staffs in Majlis Daerah Kerian who are involved in filing management. It is

conducted through the interview that requires arranging schedules and need

the right interviewee to gain the proper and effective interview sessions.

Conducting the interview time is the main constraint. This is because, the

researcher have to reschedule the interview when the interviewee canceled

the interview session. It is difficult for the researcher to gather all of the

information and possibility of missing some important information. Interview

session was conducted at Majlis Daerah Kerian, Parit Buntar, Perak.

Another limitation is that there are three different data mining techniques, but

researcher must select the best data mining technique that suite with the

objective. Researcher need to study properly for each data mining techniques

and come out with the related journals that support the findings.

Next, there are a large number of data mining tools available, but not all the

tools support different kind of data mining techniques. So researcher need to

study the tools based on their function and usability with the selected

techniques. Furthermore, the tool used in this research is new to the

researcher so that requires time to familiarize with the tool.

5

Experience of the researcher is another limitation factor of the research. This

is the first research for the researcher. However, researcher can learn and

have the proper guide based on the research plan and instruction from the

supervisor and examiner.

1.8 Outcomes/Deliverables

The outcome from the research project is a suitable searching method using

data mining technique for e-filing web-based system.

1.9 Layout of Dissertation

This research project has both a theoretical and practical part. The theoretical

part will describes the concepts and literature review of the e-filing and data

mining techniques. The practical part consists of an analysis of data gathered

from the interview session and secondary data from literature review.

The remaining chapters of this research are:

Chapter 2 is about the literature review on the e-filing and data mining

techniques. These literatures will act as a reference for this research

project.

Chapter 3 describes the research approach and methodology used in

this research project. The choice of method, how data is gathered and

the strategy used to perform an analysis of the data are explained.

Chapter 4 discusses the construction of the system’s prototype.

Chapter 5 discusses the findings and the analysis from the interview

sessions and secondary data.

Chapter 6 provides suggestion of conclusion and recommendations

for further research.

6

1.10 Summary

This chapter explains the background of the problem and its proposed

solution together with a brief explanation of the solution. The important

aspects of the projects such as research background, objectives of the project,

scope of the project and significance of the project are included in this

chapter. The methodology diagram as shown in Figure 3.1 in Chapter 3 and

other contents of this chapter will be used in the following chapter as the

basis for direction.

The next chapter discusses the literature review for the research project.

7

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

This chapter describes in detail the related literatures to support the research

project. Literature review also clarifies the relationship between the study and

previous work conducted on the topic. This chapter covers overview of e-

filing and data mining, brief explanation for each technique in data mining

and steps in selecting data mining tools.

2.2 E-Filing

2.2.1 Introduction to E-Filing

E-Filing provides access to large database that consist list of

electronic files. According to Olson et al. (2003), e-filing is a highly

secure and reliable method for sending, receiving and managing legal

documents. Besides, e-filing is a highly secure and reliable method for

sending, receiving and managing legal documents and case

information. However, the rules to implement e-filing need to be fully

understand in order to achieve the best filing.

2.2.2 Purposes of the Rules in E-Filing

According to Olson et al. (2003), there are reason why rules are

important for electronic filing :

8

To define the electronic filing system : Electronic filing

and services can mean anything. So, the exact information

regarding type of files must clearly defined in order to

provide guidance for where and how to access the files.

To authorize electronic filing and service : Rules of

procedure are very specific when it comes to defining the

mechanical rules of filing. The valid method for delivering

document into right files need to identify for the best filing.

To clearly specify the procedural mechanics : How to

file electronically, security, service and filing deadlines,

and how to sign documents electronically can more easily

for simplicity and to avoid complexity.

To encourage use of electronic filing : Electronic filing

looks new to some people and training is the solution in

order to encourage them to use this system.

2.2.3 Proposed Model Rules for E-Filing

According to Olson et al. (2003), these rules below may be cited as

“e-filing rules” :

Short title

Clear definitions of files

Give authority

Determine authorized users

Give effective date

Signature to identify responsible user

9

2.3 What is Data Mining?

2.3.1 Definition of Data Mining

According to Phyu (2009), data mining is the use of sophisticated data

analysis tools to discover previously unknown, valid patterns and

relationships in large database set.

According to Chen, Han and Yu (1996), data mining which is also

referred to as knowledge discovery in databases, means a process of

nontrivial extraction of implicit, previously unknown and potentially

useful information (such as knowledge rules, constraints, regularities)

from data in databases.

Tang, Steinbach and Kumar (2006) stated that data mining is the

process of automatically discovering useful information in large

database repositories. Data mining techniques are deployed to scour

large database in order to find novel and useful patterns that might

otherwise remain unknown.

There are also many other terms founded in some articles and journals

that carry a similar or slightly different meaning, such as knowledge

meaning from databases, knowledge extraction, data archeology, data

dredging or data analysis.

2.3.2 Data Mining and Knowledge Discovery

Data mining is an integral part of knowledge discovery in database,

which is the overall process of converting raw data into useful

information, as shown in Figure 2.1. This process consists of a series

of transformation steps, from data preprocessing to postprocessing of

data mining results. (Tang et al., 2006)

10

Figure 2.1 : The Process of knowledge discovery in database.

Tang et al. (2006) stated that the input data can be stored in a variety

of formats (flat files, spread-sheets, or relational tables) and may

reside in a centralized data repository or be distributed across multiple

sites. The purpose of preprocessing is to transform the raw input data

into an appropriate format for subsequent analysis. The steps involved

in data preprocessing include fusing data from multiple sources,

cleaning data to remove noise and duplicate observations, and

selecting records and features that are relevant to the data mining task

at hand. Because of the many ways data can be collected and stored,

data preprocessing is perhaps the most laborious and time-consuming

step in the overall knowledge discovery process.

“Closing the loop” is the phrase often used to refer to the process of

integrating data mining results into decision support systems. For

example, in business applications, the insights offered by data mining

results can be integrated with campaign management tools so that

effective marketing promotions can be conducted and tested. Such

integration requires a postprocessing step that ensures that only valid

and useful results are incorporated into the decision support system.

Statistical measures or hypothesis testing methods can also be applied

during postprocessing to eliminate spurious data mining results.

11

According to Shyu, Chen and Haruechaiyasak (2005), data mining or

knowledge discovery in databases has emerged recently as an active

research area for extracting implicit, previously unknown, and

potentially useful information from large databases mining techniques

into the IR context, specifically as the information filtering tools for

the recommender system framework.

The overall process for designing and implementing a recommender

system is illustrated in Figure 2.2. The process involves the following

five steps.

Figure 2.2 : Process for designing and implementing a

recommender system (Shyu et al., 2005)

Data Collection: This initial step involves the collection of data sets

for executing the data mining algorithms. Three data components are

considered: (a) textual content (i.e., index terms or keywords), (b) link

structure (embedded hyperlinks within Web pages), and (c) user log

records.

Data Preprocessing: This step is required to clean and transform the

collected data sets into the formats which are suitable for the data

12

mining algorithms. This step includes the data reduction and selection

techniques to improve the efficiency of the data mining algorithms.

Information Filtering via Data Mining: This step is the core

process of the recommender system framework, where the data sets

are analyzed and the data mining algorithms are applied as the

information filtering tools to generate and discover any useful and

interesting recommended outputs.

Database Design and Implementation: To improve the efficiency of

data and information access and retrieval.

User Interface Design and Implementation: The user interface acts

as an intermediary between the users and the recommender system.

This step involves the design and implementation of a Web (i.e.,

HTTP) server which receives the users’ requests via the WWW,

processes the requests by accessing the database, and responds by

returning the results to the users. The user interface provides a

recommendation function with the user personalization technique by

requiring each user to log into the system in order to keep track of the

preferences.

2.3.3 Challenges of Data Mining

According to Tang et al. (2006), traditional data analysis techniques

have often encountered practical difficulties in meeting the challenges

posed by new data sets.

Chen et al. (1996) stated the importance to examine what kind of

features an applied knowledge discovery system is expected to have

and what kind of challenges may face at the development of data

mining techniques.

13

Chen et al. (1996) also provide the list of challenges that will face

during development of data mining techniques which is :

a. Handling of different types of data.

There are many kinds of data and databases used in

different applications. This will cause knowledge

discovery system should be able to perform effective data

mining on different kinds of data. Since most available

databases are relational, it is crucial that a data mining

systems performs effective knowledge discovery on

relational data. Besides, most databases contain complex

data types, such as structured data and complex data

objects, hypertext and multimedia data, spatial and

temporal data, transaction data, legacy data and so on. So,

powerful system should be able to perform efficient data

mining on complex types of data as well. However, data

mining system can handle specific kinds of data such as

systems dedicated to knowledge mining in relational

databases, transaction databases, spatial databases,

multimedia databases and so on in order to face diversity

of data types.

b. Efficiency and scalability of data mining algorithms.

In order to extract information from a large amount of data

in databases, the knowledge discovery algorithms must be

efficient and scalable. This is because, the running time of

a data mining algorithms must be predictable and

acceptable for large databases.

c. Usefulness, certainty, and expressiveness of data

mining results.

The contents of the database must accurately portray and

be useful for certain application in order to discover

14

knowledge. This also encourage a systematic study of

measuring the quality of the discovered knowledge,

including interestingness and reliability, by construction of

statistical, analytical and simulative models and tools.

d. Expression of various kinds of data mining requests

and results.

Different kinds of knowledge can be discovered from a

large amount of data. It is important to discovered

knowledge from different views and presents them in

different forms. This task requires them to express both the

data mining requests and the discovered knowledge in

high-level languages or graphical user interfaces so that

the data mining process can be specified by none expert,

understandable and directly usable by users.

e. Interactive mining knowledge at multiple abstraction

levels.

A high-level data mining query should be treated as a

probe which may disclose some interesting traces for

further exploration. Interactive discovery allow users to

interactively refine a data mining request, dynamically

change data focusing, progressively deepen a data mining

process and flexibly view the data and data mining results

at multiple abstraction levels from different areas.

f. Mining information from different sources of data.

Many sources of data are available through local and wide-

area computer network, including internet. Mining

knowledge from different sources either formatted or

unformatted data with diverse data are becomes new

challenges to data mining. Data mining may help by come

out with simple query systems.

15

g. Protection of privacy and data security.

Protecting data security and guarding against the invasion

of privacy are important when data viewed from many

different angles and at different abstraction levels. The

measurement of security can avoid disclosure of sensitive

information.

However, these requirements may cause conflict. For example,

protection of data security may conflict with the requirements of

interactive mining of multiple-level knowledge from different angles.

2.4 Data Mining Techniques

2.4.1 Overview of Data Mining Techniques

According to Garofalakis et al. (1999), data mining techniques

describe key data mining algorithms that have been developed for

large databases.

Garofalakis et al. (1999) also stated the popular data mining

techniques which are association rules, classification and clustering.

2.4.2 Classifying Data Mining Techniques

Chen et. al (1996) stated the kinds of techniques that can be utilized

during classification which is :

Type of databases to work on

A data mining system can be classified according to the

kinds of databases on which the data mining is performed.

This is important to identify the data type in order to

16

specific the area that system will perform. For example, a

system is a relational data miner if it discovers knowledge

from relational data, or an object-oriented one if it mines

knowledge from object-oriented databases. In general, a

data miner can be classified according to its mining of

knowledge from the following different kinds of databases:

relational databases, transaction databases, object oriented

databases, deductive databases, spatial databases, temporal

databases, multimedia databases, heterogeneous databases,

active databases, legacy databases, and the Internet

information-base.

Type of knowledge to be mined

Data miners should identify several kinds of knowledge

including association rules, characteristic rules,

classification rules, clustering and deviation analysis.

However, this knowledge depends on abstraction level of

the databases.

Type of techniques to be utilized

Data miners will be categorized according to the

underlying data mining techniques and approach. For

example, it can be categorized according to the driven

method into autonomous knowledge miner, data-driven

miner, query-driven miner, and interactive data miner. It

can also be categorized according to its underlying data

mining approach into generalization based mining, pattern-

based mining, mining based on statistics or mathematical

theories, and integrated approaches, etc.

17

2.4.3 Association Rules

Association rules provide a useful mechanism for discovering

correlations among items belonging to customer transactions in a

market basket database (Garofalakis et al., 1999). For example : given

a database of sales transactions, it is desirable to discover the

important associations among items such that the presence of some

items in a transaction will imply the presence of other items in the

same transaction.

Chen et. al (1996) stated the problem of mining association rules that

composed into the following two steps :

a. Discover the large item sets.

b. Use the large item sets to generate the association rules for

the database.

It is noted that the overall performance of mining association rules is

determined by the first step. After the large item sets are identified,

the corresponding association rules can be derived in a

straightforward manner.

Figure 2.3 : The general architecture of Mining Association Rule

model (Defit & Md Sap, 2001)

18

Figure 2.3 represents the general architecture of Mining Association

Rule (MAR) model. MAR model consists of two main modules, pre-

processing and processing module. The first module, pre-processing is

used to transform data, identify and remove inconsistent data from

databases. Next, processing is executed to generate rules and evaluate

the generated rules.

2.4.4 Classification

Data classification is the process which finds the common properties

among a set of objects in a database and classifies them into different

classes, according to a classification model. (Chen et al., 1996)

Chen et al. (1996) also stated the objective of the classification which

is :

a. Analyze the training data.

b. Develop an accurate description or a model for each class

using the features available in the data.

Garofalakis et al. (1999) stated that classification are useful in the

Web context to build taxonomies and topic hierarchies on Web pages,

and subsequently perform context-based searches for Web pages

relating to a specific topic. Decisions tree classifiers are popular since

they are easily interpreted by humans and are efficient to build.

19

Figure 2.4 : Hierarchical Classification Process

(Khodra & Widyantoro, 2007)

Figure 2.4 shows the hierarchical classification process that consists

of two stages: offline stage, and online stage. Offline stage encodes

classification scheme metadata for each web page. In online stage, all

search results are hierarchically categorized using the classification

scheme provided in the metadata of retrieved documents.

Classification scheme is total ordering class from the most general (i.e.

root of ontology) to the most specific class (i.e. leaf of ontology).

They use Lucene as search engine. They combined Lucene with

interactive navigation interface generator that uses this hierarchical

structure to present list of search results hierarchically.

20

2.4.5 Clustering

Visnick (2003) defined clustering as a technique to achieve high data

density. She classifies clustering into different techniques which is

isolate index, object pooling and object modeling that conduct

different function.

Chen et al. (1996) defined clustering as the process of grouping

physical or abstract objects into classes of similar objects. It helps

data miner to construct meaningful partitioning of a large set of

objects based on a “divide and conquer” methodology which

decomposes a large scale system into smaller components to simplify

design and implementation.

Garofalakis et al. (1999) defined clustering as a useful technique for

discovering interesting data distributions and patterns in the

underlying data.

Qiu, Davis and Ikem (2004) stated that clustering techniques are

heuristic in nature. Almost all techniques have a number of arbitrary

parameters that can be “adjusted” to improve results.

Clustering techniques fall into the following broad categories :

a. Hierarchical vs partitional : Hierarchical techniques produce

a nested sequence of partitions, with a single, all inclusive

cluster at the top and singleton clusters of individual instances

at the bottom. Each intermediate level can be viewed as a

combination of two clusters from the next lower level, or a

split of a cluster from the next higher level into two.

Partitional (or non-nested) techniques create a one-level

partitioning of the data instances. After the user specifies the

desired number of clusters, a partitional approach typically

21

finds all clusters at once. This is in contrast to traditional

hierarchical schemes, which bisect a cluster to get two clusters

or merge two clusters to get one.

b. Divisive vs agglomerative : Hierarchical clustering

techniques proceed either from the top to the bottom or from

the bottom to the top, i.e. clustering starts with one large

cluster and splits it, or starts with clusters each containing a

point and then merges them.

c. Incremental vs non-incremental : Some clustering

techniques work with one instance at a time and decide how to

place it into an appropriate cluster, but most clustering

techniques are non-incremental, using information about all

the instances at once to form clusters.

Typical pattern clustering activity involves the following steps (Jain,

Murty and Flynn, 2000) :

Pattern representation (optionally including feature

extraction and/or selection)

Definition of a pattern proximity

Measure appropriate to the data domain

Clustering or grouping

Data abstraction (if needed), and

Assessment of output (if needed).

Figure 2.5 : Stages in clustering (Jain et al., 2000)

22

Figure 2.5 depicts a typical sequencing of the first three of these steps,

including a feedback path where the grouping process output could

affect subsequent feature extraction and similarity computations.

Pattern representation refers to the number of classes, the number of

available patterns, and the number, type, and scale of the features

available to the clustering algorithm. Some of this information may

not be controllable by the practitioner. Feature selection is the process

of identifying the most effective subset of the original features to use

in clustering. Feature extraction is the use of one or more

transformations of the input features to produce new salient features.

Either or both of these techniques can be used to obtain an appropriate

set of features to use in clustering. Pattern proximity is usually

measured by a distance function defined on pairs of patterns. A

variety of distance measures are in use in the various communities. A

simple distance measure like Euclidean distance can often be used to

reflect dissimilarity between two patterns, whereas other similarity

measures can be used to characterize the conceptual similarity

between patterns. The grouping step can be performed in a number of

ways. The output clustering (or clusterings) can be hard (a partition of

the data into groups) or fuzzy (where each pattern has a variable

degree of membership in each of the output clusters).

Hierarchical clustering algorithms produce a nested series of

partitions based on a criterion for merging or splitting clusters based

on similarity. Partitional clustering algorithms identify the partition

that optimizes (usually locally) a clustering criterion.

2.5 Selecting Data Mining Techniques

It is important for the researcher to select a suitable searching method using

data mining techniques in order to accomplish the objective. Researcher

decided to review three main data mining techniques which are classification,

23

association and clustering. These techniques deliver the same objective of

data mining, but different in terms of their function and suitability for the

system.

According to Tang et al. (2006), data mining is a technology that blends

traditional data analysis methods with sophisticated algorithms for processing

large volumes of data. It has also opened up exciting opportunities for

exploring and analyzing new types of data and for analyzing old types of data

in new ways.

Classification, which is the task of assigning objects to one of several

predefined categories, is a pervasive problem that encompasses many diverse

applications. Examples include detecting spam email messages based upon

the message header and content, categorizing cells as malignant or benign

based upon the results of MRI scans and classifying galaxies based upon their

shapes. (Tang et al., 2006)

Association is useful for discovering interesting relationships hidden in large

data sets. The uncovered relationships can be represented in the form of

association rules or sets of frequent items. Besides, many business enterprises

accumulate large quantities of data from their day-to-day operations. For

example, huge amounts of customer purchase data are collected daily at the

checkout counters of grocery stores. Retailers are interested in analyzing the

data to learn about the purchasing behavior of their customers. Such valuable

information can be used to support a variety of business-related applications

such as marketing promotions, inventory management and customer

relationship management. Association techniques will discover the patterns

from a large transaction data and evaluating the discovered patters in order to

prevent the generation of spurious results. (Tang et al., 2006)

Cluster divides data into groups (clusters) that are meaningful, useful, or both.

If meaningful groups are the goal, then the clusters should capture the natural

structure of the data. The concept of clustering has been around for a long

time. It has several applications, particularly in the context of information

24

retrieval and in organizing web resources. The main purpose of clustering is

to locate information and in the present day context, to locate most relevant

electronic resources. In database management, data clustering is a technique

in which, the information that is logically similar is physically stored

together. In order to increase the efficiency of search and the retrieval in

database management, the number of disk accesses is to be minimized. In

clustering, since the objects of similar properties are placed in one class of

objects, a single access to the disk can retrieve the entire class. If the

clustering takes place in some abstract algorithmic space, we may group a

population into subsets with similar characteristic, and then reduce the

problem space by acting on only a representative from each subset.

Clustering is ultimately a process of reducing a mountain of data to

manageable piles. For examples, analyze the large amounts of genetic

information that are now available, group the search result into a small

number of clusters, identify different types of depression and to segment

customers into a small number of groups for additional analysis and

marketing activities. (Ravichandra, 2003)

However, it is important for the researcher to identify suitability of each

technique in order to implement the good searching method. Researcher

reviewed the techniques based on their definition, concept, functions,

suitability and examples given in several journals. (Refer Table 2.1)

25

Table 2.1 : Differences of Classification, Association and Clustering

techniques

DM Techniques

Differences Classification Association Clustering

Definition

Data

classification is

the process

which finds the

common

properties

among a set of

objects in a

database and

classifies them

into different

classes,

according to a

classification

model. (Chen et

al., 1996)

Association rules

provide a useful

mechanism for

discovering

correlations

among items

belonging to

customer

transactions in a

market basket

database

(Garofalakis et

al., 1999)

Clustering as

the process of

grouping

physical or

abstract objects

into classes of

similar objects.

(Chen et al.,

1996)

Concept

Classification,

which is the task

of assigning

objects to one of

several

predefined

categories, is a

pervasive

problem that

encompasses

many diverse

applications.

(Tang et al., 2006)

Association is

useful for

discovering

interesting

relationships

hidden in large

data sets. (Tang

et al., 2006)

Cluster divides

data into groups

(clusters) that

are meaningful,

useful, or both.

(Ravichandra,

2003)

26

DM Techniques


Functions

Classification is

useful in the

Web context to

build taxonomies

and topic

hierarchies on

Web pages, and

subsequently

perform context-

based searches

for Web pages

relating to a

specific topic.

(Garofalakis et

al., 1999)

It will discover

the patterns from

a large

transaction data

and evaluating

the discovered

patters in order

to prevent the

generation of

spurious results.

(Tang et al.,

2006)

It helps data

miner to

construct

meaningful

partitioning of a

large set of

objects based on

a “divide and

conquer”

methodology

which

decomposes a

large scale

system into

smaller

components to

simplify design

and

implementation.

(Chen et al.,

1996)

Suitability

Develop an

accurate

description or a

model for each

class using the

features

available in the

data. (Chen et

al., 1996)

Discovering

correlations

among items

belonging to

customer

transactions in a

market basket

database for

market analysis

(Garofalakis et

al., 1999).

Increase the

efficiency of

search and the

retrieval in

database

management.

(Ravichandra,

2003)

27

DM Techniques


Examples

Detecting spam

email messages

based upon the

message header

and content,

categorizing

cells as

malignant or

benign based

upon the results

of MRI scans

and classifying

galaxies based

upon their

shapes. (Tang et

al., 2006)

Huge amounts of

customer

purchase data are

collected daily at

the checkout

counters of

grocery stores.

Retailers are

interested in

analyzing the

data to learn

about the

purchasing

behavior of their

customers. (Tang

et al., 2006)

Analyze the

large amounts

of genetic

information that

are now

available, group

the search result

into a small

number of

clusters,

identify

different types

of depression

and to segment

customers into

a small number

of groups for

additional

analysis and

marketing

activities.

(Ravichandra,

2003)

According to the comparison above, after reviewing each technique based on

their definition, concept, functions, suitability and examples given by several

journals, researcher found that clustering is the suitable searching method for

e-filing web-based system.

28

Although Classification, Association and Clustering have similarity in terms

of information retrieval, but there are differences regarding how the

information retrieved, analyzed and delivered. Classification assigning

objects to several predefined categories in order to develop a model for each

data using the features available in the data. Association is useful to discover

correlations among data in order to identify interesting relationships hidden in

large data sets especially for market analysis. However, clustering groups the

physical or abstract objects into list of similar objects to provide simplified

list of data. In other words, it divides data into groups that have similarity,

meaningful and useful.

This is because, partitioning of a large set of data by clustering will

decompose a large search result into smaller components to simplify the

content. It helps user to review accurate search result that fulfill their needs

and expectation.

In terms of suitability, clustering increase the efficiency of search and the

retrieval of information in database management (Ravichandra, 2003). It

analyzed the search result to identify similarity between the results and

provide simplified list of results. Further information regarding why

clustering is suitable for searching method in e-filing web-based system are

discussed in Chapter 5 (Result and Findings).

29

2.6 Selecting Data Mining Tools

Data mining tools are used widely to solve real-world problems in

engineering, science and business. (Abbott, Matkovsky & Elder, 1998)

Nowadays, numbers of data mining tools are increases and it has become

more challenges in order to select effective tools. The data mining tool

market has become more crowded in recent years, with more than 50

commercial data mining tools as stated at the KDNuggets website

(http://www.kdnuggets.com). KDnuggets.com is the Data Mining

Community’s Top Resource since 1997 for data mining and analytics news,

tools, jobs, courses, data and more.

Collier, Carey, Sautter and Marjaniemi (1999) proposed four categories of

criteria for selecting from among the assortment of commercially available

data mining tools which is :

a. Performance

As per Table 2.2 is the ability to handle a variety of data sources

in an efficient manner. From a computational perspective,

hardware configuration has a major impact on tool performance.

Besides, some data algorithms are more efficient than others.

However, this category focuses on the qualitative aspects of a

tool’s ability to easily handle data under a variety of hardware

configuration. The criteria that should consider in this task are

platform variety, software architecture, heterogeneous data access,

data size, efficiency, interoperability and robustness.

30

Table 2.2 : Computational Performance Criteria (Collier et al., 1999) Criteria Description Platform Variety Does the software run on a wide-variety of computer platforms? More

importantly, does it run on typical business user platforms? Software Architecture

Does the software use client-server architecture or a stand-alone architecture? Does the user have a choice of architectures?

Heterogeneous Data Access

How well does the software interface with a variety of data sources (RDBMS, ODBC, CORBA, etc)? Does it require any auxiliary software to do so? Is the interface seamless?

Data Size How well does the software scale to large data sets? Is performance linear or exponential?

Efficiency Does the software produce results in a reasonable amount of time relative to the data size, the limitations of the algorithm, and other variables?

Interoperability Does the tool interface with other KDD support tools easily? If so, does it use a standard architecture such as CORBA or some other proprietary API?

Robustness Does the tool run consistently without crashing? If the tool cannot handle a data mining analysis, does it fail early or when the analysis appears to be nearly complete? Does the tool require monitoring and intervention or can it be left to run on its own?

b. Functionality

There are variety of capabilities, techniques, and methodologies

for data mining (Table 2.3). In order to know well the tool adapt to

different data mining problem, software functionality will help to

solve it. The criteria in functionality aspect are algorithm variety,

prescribed methodology, model validation, data type flexibility,

algorithm modifiability, data sampling, reporting, model exporting,

user interface, learning curve, user types, data visualization, error

reporting, action history and domain variety.

Table 2.3 : Functionality Criteria (Collier et al., 1999) Criteria Description

Algorithmic Variety Does the software provide an adequate variety of mining techniques and algorithms including neural networks, rule induction, decision trees, clustering, etc.?

Prescribed Methodology

Does the software aid the user by presenting a sound, step-by-step mining methodology to help avoid spurious results?

Model Validation Does the tool support model validation in addition to model creation? Does the tool encourage validation as part of the methodology?

Data Type Flexibility Does the implementation of the supported algorithms handle a wide-variety of data types, continuous data without binning, etc.?

Algorithm Modifiability

Does the user have the ability to modify and fine-tune the modeling algorithms?

Data Sampling Does the tool allow random sampling of data for predictive modeling? Reporting Are the results of a mining analysis reported in a variety of ways? Does

the tool provide summary results as well as detailed results? Does the tool select actual data records that fit a target profile?

Model Exporting After a model is validated does the tool provide a variety of ways to export the tool for ongoing use (e.g., C program, SQL, etc.)?

31

c. Usability

Different level and types of user will cause usability (Table 2.4).

One problem with easy-to-use mining tools is their potential

misuse. The criteria should consider are data cleansing, value

substitution, data filtering, binning, deriving attributes,

randomization, record deletion, handling blanks, metadata

manipulation and result feedback.

Table 2.4 : Usability Criteria (Collier et al. 1999) Criteria Description

User Interface Is the user interface easy to navigate and uncomplicated? Does the interface present results in a meaningful way?

Learning Curve Is the tool easy to learn? Is the tool easy to use correctly? User Types Is the tool designed for beginning, intermediate, advanced users or a

combination of user types? How well suited is the tool for its target user type? How easy is the tool for analysts to use? How easy is the tool for business (end) users to use?

Data Visualization

How well does the tool present the data? How well does the tool present the modeling results? Are there a variety of graphical methods used to communicate information?

Error Reporting How meaningful is the error reporting? How well do error messages help the user debug problems? How well does the tool accommodate errors or spurious model building?

Action History Does the tool maintain a history of actions taken in the mining process? Can the user modify parts of this history and re-execute the script?

Domain Variety Can the tool be used in a variety of different industries to help solve a variety of different kinds of business problems? How well does the tool focus on one problem domain? How well does it focus on a variety of domains?

Data mining tools is costly and generally accompanied by moderately step

learning. Selection of the wrong tool is expensive both in terms of waste

money and time. These categories for selecting data mining tools will help

practitioners avoid spending much time only to discover that a particular tool

does not provide the necessary solution. (Collier et al., 1999)

Bialynicka (2008) stated that there are data mining tools that suite with

clustering which are :

Scatter

Grouper

Carrot²

Vivisimo

32

Scatter is designed for browsing that support online clustering based on two

novel clustering algorithms which are buckshot and fractionation. Buckshot

fast for online clustering and fractionation is accurate for offline initial

clustering of the entire set. (Bialynicka, 2008)

Grouper is suitable for online purposes that operate on query result snippets.

It will cluster together documents with large common subphrases.

(Bialynicka, 2008)

Carrot² is component framework that allows substituting components for

input (from other search engines), filter (stemming, distance measure and

clustering) and output the result. (Bialynicka, 2008)

Vivisimo is the commercial online clustering that support hierarchical and

conceptual clustering techniques. (Bialynicka, 2008)

However, for this research project, researcher used free tools that available

for learning purposes which is Carrot².

Carrot2 is an open source search results clustering engine. It can

automatically organize small collections of documents, e.g. search results,

into thematic categories. (Carrot², 2010)

Apart from two specialized document clustering algorithms, Carrot2 offers

ready-to-use components for fetching search results from various sources

including YahooAPI, GoogleAPI, MSN Live API, eTools Meta Search,

Lucene, SOLR, Google Desktop and more. Carrot2 is implemented in Java,

but it easily integrates with non-Java software, such as PHP, Ruby or C#.

(Carrot², 2010)

33

2.7 Summary

This chapter provides overview of e-filing and data mining techniques based

on the literature review from several journals. Rules in e-filing, overview of

data mining and challenges in data mining are discussed. Researcher also

reviews three basic data mining techniques which are classification,

association and clustering. After that, researcher come out with comparison

between them and selects the suitable data mining techniques for searching

method in e-filing web-based system (Refer Table 2.1). Based on the

comparison in Table 2.1, researcher found that clustering is the suitable

searching method for e-filing web-based system. Besides, for this research

project, researcher used free tools that available for learning purposes which

is Carrot² (open source search results clustering engine) after review several

journals regarding data mining tools.

The next chapter discusses the research approach and the methodology for the

research project.

34

CHAPTER 3

RESEARCH APPROACH AND METHODOLOGY

3.1 Introduction

This chapter describes the methodology and approaches that were used in the

research from problem identification until development of the system. To

achieve the objective of this project, the right approach must be applied for

best conclusions. This research used five major steps to start developing

prototype of e-filing web-based system using data mining techniques. It

consists of problem identification and planning, requirement gathering,

requirement analysis, design model and develop prototype. The overview of

this methodology can be shown below in Figure 3.1.

Figure 3.1 : Overview of Research Approach and Methodology

35

3.2 Problem Identification and Planning

This phase will identify the goal, scope, budget, schedule, technology and

system development process, methods and tools to ensure that everything are

in right place. However, it depends to what researcher wants to plan

according to the stakeholder requirement.

Before start to plan the project’s planning, the researcher should know the

current situation and problem that the old system have. An understanding of

potential problems is the main process to make the development

successful. After the researcher identifies the problems, scope of the project

is defined. The goal must be determined and the objectives of the project

must solved on the problems that have been identified. After analyzing all

the problems and identifying what task need to be done, a measurable

and achievable project plan is schedule using a Microsoft Project tool.

For this research, Microsoft Project is used to produce Gantt Chart (Refer

Appendix A- Project Planning) as a guideline for researcher in order to finish

the project. Besides, this phase involves list of steps which is :

a. Discuss the current problem with staff at Majlis Daerah

Kerian

The current problems for this research need to identify in order to

solve the problem in the next task.

b. Identify goal, objective, scope, and significance of research

The goal, objective, scope and significance of research need to be

clearly defined.

c. Plan related task

Plan the related task using Microsoft Project to schedule all the

planning. Time must be allocated carefully and entire task must be

stated to ensure the completion of the research.

36

3.3 Requirement Gathering

Requirement gathering is the process to gather all information that is needed

to develop the system. In this analysis phase, a method of data collection has

been applied. This phase is to identify some of the concept and

requirement that will be required and apply in developing the e-filing web-

based system. For this research, there are two types of data collection which

are :

3.3.1 Primary Data

Primary data is about gathering requirement from the original

resource such as interviews, questionnaire and observation. For this

research, the researcher used data from the interview with staff at

Majlis Daerah Kerian. Interviewing is a technique used to gain

detailed information regarding the related subject of interest of this

research. This includes software and hardware used and also the

problem that arises in current system so that requirements identified.

Table 3.1 below shows the information of people that involved in

interview session for gathering requirement of e-filing web-based

system.

Table 3.1 : Information of people that involve in interview

Respondent Name Department

Mr. Gobibaskaran A/L Govindaraju Head of IT Department,


Puan Shalina Mat Piah Administrative Assistant,


37

The main advantages of interviews are that the answer of the

interviewees is more spontaneous without an extended reflection. This

can be done by using a top down approach where the interviewer

starts with a general question and progress to specific question about

task. Interviews should plan in advance by defining a set of interview

questions to be asked. This does not only assist in ensuring

consistency between interviews conducted with different interviewees

but also help to focus on the purpose of the interview session.

The deliverable of this activity is an identified requirement that

needed for e-filing web-based system.

3.3.2 Secondary Data

The secondary data for this research is about data collection through

many resources such as articles, journals, books and other related

academic publication information about e-filing and data mining. It is

important to gain deeper understanding to e-filing and data mining.

3.4 Requirement Analysis

This is the next stage after all data has been collected from the requirement

gathering phase. The primary data collected is needed to be analyzed to

define the system requirement for developing e-filing web-based system. The

collected data need to be studied and analyzed properly in order to have

accurate, reliable and relevant information during the development. This

entire requirement helped researcher to identify the use case that produce

system functions and finally researcher come out with Software Requirement

Specification (SRS) documentation.

38

Besides, secondary data that collected during requirement gathering phase is

useful to identify suitable searching method using data mining techniques.

Researcher made comparison between three popular techniques (association,

classification and clustering) in data mining in order to identify suitable

searching method from selected data mining techniques. Researcher finally

comes out with suitable searching method using data mining techniques. The

tool used during this phase is Rational Rose.

3.5 Design Model

The model will be designed and determine before proceeding with the actual

construction of the database and system. System interface, classes, objects

and their relation will be designed using Rational Rose. The entire related

diagram to this research that includes class diagram, use case, sequence

diagram will be designed based on the result from the requirement analysis

phase.

After all the objects and classes are illustrated clearly with its attributes and

methods, a development of database was conducted. This activity is

accomplished by using MySQL database. At the end of this activity, a

detailed design (database model) is produced. The deliverable of this phase

has been documented in Software Design Document (SDD).

3.6 Develop Prototype

Develop prototype is related with building the application of the system using

the appropriate development technologies. In this phase, researcher will

develop the prototype of e-filing web-based system using data mining

techniques. The Apache is use as a web server, MySQL database as a

database server, and PHP programming language as the platform of the

development. In order to write programming code, Dreamweaver is used as a

39

workspace and Carrot² as a data mining tool. At the end of this phase, e-filing

prototype system using data mining technique will be produced.

3.7 Summary

The research methodology describes the research strategy that is used in this

research project. For this research project, a plan of action is laid out that

shows how the problem will be investigated, what information will be

collected using which method and how this information will be analyzed to

come to the conclusion. It consists of problem identification and planning,

requirement gathering, requirement analysis, design model and develop

prototype.

The methodology stated above was followed to develop the e-filing web-

based system in order to achieve the project’s objectives as well as to

fulfill requirements specified by the user. With understandable and

achievable methodology, the project is carried out in a proper manner that

consequently completed effectively.

The next chapter discusses the construction for the research project.

40

CHAPTER 4

PROTOTYPE CONSTRUCTION

4.1 Introduction

This chapter explained about the construction of prototype in depth and

details in developing the project development of the e-filing web-based

system. It explains on the result and ways it achieves the project objectives.

4.2 Software Requirements

Specified below is the list of software tools that are selected during the

development process. These include operating system and other applications

that are compulsory for the system to be developed and deployed.

4.2.1 Software Tools

Table 4.1 : Software Tools Specifications

No. Software Type

1. Windows XP SP2 Operating System (OS) 2. MySQL Database Server 3. PHP Programming Platform 4. Apache Web Server 5. Rational Rose Enterprise Edition Unified Modeling

Language Software 6. Adobe Photoshop CS3 Graphics Design Software 7. Macromedia Dreamweaver MX 2004 Workspace Software 8. Carrot² Open source framework for

building search clustering engines

41

4.2.2 Software Tools Installation

Referring to Table 4.1, the installation of the three basic tools related

which is Apache, MySQL Server version 5, Rational Rose Enterprise

Edition, Adobe Photoshop CS3, Macromedia Dreamweaver MX 2004

are explain further as the following.

a. Apache

The Apache HTTP Server, commonly referred to as Apache is

web server software notable for playing a key role in the initial

growth of the World Wide Web. In 2009 it became the first

web server software to surpass the 100 million web site

milestone. Apache was the first viable alternative to the

Netscape Communications Corporation web server (currently

known as Sun Java System Web Server), and has since

evolved to rival other Unix-based web servers in terms of

functionality and performance. Apache supports a variety of

features, many implemented as compiled modules which

extend the core functionality. These can range from server-

side programming language support to authentication schemes.

Some common language interfaces support Perl, Python, Tcl,

and PHP. Apache provides a variety of MultiProcessing

Modules (MPMs) which allow Apache to run in a process-

based, hybrid (process and thread) or event-hybrid mode, to

better match the demands of each particular infrastructure.

This implies that the choice of correct MPM and the correct

configuration is important. Where compromises in

performance need to be made, the design of Apache is to

reduce latency and increase throughput, relative to simply

handling more requests, thus ensuring consistent and reliable

processing of requests within reasonable time-frames. (Apache,

2002)

42

b. MySQL Version 5

MySQL is the world's most popular open source database

software, with over 100 million copies of its software

downloaded or distributed throughout its history. With its

superior speed, reliability, and ease of use, MySQL has

become the preferred choice for Web, Web 2.0, SaaS, ISV,

Telecom companies and forward-thinking corporate IT

Managers because it eliminates the major problems associated

with downtime, maintenance and administration for modern,

online applications. (MySQL, 2009)

MySQL server is chosen as the storage for the data in E-Filing

web-based system because of its consistency, fast

performance, high reliability and ease of use. The researcher

only need to follow all the instruction on the wizard until the

installation process is completed. Once the installation is

completed, MySQL Server Version 5 can be used in the

development of E-Filing web-based system.

c. Rational Rose Enterprise Edition

According to IBM Corporation (2006), Rational Rose enables

the creation of the following types of UML based diagrams:

activity diagrams, class, component, deployment, sequence,

state chart, use case, collaboration, physical storage and

deployment, and physical data and tables.

Researcher used Rational Rose Enterprise Edition to create

UML modeling for e-filing web-based system. It consists of

use case diagram, sequence diagram and class diagram for e-

filing web-based system.

43

d. Adobe Photoshop CS3

Photoshop CS3 is part of Adobe’s Creative Suite (along with a

host of other products such as Illustrator). It is Adobe’s

flagship bit map editor, and a professional level editor for fine

art photography there is no viable alternative. Photoshop is the

industry standard because of its flexibility and extensibility (it

supports a wide range of third-party plug-ins), its support for

color management, and the robustness of its tools. (Levy,

2007)

Researcher used Adobe Photoshop CS3 to design the interface

of E-Filing web-based system that consists of header, logo and

system’s layout.

e. Macromedia Dreamweaver MX 2004

Dreamweaver is a powerful web page creation and web site

management tool. It offers numerous, sophisticated functions

that can be used to create professional quality web sites.

Because of this, it’s one of the most popular web authoring

tools among web designers. (San Diego State University,

2004)

Researcher used Macromedia Dreamweaver MX 2004 as the

workspace software in order to develop coding using PHP

language for E-Filing web-based system.

f. Carrot²

According to Carrot² (2010), Carrot2 is an Open Source Search

Results Clustering Engine. It can automatically organize small

44

collections of documents, e.g. search results, into thematic

categories.

Apart from two specialized document clustering algorithms,

Carrot2 offers ready-to-use components for fetching search

results from various sources including YahooAPI, GoogleAPI,

MSN Live API, eTools Meta Search, Lucene, SOLR, Google

Desktop and more. Besides, Carrot2 is implemented in Java,

but it easily integrates with non-Java software, such as PHP,

Ruby or C#.

Researcher used Carrot² which is open source framework to

build a search results clustering engine. It will organize the

search results into topics, fully automatically and without

external knowledge such as taxonomies or reclassified content.

4.3 Hardware Requirements

In developing and deploying e-filing web-based system, the minimum

hardware requirement that project needed is standard personal computer with

Intel or AMD processor, standard motherboard, 80 GB hard disk and 512MB

DDRAM memory. No additional external device is needed for this project.

4.4 Development Phase

Based on research methodology depicts in Figure 3.1, system construction

process involved in last 3 phases of research methodology, which are

Requirement Analysis, Design and Development phase. Each process

involved in mentioned phase is explained further below.

45

4.4.1 Requirement Analysis Phase

In this construction process, the researcher analyzed the requirement

in more detail. The researcher illustrated use case diagram using

Rational Rose Software which focused on high level view that

concentrated on a user-centered view of the system. This is to analyze

class diagram which is the primary model for describing the internal

structure and behavior of the project system. Furthermore, each use

case is described thoroughly that stated the flows involved within it as

well as the production of sequence diagram are also taken placed. As

a result, a summary of requirements for development of E-Filing web-

based system is fully constructed. For details on the requirement,

please refer Appendix D: Software Requirement Specification (SRS).

4.4.2 Design Phase

The design phase is concerned with specifying the e-filing web-based

system that will meet the requirements. The design of this project

takes place at two main levels, which is system design and detailed

design.

a. System Design

System design is focuses on architectural aspects that affect

the entire system (Bennett, McRobb & Farmer, 2006). The

system design of e-filing web-based system involved setting of

standard such as the design of the human computer interface,

the development of coding standard are specified, and the

suitable database management for data storage is selected.

This project uses the MySQL as the database management and

PHP as a programming language.

46

b. Detailed Design

Detailed Design is addresses the design of classes and the

detail working of this project system. It was based on the

requirement designed in the Software Requirement

Specification (SRS) that follows object-oriented design

approach. In an object-oriented system, the detailed design is

concerned the design of objects. Object Design is mainly

concerned with the specification of attributes types, how

operations function, and how objects are linked to other object

(Bennett et al., 2006). For details description of class diagram,

please refer Appendix E: Software Design Document (SDD).

4.4.3 Development Phase

In this development phase, a series of development tasks were

performed during this phase. It consists of constructing database

establishing its connection and coding task. These tasks are explained

further as below.

a. Coding

This task was concurrently done with the enhancement of the

interfaces. The necessary codes were added in the programs to

enable the interfaces to function correctly. Figure 4.1 shows

one of the coding segments that has been constructing during

development using Macromedia Dreamweaver MX 2004.

47

Figure 4.1 : Coding index.php

b. Data Mining Techniques

This task was concurrently done with the enhancement of the

e-filing web-based system with searching method using data

mining techniques. Clustering selected as the suitable data

mining techniques for searching method. Researcher used

Carrot² which is open source framework for building search

clustering engines. The necessary codes were added in the

system to cluster search results.

c. Interface

Figure 4.2 shows the main page of the system. This page

appear after the authorize user (staff) enter into the system.

This page shows the list of menu for staff to handle the

system.

(Refer Appendix F – Description of Interface System)

48

Figure 4.2 : The main page interface of e-filing

4.5 Summary

This chapter explained about the construction of the system in details in

developing the E-Filing web-based system. Researcher reviews the list of

software tools that are selected during the development process. These

include operating system and other applications that are compulsory for the

system to be developed and deployed which is Dreamweaver MX 2004,

Apache, MySQL, Rational Rose and Carrot². Besides, researcher comes out

with the minimum hardware requirements in developing and deploying E-

Filing web-based system. In the development phase, researcher reviews a

series of development tasks that were performed. It consists of requirement

analysis, design and development phase.

The next chapter discusses the result and findings for the research project.

49

CHAPTER 5

RESULT AND FINDINGS

5.1 Introduction

This chapter will explain how the collected data is organized, analyzed and

finalized to be used in the development phase of the research. The result of

the research that has been conducted will be explained in depth in this

chapter. It includes the findings and result gathered from the interviews and

discussions.

5.2 Interview Results

In order to generate good interview question, researcher follows a model for

navigating interview processes in requirements elicitation (Refer Figure 5.1).

Figure 5.1 : A Model for Navigating Interview Processes in Requirements Elicitation

50

In developing a Software Requirements Specification (SRS) of good quality,

it is quite important to correctly elicit requirements from stakeholders. The

interview session has been conducted with Encik Gobibaskaran A/L

Govindaraju, the Head of Information Technology at Majlis Daerah Kerian

and Puan Shalina Mat Piah, the Administrative Assistant at Majlis Daerah

Kerian. The interview questions are categorized into two categories. The first

category focused more on the current problems faced by staffs in Majlis

Daerah Kerian. All the necessary data from the current problems has been

collected through this category. The second category is focusing on the

functional requirement for the system to be developed. The sample interview

question can be found in Appendix C.

5.2.1 Current Problems

Interviewee :

Puan Shalina Mat Piah, Administrative Assistant,


The results gained from the first category of the interview questions

are presented in the Table 5.1 below.

Table 5.1 : The problems that have been identified from the interviews.

Problem Researcher Interviewee

PQ.1 Is the current manual

system easier and

comfortable to you?

No

PQ.2 Please describe the

current system in

regarding the manual

managing and

searching files.

Involve many step :

Searching suitable

number of file that

required by using

log book.

Determine file name

by using file

51

number.

Check needed file

on many big shelf

that required long

time.

Surveying on each

staff’s table or other

department in Majlis

Daerah Kerian if the

file not have on the

shelf.

PQ.3 Is it easy to identify

the suitable files

manually according to

your requirement?

No

PQ.4 Why you think it is not

easy to identify the

suitable files

manually?

Difficult to search the

suitable files.

Difficult to know status

of the files.

Required long time.

There are thousands of

files on the shelf.

Sometimes, there are

interchanges of files

between departments.

PQ.5 In your opinion, is it

important for MDK to

have web-based

system that will act as

information center for

staff to gather

information about the

status of the files?

Yes, of course

52

5.2.2 Functional Requirements

Interviewee :

Encik Gobibaskaran A/L Govindaraju, Head of IT Department,


Apart from that, the second category of the interview is focusing more

on the functional requirement of the system. The requirements and

suggestions gathered from the interviews are represented in the Table

5.2 below.

Table 5.2: The requirement and suggestion that had been

identified from the interviews

Requirement Researcher Interviewer

RQ.1 How many users

required

involving in the

system?

Three users which is

Administrator, Manager

and Staff

RQ.2 What do you

think E-Filing

web-based

system should

have?

Stored general staffs

information.

Stored files information.

Stored status and

location of the files.

Implement automated

searching to identify

suitable files.

RQ.3 What is the rule

for

Administrator,

Manager and

Staff in the

system?

Admin : handle user

account, view and delete

files.

Manager : handle user

information, maintain

files and delete staff.

Staff : handle user

53

information and

maintain files.

RQ.4 What is your

suggestion about

the language to

develop the

system?

Use the open source

language that suite with

any platform such as

PHP programming.

RQ.5 What is your suggestion about the database to develop the system?

MySQL database

Based on the Table 5.2, several processes for the system are

identified. This requirement is all about system functionality for e-

filing web-based system. This requirement is collected and analyzed

to produce the new system.

54

5.3 Use Case Diagram

Maintain User Account

View Files

Delete Files

Admin

Staff

Validate User

Maintain Files Information

Maintain User Information

Delete Staff

Maintain Customer Information

Manager

Figure 5.2 : Use Case Diagram for E-Filing web-based system

Referring to Figure 5.2 above, it shows the use case diagram for e-filing web-

based system. This use case illustrated the functionality for the administrator,

manager and staff. First, the admin, manager and staff must login into the

system. They must registered first before can use the system. Upon they have

login into the system, admin can maintain user account, view files and delete

files. Manager can maintain user information, maintain files information,

maintain customer information and delete staff. Staff can maintain user

information, maintain files information and maintain customer information.

55

The description about the use cases is described in Table 5.3.

Table 5.3 : Description of Use Case diagram

Use Cases Description

Maintain User Account

Maintain User Account use case is used by

administrator to update and delete user’s

account that used the system.

View Files

View Files use case is used by administrator to

view files from all departments in Majlis

Daerah Kerian.

Delete Files

Delete Files use case is used by administrator

to delete files from all departments in Majlis

Daerah Kerian.

Validate User Validate User use case is used by administrator,

manager and staff to login into the system.

Maintain User Information

Maintain User Information is used by manager

and staff for their registration and update their

information.

Maintain Files Information

Maintain Files Information is used by manager

and staff to add new files, update files and

delete files.

Maintain Customer

Information

Maintain Customer Information is used by

manager and staff to add new customer, update

customer and delete customer.

Delete Staff Delete Staff is used by manager to delete their

staff that not belonging to their department.

56

5.4 Class Diagram

Figure 5.3 : Class Diagram for E-Filing web-based system

Referring to Figure 5.3, it is a class diagram for e-filing web-based system. The class diagram is a type of static structure diagram of the system. It shows the system's classes, their attributes, and their relationships between the classes.

file_formfile_idfile_namefile_statusfile_remarkopen_dateupdate_datestaff_nodept_name

<<boundary>>

customer_formfile_idcust_iccust_namecust_add1cust_add2cust_citycust_postcodecust_statecust_phonestaff_no

<<boundary>>

manage

validate

manage

manage

have

staff_formstaff_nostaff_icstaff_namestaff_add1staff_add2staff_citystaff_postcodestaff_statestaff_hpstaff_emaildept_nameadvisor_no

<<boundary>>

advisor_formadvisor_noadvisor_icadvisor_nameadvisor_hpadvisor_email

<<boundary>>

login_formuser_nameuser_password

<<boundary>>

staff_control

search_staff()set_staff_detail()set_staff_update()removeStaff()

<<control>>

advisor_control

set_advisor_detail()set_advisor_update()

<<control>>

login_control

set_user_update()remove_user()validate_user()

<<control>>

advisor<<PK>> advisor_noadvisor_icadvisor_nameadvisor_hpadvisor_emaildept_name

add_advisor()update_advisor()display_advisor()

<<entity>>

login<<PK>> user_nameuser_passworduser_iduser_leveluser_dept

update_user()delete_user()display_user()

<<entity>>

1

1

1

1

validate

file_control

search_files()set_file_detail()set_file_update()remove_file()

<<control>>

customer_control

search_cust()set_cust_detail()set_cust_update()remove_cust()

<<control>>

staff<<PK>> staff_nostaff_icstaff_namestaff_add1staff_add2staff_citystaff_postcodestaff_statestaff_hpstaff_emaildept_nameadvisor_no

add_staff()update_staff()delete_staff()display_staff()

<<entity>>

0..*

1

0..*

1

1

1

1

1

customer<<PK>> cust_idfile_idcust_iccust_namecust_add1cust_add2cust_citycust_postcodecust_statecust_phonestaff_no

add_cust()update_cust()delete_cust()display_cust()

<<entity>>

0..*

1

0..*

1

file<<PK>> file_idfile_namefile_statusfile_remarkopen_dateupdate_datestaff_nodept_name

add_files()update_files()delete_files()display_files()

<<entity>>

0..*

1

0..*

1

1

0..n

1

0..n

57

5.5 Clustering as the Suitable Searching Method

5.5.1 Introduction

For this research project, it is important for the researcher to select the

suitable searching method using data mining techniques. Researcher

decided to review three main data mining techniques which are

classification, association and clustering. These techniques deliver the

same objective of data mining, but different in terms of their function

and suitability for the system.

Researcher reviewed the techniques based on their definition, concept,

functions, suitability and examples given in several journals. (Refer

Table 2.1 in Chapter 2-Literature Review, page 25)

According to the comparison in Table 2.1, after reviewing each

technique based on their definition, concept, functions, suitability and

examples given by several journals, researcher found that clustering is

the suitable searching method for e-filing web-based system.

5.5.2 Why Clustering Search Result

This decision supported by several journals that stated clustering as

the suitable searching method. According to Zhang, Zie and Wu

(2006), clustering will cluster the search results that can help users

find the results in several clustered collections, so it is easy to locate

the valuable search results that the users really needed.

58

Aliakbary, Khayyamian and Abolhassani (2008) stated that clustering

search results helps the user to overview returned results and to focus

on the desired clusters. Most of search result clustering methods use

title, URL and snippets returned by a search engine as the source of

information for creating the clusters.

According to Lipai (2008), clustering search tools results means

grouping them into object classes which are constructed using the

search results characteristics, with the purpose of simplifying the

users work to retrieve the information it needs, helping him to find

faster better quality results.

Bialynicka (2008) stated that, clustering will organize search result

into groups, so that different groups correspond to different user

needs. This is because, flanked list is not enough and documents

pertaining to different topics cannot be compared. Besides, there are

relationships between the results that can be utilized in order to cluster

the search results.

5.5.3 Examples of Clustering Search Result

Jasco (2007) gives example the useful of clustering techniques in

search result list. Figure 5.3 below shows google’s one dimensional

result list without clustering techniques. By using “clustering search

result” keywords, google gives about 15,500,000 list of result which is

large and difficult to choose.

59

Figure 5.4 : Google’s One Dimensional Result List

Figure 5.4 below shows the good search result list with clustering

technique. By using “clustering search result” keywords same as

Figure 5.3 above, it gives about 194 list of result only which is more

accurate, simple and easy to choose.

Figure 5.5 : Good clustering result list

60

Figure 5.5 below shows the search result list with clustering technique

that available in the World Wide Web (http://search.carrot2.org).

Figure 5.6 : Good clustering result list from http://search.carrot2.org

61

5.5.4 Clustering Search Result from e-filing web-based system

Figure 5.7 below shows the search result list with clustering technique

that available in e-filing web-based system.

Figure 5.7 : Good clustering result list from e-filing web-based system

Figure 5.8 below shows the data mining tool provided by Carrot²

which is the open source framework for building search clustering

engines. The necessary codes were added in the system to cluster

search results.

62

Figure 5.8 : Data Mining Tool by Carrot²

5.6 Summary

On this chapter, researcher explained how the collected data is organized,

analyzed and finalized to be used in the development phase of the research.

Researcher analyzed interview results with two staffs in Majlis Daerah

Kerian in terms of their current problems and functional requirements for e-

filing web-based system. Besides, researcher also discussed the reasons why

clustering is selected as the suitable searching method for e-filing web-based

system. Researcher comes out with several journals, examples that support

clustering as the suitable method to cluster search result and clustered result

from e-filing web-based system.

The next chapter discusses the conclusion and recommendations for the

research project.

63

CHAPTER 6

CONCLUSION AND RECOMMENDATIONS

6.1 Introduction

This chapter will conclude what has been done by the researcher from

defining the objectives until obtaining the findings through developing

the prototype of e-filing web-based system using data mining techniques.

This chapter also concludes the report for this project and provides limitations

of the software and recommendations for those who wish to pursue the

research on the development of the e-filing web-based system.

6.2 Conclusions

As for the conclusion of the research project on a development the

prototype of e-filing web-based system using data mining techniques, the

researcher managed to achieve the entire objectives based on defined

research approach and methodology that consists of a proper theoretical

findings (Secondary Data) and data findings (Primary Data). The

achievement of these objectives is hoped to provide solutions to the current

problems in Majlis Daerah Kerian, Parit Buntar, Perak.

The first objective of the research project is to identify requirements that will

be needed for e-filing from Majlis Daerah Kerian. This objective has been

achieved through requirement gathering by conducting interview session with

staffs in Majlis Daerah Kerian in order to know the current problems and

functional requirements for e-filing web-based system. The deliverable for

this objective has been documented and can be referred in the Appendix D:

Software Requirement Specification (SRS).

64

The second objective of the research project is to identify the searching

method based on data mining techniques. For this phase, researcher reviewed

many resources such as article, journal, books and other related academic

publication information about e-filing and Data Mining in order to gain

deeper understanding to e-filing and Data Mining. This secondary data is

useful to identify suitable searching method using data mining techniques.

Researcher make comparison between three popular data mining techniques

(association, classification and clustering) in order to identify suitable

techniques for searching method in e-filing web-based system. This objective

has been achieved when researcher found that clustering is the suitable

searching method for e-filing web-based system.

After the second objective has been achieved, the research proceeds with the

third objective of designing e-filing web-based system. This objective has

been achieved through the design stage, which is system design and

detailed design. In system design, the development of e-filing web-based

system highlight the importance of interface design with the human

computer interface characteristics through proper choosing of colors,

buttons, and fonts. Despite, overall system structure is produced to illustrate

how the overall system works. In detailed design, it addressed the design of

classes and the detail working of this project system. The detail design

described the attributes, operations, and classes. The third objective

deliverables been documented and can be referred in the Appendix E:

Software Design Document (SDD).

The fourth objective of this project is to demonstrate e-filing web-based

system using identified data mining technique. The third objective must

follow the three objectives that have been achieved. It was based on the

project methodology that consists of requirement gathering and analyzing,

then designing the model that must follows the user requirements. Finally, the

process of development the prototype is implemented by translating the

design into program code using selected programming platform, database

server, web server and selected data mining technique. Thus, the last

objective has been realized.

65

By developing e-filing web-based system for Majlis Daerah Kerian, it is

expected that it will providing staff interactive environment in making their

choice in determining the suitable files that meets their requirements. Besides,

it also expects that it will help staff to identify their needed files more

accurate and faster as a result of using suitable searching method using

selected data mining technique. This system also expected to become

information center for staff in Majlis Daerah Kerian to gather information

about status of the files.

Although all the objectives have been achieved, the e-filing web-based

system using data mining technique is far from complete and has its own

limitations. There are still lots of improvement that can be considered to

enhance this project. The limitations and recommendation for this project are

discussed below.

6.3 Limitations

The project had encountered a number of limitations while in progress. The

limitations are as follows :

a. The interview session for gathering the information about the current

problems and functional requirements was conducted only with Head

of Information Technology and Administrative Assistant of Majlis

Daerah Kerian. Interview with two person only, provide less

information about the requirements.

b. Due to the time constraint, researcher developed the prototype of e-

filing web-based system which is the system for demonstrate

purposes.

66

c. There are a lot of journal regarding data mining techniques, but

researcher faces difficulties to understand each journal because not

familiar with this knowledge.

d. There are three different data mining techniques, but researcher must

select the better data mining techniques that suite with the objective.

Researcher need to study properly for each data mining techniques

and come out with the related journals that support the findings.

e. There are a large number of data mining tools available, but not all the

tools support different kind of data mining techniques. So researcher

need to study the tools based on their function and usability with the

selected data mining techniques. Furthermore, the tool used in this

research is new to the researcher so that requires time to familiarize

with the tool.

f. Experience of the researcher is another limitation factor of this

research. This is the first research for the researcher. However,

researcher can learn and have the proper guide based on the research

plan and instruction from the supervisor and examiner.

6.4 Recommendations

There are several recommendations that can be considered to further enhance

the development of e-filing web-based system as the following:

a. Suggest that project scope of the system to be expanded to know

contents of the files other than status of the files.

b. Suggest that this system can be used by others local government, not

only Majlis Daerah Kerian.

67

c. Suggest that project can be online through the Internet so that it

can be access by everyone at anytime and anywhere. It is because, this

project has limited access by using Local Area Network (LAN) only.

Through the implementation of this system, hopefully there will be other

enhancement made for further project.

68

REFERENCES

Abbott, D.W., Matkovsky, I.P., & Elder, J.F. (1998). An Evaluation of High-end

Data Mining Tools for Fraud Detection. IEEE Transaction on Knowledge and

Data Engineering, 2836.

Aliakbary, S., Khayyamian, M., & Abolhassani, H. (2008). Using Social Annotations

for Search Result Clustering. Retrieved February 10, 2010, from http://

www.springerlink.com/index/v770wm385n256p68.pdf

Apache. (2002). Retrieved February 14, 2010, from The Apache Software

Foundation: http://apache.org/

Bennett, S., McRobb, S., & Farmer, R. (2006). Object-Oriented Systems Analysis

and Design Using UML Third Edition. McGraw-Hill Education(UK)

Limited.

Bialynicka, I. (2008). Clustering Web Search Results. Retrieved March 2, 2010, from

http://medialab.di.unipi.it/web/Search+QA/Seminar/Clustering.ppt

Carrot² (2010). Carrot²-Open Source Search Results Clustering Engine. Retrieved

March 1, 2010, from Carrot² Website : http://project.carrot2.org/index.html

Chen, M., Han, J., & Yu, S.Y. (1996). Data Mining : An Overview from a Database

Perspective. IEEE Transaction on Knowledge and Data Engineering, 8, 6.

Collier, K., Carey, B., Sautter, D., & Marjaniemi, C. (1999). A Methodology for

Evaluating and Selecting Data Mining Software. IEEE Transaction on

Knowledge and Data Engineering, 2-4.

69

Defit, S., & Md Sap, M. N. (2009). Mining Association Rule from Large Databases.

Retrieved October 10, 2009, from http://fsksm.utm.edu.my

Garofalakis, M. N., Rastogi, R., Seshadri, S., & Shim, K. (1999). Data Mining and

the Web : Past, Present and Future. Retrieved July 17, 2009, from

http://www.softnet.tuc.gr/~minos/Papers/widm99.pdf

IBM Corporation. (2006). IBM Rational Rose. Retrieved March 1, 2010, from

http://ftp.software.ibm.com/software/rational/web/datasheets/rose_ds.pdf

Jain, A. K., Murty, M. N., & Flynn, P. J. (2000). Data Clustering: A Review. ACM

Computing.

Jasco, P. (2007). Clustering Search Result, Part 1: Web-wide Search Engines.

Retrieved January 5, 2010, from http://www.emeraldinsight.com/1468-4527.htm

Khodra, M. L., Widyantoro, D. H. (2007). An Efficient and Effective Algorithm for

Hierarchical Classification of Search Results. Retrieved March 20, 2010,

from http://repository.gunadarma.ac.id:8000/711/1/C-07.pdf

Lee, H. K. (2005). Inductive Clustering : A Technique for Clustering Search Results.

Retrieved July 15, 2009, from http://sifaka.cs.uiuc.edu/course

/598cxz05s/report-hle.pdf

Levy, P. (2007). A Review of Adobe Photoshop CS3. Retrieved February 3, 2010,

from http://www.becs-wa.org/PhotoShop_CS3.pdf

Lipai, A. (2008). World Wide Web Metasearch Clustering Algorithm. Retrieved

March 13, 2010, from http://revistaie.ase.ro/content/46/Adina%20Lipai.pdf

MySQL. (2009). Retrieved Disember 28, 2009, from MySQL Website:

http://www.mysql.com/

70

Olson, T., Edwards, M., & Monty, H.A. (2003). A Guide to Model Rules for

Electronic Filing and Service. Retrieved July 15, 2009, from

http://www.ncsconline.org/WC/Publications/External_ElFileModelRulesLexi

sPub.pdf

Phyu, T.N. (2009). Survey of Classification Techniques in Data Mining. Retrieved

August 5, 2009, from

http://www.iaeng.org/publication/IMECS2009/IMECS2009pp727-731.pdf

Qiu, M., Davis, S., & Ikem, F. (2004). Evaluation of Clustering Techniques in Data

Mining Tools. Retrieved January 5, 2010, from

http://www.iacis.org/iis/2004_iis/PDFfiles/QiuDavisIkem.pdf

Ravichandra, R. (2003). Data Mining and Clustering Techniques. Retrieved April 1,

2010, from https://drtc.isibang.ac.in/bitstream/handle/1849/121

/K_ikr_datamining.PDF?sequence=2

San Diego State University. (2004). Dreamweaver MX 2004 Introduction. San

Diego, Berkeley. Academic Affairs.

Shyu, M. L., Chen, S. C., & Haruechaiyasak, C. (2005). Retrieved February 12,

2010, from http://www.hlt.nectec.or.th/Publications/Conferences/A%20

Data%20Mining%20Framework%20for%20Building%20A%20Web-

Page%20Recommender%20System.pdf

Tang, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining.

Boston : Pearson Education.

Visnick, L. (2003). Clustering Techniques. Retrieved July 30, 2009, from

http://www.progress.com/realtime/docs/whitepapers/

clustering_techniques.pdf

Zhang, H., Xie, K., & Wu, H. (2006). An Efficient Algorithm for Clustering Search

Engine Results. Retrieved February 6, 2010, from http://www.ieee.org

71

APPENDICES

72

APPENDIX A

PROJECT PLANNING

A

73

APPENDIX B

PROGRESS SLIDE PRESENTATION

B

74

APPENDIX C

INTERVIEW QUESTION

C

75

APPENDIX D

SOFTWARE

REQUIREMENT SPECIFICATION

(SRS)

D

76

APPENDIX E

SOFTWARE DESIGN DOCUMENT

(SDD)

E

77

APPENDIX F

DESCRIPTION OF SYSTEM

INTERFACE

F

78

APPENDIX G

IN-PROGRESS ASSESSMENT

G

79

UNIVERSITI TEKNOLOGI MARA

DEVELOPMENT OF E-FILING FOR MAJLIS DAERAH KERIAN

USING DATA MINING TECHNIQUES

MOHAMED SYAHMI BIN MOHAMED ISA

BSc. (Hons)

INFORMATION SYSTEM ENGINEERING

MAY 2010

80

Universiti Teknologi MARA

Development of E-Filing for Majlis Daerah Kerian

using Data Mining Techniques

Mohamed Syahmi Bin Mohamed Isa

Thesis submitted in fulfillment of the requirements for Bachelor of Science (Hons)

Information System Engineering Faculty of Computer and Mathematical Sciences

MAY 2010

81

DECLARATION

This declaration is to certify that this thesis and all of its submitted contents are

original in its stature, excluding those in which have been acknowledged specifically

in the references. The contents of this thesis are of my own endeavor and any ideas

or quotations from the work of other people; published or otherwise are fully

acknowledged in accordance with the standard referring practices of the discipline.

Name of Candidate : MOHAMED SYAHMI BIN MOHAMED ISA

Candidate’s ID No. : 2008287242

Programme : BACHELOR OF SCIENCE (HONS)

INFORMATION SYSTEM ENGINEERING

(CS 226)

Faculty : FACULTY OF COMPUTER AND

MATHEMATICAL SCIENCES

Project Title : DEVELOPMENT OF E-FILING FOR MAJLIS

DAERAH KERIAN USING DATA MINING

TECHNIQUES

Signature of

candidate :

Date : 24th MAY 2010

82

APPROVAL

DEVELOPMENT OF E-FILING FOR MAJLIS DAERAH KERIAN

USING DATA MINING TECHNIQUES

By

Mohamed Syahmi Bin Mohamed Isa

2008287242

This thesis is prepared under the direction of thesis coordinators, Assoc. Prof. Wan

Nor Amalina Wan Hariri and Assoc. Prof. Rashidah Md. Rawi, Information System

Engineering Program, and it has been approved by the thesis supervisor, Puan

Norisan Abd Karim. It was submitted to the Faculty of Computer and Mathematical

Sciences and was accepted in partial fulfillment of the requirement for the degree of

Bachelor of Science.

Approved by:

__________________________

Madam Norisan Abd Karim

Thesis Supervisor

Date: 24th May 2010

83

DEDICATION

“For my mother, Sadiah Binti Harun,

my late father, Mohamed Isa Bin Harun, and my brothers.”

84

ACKNOWLEDGEMENT

Praise be to Allah SWT Most Gracious, Most Beneficent

Firstly, I would like to pay my gratitude to Allah S.W.T for giving me strength to be

able to complete this project. Without His blessing and permission, this project could

not have been completed.

I would like to give my sincere appreciation to my supervisor Puan Norisan Abd

Karim for her concern, advices, supports and encouragement throughout this thesis

progress. My gratitude also goes to my coordinator of Final Year Project (ITS690)

FSKM, UiTM Shah Alam, Assoc. Prof. Wan Nor Amalina Wan Hariri and Assoc.

Prof. Rashidah Md. Rawi for their valuable guidance in the completion of this

project.

Special thanks to Mr. Gobibaskaran and Puan Shalina for giving the opportunity to

perform the interview session that helped me in gathering the requirements for this

project.

Finally yet importantly, thoughtful thanks to my parents, who gave me an

appreciation of learning and taught me the value of perseverance and resolve. I also

would like to say thank you to my friends for their support and to the entire person

that directly or indirectly helped me in this project. Thanks for inspiring me in such a

means that could not be written in words. May Allah SWT bless all of you.

i

85

TABLE OF CONTENTS

TITLE PAGE

ACKNOWLEDGEMENT i

TABLE OF CONTENT ii

LIST OF TABLES vi

LIST OF FIGURES vii

ABSTRACT viii

CHAPTER 1

INTRODUCTION

1.1 Research Background 1

1.2 Problem Statement 2

1.3 Aim 3

1.4 Objective of the Research 3

1.5 Significance of Research 3

1.6 Scope of Study 4

1.7 Limitation 4

1.8 Outcomes/Deliverables 5

1.9 Layout of Dissertation 5

1.10 Summary 6

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction 7

2.2 E-Filing

2.2.1 Introduction to E-Filing 7

2.2.2 Purposes of the Rules in E-Filing 7

2.2.3 Proposed Model Rules for E-Filing 8

2.3 What is Data Mining

2.3.1 Definition of Data Mining 9

ii

86

2.3.2 Data Mining & Knowledge Discovery 9

2.3.3 Challenges of Data Mining 12

2.4 Data Mining Techniques

2.4.1 Overview of Data Mining Techniques 15

2.4.2 Classifying Data Mining Techniques 15

2.4.3 Association Rules 17

2.4.4 Classification 18

2.4.5 Clustering 20

2.5 Selecting Data Mining Techniques 22

2.6 Selecting Data Mining Tools 29

2.7 Summary 33

CHAPTER 3

RESEARCH APPROACH AND METHODOLOGY

3.1 Introduction 34

3.2 Problem Identification and Planning 35

3.3 Requirement Gathering 36

3.3.1 Primary Data 36

3.3.2 Secondary Data 37

3.4 Requirement Analysis 37

3.5 Design Model 38

3.6 Develop Prototype 38

3.7 Summary 39

CHAPTER 4

PROTOTYPE CONSTRUCTION

4.1 Introduction 40

4.2 Software Requirement 40

4.2.1 Software Tools 40

4.2.2 Software Tools Installation 41

4.3 Hardware Requirements 44

4.4 Development Phase 44

iii

87

4.4.1 Requirement Analysis Phase 45

4.4.2 Design Phase 45

4.4.3 Development Phase 46

4.5 Summary 48

CHAPTER 5

RESULT AND FINDINGS

5.1 Introduction 49

5.2 Interview Results 49

5.2.1 Current Problems 50

5.2.2 Functional Requirements 52

5.3 Use Case Diagram 54

5.4 Class Diagram 56

5.5 Clustering as the Suitable Searching Method 57

5.5.1 Introduction 57

5.5.2 Why Clustering Search Result 57

5.5.3 Examples of Clustering Search Result 58

5.5.4 Clustering Search Result from e-filing

web-based system 61

5.6 Summary 62

CHAPTER 6

CONCLUSION AND RECOMMENDATIONS

6.1 Introduction 63

6.2 Conclusions 63

6.3 Limitations 65

6.4 Recommendations 66

REFERENCES 68

iv

88

APPENDICES 71

APPENDIX A : Project Planning A

APPENDIX B : Progress Slide Presentation B

APPENDIX C : Interview Question C

APPENDIX D : Software Requirements Specification (SRS) D

APPENDIX E : Software Design Document (SDD) E

APPENDIX F : Description Of System Interface F

APPENDIX G : In-Progress Assessment G

v

89

LIST OF TABLES

Table 2.1 : Differences of Classification, Association and Clustering techniques 25

Table 2.2 : Computational Performance Criteria (Collier et. al, 1999) 30

Table 2.3 : Functionality Criteria (Collier et. al, 1999) 30

Table 2.4 : Usability Criteria (Collier et. al, 1999) 31

Table 4.1 : Software Tools Specifications 40

Table 5.1 : The problems that have been identified from the interviews 50

Table 5.2 : The requirement and suggestion that had been identified

from the interviews 52

Table 5.3 : Description of Use Case diagram 55

vi

90

LIST OF FIGURES

Figure 2.1 : The Process of knowledge discovery in database 10

Figure 2.2 : Process for designing and implementing arecommender

system (Shyu et al., 2005) 11

Figure 2.3: The general architecture of Mining Association Rule model

(Defit & Md Sap, 2001) 17

Figure 2.4: Hierarchical Classification Process (Khodra & Widyantoro, 2007) 19

Figure 2.5 : Stages in clustering (Jain et al., 1999) 21

Figure 3.1 : Overview of Research Approach and Methodology 34

Figure 4.1 : Coding index.php 47

Figure 4.2 : The main page interface of e-filing 48

Figure 5.1 : A Model for Navigating Interview Processes in

Requirements Elicitation 49

Figure 5.2 : Use Case Diagram for E-Filing web-based system 54

Figure 5.3 : Class Diagram for E-Filing web-based system 56

Figure 5.4 : Google’s One Dimensional Result List 59

Figure 5.5 : Good clustering result list 59

Figure 5.6 : Good clustering result list from http://search.carrot2.org 60

Figure 5.7 : Good clustering result list from e-filing web-based system 61

Figure 5.8 : Data Mining Tool by Carrot² 62

vii

91

ABSTRACT

E-filing web-based system is a development project that using a data mining

technique called clustering. There are different types of data mining that are useful

based on their functions and stated conditions. Majlis Daerah Kerian act as local

government which is a government unit that is closest to the citizens and these

includes municipalities, local authorities, town councils and city councils. The staff

in Majlis Daerah Kerian face difficulties in managing and identifying needed files

that meet their requirement. This is because, they have thousand of files and eight

departments, so that is difficult to search needed files manually that contains many

steps to follow. This research provides suitable searching method using data mining

technique for e-filing web-based system. The researcher make comparison between

three different data mining techniques (association, classification and clustering) to

identify suitable data mining technique for searching files and do interview session

with staff in Majlis Daerah Kerian to gather details requirement. By developing e-

filing web-based system for Majlis Daerah Kerian, it will help staff to identify their

needed files more accurate and faster as a result of using suitable searching method

by selected data mining techniques. It also will provide staff with interactive

environment in making their choice in determining the suitable files that meets their

requirements. It is expected that this e-filing web-based system will act as

information center for staff in Majlis Daerah Kerian to gather information about

status of the files.

viii

52213065 e filing report

Documents