database written project draftacademic.udayton.edu/davesalisbury/classtuf/mba664... · infor global...

24
DATA MINING TEAM #1 Kristen Durst Mark Gillespie Banan Mandura MBA 664: Database Management

Upload: others

Post on 01-Feb-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

DATA MINING

TEAM #1

Kristen Durst

Mark Gillespie

Banan Mandura

MBA 664: Database Management

Page 2: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 ii

OUTLINE

INTRODUCTION 1

DATA MINING DEFINITION AND EXAMPLES 1

DATA MINING PRODUCTS 2

DATA MINING PROCESS 4

DATA MINING TECHNIQUES 7

DATA MINING EXAMPLE 11

CONCLUSION 14

REFERENCES 14

APPENDIX: FIGURES 15

Page 3: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 1

INTRODUCTION

The purpose of this paper is to provide a brief overview of data mining and how data

mining complements database technology. First, a definition for data mining will be provided

and some example applications will be discussed. Next, a few of the more well known data

mining companies will be presented along with the software and services they provide.

Following the review of data mining products, an approach to the data mining process will be

discussed along with an overview of a few of the more prominent data mining analysis

techniques. Finally, a data mining example will be presented that illustrates the data mining

process by means of a data collection and statistical approach to a real world problem. The

intent is to provide the reader with a better feel for the data mining process and how it may be

applied in actual applications.

DATA MINING DEFINITION AND APPLICATIONS

Data mining is an analysis process applied to large amounts of data with the intent of

identifying hidden, unknown patterns and relationships within the data thereby enabling the user

to draw conclusions and predict future outcomes. Practitioners of data mining are not as

concerned with determining what has happened based on an analysis of their data as they are

about predicting what will happen in the future. Data mining has grown in interest and

application over the last several years as advances in computer processing and digital data

storage have greatly increased the speed with which data can be accessed and processed while

simultaneously reducing the cost and infrastructure required to store the data and the results. As

will be discussed later, data mining does require a process, but in practice, the data mining

process is not uniform from user to user. However, the data mining process will generally

include the following three high level steps:

( a ) Description of the data to summarize attributes of the available data

( b ) Predictive modeling derived from a portion of the existing data

( c ) Verification of the model against the larger domain of data in the real world

Despite the wide interest in and “buzzword” status of data mining, a user who wishes to

implement data mining must recognize what data mining is not and what data mining cannot do.

Page 4: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 2

Data mining is not simply the blind application of a series of algorithms to large sets of data.

The data mining analyst must still understand the data and its origins, the business in which the

data originated and is used, as well as the analytical methods that are applied to the data and the

results of that analysis. Furthermore, data mining does not indicate what you must do with the

data and the results. Only a knowledgeable user of the data will be able to assess the value of the

patterns and relationships gleaned from the data mining approach and apply them to make a

positive impact to their business.

Data mining can be implemented in any business to aid the analysis and resolution of

multiple problems; however, the use of data mining has been most widely noted in the

telecommunications, credit card, financial and retail industries among others. For instance, the

telecommunications industry has studied data to determine which customers are most likely to

turn over or “churn” on their cell phone contracts; the credit card industry is able to detect and

track fraudulent use of their services; financial companies are able to predict corporate stock

performance; and retailers are able to tailor which products to stock and offer to particular

customers. Unfortunately, the benefits of data mining do not come without a cost, and

practitioners of data mining must recognize the potential legal and ethical concerns resulting

from the widespread application of data mining tools. In particular, the ability to track and

identify individual consumer behavior through the aggregation of data from multiple sources

when the original data was in fact anonymous is of concern and has resulted in the adoption of

data control policies within many corporations.

DATA MINING PRODUCTS

A wide range of data mining software and service providers exist in the marketplace

today and they serve a wide range of customers. According to a 2008 study by the Gartner

Group, an information technology research and advisory firm, five of the largest data mining

software companies are indicated below:

AGNOSS SOFTWARE COMPANY (www.agnoss.com)

Agnoss offers a suite of software tools to perform predictive diagnostics. These tools

cover all phases of the data mining process including profiling, exploration, modeling,

implementation, scoring and validation. Key software tools include Knowledge SEEKER

Page 5: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 3

for profiling and visualization, Knowledge STUDIO, a decision tree based tool for

predictive analytics, and Strategy BUILDER, a tool combining analysis results into

business rules.

INFOR GLOBAL SOLUTIONS (www.infor.com)

Infor Global Solutions is the world’s third largest software company and has acquired a

wide range of software applications that include Infor CRM Epiphany an integrated

software tool that performs marketing, sales and service analytics.

PORTRAIT SOFTWARE (www.portraitsoftware.com)

Portrait Software provides a suite of marketing analysis tools to support marketing,

service and selling activities. Portrait Software offers products that perform marketing

automation as well as predictive analytics. Quadstone Analytics is one of their predictive

modeling tools and it employs various techniques including decision trees, regression,

additive scorecards, clustering and uplift modeling.

SAS INSTITUTE (www.sas.com)

SAS is a leader in the data mining community and provides tools and solutions to a broad

range of customers. SAS Enterprise Miner and SAS Analytics offer customers access to a

multitude of methods and techniques to perform statistical analysis, data visualization,

forecasting, and model management and deployment. (SAS was originally an acronym

for Statistical Analysis System.)

SPSS INC (www.spss.inc)

SPSS Inc. provides a range of products in four families allowing customers to perform

Data Collection, Modeling, Statistical Analysis, and Deployment. These tools can be

integrated with Clementine a data mining workbench that uses a wide range of data

mining techniques. (The name SPSS is derived from Statistical Package for the Social

Sciences).

Page 6: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 4

DATA MINING PROCESS

A formal, uniformly accepted methodology for the process of data mining does not truly

exist. However, a 2002 survey by KDnuggets.com, a leading web based data mining resource,

indicated that 51% of the 189 respondents do follow CRISP-DM (CRoss Industry Standard

Process for Data Modeling) a methodology developed and advocated by SPSS. Another 12% of

respondents reported that they apply the tools described by SAS’s SEMMA approach.

Nevertheless, the remaining 38% of those taking the survey indicated that they follow their own

methodology, the methodology devised by their employer, or nothing at all. Despite the

apparent lack of a uniform process for data mining, all approaches to data mining will likely

incorporate activities to accomplish the tasks of (1) problem definition, (2) data collection, (3)

data review, (4) data conditioning, (5) model building, (6) model evaluation, and (7)

documentation and deployment.

As known leaders in the data mining community, SPSS’s CRISP-DM method and SAS’s

SEMMA approach will be discussed in more detail below. Although these approaches do not

explicitly call out the seven activities just described, those seven activities are embedded within

the SPSS and SAS approaches, and they will likely be incorporated into any successful data

mining approach.

CRISP-DM (CRoss Industry Standard Process for Data Mining)

CRISP-DM was conceived in 1996 by a consortium consisting of Daimler Chrysler,

SPSS, and NCR. The intent was to develop a data mining approach that was not specific to any

particular industry, application, or analysis tool. With funding from the European Commission,

the consortium conducted a workshop and upon finding general agreement for the need of a data

mining template, CRIPS-DM was born.

CRISP-DM is a hierarchical process model that consists of a set of tasks with various

degrees of definition. The top level of the hierarchy is the Phase. Each Phase consists of generic

tasks, the second level of the hierarchy. The tasks are generic in order to maintain the neutrality

of the process, and they are intended to be complete, applicable to the entire process, as well as

stable, tolerant of new and unplanned developments. Specialized tasks form the third level, and

these are designed for the unique, particular nature of problems to be solved. Finally, records of

actions, decisions, and results form the fourth and final level of the CRISP-DM hierarchy. The

Page 7: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 5

data mining context will determine the mapping from the generic levels (levels 1 and 2) to the

more specific levels (levels 3 and 4).

Moreover, CRISP-DM is described by a six phase reference model that flows in a

particular sequence but does not require the user to follow the phases in a fixed path. User’s will

likely find a need to move back and forth iteratively between phases as individual phase results

come into focus. The CRIPS-DM methodology is accommodating of that requirement. Finally,

CRISP-DM is designed to be cyclical in nature with an understanding that the data mining

activity may not end once a solution is derived. New questions and problems are likely to be

identified from the solution that may demand a continuous flow of follow-on activity. The six

phase CRIPS-DM cyclical model is briefly described below.

Phase 1 – Business Understanding: The purpose of Business Understanding is to assess

the objectives and requirements of the business and articulate these needs into a specific

problem or problems the business wishes to solve.

Phase 2 – Data Understanding: Data Understanding consists of preliminary data

collection along with the assessment of any insights into the data and any data quality

issues. Potential data segregation may occur and preliminary hypotheses may be formed.

Phase 3 – Data Preparation: In data preparation, data quality issues are resolved and the

final data set for analysis is generated. Any required data transforms are completed as is

necessary data cleansing. Multiple iterations may be required.

Phase 4 – Modeling: The methodology is neutral to any of the various data modeling

approaches. Multiple modeling choices may be reviewed and tailored to the specific

problem and available data. If the desired modeling technique requires specific data

conditions, a return to Phase 3 may be required. Multiple techniques may be applied.

Phase 5 – Evaluation: In Phase 5, the model is complete and validated for sufficient

quality. If quality in the model is lacking or it fails to meet the needs of the business, a

review and return to Phase 4 may be necessary.

Page 8: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 6

Phase 6 – Deployment: In Deployment, the data and model are organized and presented

to the customer for use. Data visualization is critical as is documentation of all detailed

process steps and their results.

SEMMA (Sample, Explore, Modify, Model, Assess)

SAS proposes that SEMMA is not so much a data mining methodology as it is a set of

tools deployed within their SAS Enterprise Miner software that can be integrated into any data

mining method. SEMMA articulates that it is the user’s responsibility to define the business

problem to be solved and acquire and condition the data appropriately. The SEMMA focus is on

model development. A brief description of the five elements of SEMMA follows:

SAMPLE: The Sample activity consists of extracting a statistically significant data set

from the larger data domain. The data set must adequately represent the larger data set

but be small enough for ease of manipulation. The data may be portioned to facilitate

model training, validation and test.

EXPLORE: In the Explore activity, different views and plots of the data are generated

and trends or unusual data instances are discovered. Additionally, traditional statistical

analysis tools or data mining techniques may be employed to ascertain any data

subgroups.

MODIFY: Modification results from the creation, selection and transformation of data in

preparation for the modeling activity. New variables or groups may be defined and any

outliers, data points resulting from special cause variation, may be eliminated. The data

set is updated accordingly.

MODEL: Modeling allows the user to fit the data using a wide variety of modeling

techniques and predict outcomes as derived from the overarching business need.

Techniques that may be applied include neural nets, decision trees, logistic regression,

and k-nearest neighbor to name a few.

Page 9: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 7

ASSESS: Finally, in assessment, the model is evaluated for usefulness in solving the

articulated business problem and validated against the subset of data. As in all data

mining approaches, the model is checked for over fitting to ensure the model is not tuned

so tightly to the model development subset that it cannot adequately predict outcomes

from other data sets.

From the brief assessment of CRISP-DM and SEMMA above, it is clear that there is

commonality of activity in any data mining approach even if the terminology and articulation of

methods are different. As indicated in the introductory paragraph, any good data mining

approach will include the tasks of (1) problem definition, (2) data collection, (3) data review, (4)

data conditioning, (5) model building, (6) model evaluation, and (7) documentation and

deployment.

DATA MINING TECHNIQUES

STATISTICAL METHODS The data mining technique that most people are familiar with are statistical methods such

as sample statistics or linear regression. These are usually used for very simple problems that

have very few predictive variables. If the problem was more complex then another method

would be more appropriate.

Sample statistics involve looking at particular variables and calculating the minimum

value, maximum value, mean, median, and variance. For example, a retail store could analyze

their sales data and find out for the previous quarter the summary statistics for their particular

products. They can quickly make conclusions about particular product lines and if they find

something they do not expect or looks interesting they can “mine” further using more complex

methods.

Linear regression is an easy way to predict values based on a simple equation. There

could be many interactions involved to find the correct model, in which case linear regression

would not be appropriate. However, for simple situations this can be a very powerful method.

For example, Company ABC has data on their customer’s income level and their sales data. As

Page 10: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 8

shown in Figure 1, as the customer’s income increases their total purchase amount increases. A

line is fit through the data to minimize the error between the data points and the line. The line

then becomes an equation: Total Purchase Amount = y-intercept + (Slope * Customer’s Income).

ABC can predict what the sales amount will be based on what the customer’s income level is by

putting it into the equation. Also, since ABC knows the relationship between income and sales,

they can restrict their marketing to only certain income levels.

NEAREST NEIGHBOR FOR PREDICTION

Nearest neighbor for prediction is a very easy data mining technique to understand. The

concept comes from the idea that you can predict the outcome or how something is going to

behave based on how other predictive variables “near” it behave. An everyday example of how

this used is in real estate. When someone buys a house the realtor will check to see what other

houses in the area sold for because this is a good predictor of what the house for sale should be

worth. This technique works best and is ideal to use when there are a few amount of predictive

variables.

A simple example of how a business might use this technique is if a company had a

product they wanted to start selling in a new city. Company XYZ wants to estimate how many

units will sell so they can determine if it is worth moving into the new market. XYZ has a

database of the current sales data of each city where the product is already being sold. The

predictive variables are the population of the city and the distance away from where their

competitor’s product is sold in relation to the city. As shown in Figure 2, each city is

represented by a letter, which corresponds to three categories of the amount of units sold: >200

units, 100 – 200 units, and <100 units. These markers are placed in the graph by the population

of the city and the distance away from their competitor. There is a U marker that represents the

new city where the amount of product sold is unknown and we want to predict. Using the

nearest neighbor for prediction method the U marker is nearest to more cities falling into the A

sales category than any other sales category. This means that we could predict that this city will

behave in the same way and will have sales greater than 200 units. XYZ should plan on

extending their product line to this market given the high prediction of sales.

With the nearest neighbor for prediction method it is also possible to estimate how

confident the company is with their prediction. If there prediction variables are extremely close

Page 11: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 9

to their neighbors then there is a higher level of confidence. However, if there are not any

prediction variables that are close a prediction can still be made, but with very little confidence.

This is extremely valuable because a company would not want to follow through with a major

investment with a prediction that has a low level of confidence.

NEURAL NETWORK

A neural network is a data mining technique that is much more complex. Some benefits

of this technique is that it can use extremely large amounts of predictive variables, once the

network has been created and it has been confirmed as successful then it can be used again and

again, and it can be used in many different types of situations. The disadvantages of this

technique are that the outcomes are not very easy to interpret and it can be very time consuming

to get the data into the right format for running the model.

A neural network is a complex computer model that takes input variables and then

outputs a solution. Neural networks include an input layer, hidden layer, and output layer. The

input layer consists of the predictive variables that go into the model. The hidden layer is created

by the computer model and is not seen by the user. The output layer is the end prediction that

has been calculated by the model. All of the variables that go into the neural network have to be

converted into numeric variables with values between 0 and 1.

Company XYZ could use a simple neural network to predict the same scenario that was

used for the nearest neighbor for prediction method. The database that has the population of the

city, the distance away from where their competitor’s product is sold in relation to the city, and

the product sales would be used create the model. The computer would use the population of the

city and the distance away from the competition for the input layer and then go through a testing

phase. During the testing phase the computer will assign various weights to each of the variables

and then output a number that represents the predicted product sales. This number will be

between 0 and 1 and needs to be interpreted as how that relates to the actual ranges that are

provided in the database. For example, an output of less than 0.333 means the product sales are

less than 100 units, an output between 0.333 and 0.666 means the product sales are between 100

and 200 units, and an output greater than 0.666 means the product sales are greater than 200

units. The computer will keep testing the actual data and adjusting the weights as needed to

create the best model for computing the predicted product sales. As shown in Figure 3, once the

Page 12: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 10

model has finished testing the data, the new city can be entered into the model and the predicted

product sales can be computed. This model has an output of 0.736, so the product sales are

predicted to be greater than 200 units.

CLUSTERING/SEGMENTING

Clustering or segmenting is a data mining technique where there is not something

specific that is predicted. This technique forms groups that are similar and groups that are very

different. This can help to give a good overall view of the data and what is going on in the

business. For example if Company MNO has a database of demographic information on their

customers and their buying habits then segmenting can be used to find buying patterns based on

that demographic data. As shown in Figure 4, male consumers under the age of forty are

behaving in the same way which is drastically different than female consumers over the age of

forty. For this particular example you can group into gender differences, age differences, and

also both gender and age. These groupings can then be used to for different marketing

campaigns. The marketing techniques should be different for each group. Not all the variables

in the database will be used for clustering or segmentation and some will need to be removed by

the user if they do not make any meaningful sense.

Clustering can also be used to identify potential problems by finding outliers. For

example, through clustering company MNO determined they have a much higher sale volume for

snowboards in their stores where they are within fifty miles of a ski resort. However, they found

one store where there is a low volume of sales even though they are only twenty-four miles away

from a ski resort. With some more research, company MNO realized that their sales were down

in that store because that area had become saturated with so many competitors. With this new

information they decided to pull their store out of this area because they cannot compete with the

larger stores.

DECISION TREE

Decision trees are a predictive model that group together classification variables into a

“tree”. Each branch represents a classification group that has been divided. The decision tree

splits the data into groups by examining all of the data and picking the variable that has the

greatest split between categories first. Then the category can continue to be split at each level

Page 13: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 11

until there are no more logical splits to be made. Decision trees are designed to handle

categorical data, but numeric data can be made into categories to use in the tree.

The advantages of decision trees are they can be very easy to interpret, there is not much

involved in getting the data ready to process, and they can be used for a variety of situations.

One of the disadvantages of decision trees is sometimes with simpler problems it is more time

consuming to use this method than linear regression.

Recently a decision tree was used at a consumer products company to help determine

why a consumer study we had done did not produce the results that were predicted. The

conclusions and numbers are the same, but the product and categories have been changed to

protect confidentiality.

A study was designed to test how well a consumer likes the change in the design of a

chair cushion. The consumer ranked how comfortable they thought the chair was on a scale of

zero to five with zero being extremely uncomfortable and five being extremely comfortable and

a 0.5 step increment in between. Then a second chair was presented to the consumer and they

scored this chair as well. It was predicted that the second chair would score higher and the

average increase in score from chair one to chair two was calculated across all the consumers.

As shown in Figure 5, the change was minimal and a decision tree was constructed to

provide insight into the reason why. The largest split between what influenced the average

score was what the baseline score was, or the score for chair one. If the score for chair one was

below 3.75 then the average change in score was 0.57. If the score for chair one was greater than

3.75 then the change in score was -0.46. The company continued to split the tree into what was

influencing the scores further down, but the true value of the tree was in the first split. The

consumers who started with high score did not have much room for improvement and actually

averaged a decrease. The consumers that started out with a low score did improve their score, as

predicted. From this information it was determined that the study was designed incorrectly and

they should have only recruited consumers that were using more of the scale in their evaluation

of the first chair.

DATA MINING EXAMPLE

A brief data mining example follows. Although this example did not specifically follow

one of the more widely accepted data mining methodologies or use any of the sophisticated data

Page 14: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 12

mining modeling techniques, it does illustrate the use of the principles of data mining in that

large amounts of data were extracted from multiple data bases and a meaningful model was

generated to describe a business process and ultimately change behavior.

In this example, the business problem in question was assessing the status of and

improving the on time delivery of New Product Introduction (NPI) hardware. NPI hardware is

defined as fabricated assemblies that are all components of a larger machine assembly. The

delivery problem stems from the fact that the fabrication of the very first set of NPI hardware

required for the assembly of the very first machine is often late relative to the required due date.

For the study at hand, the actual delivery of the first set of hardware was over 25 days late to the

customer orders with an inter-quartile range of 35 days.

To initiate analysis, a fishbone diagram was derived to assess potential causes of late

delivery and a data collection plan was generated. In the data collection plan, data was identified

for extraction from two different databases, the engineering product definition database and the

manufacturing database. The data extracted from these two databases was merged into a single

data set for analysis.

An initial review of the data following evaluation of the process capability for on time

delivery indicated two distinct subgroups of data: assemblies described as “brackets” and

assemblies described as “not brackets.” From this segregation of data, analysis proceeded on the

distinct subgroups and new data was generated to describe design and manufacturing sub-

processes based on the extraction of time based event data from the databases. Preliminary plots

of the data were generated and initial models were attempted using linear regression and general

linear model approaches. Unfortunately, no single regression was able to adequately describe

the data.

Finally, an attempt was made to categorize some of the sub-process process times by

percentiles and plot the data against on time delivery as a main effects plot. This plot revealed

that manufacturing activity was related to on time delivery but design activity was not. In fact,

the main effects plot indicated that parts designed closer to the due date actually had better on

time delivery than parts designed further from the due date. Thus it was clear that no single

regression of design and manufacturing data could adequately describe the process.

From this new understanding of the data, an approach was made to fit only the

manufacturing data to on time delivery and an acceptable regression was found. Similarly, an

Page 15: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 13

attempt was made to fit the design data to one of the manufacturing variables, the creation of the

part identity in the manufacturing database, and again a suitable regression was found. These

two regressions were combined through a common term to form a single regression equation.

This equation, while not fitting the on time delivery data very well, as evidenced by a poor R

Squared correlation coefficient, did produce an accurate representation of the on time delivery

distribution. Thus while the combined regression model could not predict how late a particular

part would be, it could, based on the design release distribution, predict what percentage of parts

would be late and when all the parts would be available for assembly to the first machine.

Moreover, this model showed that the greatest contributor to the variation in on time delivery

was the variation in design lead time. By using the derived regression equation along with the

known distributions for the input parameters and the desired on time delivery distribution, a

Monte Carlo simulation was employed to calculate the required cumulative distribution for

design releases to ensure on time delivery. This cumulative distribution was compared to

previous design release plans and the design release plans were shown to be faulty.

As a result of this study, design plans were updated to reflect the learning and the

required design release schedules. A data visualization tool was developed to extract data from

the design database and plot the design release requirements against the design release plan and

the actual design releases. This tool allowed for an assessment of the ability to meet on time

delivery before all designs were released and the taking of corrective action in advance of the

hardware due date. Furthermore, a data tool was developed to extract and format manufacturing

data (Bill of Material, manufacturing status, quantity on hand, and manufacturing work order

information) for coordination of design and manufacturing processes along with the rapid

assessment of manufacturing status for all assemblies required to build a particular machine.

Unfortunately, this example is not a direct implementation of one of the data mining

methodologies described in the paper. Nor did this example illustrate use of any of the elaborate

data modeling techniques that might be used on larger more complex data sets. However, this

example does illustrate the basic principles of the data mining process (defining a business need,

data collection, data review, data conditioning, model generation, model evaluation, and

documentation and deployment) and the use of data and data modeling to change behavior and

solve a business need.

Page 16: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 14

CONCLUSION

This paper has provided an overview of data mining and has included a data mining

definition, a review of where data mining may be applied, a summary of data mining products,

an assessment of the data mining process, and a synopsis of some of the major data mining

techniques. The paper concludes with an example project that attempts to illustrate the data

mining process and how it may be used to solve a real world problem.

REFERENCES Angoss Software. Angoss Software Corporation, Inc. Last visited 7 APR 09. <http://www.angoss.com/analytics_software/> Chapman, Pete et al., CRISP-DM 1.0. 2000. Retrieved from <http://www.spss.com/media/collateral/CRISP-DM_1.0__Step-by-Step_Data_Mining_Guide.pdf> 5 APR 09. Infor Solutions. Infor Global Solutions. Last visited 7 APR 09. <http://www.infor.com/solutions/> Poll: What Main Methodology Are You Using for Data Mining. JUL 2002. KDnuggets. Last visited 5 APR 09. <http://www.kduggets.com/polls/2002/methedology.htm> Portrait Software Solutions. Portrait Software plc. Last visited 7 APR 09. <http://www.portraitsoftware.com/Products> SAS Enterprise Miner. SAS Institute, Inc. Last visited 5 APR 09. <http://www.sas.com/offices/europe/uk/technologies/analytics/datamining/miner/semma.html> SAS Products and Solutions. SAS Institute, Inc. Last visited 7 APR 09. <http://www.sas.com/software/> SPSS Software. SPSS, Inc. Last visited 7 APR 09. <http://www.spss.com/software/?source=homepage&hpzone=nav_bar> Two Crows Corporation. 2005. Introduction to Data Mining and Knowledge Discovery, Third Edition. (ISBN: 1-852095-02-5). Retrieved from <http://www.twocrows.com/booklet.htm> 25 FEB 09.

Page 17: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 15

Figure 1: KDnuggets.com, 2002 Survey; Data Mining Process

Figure 2: CRISP-CM Breakdown

Page 18: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 16

Figure 3: CRISP-DM Phases and Flow

Figure 4: Linear Regression Technique

Page 19: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 17

Figure 5: Nearest Neighbor Technique

Figure 6: Neural Net Technique

Page 20: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 18

Figure 7: Clustering / Segmenting Technique

Figure 8: Decision Tree Technique

Page 21: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 19

On Time Delivery

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-50

-45

-40

-35

-30

-25

-20

-15

-10 -5 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Pro

babi

lity

Delivery Actual - f it

Delivery Required

Figure 9: Example – On Time Delivery Problem Statement

Brainstorm Variation Sources Data Collection PlanBrainstorm Variation Sources Data Collection Plan

Figure 10: Example – On Time Delivery Data Collection

Page 22: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 20

TOTAL LEAD TIME by Part Type: p < .05

Level N Mean StDev ----+---------+---------+---------+--BRACKET 520 x6.76 x3.14 (--*-) DUCT 138 x6.70 x0.40 (----*---) MANIFOLD 44 x9.95 x4.68 (-------*-------)

TUBE 47 x3.60 x2.79 (------*-------) ----+---------+---------+---------+--

Pooled StDev = 68.47

Figure 11: Example – Data Segmentation

38114.3

38038.8

38131.5

38044.5

14448 95 .757.25

85 .25-20.25

-34.5-155.5

21.5-91 .524.75

-43 .7516157

72.75

18.25

38114.3

38038.8

38131.5

38044.5

144

48

95.75

7.25

85.25

-20.25

-34.5

-155.5

21.5

-91.5

24.75

-43.75

SHIP_DUE

IR CREATE

BOM CREATE

BOMC_MODC

BOMC_MODP

BOMC_MODI

MODC_DUE

MODI_DUE

BOMC_DUE

MODI_MODC CAT MO_FINIS

CAT MO_START

CAT SCHED_ST

CAT MAN-DUE

CAT BOM_CR-D

CAT MOD_ISSU

CAT MODEL_CR

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

60

45

30

15

0

SHIP-D

UE

Main Effects Plot - Data Means for SHIP-DUE

CAT MO_FINIS

CAT MO_START

CAT SCHED_ST

CAT MAN-DUE

CAT BOM_CR-D

CAT MOD_ISSU

CAT MODEL_CR

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

60

45

30

15

0

SHIP-D

UE

Main Effects Plot - Data Means for SHIP-DUE

CAT MO_FINIS

CAT MO_START

CAT SCHED_ST

CAT MAN-DUE

CAT BOM_CR-D

CAT MOD_ISSU

CAT MODEL_CR

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

90-100

80-90

70-80

60-70

50-60

40-50

30-40

20-30

0-20

60

45

30

15

0

SHIP-D

UE

Main Effects Plot - Data Means for SHIP-DUE

Figure 12.1: Example – Model Building

ModelPRE

ModelPRE

0

DUE DATE

SHIP DATEBOM Create

- Time + Time

ComponentsAvailable

ComponentsAvailable

MANRelease

MANRelease

MOFinishMO

FinishScheduledMO Start

ScheduledMO Start

MOStartMO

StartModel / DWG

IssueModel / DWG

IssueIR

CreateIR

Create

X – make smaller

X – make more negativeY – make smaller

X – make smaller

X – make smaller

X – make smaller

Model Create

52.8%

28.3%

8.4%7.1%

3.5%

SHIP-DUE = 7.97 + 0.269*(MODEL_CR-DUE) + 0.173*(CR- ISS) + 0.704*(MAN_BOMC) + 0.748*(SCH_ST-MAN) + 0.862*(MOS_MOFIN) [R^2A 4.4%] – {R^2A(1) 76.5%, R^2A(2) 68.0%}

Figure 12.2: Example – Model Building

Page 23: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team

#1

M

BA

664 21

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2-49.25

-34.25

-19.25

-4.25

10.75

25.75

40.75

55.75

70.75

85.75

Probability

SH

IP DU

E MO

DEL

SH

IP DU

E AC

TUA

L

Actual

Delivery

Predicted D

elivery (R

egression)

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2-49.25

-34.25

-19.25

-4.25

10.75

25.75

40.75

55.75

70.75

85.75

Probability

SH

IP DU

E MO

DEL

SH

IP DU

E AC

TUA

L

Actual

Delivery

Predicted D

elivery (R

egression)

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2-49.25

-34.25

-19.25

-4.25

10.75

25.75

40.75

55.75

70.75

85.75

Probability

SH

IP DU

E MO

DEL

SH

IP DU

E AC

TUA

L

Actual

Delivery

Predicted D

elivery (R

egression)

Fig

ure

13

: Exa

mp

le –

Mo

del E

valu

atio

n

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Issue Required for

On-Tim

e Delivery

Issue A

ctual

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Issue Required for

On-Tim

e Delivery

Issue A

ctual

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Ove

rlay C

hart

0

0.2

0.4

0.6

0.8 1

1.2

-298.00

-278.00-258.00

-238.00-218.00

-198.00-178.00

-158.00-138.00

-118.00-98.00

-78.00-58.00

-38.00-18.00

2.0022.00

42.0062.00

82.00

Probability

MO

DI A

CT

modi calc new

Issue Required for

On-Tim

e Delivery

Issue A

ctual

Fig

ure

14

: Exa

mp

le –

Mo

del E

valu

atio

n

Page 24: Database Written Project Draftacademic.udayton.edu/DaveSalisbury/classtuf/mba664... · INFOR GLOBAL SOLUTIONS ( ) Infor Global Solutions is the world’s third largest software company

Team #1 MBA 664 22

BRACKETS SUMMARY

0

10

20

30

40

50

60

70

80

90

100

08/0

6/05

08/2

0/05

09/0

3/05

09/1

7/05

10/0

1/05

10/1

5/05

10/2

9/05

11/1

2/05

11/2

6/05

12/1

0/05

12/2

4/05

01/0

7/06

01/2

1/06

02/0

4/06

02/1

8/06

03/0

4/06

03/1

8/06

04/0

1/06

04/1

5/06

04/2

9/06

05/1

3/06

05/2

7/06

06/1

0/06

06/2

4/06

Date

Num

ber

of P

arts

CUM Req IssueCUM Plan IssueCUM Actual Issue

*** WARNINGS ***

# Issed No PRE - 6# Issued Post Due - 0

# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0

All Due Dates

BRACKETS SUMMARY

0

10

20

30

40

50

60

70

80

90

100

08/0

6/05

08/2

0/05

09/0

3/05

09/1

7/05

10/0

1/05

10/1

5/05

10/2

9/05

11/1

2/05

11/2

6/05

12/1

0/05

12/2

4/05

01/0

7/06

01/2

1/06

02/0

4/06

02/1

8/06

03/0

4/06

03/1

8/06

04/0

1/06

04/1

5/06

04/2

9/06

05/1

3/06

05/2

7/06

06/1

0/06

06/2

4/06

Date

Num

ber

of P

arts

CUM Req IssueCUM Plan IssueCUM Actual Issue

*** WARNINGS ***

# Issed No PRE - 6# Issued Post Due - 0

# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0

All Due Dates

BRACKETS SUMMARY

0

10

20

30

40

50

60

70

80

90

100

08/0

6/05

08/2

0/05

09/0

3/05

09/1

7/05

10/0

1/05

10/1

5/05

10/2

9/05

11/1

2/05

11/2

6/05

12/1

0/05

12/2

4/05

01/0

7/06

01/2

1/06

02/0

4/06

02/1

8/06

03/0

4/06

03/1

8/06

04/0

1/06

04/1

5/06

04/2

9/06

05/1

3/06

05/2

7/06

06/1

0/06

06/2

4/06

Date

Num

ber

of P

arts

CUM Req IssueCUM Plan IssueCUM Actual Issue

*** WARNINGS ***

# Issed No PRE - 6# Issued Post Due - 0

# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0

All Due Dates

BRACKETS SUMMARY

0

10

20

30

40

50

60

70

80

90

100

08/0

6/05

08/2

0/05

09/0

3/05

09/1

7/05

10/0

1/05

10/1

5/05

10/2

9/05

11/1

2/05

11/2

6/05

12/1

0/05

12/2

4/05

01/0

7/06

01/2

1/06

02/0

4/06

02/1

8/06

03/0

4/06

03/1

8/06

04/0

1/06

04/1

5/06

04/2

9/06

05/1

3/06

05/2

7/06

06/1

0/06

06/2

4/06

Date

Num

ber

of P

arts

CUM Req IssueCUM Plan IssueCUM Actual Issue

*** WARNINGS ***

# Issed No PRE - 6# Issued Post Due - 0

# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0

All Due Dates

Requirements

Plan

Actual

BRACKETS SUMMARY

0

10

20

30

40

50

60

70

80

90

100

08/0

6/05

08/2

0/05

09/0

3/05

09/1

7/05

10/0

1/05

10/1

5/05

10/2

9/05

11/1

2/05

11/2

6/05

12/1

0/05

12/2

4/05

01/0

7/06

01/2

1/06

02/0

4/06

02/1

8/06

03/0

4/06

03/1

8/06

04/0

1/06

04/1

5/06

04/2

9/06

05/1

3/06

05/2

7/06

06/1

0/06

06/2

4/06

Date

Num

ber

of P

arts

CUM Req IssueCUM Plan IssueCUM Actual Issue

*** WARNINGS ***

# Issed No PRE - 6# Issued Post Due - 0

# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0

All Due Dates

BRACKETS SUMMARY

0

10

20

30

40

50

60

70

80

90

100

08/0

6/05

08/2

0/05

09/0

3/05

09/1

7/05

10/0

1/05

10/1

5/05

10/2

9/05

11/1

2/05

11/2

6/05

12/1

0/05

12/2

4/05

01/0

7/06

01/2

1/06

02/0

4/06

02/1

8/06

03/0

4/06

03/1

8/06

04/0

1/06

04/1

5/06

04/2

9/06

05/1

3/06

05/2

7/06

06/1

0/06

06/2

4/06

Date

Num

ber

of P

arts

CUM Req IssueCUM Plan IssueCUM Actual Issue

*** WARNINGS ***

# Issed No PRE - 6# Issued Post Due - 0

# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0

All Due Dates

BRACKETS SUMMARY

0

10

20

30

40

50

60

70

80

90

100

08/0

6/05

08/2

0/05

09/0

3/05

09/1

7/05

10/0

1/05

10/1

5/05

10/2

9/05

11/1

2/05

11/2

6/05

12/1

0/05

12/2

4/05

01/0

7/06

01/2

1/06

02/0

4/06

02/1

8/06

03/0

4/06

03/1

8/06

04/0

1/06

04/1

5/06

04/2

9/06

05/1

3/06

05/2

7/06

06/1

0/06

06/2

4/06

Date

Num

ber

of P

arts

CUM Req IssueCUM Plan IssueCUM Actual Issue

*** WARNINGS ***

# Issed No PRE - 6# Issued Post Due - 0

# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0

All Due Dates

BRACKETS SUMMARY

0

10

20

30

40

50

60

70

80

90

100

08/0

6/05

08/2

0/05

09/0

3/05

09/1

7/05

10/0

1/05

10/1

5/05

10/2

9/05

11/1

2/05

11/2

6/05

12/1

0/05

12/2

4/05

01/0

7/06

01/2

1/06

02/0

4/06

02/1

8/06

03/0

4/06

03/1

8/06

04/0

1/06

04/1

5/06

04/2

9/06

05/1

3/06

05/2

7/06

06/1

0/06

06/2

4/06

Date

Num

ber

of P

arts

CUM Req IssueCUM Plan IssueCUM Actual Issue

*** WARNINGS ***

# Issed No PRE - 6# Issued Post Due - 0

# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0

All Due Dates

Requirements

Plan

Actual

BRACKET PLANNING

0.5

0.6

0.7

0.8

0.9

1

1.1

-200 -150 -100 -50 0 50

Days

Cum

ulat

ive

Per

cent

OLD PLAN

NEW PLAN

REQUIRED

Figure 15: Example – Deployment