database written project draftacademic.udayton.edu/davesalisbury/classtuf/mba664... · infor global...
TRANSCRIPT
DATA MINING
TEAM #1
Kristen Durst
Mark Gillespie
Banan Mandura
MBA 664: Database Management
Team #1 MBA 664 ii
OUTLINE
INTRODUCTION 1
DATA MINING DEFINITION AND EXAMPLES 1
DATA MINING PRODUCTS 2
DATA MINING PROCESS 4
DATA MINING TECHNIQUES 7
DATA MINING EXAMPLE 11
CONCLUSION 14
REFERENCES 14
APPENDIX: FIGURES 15
Team #1 MBA 664 1
INTRODUCTION
The purpose of this paper is to provide a brief overview of data mining and how data
mining complements database technology. First, a definition for data mining will be provided
and some example applications will be discussed. Next, a few of the more well known data
mining companies will be presented along with the software and services they provide.
Following the review of data mining products, an approach to the data mining process will be
discussed along with an overview of a few of the more prominent data mining analysis
techniques. Finally, a data mining example will be presented that illustrates the data mining
process by means of a data collection and statistical approach to a real world problem. The
intent is to provide the reader with a better feel for the data mining process and how it may be
applied in actual applications.
DATA MINING DEFINITION AND APPLICATIONS
Data mining is an analysis process applied to large amounts of data with the intent of
identifying hidden, unknown patterns and relationships within the data thereby enabling the user
to draw conclusions and predict future outcomes. Practitioners of data mining are not as
concerned with determining what has happened based on an analysis of their data as they are
about predicting what will happen in the future. Data mining has grown in interest and
application over the last several years as advances in computer processing and digital data
storage have greatly increased the speed with which data can be accessed and processed while
simultaneously reducing the cost and infrastructure required to store the data and the results. As
will be discussed later, data mining does require a process, but in practice, the data mining
process is not uniform from user to user. However, the data mining process will generally
include the following three high level steps:
( a ) Description of the data to summarize attributes of the available data
( b ) Predictive modeling derived from a portion of the existing data
( c ) Verification of the model against the larger domain of data in the real world
Despite the wide interest in and “buzzword” status of data mining, a user who wishes to
implement data mining must recognize what data mining is not and what data mining cannot do.
Team #1 MBA 664 2
Data mining is not simply the blind application of a series of algorithms to large sets of data.
The data mining analyst must still understand the data and its origins, the business in which the
data originated and is used, as well as the analytical methods that are applied to the data and the
results of that analysis. Furthermore, data mining does not indicate what you must do with the
data and the results. Only a knowledgeable user of the data will be able to assess the value of the
patterns and relationships gleaned from the data mining approach and apply them to make a
positive impact to their business.
Data mining can be implemented in any business to aid the analysis and resolution of
multiple problems; however, the use of data mining has been most widely noted in the
telecommunications, credit card, financial and retail industries among others. For instance, the
telecommunications industry has studied data to determine which customers are most likely to
turn over or “churn” on their cell phone contracts; the credit card industry is able to detect and
track fraudulent use of their services; financial companies are able to predict corporate stock
performance; and retailers are able to tailor which products to stock and offer to particular
customers. Unfortunately, the benefits of data mining do not come without a cost, and
practitioners of data mining must recognize the potential legal and ethical concerns resulting
from the widespread application of data mining tools. In particular, the ability to track and
identify individual consumer behavior through the aggregation of data from multiple sources
when the original data was in fact anonymous is of concern and has resulted in the adoption of
data control policies within many corporations.
DATA MINING PRODUCTS
A wide range of data mining software and service providers exist in the marketplace
today and they serve a wide range of customers. According to a 2008 study by the Gartner
Group, an information technology research and advisory firm, five of the largest data mining
software companies are indicated below:
AGNOSS SOFTWARE COMPANY (www.agnoss.com)
Agnoss offers a suite of software tools to perform predictive diagnostics. These tools
cover all phases of the data mining process including profiling, exploration, modeling,
implementation, scoring and validation. Key software tools include Knowledge SEEKER
Team #1 MBA 664 3
for profiling and visualization, Knowledge STUDIO, a decision tree based tool for
predictive analytics, and Strategy BUILDER, a tool combining analysis results into
business rules.
INFOR GLOBAL SOLUTIONS (www.infor.com)
Infor Global Solutions is the world’s third largest software company and has acquired a
wide range of software applications that include Infor CRM Epiphany an integrated
software tool that performs marketing, sales and service analytics.
PORTRAIT SOFTWARE (www.portraitsoftware.com)
Portrait Software provides a suite of marketing analysis tools to support marketing,
service and selling activities. Portrait Software offers products that perform marketing
automation as well as predictive analytics. Quadstone Analytics is one of their predictive
modeling tools and it employs various techniques including decision trees, regression,
additive scorecards, clustering and uplift modeling.
SAS INSTITUTE (www.sas.com)
SAS is a leader in the data mining community and provides tools and solutions to a broad
range of customers. SAS Enterprise Miner and SAS Analytics offer customers access to a
multitude of methods and techniques to perform statistical analysis, data visualization,
forecasting, and model management and deployment. (SAS was originally an acronym
for Statistical Analysis System.)
SPSS INC (www.spss.inc)
SPSS Inc. provides a range of products in four families allowing customers to perform
Data Collection, Modeling, Statistical Analysis, and Deployment. These tools can be
integrated with Clementine a data mining workbench that uses a wide range of data
mining techniques. (The name SPSS is derived from Statistical Package for the Social
Sciences).
Team #1 MBA 664 4
DATA MINING PROCESS
A formal, uniformly accepted methodology for the process of data mining does not truly
exist. However, a 2002 survey by KDnuggets.com, a leading web based data mining resource,
indicated that 51% of the 189 respondents do follow CRISP-DM (CRoss Industry Standard
Process for Data Modeling) a methodology developed and advocated by SPSS. Another 12% of
respondents reported that they apply the tools described by SAS’s SEMMA approach.
Nevertheless, the remaining 38% of those taking the survey indicated that they follow their own
methodology, the methodology devised by their employer, or nothing at all. Despite the
apparent lack of a uniform process for data mining, all approaches to data mining will likely
incorporate activities to accomplish the tasks of (1) problem definition, (2) data collection, (3)
data review, (4) data conditioning, (5) model building, (6) model evaluation, and (7)
documentation and deployment.
As known leaders in the data mining community, SPSS’s CRISP-DM method and SAS’s
SEMMA approach will be discussed in more detail below. Although these approaches do not
explicitly call out the seven activities just described, those seven activities are embedded within
the SPSS and SAS approaches, and they will likely be incorporated into any successful data
mining approach.
CRISP-DM (CRoss Industry Standard Process for Data Mining)
CRISP-DM was conceived in 1996 by a consortium consisting of Daimler Chrysler,
SPSS, and NCR. The intent was to develop a data mining approach that was not specific to any
particular industry, application, or analysis tool. With funding from the European Commission,
the consortium conducted a workshop and upon finding general agreement for the need of a data
mining template, CRIPS-DM was born.
CRISP-DM is a hierarchical process model that consists of a set of tasks with various
degrees of definition. The top level of the hierarchy is the Phase. Each Phase consists of generic
tasks, the second level of the hierarchy. The tasks are generic in order to maintain the neutrality
of the process, and they are intended to be complete, applicable to the entire process, as well as
stable, tolerant of new and unplanned developments. Specialized tasks form the third level, and
these are designed for the unique, particular nature of problems to be solved. Finally, records of
actions, decisions, and results form the fourth and final level of the CRISP-DM hierarchy. The
Team #1 MBA 664 5
data mining context will determine the mapping from the generic levels (levels 1 and 2) to the
more specific levels (levels 3 and 4).
Moreover, CRISP-DM is described by a six phase reference model that flows in a
particular sequence but does not require the user to follow the phases in a fixed path. User’s will
likely find a need to move back and forth iteratively between phases as individual phase results
come into focus. The CRIPS-DM methodology is accommodating of that requirement. Finally,
CRISP-DM is designed to be cyclical in nature with an understanding that the data mining
activity may not end once a solution is derived. New questions and problems are likely to be
identified from the solution that may demand a continuous flow of follow-on activity. The six
phase CRIPS-DM cyclical model is briefly described below.
Phase 1 – Business Understanding: The purpose of Business Understanding is to assess
the objectives and requirements of the business and articulate these needs into a specific
problem or problems the business wishes to solve.
Phase 2 – Data Understanding: Data Understanding consists of preliminary data
collection along with the assessment of any insights into the data and any data quality
issues. Potential data segregation may occur and preliminary hypotheses may be formed.
Phase 3 – Data Preparation: In data preparation, data quality issues are resolved and the
final data set for analysis is generated. Any required data transforms are completed as is
necessary data cleansing. Multiple iterations may be required.
Phase 4 – Modeling: The methodology is neutral to any of the various data modeling
approaches. Multiple modeling choices may be reviewed and tailored to the specific
problem and available data. If the desired modeling technique requires specific data
conditions, a return to Phase 3 may be required. Multiple techniques may be applied.
Phase 5 – Evaluation: In Phase 5, the model is complete and validated for sufficient
quality. If quality in the model is lacking or it fails to meet the needs of the business, a
review and return to Phase 4 may be necessary.
Team #1 MBA 664 6
Phase 6 – Deployment: In Deployment, the data and model are organized and presented
to the customer for use. Data visualization is critical as is documentation of all detailed
process steps and their results.
SEMMA (Sample, Explore, Modify, Model, Assess)
SAS proposes that SEMMA is not so much a data mining methodology as it is a set of
tools deployed within their SAS Enterprise Miner software that can be integrated into any data
mining method. SEMMA articulates that it is the user’s responsibility to define the business
problem to be solved and acquire and condition the data appropriately. The SEMMA focus is on
model development. A brief description of the five elements of SEMMA follows:
SAMPLE: The Sample activity consists of extracting a statistically significant data set
from the larger data domain. The data set must adequately represent the larger data set
but be small enough for ease of manipulation. The data may be portioned to facilitate
model training, validation and test.
EXPLORE: In the Explore activity, different views and plots of the data are generated
and trends or unusual data instances are discovered. Additionally, traditional statistical
analysis tools or data mining techniques may be employed to ascertain any data
subgroups.
MODIFY: Modification results from the creation, selection and transformation of data in
preparation for the modeling activity. New variables or groups may be defined and any
outliers, data points resulting from special cause variation, may be eliminated. The data
set is updated accordingly.
MODEL: Modeling allows the user to fit the data using a wide variety of modeling
techniques and predict outcomes as derived from the overarching business need.
Techniques that may be applied include neural nets, decision trees, logistic regression,
and k-nearest neighbor to name a few.
Team #1 MBA 664 7
ASSESS: Finally, in assessment, the model is evaluated for usefulness in solving the
articulated business problem and validated against the subset of data. As in all data
mining approaches, the model is checked for over fitting to ensure the model is not tuned
so tightly to the model development subset that it cannot adequately predict outcomes
from other data sets.
From the brief assessment of CRISP-DM and SEMMA above, it is clear that there is
commonality of activity in any data mining approach even if the terminology and articulation of
methods are different. As indicated in the introductory paragraph, any good data mining
approach will include the tasks of (1) problem definition, (2) data collection, (3) data review, (4)
data conditioning, (5) model building, (6) model evaluation, and (7) documentation and
deployment.
DATA MINING TECHNIQUES
STATISTICAL METHODS The data mining technique that most people are familiar with are statistical methods such
as sample statistics or linear regression. These are usually used for very simple problems that
have very few predictive variables. If the problem was more complex then another method
would be more appropriate.
Sample statistics involve looking at particular variables and calculating the minimum
value, maximum value, mean, median, and variance. For example, a retail store could analyze
their sales data and find out for the previous quarter the summary statistics for their particular
products. They can quickly make conclusions about particular product lines and if they find
something they do not expect or looks interesting they can “mine” further using more complex
methods.
Linear regression is an easy way to predict values based on a simple equation. There
could be many interactions involved to find the correct model, in which case linear regression
would not be appropriate. However, for simple situations this can be a very powerful method.
For example, Company ABC has data on their customer’s income level and their sales data. As
Team #1 MBA 664 8
shown in Figure 1, as the customer’s income increases their total purchase amount increases. A
line is fit through the data to minimize the error between the data points and the line. The line
then becomes an equation: Total Purchase Amount = y-intercept + (Slope * Customer’s Income).
ABC can predict what the sales amount will be based on what the customer’s income level is by
putting it into the equation. Also, since ABC knows the relationship between income and sales,
they can restrict their marketing to only certain income levels.
NEAREST NEIGHBOR FOR PREDICTION
Nearest neighbor for prediction is a very easy data mining technique to understand. The
concept comes from the idea that you can predict the outcome or how something is going to
behave based on how other predictive variables “near” it behave. An everyday example of how
this used is in real estate. When someone buys a house the realtor will check to see what other
houses in the area sold for because this is a good predictor of what the house for sale should be
worth. This technique works best and is ideal to use when there are a few amount of predictive
variables.
A simple example of how a business might use this technique is if a company had a
product they wanted to start selling in a new city. Company XYZ wants to estimate how many
units will sell so they can determine if it is worth moving into the new market. XYZ has a
database of the current sales data of each city where the product is already being sold. The
predictive variables are the population of the city and the distance away from where their
competitor’s product is sold in relation to the city. As shown in Figure 2, each city is
represented by a letter, which corresponds to three categories of the amount of units sold: >200
units, 100 – 200 units, and <100 units. These markers are placed in the graph by the population
of the city and the distance away from their competitor. There is a U marker that represents the
new city where the amount of product sold is unknown and we want to predict. Using the
nearest neighbor for prediction method the U marker is nearest to more cities falling into the A
sales category than any other sales category. This means that we could predict that this city will
behave in the same way and will have sales greater than 200 units. XYZ should plan on
extending their product line to this market given the high prediction of sales.
With the nearest neighbor for prediction method it is also possible to estimate how
confident the company is with their prediction. If there prediction variables are extremely close
Team #1 MBA 664 9
to their neighbors then there is a higher level of confidence. However, if there are not any
prediction variables that are close a prediction can still be made, but with very little confidence.
This is extremely valuable because a company would not want to follow through with a major
investment with a prediction that has a low level of confidence.
NEURAL NETWORK
A neural network is a data mining technique that is much more complex. Some benefits
of this technique is that it can use extremely large amounts of predictive variables, once the
network has been created and it has been confirmed as successful then it can be used again and
again, and it can be used in many different types of situations. The disadvantages of this
technique are that the outcomes are not very easy to interpret and it can be very time consuming
to get the data into the right format for running the model.
A neural network is a complex computer model that takes input variables and then
outputs a solution. Neural networks include an input layer, hidden layer, and output layer. The
input layer consists of the predictive variables that go into the model. The hidden layer is created
by the computer model and is not seen by the user. The output layer is the end prediction that
has been calculated by the model. All of the variables that go into the neural network have to be
converted into numeric variables with values between 0 and 1.
Company XYZ could use a simple neural network to predict the same scenario that was
used for the nearest neighbor for prediction method. The database that has the population of the
city, the distance away from where their competitor’s product is sold in relation to the city, and
the product sales would be used create the model. The computer would use the population of the
city and the distance away from the competition for the input layer and then go through a testing
phase. During the testing phase the computer will assign various weights to each of the variables
and then output a number that represents the predicted product sales. This number will be
between 0 and 1 and needs to be interpreted as how that relates to the actual ranges that are
provided in the database. For example, an output of less than 0.333 means the product sales are
less than 100 units, an output between 0.333 and 0.666 means the product sales are between 100
and 200 units, and an output greater than 0.666 means the product sales are greater than 200
units. The computer will keep testing the actual data and adjusting the weights as needed to
create the best model for computing the predicted product sales. As shown in Figure 3, once the
Team #1 MBA 664 10
model has finished testing the data, the new city can be entered into the model and the predicted
product sales can be computed. This model has an output of 0.736, so the product sales are
predicted to be greater than 200 units.
CLUSTERING/SEGMENTING
Clustering or segmenting is a data mining technique where there is not something
specific that is predicted. This technique forms groups that are similar and groups that are very
different. This can help to give a good overall view of the data and what is going on in the
business. For example if Company MNO has a database of demographic information on their
customers and their buying habits then segmenting can be used to find buying patterns based on
that demographic data. As shown in Figure 4, male consumers under the age of forty are
behaving in the same way which is drastically different than female consumers over the age of
forty. For this particular example you can group into gender differences, age differences, and
also both gender and age. These groupings can then be used to for different marketing
campaigns. The marketing techniques should be different for each group. Not all the variables
in the database will be used for clustering or segmentation and some will need to be removed by
the user if they do not make any meaningful sense.
Clustering can also be used to identify potential problems by finding outliers. For
example, through clustering company MNO determined they have a much higher sale volume for
snowboards in their stores where they are within fifty miles of a ski resort. However, they found
one store where there is a low volume of sales even though they are only twenty-four miles away
from a ski resort. With some more research, company MNO realized that their sales were down
in that store because that area had become saturated with so many competitors. With this new
information they decided to pull their store out of this area because they cannot compete with the
larger stores.
DECISION TREE
Decision trees are a predictive model that group together classification variables into a
“tree”. Each branch represents a classification group that has been divided. The decision tree
splits the data into groups by examining all of the data and picking the variable that has the
greatest split between categories first. Then the category can continue to be split at each level
Team #1 MBA 664 11
until there are no more logical splits to be made. Decision trees are designed to handle
categorical data, but numeric data can be made into categories to use in the tree.
The advantages of decision trees are they can be very easy to interpret, there is not much
involved in getting the data ready to process, and they can be used for a variety of situations.
One of the disadvantages of decision trees is sometimes with simpler problems it is more time
consuming to use this method than linear regression.
Recently a decision tree was used at a consumer products company to help determine
why a consumer study we had done did not produce the results that were predicted. The
conclusions and numbers are the same, but the product and categories have been changed to
protect confidentiality.
A study was designed to test how well a consumer likes the change in the design of a
chair cushion. The consumer ranked how comfortable they thought the chair was on a scale of
zero to five with zero being extremely uncomfortable and five being extremely comfortable and
a 0.5 step increment in between. Then a second chair was presented to the consumer and they
scored this chair as well. It was predicted that the second chair would score higher and the
average increase in score from chair one to chair two was calculated across all the consumers.
As shown in Figure 5, the change was minimal and a decision tree was constructed to
provide insight into the reason why. The largest split between what influenced the average
score was what the baseline score was, or the score for chair one. If the score for chair one was
below 3.75 then the average change in score was 0.57. If the score for chair one was greater than
3.75 then the change in score was -0.46. The company continued to split the tree into what was
influencing the scores further down, but the true value of the tree was in the first split. The
consumers who started with high score did not have much room for improvement and actually
averaged a decrease. The consumers that started out with a low score did improve their score, as
predicted. From this information it was determined that the study was designed incorrectly and
they should have only recruited consumers that were using more of the scale in their evaluation
of the first chair.
DATA MINING EXAMPLE
A brief data mining example follows. Although this example did not specifically follow
one of the more widely accepted data mining methodologies or use any of the sophisticated data
Team #1 MBA 664 12
mining modeling techniques, it does illustrate the use of the principles of data mining in that
large amounts of data were extracted from multiple data bases and a meaningful model was
generated to describe a business process and ultimately change behavior.
In this example, the business problem in question was assessing the status of and
improving the on time delivery of New Product Introduction (NPI) hardware. NPI hardware is
defined as fabricated assemblies that are all components of a larger machine assembly. The
delivery problem stems from the fact that the fabrication of the very first set of NPI hardware
required for the assembly of the very first machine is often late relative to the required due date.
For the study at hand, the actual delivery of the first set of hardware was over 25 days late to the
customer orders with an inter-quartile range of 35 days.
To initiate analysis, a fishbone diagram was derived to assess potential causes of late
delivery and a data collection plan was generated. In the data collection plan, data was identified
for extraction from two different databases, the engineering product definition database and the
manufacturing database. The data extracted from these two databases was merged into a single
data set for analysis.
An initial review of the data following evaluation of the process capability for on time
delivery indicated two distinct subgroups of data: assemblies described as “brackets” and
assemblies described as “not brackets.” From this segregation of data, analysis proceeded on the
distinct subgroups and new data was generated to describe design and manufacturing sub-
processes based on the extraction of time based event data from the databases. Preliminary plots
of the data were generated and initial models were attempted using linear regression and general
linear model approaches. Unfortunately, no single regression was able to adequately describe
the data.
Finally, an attempt was made to categorize some of the sub-process process times by
percentiles and plot the data against on time delivery as a main effects plot. This plot revealed
that manufacturing activity was related to on time delivery but design activity was not. In fact,
the main effects plot indicated that parts designed closer to the due date actually had better on
time delivery than parts designed further from the due date. Thus it was clear that no single
regression of design and manufacturing data could adequately describe the process.
From this new understanding of the data, an approach was made to fit only the
manufacturing data to on time delivery and an acceptable regression was found. Similarly, an
Team #1 MBA 664 13
attempt was made to fit the design data to one of the manufacturing variables, the creation of the
part identity in the manufacturing database, and again a suitable regression was found. These
two regressions were combined through a common term to form a single regression equation.
This equation, while not fitting the on time delivery data very well, as evidenced by a poor R
Squared correlation coefficient, did produce an accurate representation of the on time delivery
distribution. Thus while the combined regression model could not predict how late a particular
part would be, it could, based on the design release distribution, predict what percentage of parts
would be late and when all the parts would be available for assembly to the first machine.
Moreover, this model showed that the greatest contributor to the variation in on time delivery
was the variation in design lead time. By using the derived regression equation along with the
known distributions for the input parameters and the desired on time delivery distribution, a
Monte Carlo simulation was employed to calculate the required cumulative distribution for
design releases to ensure on time delivery. This cumulative distribution was compared to
previous design release plans and the design release plans were shown to be faulty.
As a result of this study, design plans were updated to reflect the learning and the
required design release schedules. A data visualization tool was developed to extract data from
the design database and plot the design release requirements against the design release plan and
the actual design releases. This tool allowed for an assessment of the ability to meet on time
delivery before all designs were released and the taking of corrective action in advance of the
hardware due date. Furthermore, a data tool was developed to extract and format manufacturing
data (Bill of Material, manufacturing status, quantity on hand, and manufacturing work order
information) for coordination of design and manufacturing processes along with the rapid
assessment of manufacturing status for all assemblies required to build a particular machine.
Unfortunately, this example is not a direct implementation of one of the data mining
methodologies described in the paper. Nor did this example illustrate use of any of the elaborate
data modeling techniques that might be used on larger more complex data sets. However, this
example does illustrate the basic principles of the data mining process (defining a business need,
data collection, data review, data conditioning, model generation, model evaluation, and
documentation and deployment) and the use of data and data modeling to change behavior and
solve a business need.
Team #1 MBA 664 14
CONCLUSION
This paper has provided an overview of data mining and has included a data mining
definition, a review of where data mining may be applied, a summary of data mining products,
an assessment of the data mining process, and a synopsis of some of the major data mining
techniques. The paper concludes with an example project that attempts to illustrate the data
mining process and how it may be used to solve a real world problem.
REFERENCES Angoss Software. Angoss Software Corporation, Inc. Last visited 7 APR 09. <http://www.angoss.com/analytics_software/> Chapman, Pete et al., CRISP-DM 1.0. 2000. Retrieved from <http://www.spss.com/media/collateral/CRISP-DM_1.0__Step-by-Step_Data_Mining_Guide.pdf> 5 APR 09. Infor Solutions. Infor Global Solutions. Last visited 7 APR 09. <http://www.infor.com/solutions/> Poll: What Main Methodology Are You Using for Data Mining. JUL 2002. KDnuggets. Last visited 5 APR 09. <http://www.kduggets.com/polls/2002/methedology.htm> Portrait Software Solutions. Portrait Software plc. Last visited 7 APR 09. <http://www.portraitsoftware.com/Products> SAS Enterprise Miner. SAS Institute, Inc. Last visited 5 APR 09. <http://www.sas.com/offices/europe/uk/technologies/analytics/datamining/miner/semma.html> SAS Products and Solutions. SAS Institute, Inc. Last visited 7 APR 09. <http://www.sas.com/software/> SPSS Software. SPSS, Inc. Last visited 7 APR 09. <http://www.spss.com/software/?source=homepage&hpzone=nav_bar> Two Crows Corporation. 2005. Introduction to Data Mining and Knowledge Discovery, Third Edition. (ISBN: 1-852095-02-5). Retrieved from <http://www.twocrows.com/booklet.htm> 25 FEB 09.
Team #1 MBA 664 15
Figure 1: KDnuggets.com, 2002 Survey; Data Mining Process
Figure 2: CRISP-CM Breakdown
Team #1 MBA 664 16
Figure 3: CRISP-DM Phases and Flow
Figure 4: Linear Regression Technique
Team #1 MBA 664 17
Figure 5: Nearest Neighbor Technique
Figure 6: Neural Net Technique
Team #1 MBA 664 18
Figure 7: Clustering / Segmenting Technique
Figure 8: Decision Tree Technique
Team #1 MBA 664 19
On Time Delivery
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-50
-45
-40
-35
-30
-25
-20
-15
-10 -5 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
Pro
babi
lity
Delivery Actual - f it
Delivery Required
Figure 9: Example – On Time Delivery Problem Statement
Brainstorm Variation Sources Data Collection PlanBrainstorm Variation Sources Data Collection Plan
Figure 10: Example – On Time Delivery Data Collection
Team #1 MBA 664 20
TOTAL LEAD TIME by Part Type: p < .05
Level N Mean StDev ----+---------+---------+---------+--BRACKET 520 x6.76 x3.14 (--*-) DUCT 138 x6.70 x0.40 (----*---) MANIFOLD 44 x9.95 x4.68 (-------*-------)
TUBE 47 x3.60 x2.79 (------*-------) ----+---------+---------+---------+--
Pooled StDev = 68.47
Figure 11: Example – Data Segmentation
38114.3
38038.8
38131.5
38044.5
14448 95 .757.25
85 .25-20.25
-34.5-155.5
21.5-91 .524.75
-43 .7516157
72.75
18.25
38114.3
38038.8
38131.5
38044.5
144
48
95.75
7.25
85.25
-20.25
-34.5
-155.5
21.5
-91.5
24.75
-43.75
SHIP_DUE
IR CREATE
BOM CREATE
BOMC_MODC
BOMC_MODP
BOMC_MODI
MODC_DUE
MODI_DUE
BOMC_DUE
MODI_MODC CAT MO_FINIS
CAT MO_START
CAT SCHED_ST
CAT MAN-DUE
CAT BOM_CR-D
CAT MOD_ISSU
CAT MODEL_CR
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
60
45
30
15
0
SHIP-D
UE
Main Effects Plot - Data Means for SHIP-DUE
CAT MO_FINIS
CAT MO_START
CAT SCHED_ST
CAT MAN-DUE
CAT BOM_CR-D
CAT MOD_ISSU
CAT MODEL_CR
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
60
45
30
15
0
SHIP-D
UE
Main Effects Plot - Data Means for SHIP-DUE
CAT MO_FINIS
CAT MO_START
CAT SCHED_ST
CAT MAN-DUE
CAT BOM_CR-D
CAT MOD_ISSU
CAT MODEL_CR
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
90-100
80-90
70-80
60-70
50-60
40-50
30-40
20-30
0-20
60
45
30
15
0
SHIP-D
UE
Main Effects Plot - Data Means for SHIP-DUE
Figure 12.1: Example – Model Building
ModelPRE
ModelPRE
0
DUE DATE
SHIP DATEBOM Create
- Time + Time
ComponentsAvailable
ComponentsAvailable
MANRelease
MANRelease
MOFinishMO
FinishScheduledMO Start
ScheduledMO Start
MOStartMO
StartModel / DWG
IssueModel / DWG
IssueIR
CreateIR
Create
X – make smaller
X – make more negativeY – make smaller
X – make smaller
X – make smaller
X – make smaller
Model Create
52.8%
28.3%
8.4%7.1%
3.5%
SHIP-DUE = 7.97 + 0.269*(MODEL_CR-DUE) + 0.173*(CR- ISS) + 0.704*(MAN_BOMC) + 0.748*(SCH_ST-MAN) + 0.862*(MOS_MOFIN) [R^2A 4.4%] – {R^2A(1) 76.5%, R^2A(2) 68.0%}
Figure 12.2: Example – Model Building
Team
#1
M
BA
664 21
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2-49.25
-34.25
-19.25
-4.25
10.75
25.75
40.75
55.75
70.75
85.75
Probability
SH
IP DU
E MO
DEL
SH
IP DU
E AC
TUA
L
Actual
Delivery
Predicted D
elivery (R
egression)
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2-49.25
-34.25
-19.25
-4.25
10.75
25.75
40.75
55.75
70.75
85.75
Probability
SH
IP DU
E MO
DEL
SH
IP DU
E AC
TUA
L
Actual
Delivery
Predicted D
elivery (R
egression)
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2-49.25
-34.25
-19.25
-4.25
10.75
25.75
40.75
55.75
70.75
85.75
Probability
SH
IP DU
E MO
DEL
SH
IP DU
E AC
TUA
L
Actual
Delivery
Predicted D
elivery (R
egression)
Fig
ure
13
: Exa
mp
le –
Mo
del E
valu
atio
n
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Issue Required for
On-Tim
e Delivery
Issue A
ctual
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Issue Required for
On-Tim
e Delivery
Issue A
ctual
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Ove
rlay C
hart
0
0.2
0.4
0.6
0.8 1
1.2
-298.00
-278.00-258.00
-238.00-218.00
-198.00-178.00
-158.00-138.00
-118.00-98.00
-78.00-58.00
-38.00-18.00
2.0022.00
42.0062.00
82.00
Probability
MO
DI A
CT
modi calc new
Issue Required for
On-Tim
e Delivery
Issue A
ctual
Fig
ure
14
: Exa
mp
le –
Mo
del E
valu
atio
n
Team #1 MBA 664 22
BRACKETS SUMMARY
0
10
20
30
40
50
60
70
80
90
100
08/0
6/05
08/2
0/05
09/0
3/05
09/1
7/05
10/0
1/05
10/1
5/05
10/2
9/05
11/1
2/05
11/2
6/05
12/1
0/05
12/2
4/05
01/0
7/06
01/2
1/06
02/0
4/06
02/1
8/06
03/0
4/06
03/1
8/06
04/0
1/06
04/1
5/06
04/2
9/06
05/1
3/06
05/2
7/06
06/1
0/06
06/2
4/06
Date
Num
ber
of P
arts
CUM Req IssueCUM Plan IssueCUM Actual Issue
*** WARNINGS ***
# Issed No PRE - 6# Issued Post Due - 0
# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0
All Due Dates
BRACKETS SUMMARY
0
10
20
30
40
50
60
70
80
90
100
08/0
6/05
08/2
0/05
09/0
3/05
09/1
7/05
10/0
1/05
10/1
5/05
10/2
9/05
11/1
2/05
11/2
6/05
12/1
0/05
12/2
4/05
01/0
7/06
01/2
1/06
02/0
4/06
02/1
8/06
03/0
4/06
03/1
8/06
04/0
1/06
04/1
5/06
04/2
9/06
05/1
3/06
05/2
7/06
06/1
0/06
06/2
4/06
Date
Num
ber
of P
arts
CUM Req IssueCUM Plan IssueCUM Actual Issue
*** WARNINGS ***
# Issed No PRE - 6# Issued Post Due - 0
# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0
All Due Dates
BRACKETS SUMMARY
0
10
20
30
40
50
60
70
80
90
100
08/0
6/05
08/2
0/05
09/0
3/05
09/1
7/05
10/0
1/05
10/1
5/05
10/2
9/05
11/1
2/05
11/2
6/05
12/1
0/05
12/2
4/05
01/0
7/06
01/2
1/06
02/0
4/06
02/1
8/06
03/0
4/06
03/1
8/06
04/0
1/06
04/1
5/06
04/2
9/06
05/1
3/06
05/2
7/06
06/1
0/06
06/2
4/06
Date
Num
ber
of P
arts
CUM Req IssueCUM Plan IssueCUM Actual Issue
*** WARNINGS ***
# Issed No PRE - 6# Issued Post Due - 0
# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0
All Due Dates
BRACKETS SUMMARY
0
10
20
30
40
50
60
70
80
90
100
08/0
6/05
08/2
0/05
09/0
3/05
09/1
7/05
10/0
1/05
10/1
5/05
10/2
9/05
11/1
2/05
11/2
6/05
12/1
0/05
12/2
4/05
01/0
7/06
01/2
1/06
02/0
4/06
02/1
8/06
03/0
4/06
03/1
8/06
04/0
1/06
04/1
5/06
04/2
9/06
05/1
3/06
05/2
7/06
06/1
0/06
06/2
4/06
Date
Num
ber
of P
arts
CUM Req IssueCUM Plan IssueCUM Actual Issue
*** WARNINGS ***
# Issed No PRE - 6# Issued Post Due - 0
# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0
All Due Dates
Requirements
Plan
Actual
BRACKETS SUMMARY
0
10
20
30
40
50
60
70
80
90
100
08/0
6/05
08/2
0/05
09/0
3/05
09/1
7/05
10/0
1/05
10/1
5/05
10/2
9/05
11/1
2/05
11/2
6/05
12/1
0/05
12/2
4/05
01/0
7/06
01/2
1/06
02/0
4/06
02/1
8/06
03/0
4/06
03/1
8/06
04/0
1/06
04/1
5/06
04/2
9/06
05/1
3/06
05/2
7/06
06/1
0/06
06/2
4/06
Date
Num
ber
of P
arts
CUM Req IssueCUM Plan IssueCUM Actual Issue
*** WARNINGS ***
# Issed No PRE - 6# Issued Post Due - 0
# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0
All Due Dates
BRACKETS SUMMARY
0
10
20
30
40
50
60
70
80
90
100
08/0
6/05
08/2
0/05
09/0
3/05
09/1
7/05
10/0
1/05
10/1
5/05
10/2
9/05
11/1
2/05
11/2
6/05
12/1
0/05
12/2
4/05
01/0
7/06
01/2
1/06
02/0
4/06
02/1
8/06
03/0
4/06
03/1
8/06
04/0
1/06
04/1
5/06
04/2
9/06
05/1
3/06
05/2
7/06
06/1
0/06
06/2
4/06
Date
Num
ber
of P
arts
CUM Req IssueCUM Plan IssueCUM Actual Issue
*** WARNINGS ***
# Issed No PRE - 6# Issued Post Due - 0
# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0
All Due Dates
BRACKETS SUMMARY
0
10
20
30
40
50
60
70
80
90
100
08/0
6/05
08/2
0/05
09/0
3/05
09/1
7/05
10/0
1/05
10/1
5/05
10/2
9/05
11/1
2/05
11/2
6/05
12/1
0/05
12/2
4/05
01/0
7/06
01/2
1/06
02/0
4/06
02/1
8/06
03/0
4/06
03/1
8/06
04/0
1/06
04/1
5/06
04/2
9/06
05/1
3/06
05/2
7/06
06/1
0/06
06/2
4/06
Date
Num
ber
of P
arts
CUM Req IssueCUM Plan IssueCUM Actual Issue
*** WARNINGS ***
# Issed No PRE - 6# Issued Post Due - 0
# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0
All Due Dates
BRACKETS SUMMARY
0
10
20
30
40
50
60
70
80
90
100
08/0
6/05
08/2
0/05
09/0
3/05
09/1
7/05
10/0
1/05
10/1
5/05
10/2
9/05
11/1
2/05
11/2
6/05
12/1
0/05
12/2
4/05
01/0
7/06
01/2
1/06
02/0
4/06
02/1
8/06
03/0
4/06
03/1
8/06
04/0
1/06
04/1
5/06
04/2
9/06
05/1
3/06
05/2
7/06
06/1
0/06
06/2
4/06
Date
Num
ber
of P
arts
CUM Req IssueCUM Plan IssueCUM Actual Issue
*** WARNINGS ***
# Issed No PRE - 6# Issued Post Due - 0
# Multiple Issued Files - 12# Complex Not Planned Early - 0# Complex Not Issued Early - 0
All Due Dates
Requirements
Plan
Actual
BRACKET PLANNING
0.5
0.6
0.7
0.8
0.9
1
1.1
-200 -150 -100 -50 0 50
Days
Cum
ulat
ive
Per
cent
OLD PLAN
NEW PLAN
REQUIRED
Figure 15: Example – Deployment