from statistics to data science - kansas state …zifeiliu/files/fac_zifeiliu...from statistics to...
TRANSCRIPT
![Page 2: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/2.jpg)
2
The Data-Information-Knowledge-Wisdom Hierarchy
- Russell Ackoff
What?
How much?
How many?
How?
Why?
Individual facts
(quantities,
characters, or
symbols)
![Page 3: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/3.jpg)
3
1 exabytes= 1billion GB=1018 bytes
![Page 4: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/4.jpg)
4
How do we make decisions?
Experience
Data(Experiments)
Statistics
Big data Data science
(Probability, uncertainty)
![Page 5: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/5.jpg)
• How much? - or - How many?
– Regression algorithms
• What it is? Is this A or B?
– Classification algorithms
• Is this weird?
– Anomaly detection algorithms
Questions that you can answer with data science5
![Page 6: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/6.jpg)
Correlation vs. causation6
A B
(1) A B
(2) A B
(3) A B
C
(4) A B (5) Coincidence
Causation is not observed but inferred
• Social drinking vs. earnings
• Energy consumption vs. economic growth
• Debt rate vs. performance of company
• Shoe size vs. reading ability
• Ice cream consumption vs. rate of drowning
• Obesity vs. diabetes (risk factor)
• Children who get tutored get worse grades than
children who do not get tutored
![Page 7: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/7.jpg)
Population vs. sample7
Population
Sample
Statistic
Standard deviation
Standard error
n
sSE
Y
N
n
![Page 8: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/8.jpg)
8
True
situationOur conclusion Control errors
No effect
(negative)
Not significant True negative
Significant
(Reject H0)
False positive
“Type I error”
Confidence level,
P value
Has an effect
(positive)
Significant
(Reject H0)True positive
Not significantFalse negative
"Type II error"
Statistical power,
sample size
Null hypothesis (H0): A has no effect on B.
![Page 9: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/9.jpg)
Confounding/nuisance
variables
(undesired sources of variation that
affect the dependent variable)
9
Dependent variable
A
Independent variable
B
D
C
E
F
If you can, fix the confounding variable (make it a constant).
If you can’t fix the confounding variable, use blocking.
If you can neither fix nor block the confounding variable, use randomization.
Avoid confounding variables
![Page 10: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/10.jpg)
Common probability distributions10
![Page 11: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/11.jpg)
Regression analysis11
R2: coefficient of determination, 0 to 1
R: correlation coefficient, -1 to +1
• Linear regression
• Logistic regression
• Nonlinear regression
• Stepwise regression- Forward
- Backward
• Ridge, LASSO &
ElasticNet regression- Handle multicollinearity
variables
![Page 12: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/12.jpg)
Machine learning12
• Learning:
- improve performance from experience.
• Machine learning:
- teach computers to make and improve predictions based
on data. approach to achieve artificial intelligence
- classification
- prediction (regression)
• Data mining:
- use algorithms to create knowledge from data.
![Page 13: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/13.jpg)
Bayesian statistics for machine learning13
Bayes' rule provides the tools to update the probability for a
hypothesis as more evidence or information becomes available.
New
![Page 14: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/14.jpg)
Common data science algorithms14
• Linear regression• Decision tree• Random forest• Association rule mining• K-Means clustering
Unsupervised = exploratory
Supervised = predictive
![Page 15: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/15.jpg)
Decision tree15
• The attribute with the largest std reduction is chosen for the decision node.
• Stop when std for the branch becomes smaller than a certain fraction (e.g., 5%)
of std for the full dataset or when too few instances remain in the branch.
http://www.saedsayad.com/decision_tree_reg.htm
4/14
Std=3.49
5/14
Std=10.87 5/14
Std=7.78
Std=9.32
![Page 16: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/16.jpg)
Decision tree16
• You can define a split-point for either categorical variable or continuous variable.
• Split the dataset based on homogeneity of data.
X2
X1
Classification & Regression Trees (CART)
(Ankit Sharma, 2014)
![Page 17: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/17.jpg)
Random forest17
• Averaging multiple deep decision trees, trained on different parts of the same
training set; Overcoming overfitting problem of individual decision tree
• Widely used machine learning algorithm for classification
- Approx. 2/3rd of the total training data are selected at random to grow each tree.
- Predictor variables are selected at random and the best split is used to split the node.
- For each tree, using the leftover (1/3rd) data to calculate the out of bag error rate.
- Each tree gives a classification. The forest chooses the classification having the most
votes over all the trees in the forest.
![Page 18: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/18.jpg)
Variable importance plot18
Random forests can be used
to rank the importance of
variables in a regression or
classification problem.
• Mean decrease accuracy: How much
the model accuracy decreases if we
drop that variable
• Mean decrease gini: Measure of variable
importance based on the Gini impurity index
used for the calculation of splits in trees
Classifying income of adults
![Page 19: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/19.jpg)
Association rule mining19
An association rule is a pattern that states when X occurs, Y occurs with certain probability (If/then statement).
Initially used for Market Basket Analysis to find how items purchased by customers are related.
n
countYXsupport
). (
countX
countYXconfidence
.
). (
Goal: Find all rules that satisfy the user-specified minimum support
and minimum confidence.
![Page 20: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/20.jpg)
itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset
{2 3 5}
itemset sup
{2 3 5} 2
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Min support =50%
2,35 confidence=100%
3,52 confidence=100%
2,53 confidence=67%
Association rule mining (the Apriori Algorithm)
![Page 21: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/21.jpg)
K-Means clustering21
The algorithm works iteratively to assign each data point to one of K
groups based on feature similarity (ex. defined distance measure).
• Find the centroids of the K clusters
• Labels for the training data
![Page 22: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/22.jpg)
Open-source language for data science
22
![Page 23: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/23.jpg)
23
![Page 24: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom](https://reader033.vdocuments.site/reader033/viewer/2022060214/5f057ae47e708231d4132b8c/html5/thumbnails/24.jpg)
Demand for deep analytical talent in the U.S. projected to be 50-60% greater than
supply by 2018.
24
Become a data scientist?
Job trends form indeed.com