from statistics to data science - kansas state …zifeiliu/files/fac_zifeiliu...from statistics to...

From statistics to data science

BAE 815 (Fall 2017)

Dr. Zifei Liu

[email protected]

2

The Data-Information-Knowledge-Wisdom Hierarchy

- Russell Ackoff

What?

How much?

How many?

How?

Why?

Individual facts

(quantities,

characters, or

symbols)

3

1 exabytes= 1billion GB=1018 bytes

4

How do we make decisions?

Experience

Data(Experiments)

Statistics

Big data Data science

(Probability, uncertainty)

• How much? - or - How many?

– Regression algorithms

• What it is? Is this A or B?

– Classification algorithms

• Is this weird?

– Anomaly detection algorithms

Questions that you can answer with data science5

Correlation vs. causation6

A B

(1) A B

(2) A B

(3) A B

C

(4) A B (5) Coincidence

Causation is not observed but inferred

• Social drinking vs. earnings

• Energy consumption vs. economic growth

• Debt rate vs. performance of company

• Shoe size vs. reading ability

• Ice cream consumption vs. rate of drowning

• Obesity vs. diabetes (risk factor)

• Children who get tutored get worse grades than

children who do not get tutored

Population vs. sample7

Population

Sample

Statistic

Standard deviation

Standard error

n

sSE

Y

N

n

8

True

situationOur conclusion Control errors

No effect

(negative)

Not significant True negative

Significant

(Reject H0)

False positive

“Type I error”

Confidence level,

P value

Has an effect

(positive)

Significant

(Reject H0)True positive

Not significantFalse negative

"Type II error"

Statistical power,

sample size

Null hypothesis (H0): A has no effect on B.

Confounding/nuisance

variables

(undesired sources of variation that

affect the dependent variable)

9

Dependent variable

A

Independent variable

B

D

C

E

F

If you can, fix the confounding variable (make it a constant).

If you can’t fix the confounding variable, use blocking.

If you can neither fix nor block the confounding variable, use randomization.

Avoid confounding variables

Common probability distributions10

Regression analysis11

R2: coefficient of determination, 0 to 1

R: correlation coefficient, -1 to +1

• Linear regression

• Logistic regression

• Nonlinear regression

• Stepwise regression- Forward

- Backward

• Ridge, LASSO &

ElasticNet regression- Handle multicollinearity

variables

Machine learning12

• Learning:

- improve performance from experience.

• Machine learning:

- teach computers to make and improve predictions based

on data. approach to achieve artificial intelligence

- classification

- prediction (regression)

• Data mining:

- use algorithms to create knowledge from data.

Bayesian statistics for machine learning13

Bayes' rule provides the tools to update the probability for a

hypothesis as more evidence or information becomes available.

New

Common data science algorithms14

• Linear regression• Decision tree• Random forest• Association rule mining• K-Means clustering

Unsupervised = exploratory

Supervised = predictive

Decision tree15

• The attribute with the largest std reduction is chosen for the decision node.

• Stop when std for the branch becomes smaller than a certain fraction (e.g., 5%)

of std for the full dataset or when too few instances remain in the branch.

http://www.saedsayad.com/decision_tree_reg.htm

4/14

Std=3.49

5/14

Std=10.87 5/14

Std=7.78

Std=9.32

Decision tree16

• You can define a split-point for either categorical variable or continuous variable.

• Split the dataset based on homogeneity of data.

X2

X1

Classification & Regression Trees (CART)

（Ankit Sharma, 2014）

Random forest17

• Averaging multiple deep decision trees, trained on different parts of the same

training set; Overcoming overfitting problem of individual decision tree

• Widely used machine learning algorithm for classification

- Approx. 2/3rd of the total training data are selected at random to grow each tree.

- Predictor variables are selected at random and the best split is used to split the node.

- For each tree, using the leftover (1/3rd) data to calculate the out of bag error rate.

- Each tree gives a classification. The forest chooses the classification having the most

votes over all the trees in the forest.

Variable importance plot18

Random forests can be used

to rank the importance of

variables in a regression or

classification problem.

• Mean decrease accuracy: How much

the model accuracy decreases if we

drop that variable

• Mean decrease gini: Measure of variable

importance based on the Gini impurity index

used for the calculation of splits in trees

Classifying income of adults

Association rule mining19

An association rule is a pattern that states when X occurs, Y occurs with certain probability (If/then statement).

Initially used for Market Basket Analysis to find how items purchased by customers are related.

n

countYXsupport

). (

countX

countYXconfidence

.

). (

Goal: Find all rules that satisfy the user-specified minimum support

and minimum confidence.

itemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3

{3 5} 2

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset

{2 3 5}

itemset sup

{2 3 5} 2

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Min support =50%

2,35 confidence=100%

3,52 confidence=100%

2,53 confidence=67%

Association rule mining (the Apriori Algorithm)

K-Means clustering21

The algorithm works iteratively to assign each data point to one of K

groups based on feature similarity (ex. defined distance measure).

• Find the centroids of the K clusters

• Labels for the training data

Open-source language for data science

22

Demand for deep analytical talent in the U.S. projected to be 50-60% greater than

supply by 2018.

24

Become a data scientist?

Job trends form indeed.com

from statistics to data science - kansas state …zifeiliu/files/fac_zifeiliu...from statistics to...

Documents