probability theory and machine learning in sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf ·...

34
Probability Theory and Machine Learning in Science Ji-Hun Kim Seoul National University ROSAEC Workshop 2013 Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 1 / 26

Upload: others

Post on 20-Jul-2020

18 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Probability Theory and Machine Learning in Science

Ji-Hun Kim

Seoul National University

ROSAEC Workshop 2013

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 1 / 26

Page 2: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Outline

1 Probability in Science

2 Machine Learning

3 Machine Learning in Action

4 Opportunities in Machine Learning

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 2 / 26

Page 3: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Plan

1 Probability in Science

2 Machine Learning

3 Machine Learning in Action

4 Opportunities in Machine Learning

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 3 / 26

Page 4: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

What is Probability?Two school of thoughts

Frequentist probability

Traditional definition.

P (event) ≡ limN→∞

Nevent

Ntotal

Bayesian probability

Became popular in the 20th century.Probability is generalization of logic in presence of uncertainty.

P (event) = Degree of plausibility.

What “a 30% chance of rain tomorrow” means?

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 4 / 26

Page 5: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Probability Theory as Extended Logicfrom Probability Theory by E.T. Jaynes

Probability theories come from the three assumptions :1 Degrees of plausibility are represented by real numbers.2 As new information supporting the truth of a proposition is supplied,

the number will increase continuously and monotonically. Thedeductive limit must be obtained where appropriate.

3 Consistency.I If a conclusion can be reasoned out in more than one way, every

possible way must lead to the same result.I Propriety : The theory must take account of all information, provided

it is relevant to the question.I Jaynes consistency : Equivalent states of knowledge must be

represented by equivalent plausibility assignments.

From these, we derive

0 ≤ p(x) ≤ 1

p(A,B|C) = p(A|C)p(B|A,C) = p(B|C)p(A|B,C)

p(A|B) + p(A|B) = 1

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 5 / 26

Page 6: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Resolving Ambiguities/Aesthetics

Given clues, there are many possible guesses:

Principle of maximum entropy

Given constraints, prefer a least informative conclusion;i.e. probability distribution function of a maximum entropy.For example,

No prior information : Uniform distribution.

Mean value : Poisson distribution.

Mean and variance : Gaussian distribution.

Given data, there are many possible theory:

Occam’s Razor

Among the consistent theories, prefer a simplest one.

It automatically follows from Bayesian model selection.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 6 / 26

Page 7: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Rise of Probability in Science

The law of nature itself is non-deterministic;e.g. Quantum mechanics is probabilistic theory.

Bell’s theorem : no latent variable can remove the randomness of QM.

The uncertainty principle of QM: impossible to measure position andspeed simultaneously and exactly.

Non-linear dynamics : tiny difference in initial condition can yieldhuge deviation in the long-term future.

For a complex system, what we can study is its statistical property.

The present is a unique event. It is often impossible to repeatexperiments.

Fortunately, it is not an exact state of system(each proton in a cell), butits statistical property(a human) that we are usually interested in.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 7 / 26

Page 8: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

When to use Machine Learning?

Probability theory alone is often insufficient to solve interesting problems.

Many problems are too difficult to solve from first principles:I Listen/speak KoreanI Real estate price(such as an apartment or the 63 building ).I Higgs jets vs gluon jets (in particle physics).I Definition of life from quantum mechanics and chemistry.

Induction, which requires data, complements deductive reasoning.

Abstraction is lossy data compression; efficient encoding builds onsample distribution.

Insufficient data : when it is high dimensional, it is impossible tocollect sufficient sample, say 2100. How to conclude from incompletedata?

Human also needs experience to learn something :e.g. human walks. Who can write a text on how to walk, by the way?

Machine learning is about learning from data.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 8 / 26

Page 9: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Plan

1 Probability in Science

2 Machine Learning

3 Machine Learning in Action

4 Opportunities in Machine Learning

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 9 / 26

Page 10: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

What is Machine Learning?

Supervised machine learning

From training samples of (label, data), assign a label to a new data.

Applications : classification, spam filtering, real estate pricing, movierating, book recommendation, . . .

Algorithms : k-nearest neighbor, neural network, support vectormachine, . . .

Unsupervised Learning

Assign a label to unlabelled data; i.e. no training samples.

Applications :I Structure finding: life vs things, animal vs plants, , . . .I Text topic modeling : extract topics from texts.I Lossy data compression.I Label generation for supervised learning.

Algorithms : Clustering, Auto-encoder, . . .

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 10 / 26

Page 11: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Three Elements of Machine LearningMachine Learning = Representation + Evaluation + Optimization

0 Data gathering : remember we often short of data, not algorithms!I Data pre-processing : noise correction, parsing, . . .

1 RepresentationI Feature extraction : how to represent data?I Hypothesis space : models or classifier we consider.

2 EvaluationI Evaluation function : a measure of goodness of modelsI Parametric vs non-parametric models.

3 OptimizationI Strategy to find a best model within the hypothesis space.I Monte-Carlo method/parallel tempering, stochastic gradient, . . .

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 11 / 26

Page 12: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Supervised Learning

Decision Tree0 We have N-dimensional data.

1 Select a coordinate and slice it for best discrimination.

2 For each segment, repeat the step 1 until getting a desired datapurity or the tree reaches a depth limit.

Random Forests1 Randomly select a subset of coordinates.

2 Make a decision tree with the subset.

3 Repeat the step 1 and 2, N times.

4 Do majority rule for a final decision.

Neural networks, support vector machine, k-nearest neighbor algorithm, . . .

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 12 / 26

Page 13: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Unsupervised Learning

Clustering

Grouping similar items. Need a distance measure.

k-means clustering : Starts with K random means. Assign each pointto its nearest mean and move the mean to a center of its cluster.

Hierarchical clustering : Do merging a pair of points of a minimumdistance until every clusters are apart.

. . .

Blind signal separation

Lossy compression by dimensional reduction.

Principal component analysis : define axis so that align points withthe axis.

Independent component analysis : each axis is statisticallyindependent .

. . .

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 13 / 26

Page 14: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Unsupervised Learning and Hierarchical Structure in NatureLife in nature :

Life := Plant, Animal.

Animal := Vertebrate, Invertebrate.

Vertebrate := Reptilia, Mammal, . . .

Mammal := Dog, Cat, . . .

Dog := Jindo, Shepard, Poodle, . . .

How human brain(visual cortex) recognize image? A current guess is :

Images ≡ cells(‘pixels’) in retina.

Pixels → basic shapes such as lines, points, brightness, colors

Lines, points, brightness, colors → edges, contrasts

Edges, contrasts → objects

Objects → complex objects( on a background)

How we can learn these from high dimensional data?Discover representation(e.g. dimensional reduction),and structure finding(Hierarchical clustering).

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 14 / 26

Page 15: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Parametric vs non-parametric algorithms

Parametric algorithms

has a fixed number of model parameter

easier to make it computationally efficient.

learning capacity is limited.

neural network, K-means clustering, . . .

Non-parametric algorithms

a number of model parameter grows as data size increases

a way to go for human-like learning system.

Support vector machine, Hierarchical clustering, . . .

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 15 / 26

Page 16: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Choosing Machine Learning Algorithm

Each algorithm has many options and parameters to adjust.It is like medicines :

no best, general solution

many (imperfect) solutions to problems,

difficult for non-expert to use them correctly and efficiently.

we need experts(or expert systems?) to configure and combine them.

Diversity helps for better solutions.

Which is the best algorithm? Depends on problems.

No Free Lunch theorem : Optimization/Supervised Machine Learning

Which algorithm to use? All of them!

Recall that for complex problems in real world, each expert often have adistinct opinion, too. Many machine learning system in real world oftenemploy multiple algorithms with majority rule.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 16 / 26

Page 17: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Plan

1 Probability in Science

2 Machine Learning

3 Machine Learning in Action

4 Opportunities in Machine Learning

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 17 / 26

Page 18: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Distance measures matter

Point

x = (age, height, weight)

To cluster points, we need a distance measure.

Distance - I

d2 = (a1 − a2)2 + (h1 − h2)2 + (w1 − w2)2

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 18 / 26

Page 19: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Distance measures matter

Point

x = (age, height, weight)

To cluster points, we need a distance measure.

Distance - I

d2 = (a1 − a2)2 + (h1 − h2)2 + (w1 − w2)2

In fact, ai, hi, wi are different types.Lesson : dimensions (or types) are important.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 18 / 26

Page 20: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Distance measures matter

Point

x = (age, height, weight)

To cluster points, we need a distance measure.

Distance - II

Let’s normalize them correctly.

d2 =(a1 − a2)2

σ2a+

(h1 − h2)2

σ2h+

(w1 − w2)2

σ2w

What normalization scales should be?

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 18 / 26

Page 21: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Distance measures matter

Point

x = (age, height, weight)

To cluster points, we need a distance measure.

Distance - III

Scaling is a global transformation. What about a local transformation?

d2 =(a1 − a2)2

σa(a1, a2)2+

(h1 − h2)2

σh(h1, h2)2+

(w1 − w2)2

σw(w1, w2)2

Conceptually simple, but symbolically complex. How to simplify this?

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 18 / 26

Page 22: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

CERN LHCA proton-proton collider.

Four detectors are connected with a large ring .

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 19 / 26

Page 23: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Clustering examples from high energy physics - IA proton-proton collision event

We call a collimated bunch of hadrons as a hadron jets. A hadron jetcomes from a colored particle.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 20 / 26

Page 24: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Clustering examples from high energy physics - IIFinding events with specific clusters

Both of two leading subjetsare b-tagged.

Higgs jet, P

T > 200 GeV.

W/Z jet,P

T > 200 GeV.

∣∣=2.5∣∣=2.5

3-rd hardest jethas p

T < 30 GeV.

No pT > 10 GeV

leptons.

Both of two leading jets arerequired to satisfy :

τ rest2 < 0.08 cut.

cos θs < 0.8 cut.

And two leading subjets ofthe Higgs candidate jetsare required to be b-tagged.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 21 / 26

Page 25: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Clustering examples from high energy physics - IIIFinding a cluster with N sub-clusters

To find deformed object, define a local, deformed distance measure.

DecayX

Lab frame

Lab frame

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 22 / 26

Page 26: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Clustering examples from high energy physics - IIIFinding a cluster with N sub-clusters

To find deformed object, define a local, deformed distance measure.

X

Boost : X

Decay

Decay

X

Lab frame

pX p

X ,rest

Lab frame

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 22 / 26

Page 27: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Clustering examples from high energy physics - IIIFinding a cluster with N sub-clusters

To find deformed object, define a local, deformed distance measure.

X

Boost :

p J X p

J X ,rest

Boost :

pX p

X ,rest

X

X=J X

Decay

Decay

X

p X = p

J X≡∑i ∈J X

p i

J X

For ideally reconstructed jet :X’s decay products

= jet’s constituent particles.Lab frame

Lab frame

In this case,J X

X’s Rest frame

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 22 / 26

Page 28: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Clustering examples from high energy physics - IIIFinding a cluster with N sub-clusters

To find deformed object, define a local, deformed distance measure.

X

Boost :

p J X p

J X ,rest

Boost :

pX p

X ,rest

X

X=J X

X’s Rest frame

Jet Rest frameof the jet.

Decay

Decay

X

p X = p

J X≡∑i ∈J X

p i

J X

For ideally reconstructed jet :X’s decay products

= jet’s constituent particles.Lab frame

Lab frame

In this case,J X

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 22 / 26

Page 29: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Clustering examples from high energy physics - IIIFinding a cluster with N sub-clusters

To find deformed object, define a local, deformed distance measure.

X

Boost :

p J X p

J X ,rest

Boost :

pX p

X ,rest

X J X

X’s Rest frame

Jet Rest frameof the jet.

Decay

Decay

X

J X

Lab frame

Lab frame

Rest frame subjets.

Rest frame subjets are clustered

in the jet rest frame, using SISCone jet

in spherical coordinate.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 22 / 26

Page 30: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Clustering examples from high energy physics - IIIFinding a cluster with N sub-clusters

To find deformed object, define a local, deformed distance measure.

X

Boost :

p J X p

J X ,rest

Boost :

pX p

X ,rest

X J X

X’s Rest frame

Jet Rest frameof the jet.

Decay

Decay

X

J X

Lab frame

Lab frame

Rest frame subjets.

Rest frame subjetsin the lab frame.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 22 / 26

Page 31: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Plan

1 Probability in Science

2 Machine Learning

3 Machine Learning in Action

4 Opportunities in Machine Learning

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 23 / 26

Page 32: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Computer Science Topics in Machine Learning

Distributed file system : Hadoop, Apache Spark, . . .

Distributed computing :MPI(too low level) vs PGAS models(IBM X10, Cray Chapel, . . .).

Database : SQL query? Distributed? ACID?

Avoid over-specification : Declarative over imperative. Critical foroptimization, auto-tuning, auto-configuration.

Optimization for concurrency :e.g. addition + association rule = map-reduce addition.

Vectorization : For now, human (energy) resource is too expensive touse(not use) SIMD.

Heterogeneous computing : CPU and/or GPU?

Scaling at large : minimizing communication cost,topology aware algorithms.

How computer language can help to check numerical program?

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 24 / 26

Page 33: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Open Questions in Machine Learning

How to decide whether there are clusters?

How to embed symmetries of data?I Data have various symmetries : rotation, re-scale, . . .I How we can compare, say, images of different size?I In fact, humans are not very good at this.

(can you read a book upside down without difficulty?).

How to select a correct algorithm for a given question?

A human brain does not need million images to learn ‘animal’. How?

Why a brain(∼ 10 Watts) simulator would consume a few GW?

How to devise parallel and scalable algorithms, instead of parallelizingalgorithms.

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 25 / 26

Page 34: Probability Theory and Machine Learning in Sciencerosaec.snu.ac.kr/meet/file/20131005b.pdf · Probability Theory as Extended Logic from Probability Theory by E.T. Jaynes Probability

Opportunities in Machine Learning

Science needs data.Historically, theory innovations often follow experimental data explosion.

Data :I Internet traffic, storage, contents : doubled in every few years.I Rise of public data sets.

Demands :I Data size will continue to grow exponentially; people want to use them.I Directory/file will obsolete soon; what will be the successor? Key-value

store. How to generate keys?

Tools :I Computing power and efficiency are also improving exponentially.I Rise of distributed computingI Self-check and self-management tools are rapidly evolving.

Theory :I The only thing missing at this point.I machine learning?!

Ji-Hun Kim (SNU) Probability Theory and ML in Science ROSAEC Workshop 2013 26 / 26