what is machine learning?...inverse problems determine some unknown quantity from available data....

What is Machine Learning?o AI versus ML versus DLo ML as an inverse problemo ML Inference

Part 1: Introduction to Machine Learning§Introduction to Machine Learning

• What is machine learning?– AI versus machine learning versus Deep Learning– Machine learning and inference– Single layer neural networks

• Loss functions– Supervised training– Loss minimization – Minimum mean squared error (MMSE) loss function– Inference versus training

• Gradient descent optimization– Definition of gradient descent– Computing the function and loss gradient– Convex sets and functions– Local minimum, saddle points, and global minimum– Optimization theorems

• Training and Generalization– Overfitting, underfitting, and generalization error– Training data, validation data, and testing data– Interpreting training and validation error

• Probability and Estimation– Probability, random variables, marginal and conditional expectation– Frequentist and Bayesian estimation, ML and MAP estimators– The bias/variance tradeoff

AI versus ML versus DL

§ Artificial Intelligence (AI):– Create computers behave intelligently – like humans– Alan Turning 1950: Turning test– Broadly defined:

• Artificial general intelligence and Strong AI: Data on Star Trek• Weak AI and the Chinese room (Searle 1980): Exhibits intelligent behavior but not conscious• Narrow AI: Task specific AI such as Siri

§ Machine Learning (ML):– Train an algorithm to reproduce answers from data

§ Deep Learning (DL):– A particularly successful ML method based on deep sequences of neural networks

AIML

DL

Source: Wikicomms

Inverse Problems§ Determine some unknown quantity from available data.

– " what we can measure or observe– # what we would like to know

§This is an inverse problem:– Computer vision, sensing, demodulation, speech recognition, etc.– Business analytics: What movie will the customer like best?– Science: What is the structure of this particle?

#unknown

"data

TheWorld

Solving an Inverse Problems

§ Goal of Machine Learning (ML): Solve this inverse problem

§ Observations:– The answer, !", is usually not equal to the unknown, ".– But hopefully, !" is close to ".

§ Questions:– How do we compute the inverse?– Is there a best inverse?

§ Mick Jagger’s Theorem: – You can’t always get what you want. But if you try sometimes, you might fine, you get

what you need.

TheWorld"

unknown #data

Inversion or Inference

!"estimate

Machine Learning: Inference§ Machine Learning approach

§ Comments– Adjustparameter. toachievethe“best”oratleasta“good”answer

– ;< has a “hat” because it is an estimate (i.e., a guess) of the true unknown <.

– =, ;<, and . are usually finite dimensional vectors.

Inference Function

.parameters

=data

;<estimate

ML: Mathematical Inference§ Mathematical representation of ML inference

§ Questions:– What family of functions do we choose for !" ⋅ ?– What is the dimension of $ ∈ ℜ'?– What is our goal?– How do we determine the best value of $?

() = !" +inference function

$parameters

+data

()estimate

Single Layer Neural NetworksoMathematical representationoGraphical representationoOne hot encodingoActivation Functions

What ML Function should we use?

§ Possible choices– Support vector machines (SVD)– Radial basis functions (RBF)– Gaussian mixture functions

§ Neural Networks– Very high capacity/model order– Easy to train with modern analytical/computational tools.– Shallow neural networks: Use one easy-to-train layer.– Deep neural networks: Train a hierarchical stack of layers

!" = $% &'

& !"

Single Layer Dense NN§ Single layer NN, graphically

– Mathematically, !" = $% &' + ) where

ù & ∈ ℜ,-×,/ is a matrix of multiplicative weights.

ù ) ∈ ℜ,- is a column vector of additive offsets.

ù %:ℜ,- → ℜ,- is a point-wise activation function.

ù $ ∈ ℜ,2×,- is a matrix of multiplicative weights.

ù Typical activation function: Logistic sigmoid

%3 4 = 11 + 6789

&' !"'

)% ⋅+ $

Point-Wise Activation Functions

• Logistic sigmoid function

!" # = 11 + '()*

• Rectified linear unit (ReLU)

!" # = + 0 -. #" ≤ 0#" -. #" > 0

• Leaky ReLU

!" # = +1#" -. #" ≤ 0#" -. #" > 0

Point-Wise Activation Function§ Point-wise activation function, !:ℜ$ → ℜ$

! ⋅' (

⋮!* ⋅ (+'*

'+ (+!+ ⋅

!* ⋅ (+'*

Gradient of Point-Wise Activation Function

§ Gradient Matrix is:– Diagonal– Sparse(mostentriesarezero)– Fasttocomputeandapply

∇: ; = ==×=

gradient matrix

?:@ ;?;A

j index

i ind

ex

B@ =?: ;@?;@

BCBD

BE⋱00

Single Layer NN: Abstract Form§ Single layer NN,

§Mathematicalrepresentationis01 2 = 45 62 + 8

where 9 = 6, 4, 8 is the set of all NN parameters.

62 ;<2

85 ⋅+ 4

;< = 01 2

9 = 6, 4, 8

Single Layer NN Flow Diagram§ Example for !" = 3, !% = 4, and !' = 2.

• Approximation theorem: – Cybenko 1989, “Approximation by Superpositions of Sigmoidal Functions” ⇒

Any function can be approximated by a single layer neural network! – But number of hidden layers might be huge!!

*+

*%

*,

-+

-%

-,

-.

/0+

/0%

!" = 3inputs !% = 4

hidden layers

!' = 2output layers

1 matrix entries

2 matrix entries

One-Hot Encoding for Classification

§ A method to encode the class of an object– A vector ! ∈ ℜ$% needs to be classified into one of & possible classes.

§ Standard encoding:– () ∈ 0,⋯ ,& − 1 each value represents a different class

§ One-hot encoding:– () ∈ ℜ/ s.t.

§ Example: For & = 5, and class=3, then () = 0,0,1,0,0

()2 = 31 45 67899 = 40 45 67899 ≠ 4

Class 0Class 1Class 2Class 3Class 4

! ()

Continuous Encoding for Classification

§ Define an !-dimensional simplex as

"# = % ∈ ℜ#: ∀*, %, ≥ 0 /01 1 = 3,45

#67%,

– Then 8% ∈ "9 ⊂ ℜ9– Like a probability density for each class

§ Advantage:– Continuous function on a convex set – Makes optimization easier– Allows for representation of probability densities

§ Example: For ! = 5, and class=3, then 8% = <5, <7, <=, <>, <?

Simplex in 3D

Class 0Class 1Class 2Class 3Class 4

@ 8%

Softmax Activation Functions• Softmax

!" # = %&'∑) %&*

– Joint activation function – Notice that !" # ∈ ,- ⊂ ℜ-.– It can be interpreted as a probability density.

Simplex in 3D

⋮!1 ⋅ 31#1

#4 34!4 ⋅

!- ⋅ 3-#-

Gradient of Softmax Function

§ Gradient Matrix is:– Densematrix(mostorallentriesarenon-zero)– Usuallyslowtocompute,butsometricksinthiscase

∇> ? =A

∑C DEC

−A

∑C DEC

G

HIJ

HIK

HIL⋱0

0

∇> ? O,P =1

∑R HICHISTOUP −

HISHIV

∑R HIC

HIJ

HIK

HIL

⋱

HIJ HIK HIL⋱

The Loss FunctionoMeasuring supervised training erroroMathematical representationoParameter estimation through loss minimizationo Inference and training as inverse problems

Machine Learning: Supervised Training

§ Training: – Measure ! too! This produces training data.

• Collecting training data is application specific• It can be difficult, expensive, or even impossible

– Select " so that # = ! − &! is small– How do we measure “small”?

TheWorld! '

data

Inference&! = () ' &!

estimate

- Error# = ! − &!

"

! knownTraining data

collection

ML: Supervised Training

§ Find lots of training data– "#, %# for & = 0,⋯ , * − 1

§ Define a loss function - .§ Pick . to minimize - .

/"# = 01 %#

.

%# /"#estimate

TrainingPairs"#, %#

"# known - - .Loss2#

The ML Loss Function§ What is a loss function?

– The loss function is a measure of training error– So for example, !" ∈ ℜ%&

– What I really mean is

'()) = ' + = 1-."/0

123!" − 5!" 6 = 1

-."/0

123.7/0

%823!",7 − 5!",7

6

But wait! Gobbligood Alert!

'()) = ' + = 1-."/0

123!" − :; <" 6

Loss Function Properties

§ Facts:– Usually called mean squared error (MSE). But technically,

MSE is really defined as MSE = " #$ − &' ($ ) ≈ +,-. /

– When + / = 0, then for all 1, #$ = &' ($

+,-. / = 134$56

789#$ − &' ($ )

Parameter Estimation using Loss Minimization

§ Estimate !∗ by minimizing loss

!∗ = argmin* +,-. !

Question: What is the lowest point in the USA?Death Valley

Badwater Basin = arg min*∈9-: ;+<=<>?@ !

−282 feet = min*∈9-: ;+<=<>?@ !

– “min” returns the minimum value

– “arg min” returns the parameter that minimizes the value

!∗ = argmin* E,-. ! = argmin*1GHIJK

LMNOI − P* QI R

What does this mean?

ML Supervised Training

§ Generate training data:– Collect ! training pairs, "#, %# , "&, %& , ⋯, "()&, %()&

§ Estimate parameter:– +∗ = min1 2345 +

§ This is also an inverse problem!– Trying to find the “hidden” or “latent” value of +.

ML Training by

minimizing loss+∗

%6data

Training data

"6"6, %6

training pairs

Complete ML System

§ Supervised training:– Off-line and often slow– Requires expensive training data

§ Online inference:– On-line and usually needs to be fast

!∗parameterestimate

ML Training

#$Training

data%$

#$, %$

'# = )*∗ %%testing data

'#

Supervised Training

Online Inference

Two Inverse Problems in ML§ Inference inverse problem:

– Estimate the unknown, !", from the available measurements, #.

§ Training inverse problem:– Estimate the unknown parameter, $∗, from the training pairs, |"', #' ')*+,-

Machine LearningInference

!∗parameters

#data

$%estimate

ML Training !∗Training data

#$, &$

what is machine learning?...inverse problems determine some unknown quantity from available data....

Documents