what is machine learning?...inverse problems determine some unknown quantity from available data....
TRANSCRIPT
What is Machine Learning?o AI versus ML versus DLo ML as an inverse problemo ML Inference
Part 1: Introduction to Machine Learning§Introduction to Machine Learning
• What is machine learning?– AI versus machine learning versus Deep Learning– Machine learning and inference– Single layer neural networks
• Loss functions– Supervised training– Loss minimization – Minimum mean squared error (MMSE) loss function– Inference versus training
• Gradient descent optimization– Definition of gradient descent– Computing the function and loss gradient– Convex sets and functions– Local minimum, saddle points, and global minimum– Optimization theorems
• Training and Generalization– Overfitting, underfitting, and generalization error– Training data, validation data, and testing data– Interpreting training and validation error
• Probability and Estimation– Probability, random variables, marginal and conditional expectation– Frequentist and Bayesian estimation, ML and MAP estimators– The bias/variance tradeoff
AI versus ML versus DL
§ Artificial Intelligence (AI):– Create computers behave intelligently – like humans– Alan Turning 1950: Turning test– Broadly defined:
• Artificial general intelligence and Strong AI: Data on Star Trek• Weak AI and the Chinese room (Searle 1980): Exhibits intelligent behavior but not conscious• Narrow AI: Task specific AI such as Siri
§ Machine Learning (ML):– Train an algorithm to reproduce answers from data
§ Deep Learning (DL):– A particularly successful ML method based on deep sequences of neural networks
AIML
DL
Source: Wikicomms
Inverse Problems§ Determine some unknown quantity from available data.
– " what we can measure or observe– # what we would like to know
§This is an inverse problem:– Computer vision, sensing, demodulation, speech recognition, etc.– Business analytics: What movie will the customer like best?– Science: What is the structure of this particle?
#unknown
"data
TheWorld
Solving an Inverse Problems
§ Goal of Machine Learning (ML): Solve this inverse problem
§ Observations:– The answer, !", is usually not equal to the unknown, ".– But hopefully, !" is close to ".
§ Questions:– How do we compute the inverse?– Is there a best inverse?
§ Mick Jagger’s Theorem: – You can’t always get what you want. But if you try sometimes, you might fine, you get
what you need.
TheWorld"
unknown #data
Inversion or Inference
!"estimate
Machine Learning: Inference§ Machine Learning approach
§ Comments– Adjustparameter. toachievethe“best”oratleasta“good”answer
– ;< has a “hat” because it is an estimate (i.e., a guess) of the true unknown <.
– =, ;<, and . are usually finite dimensional vectors.
Inference Function
.parameters
=data
;<estimate
ML: Mathematical Inference§ Mathematical representation of ML inference
§ Questions:– What family of functions do we choose for !" ⋅ ?– What is the dimension of $ ∈ ℜ'?– What is our goal?– How do we determine the best value of $?
() = !" +inference function
$parameters
+data
()estimate
Single Layer Neural NetworksoMathematical representationoGraphical representationoOne hot encodingoActivation Functions
What ML Function should we use?
§ Possible choices– Support vector machines (SVD)– Radial basis functions (RBF)– Gaussian mixture functions
§ Neural Networks– Very high capacity/model order– Easy to train with modern analytical/computational tools.– Shallow neural networks: Use one easy-to-train layer.– Deep neural networks: Train a hierarchical stack of layers
!" = $% &'
& !"
Single Layer Dense NN§ Single layer NN, graphically
– Mathematically, !" = $% &' + ) where
ù & ∈ ℜ,-×,/ is a matrix of multiplicative weights.
ù ) ∈ ℜ,- is a column vector of additive offsets.
ù %:ℜ,- → ℜ,- is a point-wise activation function.
ù $ ∈ ℜ,2×,- is a matrix of multiplicative weights.
ù Typical activation function: Logistic sigmoid
%3 4 = 11 + 6789
&' !"'
)% ⋅+ $
Point-Wise Activation Functions
• Logistic sigmoid function
!" # = 11 + '()*
• Rectified linear unit (ReLU)
!" # = + 0 -. #" ≤ 0#" -. #" > 0
• Leaky ReLU
!" # = +1#" -. #" ≤ 0#" -. #" > 0
Point-Wise Activation Function§ Point-wise activation function, !:ℜ$ → ℜ$
! ⋅' (
⋮!* ⋅ (+'*
'+ (+!+ ⋅
!* ⋅ (+'*
Gradient of Point-Wise Activation Function
§ Gradient Matrix is:– Diagonal– Sparse(mostentriesarezero)– Fasttocomputeandapply
∇: ; = ==×=
gradient matrix
?:@ ;?;A
j index
i ind
ex
B@ =?: ;@?;@
BCBD
BE⋱00
Single Layer NN: Abstract Form§ Single layer NN,
§Mathematicalrepresentationis01 2 = 45 62 + 8
where 9 = 6, 4, 8 is the set of all NN parameters.
62 ;<2
85 ⋅+ 4
;< = 01 2
9 = 6, 4, 8
Single Layer NN Flow Diagram§ Example for !" = 3, !% = 4, and !' = 2.
• Approximation theorem: – Cybenko 1989, “Approximation by Superpositions of Sigmoidal Functions” ⇒
Any function can be approximated by a single layer neural network! – But number of hidden layers might be huge!!
*+
*%
*,
-+
-%
-,
-.
/0+
/0%
!" = 3inputs !% = 4
hidden layers
!' = 2output layers
1 matrix entries
2 matrix entries
One-Hot Encoding for Classification
§ A method to encode the class of an object– A vector ! ∈ ℜ$% needs to be classified into one of & possible classes.
§ Standard encoding:– () ∈ 0,⋯ ,& − 1 each value represents a different class
§ One-hot encoding:– () ∈ ℜ/ s.t.
§ Example: For & = 5, and class=3, then () = 0,0,1,0,0
()2 = 31 45 67899 = 40 45 67899 ≠ 4
Class 0Class 1Class 2Class 3Class 4
! ()
Continuous Encoding for Classification
§ Define an !-dimensional simplex as
"# = % ∈ ℜ#: ∀*, %, ≥ 0 /01 1 = 3,45
#67%,
– Then 8% ∈ "9 ⊂ ℜ9– Like a probability density for each class
§ Advantage:– Continuous function on a convex set – Makes optimization easier– Allows for representation of probability densities
§ Example: For ! = 5, and class=3, then 8% = <5, <7, <=, <>, <?
Simplex in 3D
Class 0Class 1Class 2Class 3Class 4
@ 8%
Softmax Activation Functions• Softmax
!" # = %&'∑) %&*
– Joint activation function – Notice that !" # ∈ ,- ⊂ ℜ-.– It can be interpreted as a probability density.
Simplex in 3D
⋮!1 ⋅ 31#1
#4 34!4 ⋅
!- ⋅ 3-#-
Gradient of Softmax Function
§ Gradient Matrix is:– Densematrix(mostorallentriesarenon-zero)– Usuallyslowtocompute,butsometricksinthiscase
∇> ? =A
∑C DEC
−A
∑C DEC
G
HIJ
HIK
HIL⋱0
0
∇> ? O,P =1
∑R HICHISTOUP −
HISHIV
∑R HIC
HIJ
HIK
HIL
⋱
HIJ HIK HIL⋱
The Loss FunctionoMeasuring supervised training erroroMathematical representationoParameter estimation through loss minimizationo Inference and training as inverse problems
Machine Learning: Supervised Training
§ Training: – Measure ! too! This produces training data.
• Collecting training data is application specific• It can be difficult, expensive, or even impossible
– Select " so that # = ! − &! is small– How do we measure “small”?
TheWorld! '
data
Inference&! = () ' &!
estimate
- Error# = ! − &!
"
! knownTraining data
collection
ML: Supervised Training
§ Find lots of training data– "#, %# for & = 0,⋯ , * − 1
§ Define a loss function - .§ Pick . to minimize - .
/"# = 01 %#
.
%# /"#estimate
TrainingPairs"#, %#
"# known - - .Loss2#
The ML Loss Function§ What is a loss function?
– The loss function is a measure of training error– So for example, !" ∈ ℜ%&
– What I really mean is
'()) = ' + = 1-."/0
123!" − 5!" 6 = 1
-."/0
123.7/0
%823!",7 − 5!",7
6
But wait! Gobbligood Alert!
'()) = ' + = 1-."/0
123!" − :; <" 6
Loss Function Properties
§ Facts:– Usually called mean squared error (MSE). But technically,
MSE is really defined as MSE = " #$ − &' ($ ) ≈ +,-. /
– When + / = 0, then for all 1, #$ = &' ($
+,-. / = 134$56
789#$ − &' ($ )
Parameter Estimation using Loss Minimization
§ Estimate !∗ by minimizing loss
!∗ = argmin* +,-. !
Question: What is the lowest point in the USA?Death Valley
Badwater Basin = arg min*∈9-: ;+<=<>?@ !
−282 feet = min*∈9-: ;+<=<>?@ !
– “min” returns the minimum value
– “arg min” returns the parameter that minimizes the value
!∗ = argmin* E,-. ! = argmin*1GHIJK
LMNOI − P* QI R
What does this mean?
ML Supervised Training
§ Generate training data:– Collect ! training pairs, "#, %# , "&, %& , ⋯, "()&, %()&
§ Estimate parameter:– +∗ = min1 2345 +
§ This is also an inverse problem!– Trying to find the “hidden” or “latent” value of +.
ML Training by
minimizing loss+∗
%6data
Training data
"6"6, %6
training pairs
Complete ML System
§ Supervised training:– Off-line and often slow– Requires expensive training data
§ Online inference:– On-line and usually needs to be fast
!∗parameterestimate
ML Training
#$Training
data%$
#$, %$
'# = )*∗ %%testing data
'#
Supervised Training
Online Inference
Two Inverse Problems in ML§ Inference inverse problem:
– Estimate the unknown, !", from the available measurements, #.
§ Training inverse problem:– Estimate the unknown parameter, $∗, from the training pairs, |"', #' ')*+,-
Machine LearningInference
!∗parameters
#data
$%estimate
ML Training !∗Training data
#$, &$