more machine learning linear regression squared error l1 and l2 regularization gradient descent

Download More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent

If you can't read please download the document

Upload: ernest-matthews

Post on 25-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent
  • Slide 2
  • Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets Inference Mechanism: A*, variable elimination, Gibbs sampling Learning Mechanism: Maximum Likelihood, Laplace Smoothing, many more: linear regression, perceptron, k- Nearest Neighbor, ------------------------------------- Evaluation Metric: Likelihood, many more: squared error, 0-1 loss, conditional likelihood, precision/recall,
  • Slide 3
  • Recall: Types of Learning The techniques we have discussed so far are examples of a particular kind of learning: Supervised: the training examples included the correct labels or outputs. Vs. Unsupervised (or semi-supervised, or distantly-supervised, ): None (or some, or only part, ) of the labels in the training data are known. Parameter Estimation: We only tried to learn the parameters in the BN, not the structure of the BN graph. Vs. Structure learning: The BN graph is not given as an input, and the learning algorithms job is to figure out what the graph should look like. The distinctions below arent actually about the learning algorithm itself, but rather about the type of model being learned: Classification: the output is a discrete value, like Happy or not Happy, or Spam or Ham. Vs. Regression: the output is a real number. Generative: The model of the data represents a full joint distribution over all relevant variables. Vs. Discriminative: The model assumes some fixed subset of the variables will always be inputs or evidence, and it creates a distribution for the remaining variables conditioned on the evidence variables. Parametric vs. Nonparametric: I will explain this later. We wont talk much about structure learning, but we will cover some other kinds of learning (regression, unsupervised, discriminative, nonparameteric, ) in later lectures.
  • Slide 4
  • Regression vs. Classification Our NBC spam detector was a classifier: the output Y was one of two options, Ham or Spam. More generally, classifiers give an output from a (usually small) finite (or countably infinite) set of options. E.g., predicting who will win the presidency in the next election is a classification problem (finite set of possible outcomes: US citizens). Regression models give a real number as output. E.g., predicting what the temperature will be tomorrow is a regression problem. Any real number greater than or equal to 0 (Kelvin) is a possible outcome.
  • Slide 5
  • Quiz: regression vs. classification For each prediction task below, determine whether regression or classification is more appropriate. TaskRegression or Classification? Predict who will win the Super Bowl next year Predict the gender of a baby when its born Predict the weight of a child one year from now Predict the average life expectancy of all babies born today Predict the price of Apple, Inc.s stock at the close of trading tomorrow. Predict whether Microsoft or Apple will have a higher valuation at the close of trading tomorrow
  • Slide 6
  • Answers: regression vs. classification For each prediction task below, determine whether regression or classification is more appropriate. TaskRegression or Classification? Predict who will win the Super Bowl next yearC Predict the gender of a baby when its bornC Predict the weight of a child one year from nowR Predict the average life expectancy of all babies born todayR Predict the price of Apple, Inc.s stock at the close of trading tomorrow. R Predict whether Microsoft or Apple will have a higher valuation at the close of trading tomorrow C
  • Slide 7
  • Concrete Example Suppose I want to buy a house thats 2000 square feet. Predict how much it will cost. 175000
  • Slide 8
  • More realistic data Percentage of the population under the federal poverty level Violent Crime per Capita Reported Crime Statistics for U.S. Counties
  • Slide 9
  • Linear Regression Suppose there are N input variables, X 1, , X N (all real numbers). A linear regression is a function that looks like this: Y = w 0 + w 1 X 1 + w 2 X 2 + + w N X N The w i variables are called weights or parameters. Each one is a real number. The set of all functions that look like this (one function for each choice of weights w 0 through w N ) is called the Hypothesis Class for linear regression.
  • Slide 10
  • Hypotheses In this example, there is only one input variable: X 1 is square footage. The hypothesis class is all functions Y = w 0 + w 1 * (square footage). Several example elements of the hypothesis class are drawn. 100+900*X1 55100+900*X1 80000+270*X1
  • Slide 11
  • Learning for Linear Regression Linear regression tells us a whole set of possible functions to use for prediction. How do we choose the best one from this set? This is the learning problem for linear regression: Input: a set of training examples, where each example contains a value for (X 1, , X N, Y) Output: a set of weights (w 0, , w N ) for the best-fitting linear regression model.
  • Slide 12
  • Quiz: Learning for Linear Regression XY 1080 3040 1570 55-10 For the data on the left, whats the best fit linear regression model?
  • Slide 13
  • Answer: Learning for Linear Regression XY 1080 3040 1570 55-10 For the data on the left, whats the best fit linear regression model? 80 = w0 + w1 * 10 40 = w0 + w1 * 30 80-40 = w0-w0 + w1 * 10-w1*30 40 = w1 * (-20) -2 = w1 80 = w0 + (-2)*10 100 = w0 Y= 100 + (-2) * X
  • Slide 14
  • Linear Regression with Noisy Data In the previous example, we could use only two points and find a line that passed through all of the remaining points. In this example, points are only approximately linear. No single line passes through all points exactly. Well need a more complex algorithm to handle this.
  • Slide 15
  • Quadratic Loss (a.k.a. Squared Error) X 11 X 12 X 1N Y1Y1 X 21 X 22 X 2N Y2Y2 X M1 X M2 X MN YMYM
  • Slide 16
  • Objective Function
  • Slide 17
  • Closed-form Solution for 1 input variable
  • Slide 18
  • Slide 19
  • Closed-form Result
  • Slide 20
  • Quiz: Learning for Linear Regression XY 1080 3040 1570 55-10 Using the closed-form solution for Quadratic Loss, compute w0 and w1 for this dataset.
  • Slide 21
  • Answer: Learning for Linear Regression XY 1080 3040 1570 55-10 Using the closed-form solution for Quadratic Loss, compute w0 and w1 for this dataset. Note that w1, w0 match what we calculated before!
  • Slide 22
  • Overfitting and Regularization Parameter loss
  • Slide 23
  • Gradient Descent For more complex loss functions, it is often NOT POSSIBLE to find closed-form solutions. Instead, people resort to iterative methods that iteratively find better and better parameter estimates, until they converge to the best setting. Well go over one example of this kind of method, called gradient descent.
  • Slide 24
  • Gradient Descent Learning rate
  • Slide 25
  • Quiz: Gradient About zero a b c w LOSS a b c Check the boxes that apply.
  • Slide 26
  • Answer: Gradient About zero ax bx cx w LOSS a b c Check the boxes that apply.
  • Slide 27
  • Quiz: Gradient a b c Equal everywhere w LOSS a b c
  • Slide 28
  • Answer: Gradient a b c Equal everywherex w LOSS a b c
  • Slide 29
  • Quiz: Gradient Descent Which point will allow gradient descent to reach the global minimum, if it is used as the initialization for parameter w? a b c w LOSS a b c
  • Slide 30
  • Answer: Gradient Descent Which point will allow gradient descent to reach the global minimum, if it is used as the initialization for parameter w? a b cx w LOSS a b c