the unreasonable e ectiveness of mathematics, revisitedthe typical interpretation of wigner test is...

62

Upload: others

Post on 11-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • The unreasonable e�ectiveness of mathematics,

    revisited

    Big data and neuroscience

    Jaime Gómez-Ramírez

    Fundación Reina So�a. Centre for Research in Neurodegenarative Diseases

    April 11 2018

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited1 / 1

  • Outline

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited2 / 1

  • The e�ectiveness of mathematics

    Einstein: The mostincomprehensible thingabout the world is that is

    comprehensible

    Wigner: The unreasonablee�ectiveness ofmathematics

    Gelfand: The UnreasonableIne�ectiveness of

    Mathematics in biology

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited3 / 1

  • The e�ectiveness of mathematics

    heat loss in co�eedQdT = As(Tcoffe − Troom)

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited4 / 1

  • The e�ectiveness of mathematics

    Wigner's 1960 essay "the enormous usefulness of mathematics innatural science is something bordering on the mysterious"

    The typical interpretation of Wigner test is as follows:

    premise math concepts arise from aesthetic impulse in humanspremise is unreasonable to think that those same impulses are

    e�ectiveobservation nevertheless it so happens that they are e�ectiveconsequence it follows that math concepts are unreasonably

    e�ective (assuming that the aesthetic premise asvalid)

    e.g imaginary numbers, tensor. Math concepts appear and propagate

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited5 / 1

  • The e�ectiveness of mathematics

    Wigner's 1960 essay "the enormous usefulness of mathematics innatural science is something bordering on the mysterious"

    The typical interpretation of Wigner test is as follows:

    premise math concepts arise from aesthetic impulse in humans

    premise is unreasonable to think that those same impulses aree�ective

    observation nevertheless it so happens that they are e�ectiveconsequence it follows that math concepts are unreasonably

    e�ective (assuming that the aesthetic premise asvalid)

    e.g imaginary numbers, tensor. Math concepts appear and propagate

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited5 / 1

  • The e�ectiveness of mathematics

    Wigner's 1960 essay "the enormous usefulness of mathematics innatural science is something bordering on the mysterious"

    The typical interpretation of Wigner test is as follows:

    premise math concepts arise from aesthetic impulse in humanspremise is unreasonable to think that those same impulses are

    e�ective

    observation nevertheless it so happens that they are e�ectiveconsequence it follows that math concepts are unreasonably

    e�ective (assuming that the aesthetic premise asvalid)

    e.g imaginary numbers, tensor. Math concepts appear and propagate

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited5 / 1

  • The e�ectiveness of mathematics

    Wigner's 1960 essay "the enormous usefulness of mathematics innatural science is something bordering on the mysterious"

    The typical interpretation of Wigner test is as follows:

    premise math concepts arise from aesthetic impulse in humanspremise is unreasonable to think that those same impulses are

    e�ectiveobservation nevertheless it so happens that they are e�ective

    consequence it follows that math concepts are unreasonablye�ective (assuming that the aesthetic premise asvalid)

    e.g imaginary numbers, tensor. Math concepts appear and propagate

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited5 / 1

  • The e�ectiveness of mathematics

    Wigner's 1960 essay "the enormous usefulness of mathematics innatural science is something bordering on the mysterious"

    The typical interpretation of Wigner test is as follows:

    premise math concepts arise from aesthetic impulse in humanspremise is unreasonable to think that those same impulses are

    e�ectiveobservation nevertheless it so happens that they are e�ectiveconsequence it follows that math concepts are unreasonably

    e�ective (assuming that the aesthetic premise asvalid)

    e.g imaginary numbers, tensor. Math concepts appear and propagate

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited5 / 1

  • The e�ectiveness of mathematics

    Wigner's 1960 essay "the enormous usefulness of mathematics innatural science is something bordering on the mysterious"

    The typical interpretation of Wigner test is as follows:

    premise math concepts arise from aesthetic impulse in humanspremise is unreasonable to think that those same impulses are

    e�ectiveobservation nevertheless it so happens that they are e�ectiveconsequence it follows that math concepts are unreasonably

    e�ective (assuming that the aesthetic premise asvalid)

    e.g imaginary numbers, tensor. Math concepts appear and propagate

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited5 / 1

  • The e�ectiveness of mathematics

    Wigner did seminal work on group theory applied to discoversymmetry principles

    group theory replaced previous methods of analysis in quantummechanics, group pest, �nding invariants instead of seeking forexplicit solution by calculus

    The goal of science is not to explain nature (the black box) but toexplain the regularities in the behavior of the object "Not thethings in themselves but the relationships between the things.

    (Poincaré)

    The search for causal explanation in terms of mathematicalprinciples necessitates the belief of the mathematical structure ofthe universe the c-word

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited6 / 1

  • The e�ectiveness of mathematics

    Wigner did seminal work on group theory applied to discoversymmetry principles

    group theory replaced previous methods of analysis in quantummechanics, group pest, �nding invariants instead of seeking forexplicit solution by calculus

    The goal of science is not to explain nature (the black box) but toexplain the regularities in the behavior of the object "Not thethings in themselves but the relationships between the things.

    (Poincaré)

    The search for causal explanation in terms of mathematicalprinciples necessitates the belief of the mathematical structure ofthe universe the c-word

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited6 / 1

  • The e�ectiveness of mathematics

    Wigner did seminal work on group theory applied to discoversymmetry principles

    group theory replaced previous methods of analysis in quantummechanics, group pest, �nding invariants instead of seeking forexplicit solution by calculus

    The goal of science is not to explain nature (the black box) but toexplain the regularities in the behavior of the object "Not thethings in themselves but the relationships between the things.

    (Poincaré)

    The search for causal explanation in terms of mathematicalprinciples necessitates the belief of the mathematical structure ofthe universe the c-word

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited6 / 1

  • The e�ectiveness of mathematics

    We are "lucky" that regularities exist and that we can grasp themmathematically

    This is Newton's contribution and this is in essence why deeplearning works

    Regularities are invariant with respect to space and time.A,B...→ X,Y.... under T T (A), T (B)→ T (X), T (Y )Convolutional networks exploit image invariance to work (A cat isa cat is a cat)

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited7 / 1

  • The e�ectiveness of mathematics

    We are "lucky" that regularities exist and that we can grasp themmathematically

    This is Newton's contribution and this is in essence why deeplearning works

    Regularities are invariant with respect to space and time.A,B...→ X,Y.... under T T (A), T (B)→ T (X), T (Y )Convolutional networks exploit image invariance to work (A cat isa cat is a cat)

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited7 / 1

  • The e�ectiveness of mathematics

    We are "lucky" that regularities exist and that we can grasp themmathematically

    This is Newton's contribution and this is in essence why deeplearning works

    Regularities are invariant with respect to space and time.A,B...→ X,Y.... under T T (A), T (B)→ T (X), T (Y )Convolutional networks exploit image invariance to work (A cat isa cat is a cat)

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited7 / 1

  • >>

    t =√

    2sg

    What makes possible for us to discoverregularities is the division betweeninitial conditions and regularities.

    Laws of nature are IF initialconditions THEN event.

    That's why causality is so hard, weneed to include/exclude all possiblecombination of antecedents (initialconditions)

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited8 / 1

  • >>

    Good doesn't play dice eg. stochasticbrownian motion

    Our knowledge of nature contains 'a strangehierarchy' (Events we observed → Laws(regularities to discover) → Symmetry(invariance principles)

    The future is always uncertain butnevertheless there are correlations - laws-that we can discover

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited9 / 1

  • >>

    Good doesn't play dice eg. stochasticbrownian motion

    Our knowledge of nature contains 'a strangehierarchy' (Events we observed → Laws(regularities to discover) → Symmetry(invariance principles)

    The future is always uncertain butnevertheless there are correlations - laws-that we can discover

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited9 / 1

  • >>

    Good doesn't play dice eg. stochasticbrownian motion

    Our knowledge of nature contains 'a strangehierarchy' (Events we observed → Laws(regularities to discover) → Symmetry(invariance principles)

    The future is always uncertain butnevertheless there are correlations - laws-that we can discover

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited9 / 1

  • AI, Machine Learning, Deep Learning

    >>

    AI

    Machine Learning

    ANN are non linear mapping systems whosefunctioning principles are vaguely based onthe nervous systems of mammals

    Deep learning

    Data the most valuable asset and computation is a cheap commodity(informationwants to be free)

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited10 / 1

  • Perceptron

    >>

    y = f(∑k

    wxxk) (1)

    "A Logical Calculus of Ideas Immanent in NervousActivity McCulloch, Pitts, 1943"'If it doesn't rain (x1w1) and homework done(x2w2), go to the movies y (output)'

    neurons with a binary threshold activationfunction analogous to �rst order logicsentences

    By itself a neuron (or an ann) does very littlebut a su�ciently large network withappropriate structure and properly chosenweights can approximate with arbitraryaccuracy any function

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited11 / 1

  • Perceptron

    A perceptron is any feedforward network of nodes with responseslike equation ??.

    y = f(∑k

    wxxk) = f(z) (2)

    In general, f is bounded nondecreasing nonlinear squeezing functionfunction, eg. the sigmoid

    f(z) =1

    1 + e−z, f ′(z) =

    e−z

    (1 + e−z)2

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited12 / 1

  • Perceptron

    A perceptron is any feedforward network of nodes with responseslike equation ??.

    y = f(∑k

    wxxk) = f(z) (2)

    In general, f is bounded nondecreasing nonlinear squeezing functionfunction, eg. the sigmoid

    f(z) =1

    1 + e−z, f ′(z) =

    e−z

    (1 + e−z)2

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited12 / 1

  • Perceptron

    Other choices are the tanh, step function and more recently therelu function .

    y = ReLU(z) = max(0, z), y′ = 1, z > 0

    ReLu works better, faster (gradient constant), y'(0) approximatedy = ln(1.0 + ex)

    Reduced likelihood of gradient to vanish

    Sparsity produced when z ≤ 0, sigmoids on the other hand tend torepresent more dense representations

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited13 / 1

  • Perceptron

    Other choices are the tanh, step function and more recently therelu function .

    y = ReLU(z) = max(0, z), y′ = 1, z > 0

    ReLu works better, faster (gradient constant), y'(0) approximatedy = ln(1.0 + ex)

    Reduced likelihood of gradient to vanish

    Sparsity produced when z ≤ 0, sigmoids on the other hand tend torepresent more dense representations

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited13 / 1

  • Perceptron

    Other choices are the tanh, step function and more recently therelu function .

    y = ReLU(z) = max(0, z), y′ = 1, z > 0

    ReLu works better, faster (gradient constant), y'(0) approximatedy = ln(1.0 + ex)

    Reduced likelihood of gradient to vanish

    Sparsity produced when z ≤ 0, sigmoids on the other hand tend torepresent more dense representations

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited13 / 1

  • Perceptron

    Other choices are the tanh, step function and more recently therelu function .

    y = ReLU(z) = max(0, z), y′ = 1, z > 0

    ReLu works better, faster (gradient constant), y'(0) approximatedy = ln(1.0 + ex)

    Reduced likelihood of gradient to vanish

    Sparsity produced when z ≤ 0, sigmoids on the other hand tend torepresent more dense representations

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited13 / 1

  • What can and can't perceptrons do?

    (Single-layer) perceptrons can correctly classify only data sets thatare linearly separable (they can be separated by a hyperplane)

    The XOR function is famously non linearly separable and this isvery important because many classi�cation problems are notlinearly separable.

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited14 / 1

  • What can and can't perceptrons do?

    (Single-layer) perceptrons can correctly classify only data sets thatare linearly separable (they can be separated by a hyperplane)

    The XOR function is famously non linearly separable and this isvery important because many classi�cation problems are notlinearly separable.

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited14 / 1

  • What can and can't perceptrons do?

    There are 22dboolean functions of d boolean input variables and

    only O(222) are linearly separable.

    For d=2 14/16 are linearly separable (XOR and its complement arethe exceptions), but for d = 4, only 1882/65536 are linearlyseparable.

    Although at that time it was known that multilayer networks weremore powerful than single layer ones, the learning algorithms formultilayer architectures were not known

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited15 / 1

  • What can and can't perceptrons do?

    There are 22dboolean functions of d boolean input variables and

    only O(222) are linearly separable.

    For d=2 14/16 are linearly separable (XOR and its complement arethe exceptions), but for d = 4, only 1882/65536 are linearlyseparable.

    Although at that time it was known that multilayer networks weremore powerful than single layer ones, the learning algorithms formultilayer architectures were not known

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited15 / 1

  • Deep networks

    ANN learn by example and use backpropagation

    If data are well-behaved it will learn not only the trainingexamples but also the underlying relationships

    ANN are adaptive and self-repairing, also has some fault tolerancedue to its redundant parallel structure (dense connectivity makes itresilient to minor damage, graceful degradation)

    Units within a layer are independent so they can be be evaluatedsimultaneously eg. network with 2,000 nodes in two layers willproduce a response in 2 time steps rather than in 2,000 steps ifeach neuron required to be processes serially (dependent)

    Until the advent of GPUs this advantage were not fully exploitedby computers

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited16 / 1

  • Deep networks

    ANN learn by example and use backpropagation

    If data are well-behaved it will learn not only the trainingexamples but also the underlying relationships

    ANN are adaptive and self-repairing, also has some fault tolerancedue to its redundant parallel structure (dense connectivity makes itresilient to minor damage, graceful degradation)

    Units within a layer are independent so they can be be evaluatedsimultaneously eg. network with 2,000 nodes in two layers willproduce a response in 2 time steps rather than in 2,000 steps ifeach neuron required to be processes serially (dependent)

    Until the advent of GPUs this advantage were not fully exploitedby computers

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited16 / 1

  • Deep networks

    ANN learn by example and use backpropagation

    If data are well-behaved it will learn not only the trainingexamples but also the underlying relationships

    ANN are adaptive and self-repairing, also has some fault tolerancedue to its redundant parallel structure (dense connectivity makes itresilient to minor damage, graceful degradation)

    Units within a layer are independent so they can be be evaluatedsimultaneously eg. network with 2,000 nodes in two layers willproduce a response in 2 time steps rather than in 2,000 steps ifeach neuron required to be processes serially (dependent)

    Until the advent of GPUs this advantage were not fully exploitedby computers

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited16 / 1

  • Deep networks

    ANN learn by example and use backpropagation

    If data are well-behaved it will learn not only the trainingexamples but also the underlying relationships

    ANN are adaptive and self-repairing, also has some fault tolerancedue to its redundant parallel structure (dense connectivity makes itresilient to minor damage, graceful degradation)

    Units within a layer are independent so they can be be evaluatedsimultaneously eg. network with 2,000 nodes in two layers willproduce a response in 2 time steps rather than in 2,000 steps ifeach neuron required to be processes serially (dependent)

    Until the advent of GPUs this advantage were not fully exploitedby computers

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited16 / 1

  • Deep networks

    ANN learn by example and use backpropagation

    If data are well-behaved it will learn not only the trainingexamples but also the underlying relationships

    ANN are adaptive and self-repairing, also has some fault tolerancedue to its redundant parallel structure (dense connectivity makes itresilient to minor damage, graceful degradation)

    Units within a layer are independent so they can be be evaluatedsimultaneously eg. network with 2,000 nodes in two layers willproduce a response in 2 time steps rather than in 2,000 steps ifeach neuron required to be processes serially (dependent)

    Until the advent of GPUs this advantage were not fully exploitedby computers

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited16 / 1

  • Deep networks

    Table: ANN versus real nervous system

    MLP Nervous System

    feedforward recurrentdense(fullyconnected) sparse(local)O(102,3,4) O(1010), O(1015)static dynamic:spike trains, synchronization, fatigue

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited17 / 1

  • A frame

    >> >>

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited18 / 1

  • Why MLP is better than one layer?

    >>

    y = mx is a system with oneparameter, m, what kind of datasets

    can separate? only the linearly

    separable ones

    >>

    y = sin(kx), also has oneparameter, the frequency k, but can

    separate any arbitrary distribution

    of points in the x-axis

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited19 / 1

  • Universality of MLP

    Any bounded function can be approximated with arbitraryaccuracy if enough hidden units are available -multilayerperceptrons are universal approximators

    How many layers do we need for this astounding property(universal approximators)? Kolmogorov showed that one hiddenlayer is su�cient

    Any continuous function with n variables to a m-dimensionaloutput can be implemented by a network with one hiddden layer

    Unfortunately the proof is not constructive, that is, it does not tellhow the weights should be chosen to produce such a function

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited20 / 1

  • Universality of MLP

    Any bounded function can be approximated with arbitraryaccuracy if enough hidden units are available -multilayerperceptrons are universal approximators

    How many layers do we need for this astounding property(universal approximators)? Kolmogorov showed that one hiddenlayer is su�cient

    Any continuous function with n variables to a m-dimensionaloutput can be implemented by a network with one hiddden layer

    Unfortunately the proof is not constructive, that is, it does not tellhow the weights should be chosen to produce such a function

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited20 / 1

  • Universality of MLP

    Any bounded function can be approximated with arbitraryaccuracy if enough hidden units are available -multilayerperceptrons are universal approximators

    How many layers do we need for this astounding property(universal approximators)? Kolmogorov showed that one hiddenlayer is su�cient

    Any continuous function with n variables to a m-dimensionaloutput can be implemented by a network with one hiddden layer

    Unfortunately the proof is not constructive, that is, it does not tellhow the weights should be chosen to produce such a function

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited20 / 1

  • Universality of MLP

    Any bounded function can be approximated with arbitraryaccuracy if enough hidden units are available -multilayerperceptrons are universal approximators

    How many layers do we need for this astounding property(universal approximators)? Kolmogorov showed that one hiddenlayer is su�cient

    Any continuous function with n variables to a m-dimensionaloutput can be implemented by a network with one hiddden layer

    Unfortunately the proof is not constructive, that is, it does not tellhow the weights should be chosen to produce such a function

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited20 / 1

  • How important is the universality of MLP?

    Is it universal approximation a rare property? Not really, manyother systems such as polynomials, trigonometric polynomials (egFourier series), wavelets, kernel regression systems (svm) have alsouniversal properties

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited21 / 1

  • Architecture

    >>

    >>

    First layer detects the edges, andthe second has the abstract conceptof loop and straight lines, this isactually the hope of having a layerstructure and it works becausewhat Wigner already said

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited22 / 1

  • Gradient descent

    Cost C(w), the gradient dC(w)dw = 0 (huge column vector with784 + 16 ∗ 16 + 16 ∗ 10 + 16 + 16 + 10 dimensions).The negative of the gradient which is the direction of the steepestincrease gives the direction to take to decrease the error(cost) morequickly

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited23 / 1

  • Backprop

    The method to calculate the gradient vector, which tells you whichdirection to take and how step the step is

    1 compute ∇C2 take step in −∇C direction3 repeat

    Learning is �nding the minimizing the weight function.Backprop is the algo used in gradient descent.Learning is 'just' �nding the right weights and biases.

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited24 / 1

  • Backprop in action, chain rule

    The cost of one training example isC0 = (a

    L−y)2, the last activationis aL = σ(wLa

    L−1 + bL) = σ(zL)

    How sensitive is the Cost function to smallchanges in the weight?

    ∂C0∂wL

    = ∂zL

    ∂wL∂aL

    ∂zL∂C0∂aL

    ∂C0∂aL

    = 2(aL − y), ∂aL∂zL

    = σ′(zL),∂zL

    ∂wL= aL−1

    Average over all training examples∂C∂wL

    = 1n∑n−1

    k=0∂Ck∂wL

    ∇C = [ ∂C∂w1

    , ∂C∂b1, ..., ∂C

    ∂wL]

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited25 / 1

  • Curse of dimensionality

    Curse of dimensionality refers to the apparent intractability ofsystematically searching through a high-dimensional space

    As n get bigger it gets harder and harder to sample all the boxes,with n dimensions each allowing for m states, we will have mn

    possible combinations

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited26 / 1

  • Blessing of dimensionality

    In MLP approximation error decreases with the number of trainingsamples error O(1/sqrtN) and also with the number of hiddenunits error O(1/M) and unlike other systems, eg polynomials thisis independent of the input size and avoid the curse ofdimensionality problem.

    From these results we can build bounds, for example

    N > O(Mp/�) (3)

    where N is the number of samples, M the hidden nodes, p inputdimension (Mp number of parameters) and � the desiredapproximation error.

    More layers is better and do not harm

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited27 / 1

  • Bias variance trade o�

    Bias�variance tradeo� is the problem of simultaneously minimizing twosources of error in the estimand. The bias-variance decomposition:

    MSE = E((θ̂−θ)2) = E(θ̂−θ)2+V ar(θ̂)+ = (Bias(θ))2+V ar(θ) (4)

    The bias/variance trade o� in deep learning is not exactly a trade o� itcan be tackle algorithmically

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited28 / 1

  • Bias variance trade o�

    Table: Bias variance

    high var high bias high bias and var low bias and var

    2% 15% 15% 0.5%11% 15% 30% 1%

    you don't have the dialectical tension one thing or the other but in thetable we have 4 cases rather than a trade o� and luckily we can takeaction that �t every case.

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited29 / 1

  • Bias variance trade o�

    A bigger network will improve your�tting without hurting the varianceproblem, with the caveat that youregularize properly.

    Before we couldn't make better onewithout hurting the other, now we canget both better.

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited30 / 1

  • Ensemble models

    Idea: you don't want an organization with all the same('good') youmay want to introduce variability

    decision trees are grown by introducing a random element, eg ateach node choose randomly the features to split the node

    Random forest (randomly constructed trees), each voting for aclass, Bagging: boosting + aggregation.Great predictors but interpretability is obscured by the complexityof the model -accuracy generally requires more complex prediction methods-

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited31 / 1

  • Computational Topology

    >>

    topology is concernedwith the properties ofspace that are preser-ved under continuousdeformations: stretching,crumpling and bending,but not tearing or gluing

    Topology is an intermediate analysis mediumthat focuses on coarse structures.

    Why to use topology over Big data?

    It studies the invariants of continuousformations of the shape of data -resistant tothreshold selection problem-It allows measures of shape (clumps, holesand voids) which are invariant across scales

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited32 / 1

  • Persistent homology

    >>

    >>

    Edges in a graph capturedyadic relationships.

    Graphs can't capture highorder relationships butsimplicial complex can

    A simplicial complex is ageneralized graph consisting onvertices, edges, triangles andsimplices of higher dimensionglue together.

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited33 / 1

  • Persistent homology

    >>

    C0(X) =< v1, v2, v3, v4 >,C1(X) =<e1, e2, e3, e4, e5 >,C2(X) = σ1

    boundaryoperatorρ : C1(X)→C0(X), ρ2 :C2(X)→ C1(X)when applied toan edge it yieldsa di�erence ofvertices, higherorder operatorto act ontriangles(2-simplices),

    Loop is when wehaveρ1(e1+e2+e3) =0 =ρ1(e1 + e5 + e4),both loops are inthe kernel of ρ1,Ker(ρ1) = {x ∈C1(X), ρ1(x) =0}

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited34 / 1

  • Persistent homology

    >>

    e1 + e2 + e3 is obtained as the image of thetriangle σ1 under the map ρ2, whereas is notthe image of a triangle, in other words,Im(ρ2) = {y ∈ C1, ∃x ∈ C2(X), ρ2(x) = y}),then e1 + e2 + e3 ∈ Im(ρ2) ande1 + e5 + e4 6∈ Im(ρ2).The 1-D homology is the quotient spaceH1(X) = [Ker(ρ1)/Im(ρ2)]

    Hi(X) =Ker(ρi)

    Im(ρi+1(5)

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited35 / 1

  • Persistent homology

    >>

    >>

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited36 / 1

  • Conclusions

    With enough imagination a classi�er(regression) can be useful tosolve a large a number of problems

    Deep learning works because there is structure in the world but wedon't know why because we don't know anything about the initialconditionslaws of nature are precise beyond anything reasonable; we know virtuallynothing about the initial conditions (Wigner)

    There are other ways to reduce complexity in big data whilepreserving maximal intrinsic information -computational topology

    Occam's dilemma (lex parsimoniae): accuracy generally requiresmore complex prediction methods, simple and interpretablefunctions do not make the most accurate predictions

    The curse of dimensionality can be a blessing

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited37 / 1

  • Thanks!

    Jaime Gómez-RamírezThe unreasonable e�ectiveness of mathematics, revisited38 / 1