icml2012 learning hierarchies of invariant features

1.Learning Hierarchies Learning Hierarchies of Invariant Features of Invariant FeaturesYann LeCun Yann LeCun Courant Institute of Mathematical SciencesCourant Institute of Mathematical Sciencesand and Center for Neural Science,Center for Neural Science, New York UniversityNew York UniversityYann LeCun

2. Challenges for Machine Learning, Vision, Signal Processing, AI, Neuroscience Challenges for Machine Learning, Vision, Signal Processing, AI, Neuroscience How can learning build a perceptual system? How do we learn representations of the perceptual world? In ML/CV/ASR/MIR: How do we learn features (not just classifiers)? With good representations, we can learn categories from just a few examples. ML has neglected the question of learning representations, relying instead on domain expertise to engineer features and kernels. Deep Learning addresses the problem of learning representations Goal 1: biologically-plausible methods for deep learning Goal 2: representation learning for computer perceptionYann LeCun 3. Architecture of Mainstream SystemsArchitecture of Mainstream Systems Traditional way: handcrafted features + classifier(handcrafted)SimpleTrainableFeatureExtractionClassifier Mainstream Approaches to Image and Speech RecognitionLowLevelMidLevel Features Features(fixed)(unsupervised)ClassifierMFCCMixofGaussians (Supervised) SIFT Kmeans HoG SparseCodingYann LeCun 4. Trainable Feature HierarchiesTrainable Feature Hierarchies Why cant we make all the modules trainable? Proposed way: hierarchy of trained featuresTrainableTrainable Trainable FeatureFeatureClassifier/TransformTransform Predictor LearnedInternalRepresentationYann LeCun 5. The Mammalian Visual Cortex is HierarchicalThe Mammalian Visual Cortex is Hierarchical The ventral (recognition) pathway in the visual cortex has multiple stages Retina - LGN - V1 - V2 - V4 - PIT - AIT .... [picturefromSimonThorpe][Gallant&VanEssen]Yann LeCun 6. Feature Transform =Feature Transform = Normalization Filter Bank Non-Linearity PoolingNormalization Filter Bank Non-Linearity PoolingFilter Non feature FilterNon feature NormNorm ClassifierBankLinear PoolingBank Linear PoolingStacking multiple stages of[Normalization Filter Bank Non-Linearity Pooling].Normalization: variations on whitening Subtractive: average removal, high pass filtering Divisive: local contrast normalization, variance normalizationFilter Bank: dimension expansion, projection on overcomplete basisNon-Linearity: sparsification, saturation, lateral inhibition.... Component-wise shrinkage or tanh, winner-takes-allPooling: aggregation over space or feature type, subsampling1p 1 bXAVERAGE :K X i ; MAX : Max X i ; L p : X ; PROB : b log ( e ) i ip ii iYann LeCun 7. Feature Transform =Feature Transform = Normalization Filter Bank Non-Linearity PoolingNormalization Filter Bank Non-Linearity PoolingFilter Non feature Filter Non feature NormNormClassifierBankLinear PoolingBankLinear PoolingFilter Bank Non-Linearity = Non-linear embedding in high dimensionFeature Pooling = contraction, dimensionality reduction, smoothingLearning the filter banks at every stageCreating a hierarchy of featuresBasic elements are inspired by models of the visual (and auditory) cortex Simple Cell + Complex Cell model of [Hubel and Wiesel 1962] Many traditional feature extraction methods are based on this SIFT, GIST, HoG, Convolutional networks.....[Fukushima 1974-1982], [LeCun 1988-now], [Poggio 2005-now], [Ng2006-now], many others....Yann LeCun 8. Basic Convolutional Network ArchitectureBasic Convolutional Network ArchitectureSimplecellsComplexcellspooling[LeCunetal.89] Multiplesubsampling convolutions [LeCunetal.98]Yann LeCun RetinotopicFeatureMaps 9. Convolutional Network ArchitectureConvolutional Network ArchitectureYann LeCun 10. Convolutional Network (ConvNet)Convolutional Network (ConvNet)Layer3256@6x6 Layer4Layer1 Output Layer2256@1x164x75x75101input64@14x1483x839x99x910x10pooling, convolution6x6poolingconvolution5x5subsampling (4096kernels)(64kernels) 4x4subsamp Non-Linearity: shrinkage function, tanhPooling: L2, average, max, averagetanhTraining: Supervised (1988-2006), Unsupervised+Supervised (2006-now)Yann LeCun 11. Convolutional Network (vintage 1990) Convolutional Network (vintage 1990)filters tanh average-tanh filters tanh average-tanh filters tanhYann LeCun 12. Mainstream object recognition pipeline 2006-2010: similar to ConvNetsMainstream object recognition pipeline 2006-2010: similar to ConvNets Filter Non featureFilterNon featureClassifier Bank Linearity Pooling BankLinearity Pooling Oriented Winner Histogram PyramidSVMor Kmeans Edges Takes (sum)HistogramAnother OrAll SparseCoding (sum)SimpleSIFTclassifierFixed low-level Features + unsupervised mid-level features + simple classifierExample (on Caltech 101 dataset): SIFT + Vector Quantization + Pyramid pooling + SVM: >65%[Lazebnik et al. CVPR 2006]SIFT + Local Sparse Coding Macrofeat. + Pyr/ pooling + SVM: >77%[Boureau et al. ICCV 2011]Yann LeCun 13. Other Applications with State-of-the-Art PerformanceOther Applications with State-of-the-Art PerformanceTraffic Sign Recognition (GTSRB) House Number Recognition (Google)German Traffic Sign Reco BenchStreet View House Numbers97.2% accuracy94.8% accuracyYann LeCun 14. ConvNet Architecture with Multi-Stage FeaturesConvNet Architecture with Multi-Stage FeaturesFeature maps from all stages are pooled/subsampled and sent to thefinal classification layersPooled low-level features: good for textures and local motifsHigh-level features: good for gestalt and global shapeYann LeCun[Sermanet,Chintala,LeCunArXiv:1204.3968,2012] 15. Industrial Applications of ConvNetsIndustrial Applications of ConvNetsAT&T/Lucent/NCR Check reading, OCR, handwriting recognition (deployed 1996)NEC Intelligent vending machines and advertizing posters, cancer cell detection, automotive applicationsGoogle Face and license plate removal from StreetView imagesMicrosoftHandwriting recognition, speech detectionOrange Face detection, HCI, cell phone-based applicationsStartups, other companies...Yann LeCun 16. Fast Scene Parsing Fast Scene Parsing[Farabet,Couprie,Najman,LeCun,ICML2012]Yann LeCun 17. Labeling every pixel with the object it belongs to Labeling every pixel with the object it belongs toWould help identify obstacles, targets, landing sites, dangerous areasWould help line up depth map with edge mapsYann LeCun[Farabetetal.ICML2012] 18. Scene Parsing/Labeling: ConvNet Architecture Scene Parsing/Labeling: ConvNet Architecture Each output sees a large input context:46x46 window at full rez; 92x92 at rez; 184x184 at rez[7x7conv]->[2x2pool]->[7x7conv]->[2x2pool]->[7x7conv]->Trained supervised on fully-labeled images Categories Laplacian Level1 Level2Upsampled Pyramid Features Features Level2FeaturesYann LeCun 19. Scene Parsing/Labeling: System Architecture Scene Parsing/Labeling: System ArchitectureDenseFeature MapsConvNetMulti-ScalePyramid(Band-pass Filtered)Original ImageYann LeCun 20. Method 1: majority over super-pixel regionsMethod 1: majority over super-pixel regions SuperpixelboundaryhypethesesMajorityVoteOver Superpixels Superpixelboundaries Categoriesaligned WithregionConvolutionalclassifierMultiscaleConvNet boundariesInputimage softcategoriesscores Featuresfrom ConvolutionalnetYann LeCun (d=768perpixel) [Farabetetal.IEEET.PAMI2012] 21. Method 2: optimal cover of purity treeMethod 2: optimal cover of purity tree 2-layerDistribution of Neural Categories within netEach SegmentSpanning TreeFrom pixelSimilarity graphYann LeCun[Farabetetal.ICML2012] 22. Scene Parsing/Labeling: Performance Scene Parsing/Labeling: Performance Stanford Background Dataset [Gould 1009]: 8 categoriesSIFT Flow dataset [Liu 2009]: 33 categoriesBarcelona dataset[Tighe 2010]:170 categories.Yann LeCun[Farabetetal.IEEET.PAMI2012] 23. Scene Parsing/Labeling: Results Scene Parsing/Labeling: ResultsSamples from the SIFT-Flow dataset (Liu)Yann LeCun[Farabetetal.2012] 24. Scene Parsing/Labeling: SIFT Flow dataset (33 categories) Scene Parsing/Labeling: SIFT Flow dataset (33 categories)Samples from the SIFT-Flow dataset (Liu)Yann LeCun[Farabetetal.ICML2012] 25. Scene Parsing/Labeling: SIFT Flow dataset (33 categories) Scene Parsing/Labeling: SIFT Flow dataset (33 categories)Yann LeCun[Farabetetal.ICML2012] 26. Scene Parsing/Labeling Scene Parsing/LabelingYann LeCun[Farabetetal.ICML2012] 27. Scene Parsing/Labeling Scene Parsing/LabelingYann LeCun[Farabetetal.ICML2012] 28. Scene Parsing/Labeling Scene Parsing/LabelingYann LeCun[Farabetetal.2012] 29. Scene Parsing/Labeling Scene Parsing/LabelingYann LeCun[Farabetetal.2012] 30. Scene Parsing/Labeling Scene Parsing/LabelingNo post-processingFrame-by-frameConvNet runs at 50ms/frame on Virtex-6 FPGA hardware But communicating the features over ethernet limits system perf.Yann LeCun 31. Scene Parsing/Labeling: Temporal Consistency Scene Parsing/Labeling: Temporal ConsistencyMajority Vote on Spatio-Temporal Super-PixelsReset every secondYann LeCun 32. UnsupervisedUnsupervisedFeature Learning:Feature Learning:variations on thevariations on the sparse auto-encoder theme sparse auto-encoder themeYann LeCun 33. Learning Features with Unsupervised Pre-TrainingLearning Features with Unsupervised Pre-Training Supervised learning requires lots of labeled samples Most available data is unlabeled Models need to be large to understand the task But large models have many parameters and require many labeled samples Unsupervised learning can be used to pre-train the system before supervised refinement Unsupervised pre-training consumes degrees of freedom while placing the system in a favorable region of parameter space. Supervised refinement merely find the closest local minimum within the attractor found by unsupervised pre-training. Unsupervised feature learning through sparse/overcomplete auto-encoders With high-dimensional and sparse representations, the data manifold is flattened (any collection of points is flatter in higher dimension)Yann LeCun 34. Sparse Coding & Sparse ModelingSparse Coding & Sparse Modeling[Olshausen&Field1997] Sparse linear reconstruction Energy = reconstruction_error + code_prediction_error + code_sparsity i i 2E (Y , Z )=Y W d Z + j z j i Y Y 2WdZj . INPUT YZz j FEATURESInference is slow Y Z =argmin Z E (Y , Z )Yann LeCun 35. How to Speed Up Inference in aaGenerative Model?How to Speed Up Inference in Generative Model? Factor Graph with an asymmetric factor Inference Z Y is easy Run Z through deterministic decoder, and sample Y Inference Y Z is hard, particularly if Decoder function is many-to-one MAP: minimize sum of two factors with respect to Z Z* = argmin_z Distance[Decoder(Z), Y] + FactorB(Z) GenerativeModel FactorA FactorBDistanceDecoderLATENTINPUT YZVARIABLEYann LeCun 36. Idea: Train aasimple function to approximate the solutionIdea: Train simple function to approximate the solution[Kavukcuoglu,Ranzato,LeCun,rejectedbyeveryconference,20082009] Train a simple feed-forward function to predict the result of a complex optimization on the data points of interestGenerativeModelFactorA FactorB Distance Decoder LATENTINPUT Y Z FastFeedForwardModel VARIABLEFactorAEncoderDistance 1. Find optimal Zi for all Yi; 2. Train Encoder to predict Zi from YiYann LeCun 37. Predictive Sparse Decomposition (PSD): sparse auto-encoderPredictive Sparse Decomposition (PSD): sparse auto-encoder[Kavukcuoglu,Ranzato,LeCun,2008arXiv:1010.3467], Prediction the optimal code with a trained encoder Energy = reconstruction_error + code_prediction_error + code_sparsity i i 2i 2E Y , Z =Y W d Z Z g e W e ,Y j z j iige (W e , Y )=shrinkage(W e Y ) i Y Y 2 WdZ j . INPUT YZ z jFEATURES i ge W e ,Y 2 Z ZYann LeCun 38. Soft Shrinkage Non-Linearity Soft Shrinkage Non-LinearityYann LeCun 39. PSD: Basis Functions on MNISTPSD: Basis Functions on MNIST Basis functions (and encoder matrix) are digit partsYann LeCun 40. Predictive Sparse Decomposition (PSD): TrainingPredictive Sparse Decomposition (PSD): Training Training on natural images patches. 12X12 256 basis functionsYann LeCun 41. Learned Features on natural patches: V1-like receptive fieldsLearned Features on natural patches: V1-like receptive fieldsYann LeCun 42. Better Idea: Give the right structure to the encoder Better Idea: Give the right structure to the encoder [Gregor&LeCun,ICML2010],[Bronsteinetal.ICML2012]ISTA/FISTA: iterative algorithm that converges to optimal sparse codeINPUT Y We+sh () ZSYann LeCun 43. LISTA: Train We and SSmatrices to give aagood approximation quickly LISTA: Train We and matrices to give good approximation quicklyThink of the FISTA flow graph as a recurrent neural net where We and S aretrainable parameters INPUT YWe +sh () Z STime-Unfold the flow graph for K iterationsLearn the We and S matrices with backprop-through-timeGet the best approximate solution within K iterations Y We+ sh S + sh S ZYann LeCun 44. Learning ISTA (LISTA) vs ISTA/FISTA Learning ISTA (LISTA) vs ISTA/FISTAYann LeCun 45. One Stage: filter Shrinkage L2 Pooling Contrast Norm One Stage: filter Shrinkage L2 Pooling Contrast NormL2Pooling&subsampling contrastnormalizationsubtractive+divisiveConvolutionsTHISISONESTAGEOFTHECONVNET ShrinkageYann LeCun 46. Local Contrast NormalizationLocal Contrast NormalizationPerformed on the state of every layer, includingthe inputSubtractive Local Contrast Normalization Subtracts from every value in a feature a Gaussian-weighted average of its neighbors (high-pass filter)Divisive Local Contrast Normalization Divides every value in a layer by the standard deviation of its neighbors over space and over all feature mapsSubtractive + Divisive LCN performs a kind ofapproximate whitening.Yann LeCun 47. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train first layer using PSDY i Y2 WdZ j . YZ z j ge W e ,Y i 2 Z ZFEATURESYann LeCun 48. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Y z j ge W e ,Y i FEATURESYann LeCun 49. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD Y i Y2 WdZ j . Y z j YZ z j ge W e ,Y i ge W e ,Y i 2 Z ZFEATURESYann LeCun 50. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD Phase 4: use encoder + absolute value as 2 nd feature extractor Y z jz j ge W e ,Y i ge W e ,Y i FEATURESYann LeCun 51. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of FeaturesPhase 1: train first layer using PSDPhase 2: use encoder + absolute value as feature extractorPhase 3: train the second layer using PSDPhase 4: use encoder + absolute value as 2 nd feature extractorPhase 5: train a supervised classifier on topPhase 6 (optional): train the entire system with supervised back-propagation Y z jz j classifier ge W e ,Y i ge W e ,Y i FEATURESYann LeCun 52. Using PSD Features for Object RecognitionUsing PSD Features for Object Recognition64 filters on 9x9 patches trained with PSD with Linear-Sigmoid-Diagonal EncoderYann LeCun 53. Multistage Hubel-Wiesel Architecture: Filters Multistage Hubel-Wiesel Architecture: Filters After PSDAfter supervised refinementStage 1Stage2Yann LeCun 54. Results on Caltech101 with sigmoid non-linearity Results on Caltech101 with sigmoid non-linearity likeHMAXmodelYann LeCun 55. Results on Caltech101: purely supervisedResults on Caltech101: purely supervised with soft-shrink, L2 pooling, contrast normalizationwith soft-shrink, L2 pooling, contrast normalizationSupervised learning with soft-shrinkage non-linearity, L2 complex cells, andsparsity penalty on the complex cell outputs: 71%Yann LeCun 56. What does Local Contrast Normalization Do?What does Local Contrast Normalization Do?OriginalReconstuctionWithLCNReconstructionWithoutLCNYann LeCun 57. Why Do Random Filters Work? Why Do Random Filters Work? Random Filters For Simple Optimal Cells Stimuli foreach Trained Complex Filters Cell For Simple CellsYann LeCun 58. Small NORB datasetSmall NORB datasetTwo-stage system: error rate versus number of labeled training samplesNonormalization Randomfilters Unsupfilters Supfilters Unsup+SupfiltersYann LeCun 59. Convolutional Sparse Coding Convolutional Sparse Coding Convolutional PSD Convolutional PSD [Kavukcuoglu,Sermanet,Boureau,Mathieu,LeCun.NIPS2010]:convolutionalPSD [Zeiler,Krishnan,Taylor,Fergus,CVPR2010]:DeconvolutionalNetwork [Lee,Gross,Ranganath,Ng,ICML2009]:ConvolutionalBoltzmannMachine [Norouzi,Ranjbar,Mori,CVPR2009]:ConvolutionalBoltzmannMachine [Chen,Sapiro,Dunson,Carin,Preprint2010]:DeconvolutionalNetworkwith automaticadjustmentofcodedimension.Yann LeCun 60. Convolutional Training Convolutional TrainingProblem: With patch-level training, the learning algorithm must reconstruct the entire patch with a single feature vector But when the filters are used convolutionally, neighboring feature vectors will be highly redundant Patchleveltrainingproduces lotsoffiltersthatareshifted versionsofeachother.Yann LeCun 61. Convolutional Sparse Coding Convolutional Sparse CodingReplace the dot products with dictionary element by convolutions. Input Y is a full image Each code component Zk is a feature map (an image) Each dictionary element is a convolution kernelRegular sparse codingConvolutional S.C. Y= .* ZkkWk deconvolutional networks [Zeiler, Taylor, Fergus CVPR 2010]Yann LeCun 62. Convolutional PSD: Encoder with aasoft sh() Function Convolutional PSD: Encoder with soft sh() FunctionConvolutional Formulation Extend sparse coding from PATCH to IMAGE PATCH based learning CONVOLUTIONAL learningYann LeCun 63. Pedestrian Detection, Face Detection Pedestrian Detection, Face DetectionYann LeCun [Osadchy,MillerLeCunJMLR2007],[Kavukcuogluetal.NIPS2010] 64. ConvNet Architecture with Multi-Stage FeaturesConvNet Architecture with Multi-Stage FeaturesFeature maps from all stages are pooled/subsampled and sent to thefinal classification layersPooled low-level features: good for textures and local motifsHigh-level features: good for gestalt and global shape filter+tanh Av Pooling 64 feat maps 2x2 filter+tanh Input 78x126xRGB filter+tanhL2 Pooling 22 feat maps 3x3Yann LeCun[Sermanet,Chintala,LeCunArXiv:1204.3968,2012] 65. Pedestrian Detection (INRIA Dataset) Pedestrian Detection (INRIA Dataset)[Kavukcuogluetal.NIPS2010]Yann LeCun 66. Convolutional PSD pre-training for pedestrian detection Convolutional PSD pre-training for pedestrian detectionConvPSD pre-training improves the accuracy of pedestrian detection overpurely supervised training from random initial conditions.Yann LeCun 67. Convolutional PSD pre-training for pedestrian detection Convolutional PSD pre-training for pedestrian detectionTrained on INRIA. Tested on INRIA, Daimler, TUD-Brussles, ETHSame testing protocol as in [Dollar et al. T.PAMI 2011]Yann LeCun 68. Results on Near Scale Images (>80 pixels tall, no occlusions)Results on Near Scale Images (>80 pixels tall, no occlusions) INRIA Daimler p=288 p=21790 ETH TudBrussels p=804 p=508Yann LeCun 69. Results on Reasonable Images (>50 pixels tall, few occlusions)Results on Reasonable Images (>50 pixels tall, few occlusions) INRIA Daimler p=288 p=21790 ETHTudBrussels p=804p=508Yann LeCun 70. Musical Genre Musical GenreRecognitionRecognitionSame Architecture, Different Data Same Architecture, Different Data [Henaffetal.ISMIR2011]Yann LeCun 71. Convolutional PSD Features on Time-Frequency Signals Convolutional PSD Features on Time-Frequency SignalsInput: Constant Q Transform over 46.4ms windows (1024 samples)96 filters, with frequencies spaced every quarter tone (4 octaves)Architecture: Input: sequence of contrast-normalized CQT vectors 1: PSD features, 512 trained filters 2: shrinkage function rectification 3: pooling over 5 seconds 4: linear SVM classifier 5: pooling of SVM categories over 30 secondsGTZAN Dataset 1000 clips, 30 second each 10 genres: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae and rock.Results 84% correct classification (state of the art is at 92% with many features)Yann LeCun 72. Architecture: contrast norm filters shrink max pooling Architecture: contrast norm filters shrink max pooling contrastnormalizationsubtractive+divisiveLinearClassifier MaxPooling(5s)ShrinkageFiltersSingleStageConvolutionalNetworkTrainingoffilters:PSD(unsupervised)Yann LeCun 73. Constant Q Transform over 46.4 ms Contrast Normalization Constant Q Transform over 46.4 ms Contrast Normalization subtractive+divisivecontrastnormalizationYann LeCun 74. Convolutional PSD Features on Time-Frequency Signals Convolutional PSD Features on Time-Frequency SignalsOctave-wide features full 4-octave features Minor 3rd Perfect 4th Perfect 5th Quartal chord Major triad transientYann LeCun 75. PSD Features on PSD Features onConstant-Q Transform Constant-Q TransformOctave-wide featuresEncoder basis functionsDecoder basis functionsYann LeCun 76. Time-FrequencyTime-Frequency FeaturesFeaturesOctave-wide features on8 successive acousticvectorsAlmost no temporalstructure in thefilters!Yann LeCun 77. Accuracy on GTZAN dataset (small, old, etc...) Accuracy on GTZAN dataset (small, old, etc...) Accuracy: 83.4%. State of the Art: 84.3% Very fastYann LeCun 78. LearningLearning Mid-Level Features Mid-Level Features [Boureauetal.,CVPR2010,ICML2010,CVPR2011]Yann LeCun 79. ConvNets and Conventional Vision Architectures are SimilarConvNets and Conventional Vision Architectures are SimilarFilter Non feature FilterNonfeature ClassifierBank LinearityPooling BankLinearity Pooling Oriented HistogramKmeans PyramidSVMwith WTA Edges (sum)Histogram Histogram (sum) Intersection SIFT kernelCant we use the same tricks as ConvNets to train the second stage of aconventional vision architecture?Stage 1: SIFTStage 2: sparse coding over neighborhoods + poolingYann LeCun 80. Using DL/ConvNet ideas in conventional recognition Using DL/ConvNet ideas in conventional recognitionsystems systems [Boureauetal.CVPR2010] Adapting insights from ConvNets:Jointly encoding spatial neighborhoods instead of single points:increase spatial receptive fields for higher-level features Use max pooling instead of average pooling Train supervised dictionary for sparse coding This yields state-of-the-art results:75.7% on Caltech-101 (+/-1.1%): record for single system85.6% on 15-Scenes (+/- 0.2): record!Yann LeCun 81. The Competition: SIFT + Sparse-Coding + PMK-SVM The Competition: SIFT + Sparse-Coding + PMK-SVMReplacing K-means with Sparse Coding [Yang 2008] [Boureau, Bach, Ponce, LeCun 2010]Yann LeCun 82. Sparse Coding within Clusters Sparse Coding within ClustersSplitting the Sparse Coding into Clusters only similar things get pooled together [Boureau,etal.CVPR2011]Yann LeCun 83. Learning LearningInvariant Features Invariant Features (learning complex cells) (learning complex cells) [Kavukcuoglu,Ranzato,Fergus,LeCun,CVPR2009] [Gregor&LeCun2010]Yann LeCun 84. Learning Invariant Features with L2 Group SparsityLearning Invariant Features with L2 Group SparsityUnsupervised PSD ignores the spatial pooling step.Could we devise a similar method that learns the pooling layer as well?Idea [Hyvarinen & Hoyer 2001]: group sparsity on pools of features Minimum number of pools must be non-zero Number of features that are on within a pool doesnt matter Pools tend to regroup similar featuresE (Y , Z )=Y W d Z + Z g e (W e ,Y ) + 22 2Z k j k P j iY Y 2 WdZj . INPUT Y Z k P Z 2 j k FEATURESi ge W e ,Y Z Z2 L2normwithin eachpoolYann LeCun 85. Learning Invariant Features with L2 Group SparsityLearning Invariant Features with L2 Group SparsityIdea: features are pooled in group.Sparsity: sum over groups of L2 norm of activity in group.[Hyvrinen Hoyer 2001]: subspace ICAdecoder only, square[Welling, Hinton, Osindero NIPS 2002]: pooled product of experts encoder only, overcomplete, log student-T penalty on L2 pooling[Kavukcuoglu, Ranzato, Fergus LeCun, CVPR 2010]: Invariant PSDencoder-decoder (like PSD), overcomplete, L2 pooling[Le et al. NIPS 2011]: Reconstruction ICASame as [Kavukcuoglu 2010] with linear encoder and tied decoder[Gregor & LeCun arXiv:1006:0448, 2010] [Le et al. ICML 2012]Locally-connect non shared (tiled) encoder-decoder L2normwithinINPUT SIMPLE FEATURES eachpoolj .Encoderonly(PoE,ICA), YDecoderOnlyorEncoderDecoder(iPSD,RICA) Z j2 k P Z k INVARIANTFEATURESYann LeCun 86. Pooling Similar Features using Group Sparsity Pooling Similar Features using Group Sparsity A sparse-overcomplete version of Hyvarinens subspace ICA Decoder ensures reconstruction (unlike ICA which requires orthonogonal matrix)1. Apply filters on a patch (with suitable non-linearity)2. Arrange filter outputs on a 2D plane3. square filter outputs4. minimize sqrt of sum of blocks of squared filter outputs [Kavukcuoglu,Ranzato,Fergu,LeCun,CVPR2009] [Jenatton,Obozinski,BachAISTATS2010][Leetal.NIPS2011]Yann LeCun 87. Groups are local in aa2D Topographic Map Groups are local in 2D Topographic MapThe filters arrangethemselves spontaneously sothat similar filters enter thesame pool.The pooling units can be seenas complex cellsOutputs of pooling units areinvariant to localtransformations of the input For some its translations, for others rotations, or other transformations.Yann LeCun 88. Image-level training, local filters but no weight sharing Image-level training, local filters but no weight sharingTraining on 115x115 images. Kernels are 15x15 (not shared across space!)[Gregor & LeCun 2010]DecoderLocal receptive fields ReconstructedInputNo shared weights4x overcompleteL2 poolingGroup sparsity over pools(Inferred)CodePredictedCode Input EncoderYann LeCun 89. Image-level training, local filters but no weight sharing Image-level training, local filters but no weight sharingTopographic maps ofcontinuously-varyingfeaturesLocal overlapping pools areinvariant complex cells[Gregor & LeCunarXiv:1006.0448](double tanh encoder)[Le et al. ICML12](linear encoder)Yann LeCun 90. Image-level training, local filters but no weight sharing Image-level training, local filters but no weight sharingTraining on 115x115 images. Kernels are 15x15 (not shared across space!)Yann LeCun 91. KObermayerandGGBlasdel,Journalof Neuroscience,Vol13,41144129(Monkey)119x119ImageInput 100x100Code20x20Receptivefieldsize sigma=5 MichaelC.Crair,et.al.TheJournalofNeurophysiology Vol.77No.6June1997,pp.33813385(Cat) 92. Image-level training, local filters but no weight sharing Image-level training, local filters but no weight sharingColor indicates orientation (by fitting Gabors)Yann LeCun 93. Theory of Repeated [Filter Bank L2 Pooling Average Pooling] Theory of Repeated [Filter Bank L2 Pooling Average Pooling] Stphane Mallats Scattering Transform: Theory of ConvNet-like architectures [Mallat & Bruna CVPR 2011] Classification with Scattering Operators [Mallat & Bruna, arXiv:1203.1513 2012] Invariant Scattering Convolution Networks [Mallat CPAM 2012] Group Invariant ScatteringYann LeCun 94. Sparse Coding Using Sparse Coding Using Lateral Inhibition Lateral Inhibition [Gregor,Szlam,LeCun,NIPS2011]Yann LeCun 95. Invariant Features Lateral Inhibition Invariant Features Lateral InhibitionReplace the L1 sparsity term by a lateral inhibition matrixEasy way to impose some structure on the sparsity [Gregor,Szlam,LeCunNIPS2011]Yann LeCun 96. Invariant Features via Lateral Inhibition: Structured Sparsity Invariant Features via Lateral Inhibition: Structured Sparsity Each edge in the tree indicates a zero in the S matrix (no mutual inhibition)Sij is larger if two neurons are far away in the treeYann LeCun 97. Yann LeCun 98. Yann LeCun 99. Yann LeCun 100. Invariant Features via Lateral Inhibition: Topographic Maps Invariant Features via Lateral Inhibition: Topographic MapsNon-zero values in S form a ring in a 2D topology Input patches are high-pass filteredYann LeCun 101. Invariant Features via Lateral Inhibition: Topographic Maps Invariant Features via Lateral Inhibition: Topographic MapsNon-zero values in S form a ring in a 2D topology Left: no high-pass filtering of input Right: patch-level mean removalYann LeCun 102. Invariant Features via Lateral Excitation: Topographic Maps Invariant Features via Lateral Excitation: Topographic MapsShort-range lateral excitation + L1 sparsityYann LeCun 103. Learning What/WhereLearning What/WhereFeatures withFeatures with Temporal Constancy Temporal Constancy [Gregor&LeCunarXiv:1006.0448,2010]Yann LeCun 104. Invariant Features through Temporal Constancy Invariant Features through Temporal ConstancyObject is cross-product of object type and instantiation parameters Mapping units [Hinton 1981], capsules [Hinton 2011]small mediumlarge Objecttype[KarolGregoretal.] ObjectsizeYann LeCun 105. What-Where Auto-Encoder ArchitectureWhat-Where Auto-Encoder Architecture Decoder PredictedSt St1 St2 input W1W1 W1 W2 t t1t2t InferredC 1C 1 C 1C 2 codeCtCt1 Ct2Ct Predicted1 112 code f11 f W12f Wf W W W2 W2tt1t2Yann LeCun EncoderSSS Input 106. Low-Level Filters Connected to Each Complex Cell Low-Level Filters Connected to Each Complex Cell C1 (where)C2(what)Yann LeCun 107. Generating from the Network Generating from the Network InputYann LeCun 108. Hardware Hardware ImplementationsImplementationsYann LeCun 109. Higher End: FPGA with NeuFlow architectureHigher End: FPGA with NeuFlow architectureNow Running on Picocomputing 8x10cm high-performance FPGA board Virtex 6 LX240T: 680 MAC units, 20 neuflow tilesFull scene labeling at 20 frames/sec (50ms/frame) at 320x240 NewboardwithVirtex6Yann LeCun 110. NewFlow: ArchitectureNewFlow: Architecture MEM Multiportmemory DMA controller(DMA) [x12 on a V6 LX240T] RISCCPU,to CPU reconfiguretilesand datapaths,atruntime globalnetworkonchipto allowfastreconfigurationgridofpassiveprocessingtiles(PTs) [x20 on a Virtex6 LX240T]Yann LeCun 111. NewFlow: Processing Tile ArchitectureNewFlow: Processing Tile Architectureconfigurablerouter,Termby:termtostreamdatainstreamingoperators andoutofthetile,to(MUL,DIV,ADD,SU neighborsorDMAB,MAX)ports [x20][x8,2 per tile]configurablepiecewiselinearorquadraticmapper[x4] configurablebankoffull1/2Dparallelconvolver FIFOs,forstreamwith100MACunits buffering,upto10kB perPT [Virtex6 LX240T] [x4][x8]Yann LeCun 112. NewFlow ASIC: 2.5x5 mm, 45nm, 0.6Watts, 160GOPSNewFlow ASIC: 2.5x5 mm, 45nm, 0.6Watts, 160GOPSCollaboration NYU-Purdue (Eugenio Culurciellos group)Suitable for vision-enabled embedded and mobile devicesStatus: waiting for first samples end of 2012Yann LeCun Pham,Jelaca,Farabet,Martini,LeCun,Culurciello2012] 113. NewFlow: PerformanceNewFlow: PerformanceIntel neuFlow neuFlow nVidianeuFlownVidia I74cores Virtex4 Virtex6 GT335m IBM45nm GTX480Peak GOP/sec 4040 160 182160 1350Actual GOP/sec 1237 147 54 147 294FPS1446 182 67 182 374 Power(W)501010 30 0.6 220Embed? (GOP/s/W)0.24 3.714.71.8245 1.34Yann LeCun 114. Software Platforms:Software Platforms:Torch7Torch7 http://www.torch.ch http://www.torch.ch[Collobert,Kavukcuoglu,Farabet2011]Yann LeCun 115. Torch7Torch7ML/Vision development environment Collaboration between IDIAP, NYU, and NEC Labs Uses the Lua interpreter/compiler as a front-end Very simple and efficient scripting language Ultra simple and natural syntax Its like Scheme underneath, but with a normal syntax. Lua is widely used in the video game and web industriesTorch7 extend Lua with Numerical, ML, Vision libraries Powerful Vector/Matrix/Tensor Engine (developed by us) Interfaces to numerical libraries, OpenCV, etc Back-ends for SSE, OpenMP, CUDA, ARM-Neon,... Qt-based GUI and graphics Open source: http://www.torch.ch/ First released in early 2012Yann LeCun 116. MATRIX and VECTOR Operations Torch7Torch7 > M = torch.Tensor(2,2):fill(2) > N = torch.Tensor(2,4):fill(3)ML/Vision development environment> x = torch.Tensor(2):fill(4) As simple as Matlab, but faster and > y = torch.Tensor(2):fill(5) better designed as a language.> = x*y -- dot product 40 2D CONVOLUTION > = M*x --- matrix-vector x = torch.rand(100,100) k = torch.rand(10,10)16 res1 = torch.conv2(x,k)16 [torch.Tensor of dimension 2]TRAINING A NEURAL NETmlp = nn.Sequential()mlp:add( nn.Linear(10, 25) ) -- 10 input, 25 hidden unitsmlp:add( nn.Tanh() ) -- some hyperbolic tangent transfer functionmlp:add( nn.Linear(25, 1) ) -- 1 outputcriterion = nn.MSECriterion() -- Mean Squared Error criteriontrainer = nn.StochasticGradient(mlp, criterion)trainer:train(dataset) -- train using some examplesYann LeCun 117. Acknowledgements Acknowledgements Camille CouprieKarol GregorY-Lan BoureauClment FarabetKevin JarrettArthur SzlamKoray KavukcuogluMarcAurelio Ranzato Rob FergusPierre Sermanet Laurent Najman (ESIEE) Eugenio Culurciello (Purdue)Yann LeCun 118. The EndThe EndYann LeCun

icml2012 learning hierarchies of invariant features

Technology