deep learning技術の今

47
Deep Learning 技術の今 全脳アーキテクチャ勉強会(第2回) 得居 誠也 2014130

Upload: seiya-tokui

Post on 18-Nov-2014

46.215 views

Category:

Technology


3 download

DESCRIPTION

第2回全脳アーキテクチャ勉強会での講演スライドです。Deep Learning の基礎から最近提案されている面白トピックを詰め込んだサーベイになっています。

TRANSCRIPT

  • 1. Deep Learning 2014130

2. l (Seiya Tokui) Preferred Infrastructure, Jubatus Pj. llll@beam2d (Twitter, Github, etc.)2 / 47 3. 2011: ll3 / 47DNN-HMM word error rate (GMM) 10% Deep Learning F. Seide, G. Li and D. Yu. Conversational Speech Transcription Using Context-Dependent Deep Neural Network, in INTERSPEECH, pp. 437-440 (2011) 4. 2012: l ILSVRC2012 Deep Convolutional Neural Network Supervision 10% J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla and F.-F. Li . Large Scale Visual Recognition Challenge 2012. ILSVRC2012 Workshop. 4 / 47 5. 2013: lGoogle DNNresearch lBaidu Institute of Deep Learning lDeep learning group 12Facebook AI Lab l Kai Yu 8, 10Yahoo IQ Engines LookFlow lGeorey Hinton, Alex Krizhevsky and Ilya SutskeverYann LeCun MarcAurelio Ranzato 20141Google DeepMind G. Hinton Deep Learning 5 / 47 6. lDeep Learning lZoo of deep learning l6 / 47 7. l l l l lDeep Learning Deep Learning Deep Learning Deep Learning Deep Learning 7 / 47 8. Deep Learning 8 9. 9 / 47A. Ranzato. Deep Learning Cf.) Y. LeCun and M.ICML 2013.Deep Learning Tutorial.BoostingAE PerceptronSparse Coding RBM GMM SVMRFDAEDNN CNNSFFNN DBNGSN DBMBayes NP Sum-Product RNN 10. Feed-Forward Neural Network10 / 47x1wj1x2wj2x3wj3 wj4x4 (activation)hj = f (wj1 x1 + wj2 x2 + wj3 x3 + wj4 x4 ) > = f (wj x) (pre-activation) (activation function)hj h = f (W x) : h = f (W x + b) 11. 11 / 47 fprop bprop backpropagation (error)L (groundtruth) 12. (activation function) l 1 1+e1 e tanh(x) = 1+ex l12 / 47 Rectied Linear Unit (ReLU) l Linear Unit l Linear Unit max : maxout unit** I. Goodfellow, D. W.-Farley, M. Milza, A. Courville and Y. Bengio. Maxout Networks. ICML 2013. ReLUmax(0, x)x x 13. 13 / 47Neural Network ll B Mini-Batch SGD ww L2 (weight decay)L1 lMomentum, Nesterovs Accelerated Gradient*B 1 X @L(xBi ) B i=1 @w (AdaGrad**, vSGD***)L-BFGS, Hessian-Free * I. Sutskever, J. Martens, G. Dahl and G. Hinton. On the importance of initialization and momentum in deep learning. ICML 2013. ** J. Duchi, E. Hazan and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. JMLR 12 (2011) 2121-2159. *** T. Schaul, S. Zhang and Y. LeCun. No More Pesky Learning Rates. ICML 2013. 14. Dropout14 / 47 lSGD 0 20% 50% dropout l l G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov. l Improving neural networks by preventing co-adaptation of feature detectors. ArXiv 1207.0580. * S. Wager, S. Wang and P. Liang. Dropout Training as Adaptive Regularization. NIPS 2013. L2 * (ReLU, maxout) (DropConnect, Adaptive Dropout, etc.) 15. 15 / 47Restricted Boltzmann Machine Wl lvh v hlE(v, h) = E(v, h) =a> v(va)22b> h 2h> W v>b h1>h Wv@ log p(v) = hvi hj idata @wijhvi hj imodel RBM 16. 16 / 47Contrastive Divergence (CD-k) lW vModel k wij = hvi hj idatah lhvi hj ireconstruction Contrastive Divergence l k=1 CD CD-1 17. Deep Belief Network* l l17 / 47 RBM Greedy Layer-wise Pre-training Deep Learning l Up-down ne-tuning Contrastive Divergence l (top-down regularization**) DNN l* G. E. Hinton, S. Osindero and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation 2006. ** H. Goh, N. Thome, M. Cord and J.-H. Lim. Top-Down Regularization of Deep Belief Networks. NIPS 2013. 18. 18 / 47Deep Boltzmann Machine l l DBN lRBM RBM @ log p(v) = hvi hj idata @wij R. Salakhutdinov and G. Hinton. Deep Boltzmann Machines. AISTATS 2009. hvi hj imodel data factorize Model Persistent MCMC 19. 19 / 47Autoencoder (AE) l l NN (bottleneck)Contractive AE*, Sparse AEWLDenoising AE** l (tied weights) W0 = W>W0 DAE score matching RBM **** S. Rifai, P. Vincent, X. Muller, X. Glorot and Y. Bengio. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. ICML 2011. ** P. Vincent, H. Larochelle, Y. Bengio and P.-A. Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. ICML 2008. *** P. Vincent. A Connection Between Score Matching and Denoising Autoencoders. TR 1358, Dept. IRO, Universite de Montreal. 20. Denoising Autoencoder (DAE)l NN Salt-and- Pepper 0 1 l lW W0 L20 / 47DAE * * Y. Bengio, L. Yao, G. Alain and P. Vincent. Generalized Denoising Auto-Encoders as Generative Models. NIPS 2013. 21. Stacked Denoising Autoencoder l lDAE DAE lP. Vincent, H. Larochelle, I. Lajoie, Y. Bengio and P.-A. Manzagol. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. JMLR 11 (2010) 3371-3408.DAE (disentanglement)l21 / 47Stacked DAE DAE AE deep net 22. 22 / 47Deep Learning l l Neural Network RBM Deep Learning Deep Neural Net Neural Net * Neural Net * L. J. Ba and R. Caruana. Do Deep Nets Really Need to be Deep? ArXiv 1312.6184. 23. Deep Learning 23 24. Convolutional Neural Network (CNN) l l l l lFeature maps24 / 47 GPU FFT Simple cell Convolutional Neural Network (LeNet) in Deep Learning 0.1 Documentation. http://deeplearning.net/tutorial/lenet.html#details-and-notation 25. 25 / 47Pooling (subsampling) l l0 @Feature map L2-pooling, max-pooling, average-pooling 11 |rectangle| X(i,j)2rectangle12x2 A ijmax(i,j)2rectanglexij1 |rectangle|Xxij(i,j)2rectangleL2-pooling average-pooling l CNN convolution / activation / pooling lComplex cell l 26. 26 / 47Local Contrast Normalization l feature maps X vijk = xijk wpq xi,j+p,k+q DivisivelSubtractivevijk / max(c,X2 wpq vi,j+p,k+q )ipqK. Jarrett, K. Kavukcuoglu, M. A. Ranzato and Y. LeCun. What is the Best Multi-Stage Architecture for Object Recognition? ICCV 2009. lipq 27. 27 / 47SupervisionA. Krizhevsky, I. Sutskever and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.l pooling Local Response Normalizationl GPU feature maps l2013 28. Deconvolutional NN lMax-pooling pooling lILSVRC2013 (clarifai) M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. ArXiv 1131.2901v3.28 / 47 29. 29 / 47Stochastic Feedforward NN ll EM lE-step M-step backpropagationStochastic neuron Y. Tang and R. Salakhutdinov. Learning Stochastic Feedforward Neural Networks. NIPS 2013. 30. Decoder Topographic ICA l ll30 / 47 EncoderSparse Autoencoder Pooling Pooling Q. V. Lee, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean and A. Y. Ng. Building High-level Features Using Large Scale Unsupervised Learning. ICML 2012. 31. Google l TICA Local Contrast Normalization l31 / 47Convolution Youtube 10,000,000 Q. V. Lee, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean and A. Y. Ng. Building High-level Features Using Large Scale Unsupervised Learning. ICML 2012. 32. DNN Recurrent Neural Network lllPart units32 / 47 N N DNN part unit categorical unit Recurrent Sparse AutoencoderCategorical unitsJ. T. Rolfe and Y. LeCun. Discriminative Recurrent Sparse Auto-Encoders. ICLR 2013. 33. Deep Learning 34. Recursive Neural Network l l34 / 47 ll Neural Network deep R. Socher, C. C.-Y. Lin, A. Y. Ng and C. D. Manning. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. ICML 2011. 35. Recursive NN 35 / 47R. Socher, B. Huval, C. D. Manning and A. Y. Ng. Semantic Compositionality through Recursive Matrix-Vector Spaces. EMNLP 2012. R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng and C. Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP 2013.: http://nlp.stanford.edu/sentiment/ 36. 36 / 47Recurrent Neural Network Language Model (RNNLM) l l Recurrent Neural Network N-gram t-1 (word embeddings) T. Mikolov, M. Karafiat, L. Burget, J. H. Cernocky and S. Khudanpur. Recurrent neural network based language model. INTERSPEECH 2010. 37. RNN : Backpropagation through Time t=1 lllRNN DNN Backpropagation t=2t=337 / 47 38. Deep Recurrent Neural Network ll38 / 47DNN Recurrent Net M. Hermans and B. Schrauwen. Training and Analyzing Deep Recurrent Neural Networks. NIPS 2013. 39. Skip-gram model39 / 47 lll Deep Learning Analogical Reasoning v(brother) - v(sisiter) + v(queen) v(king)l: word2vec T. Mikolov, K. Chen, G. Corrado and J. Dean. Efficient Estimation of Word Representations in Vector Space. ICLR 2013. 40. : DeViSE40 / 47 A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato and T. Mikolov. DeViSE: A Deep Visual-Semantic Embedding Model. NIPS 2013.lSupervision Skip-gram model l (zero-shot learning) 41. Deep Learning 41 42. l l42 / 47 : Deep Q-Networks POMDP lDeepMind Google V. Mnih, K. Karukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller. Playing Atari with Deep Reinforcement Learning. NIPS Deep Learning Workshop 2013. 43. Neural Network l43 / 47C. Gulcehre and Y. Bengio. Knowledge Matters: Importance of Prior Information for Optimization. NIPS Deep Learning Workshop 2012. NN l lCurriculum Learning* * Y. Bengio, J. Louradour, R. Collobert and J. Weston. Curriculum Learning. ICML 2009. 44. 44 / 47 l l ILSVRC2013 LeCun OverFeat Supervision lP. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and Y. LeCun . OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. ArXiv 1312.6229. 45. ll45 / 47 Y. Bengio. Evolving Culture vs Local Minima. ArXiv 1203.2990, 2012. 46. l lDeep Learning 2014 l DropoutDAE l46 / 47Recurrent Net 47. Preferred Infrastructure, Inc. 2014