deeplearning on big data - semantic scholar · • president barack obama's big data keynote...
TRANSCRIPT
Deep Learning on Big‐DataDeep Learning on Big Data Aurelio [email protected]
de
gli INGEGNERI
OR
DIN
E
provincia di ANCONA
Facoltà di Ingegneria dell'Informazione, Informatica e Statistica (I3S)
http://ispac.diet.uniroma1.it
Bologna, 07 Novembre 2013
Rome, July 2015Ancona, June 12, 2015
ProloguePrologue
A i t tl d th t ll l i i il• Aristotle argued that all people expressing similar
intellectual faculties and that the differences were due to
the teaching and example.
• My elementary school teacher said that “man is
intelligent because it has the ability to adapt”intelligent because it has the ability to adapt”.
• Bernard Widrow (LMS inventor): - “I'm an ‘adaptive’ guy”
Keywords: teaching example and adaptationKeywords: teaching, example and adaptation
The Big Data PhenomenonThe Big Data Phenomenon
Exponential growth of available informationExponential growth of available information
• Social networks
• Sensor networks
• Internet of Things
• Bureaucratic and specific database
• Apps
• ….
Big Data cycleBig Data cycle
AApps
DataUsers
Data
Big Data many ‘V’
2020 about 44x1021 (44 zettabyte)
V lVolume
Velocity
Variability
Source: IDC’s Digital Universe Study (EMC)Variety
Big DataBig Data
• Untapped opportunities for socioeconomic growth World Economic Forum
D t i th il f th I t t d th• Data is the new oil of the Internet and the new currency of the digital world.
Meglena Kuneva, Europeang , pConsumer Commissioner
• Data in 21st Century is like Oil in the 18thlike Oil in the 18th Century: an immensely, untapped valuable asset.
Big Data - Many ‘V’ definitionBig Data Many V definitionBig problem: extraction of ‘V’alue from the large pools of datapools of data
Cost center Profit center
Harvesting of valuable knowledge from Big Data is not di t kan ordinary task
Today, machine learning methods, have come to play a vital role in Big Data analytics and knowledge discovery
Big-Data relevant themes
Computational IntelligenceMethods
Deep Learning MethodDeep Neural Nets
Data ConstraintMassive scaleDecentralizedReal - Time stream
InfrastructureMassive Scale Value
Deep Neural NetsConvolutive Neural NetsDistributed Neural NetsMeta heuristic…….
Real Time stream
Massive Scale Cloud Storage High speed networksHigh speed computers
Value BD business modelBD AnalyticsHigh-valueadded productsComputational model added products……..
TaskModelingPrediction
Computational modelAdaptiveParallelDistributedLocal connections‘G ’ Classification
Clustering ……..
‘Green’
S f• Support projects that can transform our ability to harness in novel ways from huge volumes of digital data.
• In April, 2013, U.S. President Barack Obama announced another federal project, a new brain mapping initiative called the BRAIN (Brain Research Through Advancing Innovative Neurotechnologies).( g g g )
• President Barack Obama's Big Data Keynote -- Hadoop World 2015 (He talks about the importance of Big Data and Data Science) (19(He talks about the importance of Big Data and Data Science) (19 feb 2015)
Biologically inspired computingBiologically inspired computing
Biologically inspired approach ....
InstinctKnowledgeExperienceCultureEmotionsA
MemoryA priori knowledge
Brain DeductionA ti Aware
...
A priori knowledgeRulesReasoning ability
Action
M f i ith th i f tiMoreover: fusion with other information ....
most of our behaviors which combine information.... most of our behaviors, which combine information, knowledge and intelligence; happens unconsciously.
Ex. Complex scene summarization in a few words
Characteristics of the biological brain
D d it
The neuron cell
A T i l
Dendrites(receivers)
Axon Terminals(transmitters)Cell Body
NucleusStimuli Response
• Birth of Artificial Neural Networks (ANN) (40s)• The formal neuron of McCulloch – Pitts (1943)e o a eu o o cCu oc tts ( 9 3)
( )s Non linear function
• Simple biological inspired circuit
1w
( )s
s
Non linear function
Threshold or bias
Synaptic weights w
Cell potential( i i )
11
2w Ts w x ( ) ( )Ty w x
s
Neuron'sinput x
(activation)1x
Stimuli Response
MxMw
( ) ( )y w
Activationfunction
Summing junction Axon
M
Dendrit
• Can be implemented by a very simple algorithm. Suitable for Artificial Neural Networks
Learning model and paradigmsLearning model and paradigms
• Learning model: simple rewarding mechanism
f• In general terms we can define two learning paradigms– Supervised
U i d– Unsupervised
Supervised learningLearning through teaching by examples
1 Rewarding_Functionn n w w
Stimuli Response
Supervisor or TeacherCorrect answerComparison
Rewardh imechanism Error [ ]ne
R di h i f tiExternal forcing
Rewarding mechanism: error function minimization provided through examples.
Learning by error correction• A learning algorithm with a concrete and useful results is the
LMS algorithm (Delta-rule) of Bernard Widrow (1959).
Learning by error correction
g ( ) ( )
w
Desired output (supervisor or teacher)
d
wResponse T
ny w xExternal stimuli (Signals)
x
Stimuli Error wLearning
wComparison
1n n w w x ealgorithm e d y
1n n
Bernard Widrow “I'm an ‘adaptive’ guy”Bernard Widrow “I'm an ‘adaptive’ guy”Professor Emeritus Electrical Engineering DepartmentSt f d U i itStanford UniversityUSA
Multi-Layer Neural Networksu t aye eu a et o s
Compare outputs withcorrect answer to getcorrect answer to geterror signal
Back - propagateerror signal toget derivativesf l i
y Outputs (3) (3) (2) (2) (1) (1)y Φ W Φ W Φ W x
for learning
(3)W
Many Hiddenlayers
(2)W
Feed - forwardcomputation
(1)W
x Input vector (pattern)
computation
Back-Propagation learning algorithm (mid 80s)
Unsupervised learningUnsupervised learning
Learning through self adaptation
Stimuli Response
N t l f iRewardingmechanism
Rewarding mechanism: simple primal instinct that creates the
No external forcingsmechanism
Rewarding mechanism: simple primal instinct that creates the adaptation i.e. natural evolutionary behavior
Unsupervised learning
Hebbian learning
Unsupervised learning
Hebbian learning
• Hebb’s Postulate• The strength of the connection depends on the activity between the
neurons.
Donald Hebb (Canadian psychologist 1904-1985)Donald Hebb (Canadian psychologist, 1904 1985)Canadian psychologist, McGill University, Montreal
Neural Networks History: Gartner Hype Cycley y y
• Neural Network Disillusionment
Peak of Infleted Expectationhype
Plateau of Productivityr med
ia
RNNPlateau of Productivity
atio
nso
MLPNNs Rebirth
Slope of Enlightenment
Trough of DisillusionmentExp
ecta
BP
Time
Technology Trigger
1950-70 ’80 ’90 ‘00 ’06 ‘10
Widrow’sLMS
BP-NNs Disillusionment, 80 and 90BP NNs Disillusionment, 80 and 90
• Supervised learning p g– It requires labeled training data– Almost all data is unlabelled
• Long learning timeLong learning time– Very slow in networks with many hidden layers– Vanish gradient problemg p
• It may fall into poor local minimay p– For deep networks they may be too far from the
optimal solution
Back-propagation problems in the 80 and 90Back propagation problems in the 80 and 90
1 Difficulty of producing labelled training
Three main problems of BP
1. Difficulty of producing labelled training data set: not enough labelled data sets.
2. No fast enough CPU.
3. Difficulty of correct weights: propagation error problems.
What has happened recently
1. Labelled data sets got much bigger.1. Labelled data sets got much bigger.
2 Computer got much faster2. Computer got much faster.
3 New paradigm for learning deep layers using3. New paradigm for learning deep layers using unlabeled data (2006).
• Result: deep neural networks are the now state-pof-the-art for many real world problems.
Deep Neural NetworksDeep Neural Networks
Neural Networks History: Gartner Hype CycleNeural Networks History: Gartner Hype Cycle
Peak of Infleted Expectationhype
Pl t f P d ti it ?r med
ia
MLP
RNN
NNs 2nd RebirthPlateau of Productivity?
atio
nso
BP Slope of Enlightenment
Trough of DisillusionmentExp
ecta BP DNN
Time
Technology Trigger
1950-70 ’80 ’90 ‘00 ’06 ‘10
Widrow’sLMS
DNN(industry)
DNN
Deep Neural Networks - Gartner Hype CycleDeep Neural Networks - Gartner Hype Cycle
hypothesized trend hy
peyp
DNN
Strong AIW A R N I N G
r med
ia W A R N I N G
Bill GatesStephen Hawking
atio
nso BP
Exp
ecta
Time1950-70 ’80 ’90 ‘00 ’06 ‘10
Widrow’sLMS
DNN
http://www.huffingtonpost.com/james-barrat/hawking-gates-artificial-intelligence_b_7008706.html
Machine Learning performance vs amount of dataMachine Learning performance vs amount of data
Deep learningmethods
Standard
man
ce
machine learningalgorithms
Per
form
Amount of data
Deep Learning definitionDeep Learning definition
• Many definitions:• Many definitions:• DL is a set of algorithms in machine learning that
attempt to learn in multiple levels, corresponding to p p , p gdifferent levels of abstraction. It typically uses artificial neural networks.
• DL is a class of machine learning techniques that exploit many layers of non-linear informationexploit many layers of non linear information processing for supervised or unsupervised feature extraction and transformation, and for pattern analysis
d l ifi tiand classification.
•• ….
DL Biological evidenceDL Biological evidence• For example the layers organization of the visual system
Muscle cellsMotoneuronReceptors
External stimuli
Memory, ideation, psyche, etc.Hidden layers
Many levels of transformation
DL Psychological cognitive evidence• The knowledge is represented in different levels of
abstraction
DL Psychological-cognitive evidence
abstraction Abstraction
Wisdom
Insight
Understanding
Knowledge
Information
Data
Concreteness
The Ladder-of-Abstraction and the Data-Wisdom Pyramid
Example of Deep Learning solutions
• Apple - Siri speech recognition, iPhone personal assistant, …
• Facebook – massive data analysis, …
• Google - Translator, Android’s voice recognition, text processing Word2Vec,
(Google acquires AI startup Deep Mind > $500M), …
• IBM – brain-like computer, deep learning for Big Data, (IBM acquires
AlchemyAPI, Enhancing Watson’s Deep Learning Capabilities)…
• Microsoft – speech, massive data analysis, …
• Twitter – acquires Deep Learning startup Madbits
• Yahoo – acquires startup LookFlow to work on Flickr and Deep Learning
• As data keeps getting bigger DL coming to play a key role in:• Data modeling• Analytics solutions• Leverage for competitive advantage
Three main DNN families (L Deng D Yu 2014)Three main DNN families (L. Deng, D. Yu 2014)
• Deep networks for unsupervised or generative learning• Capture high-order correlation of the observed data when no
information about target class labels is available.
• Deep networks for supervised learning• Directly provide discriminative power for pattern classificationDirectly provide discriminative power for pattern classification
purposes.
• Hybrid deep networks• Mix of the previous models. The goal is discrimination which is
assisted, often in a significant way, with the outcomes ofassisted, often in a significant way, with the outcomes of generative or unsupervised deep networks.
• The research activities in the field is very high• The research activities in the field, is very high
Unsupervised generative model• Ex. Deep Belief Networks (DBN)
• Stack of Restricted Boltzmann Machines (RBM)• Stack of Restricted Boltzmann Machines (RBM)
IndependentIndependent unsupervised training of each layer.
O
H
Output layer
Hidden la er
(4)W
RBM
RBM
DBN can effectively utilize large amounts of
3
2
H
H
Hidden layer
Hidden layer
(3)W
RBM
RBM
large amounts of unlabeled data for exploiting complex data
2
1HHidden layer
(2)
(1)
W
WRBM
RBM
structures. VInput layer
(1)W
Deep networks for supervised learning• Ex. Convolutional Neural Network (CNN)
Yann LeCun (NYU)
Specific architecture for image classification
Yann LeCun (NYU)
Fig. from: http://parse.ele.tue.nl/cluster/2/CNNArchitecture.jpg
Biologically inspired - Small neuron collections which look at small portions of the input image, as the receptive fields.
Convolutional Neural Network (CNN)Softmax to predict object class
Fully-connected layers
Convolutional layers(same weights used at allspatial locations in layer)
Layer 7
spatial locations in layer)
Layer 1
…..
Biologically inspired - Small neuron collections which yneuron collections which look at small portions of the input image, as the receptive fields.
Input
Won 2012 ImageNet challenge with 16.4% top-5 error rate
Hybrid DNN architecture
(1)W(2)
Softmax classifier
(2)W( 1)N W ( )NW
Unsupervised learning Supervised learning
Supervised final fine tuning
DNN by stacked autoencoderDNN by stacked autoencoder
Output
4P N 4P 3P4P 3P 2P 3P 2P 1P
pclasses
2P 1PN 4P 3P
Softmax classifier
(1)W (2)W (3)W (4)W
N(1)W (2)W (3)W (4)WW W W W ( )W W W W
Separate unsupervised pre-training of theSeparate unsupervised pre training of the hidden layers
Large Scale Deep Neural NetworkLarge Scale Deep Neural Network
Parallel and distributed computingParallel and distributed computingSM-MIMD
DM-MIMD
VectorSupercomputer
High-Speed Network
StorageWorkstation
Special Pourpose Architecture SIMD
Large Scale Distributed Deep NetworksLarge Scale Distributed Deep Networks
• Problem: training a deep network with billions of• Problem: training a deep network with billions of parameters using tens of thousands of CPU cores.
• Exploit many kinds of parallelism
• Data parallelism
• Model parallelism
• Data and model parallelism
Large scale DNNLarge scale DNN• Model parallelism
e1
e2
Minimal network traffic:The most densely connected
th titi
Mac
hine
Mac
hine areas are on the same partition
chin
e3
chin
e4
Data
Mac
Mac
• Network partitions
Large scale SGDLarge scale SGD• Asynchronous Stochastic Gradient Descent (SGD) (Widrow’s
Generalized Delta Rule (GRD))Generalized Delta Rule (GRD))1n n n w w w Parameter Server
‘Downpour’ SGD(1). Model replicas asynchronously
nwnw
fetch parameters w and push gradients w to the parameter server.
n
ModelReplicasReplicas
Data Shards
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng, ‘Large Scale Distributed Deep Networks’, NIPS 2012.
Large scale L BFGSLarge scale L-BFGS• Limited-memory conjugate gradient algorithm of Broyden, Fletcher,
Goldfarb Shanno (L-BFGS)
1n n n w w w Parameter Server
Goldfarb,Shanno (L BFGS).
nwnwCoordinator (small messages)
ModelReplicas
L-BFGS-A: single ‘coordinator’ sends small messages to replicas and the parameter
Data
p pserver to orchestrate batch optimization.
Jeffrey Dean Greg S Corrado Rajat Monga Kai ChenJeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’AurelioRanzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng, ‘Large Scale Distributed Deep Networks’, NIPS 2012.
DNN on Big-Data applicationsDNN on Big Data applications
DNN: state-of-the-art performance reported in l d iseveral domains
• Text, Language Model and Natural Language Processing
• Information Retrieval
• Visual Object Recognition and Computer Vision
• Speech Recognition and Audio Processing
• Multimodal and Multi-task Learning: Text-Image, S h ISpeech-Image, …
Text and Language processing
• Feedforward Neural Net Language Model
Input
Neural net language model ( )w t M
U
ProjectionOutputarchitecture
The training is done using
VU W
Hiddeng g
backpropagation
The word vectors are in matrix U( 2)w t
U( )w tThe word vectors are in matrix U
( 1)w t U
Text and Language processing
• Skip-gram ArchitectureOutput
Predicts the surrounding words given the current ( 2)w t
Input
Output
gword
( 1)w t
Input
HiddenLayers
( )w t
( 1)w t
( 2)w t
Text and Language processing
• Ex. Skipgram Text Model
Hierarchical softmaxClassifier
Single embedding function
Raw sparse features
Mikolov, Chen, Corrado and Dean. Efficient Estimation of Word Representations in Vector Space, http://arxiv.org/abs/1301.3781.
Text and Language processing
• Continuous Bag-of-words (CBOW) Architecture
Predicts the current word ( 2)
Input
given the context ( 2)w t
HiddenLayers Output( 1)w t
( )w t
Layers Output
( )w t
( 1)w t
( 2)w t
Example: GoogleExample: Google
• Neural network trained to predict a word given the words to it nearby.
• It allows you to create numerical representations of each word.
• These representations can be mathematically manipulate as classics vectors.
• Training carried out on database of hundreds of billions of• Training carried out on database of hundreds of billions of words.
http://deeplearning4j.org/word2vec.html
Example: GoogleExample: Google
• W2V : is a neural net that processes text before that text is handled by p ydeep-learning algorithms.
• W2V creates features without human intervention, including the context of individual words.individual words.
• W2V can make highly accurate guesses about a word’s meaning based on its past appearances.
• Word: ‘france’Word Cosine distance
-------------------------------------------spain 0.678515
b l i 0 665923belgium 0.665923netherlands 0.652428
italy 0.633130switzerland 0.622323luxembourg 0.610033
portugal 0.577154russia 0.571507
0 563291germany 0.563291catalonia 0.534176
Example: Google
• Here’s a graph of words associated with “China” using Word2vec
Example: Google
g p g
Example: GoogleExample: Google
• ‘Semantic computation’
• The word vectors capture many linguistic regularities, for example vector operations
p
vector('Paris') - vector('France') + vector('Italy')results in a vector that is very close to vector('Rome'),
and
t ('ki ') t (' ') t (' ') i lvector('king') - vector('man') + vector('woman') is close to vector('queen')
• W2V is the key element for the development of li ti f t ‘V’ l ti Bi D tapplications of great ‘V’alue, operating on Big-Data
Visual object recognition: GoogleNet(1)Visual object recognition: GoogleNetDeep Network1 billion connections, 9 - layered1 billion connections, 9 layered locally connected sparse autoencoder trained over a dataset of 10 million 200x200 pixel ofof 10 million 200x200 pixel of images downloaded from the Internet.
TrainingParallel Asynchronous StochasticyGradient Descent on a cluster with 1,000 machines (16,000 cores) for three daysthree days.
Image from (1)
(1) Q. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng ‘’Building high-level features using large scale unsupervised learning’’, International Conference in Machine Learning. 2012.
Visual object recognition (2014 Google)
Winner if 2014 ImageNet challenge with 6.66% top-5 error rate
• 6 modules of convolutional6 modules of convolutionallayers.
• 24 layers deep !!• 24 layers deep !!
• Good Fine-grainedClassificationClassification
• Good Generalization
• Sensible errors
Image from : http://www.engadget.com/2014/09/08/google-details-object-recognition-tech/
“Snake” “Dog”
Visual object recognition: Complex scene j g psummarization in a few words(1)
“Two pizzas sitting on top of a stove top oven”
(1) Google Research Bloghttp://googleresearch.blogspot.it/2014/11/a-picture-is-worth-thousand-coherent.html
Visual object recognition: Complex scene j g psummarization in a few words
App NVIDIA’S DRIVE PX(1)App – NVIDIA S DRIVE PX(1)
Self Driving Cars Using Deep LearningSelf-Driving Cars Using Deep Learning
(1) http://dataconomy.com/nvidias-drive-px-platform-to-pave-way-for-self-driving-cars-using-deep-learning/
Industrial sector of interest• Topics include:
Banking / Retail / Finance– Identify: Prospective, customers Dissatisfied customers, Good customers, Bad payersIdentify: Prospective, customers Dissatisfied customers, Good customers, Bad payers– Obtain: More effective advertising, Less credit risk, Fewer fraud, Decreased churn rate– Finance: econometric, time series analysis and predictionBiomedical / Biometrics
M di i S i Di i d i D di ( ti M di i )– Medicine: Screening, Diagnosis and prognosis, Drug discovery (semantic Medicine)– Security: Face recognition, Signature / fingerprint / iris verification, speaker recognition,
DNA fingerprinting, …Computer / Internet / Multimedia p– Computer interfaces: Troubleshooting wizards, Handwriting and speech, Brain waves– Internet: Hit ranking, Text categorization, Text translation, Sentiment analysis, ….– Cyber security: Network anomaly, Cyber-attack prediction, Spam detection, Malicious
code recognitioncode recognition, …– Audio Video processing, audio-video content retrieval information, scene analysis, video
games, virtual movie, ...Electrical / Computer Engineering
Wireless communication Cognitive Radio Remote sensing Array processing multi– Wireless communication, Cognitive- Radio, Remote sensing, Array processing, multi sensor data fusions, robotics, Smart-Grid, intelligent house, ...
Data processing– Classification, Time series Filtering, Prediction, Regression, Clustering, Spam filtering,
S itSecurity …• Etc.
Research activity @y @
C t ti l i t lliComputational intelligence
Fast DNN model and architectures• Fast DNN model and architectures
• Random feature extractionRandom feature extraction
• Semi-supervised model
• Evolutionary methods for learning
• Distributed learning with Big Data
Research activity @y @
Ex Large Scale Distributed Learning on Big DataEx. Large Scale Distributed Learning on Big Data
• Development of learning algorithm without communication to a single central node and that can scale to large networks.
• The data are distributed on a network of interconnected agents.
• Applications including: learning on sensor networks, on peer-to-peer, swarms of robot, …
• Lynx toolbox: an open source Matlab toolbox, designed for fast prototyping of supervised machine learning simulations.
Highlights of Deep Learning on Big DataHighlights of Deep Learning on Big Data• DL can be used to merge symbolic and not symbolic heterogeneous information
• Development of parallel DL algorithms distributed on cluster of servers and/or parallel CPU (e.g. cuda GPU, …)
S i d d i d i d l i• Supervised and unsupervised mixed learning
• Possibility of continuous adaptation (learn while it is working)
• Possibility of customized solutions for specific problems
• Real-time data stream processingReal time data stream processing
• Order of weeks to train on large-scale datasets even on the fastest available GPUs
• Heuristic approach for the determination of the network topology
• Many tricks to make them learn optimally
• Developing applications with DL requires expertise and experience
EpilogueEpilogue
• Aristotele e il mio Maestro delle elementari, ,avevano già capito tutto ?
Conclusioni• Il problema della coscienza artificiale sembra costituire
l' lti tt d ll t i d ll'i i D d l
Conclusioni
l'ultimo atto della storia dell'ingegneria. Dando al termine ingegnere il termine estensivo di colui che fa, la costruzione di un artefatto in grado di poter dire Iola costruzione di un artefatto in grado di poter dire Io esisto potrebbe rappresentare il sogno finale dell'essere umano costruttore che vuole costruire anche senza sapere.
Ing. Vincenzo Tagliasco (1941-2008)
• Il mio gran male è stato sempre e sarà sempre uno:Il mio gran male è stato sempre e sarà sempre uno: quello di desiderare e sognare, invece di volere e fare.
Ing. Carlo Emilio Gadda (1893-1973)
Q ti ?• Questions?