computer vision
DESCRIPTION
Computer VisionTRANSCRIPT
Ivan Laptev
WILLOW, INRIA/ENS/CNRS, Paris
Computer Vision:
Weakly-supervised learning from
video and images
CSClub
Saint Petersburg
November 17, 2014
Joint work with: Piotr Bojanowski – Rémi Lajugie – Maxime Oquab –
Francis Bach – Leon Bottou – Jean Ponce –
Cordelia Schmid – Josef Sivic
Контакты:Официальный сайт: http://visionlabs.ru/Контактное лицо: Ханин АлександрE-mail: [email protected]Тел.: +7 (926) 988-7891
VisionLabs – команда профессионалов, обладающихзначительными знаниями и существенным практическимопытом в сфере разработки алгоритмов компьютерногозрения и интеллектуальных систем.
Мы создаем и внедряем технологии компьютерного зрения, открывая новые
возможности для изменения окружающего нас мира к лучшему.
О компании
– Advertisement –
Команда
АлександрХанинChief
ExecutiveOfficer
АлексейНехаев
ExecutiveOfficer
Слава Казьмин
ChiefTechnical
Officer
ИванЛаптев
Scientificadvisor
СергейМиляев
SeniorCV engineer
АлексейКордичевFinancialadvisor
ИванТрусковSoftwaredeveloper
СергейЧерепанов
Softwaredeveloper
Наша команда –симбиоз науки и бизнеса
Направления деятельностиТехнология распознавания лицСистема выявления мошенников в банках
Технология распознавания номеровСистема учета и автоматизации доступа транспорта
Технологии для безопасного городаСистема выявления нарушений и опасных ситуаций
– Advertisement –
– Advertisement –
Проекты масштаба государства
Достижения
– Advertisement –
Мы ищем единомышленников
Создание и внедрение интеллектуальных систем
Решение интересных практических задач
Работа в дружной амбициозной команде
Спасибо за внимание!
Контакты:Официальный сайт: http://visionlabs.ru/Контактное лицо: Ханин АлександрE-mail: [email protected]Тел.: +7 (926) 988-7891
What is Computer Vision?
7
What is Computer Vision?
What is the recent progress?
1990s:
Recognition at the level of a few
toy objects (COIL 20 dataset)
ResearchIndustry
Automated quality inspection
(controlled lighting, scale,…)
Now:
Face recognition in social media ImageNet: 14M images, 21K classes
6% Top-5 error rate in 2014 Challenge
~5K image uploads
every min. >34K hours of video
upload every day
TV-channels recorded
since 60’s
~30M surveillance cameras in US
=> ~700K video hours/day
~2.5 Billion new
images / month
And even more with future
wearable devices
Why image and video analysis?Data:
Movies TV
YouTube
Why looking at people?
How many person-pixels are in the video?
Movies TV
YouTube
How many person-pixels are in the video?
40%
35% 34%
Why looking at people?
How many person pixels
in our daily life?
Wearable camera data: Microsoft SenseCam dataset
How many person pixels
in our daily life?
Wearable camera data: Microsoft SenseCam dataset
~4%
Large variations in appearance:occlusions, non-rigid motion, view-point changes, clothing…
What are the difficulties?
Manual collection of training samples is prohibitive: many action classes, rare occurrence
Action vocabulary is not well-defined
…
Action Open:
…
…
Action Hugging:
This talk:
Brief overview of recent techniques
Weakly-supervised learning from video
and scripts
Weakly-supervised learning with
convolutional neural networks
Standard visual recognition pipeline
GetOutCar AnswerPhone
Kiss
HandShake StandUp
DriveCar
Collect image/video samples and corresponding class labels
Design appropriate data representation, with certain invariance properties
Design / use existing machine learning methods for learning and classification
Occurrence histogram
of visual words
space-time patches
Extraction of
Local features
Feature
description
K-means
clustering
(k=4000)
Feature
quantization
Non-linear
SVM with χ2
kernel
[Laptev, Marszałek, Schmid, Rozenfeld 2008]
Bag-of-Features action recognition
Action classification
Test episodes from movies “The Graduate”, “It’s a Wonderful Life”,
“Indiana Jones and the Last Crusade”
Where to get training data?
Shoot actions in the lab•
KTH dataset
Weizman dataset,…
- Limited variability
- Unrealistic
Manually annotate existing content•
HMDB, Olympic Sports,
UCF50, UCF101, …
- Very time-consuming
Use readily-available video scripts•
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com
- Scripts are available for 1000’s of hours of movies and TV-series
- Scripts describe dynamic and static content of videos
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
21
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam.Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
22
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
23
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
24
…
1172
01:20:17,240 --> 01:20:20,437
Why weren't you honest with me?
Why'd you keep your marriage a secret?
1173
01:20:20,640 --> 01:20:23,598
lt wasn't my secret, Richard.
Victor wanted it that way.
1174
01:20:23,800 --> 01:20:26,189
Not even our closest friends
knew about our marriage.
…
…
RICK
Why weren't you honest with me? Why
did you keep your marriage a secret?
Rick sits down with Ilsa.
ILSA
Oh, it wasn't my secret, Richard.
Victor wanted it that way. Not even
our closest friends knew about our
marriage.
…
01:20:17
01:20:23
subtitles movie script
• Scripts available for >500 movies (no time synchronization)
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …
• Subtitles (with time info.) are available for the most of movies
• Can transfer time to scripts by text alignment
Script-based video annotation
[Laptev, Marszałek, Schmid, Rozenfeld 2008]
Scripts as weak supervision
Un
ce
rta
inty
24:25
24:51
Imprecise temporal localization•
No explicit spatial localization •
NLP problems, scripts ≠ training labels•
“… Will gets out of the Chevrolet. …”
“… Erin exits her new truck…”vs. Get-out-car
Challenges:
Previous work
Sivic, Everingham, and Zisserman,
''Who are you?'' -- Learning Person Specific
Classifiers from Video, In CVPR 2009.
Buehler, Everingham, and Zisserman "Learning
sign language by watching TV (using weakly
aligned subtitles)", In CVPR 2009.
Duchenne, Laptev, Sivic, Bach and Ponce,
"Automatic Annotation of Human Actions in
Video", In ICCV 2009.
…wanted to know about the history of the trees
Joint Learning of Actors and Actions
Rick? Rick?
Walks?Walks?
[Bojanowski et al. ICCV 2013]
Rick walks up behind Ilsa
Rick
Walks
Rick walks up behind Ilsa
Joint Learning of Actors and Actions[Bojanowski et al. ICCV 2013]
Formulation: Cost function
Rick
Ilsa
Sam
Actor labels Actor image features
Actor classifier
Formulation: Cost function
Person p appears at
least once in clip N :
p = Rick
Weak supervision
from scripts:
Action a appears at
least once in clip N :
a = Walk
Weak supervision
from scripts:
Formulation: Cost function
Formulation: Cost function
Action a
appears
in clip N :
Weak supervision
from scripts:
Person p
appears in
clip N :
Person p
and
Action a
appear in
clip N :
34
Image and video features
• Facial features
[Everingham’06]
• HOG descriptor on
normalized face image
• Dense Trajectory
features in person
bounding box
[Wang et al.,’11]
Face features
Action features
35
Results for Person Labelling
American beauty (11 character names)Casablanca (17 character names)
36
Results for Person + Action Labelling
Casablanca,
Walking
Finding Actions and Actors in Movies
[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
38
Action Learning with
Ordering Constraints[Bojanowski et al. ECCV 2014]
39
Action Learning with
Ordering Constraints[Bojanowski et al. ECCV 2014]
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
Is the optimization tractable?
• Path constraints are implicit
• Cannot use off-the-shelf solvers
• Frank-Wolfe optimization algorithm
Results
937 video clips from 60 Hollywood movies•
16 action classes•
Each clip is annotated by a sequence of n actions (2≤n≤11)•
Object recognition
Convolutional Neural Networks
• ImageNet Large-Scale Visual Recognition Challenge is
very hard: 1000 classes, 1.2M images
• Krizhevsky et al. ILSVRC12 results improve other
methods with a large margin
2012
2014
GoogleLeNet: 6%
CNN of Krizhevsky et al. NIPS’12
• Learns low-level features at
the first layer.
• Has some tricks but the main
principle is similar to LeCun’88
• Has 60M parameters and 650K neurons.
• Success seems to be determined by (a) lots of labeled
images and (b) very fast GPU implementation. Both (a)
and (b) have not been available until very recently.
Approach
1. Design training/test procedure using sliding windows
2. Train adaptation layers to map labelsSee also [Girshick et al.’13], [Donahue et al.’13], [Sermanet et al. ’14], [Zeiler and Fergus ’13]
Transfer learning workshop at ICCV’13, ImageNet workshop at ICCV’13
Approach – sliding window training / testing
Results
Object localization
Results
[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]
Results
Vision works?
Vision works?
[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]
VOC Action Classification Taster
Challenge
Given the bounding box of a person, predict
whether they are performing a given action
Playing Instrument? Reading?
Encourage research on still-image activity
recognition: more detailed understanding of image
Nine Action Classes
Phoning Playing Instrument Reading Riding Bike Riding Horse
Running Taking Photo Using Computer Walking
CNN action recognition and
localizationQualitative results: reading
CNN action recognition and
localizationQualitative results: phoning
CNN action recognition and
localizationQualitative results: playing instrument
Results PASCAL VOC 2012
Object classification
Action classification
[Oquab, Bottou, Laptev, Sivic 2013, HAL-00911179]
Are bounding boxes needed for training CNNs?
Image-level labels: Bicycle, Person[Oquab, Bottou, Laptev, Sivic, 2014]
Motivation: labeling bounding boxes is tedious
Motivation: image-level labels are plentiful
“Beautiful red leaves in a back street of Freiburg”
[Kuznetsova et al., ACL 2013]
http://www.cs.stonybrook.edu/~pkuznetsova/imgcaption/captions1K.html
Motivation: image-level labels are plentiful
“Public bikes in Warsaw during night”
https://www.flickr.com/photos/jacek_kadaj/8776008002/in/photostream/
Let the algorithm localize the object in the image
[Oquab, Bottou, Laptev, Sivic, 2014]
Example training images with bounding boxes
The locations of objects or their parts learnt by the CNN
NB: Related to multiple instance learning, e.g. [Viola et al.’05] and weakly supervised object
localization, e.g. [Pandy and Lazebnik’11], [Prest et al.’12], [Oh Song et al. ICML’14], …
Approach: search over object’s location
1. Efficient window sliding to find object location hypothesis
2. Image-level aggregation (max-pool)
3. Multi-label loss function (allow multiple objects in image)
See also [Sermanet et al. ’14] and [Chaftield et al.’14]
Max-pool
over image
Per-image score
FCaFCbC1-C2-C3-C4-C5 FC6 FC7
4096-dim
vector9216-dim
vector
4096-dim
vector
…
motorbike
person
diningtable
pottedplant
chair
car
bus
train
…
Max
1. Efficient window sliding to find object location
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
…
2. Image-level aggregation using global max-pool
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
…
3. Multi-label loss function
(to allow for multiple objects in image)108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
192normpool1:8
3256
normpool1:16
3841:16
3841:16
6144dropout
1:32
6144dropout
1:32
2048dropout
1:32
201:32
20final-pool
Convolutional feature extraction layerstrained on 1512 ImageNet classes (Oquab et al., 2014)
Adaptation layerstrained on Pascal VOC.
256pool1:32
C1 C2 C3 C4 C5 FC6 FC7 FCa FCb
Figure 2: Network architecture. The layer legend indicates the number of maps, whether the layer performscross-map normalization (norm), pooling (pool), dropouts (dropout), and reports its subsampling ratio withrespect to the input image. See [21, 26] and Section 3 for full details.
Initial work [1, 6, 7, 15, 37] on weakly supervised object localization has focused on learningfrom images containing prominent and centered objects in images with limited background clut-ter. More recent efforts attempt to learn from images containing multiple objects embedded incomplex scenes [2, 9, 28] or from video [30]. Thesemethods typically localize objects with visuallyconsistent appearance in the training data that often contains multiple objects in different spatialconfigurations and cluttered backgrounds. While these works are promising their performance isstill far from the fully supervised methods. Our work is also related to recent methods that find dis-tinctive mid-level object parts for scene and object recognition in an unsupervised [34] or a weaklysupervised [10, 20] setting.
In contrast to theabovemethodswedevelop aweakly supervised learning method based on convolu-tional neural networks (CNNs) [22, 24]. Convolutional neural networkshaverecently demonstratedexcellent performance on a number of visual recognition tasks that include classification of entireimages [11, 21, 40], predicting presence/absence of objects in cluttered scenes [4, 26, 31, 32] orlocalizing objects by bounding boxes [16, 32]. However, the current CNN architectures assumein training a single prominent object in the image with limited background clutter [11, 21, 32, 40]or require fully annotated object locations in the image [16, 26]. Learning from images contain-ing multiple objects in cluttered scenes with only weak object presence/absence labels has been sofar limited to representing entire images without explicitly searching for location of individual ob-jects [4, 31, 40], though some level of robustness to the scale and position of objects is gained byjittering.
In this work, we develop a weakly supervised convolutional neural network pipeline that learnsfrom complex scenes containing multiple objects by explicitly searching over possible object loca-tions and scales in the image. We demonstrate that our weakly supervised approach achieves thebest published result on the Pascal VOC 2012 object classification dataset outperforming methodstraining from entire images [4, 31, 40] as well as performing on par or better than fully supervisedmethods [26].
3 Network architecture for weakly supervised learning
We build on the fully supervised network architecture of [26] that consists of fiveconvolutional andfour fully connected layers and assumes as input a fixed size image patch containing a single rela-tively tightly cropped object. To adapt this architecture to weakly supervised learning we introducethe following threemodifications. First, we treat the fully connected layers as convolutions, whichallows us to deal with nearly arbitrary sized images as input. Second, we explicitly search for thehighest scoring object position in the image by adding a single global max-pooling layer at the out-put. Third, we use a cost-function that can explicitly model multiple objects present in the image.The threemodifications are discussed next and the network architecture is illustrated in Figure 2.
3
Sum of K (=20) log-loss functions, one for each of K classes:
K-vector of network output
for image x
K-vector of (+1,-1) labels indicating
presence/absence of each class
Search for objects using max-poolingMax-pooling search
aeroplane map
car map
«Keep up the good work !»
(increase score)
«Wrong !»(decrease score)
«Found something there !» Receptive field of the maximum-
scoring neuron
max-pool
max-pool
mardi 10 juin 14
Correct label:
increase score
for this class
Incorrect label:
decrease score
for this class
Search for objects using max-pooling
a
What is the effect of errors?
Multi-scale training and testing162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
Rescale
[ 0.7…1.4 ]
chairdiningtablesofapottedplantpersoncarbustrain…
Figure 3: Weakly supervised training
chairdiningtablepersonpottedplantpersoncarbustrain…
Rescale
Figure 4: Multiscale object recognition
Convolutional adaptation layers. The network architecture of [26] assumes a fixed-size imagepatch of 224⇥224 RGB pixelsas input and outputs a1⇥1⇥N vector of per-class scores asoutput,whereN is the number of classes. The aim is to apply the network to bigger images in a slidingwindow manner thus extending its output to n ⇥m ⇥N where n and m denote the number ofsliding window positions in the x- and y- direction in the image, respectively, computing the Nper-class scores at all input window positions. While this type of sliding was performed in [26] byapplying the network to independently extracted image patches, here we achieve the same effect bytreating thefully connected adaptation layersasconvolutions. For agiven input imagesize, the fullyconnected layer can be seen as a special case of a convolution layer where the size of the kernel isequal to the size of the layer input. With this procedure the output of the final adaptation layer FC7becomes a2⇥2⇥N output score map for a256⇥256 RGB input image, asshown in Figure 2. Asthe global stride of the network is 32 pixels, adding 32 pixels to the image width or height increasesthe width or height of the output score map by one. Hence, for example, a 2048⇥1024 pixel inputwould lead to a 58⇥26 output score map containing the score of the network for all classes for thedifferent locations of the input 224⇥224 window with astride of 32 pixels. While this architectureis typically used for efficient classification at test time, see e.g. [32], here we also use it at trainingtime (as discussed in Section 4) to efficiently examine the entire image for possible locations of theobject during weakly supervised training.
Explicit search for object’s position via max-pooling. The aim is to output a single image-levelscore for each of the object classes independently of the input image size. This is achieved byaggregating the n ⇥m ⇥N matrix of output scores for n ⇥m different positions of the inputwindow using aglobal max-pooling operation into asingle1⇥1⇥N vector, whereN is thenumberof classes. Note that the max-pooling operation effectively searches for the best-scoring candidateobject position within the image, which is crucial for weakly supervised learning where the exactposition of the object within the image is not given at training. In addition, due to the max-poolingoperation the output of the network becomes independent of the size of the input image, which willbeused for multi-scale learning in Section 4.
Multi-label classification cost function. The Pascal VOC classification task consists in tellingwhether at least one instance of aclass is present in the image or not. We treat the task as a separate
4
Training
videos
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset
Test results on 80 classes in Microsoft COCO dataset