rahul sukthankar at ai frontiers: large-scale video understanding: youtube and beyond
TRANSCRIPT
![Page 1: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/1.jpg)
Large-Scale Video Understanding:YouTube and BeyondRahul SukthankarMachine Perception, Google Researchhttps://research.google.com/teams/perception/
AI Frontiers Conference - Nov. 3, 2017
![Page 2: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/2.jpg)
Machine PerceptionReally Works!
(better than I expected)
![Page 3: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/3.jpg)
Sample of Perception tech in products
Signals for Image Search ranking, related images, search-by-image, etc.
![Page 4: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/4.jpg)
Sample of Perception tech in products
Cloud Video API Cloud Vision API
![Page 5: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/5.jpg)
Sample of Perception tech in products
(Seth LaForge, Nexus 5X)
HDR+ in Android Camera Mobile Vision API
![Page 6: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/6.jpg)
Sample of Perception tech in products
Organizing Photos image & video collections and making them searchable by content
Microvideo tech in Photos & Motion Stills
De-reflection & tracking in Photo Scanner
![Page 7: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/7.jpg)
Sample of Perception tech in products
Personalized sticker packs in Allo
On-device handwritinginput & recognition
OCR for lots of languages
![Page 8: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/8.jpg)
Sample of Perception tech in products
Visual & auditory annotation & signals on YouTube
Thumbnail/preview selection & optimization for YouTube
Non-speech sound captions on YouTube
![Page 9: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/9.jpg)
Sample of Perception tech in products
Region tracking for custom blurring tool on YouTube
Mobile creative effects on YouTube
![Page 10: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/10.jpg)
watch, listen, understandcapture a moment improve & manipulate
Useful Applications for Video Technology
Help users create, enhance, organize, and discover videos.
![Page 11: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/11.jpg)
Privacy Region Tracking & Blurring for YouTube
![Page 12: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/12.jpg)
Fun Effects from Tracking (on Mobile) for YouTube
![Page 13: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/13.jpg)
Large-Scale Video Annotation for YouTube
![Page 14: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/14.jpg)
Large-Scale Video Annotation for YouTube
extract features
quantize & aggregate
train model(e.g., AdaBoost)
training data
Video understanding pipeline as of ~5 years ago
frame features
video features
“Roller-blading”
hand-designed descriptors
codebook histogram
pixels & sound samples
![Page 15: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/15.jpg)
Large-Scale Video Annotation for YouTube
extract features
training data
Modern video understanding pipeline
“Roller-blading”
pixels & sound samples
Magic box containing many convolutional, deep, end-to-
end buzzwords :-)
![Page 16: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/16.jpg)
Deep-learned visual features
Inception model trained on noisy data (images)
Bottleneck embedding
layer (1000-d)
Videos with noisy labels
Frame-level Video-level
- Max pooling- Avg pooling- VLAD pooling
![Page 17: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/17.jpg)
+80%mean avg. precision
40x more compact features
Deep learned visual features, VLAD coding: 1024-d, 0.272 MAP
Handcrafted audio-visual features: ~40K-d, 0.153 MAPM
ean
Ave
rage
Pre
cisi
on
Dimensionality
0.40
0.30
0.20
0.10
0
Deep-learned vs. handcrafted features
![Page 18: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/18.jpg)
Personal video search in Google Photos
Lots of videosAlmost no metadata
![Page 19: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/19.jpg)
“Dancing” on the web
![Page 20: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/20.jpg)
“Dancing” in home videos
![Page 21: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/21.jpg)
Domain adaptation: Finding home videos on YouTubeBy capture device
vs
By video frame rate
By video orientation
vs
![Page 22: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/22.jpg)
The technology behind personal video searchVideo
Trained on web images
Image / photo annotation model
1
![Page 23: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/23.jpg)
The technology behind personal video searchVideo
Trained on web images
Image / photo annotation model
YouTube frame annotation model
Trained on video thumbnails
Domain-adapted frame-level vision model
1
2
![Page 24: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/24.jpg)
YouTube video annotation model
Trained on YouTube videos
The technology behind personal video searchVideo
Trained on web images
Image / photo annotation model
YouTube frame annotation model
Trained on video thumbnails
Domain-adapted frame-levelvision model
Domain-adapted video-levelvision model
1
2
3
![Page 25: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/25.jpg)
YouTube video annotation model
Trained on YouTube videos
The technology behind personal video searchVideo
Audio
Trained on web images
Image / photo annotation model
Trained on YouTube videos
YouTube audio annotation model
YouTube frame annotation model
Trained on video thumbnails
Domain-adapted frame-level vision model
Domain-adapted video-levelvision model
Domain-adapted audio model
1
2
3
4
![Page 26: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/26.jpg)
YouTube video annotation model
Trained on YouTube videos
toddlerdancing
The technology behind personal video searchVideo
Audio
Trained on web images
Image / photo annotation model
Trained on YouTube videos
YouTube audio annotation model
YouTube frame annotation model
Trained on video thumbnails
Domain-adapted frame-level vision model
Domain-adapted video-levelvision model
Domain-adapted audio model
1
2
3
4
Fusion & calibration
5
Trained on home videos
Domain-adapted personal videomodel
![Page 27: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/27.jpg)
Evolution of personal video annotation models1234
![Page 28: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/28.jpg)
Evolution of personal video annotation models1234
Photo annotation model applied on video frames
![Page 29: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/29.jpg)
Evolution of personal video annotation models
Domain adaptation + fusion across frames
1234
Photo annotation model applied on video frames
![Page 30: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/30.jpg)
Evolution of personal video annotation models
Fusion across multiple vision models
Domain adaptation + fusion across frames
1234
Photo annotation model applied on video frames
![Page 31: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/31.jpg)
Evolution of personal video annotation models
Fusion across multiple audio-visual models
Fusion across multiple vision models
Photo annotation model applied on video frames
Domain adaptation + fusion across frames
1234
![Page 32: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/32.jpg)
Evolution of personal video annotation models1234
> 2x recall gain
![Page 33: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/33.jpg)
Learning aesthetics: YouTube Thumbnails
![Page 34: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/34.jpg)
Learning aesthetics: YouTube Thumbnails
YouTube thumbnail quality model
![Page 35: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/35.jpg)
Learning aesthetics: YouTube Thumbnails
![Page 36: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/36.jpg)
Learning aesthetics: YouTube Thumbnails
Improving YouTube video thumbnails with deep neural nets, Google Research Blog, Oct. 2015
![Page 37: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/37.jpg)
Video retargeting (spatial)
Original video. Reframed for a banner aspect ratio.
![Page 38: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/38.jpg)
Video retargeting (temporal)
Video preview:
(duration: 6 secs)
![Page 39: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/39.jpg)
Motion Stabilization
![Page 40: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/40.jpg)
Motion Stills app
Stream One-Up
![Page 41: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/41.jpg)
Motion Still examples: cinemagraphs
![Page 42: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/42.jpg)
Motion Stills examples: gifs / memes
![Page 43: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/43.jpg)
Motion Stills examples: timelapse
![Page 44: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/44.jpg)
Promising Directions for Future Research:
Learning from Video
![Page 45: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/45.jpg)
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Self-Supervised ImitationPierre Sermanet* Corey Lynch* Yevgen Chebotar*
Jasmine Hsu Eric Jang Stefan Schaal Sergey LevineGoogle Brain + University of Southern California
* equal contribution
![Page 46: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/46.jpg)
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Multi-view capture
This image cannot currently be displayed.
![Page 47: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/47.jpg)
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Time-Contrastive Networks (TCN)
(source: [Rippel et al 2015])
arxiv.org/abs/1704.06888v2sermanet.github.io/imitate
![Page 48: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/48.jpg)
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Approach (pouring, real)
* RL used: Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning,Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., Levine, S. [ICML 17]
![Page 49: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/49.jpg)
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Resulting policies
![Page 50: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/50.jpg)
Sermanet, Self-Supervised Imitation, Google Brain BAVM 2017
Pose imitation (real robot)
![Page 51: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/51.jpg)
Useful Datasets for Video Understanding
● Large-scale video annotation○ Sports-1M > 1M videos from ~500 classes [with
Stanford]○ YouTube-8M ~8M videos from ~4800 classes
● Action recognition in video○ THUMOS Temporal localization in untrimmed videos [with UCF, INRIA]○ Kinetics 400+ short clips for 400 actions [with
DeepMind]○ AVA Spatially localized atomic actions
[with Berkeley, INRIA]
● Object recognition○ YouTube-BB Spatially localized objects in video (80 classes)○ Open Images Spatially localized objects in images (600 classes)
![Page 52: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/52.jpg)
Sports-1M: 1.1M videos from 487 sports classes (video classification)
![Page 53: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/53.jpg)
YouTube-8M Video Research Dataset
research.google.com/youtube8m/
![Page 54: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/54.jpg)
THUMOS Challenge Series: Temporal Localization in Untrimmed Videos
![Page 55: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/55.jpg)
YouTube Bounding Boxes: Spatial localization of one object through time
![Page 56: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/56.jpg)
AVA: Spatial localization of an actor performing atomic actions
Atomic action: “Paint”
![Page 57: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/57.jpg)
Open Images v3 - detailed spatial annotations in images
Example validation images
![Page 58: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/58.jpg)
Open Images v3 - detailed spatial annotations in images
Example validation images
![Page 59: Rahul Sukthankar at AI Frontiers: Large-Scale Video Understanding: YouTube and Beyond](https://reader031.vdocuments.site/reader031/viewer/2022021503/5a6766977f8b9a8a378b4825/html5/thumbnails/59.jpg)
● Significant progress in large-scale video annotation for YouTube● Video understanding has many applications beyond YouTube● We encourage others to work on video through public datasets● Many exciting research problems ahead, particularly in learning from video
(I think there’s a lot more progress to be made in video understanding)
Conclusion