experimental design for machine learning on multimedia data

32
Experimental Design for Machine Learning on Multimedia Data Dr. Gerald Friedland, [email protected] 1 See: http://www.icsi.berkeley.edu/~fractor/spring2019/

Upload: others

Post on 16-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experimental Design for Machine Learning on Multimedia Data

Experimental Design for Machine Learning on Multimedia Data

Dr. Gerald Friedland, [email protected]

�1

See: http://www.icsi.berkeley.edu/~fractor/spring2019/

Page 2: Experimental Design for Machine Learning on Multimedia Data

Hands On: Multimedia Methods for Large Scale Video Analysis

Dr. Gerald Friedland, [email protected]

�2

…was:

…in 2012

Page 3: Experimental Design for Machine Learning on Multimedia Data

Multimedia in the Internet is Growing

Raphaël Troncy: Linked Media: Weaving non-textual content into the Semantic Web, MozCamp, 03/2009.

�3

Page 4: Experimental Design for Machine Learning on Multimedia Data

Multimedia in the Internet is Growing

• YouTube claims 65k 100k video uploads per day or 48 72 hours every minute

• Flickr claims 1M images uploads per day

• Twitter: up to 120M messages per day => Twitpic, yfrog, plixi & co: 1M images per day

�4

Page 5: Experimental Design for Machine Learning on Multimedia Data

Why do we care?

• Consumer-Produced Multimedia allows empirical studies at never-before seen scale in various research disciplines such as sociology, medicine, economics, environmental sciences, computer science...

• Recent buzzword: BIGDATA�5

Page 6: Experimental Design for Machine Learning on Multimedia Data

Problem

How can YOU effectively work on large scale multimedia data (without working at Google)?

�6

Page 7: Experimental Design for Machine Learning on Multimedia Data

What is this class about?

• Introduction to large-scale video processing from a CS point of view (application driven)

• Covers different modalities: Visual, Audio, Tags, Sensor Data

• Covers different side topics, such as Mechanical Turk for annotation

Page 8: Experimental Design for Machine Learning on Multimedia Data

Content of 2012 Class

• Visual methods for video analysis• Acoustic methods for video analysis• Meta-data and tag-based methods for video analysis• Inferring from the social graph and collaborative

filtering• Information fusion and multimodal integration• Coping with memory and computational issues• Crowd sourcing for ground truth annotation• Privacy issues and societal impact of video retrieval

Page 9: Experimental Design for Machine Learning on Multimedia Data

Content of 2019 Class

• Less anecdotal• More systematic• Adds: Concepts for Machine Learning measurements• See: Machine Learning Cheat Sheet and Design

Process

Page 10: Experimental Design for Machine Learning on Multimedia Data

Why you should you be interested in the Topic

• Processing of large scale, consumer produced videos is a brand-new emerging field driven by:– Massive government funding (e.g. IARPA

Aladdin, IARPA Finder, NSF BIGDATA)– Industry’s needs to make consumer videos

retrievable and searchable– Interesting academic questions

Page 11: Experimental Design for Machine Learning on Multimedia Data

Some Examples of my Research

• Video Concept Detection• Forensic User Matching• Location Estimation• Fast Approaches with EC2

and HPC

Page 12: Experimental Design for Machine Learning on Multimedia Data

Video2GPS: Multimodal Location Estimation of Flickr Videos

Gerald FriedlandInternational Computer Science InstituteBerkeley, [email protected]

Page 13: Experimental Design for Machine Learning on Multimedia Data

Definition

�13

• Location Estimation = estimating the geo-coordinates of the origin of the content recorded in digital media.

• Here: Regression task– Where was the media recorded in lattitude,

longitude, and altitude?

G. Friedland, O. Vinyals, and T. Darrell: "Multimodal Location Estimation", pp. 1245-1251, ACM Multimedia, Florence, Italy, October 2010.

Page 14: Experimental Design for Machine Learning on Multimedia Data

Motivation 1

�14

Training data comes for free on a never-before seen scale!

Portal % TotalYouTube (estimate) 3.0 3MFlickr 4.5 180M

Allows tackling of tasks of never-before seens difficulty?

Page 15: Experimental Design for Machine Learning on Multimedia Data

Placing TaskAutomatically guess the location of a Flickr video:• i.e., assign geo-coordinates (latitude and

longitude)• Using one or more of:

– Visual/Audio content– Metadata (title, tags, description, etc)– Social information �15

Page 16: Experimental Design for Machine Learning on Multimedia Data

Consumer-Produced, Unfiltered Videos...

Page 17: Experimental Design for Machine Learning on Multimedia Data

Data Description (2011)• Training Data

– 10k flickr videos/metadata/visual keyframes (+features)/geo-tags

– 6M flickr photos/metadata/visual features

• Test Data– 5k flickr video/metadata/visual

keyframes (+features)– no geotags

• Test/Training split: by UserID�17

Page 18: Experimental Design for Machine Learning on Multimedia Data

Your Best Guess???

Page 19: Experimental Design for Machine Learning on Multimedia Data

Our Approach

�19

Tag-based Approach+ Visual Approach+ Acoustic Approach-------------------Multimodal Approach

Page 20: Experimental Design for Machine Learning on Multimedia Data

Intuition for the Approach

Investigate `Spatial Variance’ of a feature: –Spatial variance is small: Features is

likely location-indicative–Spatial variance is high: Feature is likely

not indicative.

�20

Page 21: Experimental Design for Machine Learning on Multimedia Data

Example Revisited

Tag: ['pavement', 'ucberkeley', 'berkeley', 'greek', 'greektheatre', 'spitonastranger', 'live', 'video‘]

Page 22: Experimental Design for Machine Learning on Multimedia Data

Tag-based Approach

• ‘greektheatre’ would be the correct answer: however, it was not seen in the development dataset

• ‘ucberkeley’ is the 2nd-best answer: estimation error : 0.332 km

Tag Matches in Training set

Spatial Variance

pavement 2 5.739

ucberkeley 4 0.132

berkeley 14 68.138

greek 0 N/A

greektheatre 0 N/A

spitonastranger 0 N/A

live 91 6453.109

video 2967 6735.844

Page 23: Experimental Design for Machine Learning on Multimedia Data

Visual Approach

• Assumption: similar images ! similar locations• Based off related work: Reduce location estimation to an image retrieval problem•Use median frame of video as keyframe

Page 24: Experimental Design for Machine Learning on Multimedia Data

Audio Approach

• Idea: Different places have different acoustic “signatures”

• Due to data sparsity: Initially only used for major cities

• Classification system similar to GMM/UBM Speaker ID system using MFCC19 features.

Page 25: Experimental Design for Machine Learning on Multimedia Data

• Which city was this recorded in? Pick one of: Amsterdam, Bangkok, Barcelona, Beijing, Berlin, Cairo, CapeTown, Chicago, Dallas, Denver, Duesseldorf, Fukuoka, Houston, London, Los Angeles, Lower Hutt, Melbourne, Moscow, New Delhi, New York, Orlando, Paris, Phoenix, Prague, Puerto Rico, Rio de Janeiro, Rome, San Francisco, Seattle, Seoul, Siem Reap, Sydney, Taipei, Tel Aviv, Tokyo, Washington DC, Zuerich

• Solution: Tokyo, highest confidence score!•

An ExperimentListen!

Page 26: Experimental Design for Machine Learning on Multimedia Data

Audio-Only Results

1 2 5 10 20 40 60 80 90 95 98 99 1 2

5

10

20

40

60

80

90

95

98 99

DET curve for common−user city identification with 540 videos and 9,700 trials

False Alarm probability (in %)

Mis

s pr

obab

ility

(in %

)

EER = 22%

Page 27: Experimental Design for Machine Learning on Multimedia Data

Problem: Bias

Page 28: Experimental Design for Machine Learning on Multimedia Data

Multimodal Results: MediaEval 2011 Dataset

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

10m" 100"m" 1"km" 5"km" 10"km" 50"km" 100"km"

Coun

t"[%]"

Distance"between"es>ma>on"and"ground"truth"

Figure 3: The resulting accuracy of the algorithmas described in Section 5.

used as the final distance between frames. The scaling of theweights was learned by using a small sample of the trainingset and normalizing the individual distance distributions sothat each the standard deviation of each of them would besimilar. We use 1-NN matching between the test frame andthe all the images in a 100 km radius around the 1 to 3 co-ordinates from the previous step. We pick the match withthe smallest distance and output its coordinates as a finalresult.

This multimodal algorithm is less complex than previousalgorithms (see Section 3), yet produces more accurate re-sults on less training data. The following section analysesthe accuracy of the algorithm and discusses experiments tosupport individual design decisions.

6. RESULTS AND ANALYSISThe evaluation of our results is performed by applying

the same rules and using the same metric as in the MediaE-val 2010 evaluation. In MediaEval 2010, participants wereto built systems to automatically guess the location of thevideo, i.e., assign geo-coordinates (latitude and longitude) tovideos using one or more of: video metadata (tags, titles),visual content, audio content, and social information. Eventhough training data was provided (see Section 4), any “useof open resources, such as gazetteers, or geo-tagged articlesin Wikipedia was encouraged” [16]. The goal of the taskwas to come as close as possible to the geo-coordinates ofthe videos as provided by users or their GPS devices. Thesystems were evaluated by calculating the geographical dis-tance from the actual geo-location of the video (assignedby a Flickr user, creator of the video) to the predicted geo-location (assigned by the system). While it was importantto minimize the distances over all test videos, runs werecompared by finding how many videos were placed within athreshold distance of 1 km, 5 km, 10 km, 50 km and 100 km.For analyzing the algorithm in greater detail, here we alsoshow distances of below 100m and below 10m. The lowestdistance category is about the accuracy of a typical GPSlocalization system in a camera or smartphone.

First we discuss the results as generated by the algorithmdescribed in Section 5. The results are visualized in Figure 3.The results shown are superior in accuracy than any system

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

10m" 100"m" 1"km" 5"km" 10"km" 50"km" 100"km"

Coun

t"[%]"

Distance"between"es>ma>on"and"ground"truth"

Visual"Only" Tags"Only" Visual+Tags"

Figure 4: The resulting accuracy when comparingtags-only, visual-only, and multimodal location esti-mation as discussed in Section 6.1.

presented in MedieEval 2010. Also, although we added ad-ditional data to the MediaEval training set, which was legalas of the rules explained above, we added less data thanother systems in the evaluation, e.g. [21]. Compared to anyother system, including our own, the system presented hereis the least complex.

6.1 About the Visual ModalityProbably one of the most obvious questions is the impact

of the visual modality. As a comparison, the image-matchingbased location estimation algorithm in [8] started reportingaccuracy at the granularity of 200 km. As can be seen inFigure 4, this is consistent with our results: Using the lo-cation of the 1-best nearest neighbor in the entire databasecompared to the test frame results in a minimum accuracyof 10 km. In contrast to that, tag-based localization reachesaccuracies of below 10m. For the tags-only localization wemodified the algorithm from Section 5 to output only the 1-best geo-coordinates centroid of the matching tag with low-est spatial variance and skip the visual matching step. Whilethe tags-only variant of the algorithm performs already well,using visual matching on top of the algorithm decreases theaccuracy in the finer-granularity ranges but increases over-all accuracy, as in total more videos can be classified below100 km. Out of the 5091 test videos, using only tags 3774videos can be estimated correctly with an accuracy better100 km. The multimodal approach estimates 3989 correctlyin the range below 100 km.

6.2 The Influence of Non-Disjoint User SetsEach individual person has his own idiosyncratic method

of choosing a keyword for certain events and locations whenthey upload videos to Flickr. Furthermore, the spatial vari-ance of the videos uploaded by one user is low on average.At the same time, a certain amount of users uploads manyvideos. Therefore taking into account to which user a videobelongs seems to have a higher chance of finding geographi-cally related videos. For this reason, videos in the MediaEval2010 test dataset were chosen to have a disjoint set of usersfrom the training dataset. However, the additional trainingimages provided for MediaEval 2010 are not user disjoint

Page 29: Experimental Design for Machine Learning on Multimedia Data

How is the Class Organized?

• Once a week: lectures fundamental introduction (from me)

• Once a week: Project meetings• Guest lectures by YouTube, Bo Li

(adversarial examples).

Page 30: Experimental Design for Machine Learning on Multimedia Data

Hands-On Part

• Do a project on a large scale data set• Data accessible through ICSI accounts • Additional computation: $100 in Amazon

EC 2 money•

Page 31: Experimental Design for Machine Learning on Multimedia Data

Lecture Material

• Background material for lectures:G. Friedland, R. Jain: Introduction to Multimedia Computing, Cambridge University Press, 2014.

• Support material available:http://www.mm-creole.org

Page 32: Experimental Design for Machine Learning on Multimedia Data

How do you receive Credit?

• Attend Lecture Regularly• Do a project More information: http://www.icsi.berkeley.edu/~fractor/spring2019/