experimental design for machine learning on multimedia data

Experimental Design for Machine Learning on Multimedia Data

Dr. Gerald Friedland, [email protected]

�1

See: http://www.icsi.berkeley.edu/~fractor/spring2019/

mailto:[email protected]

http://www.icsi.berkeley.edu/~fractor/spring2019/

Hands On: Multimedia Methods for Large Scale Video Analysis

Dr. Gerald Friedland, [email protected]

�2

…was:

…in 2012


Multimedia in the Internet is Growing

Raphaël Troncy: Linked Media: Weaving non-textual content into the Semantic Web, MozCamp, 03/2009.

�3

Multimedia in the Internet is Growing

• YouTube claims 65k 100k video uploads per day or 48 72 hours every minute

• Flickr claims 1M images uploads per day

• Twitter: up to 120M messages per day => Twitpic, yfrog, plixi & co: 1M images per day

�4

Why do we care?

• Consumer-Produced Multimedia allows empirical studies at never-before seen scale in various research disciplines such as sociology, medicine, economics, environmental sciences, computer science...

• Recent buzzword: BIGDATA�5

Problem

How can YOU effectively work on large scale multimedia data (without working at Google)?

�6

What is this class about?

• Introduction to large-scale video processing from a CS point of view (application driven)

• Covers different modalities: Visual, Audio, Tags, Sensor Data

• Covers different side topics, such as Mechanical Turk for annotation

•

Content of 2012 Class

• Visual methods for video analysis• Acoustic methods for video analysis• Meta-data and tag-based methods for video analysis• Inferring from the social graph and collaborative

filtering• Information fusion and multimodal integration• Coping with memory and computational issues• Crowd sourcing for ground truth annotation• Privacy issues and societal impact of video retrieval

•

•

Content of 2019 Class

• Less anecdotal• More systematic• Adds: Concepts for Machine Learning measurements• See: Machine Learning Cheat Sheet and Design

Process

•

•

Why you should you be interested in the Topic

• Processing of large scale, consumer produced videos is a brand-new emerging field driven by:– Massive government funding (e.g. IARPA

Aladdin, IARPA Finder, NSF BIGDATA)– Industry’s needs to make consumer videos

retrievable and searchable– Interesting academic questions

Some Examples of my Research

• Video Concept Detection• Forensic User Matching• Location Estimation• Fast Approaches with EC2

and HPC

Video2GPS: Multimodal Location Estimation of Flickr Videos

Gerald FriedlandInternational Computer Science InstituteBerkeley, [email protected]


Definition

�13

• Location Estimation = estimating the geo-coordinates of the origin of the content recorded in digital media.

• Here: Regression task– Where was the media recorded in lattitude,

longitude, and altitude?

G. Friedland, O. Vinyals, and T. Darrell: "Multimodal Location Estimation", pp. 1245-1251, ACM Multimedia, Florence, Italy, October 2010.

Motivation 1

�14

Training data comes for free on a never-before seen scale!

Portal % TotalYouTube (estimate) 3.0 3MFlickr 4.5 180M

Allows tackling of tasks of never-before seens difficulty?

Placing TaskAutomatically guess the location of a Flickr video:• i.e., assign geo-coordinates (latitude and

longitude)• Using one or more of:

– Visual/Audio content– Metadata (title, tags, description, etc)– Social information �15

Consumer-Produced, Unfiltered Videos...

Data Description (2011)• Training Data

– 10k flickr videos/metadata/visual keyframes (+features)/geo-tags

– 6M flickr photos/metadata/visual features

• Test Data– 5k flickr video/metadata/visual

keyframes (+features)– no geotags

• Test/Training split: by UserID�17

Your Best Guess???

Our Approach

�19

Tag-based Approach+ Visual Approach+ Acoustic Approach-------------------Multimodal Approach

Intuition for the Approach

Investigate `Spatial Variance’ of a feature: –Spatial variance is small: Features is

likely location-indicative–Spatial variance is high: Feature is likely

not indicative.

�20

Example Revisited

Tag: ['pavement', 'ucberkeley', 'berkeley', 'greek', 'greektheatre', 'spitonastranger', 'live', 'video‘]

Tag-based Approach

• ‘greektheatre’ would be the correct answer: however, it was not seen in the development dataset

• ‘ucberkeley’ is the 2nd-best answer: estimation error : 0.332 km

Tag Matches in Training set

Spatial Variance

pavement 2 5.739

ucberkeley 4 0.132

berkeley 14 68.138

greek 0 N/A

greektheatre 0 N/A

spitonastranger 0 N/A

live 91 6453.109

video 2967 6735.844

Visual Approach

• Assumption: similar images ! similar locations• Based off related work: Reduce location estimation to an image retrieval problem•Use median frame of video as keyframe

Audio Approach

• Idea: Different places have different acoustic “signatures”

• Due to data sparsity: Initially only used for major cities

• Classification system similar to GMM/UBM Speaker ID system using MFCC19 features.

• Which city was this recorded in? Pick one of: Amsterdam, Bangkok, Barcelona, Beijing, Berlin, Cairo, CapeTown, Chicago, Dallas, Denver, Duesseldorf, Fukuoka, Houston, London, Los Angeles, Lower Hutt, Melbourne, Moscow, New Delhi, New York, Orlando, Paris, Phoenix, Prague, Puerto Rico, Rio de Janeiro, Rome, San Francisco, Seattle, Seoul, Siem Reap, Sydney, Taipei, Tel Aviv, Tokyo, Washington DC, Zuerich

• Solution: Tokyo, highest confidence score!•

•

An ExperimentListen!

Audio-Only Results

1 2 5 10 20 40 60 80 90 95 98 99 1 2

5

10

20

40

60

80

90

95

98 99

DET curve for common−user city identification with 540 videos and 9,700 trials

False Alarm probability (in %)

Mis

s pr

obab

ility

(in %

)

EER = 22%

Problem: Bias

Multimodal Results: MediaEval 2011 Dataset

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

10m" 100"m" 1"km" 5"km" 10"km" 50"km" 100"km"

Coun

t"[%]"

Distance"between"es>ma>on"and"ground"truth"

Figure 3: The resulting accuracy of the algorithmas described in Section 5.

used as the final distance between frames. The scaling of theweights was learned by using a small sample of the trainingset and normalizing the individual distance distributions sothat each the standard deviation of each of them would besimilar. We use 1-NN matching between the test frame andthe all the images in a 100 km radius around the 1 to 3 co-ordinates from the previous step. We pick the match withthe smallest distance and output its coordinates as a finalresult.

This multimodal algorithm is less complex than previousalgorithms (see Section 3), yet produces more accurate re-sults on less training data. The following section analysesthe accuracy of the algorithm and discusses experiments tosupport individual design decisions.

6. RESULTS AND ANALYSISThe evaluation of our results is performed by applying

the same rules and using the same metric as in the MediaE-val 2010 evaluation. In MediaEval 2010, participants wereto built systems to automatically guess the location of thevideo, i.e., assign geo-coordinates (latitude and longitude) tovideos using one or more of: video metadata (tags, titles),visual content, audio content, and social information. Eventhough training data was provided (see Section 4), any “useof open resources, such as gazetteers, or geo-tagged articlesin Wikipedia was encouraged” [16]. The goal of the taskwas to come as close as possible to the geo-coordinates ofthe videos as provided by users or their GPS devices. Thesystems were evaluated by calculating the geographical dis-tance from the actual geo-location of the video (assignedby a Flickr user, creator of the video) to the predicted geo-location (assigned by the system). While it was importantto minimize the distances over all test videos, runs werecompared by finding how many videos were placed within athreshold distance of 1 km, 5 km, 10 km, 50 km and 100 km.For analyzing the algorithm in greater detail, here we alsoshow distances of below 100m and below 10m. The lowestdistance category is about the accuracy of a typical GPSlocalization system in a camera or smartphone.

First we discuss the results as generated by the algorithmdescribed in Section 5. The results are visualized in Figure 3.The results shown are superior in accuracy than any system

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

10m" 100"m" 1"km" 5"km" 10"km" 50"km" 100"km"

Coun

t"[%]"

Distance"between"es>ma>on"and"ground"truth"

Visual"Only" Tags"Only" Visual+Tags"

Figure 4: The resulting accuracy when comparingtags-only, visual-only, and multimodal location esti-mation as discussed in Section 6.1.

presented in MedieEval 2010. Also, although we added ad-ditional data to the MediaEval training set, which was legalas of the rules explained above, we added less data thanother systems in the evaluation, e.g. [21]. Compared to anyother system, including our own, the system presented hereis the least complex.

6.1 About the Visual ModalityProbably one of the most obvious questions is the impact

of the visual modality. As a comparison, the image-matchingbased location estimation algorithm in [8] started reportingaccuracy at the granularity of 200 km. As can be seen inFigure 4, this is consistent with our results: Using the lo-cation of the 1-best nearest neighbor in the entire databasecompared to the test frame results in a minimum accuracyof 10 km. In contrast to that, tag-based localization reachesaccuracies of below 10m. For the tags-only localization wemodified the algorithm from Section 5 to output only the 1-best geo-coordinates centroid of the matching tag with low-est spatial variance and skip the visual matching step. Whilethe tags-only variant of the algorithm performs already well,using visual matching on top of the algorithm decreases theaccuracy in the finer-granularity ranges but increases over-all accuracy, as in total more videos can be classified below100 km. Out of the 5091 test videos, using only tags 3774videos can be estimated correctly with an accuracy better100 km. The multimodal approach estimates 3989 correctlyin the range below 100 km.

6.2 The Influence of Non-Disjoint User SetsEach individual person has his own idiosyncratic method

of choosing a keyword for certain events and locations whenthey upload videos to Flickr. Furthermore, the spatial vari-ance of the videos uploaded by one user is low on average.At the same time, a certain amount of users uploads manyvideos. Therefore taking into account to which user a videobelongs seems to have a higher chance of finding geographi-cally related videos. For this reason, videos in the MediaEval2010 test dataset were chosen to have a disjoint set of usersfrom the training dataset. However, the additional trainingimages provided for MediaEval 2010 are not user disjoint

How is the Class Organized?

• Once a week: lectures fundamental introduction (from me)

• Once a week: Project meetings• Guest lectures by YouTube, Bo Li

(adversarial examples).

•

•

Hands-On Part

• Do a project on a large scale data set• Data accessible through ICSI accounts • Additional computation: $100 in Amazon

EC 2 money•

•

Lecture Material

• Background material for lectures:G. Friedland, R. Jain: Introduction to Multimedia Computing, Cambridge University Press, 2014.

• Support material available:http://www.mm-creole.org

http://www.mm-creole.org

How do you receive Credit?

• Attend Lecture Regularly• Do a project More information: http://www.icsi.berkeley.edu/~fractor/spring2019/

•

•

experimental design for machine learning on multimedia data

Documents