arxiv:1901.03107v1 [cs.cv] 10 jan 2019 · proposed a method to extract action videos based on the...

17
Cricket stroke extraction: Towards creation of a large-scale cricket actions dataset Arpan Gupta [0000-0002-9417-3169] and Sakthi Balan M. [0000-0003-1817-7173] The LNM Institute of Information Technology, Jaipur, Rajasthan, India. {arpan,sakthi.balan}@lnmiit.ac.in https://www.lnmiit.ac.in/ Abstract. In this paper, we deal with the problem of temporal action localization for a large-scale untrimmed cricket videos dataset. Our ac- tion of interest for cricket videos is a cricket stroke played by a batsman, which is, usually, covered by cameras placed at the stands of the cricket ground at both ends of the cricket pitch. After applying a sequence of preprocessing steps, we have 73 million frames for 1110 videos in the dataset at constant frame rate and resolution. The method of localiza- tion is a generalized one which applies a trained random forest model for CUTs detection(using summed up grayscale histogram difference fea- tures) and two linear SVM camera models(CAM1 and CAM2) for first frame detection, trained on HOG features of CAM1 and CAM2 video shots. CAM1 and CAM2 are assumed to be part of the cricket stroke. At the predicted boundary positions, the HOG features of the first frames are computed and a simple algorithm was used to combine the positively predicted camera shots. In order to make the process as generic as possi- ble, we did not consider any domain specific knowledge, such as tracking or specific shape and motion features. The detailed analysis of our methodology is provided along with the metrics used for evaluation of individual models, and the final predicted segments. We achieved a weighted mean TIoU of 0.5097 over a small sample of the test set. Keywords: Cricket stroke extraction · sports video processing · HOG · shot boundary detection · untrimmed videos · temporal localization 1 Introduction The vision researchers are still a long way from achieving human level under- standing of videos by a machine. Though, we get good results for basic tasks on images, but extending the same methods for videos may not be that straight- forward. Major challenges in understanding of videos include camera motions, illumination changes, partial occlusion etc. Many of these challenges can be avoided in applications where a stationary camera is used, since, the moving foreground objects are much easier to segment out. On the other hand, video con- tent from a non-stationary camera, like telecast videos, movies and ego-centric videos, give rise to all these challenges. It is tough to come up with a unified arXiv:1901.03107v1 [cs.CV] 10 Jan 2019

Upload: others

Post on 04-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction: Towards creation of alarge-scale cricket actions dataset

Arpan Gupta[0000−0002−9417−3169] and Sakthi Balan M.[0000−0003−1817−7173]

The LNM Institute of Information Technology, Jaipur, Rajasthan, India.arpan,[email protected]

https://www.lnmiit.ac.in/

Abstract. In this paper, we deal with the problem of temporal actionlocalization for a large-scale untrimmed cricket videos dataset. Our ac-tion of interest for cricket videos is a cricket stroke played by a batsman,which is, usually, covered by cameras placed at the stands of the cricketground at both ends of the cricket pitch. After applying a sequence ofpreprocessing steps, we have ≈ 73 million frames for 1110 videos in thedataset at constant frame rate and resolution. The method of localiza-tion is a generalized one which applies a trained random forest model forCUTs detection(using summed up grayscale histogram difference fea-tures) and two linear SVM camera models(CAM1 and CAM2) for firstframe detection, trained on HOG features of CAM1 and CAM2 videoshots. CAM1 and CAM2 are assumed to be part of the cricket stroke. Atthe predicted boundary positions, the HOG features of the first framesare computed and a simple algorithm was used to combine the positivelypredicted camera shots. In order to make the process as generic as possi-ble, we did not consider any domain specific knowledge, such as trackingor specific shape and motion features.The detailed analysis of our methodology is provided along with themetrics used for evaluation of individual models, and the final predictedsegments. We achieved a weighted mean TIoU of 0.5097 over a smallsample of the test set.

Keywords: Cricket stroke extraction · sports video processing · HOG ·shot boundary detection · untrimmed videos · temporal localization

1 Introduction

The vision researchers are still a long way from achieving human level under-standing of videos by a machine. Though, we get good results for basic tasks onimages, but extending the same methods for videos may not be that straight-forward. Major challenges in understanding of videos include camera motions,illumination changes, partial occlusion etc. Many of these challenges can beavoided in applications where a stationary camera is used, since, the movingforeground objects are much easier to segment out. On the other hand, video con-tent from a non-stationary camera, like telecast videos, movies and ego-centricvideos, give rise to all these challenges. It is tough to come up with a unified

arX

iv:1

901.

0310

7v1

[cs

.CV

] 1

0 Ja

n 20

19

Page 2: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

2 A. Gupta and S. Balan

model that deals with all of the above challenges, while at the same time ensuringthe real-time processing of the video frames.

Vision researchers, after seeing the success of deep neural networks[18] onImageNet[25], have tried to apply them for activity recognition tasks[32,15,27,3].The need for large-scale annotated video datasets, for training the deep neuralnetworks, was a driving factor, that resulted in a number of such benchmarkvideo datasets. Some of them are Sports-1M[15], Youtube-8M[2], ActivityNet[10],Kinetics[16] etc.

Activity recognition in sports telecast videos is an active area of research.There have been quite a few works that focus on sports events. Thomas et al.[31] and Wang et al. [33] provide detailed survey of some of the current and pastworks, respectively. Though, Sports-1M, UCF Sports[30], are some of the largestavailable sports dataset, but they cannot be used to learn models on any onesport in particular, as there is not enough data to recognize all types of eventsin a single sport.

We take a large-scale dataset of untrimmed cricket videos, and try to localizethe individual cricket strokes being played by the batsman, which is our eventof interest, hereafter, referred to as stroke. Usually, annotating a dataset of suchscale, is done using a crowd sourcing platform like Amazon Mechanical Turk,similar to the works in [19,10,25]. In this work, we hand annotate only a smallsubset of validation and test set videos for evaluation purpose and bootstrap themodel to make predictions on the entire dataset.

Our motivation for this work is to come up with a generalized solution of de-veloping a large-scale dataset for domain-specific telecast videos. We do not useany form of data(audio, text, etc.) other than the RGB frames of the untrimmedvideos, and make minimum assumptions regarding our domain of interest i.e.,cricket. Though, there are highly accurate tracking systems, like Hawk-Eye [1],which are already being used as Decision Review System (DRS), but their datais private and they have their own set of stationary cameras and sensors in-stalled in the sporting ground. A dataset of such scale can be used to train, forexample C3D[14] type of deep neural networks, and later solve the problem ofautomatic content-based recognition of types cricket strokes. A direct use of thismodel would be in automatic commentary generation, apart from content-basedbrowsing and indexing of cricket videos.

We collected a large set of untrimmed cricket telecast videos, performed aseries of pre-processing steps to make them have a uniform frame rate and res-olution, and then applied our model for the prediction of temporal localizedstroke segments. The applied model involves simple shot boundary detection us-ing summed up grayscale histogram difference feature, a couple of camera modelsthat recognize the starting frames of specific camera shots that are part of thestroke and finally make stroke segment predictions by finding stroke shots.

Section 2 provides a brief review of the related works. Our methodologyis explained in detail in Section 3, which includes the details about the sub-problems of boundary detection, training of camera models and prediction ofstroke segments. Section 4 describes our experimental setup and the evaluation

Page 3: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction 3

metrics used for boundary detection, camera models, and action localization.Finally, we give the results in Section 5, followed by the conclusion in Section 6.

2 Related Work

The problem of action recognition in videos has picked up pace with the onsetof deep neural networks[15,32,27,8,34,21]. These works modify the architectureof the Convolutional Neural Networks (CNNs) and Recurrent Neural Networks(RNNs) and train them on the video data. As a result, they produce state-of-the-art results at the cost of increasing the number of parameters, and trainingtime.

The tasks of classification, tracking, segmentation and temporal localizationare quite inter-dependent and may involve similar approaches and features. Insome works, the problem of temporal action localization has also been tackledusing an end-to-end learning network [35,37]. Segmenting the object of interestand tracking it in the sequence of frames has been done in a few works like[36,13,12].

Some of the above approaches need pre-trained models that can be fine-tunedon their own problem specific datasets, while some other use large benchmarkdatasets for training purpose. Applying such techniques for a sporting eventrequires a lot of hand-annotated training data, which is hard to get or mayinclude a lot of noise. Automated ways of creating datasets may involve using athird party API, like YouTube Data API (as done in [15]), or extraction usingtext meta-data of videos, which may not, at all, be accurate. Nga et al.[22]proposed a method to extract action videos based on the tags. Deciding therelevancy of a tag, in itself, is a research problem. Due to these reasons, content-based action extraction from videos is a good choice for automatic constructionof large-scale action dataset. Hu et al.[11] provide a survey of the content-basedmethods for video extraction.

Content-based retrieval of actions from untrimmed videos is tough whenthe number of actions increase or when a more generalized set of actions isconsidered. The second problem is prevented in case of sports videos, as thevideos of a particular sport have same type of actions performed at intervals.We take cricket as our test case and try to come up with a framework for cricketstroke extraction. A cricket stroke is a cricketing shot played by a batsman whena bowler bowls a ball. In this paper, we refer to the cricket shot as stroke, so thatit may not be confused with a video shot, which can be defined as a sequence offrames captured by a single camera over a time interval during which period itdoes not switch to any other camera.

Coming up with a purely content-based domain specific event extraction foruntrimmed cricket telecast videos has not been attempted by many. Sharma etal. [26] tried to annotate the videos based by mapping the segments to the text-commentary using dynamic programming alignment, which may not always beavailable or may be noisy. Moreover, their dataset is quite small, as compared towhat we are trying to achieve. Some other works have also looked at extraction

Page 4: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

4 A. Gupta and S. Balan

∑ | H(i) – H(i+1) |

Gray

Hist

∑ | H(0) – H(1) |

+

Algorithm 1

Fin

al

Pre

dic

ted

Seg

men

ts

CAM1

CAM2Cut Predictions

Cut Predictions

Get HOG features for frames at cut positions

SBD

Fig. 1. Our framework for prediction of cricket strokes based on the learned models forshot boundary detection (SBD), camera models (CAM1 and CAM2) and combiningthe predictions using Algorithm 1

of cricketing events but they have not considered it at such a scale as ours, orhave not tried to generalize their solutions. Kumar et al.[20] and some similarworks have tried to classify frames based on ground, pitch, or players present inthem, or came up with a rule-based solution to classify motion features. Noneof them have analyzed their results on an entirely new set of cricket videos.

In our work, we collect a large set of raw untrimmed cricket videos and reportour results on a subset of these videos, that have been hand-annotated for thetemporal localization task. The architecture (Figure 1) can be generalized toother sports, since there are no cricket specific assumptions involved.

Temporal localization using deep neural networks has been quite successfulrecently, [38,9]. Though, there are other works that do not use deep neuralnets, such as Soomro et al. [29]. They use an unsupervised spectral clusteringapproach to find similar action types and localizing them. Our approach alsodoes not use deep neural networks as our object is to come up with a datasetthat is large enough to train CNNs with millions of parameters. Even the pre-trained networks need sufficiently large amount of labeled data for fine-tuningthe network. We have done the labeling for only a small set of highlight videos(1GB of World Cup T20 2016) and bootstrapped simple machine learning modelstrained on only grayscale histogram differences and HOG [7] features.

3 Cricket shot extraction

Extracting cricket shots in untrimmed videos is analogous to the action detection(localization) task where the action of interest is a cricket stroke being playedby a batsman.

The live telecast videos of a sports event are created by a set of cameras thathave the task of covering most of the sporting arena. There are two types ofcameras, fixed and moving. The fixed cameras (assigned to camera-men) havea defined objective of capturing a specific sporting activity, while the moving

Page 5: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction 5

cameras, for e.g., spider-cams in cricket fields, may be controlled remotely, andmay not, necessarily, cover the relevant sporting activity at all times.

The main sporting activity in a cricket match is a bowler bowling a ball,followed by a batsman playing a stroke and then scoring runs. An over consists of6 such deliveries, each of which is, potentially, an event of interest. Automaticallyrecognizing the outcome of each delivery is a tough problem, which may requirea complex system with domain knowledge and learned models that interact ina rule-based manner, as done in [20]. In our work, we consider a basic problemof localization of a cricketing event, which is the direction of stroke play. Here,the starting point of our event of interest is the camera shot where the bowleris about to deliver the ball, and ending at the camera shot that captures thedirection of the cricket stroke played. Generally, our event of interest is capturedby one camera shot or at most two subsequent camera shots. An illustrationis provided in Figure 2, where frame 99 (F99) is the starting point and F153is the ending point of the event of interest. The major portion of the event iscaptured using two subsequent camera shots, CAM1 and CAM21, where CAM2captures a wider area to locate the movement of the ball and then, gradually,focuses on it. There may also be a case where the batsman just taps the balland it doesn’t travel beyond the CAM1′s field-of-view. Therefore, we need tosegment out either CAM1 shots or CAM1 + CAM2 shots.

The above type of modeling can be applied (with minor changes) to a num-ber of other sports, like tennis, badminton, or baseball. An illustration of ourtemporal cricket stroke localization model, is given in Figure 1. The pipeline ofsimple models, trained on only a small set of sample videos, is used to predicttemporal stroke segments in a large set of raw untrimmed cricket telecast videos.The evaluation for the bootstrapped predictions is done on a subset of the maindataset, called test set sample. This subset, along with a validation set sample,has been hand-annotated with ground-truth cricket stroke segments. An eval-uation over these subsets would give an estimate of what we can expect as anoverall accuracy.

A learning-based approach for detection and localization, in order to getgeneralized results, would require a large amount of labeled data, where labelsshould contain minimum amount of noise. Such a dataset for cricket telecastvideos may help the research community to come up with better learned modelsfor this particular sporting domain. Choosing a purely content-based modelingapproach, we need to minimize the manual annotation effort, which leaves uswith the idea of bootstrapping smaller models’ predictions to the large raw videodatasets, where a “smaller” model is one having only a few parameters.

Our method is purely content-based, since, we do not use any extra informa-tion, other than the RGB video frames and features extracted from them. Thetraining of simple machine learning models on shallow and high dimensionalfeature descriptor, for localization, has been performed using a small highlightsvideo dataset. Here, the shallow feature is the summed-up absolute values of

1 Please note that we use CAM1 and CAM2, for referring to the camera shots as wellas the models trained on the first frames of these shots

Page 6: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

6 A. Gupta and S. Balan

histogram differences of consecutive grayscale frames, that are used for shotboundary detection (CUTs), and HOG is the high dimensional feature descrip-tor of the frame, used to recognize starting frames of CAM1 and CAM2 videoshots. The shallow descriptor is, computationally, less expensive as compared tothe high dimensional feature descriptor.

The steps involved in the cricket stroke extraction are as follows:

1. Dividing the telecast videos into individual camera shots using shot bound-ary detection. Here only the CUTs are considered, since the detection ofgradual transitions involved an additional overhead. The CUT predictionsare done using a random forest[5,23] model, trained on summed-up absolutevalues of histogram differences of consecutive grayscale video frames takenfrom our sample dataset (refer Table 1).

2. A small dataset of first frames was created using only the sample datasetvideos for training CAM1 and CAM2 camera models. These sets had 367and 336 frames each, where nearly half are positive samples and half arenegative samples. We trained two linear SVMs[6,23] on the HOG [7] featuresof the training subset of these first frames.

3. Having the boundary predictions and the first frame recognition models, asimple approach is to extract only those video shots that give positive resultsfor their first frame HOG features. These will be cricket strokes. Algorithm 1describes a simple approach for extracting these events.

4. The evaluation of the extracted cricket strokes can be done by defining anevaluation metric and having a test set of human annotated cricket strokelocalized segments. We choose the 1D version of IoU metric (Intersectionover Union), which is called the temporal IoU (TIoU), and see how muchoverlap is there between our predictions and the ground-truth annotations.

Given below are the steps in greater detail.

3.1 Preprocessing

The raw cricket telecast videos, collected from different sources, like YouTubeand Hotstar, had different frame rates and resolutions. All the videos were resizedto 360 × 640 with a constant frame rate of 25FPS. This step was carried outusing FFmpeg. Having a constant frame rate and frame dimensions ensures theuniformity of the motion features that may be extracted from the videos.

3.2 Shot boundary detection

Shot boundary detection has been studied for decades and is, often, the firststep of any content-based video processing system. There may be two types ofboundaries, such as a hard CUT transitions (occurs where one camera shot endsabruptly and the next begins) or gradual transition (like fade, wipe, dissolve,etc.). Our focus is only on the detection of CUT boundaries, since, CUTs arethe most common in sports telecast videos. As a future work, one might focus

Page 7: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction 7

on detection of gradual transitions as well, but that would tend to increase theoverall processing time, which needs to be minimized while dealing with anykind of large-scale data processing. The use of CUT predictions, is when weneed to jump directly from one camera shot to the next, in an untrimmed video.Iterating over the CUT predictions, and extracting the HOG features from onlythe first frames, speeds up our processing.

We tested for histogram differences (grayscale and RGB) and weighted-χ2

differences of consecutive frames. The equation 1 gives the value of summed-upabsolute value of histogram differences for consecutive ith and (i+1)th grayscaleframes, where N is the number of bins representing the different gray-levels.

D(i, i+ 1) =

N∑n=1

| (Hi(n)−Hi+1(n)) | (1)

3.3 Camera Models

We may assume that each fixed camera has a defined task, i.e., it will, regularly,cover similar actions being performed. The starting scene for cricket will have abowler taking a run-up to the bowling crease and the batsman standing at theother end of the pitch ready to face the delivery. This stroke can be detectedby identifying the first frame of this video shot (using predicted CUT positions)and segmenting out the shot till the next CUT position. The accuracy of thismethod relies on how accurate the CUT predictions are and how accurately wedetect the first frames of the stroke. Figure 2 shows a sample of cricket strokecontaining CAM1 and CAM2 video shots.

F99 F100 F137

CA

M1

F138 F139 F153

CA

M2

Fig. 2. The sequence of frames in a cricket stroke. As the ball goes outside the field-of-view of CAM1, the camera is switched to CAM2. Note there is a CUT betweenframes F137 and F138

Page 8: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

8 A. Gupta and S. Balan

3.4 Extraction Algorithm

Algorithm 1 is a simple approach to localize our activities of interest, giventhe individual pretrained camera models and a set of cut predictions for anuntrimmed video. The final predictions for the stroke segments may be of anylength, which includes segments obtained from a false positive CUT followedby a false positive CAM1/CAM2 prediction. This occurs, mostly, at placeswhere there are gradual transitions. These false positive segments, are of smallduration, and can, simply, be neglected by a filtering step, which is explained inSection 4.4.

Algorithm 1 Extract cricket strokes

1: procedure localizeActions(srcV ideo, videoCuts, cam1, cam2) . srcV ideo :the input video . videoCuts : list of predicted CUTs for srcV ideo . cam1, cam2 :trained camera models

2: vid segments, start frame, end frame← emptyList([ ]),−1,−13: for i, cut← enumerate(videoCuts) do . Iterate over the CUT positions4: set cut position in srcVideo5: frame← readFrameFromV ideo()6: hog ← computeHOG(frame)7: cam1 pred← cam1.predict(hog)8: cam2 pred← cam2.predict(hog)9: if start frame == −1 then

10: if cam1 pred == 1 then . positive prediction for cam111: start frame← cut

12: else if start frame ≥ 0 then13: if cam2 pred == 0 then14: end frame← cut− 115: vid segments.append( [start frame, end frame] ) . save a

predicted segment to list16: start frame← end frame← −117: if cam1 pred == 1 then18: start frame← cut

19: else20: if (i+ 1) < len(videoCuts) then21: end frame← videoCuts[i+ 1]− 122: else23: end frame← nFrames(srcV ideo)− 1 . get number of frames in

video24: vid segments.append( [start frame, end frame] )25: start frame← end frame← −1

26: return vid segments

Page 9: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction 9

3.5 Bootstrapping the predictions

We refer to the final predictions made on the large dataset, as the bootstrappedpredictions, since, the trained models used an entirely different dataset (High-light videos) for training. The accuracy of these predictions depends, largely, onthe Highlight videos.

4 Experimentation Details

This section describes our experimental setup in detail.

4.1 Data Description

The raw cricket telecast videos were first transformed to have a constant framerate (25 FPS) and resolution (360 × 640) by using FFmpeg. We had a sampledataset that had only 26 highlight videos, each of around 2-5 minute duration.This dataset was set aside for experimentation and training of intermediateCUTs and CAM models. The main dataset of untrimmed videos (also referredto as full dataset) had 273 GB (> 73 million frames) of untrimmed videosfrom various sources, containing Test Match videos, ODI videos and T20 videos.We neglected any local cricket match videos, where the telecast camera waspositioned at a non-standard position. A brief summary of both the datasets isgiven in Table 1. Further, we partitioned the datasets into training, validationand testing sets in the given ratios. The highlight videos are annotated withCUT positions and stroke segments. A subset of validation set videos and asubset of test set videos of the full dataset are manually annotated with thestroke segments. They are used for the final evaluation.

Table 1. Details of our datasets

Property Highlights dataset Full dataset

No. of Videos 26 1110Total Size 1GB (approx.) 273GB (approx.)Dimensions (H,W) 360× 640 360× 640Frame Rate (FPS) 25.0 25.0Event Description ICC World Cup T20 2016 Varied sourcesTraining set 16 videos (≈ 60%) ≈ 50%Validation set 5 videos (≈ 20%) ≈ 25%Test set 5 videos (≈ 20%) ≈ 25%Annotations CUTs, strokes strokes on a subset

Page 10: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

10 A. Gupta and S. Balan

4.2 Shot Boundary Detection

We tested a number of approaches for detecting the shot boundaries (CUTs)based on only the sum of the absolute values of histogram differences of con-secutive frames; like global thresholding on grayscale/ RGB differences [28],weighted-χ2 differences [17] and applying simple classifiers on these features.The best performance on the sample datasets’ test set was given by applying arandom forest model on grayscale histogram differences of consecutive frames 5.

Evaluation Criteria: We follow the evaluation process of TRECVid fordetection of only CUT transitions. Details are specified in [24]. A false positiveis referred to as an insertion, while a false negative to a deletion. The presenceof gradual transition or setting a low threshold (in case of global thresholding),creates insertions, and, as a result, tends to reduce the overall precision. Whilesetting a global threshold to a high value misses out on actual CUT boundariesand reduces the overall recall. Therefore, F-score is a suitable evaluation criteriafor CUT boundary detection.

4.3 Camera Models

The only assumption in our work that is specific to cricket, as illustrated infigure 2, is that CAM1 and CAM1+CAM2 video shots comprise of our eventsof interest, therefore, need to be localized. These video shots can be recognizedby extracting a number of features, such as shape features, tracking cricketingobjects, textures, etc, but all of these may not be generally applicable to othersporting domains. We notice that the first frames of CAM1 shots are quite“similar” and same is the case with first frames of CAM2. We chose to extracta fine grained HOG feature vector for the grayscale first frames of CAM1 andCAM2 and train simple machine learning models on them. The sample datasetsfor these two models had 367 and 336 frames, respectively, that include abouthalf positive samples and half negative samples. These samples were extractedmanually from a subset of highlight sample videos dataset and the negativeframes are the first frames of some other random camera shots that are notcricket strokes. Figure 3 shows a few training frames used for training the CAMmodels.

4.4 Filtering final predictions

The final predictions are of the form [si, ei] where si and ei are the starting andending frame positions, respectively, for the ith predicted segment. The valueof di = ei − si for a cricket stroke should be sufficiently large, considering thefact that the actions take time and the frame rate is constant at 25FPS. Alow value of di, generally, occurs, mainly, due to the gradual transitions, fastcamera motion, or some advertisements occurring in between the telecast video.We apply a filtering step to remove any segments that have di < T . The resultsfor T = 0, 10, 20, ..., 100 on the validation set samples, are given in Figure 4.As the best result occurs for T = 60, therefore, we set this value for our finalaccuracy, on the test set samples.

Page 11: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction 11

CAM2

negative

samples

positive

samples

CAM1

negative

samples

positive

samples

Fig. 3. A few training examples from our CAM1 and CAM2 samples datasets

4.5 Temporal Intersection over Union (IoU)

The metric used for evaluation of the localization task is the Weighted MeanTemporal IoU, motivated from [4]. If for a specific untrimmed video, the set ofpredicted segments are S = s1, s2, ..., sm and the set of ground truth segmentsare S = s1, s2, ..., sn, then the Weighted Mean Temporal IoU is given byequation 2, where mean is taken over the weighted sum of TIoUi values of ith

untrimmed video, each weighted by the number of ground truth segments niin that video and V being the total number of untrimmed videos in the testdataset. Each segment s ∈ S or s ∈ S is of the form [start segment position, endsegment position]. In equation 3, TIoUi is calculated by taking the maximumoverlaps of each ground truth segment with all the predicted segments and viceversa.

TIoUmean =

∑Vi=1 ni TIoUi∑V

i=1 ni(2)

TIoUi =1

2

[1

n

n∑i=1

maxj∈Nm

sj ∩ sisj ∪ si

+1

m

m∑i=1

maxj∈Nn

si ∩ sjsi ∪ sj

](3)

4.6 A Note on Data Parallelism

When dealing with any large-scale data processing, parallelism is essential. Theextraction of a single feature from the entire dataset may take weeks, if performed

Page 12: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

12 A. Gupta and S. Balan

serially. For our full dataset of ≈ 273 GB, the extraction of grayscale histogramdifferences takes more than 10 days. If we consider extraction of any fine grainedfeature like HOG, which is computationally expensive, that would take muchmore than time. We followed a data parallelism approach for any kind of featureextraction or prediction. The untrimmed videos were divided into batches andeach batch was parallelized over a fixed number of processes running in parallelover different cores. Table 2 shows approximate time for some of the featureextraction operations, with different batch sizes and running on “#Jobs” numberof cores in parallel.

Table 2. Execution times compared with and without data parallelization. The SortedVideos column refers to whether the videos were sorted based on their sizes.

Feature Sorted Videos? Batch Size #Jobs Time(approx.)

HDiffs Gray Unsorted 1 1 > 240 hoursHDiffs BGR Unsorted 50 10 > 40 hoursHDiffs BGR Sorted 50 10 6.62 hoursWt-χ2Diffs Sorted 50 10 6.66 hours

5 Results and Discussion

In this section, we describe the results of our individual models and our finalpredicted segments.

Shot Boundary Detection: CUTs detection worked best with the grayscalehistogram difference feature. The number of histogram bins was set to maxi-mum (256), since changing it to any lower value decreased the accuracy. In caseof global thresholding, weighted-χ2 difference values, performs better than thegrayscale histogram differences and RGB histogram differences. The accuracyfor a trained random forest model exceeds the global thresholding approach bya big margin, and a random forest trained on summed-up absolute grayscale his-togram difference features, is works better than the one trained on weighted-χ2

differences. Refer Table 3 for the accuracy of the CUTs model on the test set ofour Highlight videos dataset.

Table 3. Evaluation results for shot boundary detection model

Model Feature Dataset Precision Recall F-score

SBD RF(Bins:256) HDiffs(Gray) Highlights 0.9796 0.9711 0.9753SBD RF(Bins:128) HDiffs(Gray) Highlights 0.9380 0.8510 0.8924

Page 13: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction 13

Camera Models: We train linear SVMs on the HOG features of the firstframes. The accuracy values for the two trained models is given in table 4. Out of111 test samples, there was 1 false negative and 1 false positive for CAM1, whilefor CAM2, there were 4 false positives and 1 false negative. For final predictions,we used these models, on extracted HOG features of the first frames.

Table 4. Evaluation results for CAM models.

Model Feature #Test samples Error

CAM1 LinearSVM HOG 111 1.80%CAM2 LinearSVM HOG 101 4.95%

TIoU: The weighted mean TIoU metric can be applied to any temporallocalization task for untrimmed videos. A minimum value of 0 for this metricdenotes that our predictions are completely off, while a the maximum value of1 denotes that we predict perfectly. Our predictions of the stroke segments werefiltered, as explained in section 4.4, and then evaluated on the validation setsample videos. Refer to figure 4 for the results on the validation set samples.The value of the filtering parameter was set to 60, i.e., any predicted shot seg-ment that is of size less than 60 will be neglected. The final evaluation result iscalculated against the 30 untrimmed videos taken from the test set of our fulldataset. The weighted mean TIoU was 0.5097.

We tried to modify the algorithm 1 to make predictions on the first 5 framesand take only those CAM shots for which the p out of 5 are positive predictions.But, in each case, the accuracy was lower than what has been reported.

Fig. 4. Evaluation on validation set samples after filtering out segments less than givensize

Page 14: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

14 A. Gupta and S. Balan

6 Conclusion and Future work

In this work, we demonstrated a simple approach for extraction of similar sport-ing actions from telecast videos by performing experiments over a large-scaledataset of cricket videos. Here, our action of interest was a cricket stroke playedby a batsman. The sequence of extraction steps, that we followed, may be gener-alized for other sporting events as well. Extracting cricket strokes is a temporallocalization problem, for which the final accuracy (weighted mean TIoU) was0.5097 which is quite accurate, considering the fact that we did not use anycomplicated approach, such as training CNNs or extracting any motion infor-mation.

Our objective is to come up with a large-scale cricket actions dataset, whichcan be used to train deep neural networks for cricket video understanding. Thereis a lot of scope for improvement of our results, which is our future work. Inaddition to that, if we are given the individual cricket stroke segments, how wecan cluster them into different types, by looking into the motion features. Theseclusters should represent the different types of cricketing strokes, such as, stroketowards mid-wicket, towards long-off etc.

References

1. Hawk-Eye Innovations hawk-eye in cricket. https://www.hawkeyeinnovations.

com/sports/cricket, accessed: 2018-03-122. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vi-

jayanarasimhan, S.: YouTube-8M: A Large-Scale Video Classification Benchmark.CoRR abs/1609.08675 (2016), http://arxiv.org/abs/1609.08675

3. Baccouche, M., Mamalet, F., Wolf, C.: Sequential deep learning for human actionrecognition. Proc. Int. Conf. Human Behavior Understanding (HBU) pp. 29–39(2011). https://doi.org/10.1007/978-3-642-25446-8, http://link.springer.com/

chapter/10.1007/978-3-642-25446-8_4

4. Baraldi, L., Grana, C., Cucchiara, R.: A deep siamese network for scene detec-tion in broadcast videos. In: Proceedings of the 23rd ACM International Con-ference on Multimedia. pp. 1199–1202. MM ’15, ACM, New York, NY, USA(2015). https://doi.org/10.1145/2733373.2806316, http://doi.acm.org/10.1145/2733373.2806316

5. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (Oct 2001).https://doi.org/10.1023/A:1010933404324, https://doi.org/10.1023/A:

1010933404324

6. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (Sep1995). https://doi.org/10.1023/A:1022627411411, https://doi.org/10.1023/A:

1022627411411

7. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detec-tion. In: 2005 IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR’05). vol. 1, pp. 886–893 vol. 1 (June 2005).https://doi.org/10.1109/CVPR.2005.177

8. Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadar-rama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional

Page 15: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction 15

networks for visual recognition and description. IEEE Transactions onPattern Analysis and Machine Intelligence 39(4), 677–691 (April 2017).https://doi.org/10.1109/TPAMI.2016.2599174

9. Gao, J., Yang, Z., Chen, K., Sun, C., Nevatia, R.: TURN TAP: Temporal UnitRegression Network for Temporal Action Proposals. In: The IEEE InternationalConference on Computer Vision (ICCV) (Oct 2017)

10. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEEComputer Society Conference on Computer Vision and Pattern Recognition 07-12-June, 961–970 (2015). https://doi.org/10.1109/CVPR.2015.7298698

11. Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S.: A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Manand Cybernetics Part C: Applications and Reviews 41(6), 797–819 (2011).https://doi.org/10.1109/TSMCC.2011.2109710

12. Jain, M., van Gemert, J., Jegou, H., Bouthemy, P., Snoek, C.G.M.: Tubelets: Unsu-pervised action proposals from spatiotemporal super-voxels. International Journalof Computer Vision 124(3), 287–311 (Sep 2017). https://doi.org/10.1007/s11263-017-1023-9, https://doi.org/10.1007/s11263-017-1023-9

13. Jain, M., Van Gemert, J., Jegou, H., Bouthemy, P., Snoek, C.G.M.: Action lo-calization with tubelets from motion. Proceedings of the IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recognition pp. 740–747 (2014).https://doi.org/10.1109/CVPR.2014.100

14. Ji, S., Yang, M., Yu, K.: 3D Convolutional Neural Networks for Human ActionRecognition. Pami 35(1), 221–31 (2013). https://doi.org/10.1109/TPAMI.2012.59,http://www.ncbi.nlm.nih.gov/pubmed/22392705

15. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. Computer Vision andPattern Recognition (CVPR), 2014 IEEE Conference on pp. 1725–1732 (2014).https://doi.org/10.1109/CVPR.2014.223, http://www-cs.stanford.edu/groups/vision/pdf/karpathy14.pdf

16. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan,S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: Thekinetics human action video dataset. CoRR abs/1705.06950 (2017), http://

arxiv.org/abs/1705.06950

17. Ko, K.C., Kang, O.H., Lee, C.W., Park, K.H., Rhee, Y.W.: Scene change detectionusing the weighted chi-test and automatic threshold decision algorithm. LectureNotes in Computer Science (including subseries Lecture Notes in Artificial In-telligence and Lecture Notes in Bioinformatics) 3983 LNCS, 1060–1069 (2006).https://doi.org/10.1007/11751632 114

18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deepconvolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L.,Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25,pp. 1097–1105. Curran Associates, Inc. (2012), http://papers.nips.cc/paper/

4824-imagenet-classification-with-deep-convolutional-neural-networks.

pdf

19. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: Alarge video database for human motion recognition. Proceedings of theIEEE International Conference on Computer Vision pp. 2556–2563 (2011).https://doi.org/10.1109/ICCV.2011.6126543

Page 16: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

16 A. Gupta and S. Balan

20. Kumar, A., Garg, J., Mukerjee, A.: Cricket activity detection. International Im-age Processing, Applications and Systems Conference, IPAS 2014 pp. 1–6 (2014).https://doi.org/10.1109/IPAS.2014.7043264

21. Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga,R., Toderici, G.: Beyond short snippets: Deep networks for video classi-fication. Proceedings of the IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition 07-12-June, 4694–4702 (2015).https://doi.org/10.1109/CVPR.2015.7299101

22. Nga, D.H., Yanai, K.: Automatic extraction of relevant video shots of specificactions exploiting Web data. Computer Vision and Image Understanding 118,2–15 (2014). https://doi.org/10.1016/j.cviu.2013.03.009, http://dx.doi.org/10.1016/j.cviu.2013.03.009

23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machinelearning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)

24. Ruiloba, R., Joly, P., Marchand-Maillet, S., Quenot, G.: Towards a standard Proto-col for the evaluation of Video-to-Shot Segmentation Algorithms. European Work-shop on Content Based Multimedia Indexing 1 (1999)

25. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet LargeScale Visual Recognition Challenge. International Journal of Computer Vision115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

26. Sharma, R.A., Sankar, K.P., Jawahar, C.V.: Fine-grain annotation of cricket videos.CoRR abs/1511.07607 (2015), http://arxiv.org/abs/1511.07607

27. Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for ActionRecognition in Videos. arXiv preprint arXiv:1406.2199 pp. 1–11 (2014), http:

//arxiv.org/abs/1406.219928. Smeaton, A.F., Over, P., Doherty, A.R.: Video shot boundary detection: Seven

years of TRECVid activity. Computer Vision and Image Understanding 114(4),411–418 (2010). https://doi.org/10.1016/j.cviu.2009.03.011, http://dx.doi.org/10.1016/j.cviu.2009.03.011

29. Soomro, K., Shah, M.: Unsupervised action discovery and localization in videos.In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)

30. Soomro, K., Zamir, A.R.: Action Recognition in Realistic Sports Videos, pp. 181–208. Springer International Publishing, Cham (2014). https://doi.org/10.1007/978-3-319-09396-3 9, https://doi.org/10.1007/978-3-319-09396-3_9

31. Thomas, G., Gade, R., Moeslund, T.B., Carr, P., Hilton, A.: Com-puter vision for sports: Current applications and research topics.Computer Vision and Image Understanding 159, 3 – 18 (2017).https://doi.org/https://doi.org/10.1016/j.cviu.2017.04.011, http://www.

sciencedirect.com/science/article/pii/S1077314217300711, computerVision in Sports

32. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learn-ing Spatiotemporal Features with 3D Convolutional Networks. Iccv (2015).https://doi.org/10.1109/CVPR.2014.223

33. Wang, J.R., Parameswaran, N.: Survey of sports video analysis: Research issuesand applications. In: Proceedings of the Pan-Sydney Area Workshop on VisualInformation Processing. pp. 87–90. VIP ’05, Australian Computer Society, Inc.,Darlinghurst, Australia, Australia (2004), http://dl.acm.org/citation.cfm?id=1082121.1082136

Page 17: arXiv:1901.03107v1 [cs.CV] 10 Jan 2019 · proposed a method to extract action videos based on the tags. Deciding the relevancy of a tag, in itself, is a research problem. Due to these

Cricket stroke extraction 17

34. Wang, K., Wang, X., Lin, L., Wang, M., Zuo, W.: 3D Human Activity Recognitionwith Reconfigurable Convolutional Neural Networks. In: Proceedings of the 22ndACM International Conference on Multimedia. pp. 97–106. MM ’14, ACM, NewYork, NY, USA (2014). https://doi.org/10.1145/2647868.2654912, http://doi.

acm.org/10.1145/2647868.2654912

35. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Tempo-ral segment networks: Towards good practices for deep action recognition. CoRRabs/1608.00859 (2016), http://arxiv.org/abs/1608.00859

36. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporalaction localization. CoRR abs/1506.01929 (2015), http://arxiv.org/abs/

1506.01929

37. Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end Learn-ing of Action Detection from Frame Glimpses in Videos. arXiv (2015).https://doi.org/10.1109/CVPR.2016.293, http://arxiv.org/abs/1511.06984

38. Zheng, S., Dongang, W., Shih-Fu, C.: Temporal Action Localization inUntrimmed Videos via Multi-stage CNNs. IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2016 pp. 1049–1058 (2016).https://doi.org/10.1109/CVPR.2016.119