2013 annual ieee india conference (indicon) a frame-based

5
A Frame-based Decision Pooling Method for Video Classification Ambika Ashirvad Mohanty 1 , Bipul Vaibhav 2 Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati Guwahati, India [email protected] 1 , [email protected] 2 Amit Sethi Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati Guwahati, India Abstract— This paper proposes an ingenious and fast method to classify videos into fixed broad classes, which would assist searching and indexing using semantic keywords. The model extracts constituent frames from videos and maps low-level features extracted these frames to high-level semantics. We use color, structure and texture features extracted from a standard image database to train an SVM classifier, to classify videos to five different classes, viz. Mountains, Forests, Buildings, Deserts, and Seas with reasonable accuracy. The model is expected to be quite fast with an optimized implementation as the methods used for feature extraction are not computationally complex and have fast algorithms available. Keywords—video classification; content-based video retrieval; feature extraction; SVM I. INTRODUCTION With the abundance of videos on various different topics available to us from various sources, most notably the Internet, there is an increased need to classify the videos according to their content to enable easier searching and indexing of the same. Most of the online sources of videos return a list of relevant videos on our queries. In most of the cases, the videos in the result have associated tags that are matched with the words of our query. However, a better approach would be to return the videos whose actual ‘contents’, as opposed to attached tags, match the semantic keywords. Most of the content-based classification methods proposed in recent years employ features extracted from key frames, or shots, and computationally heavy algorithms. Though these methods are quite efficient in terms of results, but are time consuming and computer hardware requirements are quite high. We propose a model to classify videos that can be implemented on basic computer hardware, but returns good accuracy for most basic classes of videos. The model we propose is better at classifying classes that are broader in scope like, geographical environment, human crowds, fires, etc. as opposed to specific examples of events, e.g. car accidents, and specific human interactions, etc. We tested the model on five different classes, viz. (a) Mountains, (b) Seas, (c) Tall Buildings, (d) Deserts and (e) Forests, and were able to obtain excellent results in classification of test videos. Of course, the methods as in [1], [2], etc. are more accurate when classes involved are more specific. But, what our model compromises on accuracy on specific classes, is compensate in form of better speed in classification into broad categories. Compared to techniques described in [1], [2], etc., which involve spatio-temporal segmentation, we split a given video into individual frames (as many as there are in the video), and extract features from the individual such frames, and each frame is then classified into one of the five classes by an SVM classifier trained on a large image database, similar to the method adopted in [3] (In [3], one out of seven frames are used for feature extraction, while we use all of the constituent frames). The class into which the video is classified is the one to which a majority of its frames belong. The features that were used to train the said model were of three types: (a) Texture features, (b) Color features, and (c) Structure features. Texture features account for information about the spatial arrangement of color or intensities in an image or selected region of an image [4]. Structure features provide information about edges and boundaries detected in an image, while color features are a measure of pixel level color intensities. II. PROPOSED MODEL Our objective while designing the model was computational simplicity, as opposed to obtaining exceptional accuracy. Hence, we made sure that the algorithms and methods we used for feature extraction, as well as training the classifier were not very complex. 500 training images from ‘Corel’ image dataset are used for training. The features that were extracted from these images were 18 color features using RGB histograms, 258 texture features using entropy of image and FFT, and range of grayscale image, and six structural features from Canny’s operator. An SVM classifier was then trained on these 282 features from the 500 training images. Following sections describe in detail, the process of feature extraction, training and finally the testing and test results. A. Feature Extraction Prior to feature extraction from a given image, we needed to convert it into a resolution that would be a constant for all 978-1-4799-2275-8/13/$31.00 ©2013 IEEE 2013 Annual IEEE India Conference (INDICON)

Upload: others

Post on 28-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

A Frame-based Decision Pooling Method for Video Classification

Ambika Ashirvad Mohanty1, Bipul Vaibhav2 Department of Electronics and Electrical Engineering

Indian Institute of Technology Guwahati Guwahati, India

[email protected], [email protected]

Amit Sethi Department of Electronics and Electrical Engineering

Indian Institute of Technology Guwahati Guwahati, India

Abstract— This paper proposes an ingenious and fast method to classify videos into fixed broad classes, which would assist searching and indexing using semantic keywords. The model extracts constituent frames from videos and maps low-level features extracted these frames to high-level semantics. We use color, structure and texture features extracted from a standard image database to train an SVM classifier, to classify videos to five different classes, viz. Mountains, Forests, Buildings, Deserts, and Seas with reasonable accuracy. The model is expected to be quite fast with an optimized implementation as the methods used for feature extraction are not computationally complex and have fast algorithms available.

Keywords—video classification; content-based video retrieval; feature extraction; SVM

I. INTRODUCTION With the abundance of videos on various different topics

available to us from various sources, most notably the Internet, there is an increased need to classify the videos according to their content to enable easier searching and indexing of the same. Most of the online sources of videos return a list of relevant videos on our queries. In most of the cases, the videos in the result have associated tags that are matched with the words of our query. However, a better approach would be to return the videos whose actual ‘contents’, as opposed to attached tags, match the semantic keywords. Most of the content-based classification methods proposed in recent years employ features extracted from key frames, or shots, and computationally heavy algorithms. Though these methods are quite efficient in terms of results, but are time consuming and computer hardware requirements are quite high. We propose a model to classify videos that can be implemented on basic computer hardware, but returns good accuracy for most basic classes of videos.

The model we propose is better at classifying classes that are broader in scope like, geographical environment, human crowds, fires, etc. as opposed to specific examples of events, e.g. car accidents, and specific human interactions, etc. We tested the model on five different classes, viz. (a) Mountains, (b) Seas, (c) Tall Buildings, (d) Deserts and (e) Forests, and were able to obtain excellent results in classification of test videos. Of course, the methods as in [1], [2], etc. are more accurate when classes involved are more specific. But, what

our model compromises on accuracy on specific classes, is compensate in form of better speed in classification into broad categories.

Compared to techniques described in [1], [2], etc., which involve spatio-temporal segmentation, we split a given video into individual frames (as many as there are in the video), and extract features from the individual such frames, and each frame is then classified into one of the five classes by an SVM classifier trained on a large image database, similar to the method adopted in [3] (In [3], one out of seven frames are used for feature extraction, while we use all of the constituent frames). The class into which the video is classified is the one to which a majority of its frames belong.

The features that were used to train the said model were of three types: (a) Texture features, (b) Color features, and (c) Structure features. Texture features account for information about the spatial arrangement of color or intensities in an image or selected region of an image [4]. Structure features provide information about edges and boundaries detected in an image, while color features are a measure of pixel level color intensities.

II. PROPOSED MODEL Our objective while designing the model was

computational simplicity, as opposed to obtaining exceptional accuracy. Hence, we made sure that the algorithms and methods we used for feature extraction, as well as training the classifier were not very complex.

500 training images from ‘Corel’ image dataset are used for training. The features that were extracted from these images were 18 color features using RGB histograms, 258 texture features using entropy of image and FFT, and range of grayscale image, and six structural features from Canny’s operator. An SVM classifier was then trained on these 282 features from the 500 training images. Following sections describe in detail, the process of feature extraction, training and finally the testing and test results.

A. Feature Extraction Prior to feature extraction from a given image, we needed

to convert it into a resolution that would be a constant for all

978-1-4799-2275-8/13/$31.00 ©2013 IEEE

2013 Annual IEEE India Conference (INDICON)

training images. We chose this constant resolution as 256-by-256. This was done as features extracted from images of different resolutions would be different in number. After having converted the training images to the same resolution, we proceeded to extract the color, texture and structure features as described in the following sections.

a) Color features: For colour features, we used statistical properties of normalized RGB histograms as described in [5]. The RGB colour space can be defined as a unit cube with red, green, and blue axes [6]. And a normalized RGB colour space is one in which the RGB colour cube has unit dimensions, thus, all values of red, green, and blue are assumed to be in the range [0, 1] [7]. An RGB histogram can be thought of as a type of bar graph, where each bar represents a particular colour of the RGB colour space being used. ‘Bars’ in a colour histogram are referred to as bins and are represented on the x-axis, and y-axis represents the number of pixels in each bin [6]. In our work, we used 12 bins each for red, green and blue. The statistical values as described in [5] that were used as colour features are described below. The first order histogram probability is P(g) =N(g)/M, where M is the total number of pixels in an image, and N(g) is the number of pixels at grey level. 6 colour features based on first order histogram probability, viz. mean, standard deviation, two forms of skew, energy and entropy, were used by us. Mean is defined as

where, L is the total intensity level (L = 256 for 8-bit data), ‘r’ and ‘c’ represent row and column. Mean is a representative of brightness. The second feature used is standard deviation, which is defined as

It is a measure of contrast. Skew is defined as

It is a measure of asymmetry about the mean in the

intensity level distribution. An alternate definition of skew, also used by us as a

colour feature, is given by

The final two features used for colour are energy and entropy, which are defined as follows

Energy is an indication of distribution of intensity levels,

and entropy gives us the number of bits needed to code the image data. Thus, we have six colour features for each histogram, and, hence, accounting for the three histograms, we have a total of 18 colour features.

b) Edge-Detection (Structutral) features: Edge detection refers to the process of identifying and locating sharp discontinuities in an image, which are abrupt changes in pixel intensity which characterize boundaries of objects in a scene [11].There are several edge detection techniques available to us which vary in complexity, and accuracy depending on the edge orientation, noise environment and edge structure [11]. Some popular edge detection techniques are Sobel’s operator, Robert-Cross method, Prewitt operator, etc. In our work, we made use of a very efficient edge detection technique, Canny’s edge detection algorithm [12]. The steps involved in this algorithm are as follows. In the first step we compute a 2-D spatial gradient measurement on an image. Pixel values at each point in the output represent the estimated absolute magnitude of the spatial gradient of the input image at that point. The convolution masks used are as follows

Let us refer to the convolution results as Gxx and Gyy. Thus, Gxx = Gx*I, and Gyy = Gy*I, where I is the grayscale image. These are then used to find the magnitude and orientation, as follows.

|G|=|Gxx|+|Gyy| (1) Θ = tan-1(Gyy/Gxx) (2)

The angle, obtained in the above step, is the edge orientation that has to be resolved into one of four directions (0 degree, 45 degrees, 90 degrees and 135 degrees) depending on which direction it is closest to [10]. With the direction maps obtained from (2), an 18-bin histogram is obtained from it. And we then use the six statistical features as described in the section on ‘color features’, to get 6 structure features. Although other operators are faster than Canny (and speed has been an important factor throughout the paper), Canny

Figure 1. Examples of frames classified into the five different classes: (a) Mountain, (b) Forest, (c) Building, (d) Desert, (e) Sea

operator was chosen as other operators, while being quite successful in classifying most of the classes, that we have chosen, but failed to differentiate between certain classes with intuitively similar edge structure, as is mentioned in section III. This problem was solved by using Canny’s Algorithm.

c) Texture Features: Subjectively, texture can be described as an attribute representing the spatial arrangement of the gray levels of the pixels in a region of a digital image [8]. Statistics of 2-D Fourier transform of an image, for which fast implementations are available, can be used to effectively characterize its texture [9]. In our model, we used the entropy of FFT of an image converted into grayscale. This gave us one texture feature to work with. Another texture feature that we used was entropy of the grayscale image. The entropy of the grayscale image is a statistical measure of randomness, and can be used as a texture feature. In addition to these entropy features, we used ‘range’ of the grayscale matrix. This gave us additional 256 features. Range function in Matlab returns an array J, where each output pixel contains the range value (max - min) of each column of the grayscale input image I, which may have any resolution. The number of features obtained from the range function is equal to no. of columns in the grayscale image. For uniformity in the value of features extracted from all of the images, we convert every image to ‘256-by-256’ resolution before processing. So, we obtain 256 features from range in addition to the two entropy features. Hence, total number of texture features extracted was 258. There are various methods of textural feature analysis, with different methods for feature extraction. Some of the popular methods are grey level co-occurrence matrix (GLCM), run length matrix (RLM), Markov random fields (MRF), among others [10]. The reason we chose to use entropy of FFT and range of grayscale matrix of an image, was computational simplicity, which has been a focus throughout our work.

B. Training the SVM Classifier Representing each image as a feature vector of

‘282’elements, we can consider every image as a point in a 282-D space. And to classify these points, we employ a support vector machine (SVM). A Support Vector Machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification. An SVM classifier is a binary classifier. But, we are dealing with multiple classes in our work. So to avoid using multiple steps of binary classification, we used five SVMs to achieve ‘one vs. all’ classification. We used ‘Corel Dataset’ of images to train our classifier. 100 images of each of the five classes were used for training. In the first step, 282 features of the test images were extracted, and the SVM classifier was trained to map the features with the five different classes. For each class we trained the SVM classifier to categorize a test image as belonging to that particular class of to any of the other classes. Thus, after five such iterations,

we are able to know exactly which class a particular image actually belongs to.

C. Testing and Results To test the model we used 250 small duration test videos

(50 belonging to each class), obtained from “footage.shutterstock.com”. To test a particular video we first break it down into individual frames, which constitute the video. Each of these frames is then converted to 256-by-256 resolution so as to have uniformity in number of features extracted for each video. Next each of the frames is tested on the SVM classifier, which classifies it into one of the five set classes. A few typical examples of the frames classified into the five classes are shown as follows.

After testing and classifying all of the frames, we next check the class into which a majority of the frames are classified. If the ratio of the frames classified into that particular class to the total number of frames is greater than a pre-set ratio (In our case the pre-set ratio was 0.3), then the video is classified as that particular class. Now, the classification results obtained on the 250 test videos were remarkably accurate for a relatively simple model as ours. We had used fifty videos belonging to each class, and the number

Figure 2: Flowchart depicting the model

of videos out of fifty that were correctly classified is shown in the following table.

Class Videos correctly classified

Number (out of 50) Percentage

Mountains 48 96

Buildings 46 92

Seas 42 84

Forests 44 88

Deserts 41 82

III. DISCUSSION • Although the model proposed is for video

classification, it can be used as an image classifier as well. For this, we just need to use images, instead of frames extracted from videos.

• The results we obtained are quite good. But, accuracy can be expected to decrease while applying it to classify videos obtained from other sources. This is because generally videos have human/animal interactions. And even if their ‘background’ belongs to one of the aforementioned classes, the features extracted from it may not be good enough for it to be accurately classified. But, for videos with minimal human/animal presence, the model works very accurately, and is quite fast, as well as relatively simple to implement.

• Instead of using “one vs. all” classifier, another alternative could be “one vs. one” classifier applied in multiple stages. E.g. we could separate the classes into two broad classes of natural and artificial. And within those classes, we could have added two more subclasses, which could further be divided into two similar classes. This process of step-by-step binary classification would be slower than the method adopted by us. But, it could be advantageous in the sense that when a video cannot be classified distinctly

.

into one particular class, it can still be classified under broader sub-class. Also, when implemented, say in a search engine, this method would enable users to search using vague keywords.

• Canny’s algorithm was chosen, after testing the videos with a model trained using features from Robert Cross and Sobel operators. In both the cases, the model often confused between mountains and buildings, hence returning poor accuracy in both classes. Although if one of these classes was replaced with another that has quite different structure features, then using Robert Cross or Sobel operators would be a faster method.

REFERENCES

[1] “Video Classification Using Spatial-Temporal Features And PCA”- Li-Qun Xu and Yongmin Li

[2] “A Probabilistic Framework for Semantic Video Indexing, Filtering and Retrieval”- Milind Ramesh Naphade and Thomas S. Huang

[3] “Video Semantic Indexing using Image Classification “- Ming Yang, Yuanqing Lin, Fengjun Lv, Shenghuo Zhu, Kai Yu, Mert Dikmen, Liangliang Cao, Thomas S. Huang

[4] Linda G. Shapiro and George C. Stockman, Computer Vision, Upper Saddle River: Prentice–Hall, 2001

[5] S. Sergyan “Color histogram features based image classification in content-based image retrieval systems.” In Applied Machine Intelligence and Informatics, 2008. SAMI 2008. 6th International Symposium on, pages 221-224, Jan. 2008.

[6] “Design of Feature Extraction in Content Based Image Retrieval (CBIR) using Color and Texture”-Swati V. Sakhare & Vrushali G. Nasre

[7] “Integrated Feature Extraction for Image Retrieval”-Poorani M, Prathiba T, Ravindran G

[8] IEEE Std. 610.4-1990 [9] U. Indhal, T. Næs, J. Chemometr. 12 (1998) 261–278 [10] “Image texture analysis: methods and comparisons”-Manish H. Bharati,

J. Jay Liu, John F. MacGregor [11] “Study and Comparison of Various Image Edge Detection Techniques”-

Raman Maini & Dr. Himanshu Aggarwal [12] Canny, J., “A Computational Approach to Edge Detection”, IEEE Trans.

Pattern Analysis and Machine Intelligence, 8(6):679–698