[ieee 2012 international joint conference on computer science and software engineering (jcsse) -...

Framework for Texture Classification and Retrieval Using Scale Invariant Feature Transform

Tuan Do, Antti Aikala and Olli Saarela VTT Technical Research Centre of Finland

Espoo, Finland [email protected], [email protected], [email protected]

Abstract— Texture images can be characterized with key features extracted from images. In this paper, the scale invariant feature transform (hereinafter SIFT) algorithm is utilized to generate local features for texture image classification. The local features are selected as inputs for texture classification framework. For each texture category, a texton dictionary is built based on the local features. To establish the texton dictionary, an adaptive mean shift clustering algorithm is run with all local features to generate key features (called textons) for texton dictionary. The texton dictionaries among texture categories are supposed be distinctive from each other to provide a highest performance in term of classification accuracy. A framework is proposed for classifying images into corresponding categories by matching their local features with textons from the texton dictionaries. This can be done by a histogram model of ‘match’ vectors versus texture categories. Finally, our texture image database and the Ponce texture database are used to test the proposed approach. The results indicate a potential of our proposed method based on high classification accuracies achieved. They are 100% with our testing database for both classification and retrieval and 92 % and 100% with Ponce database for classification and retrieval, respectively.

Keywords- SIFT; local feature; adaptive mean shift clustering; texton; texton dictionary

I. INTRODUCTION Local invariant features have recently been getting more

and more attention in the computer vision field. There have been some descriptors such as SIFT [1], gradient location and orientation histogram (called GLOH), steerable weighted median filter [2], [3] and spin image [4]. Many interesting applications based on those features such as face, object and scene recognition with SIFT [5], [6], human age estimation with GLOH [7] and 3D landmark automatic identification [8]. Among those features, the features from SIFT were considered taking more advantages over others for some specific applications such as moving object recognition [9]. The SIFT features are more robust to image scaling and rotation as well as illumination.

Texture image classification has been studied for years. Many research works on texture classification with well-know features have been published [10], [11]. In this work, we propose a classification framework that utilizes the SIFT features for texture image classification. Texture images from a same category taken by cameras are easily influenced by scaling, rotation and illumination. However, features extracted

using the SIFT algorithm are supposed to be similar for images within each texture category. Therefore, it motivates us to exploit local features from the SIFT algorithm for texture classification and retrieval.

Texture recognition is concerned with similarity detection among texture database using some key features. To generate key features (hereafter called textons) for each category, we first find out the appearance-based local features of the whole training set containing a group of monochrome texture images through the SIFT algorithm. A number of textons representing main characteristics of each texture category are defined by the centers of the clusters from all local features generated by SIFT algorithm. The most widely used technique for clustering such as k-means clustering has two inherent limitations - the clusters are constrained to be spherically symmetric and their number has to be known a priori. Therefore, we use the adaptive mean shift clustering algorithm [12] to avoid those weaknesses of the k-means clustering. Apart from the SIFT algorithm, a framework for texture classification is proposed.

The paper is organized as follows. A brief introduction of SIFT is given in Section 2. Section 3 describes the feature extraction and texton dictionary creation based on local features from SIFT. A classification framework is proposed in Section 4, while experiments and results are presented in Section 5. Final concluding remarks and suggestions for future study are made in Section 6.

II. SCALE INVARIANT FEATURE TRANSFORM The SIFT algorithm consists of four main filtering stages:

scale-space extreme detection, keypoint localization, orientation assignment, and keypoint descriptor. In this section, the algorithm containing these four steps is briefly introduced. The details of the SIFT algorithm can be found in [1].

A. Scale-space Extrema Detection The scale space called ),,( σyxL is defined by the

following function:

),(),,(),,( yxIyxGyxL ∗= σσ (1) where ‘*’ notation is the convolution operator, ),,( σyxG is a variable-scale Gaussian kernel and ),( yxI is the intensity of the pixel whose coordinates are x and y . The SIFT is a

289

USER UTCC

Text Box

978-1-4673-1921-8/12/$31.00 ©2012 IEEE

USER UTCC

Text Box

2012 Ninth International Joint Conference on Computer Science and Software Engineering (JCSSE)

technique that locates scale-space extrema from Gaussian image differences called ),,( σyxD given by:

),,(),,(),,( σσσ yxLkyxLyxD −= (2) Where ,...3,2,1=k is used to present the different scale space. To detect the local maxima or minima of ),,( σyxD , each pixel is compared with its eight neighbors on the same scale, and its nine neighbor pixels on the up and down scale. If this value is larger than all 26 neighbors it is a maximum; otherwise, it is a minimum.

B. KeyPoint Localization The unstable extrema that has low contrast or poorly

localized along an edge needs to be rejected. The location of extrema, z , is given by high values of:

xD

xDz

∂∂

∂∂−=

−

2

12

(3)

The points where the function value of z is below a threshold value are discarded. This removes extrema with low contrasts. Using 2x2 Hessian matrix H computed at the location and scale of the keypoint, the principle curvatures proportional to eigenvalue of H can be computed.

⎥⎦

⎤⎢⎣

⎡=

yyyx

xyxx

DDDD

H (4)

The elimination criteria for poorly localized long edge can be constructed as follows:

rr

HDetHTr 22 )1(

)()( +< ,

βα=r (5)

where σ is Eigenvalue with larger magnitude, and β is Eigenvalue with smaller magnitude. If this inequality is true, the keypoint is rejected.

C. Orientation Assignment This stage aims to assign a consistent orientation to the

keypoints based on local image properties. The keypoint descriptor is represented relative to this orientation because it is invariant to rotational movements of the keypoints. The approach taken to find an orientation consists of five steps:

Step 1: Use the keypoint scale to select the Gaussian smoothed image L . Compute gradient magnitude, ),( yxm and orientation, ),( yxθ by two following equations:

2

2

))1,()1,(()),1(),1((

),(−−+

+−−+=

yxLyxLyxLyxL

yxm (6)

))),1(),1((/))1,()1,(((tan),( 1

yxLyxLyxLyxLyx

−−+−−+= −θ (7)

Step 2: Form an orientation histogram from gradient

orientations of sample points

Step 3: Locate the highest peak in the histogram

Step 4: The orientations corresponding to the highest peak and local peaks that are within 80% of the highest peaks are assigned to the keypoint. Some points may be assigned multiple orientations.

D. Keypoint Descriptor The local gradient data, used above, is also used to create

keypoint descriptors. The gradient information is rotated to line up with the orientation of the keypoint and then weighted by a Gaussian kernel with a variance of keypoint scale multiplied by 1.5. This data is then used to create a set of histograms over a window centered on the keypoint. Keypoint descriptors typically use a set of 16 histograms, which are aligned in a 4x4 grid, each with eight orientation bins, one for each of the main compass directions and one for each of the mid-points of these directions. These results in a feature vector containing 128 elements.

III. FEATURE EXTRACTION AND TEXTON DICTIONARY This section is concerned with how to generate a texton

dictionary for a texture image category. A texton dictionary contains a list of key features. The features in a texton dictionary are supposed to be distinctive from those of other texton dictionaries. This distinctness will help to identify a texture image category to another.

As previously mentioned, SIFT features are used for this paper. The SIFT features for monochrome image are generated by applying the SIFT algorithm. The outputs of the algorithm are a list of features, which are 128-dimension vectors containing information of location, weight and orientation.

Fig. 1. Schematic of texton dictionary creation. AMSC

stands for adaptive mean shift clustering Suppose that p image texture categories are considered

in this context. To build a texton dictionary for the thi texture

image category, the SIFT algorithm is used for several selected texture images in order to generate a set of SIFT features. Since the number of generated features are large and a targeted texton dictionary should be compact and contain most key features only, the adaptive shift clustering algorithm [12] is applied to generate ic centroid feature vectors (referred to as textons)

from the set of local features generated by SIFT. Hence, the ic

290

textons are representative of the category. This collection of ic textons is called “texton dictionary” for the category. The value

ic is proportionate to the number of local features generated by SIFT. The schematic of texton dictionary creation procedure is illustrated in Fig. 1.

IV. CLASSIFICATION FRAMEWORK To classify a testing texture image into corresponding

category, the pattern classification framework utilizing the local feature vectors and the texton dictionary contains following five steps and also can be visually presented as from Fig.2.

Step 1: Apply the SIFT algorithm for this testing texture image to generate a set of the 128-dimension local feature vectors.

Step 2: Calculate the Euclidean distance between each local feature vector with each texton in the texton dictionary for each texture category. With two 128-dimension vectors 1V and 2V with the coordinates },...,,{ 128

12

11

1 VVV and },...,,{ 1282

22

12 VVV ,

respectively, the Euclidean distance can be calculated as follows:

∑=

−=128

1

22112 )(

i

ii VVD (8)

Step 3: Find the category containing the texton that makes the distance minimal. The texton can be called as the “match” vector.

Step 4: Repeat step 2 and 3 for entire local feature vectors of the testing image to be classified. Generate a histogram number of “matches” for each texture category.

Step 5: Classify the texture image to the equivalent category based on the created histograms. The testing texture image belongs to the category with the highest number of “match” vectors.

Fig. 2. Schematic of classifying procedure

V. EXPERIMENTS AND RESULTS In this work, the SIFT was implemented with the number of

scale spaces was set to four and the number of differences of

Gaussians (DOG) was set to five. In the scale space the image was enlarged two times and scaled down two times. Since a texture image can generate many features in small areas, enlarging the image in the first step is necessary for finding local features even in small areas.

The first experiment is targeted to demonstrate how the SIFT local features are distinctive from each other among texture image categories, the SIFT algorithm was used to extract local features from 27 texture images representing 27 different types of texture categories (i.e. one picture for one category). In this experiment, the adaptive mean shift clustering algorithm is not used so that each local feature is treated as a single texton in the texton dictionary. Ten of the example texture images are illustrated in Fig. 3.

The classification framework proposed in Section 4 without using mean shift clustering algorithm was demonstrated with ten randomly selected images belonging to ten of 27 texture image categories. The final outcome of classification shows that all ten texture images are classified to the exact corresponding categories. Fig.4 shows the output histograms for two texture images A and B of ten texture images. Through the histograms, the framework should confirm that the texture A and B should belong to categories that contain texture image number 6 and 16, respectively. Indeed, it clearly shows that texture image A is similar to the texture number 6, and texture B is similar to the texture number 16 in the database as shown in Fig. 5. Since only one image for each category is used, the retrieval testing classification accuracy is insured to be 100%.

Fig. 3. An example of texture image training set

Fig. 4. Histogram of features for 2 textures (i.e. A and B)

The previous experiment result with high classification accuracy indicates the potential of exploiting the all extracted SIFT features for classification. However, texture images are generally generating a huge amount of features whose number of dimension are high (in the previous experiment, it is 128). The classification framework will suffer from a big

291

computational complexity. One potential way to avoid that is to narrow down the number of features by adaptive mean shift clustering algorithm.

In this second experiment, the proposed classification framework with the adaptive mean shift clustering algorithm was tested with the texture images from Ponce database. The database contains 25 texture image categories with four texture images for each (i.e. 100 images in total). To build texton dictionaries for all categories, three texture images from each category are randomly selected. The texton dictionaries are built using the model mentioned in section 3. This also leaves 25 texture images, one from each category, for testing the proposed framework.

The final outcome indicates that 23 of the 25 texture images were classified correctly with the proposed classification framework. Classifications of ‘Bark3’ and ‘Glass2’ failed. For retrieval, each of 75 images which were used to create texton dictionaries was treated as a single testing image. The final result shows that all of 75 images were found successfully. Fig. 6 shows an example of a correct classification in which the testing texture image is belonging to the category ‘Bark1’ that contains texture number 1, 2 and 3. However, in Fig. 7 and Fig. 8, unsuccessful classifications were observed. Testing texture image belonging to ‘Glass2’ was classified to ‘Bark2’ category. In contrast, the texture image belonging to ‘Bark2’ was classified to ‘Glass2’ category.

Fig. 5. Texture A and B are similar to textures number 6 and 16 in the training set, respectively.

Fig. 6. Three image samples of category ‘Bark1’ and the testing texture that belongs to the same category.

Fig. 7. Three image samples of category ‘Bark3’ and the testing texture that belongs to the ‘Glass2’ category.

Fig. 8. Three image samples of category ‘Glass2’ and the testing texture that belongs to the ‘Bark2’ category.

Our observation is that when using the SIFT algorithm for texture images to generate the SIFT features; we obtained hundreds or even thousands of feature vectors for testing texture images that were classified correctly. Hence, hundreds of textons were created in each texton dictionary confirming that texton dictionaries contain enough key features. That was the reason why the all testing images for those categories were classified correctly. In contrast, testing texture images for ‘Bark2’ and ‘Glass2’ generated only tens of features; therefore, their texton dictionaries contain few textons, i.e. not enough key features, causing the wrong classifications.

TABLE 1. EXPERIMENTAL RESULTS OF TEXTURE CLASSIFICATION AND RETRIEVAL

Database Accuracy (%)

Classification Retrieval Our database

27 images for texton dictionary 100% 100%

10 images for testing

Ponce database

75 images for texton dictionary 92% 100%

25 images for testing

Table 1 is the summary of the experiment. It indicates a potential of using SIFT features in the proposed classification framework for texture classification. The accuracies are relatively high with two different data sets for both classification and retrieval. Better understanding of the capabilities and limitations of the framework would naturally require more comprehensive testing.

VI. CONCLUSION AND FUTURE WORKS The method for texture classification and retrieval using

SIFT local features is proposed in this paper. The method includes texton dictionary creation and a classification framework. The proposed method was applied to two data sets, showing classification accuracies that indicate the potential of exploiting local features in the challenging task of texture classification.

In future work, we intend to supplement the proposed framework with Pseudo two-dimensional Hilbert Huang transform (PHHT) for texture images. With this technique, a number of scale components are generated presenting different

292

frequency band components. Scale invariant feature transform (SIFT) is then used with the scale component images to extract local features.

REFERENCES

[1] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” IJCV, vol. 60, no. 2, pp. 91-110, 2004.

[2] D. Charalampidis, “Steerable Weighted Median Filters,” IEEE Transaction on Image Processing, vol. 19, no. 4, pp. 882-894, 2010.

[3] M. Muhlich, D. Friedrich and T. Aach, “Design and Implementation of Multisteerable Matched Filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 2, pp. 279-291, 2012.

[4] S. Lazebnik, C. Schmid, and J. Ponce, “Sparse Texture Representation Using Affine-Invariant Neighborhoods,” Proc. Conf. Computer Vision and Pattern Recognition, pp. 319-324, 2003.

[5] J. Krizaj, V. Struc, and N. Pavesic, “Adaptation of SIFT Features for Robust Face Recognition,” Proceedings of ICIAR, vol. 6111, pp. 394-404, 2010.

[6] K.E.A. van de Sande, T. Gevers, and C.G.M. Snoe, “Evaluating Color Descriptors for Object and Scene Recognition,” IEEE Transaction on

Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1582-1596, 2010.

[7] Y. Liang, L. Liu, Y. Xu, Y. Xiang and B. Zou, “Multi-task GLOH feature selection for human age estimation,” IEEE International Conference on Image Processing (ICIP), pp. 565-568, 2011.

[8] X. Zhang, N. Fan, X. Chen, L. Ran and J. Niu, “Feature extraction of three dimensional (3D) facial landmarks using Spin Image,” IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 537-541, 2010.

[9] M. Krystian and S. Cordelia, “A Performance Evaluation of Local Descriptors,” IEEE Trans. vol. 27, no. 10, pp. 1615-1629, 2005.

[10] J. Zhang , S. Lazebnik and C. Schmid, “Local features and kernels for classification of texture and object categories: a comprehensive study,” International Journal of Computer Vision, vol. 73, no. 2, pp. 213-238, 2007.

[11] M. Varma and A. Zisserman, “A Statistical Approach to Texture Classification from Single Images,” International Journal of Computer Vision, vol. 62, no. 1-2, pp. 61-81, 2005.

[12] B. Georgescu, I. Shimshoni, P. Meer, “Mean Shift based Clustering in High Dimensions: a Texture Classification Example,” In Proc. 9-th IEEE International Conference, vol.1, pp. 456 – 463, 2003.

293

[ieee 2012 international joint conference on computer science and software engineering (jcsse) -...

Documents