human head-shoulder

Human Head-Shoulder Segmentation

Hai Xin, Haizhou Ai

Computer Science and Technology

Tsinghua University

Beijing, China

[email protected]

Hui Chao, Daniel Tretter

Hewlett-Packard Labs

1501 Page Mill Rd. Palo Alto

CA, USA

{Hui.chao, dan.tretter}@hp.com

Abstract—In this paper, an automatic head-shoulder

segmentation method for human photos based on graph cut with

shape sketch constraint and border detection through learning is

presented. We propose a new shape constraint method based

upon graph cut for head-shoulder photos. First, a watershed

algorithm is used to over segment the photo into superpixels; next,

an iterative shape mask guided graph cut algorithm with sketch

constraint is applied to the superpixel level graph to get a border

that segments the head-shoulder from its background; finally, a

border detector, which is trained by AdaBoost, is used to refine

the border. Experiments on consumer photo images demonstrate

its effectiveness.

Keywords-component; Human Segmentation, Shape Sketch,

Graph Cut, AdaBoost, Border Detection

I. INTRODUCTION

Automatic and efficient human head-shoulder segmentation has great practical importance in human photo analysis and editing. It is an important part of face contextual region analysis for the purpose of human recognition and tracking. General contextual information, such as clothing, hair style, etc., is very useful for people identification, especially when the facial features alone do not provide sufficient information. Gallagher et al [1] has demonstrated that clothing information with time stamp can be effectively used in recognizing people for family photo collection. For editing, segmented head-shoulder images are often used in new image and video composition by replacing or modifying the foreground or background. In this paper, we present a non-interactive human face contextual region segmentation method based on graph cut with shape sketch constraint and boundary detection through learning.

Object segmentation is one of the basic issues in image processing and computer vision. Extensive studies on segmentation algorithms have been presented in the literature. Among them, GrabCut [2] built on Graph cut [3] proves to be a powerful interactive method with much reduced human interaction. After user roughly draw the object location, it initializes graph cut algorithm iteratively for more precise segmentation. Graph cut in its original form builds a graph on pixel level in which each pixel is a vertex and between any two neighboring pixels is an edge. Each pixel has a probability of foreground called data cost, and each edge has an edge cost based on the brightness and distance of the two pixels. Graph

cut translates the segmentation problem into a min-cut problem, which can be solved by network flow algorithms.

Learning based border and edge detection based on low level features has shown to be highly adaptive and effective [4, 5, 6, 7]. Shahrokni et al [4] trained a detector by AdaBoost for texture boundary detection. Their results showed accurate object borders identified as texture transitions in complex scenes.

Our basic idea in developing a robust algorithm for automatic non-interactive head-shoulder region segmentation is to incorporate the domain knowledge that is specific to consumer face photos rather than general objects. Start with an extended face area as the region of interest, to improve the speed of graph cut, watershed segmentation is used first to convert this region into a superpixel image; graph cut with head-shoulder constraint is then applied to the superpixel level graph and followed by a special hair region processing to get a border that segments the head-shoulder region from its background. For boundary refinement, we explore the method of border detection through machine learning as [4]. A border detector is trained using AdaBoost where ribbon patches along the segmented border in the normal direction is used as positive samples and patches away from border as negative samples. We use this detector to check ribbon patches in the normal direction of previous segmented border resulted from graph cut to refine the boundary.

The rest of the paper is organized as follows: Section 2 reviews related work. Section 3 introduces over segmentation and superpixel graph. Section 4 describes Shape sketch constraint. Section 5 presents segmentation with iterative mask. Section 6 provides border patches and border detector training. Section 7 shows the experiment results. And finally a conclusion is given in section 8.

II. RELATED WORK

In principle, graph cut is developed for interactive image segmentation by Y. Boykov [3]. The approach is to segment image by labeling each pixel to 0 (background) or 1 (foreground). For a given image, it builds a graph, G=(V,E), where V is a set of vertices and E is a set of edges. Each vertex vi represents a pixel in original image and edges are connection relationship between adjacent pixels, which can be brightness, color, texture, space distance or hybrid of them. Moreover,

Figure 1. Head-shoulder part

graph cut has two special terminate vertices called source and sink, which delegate foreground and background. Algorithm tries to minimize global cost energy E:

,

( , ) ,

( )p q

p p p q

p P p q N A A

E R A B

(1)

where

, 2

( , )exp( )

2p q

feature dist p qB

(2)

where P is the set of pixels and N is the set of neighboring pixels. A is the label of segmentation, A(p)=0 if p is background, and A(p)=1 for foreground. Rp(Ap) is the prior data likelihood to assign p to foreground, which comes by interactive marking in [3]. Bp,q is image connection cost, which is the possibility p and q will be assigned to the same class. This energy will be high when p and q have similar feature but different labels. Then, the problem can be solved in polynomial time through graph min-cut algorithm.

Shape prior is a useful tool for segmentation. Borenstein and Malik [15] use top-down shape template to guide low level pixels merge. Some researchers use level-set active contour algorithm along with principal component analysis (PCA) [11, 12] to segment with shape priors. Freedman and Zhang [8, 9] introduced level-set active contour template into traditional graph cut to incorporate with shape priors. It merges shape priors information along with original image into new edge weight. In their approach, they add [ ( )]mid p q into Bp,q in

(1). φ is the shape prior function, it works on the midpoint of

p and q, which are adjacent pixels with different labels. The energy will be minimal when midpoint of p and q is on the shape prior, and it increases while the midpoint is away from shape prior. Moreover, they apply Procrustes Method [9, 10] for template rotation and Gaussian pyramid for scale. Chang, Yang and Parvin [14] improved Freedman and Zhang’s method. They design a Bayesian approach to manage shape template scaling, translation and rotation.

Figure 2. Over segementation into superpixels.

Border detection is another way for segmentation. Some researchers use various features to train border patches [5, 6, 13]. Dollar, Tu and Belongie [5] use a large number of generic features across different scales to train a boosting tree for detection. Shahrokni et al [4] use some simple feature on ribbon patch.

III. OVER SEGMENTATION AND SUPERPIXEL GRAPH

Our approach starts from face locating. For a given frontal human photo, we use a face detector to find the position and size of the human face. Centered on the face, we extend the face area to three times the size of the face and normalize it to a 400*400 image. We use this region as head-shoulder part of a human image. An example is shown in Fig. 1.

The normalized photo is 400*400 with 160000 pixels, which will cost too much time for using graph cut algorithm directly on a pixel level graph. Thus, a watershed segmentation algorithm is applied first to split the photo into superpixels. Examples are shown in Fig. 2, where about 2700 superpixels are obtained. The speed is greatly improved with much reduced pixel size at superpixel level. And the border information still keeps in the superpixel graph.

An image graph is then constructed based on superpixels. Each superpixel is represented by the mean RGB color and the histogram of its pixels. The edge weight of two connected or close superpixels is calculated by the histogram distance and the boundary length. The formula of edge weight of two superpixels A and B is as follows:

2

| ( ) ( ) |

( ) ( ) ( )

2

AB

i i AB AB

i i i

D p A p B

A B t A t B

A B

(3)

where

2( )i i

i i i

A B

A B

is the Chi-Square distance of the histogram, p(A) is the most frequent histogram color of A. tAB(A) is the ratio of boundary length of AB over whole border of A, and similarly, tAB(B) is the ratio of the boundary length of AB over the whole border of B.

Figure 3. Head shape sketch

IV. SHAPE SKETCH CONSTRAINT

In this section, we introduce an energy item to represent the difference between the segmentation result and a prior shape model. The function is not related to the model’s scale, but the shape of model.

A. Head-Shoulder Shape

Human head-shoulder parts all have a similar Omega like shape, as Fig. 3. Since the head part is guaranteed at the center of the normalized image by face detector. We compute the head shape sketch through statistical average position. However, the shoulder part has various shapes and changes a lot in different photos. Thus, we only restrict the resulted shape sketch of head border. The target is to limit resulted shape to a common sense head shape, but with various scale.

B. Sketch Energy Function

Shape sketch constraint is a high level knowledge for object segmentation, while lower-level superpixels don’t carry this information. Thus we provide the following method to add the shape information into the graph cut framework. Keep data energy Rp(Ap) the same as (1). Add a sketch energy term Sp,q into the global energy equation:

, ,

( , ) , ( , ) ,

( )p q p q

p p p q p q

p P p q N A A p q S A A

E R A B S

(4)

Same as traditional graph cut, Bp,q is pixel connection energy, where feature can be brightness, color, texture and etc. Since we work on superpixel level graph, Bp,q is the histogram distance in (3). Sp,q is the sketch energy and S is the sketch neighborhood set, shown as follows.

In Fig. 4(a) and 4(b), the green line l0 is the average position of head shape sketch synthesized from ground truth. For each pair of contiguous superpixels on the green line, add it to neighborhood set S and assign a shape energy El to them. As in Fig. 4(c), superpixel A and B are neighbors on shape sketch line l, we put pair (A, B) into set S and add energy El(A, B) to Sp,q.

, , ( , )p q p q lS S E p q (5)

Figure 4. Sketch energy: (a) Average sketch position. (b, c) superpixels on

sketch line.

Assuming a segment result crosses the shape sketch, it will definitely intersect with some pair (P, Q) in set S. So the energy EL(P, Q) will be added to E, which increase the global energy. Therefore, global optimal result will avoid crossing this green shape sketch.

To force segmentation result along with shape sketch, we add more sketch lines besides the green one. As Fig. 5(a), we resize the sketch and add all this class of shape lines to the model, repeating the above method. For each resized sketch line l, add energy El(A, B) to each adjacent neighbor pair. Therefore, sketch energy S will be:

, ( , )p q l

l L

S E p q

(6)

0( , ) ( , )lE p q dist l l (7)

where L is the set of resized sketch lines. El gets larger while l is far away from l0.

This energy penalizes segment result crossing shape sketch but less for the sketch line near average position. The further the result is away from original shape, the more shape lines segment border will cross, which will produce a lot of energy. Energy graph is shown in Fig. 5(b), where the darker, the larger cost is.

Figure 5. Sketch model. (a) Resized sketch lines. (b) sketch line energy.

Figure 6. Shape Masks

V. ITERATIVE MASK SEGMENTATION

For a human photo, his or her head can be accurately located by a face detector, although the size of the head remains uncertain. But shoulder and body parts of humans can have various shape forms that change differently. We use K-means algorithm to cluster all ground truth. Experiments show it works best when dividing into three classes, as Fig. 6. Then we provide the following algorithm to find the segmentation result through iterative improvement shape mask.

A. Shape mask guided graph cut

Given a shape mask synthesized with ground truth, as Fig. 6(b), each superpixel has a probability, p, to belong to foreground or background. By deciding a probability threshold , we can divide all superpixels into 3 classes: foreground

(where p> ), background (p<1- ), or unknown (else). In

our implementation, we use = 1%.

Based on the mask, we assign each superpixel to foreground, background or unknown and build Gaussian Mixture Models (GMMs)for foreground and background respectively. We then compute the probabilities of each unknown superpixel X in foreground and background class.

In (4), data energy is expressed by Rp(Ap), which is the probability of p in foreground. We merge shape mask along with GMM, getting the following formula:

( ) ( )

( )( ) ( ) ( ) ( )

fore

fore back

GMM X P X fored X fore

GMM X P X fore GMM X P X back

(8)

( ) ( )

( )( ) ( ) ( ) ( )

back

fore back

GMM X P X backd X back

GMM X P X fore GMM X P X back

(9)

where GMMfore(X) is the probability that X is in foreground GMM and GMMback(X) is the probability X in background

GMM. P(X∈fore) is the probability that X is in foreground of

the shape mask and P(X∈back) is the probability that X in

shape mask’s background.

B. Iterative mask update

Human shoulder and body parts differ a lot from each other, even the clustered average shape model is not accurate due to large data space. To match various shoulder shape photos, we update shape mask through iterative graph cut.

Figure 7. Iterative shape mask. (a) Result from first time segmentation. (b)

Generated shape mask. (c) Final result.

First, we merge similar superpixels and choose the closest shape mask in Fig. 6. Based on the mask, we apply graph cut on (4), and get a segmentation result. Then we blur the border of the result to create a new shape mask and segment again. The iteration will continue until the result is stable. As Fig. 7, the iterative shape improvement can lead segmentation result close to the ground truth step by step.

The whole segmentation algorithm is as follows:

1) Detect head shoulder part and normalize photo size.

2) Apply watershed algorithm to segment photo into superpixels and find closest mask.

3) Graph cut image based on sketch model and shape mask, according to (4).

4) Regenerate shape mask and sketch, redo step 3, until result stable.

The algorithm converges to final result quickly, which will iterate 2 to 4 times in practice.

VI. BORDER DETECTION

To refine the segmentation border further, we explore the border detection method proposed in [4]. We extract image patches on the normal direction of the borders from ground truth. We use AdaBoost to train a head-shoulder border detector and use it to improve the boundary from segmentation result.

A. Patch Training

We collect image patches on the border of ground truth. Each patch is a 64*16 rectangle along the normal direction of the border. Since the direction is not precise, we add several small degree angle distortions on each direction and get several patches. We scale the patches down to 32*8 and use them as positive samples. The white rectangles in Fig. 9(a) show the positive samples. The negative samples come from ribbon patches sliding away from each positive sample in the normal direction of the border, as the yellow or green rectangle in Fig. 9(b). Similarly, we extract 64*16 rectangle images, scale them down to 32*8, and use these as negative samples. The samples are shown in Fig. 9(c) and (d).

Figure 8. Patch feature

Figure 9. Patch samples

For a 32*8 ribbon patch, the left and right average RGB color differences at different window sizes and locations are used as weak features. As shown in Fig. 8, the patch is divided into 32 vertical bars. Assuming average RGB values of the left are l1, l2,…, l16, the right are r1, r2,…, r16, the following features are computed:

, , , , 1 , , , 16

b d

i k

i a k ca b c d

l r

f a b c db a d c

(10)

We collect about 3,000 positive samples and 74,000 negative ones, and use AdaBoost algorithm to train a detector. The detector on the training set has 99% positive pass rate and 3.7% negative pass rate.

B. Refine Border by Detection

For the segmentation result, we check the border along the normal direction by the border detector. As in the training method, we get 64*16 rectangle patches by several distortions and down scale to 32*8 for detector. The detector finds several possible border points, as illustrated as the red points in Fig. 10(b), and then the confidence weighted average position is chosen as the new border. The refined border result as the white outline is shown in Fig. 10(b) which is better than previous segmentation result in Fig. 10(a).

VII. EXPERIMENTS

Experiments are carried on 1000 frontal human photos, with various hair styles and different skin colors. The photos are from our own photo collections. The segmentation result is compared with ground truth and the overlap ratio is used as the evaluation criterion.

Figure 10. Refine border: (a) previous; (b) refined.

TABLE I. EXPERIMENT RESULTS

Overlap Ratio 95-100% 90-95% 80-90% <80%

Our approach 45% 29% 24% 2%

Without border refine 42% 30% 26% 2%

Without sketch constraint 39% 32% 27% 2%

GrabCut [2] 15% 24% 26% 35%

Ground Segmentoverlap

Ground Segment

(11)

where Ground is the image ground truth, and Segment is the result of our algorithm.

Experiments shows 74% photos have an overlap more than 90%. We compare our results with GrabCut [2] algorithm, for which we manually interactive segment on arbitrary 100 photos in the same dataset. Details are shown in Table I. Some examples are shown in Fig. 11.

Due to the shape priors, our algorithm gives a rough head-shoulder Omega-like border correctly on complex or confusable background, while general GrabCut fails sometime. With the help of sketch constraint, our algorithm fixes head part in most cases. Moreover, the border refine process further improves result locally while shape mask disturbs local superpixel feature information in global energy. However, the algorithm fails when the ground truth is far away from trained shape masks.

VIII. CONCLUSION

In this paper, we propose an iterative segmentation algorithm with shape priors constraint and apply it on human head-shoulder photo. A graph cut algorithm with shape sketch constraint and mask guide is applied on superpixels generated by watershed algorithm. Then an AdaBoosted border detector is used to refine the segmentation. Experiment results on different style human photos demonstrate its effectiveness. The new method improves precision of graph cut. It can extend to other object segmentation with shape priors.

ACKNOWLEDGMENT

This work is supported by a grant from Hewlett-Packard Company, and it is also supported in part by National Science Foundation of China under grant No.61075026.

REFERENCES

[1] Andrew C. Gallagher, Tsuhan Chen, Clothing Segmentation for Recognizing People, IEEE Conference, Computer Vision and Pattern Recognition, 2008.

[2] Carsten Rother, Vladimir Kolmogorov and Andrew Blake, GrabCut - Interactive Foreground Extraction using Iterated Graph Cuts, ACM SIGGRAPH 2004.

[3] Y. Boykov, M. Jolly. Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images. International Conference on Computer Vision, 2001.

[4] A. Shahrokni, T. Drummond, F. Fleuret, and P. Fua, Classification-Based Probabilistic Modeling of Texture Transition for Fast Line Search Tracking and Delineation, IEEE Trans.Pattern Analysis and Machine Intelligence, Vol.31, No.3, March 2009

[5] P. Dollar, Z. Tu, and S. Belongie, Supervised Learning of Edges and Object Boundaries, Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, June 2006.

[6] D. Martin, C. Fowlkes, and J. Malik, Learning to Detect Natural Image Boundaries Using Local Brightness, Color and Texture Cues, IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 26, no. 5, May 2004.

[7] E. Borenstein and S. Ullman, Class-Specific, Top-Down Segmentation, European Conference on Computer Vision, 2002

[8] D. Freedman and T. Zhang, Interactive graph cut based segmentation with shape priors, IEEE Conference on Computer Vision and Pattern Recognition, 2005

[9] T. Zhang and D. Freeman. Tracking objects using density matching and shape priors. IEEE International Conference on Computer Vision, 2003.

[10] F. L. Bookstein, landmark methods for forms without landmarks localizing group differences in outline shape. Mediacal Image Analysis, 1996.

[11] S. Pizer, G. Gerig, S. Joshi, and S. Aylward. Multiscale medial shape-based analysis of image objects. Proceedings of the IEEE, 91(10):1670–1679, 2003.

[12] A. Tsai, W. Wells, C. Tempany, E. Grimson, and A. Willsky. Mutual information in coupled multi-shape model for medical image segmentation. Medical Image Analysis 8 (2004) 429–445.

[13] A. Shahrokni, T. Drummond, P. Fua. Texture Boundary Detection for Real-Time Tracking. European Conference on Computer Vision, 2004.

[14] H. Chang, Q. Yang and B. Parvin. A Bayesian Approach for Image Segmentation with Shape Priors. IEEE Conference on Computer Vision and Pattern Recognition, 2008.

[15] E.Borenstein and J. Malik. Shape Guided Object Segmentation, Computer Vision and Pattern Recognition, 2006.

Figure 11. Head-shoulder segmentation results. (a, d) Results from GrabCut [2] algorithm. (b, e) Shape mask guided graph cut algorithm without sketch constraint.

(c, f) Our approach results, with both shape mask and sketch constraint.

human head-shoulder

Documents