blendshapes from commodity rgb-d sensors

1
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). SIGGRAPH 2015 Talks, August 09 – 13, 2015, Los Angeles, CA. ACM 978-1-4503-3636-9/15/08. http://dx.doi.org/10.1145/2775280.2792540 Blendshapes from Commodity RGB-D Sensors Dan Casas 1 , Oleg Alexander *1 , Andrew W. Feng 1 , Graham Fyffe 1 , Ryosuke Ichikari 1 , Paul Debevec §1 , Rhuizhe Wang 2 , Evan Suma k1 , and Ari Shapiro **1 1 Institute for Creative Technologies , University of Southern California 2 University of Southern California CR Categories: I.3.6 [Computer Graphics]: Methodology and Techniques—Interaction Techniques; I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Animation; Keywords: face animation, RGB-D, depth sensors, blendshapes 1 Exposition Creating and animating a realistic 3D human face is an important task in computer graphics. The capability of capturing the 3D face of a human subject and reanimate it quickly will find many appli- cations in games, training simulations, and interactive 3D graph- ics. We demonstrate a system to capture photorealistic 3D faces and generate the blendshape models automatically using only a sin- gle commodity RGB-D sensor. Our method can rapidly generate a set of expressive facial poses from a single depth sensor, such as a Microsoft Kinect version 1, and requires no artistic expertise in order to process those scans. The system takes only a matter of sec- onds to capture and produce a 3D facial pose and only requires a few minutes of processing time to transform it into a blendshape- compatible model. Our main contributions include an end-to-end pipeline for capturing and generating face blendshape models au- tomatically, and a registration method that solves dense correspon- dences between two face scans by utilizing facial landmarks de- tection and optical flows. We demonstrate the effectiveness of the proposed method by capturing different human subjects and pup- peteering their 3D faces in an animation system with real-time fa- cial performance retargeting. Since our method is low cost, fast, and requires little or no exper- tise on the part of the operator, this method has the potential to expand the number of photoreal, individualized faces that are avail- able for simulation. Our method is data agnostic, and can utilize scan input from depth-sensors; as the scan quality of such com- ponenets improved, the quality of the resulting blendshapes will also improve. For example, we demonstrate photorealistic results suitable for close-up encounters when using the Occipital Structure Sensor instead of the Kinect v1. A key to our method is the proper alignment of both texture and ge- ometry. Rather than storing our scans as geometry and textures, we choose instead to store our scans as images. Each one of our scans * [email protected] e-mail:[email protected] e-mail:[email protected] § e-mail:[email protected] e-mail:[email protected] k e-mail:[email protected] ** e-mail:[email protected] Figure 1: Blendshape results using our method with data from a commodity RGB-D sensor (Occipital Structure Sensor) Figure 2: Real-time puppeteering on digital image. Our shapes preserve their photorealistic facial quality generated from the orig- inal scans. is stored as a 32 bit float EXR texture map image, and a high reso- lution point cloud. Our texture and geometry alignment algorithm has three steps and runs in 2D. First, we construct a Delaunay trian- gulation between the user supplied points and apply affine triangles to roughly pre-warp the source diffuse texture to the target. Second, we use GPU-accelerated optical flow to compute a dense warp field from the pre-warped source diffuse texture to the target. Finally, we apply the dense warp to each one of our source texture maps, including diffuse, specular, specular normals, and point cloud. The result is the source scan warped to the target UV space. The sub- millimeter correspondence is able to align individual pores across the majority of the face. Recent template-based face scanning technologies deform a generic template model toward raw face scans, resulting face model repre- sents only an approximation of the original geometry instead of an exact reconstruction. Moreover, since the template model is de- formed under geometric constraints, there is no guarantees on exact texture alignments between different facial expressions. By con- trast, our method starts from raw face scans and directly extract consistent mesh from the scans. Therefore the geometric represen- tations are exact to the original shapes. By contrast, our method un- wraps both geometric and texture information into 2D images and solves for alignments in the UV space using facial feature detection and optical flow. Thus, template-based methods tend to produce results that are similar across scan subjects, since their template is shared, while our method preserves the unique appearance of each face, resulting in greater photorealism.

Upload: others

Post on 22-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). SIGGRAPH 2015 Talks, August 09 – 13, 2015, Los Angeles, CA. ACM 978-1-4503-3636-9/15/08. http://dx.doi.org/10.1145/2775280.2792540

Blendshapes from Commodity RGB-D Sensors

Dan Casas1, Oleg Alexander∗1, Andrew W. Feng†1, Graham Fyffe‡1, Ryosuke Ichikari1, Paul Debevec§1, Rhuizhe Wang¶2,Evan Suma‖1, and Ari Shapiro∗∗1

1Institute for Creative Technologies , University of Southern California2University of Southern California

CR Categories: I.3.6 [Computer Graphics]: Methodology andTechniques—Interaction Techniques; I.3.7 [Computer Graphics]:Three-Dimensional Graphics and Realism—Animation;

Keywords: face animation, RGB-D, depth sensors, blendshapes

1 Exposition

Creating and animating a realistic 3D human face is an importanttask in computer graphics. The capability of capturing the 3D faceof a human subject and reanimate it quickly will find many appli-cations in games, training simulations, and interactive 3D graph-ics. We demonstrate a system to capture photorealistic 3D facesand generate the blendshape models automatically using only a sin-gle commodity RGB-D sensor. Our method can rapidly generatea set of expressive facial poses from a single depth sensor, such asa Microsoft Kinect version 1, and requires no artistic expertise inorder to process those scans. The system takes only a matter of sec-onds to capture and produce a 3D facial pose and only requires afew minutes of processing time to transform it into a blendshape-compatible model. Our main contributions include an end-to-endpipeline for capturing and generating face blendshape models au-tomatically, and a registration method that solves dense correspon-dences between two face scans by utilizing facial landmarks de-tection and optical flows. We demonstrate the effectiveness of theproposed method by capturing different human subjects and pup-peteering their 3D faces in an animation system with real-time fa-cial performance retargeting.

Since our method is low cost, fast, and requires little or no exper-tise on the part of the operator, this method has the potential toexpand the number of photoreal, individualized faces that are avail-able for simulation. Our method is data agnostic, and can utilizescan input from depth-sensors; as the scan quality of such com-ponenets improved, the quality of the resulting blendshapes willalso improve. For example, we demonstrate photorealistic resultssuitable for close-up encounters when using the Occipital StructureSensor instead of the Kinect v1.

A key to our method is the proper alignment of both texture and ge-ometry. Rather than storing our scans as geometry and textures, wechoose instead to store our scans as images. Each one of our scans

[email protected]†e-mail:[email protected]‡e-mail:[email protected]§e-mail:[email protected]¶e-mail:[email protected]‖e-mail:[email protected]∗∗e-mail:[email protected]

Figure 1: Blendshape results using our method with data from acommodity RGB-D sensor (Occipital Structure Sensor)

Figure 2: Real-time puppeteering on digital image. Our shapespreserve their photorealistic facial quality generated from the orig-inal scans.

is stored as a 32 bit float EXR texture map image, and a high reso-lution point cloud. Our texture and geometry alignment algorithmhas three steps and runs in 2D. First, we construct a Delaunay trian-gulation between the user supplied points and apply affine trianglesto roughly pre-warp the source diffuse texture to the target. Second,we use GPU-accelerated optical flow to compute a dense warp fieldfrom the pre-warped source diffuse texture to the target. Finally,we apply the dense warp to each one of our source texture maps,including diffuse, specular, specular normals, and point cloud. Theresult is the source scan warped to the target UV space. The sub-millimeter correspondence is able to align individual pores acrossthe majority of the face.

Recent template-based face scanning technologies deform a generictemplate model toward raw face scans, resulting face model repre-sents only an approximation of the original geometry instead of anexact reconstruction. Moreover, since the template model is de-formed under geometric constraints, there is no guarantees on exacttexture alignments between different facial expressions. By con-trast, our method starts from raw face scans and directly extractconsistent mesh from the scans. Therefore the geometric represen-tations are exact to the original shapes. By contrast, our method un-wraps both geometric and texture information into 2D images andsolves for alignments in the UV space using facial feature detectionand optical flow. Thus, template-based methods tend to produceresults that are similar across scan subjects, since their template isshared, while our method preserves the unique appearance of eachface, resulting in greater photorealism.