learning temporal pose estimation from sparsely labeled videos · annotations is costly and time...

1
Training: Given a pair of frames from the same video—a labeled Frame A and an unlabeled Frame B—we train our model to detect pose in Frame A using the features from Frame B. I. Introduction Learning Temporal Pose Estimation from Sparsely Labeled Videos Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani Goal: Pose detection in video. Challenge: Densely labeling every frame with multi-person pose annotations is costly and time consuming. Idea: Leverage sparsely annotated videos, i.e., pose annotations are given only every k frames. II. The PoseWarper Network • We use dilated deformable convolutions to warp pose heatmaps from an unlabeled Frame B for pose prediction in a labeled frame A. • Our model implicitly learns motion cues between Frame A and Frame B (no ground truth alignment is available between frames). III. Applications of the PoseWarper Network 1. Video Pose Propagation Goal: Propagate ground truth pose annotations across the entire video from only a few labeled frames. Ground Truth Reference Frame (time t) PoseWarper (time t+1) FlowNet2 (time t+1) 3. Temporal Pose Aggregation at Inference Challenge: Detecting pose in video can be difficult due to nuisance effects such as occlusion, motion blur, and video defocus. Goal: Improve detection by using pose information from adjacent frames. •We train our PoseWarper model on the full PoseTrack dataset, i.e., when frames in videos are densely labeled. • During inference, we use our temporal pose aggregation scheme to aggregate information from 5 nearby frames. 2. Data Augmentation with PoseWarper Goal: Augment sparsely labeled video data with our propagated poses. Then, train an HRNet-W48 pose detector on this joint data. Comparison to State-of-the-Art IV. Additional Experiments Interpreting our Learned Offsets • It appears that different offset maps encode different motion, thus performing a sort of motion decomposition of discriminative regions in the video. V. Conclusions Frame t Frame t+5 Offset Magnitudes Channel 99 (x,y) Channel 123 (x,y) • Our approach reduces the need for densely labeled video data, while producing strong pose detection performance. • PoseWarper is useful even when training videos are densely labeled. • The source code and our trained models will be made publicly available at: https://github.com/facebookresearch/PoseWarper Application Stage: After training, we can apply our model in reverse order to propagate pose information across the training video from sparse ground truth poses. 2. Application Stage (Labeled Unlabeled) Labeled Frame A (time t+ ) δ Unlabeled Frame B (time t) Ground Truth Pose Propagated to Frame B Ground Truth Pose in Frame A Features Relating the Two Frames Feature Warping Architecture • Every 7 th frame in a training video is labeled. We propagate pose annotations from manually-labeled frames to all unlabeled frames in the same video. • Our temporal pose aggregation scheme during inference is another effective way to maintain strong performance in a sparsely-labeled video setting. Warping from A to B Pose Heatmap B Frame B (time t) Frame A (time t-1) Frame C (time t+1) Pose Heatmap C Pose Heatmap A Warping from C to B Pose B (time t) 0 2 4 6 8 # Labeled Frames Per Video 70 75 80 Accuracy (mAP) GT (7x) GT (1x) + pGT (6x) GT (1x) 0 100 200 300 # Training Videos 70 75 80 Accuracy (mAP) GT (7x) GT (1x) + pGT (6x) GT (1x) 0 2 4 6 8 # Labeled Frames Per Video 70 75 80 Accuracy (mAP) GT (7x) GT (1x) + ST-Agg. GT (1x) 0 100 200 300 # Training Videos 70 75 80 Accuracy (mAP) GT (7x) GT (1x) + ST-Agg. GT (1x) A Labeled Frame (time t) An Unlabeled Frame (time t-1) An Unlabeled Frame (time t+1) A Sparsely Labeled Multi-Person Video

Upload: others

Post on 28-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Temporal Pose Estimation from Sparsely Labeled Videos · annotations is costly and time consuming. •Idea: Leverage sparsely annotated videos, i.e., pose annotations are

• Training: Given a pair of frames from the same video—a labeled Frame A and an unlabeled Frame B—we train our model to detect pose in Frame A using the features from Frame B.

I. Introduction

Learning Temporal Pose Estimation from Sparsely Labeled VideosGedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

• Goal: Pose detection in video.• Challenge: Densely labeling every frame with multi-person pose

annotations is costly and time consuming.• Idea: Leverage sparsely annotated videos, i.e., pose annotations are

given only every k frames.

II. The PoseWarper Network

• We use dilated deformable convolutions to warp pose heatmaps from an unlabeled Frame B for pose prediction in a labeled frame A.

• Our model implicitly learns motion cues between Frame A and Frame B (no ground truth alignment is available between frames).

III. Applications of the PoseWarper Network1. Video Pose Propagation• Goal: Propagate ground truth pose annotations across the entire

video from only a few labeled frames.

Ground Truth Reference Frame (time t) PoseWarper (time t+1) FlowNet2 (time t+1)

3. Temporal Pose Aggregation at Inference• Challenge: Detecting pose in video can be difficult due to nuisance

effects such as occlusion, motion blur, and video defocus.• Goal: Improve detection by using pose information from adjacent frames.

• We train our PoseWarper model on the full PoseTrack dataset, i.e., when frames in videos are densely labeled.

• During inference, we use our temporal pose aggregation scheme to aggregate information from 5 nearby frames.

2. Data Augmentation with PoseWarper• Goal: Augment sparsely labeled video data with our propagated

poses. Then, train an HRNet-W48 pose detector on this joint data.

Comparison to State-of-the-Art

IV. Additional Experiments

Interpreting our Learned Offsets• It appears that different offset maps encode different motion, thus performing a

sort of motion decomposition of discriminative regions in the video.

V. ConclusionsFrame t Frame t+5 Offset MagnitudesChannel 99 (x,y)Channel 123 (x,y)

• Our approach reduces the need for densely labeled video data, while producing strong pose detection performance.

• PoseWarper is useful even when training videos are densely labeled.• The source code and our trained models will be made publicly available at:

https://github.com/facebookresearch/PoseWarper

• Application Stage: After training, we can apply our model in reverse order to propagate pose information across the training video from sparse ground truth poses.

2. Application Stage (Labeled Unlabeled)

Labeled Frame A (time t+ )�<latexit sha1_base64="AU/vq+DAgMm1x9RhUc20aA7BJwY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==</latexit><latexit sha1_base64="AU/vq+DAgMm1x9RhUc20aA7BJwY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==</latexit><latexit sha1_base64="AU/vq+DAgMm1x9RhUc20aA7BJwY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==</latexit><latexit sha1_base64="AU/vq+DAgMm1x9RhUc20aA7BJwY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA==</latexit>

Unlabeled Frame B (time t)

Ground Truth Pose Propagated to Frame B

Ground Truth Pose in Frame A

Features Relating the Two Frames

Feature Warping

Architecture

• Every 7 th frame in a training video is labeled. We propagate pose annotations from manually-labeled frames to all unlabeled frames in the same video.

• Our temporal pose aggregation scheme during inference is another effective way to maintain strong performance in a sparsely-labeled video setting.

Warping from A to B

Pose Heatmap B

Frame B (time t)Frame A (time t-1) Frame C (time t+1)

Pose Heatmap C Pose Heatmap A

�<latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit><latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit><latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit><latexit sha1_base64="N4ztu7rA6GcWPA6nADZ8tMOcvqE=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ==</latexit>

Warping from C to B

Pose B (time t)

0 2 4 6 8

# Labeled Frames Per Video

70

75

80

Acc

ura

cy (

mA

P)

GT (7x)GT (1x) + pGT (6x)GT (1x)

0 100 200 300

# Training Videos

70

75

80

Acc

ura

cy (

mA

P)

GT (7x)GT (1x) + pGT (6x)GT (1x)

0 2 4 6 8

# Labeled Frames Per Video

70

75

80

Acc

ura

cy (

mA

P)

GT (7x)GT (1x) + ST-Agg.GT (1x)

0 100 200 300

# Training Videos

70

75

80

Acc

ura

cy (

mA

P)

GT (7x)GT (1x) + ST-Agg.GT (1x)

A Labeled Frame (time t)An Unlabeled Frame (time t-1) An Unlabeled Frame (time t+1)

A Sparsely Labeled Multi-Person Video