Online Multi-Object Tracking with Dual Matching Attention Networks
Ji Zhu, Hua Yang, Shanghai Jiao Tong UniversityNian Liu, Northwestern Polytechnical University
Minyoung Kim, Massachusetts Institute of TechnologyWenjun Zhang, Shanghai Jiao Tong University
Ming-Hsuan Yang, University of California, MercedECCV 2018
Pipeline
• STEP 1: Apply single object tracker to keep tracking each target;• STEP 2: If tracking result becomes unreliable, suspend the tracker;• STEP 3: Do data association between lost targets and detections;• STEP 4: Update results.
Xu Gao, Peking University 2
Pipeline
• STEP 1: Apply single object tracker to keep tracking each target;• STEP 2: If tracking result becomes unreliable, suspend the tracker;• STEP 3: Do data association between lost targets and detections;• STEP 4: Update results.
Xu Gao, Peking University 3
Single Object Tracking
• Baseline Method. “ECO: Efficient Convolution Operators for Tracking.” 2017 CVPR.
• 𝑥 = {(𝑥%)',… , (𝑥*)'} is a feature map with D feature channels extracted from an image patch.
• Aim to learn a multi-channel convolution filter 𝑓 = {𝑓%,… , 𝑓*}.
• 𝐸 𝑓 = ∑ 𝛼0||𝑆3 𝑥0 𝑡 − 𝑦0 𝑡 ||789 + ∑ ||𝑤(𝑡)𝑓< 𝑡 ||78
9 *<>%
?0>% .
• Where 𝑆3 𝑥0 𝑡 = 𝑓 ∗ 𝑃'𝑥0, 𝑃 is a 𝐷×𝐶 matrix. 𝑦0 𝑡 is the desired confidence map. 𝑀 is the number of training samples.
• ||𝑔 𝑡 ||789 = %
' ∫ |𝑔(𝑡)|9𝑑𝑡'I .
Desired Confidence Map
Score Map Predicted by ECO
Xu Gao, Peking University 4
Cost-Sensitive Tracking Loss
• Drawback of ECO: As shown in the figure, the center of the object next to the target also gets high confidence score.
• Analysis: The center of the object next to the target also gets high confidence score. Hence, these negative samples should be penalized more heavily to prevent the tracker from drifting.
• 𝐸 𝑓 = ∑ 𝛼0||𝑞(𝑡)(𝑆3 𝑥0 𝑡 − 𝑦0 𝑡 )||789 + ∑ ||𝑤(𝑡)𝑓< 𝑡 ||78
9 *<>%
?0>% .
• Where 𝑞 𝑡 = | KL MN O PQN ORSMT|KL MN O PQN O |
|9.
Desired Confidence Map
Score Map Predicted by ECO
Xu Gao, Peking University 5
Pipeline
• STEP 1: Apply single object tracker to keep tracking each target;• STEP 2: If tracking result becomes unreliable, suspend the tracker;• STEP 3: Do data association between lost targets and detections;• STEP 4: Update results.
Xu Gao, Peking University 6
Preparation for Data Association
• When the tracking process becomes unreliable, suspend the tracker and set the target to a lost state.
• 𝑠𝑡𝑎𝑡𝑒 = X𝑡𝑟𝑎𝑐𝑘𝑒𝑑, 𝑖𝑓𝑠 > 𝜏_𝑎𝑛𝑑𝑜RbSc > 𝜏d𝑙𝑜𝑠𝑡, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.
• 𝑠 is the tracking score (the highest value in the confidence map);• 𝑜RbSc is the mean value of the maximum IoU between the tracked target 𝑡g
and the detections 𝐷g at frame each frame 𝑙.• 𝑜RbSc > 𝜏d is used since a false alarm detection is prone to be consistently
tracked with high confidence.• I think the set of 𝑜RbSc need to be reconsidered.
Xu Gao, Peking University 7
Pipeline
• STEP 1: Apply single object tracker to keep tracking each target;• STEP 2: If tracking result becomes unreliable, suspend the tracker;• STEP 3: Do data association between lost targets and detections;• STEP 4: Update results.
Xu Gao, Peking University 8
Data Association with DMAN
• Data association between lost trajectories and candidate detections.• Candidate detections are detections that surrounding the predicted location
which are not covered by any tracked target.• The predicted location are predicted from the lost trajectory with linear
motion model.• Dual Matching Attention Networks (DMAN) with spatial and temporal
attention.
Xu Gao, Peking University 9
Pipeline of DMAN
Xu Gao, Peking University 10
Spatial Attention Network (SAN)
• Intuition: pay more attention to common local patterns of the two feature maps.
• Matching Layer: Compute the cosine similarity between each 𝑥hi and 𝑥0
j. 𝑆h0 = (𝑥hi)'𝑥0j, 𝑥h ∈
ℝm.
• 𝑆 = (𝑥i)'𝑥j, 𝑆 ∈ ℝn×n,𝑁 = 𝐻×𝑊.
• Reshape 𝑆 ∈ ℝn×n into𝑋Ki ∈ ℝs×t×n.
• Reshape 𝑆' ∈ ℝn×n into𝑋Kj ∈ ℝs×t×n.
• Training Loss: Identification Loss and verification Loss.
Xu Gao, Peking University 11
Temporal Attention Network (TAN)
• Intuition: The tracklet may contain noisy observations, hence average pooling is unreliable.• Training Strategy: First train the SAN on randomly
generated image pairs, and fixed. Then train the TAN with extracted features as input.• Reason of the Strategy: The sequence of each id has
large redundancies to generate image pair, hence it is easy to overfit.• MOT 16 is used for training.
Xu Gao, Peking University 12
Pipeline
• STEP 1: Apply single object tracker to keep tracking each target;• STEP 2: If tracking result becomes unreliable, suspend the tracker;• STEP 3: Do data association between lost targets and detections;• STEP 4: Update results.
Xu Gao, Peking University 13
Datasets
• MOT 16: 14 sequences, including 7 for training and 7 for testing.• MOT 17: Same video sequences as MOT 16 but with 3 detections
(DPM, Faster-RCNN, SDP)
Xu Gao, Peking University 14
Visualization of the Spatial and Temporal Attention
Positive
Negative
Xu Gao, Peking University 15
More Visualization Results
Xu Gao, Peking University 16
Experiment
Xu Gao, Peking University 17
Ablation Study
Xu Gao, Peking University 18
Conclusion
• Integrate the merits of single object tracking and data association methods in a unified online MOT framework.• + Combine with single object tracking results.• + Spatial attention network seems to be useful.• - Results are not the best.• - Not too much innovation.
Xu Gao, Peking University 19