localiza)on using faster r-cnn and mul)-frame fusion · efficient end-to-end object localiza)on 1....
Post on 17-Oct-2020
5 Views
Preview:
TRANSCRIPT
Localiza)onusingFasterR-CNNandMul)-FrameFusion
RyosukeYamamoto,NakamasaInoue,KoichiShinodaTokyoIns8tuteofTechnology
Outline
Mo)va)on:detectanac)onconcept“Si?ngDown”
Ourmethod:FasterR-CNN+LSTM+Re-scoring
Annota)on:Frame-wiseannota)onforSi?ngDown,Key-frameannota)onforotherconcepts
Results:2ndamong3teams,bestresultatSi?ngDown
0
0.1
0.2
0.3
0.4
0.5iframe_fscore
mean_pixel_fscore
F-s
core
Mo)va)on
・Localiza)ontaskfocusesnotonlyonsta)cobjects,butalsoonac)onconcepts・WefocusonSi?ngDown,oneofac)onconcepts・Howtodis)nguishbetweenSi?ngandSi?ngDown?→Dynamicinforma)onis
importantforprecisedetec)on
Si?ng Si?ngDown
OurMethod
・Faster-RCNN(Ren2015)-Efficientobjectlocaliza)on
・LSTM(Donahue2015)-Preciseac)onlocaliza)on-AppliedtoSi?ngDown
・Re-scoring(Yamamoto2015)
-Mul)-frameScoreFusion-Mul)-ShotScoreBoos)ng
Faster R-CNN
PredictionPrediction Prediction
Fusion
LSTMLSTM LSTM
BoostBoost Boost
Time Sequence
FasterR-CNN(Ren2015)
EfficientEnd-to-Endobjectlocaliza)on1.Generateregionproposalsbyanetwork2.PredictscoresforeachregionbyusingCNNfeaturesExampleCNNs:
-ZFNet(Zeiler2014) weuse-VGG-16(Simonyan2014)-GoogLeNet(Szegedy2015)-ResNet(He2016)
ROI PoolingROI Pooling
CNN
Region Region
proposalsproposals
DN
N
DN
N
FasterR-CNN
LSTM
Prediction
FasterR-CNN
LSTM
Prediction
FasterR-CNN
LSTM
Prediction
Time Sequence
LongShort-TermMemory(LSTM)
AnLSTMlayerisintroducedtoFasterR-CNN-memorizelongandshortterminforma)on-appliedonlytoSi?ngDown
Mul)-FrameandMul)-Shot(Yamamoto2015)
l Mul)-FrameScoreFusionAveragepoolingofscoresover5framesinashot
l Mul)-ShotScoreBoos)ngAddadjacentshotscores
Key-frame(I-frame)
Average
Key-FrameAnnota)ons
Bounding-boxannota)onontherepresenta)vekey-frameforeachshotlabeledasposi)veincollabora)veannota)on
Concept #frames #boxes Concept #frames #boxesAnimalBicyclingBoyDancingExplosionFire
11,545599
1,8482,1182,483
9,1551,3552,4925,1992,402
Inst.MusicianRunningSi?ngDownBabySkier
4,923945
-898320
7,2291,394
-895521
I-FrameAnnota)onsforSi?ngDown
l I-Frameannota)onforSi?ngDowntotrainLSTMl Annota)onresults
#shots=92#frames=481#bounding-boxes=515
*WefoundSi?ngDowninonly92shotsinthe3Kshotslabeledasposi)veincollabora)veannota)on
Results
0
0.1
0.2
0.3
0.4
0.5iframe_fscore
mean_pixel_fscore
F-s
core
TokyoTechRuns
ID Method RunID1*2*3*4*5
FasterR-CNN+Mul)-FrameScoreFusion1+Mul)-ShotScoreBoos)ng1+LSTM(4096units)forSi?ngDown2+LSTM(4096units)forSi?ngDown2+LSTM(64units)forSi?ngDown
fusionboostfusion.lstmboost.lstm(postexp.)
l 2ndamong3teams
ResultsforSi?ngDown
ID Method I-FrameF-score PixelF-score2*4*5
Fusion+Boos)ng2+LSTM(4096units)2+LSTM(64units)
0.630.00
11.96
0.220.004.51
BestresultforSi?ngDownwithrun#2LSTMwith4096units(run#4)didnotwork→LSTMwith64units(run#5)avoidedover-fi?ng
andworkedinpostsubmissionexperiment
SittingDown
System outputGround truthGood cases Bad cases
Moving but not sitting down Moving around a chairSitting down
Re-trained network with LSTM 64 units
Animal, Good Results
System output Ground truth
Faster R-CNN Score Fusion
Cat (no movement)
Score Boosting
Dog (walking)
Animal, Bad Results
System output Ground truth
Faster R-CNN Score Fusion
Many animals
Score Boosting
Bird (flying fast)
Others
Faster R-CNN Score Fusion Score Boosting
System output Ground truth
Bicycling
Boy
Others
Faster R-CNN Score Fusion Score Boosting
System output Ground truth
Dancing
ExplosionFire
Others
Faster R-CNN Score Fusion Score Boosting
System output Ground truth
InstrumentalMusician
Running
Others
Faster R-CNN Score Fusion Score Boosting
System output Ground truth
Baby
Skier
Conclusion&FutureWork
l Weproposedalocaliza)onsystem-FasterR-CNN+LSTM+Re-scoring
l Manualannota)on-31Kboundingboxes
l Results-2ndamong3teams,bestresultatSi?ngDown-LSTMwith64unitswaseffec)veforSi?ngDown
l Futurework-Findabeoerwaytolocalizeac)on
top related