performance tuning on multicore systems for feature matching within image collections
DESCRIPTION
Performance Tuning on Multicore Systems for Feature Matching within Image Collections. Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang , Kai-Cheung Leung and Minyi Guo * Department of Computer Science University of Otago , New Zealand * Department of Computer Science - PowerPoint PPT PresentationTRANSCRIPT
Performance Tuning on Multicore Systems for
Feature Matching within Image Collections
Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung Leung and Minyi Guo*
Department of Computer Science University of Otago, New Zealand
* Department of Computer ScienceShanghai Jiao Tong University, China
Contents
• Motivation• Our work• Evaluation• Conclusion
Contents
• Motivation• Our work• Evaluation• Conclusion
Similarity Search
• Definition:– To preprocess a database of N objects so that
given a query object, one can effectively determine its nearest neighbors in database.
• Applications:– pattern recognition, chemical similarity
analysis, and statistical classification, etc.
The problem – KNN Search
• K Nearest Neighbor Search:– Feature: an array of D elements
• f = [e1]
– Feature Space: a set of features• Fs= {f1}
– Feature Similarity: Euclidean distance• =sqrt(Σ(fi
m-fjm)2)
– Search: given a query feature fq, find k features in Fs so that they have the shortest distances to fq.
Our Case Study• Feature Matching: a fundamental problem in many
computer vision tasks– Use the SIFT algorithm to generate features for each image;– Use a k-Nearest Neighbors (k-NN) algorithm to find similar
features between images
Challenges
• Very time-consuming:– datasets become larger:
• hundreds or thousands of images;– image resolution increases:
• 2300×1500 pixels, or higher;
• New platforms: HPC turns to multi-/many-core age:
• AMD 16-core and 64-core machines.
Motivation
• Performance evaluation:– Find out common problems that may limit the
performance of feature matching on multi-/many-core platforms.
• Performance tuning:– Find general methods to solve the identified
problems.
Contents
• Motivation
• Our work• Evaluation• Conclusion
Data Distribution
10000 20000 30000 400000
5
10
15
20
25
30
0
100000
200000
300000
400000
500000
600000
700000
26 26 26
3
181124
420008
660949
146180
images features
feature size range
num
ber o
f im
ages
tota
l num
ber o
f fea
ture
s
Data Size
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 8005
1015202530354045
data size kd-tree size totalImage id
Siz
e (M
B)
Problems
• Unbalanced workload:– Levels of parallelism;– Scheduling policy.
• Poor last-level cache utilization:– Memory architecture.
Levels of parallelism
…….. ……..
Level_1
Level_2 Level_3
———————
Level_4
LinearKD-treeKmeansLSHOthers
Level_1&2
Reference Images Query Images Features
Scheduling policy
• OpenMP scheduling policy:– Static: the scheduler will assign an equal number
of tasks to each thread (not used);
– Dynamic: when one thread finishes its current task, it will take new tasks from the global task queue;
– Guided: chunk size is adjusted dynamically when tasks are requested from the task queue.
Memory architecture• More cores are sharing the memory and last-level
cache:– Memory bandwidth:
• AMD 16-core 12.8 GB/s• AMD 64-core 25.6 GB/s
– Last-level cache:• AMD 16-core 6 MB• AMD 64-core 16 MB
• Large images may not fit in cache and will cause many memory accesses, which leads to hitting the memory wall.
Divide-and-Merge
• We propose Divide-and-Merge:– Whole feature space is split into several
smaller sub-spaces;– Search each sub-space independently;– Merge their results.
Divide-and-Merge
Time complexity
• Accurate algorithms:– Brute force: – Apply DM:
• Approximate algorithms:– Randomized KD-Tree: – Apply DM:
Contents
• Motivation• Our work
• Evaluation• Conclusion
Hardware and Software configuration
Name CPU Cache Memory OS Compiler
AMD 16-core(AMD16)
AMD Opteron Processor
83804 cores × 4 @ 2.5 GHz
L1: 128 KB,L2: 512 KB,L3: 6144 KB
16 GiB, DDR2 800 MHz12.8 GB/s
Ubuntu 12.04.1 g++-4.4
AMD 64-core(AMD64)
AMD Opteron Processor
62768 cores × 8 @ 2.3 GHz
L1: 48 KB,L2: 1000 KB,
L3: 16384 KB
64 GiB, DDR3 1333
MHz21.32 GB/s
Ubuntu 12.04.1 g++-4.4
Environment:OpenCV + OpenMP: one of the most frequently used setup for computer vision researchers to utilize parallel platforms
Levels of parallelism
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
10
12
Level_1 Level_2 Level_3 Level_1&2
Scalability
Scheduling policy(on level_1&2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
10
12
d1 d2 d4 guided
Scalability
Scheduling policy(on level_3)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
10
12
14
d1 d2 d4 guided
Scalability
Memory architecture
1. Original Execution
2. Apply Divide-and-Merge
Evaluation on Manawatu Dataset
1 4 8 121620242832364044485256606405101520253035404550
Level_3 Level_3_DMLevel_1&2 Level_1&2_DM
Scalability
1 4 8 12162024283236404448525660640
5
10
15
20
25
Level_3 Level_3_DMLevel_1&2 Level_1&2_DM
Speedup
Evaluation on Manawatu Dataset
1 4 8 121620242832364044485256606405101520253035404550
Level_3 Level_3_DMLevel_1&2 Level_1&2_DM
Scalability
1 4 8 12162024283236404448525660640
2
4
6
8
10
12
14
Level_3 Level_3_DMLevel_1&2 Level_1&2_DM
Speedup
Contents
• Motivation• Our work• Evaluation
• Conclusion
Conclusion• We have shown that performance tuning is
demanding on modern multicore systems.
• We have comprehensively evaluated the impact of the three factors that have an influence on large-scale image feature matching.
• We have proposed a Divide-and-Merge algorithm that can greatly improve the speedup and scalability of feature matching algorithms on multicore machines.