mysql story in poi dedup
Post on 13-Dec-2014
395 Views
Preview:
DESCRIPTION
TRANSCRIPT
Mysql Story in POI Dedup
Outline
• Problem• Proposal• Test & Verify
Problem
MasterDB: 23 million POIDaily Incremental: 1 million POI
Deduping Add
Update
Problem
• ProcessPOI (target)
1) Get Candidate {POI: distance < 100 meter} from Master DBa. Use Grid index
2) Compare target with Candidates
Problem
• DB is time-consuming according to Content Team experience
10ms/POI, 1 million POI need 2.7 hour (DB Query)
100ms/POI, 1 million POI need 27 hour (DB Query) – It’s daily update!
Proposal
• Build Local Cache• Multiple-Thread (Multiple-Boxes, Map-
Reduce)• DB Query and Dedup computation separation• Single SQL Tuning
Single SQL Running: DAL VS JDBC//DALCpPoiWorkDao dao = CpPoiWorkDao.getInstance();List<IPoi> poi = dao.getAllPois(PoiDataSet.CS, Configs.isActive);
//JDBCStatement statement = connect.createStatement();ResultSet rs = statement.executeQuery("select * from cs_1");
//runningcom.telenav.content.impl.JdbcPoiLoader 0:00:04.062 42985com.telenav.content.impl.PoiLoader 0:00:10.969 42985
First Declaration
First Declaration: DAL is slower than JDBC, there are performance loss in DAL
The truth
• DAL need ‘warm up’ (one more query)select id as id, table_set_name as table_set_name, current_work_suffix as current_work_suffix, current_live_suffix as current_live_suffix, table_set_size as table_set_size, update_time as update_time, create_time as create_time from active_table where table_set_name=?
JDBC DAL
First run 0:00:04.125 0:00:09.360
2 3187 4797
3 3297 4672
4 3265 4828
5 3297 4828
6 3344 4891
Second SQL Running
select POI_RECORD_ID, POI_ID, LATITUDE, …, locality, locale from us_ta_1 where node_index in ( ?, ?, … ? )
JDBC DAL
First run 375 1156
2 406 313
3 375 281
4 391 375
5 375 266
6 406 297
First Declaration: DAL is slower than JDBC, there are performance loss in DALFirst Declaration: DAL is slower than JDBC, there are performance loss in DAL
Benchmark Data
• It’s slow, how is it slow ?– Single SQL is smoke test, we want real data
Benchmark Data
• Test Case
• Test Result
•Running 10k POI, for each POI• DedupWorkPoiDao.getAdjacentDedupPois to get candidates POI for matching• IDeDuper.getDuplications(target, candidate) to find matching from candidates
•100 meter•Repeat the test for 3 times
Process Time DB Time Dedup Time Dedup candidate POI size
matched POI Percent
0:01:46, 10ms 4ms 6ms 51 0.63
6387 POI has been matched
Second Declaration
Second Declaration: Dedup is the most important factor in the process, db is not the botteneck
The truth
• DB is fast because of cache
# distance Process Time DB Parameters DB Time Dedup Time Dedup candidate POI size matched POI Percent
100 total 2min30s, 14ms 4 4ms 9ms 80 0.68
1 500 total 30m, 180ms 37 128ms 51ms 474 0.79
2 500 total 11m38s, 69ms 18ms 51ms
37 node in single query each POI need compare with 474 candidates
Second (latter) run is much faster than first run
The truth
• Clean Mysql cache & Restart Mysql– key_buffer_size 500m -> 8 byte– query_cache_size 64m -> 0
• No effect, the db query is still fast. – The first run time can not be reproduced for the
same data set.
The truth
• Clean OS (linux) file cache– echo 3 > /proc/sys/vm/drop_caches
• Test Result
Process Time DB Time Dedup Time Dedup candidate POI size
matched POI Percent
0:01:46, 10ms 4ms 6ms 51 0.63
0:22:58.844 (db only) 137ms / (removed) 51 /
30 times slower when OS file cache is cleanedSecond Declaration: Dedup is the most important factor in the process, db is not the botteneckSecond Declaration: Dedup is the most important factor in the process, db is not the botteneck
Mysql Index Preloading
• Mysql Index Preloading– key_buffer_size 4096m– load index into cache us_ta_1 (INDEX_NODEX_INDEX);
• Nearly No effect, the db query is nearly same.
Data file is bottleneck
• It seems key index does not help, the bottleneck is in data file reading (an assumption) ?
• Verify– 1) Reorder 23 million records using Hilbert, let
neighboring POI also adjacent in disk, reduce disk seek times
– 2) Build a new table, each row is <node, POI in the node>, reduce io times for one node POI reading
Data file is bottleneck
• Re-order POI in DB
• Test Result
insert into us_ta_2 (select * from us_ta_1 order by node_index)
Process Time DB Time Dedup Time Dedup candidate POI size
matched POI Percent
0:01:46, 10ms 4ms 6ms 51 0.63
First run 0:22:58.844 (db only)
137ms / (removed) 51 /
First run 0:03:10.985(db only)
19ms / 51 /
0:00:46.360 (db only)
4ms / (removed) 51 /
Multiple-Thread
• DB
• DB & Dedup
Process Time(db only)
DB Time Dedup Time Dedup candidate POI size
matched POI Percent
1 Thread 0:03:10.985 19ms / 51 /
4 Thread 0:01:05.406 24ms / 51 /
8 Thread 0:00:38.328 29ms / 51 /
Process Time DB Time Dedup Time Dedup candidate POI size
matched POI Percent
1 Thread 0:04:07.125 18ms 5ms 51 0.6387
4 Thread db, 2 thread dedup 0:01:11.328 25ms 9ms 51 0.6387
4 Thread db, 1 thread dedup 0:01:22.953 28ms 7ms 51 0.6387
Another assumption
Assumption : Build a local cache, and process POI in Hilbert Curve order would do great help
Cache: <node, POI in the node>
DB Query:Get POI in given nodes
Query:- Pick up nodes which has local cache- DB Query : nodes which has no local cache
Hilbert Curve
give a mapping between 1D and 2D space that fairly well preserves locality.
Hilbert Curve5k POI
DB Ordering Hilbert Curve Ordering
The truth
# distance Total Time
DB Parameters DB Time Dedup candidate POI size
cache hit ratio
first run 100 47s 4 4.7ms 80
100, cache 41s 4 4.1ms 80 5% (1679/40986)
first run 100, cache 48s 4 4.8ms 80 5%
100 41s 4 4.1ms 80
os file cache is not cleaned
Assumption : Build a local cache, and process POI in Hilbert Curve order would do helpsome great
when data is not so sparse
500, cache 37 11ms 474 11%
500 37 18ms 474
Summary
• SQL itself is very simple, no tuning point ?
• Multiple-Thread is necessary to increase throughput– Separate Dedup and DB Query (Dedup is also
time-consuming when candidate size is big)
select * from us_ta_1 where node_index in ( ?, ? , ?...)
Jump out of box
• A new <node, POI> table • No-Sql Storage with spatial support <node, POI>• CoSE to search candidates• Hadoop(Map-Reduce)
Performance Tuning Tips
• Test to verify assumption• Make the environments as close to real as possible– Do not Mock– Do not talk with US DB in CN
• Repeat test to get a coherent result (result can be reproduced)
• Do not miss any exception case (First run is slower than latter)
• Consider both (Mysql) client/server side
top related