running a realtime stats service on my sql
Post on 21-Jan-2015
1.922 Views
Preview:
DESCRIPTION
TRANSCRIPT
Running a Realtime Stats Service on MySQL
Cybozu Labs, Inc. Kazuho Oku
Background
Apr. 23 2009 Running Realtime Stats Service on MySQL 2
Who am I?
Name: Kazuho Oku (奥 一穂) Original Developer of Palmscape / Xiino
The oldest web browser for Palm OS
Working at Cybozu Labs since 2005 Research subsidiary of Cybozu, Inc. Cybozu is a leading groupware vendor in Japan My weblog: tinyurl.com/kazuho
Apr. 23 2009 Running Realtime Stats Service on MySQL 3
Introduction of Pathtraq
Apr. 23 2009 Running Realtime Stats Service on MySQL 4
What is Pathtraq?
Started in Aug. 2007 Web ranking service
One of Japan’s largest ~10,000 users submit access information ~1,000,000 access infomation per day
like Alexa, but semi-realtime, and per-page
Apr. 23 2009 Running Realtime Stats Service on MySQL 5
What is Pathtraq? (cont'd)
Automated Social News Service find what's hot like Google News + Digg calculate relevance from access stats
Search by... no filtering (all the Internet) by category by keyword by URL (per-domain, etc.)
Apr. 23 2009 Running Realtime Stats Service on MySQL 6
How to Provide Real-time Analysis?
Data Set (as of Apr. 23 2009) # of URLs: 147,748,546 # of total accesses: 413,272,527
Sharding is not a good option since we need to join the tables and aggregate
prefix-search by URL, search by keyword, then join with access data table
core tables should be stored on RAM not on HDD, due to lots of random access
Apr. 23 2009 Running Realtime Stats Service on MySQL 9
Our Decision was to...
Keep URL and access stats on RAM compression for size and speed
Create a new message queue Limit Pre-computation Load Create our own cache, with locks
to minimize database access
Fulltext-search database on SSD
Apr. 23 2009 Running Realtime Stats Service on MySQL 10
Our Servers
Main Server Opteron 2218 x2, 64GB Mem MySQL, Apache
Fulltext Search Server Opteron 240EE, 2GB Mem, Intel SSD MySQL (w. Tritonn/Senna)
Helper Servers for Content Analysis for Screenshot Generation
Apr. 23 2009 Running Realtime Stats Service on MySQL 11
The Long Tail of the Internet
y=C・x-0.44
# of URLs with 1/10 hits: x2.75
Apr. 23 2009 Running Realtime Stats Service on MySQL 12
Compressing URLs
Apr. 23 2009 Running Realtime Stats Service on MySQL 13
Compressing URLs
The Challenges: URLs are too short for gzip, etc. URLs should be prefix-searchable in compressed
form How to run like 'http://www.mysql.com/%' on a
compressed URL?
The Answer: Static PPM + Range Coder
Apr. 23 2009 Running Realtime Stats Service on MySQL 14
Static PPM
PPM: Prediction by Partial Matching What is the next character after ".co"?
The answer is "m"!
PPM is used by 7-zip, etc.
Static PPM is PPM with static probabilistic model Many URLs (or English words) have common
patterns Suitable for short texts (like URLs)
Apr. 23 2009 Running Realtime Stats Service on MySQL 15
Range Coder
A fast variant of arithmetic compression similar to huffmann encoding, but better If probability of next character being "m" was
75%, it will be encoded into 0.42 bit
Compressed strings preserve the sort order of uncompressed form
Apr. 23 2009 Running Realtime Stats Service on MySQL 16
Create Compression Functions
Build prediction table from stored URLs Implement range coder
took an open-source impl. and optimized it original impl. added some bits unnecessary at the tail use SSE instructions for faster operation coderepos.org/share/browser/lang/cplusplus/range_coder
Link the coder and the table to create MySQL UDFs
Apr. 23 2009 Running Realtime Stats Service on MySQL 17
Rewriting the Server Logic
Change schema url varchar(255) not null # with unique index ↓ urlc varbinary(767) not null # with unique index
Change prefix-search form url like 'http://example.com/%' ↓ url_compress('http://example.com/')<=urlc and
urlc<url_compress('http://example.com0')
Note: "0" is next character of '/' Apr. 23 2009 Running Realtime Stats Service on MySQL 18
Compression Ratio
Compression ratio: 37% Size of prediction table: 4MB
Benchmark of the compression functions compression: 40MB/sec. (570k URLs/sec.) decompression: 19.3MB/sec. (280k URLs/sec.) fast enough since searchable in compressed form
Prefix-search became faster shorter indexes lead to faster operation
Apr. 23 2009 Running Realtime Stats Service on MySQL 19
Apr. 23 2009 Running Realtime Stats Service on MySQL 20
Re InnoDB Compression
URL Compression can coexist with InnoDB compression
though we aren't using InnoDB compression on our production environment
Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33%
Compressing the Stats Table
Used to have two int columns: at, cnt it was waste of space, since...
most cnt values are very small numbers most accesses to each URL occur on a short period (ex.
the day the blog entry was written) at field should be part of the indexes
Apr. 23 2009 Running Realtime Stats Service on MySQL 21
at (hours since epoch) cnt (# of hits)
330168 1
330169 2
330173 1
330197 1
Compressing the Stats Table (cont'd)
Merge the rows into a sparse array example on the prev. page becomes: (offset=330197),1,0(repeated 23 times),1,2,1
Then compress the array the example becomes a blob of 8 bytes originally was 8 bytes x 4 rows with index
And store the array in a single column fewer rows lead to smaller table, faster access
Apr. 23 2009 Running Realtime Stats Service on MySQL 22
Compressing the Stats Table (cont'd)
Write MySQL UDFs to access the sparse array
cnt_add(column,at,cnt) -- adds cnt on given index (at) cnt_between(column,from,to) -- returns # of hits between given hours and more...
We use int[N] arrays for vectorized calc. especially when creating access charts
Apr. 23 2009 Running Realtime Stats Service on MySQL 23
Create a new Message Queue
Apr. 23 2009 Running Realtime Stats Service on MySQL 24
Q4M
A simple, reliable, fast message queue runs as a pluggable storage engine of MySQL GPL License; q4m.31tools.com presented yesterday at MySQL Conference :-p
slides at tinyurl.com/q4m2009
Used for relaying messages between our servers
Apr. 23 2009 Running Realtime Stats Service on MySQL 25
Limiting Pre-computation Load
Apr. 23 2009 Running Realtime Stats Service on MySQL 26
Limit # of CPU-intensive Pre-computations
Use cron & setlock setlock is part of daemontools by djb
setlock serializes processes by using flock -n option: use trylock; if locked, do nothing
# use only one CPU core for pre-computation */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries 5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data
Apr. 23 2009 Running Realtime Stats Service on MySQL 27
Limit # of Disk-intensive Pre-computations
Divide pre-computation to blocks and sleep depending on the elapsed time
my $LOAD = 0.25;
while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD); }
Apr. 23 2009 Running Realtime Stats Service on MySQL 28
Creating our own Cache System
Apr. 23 2009 Running Realtime Stats Service on MySQL 29
The Problem
Query cache is flushed on table update access stats can be (should be) cached for a
certain period
Memcached has a thundering-herd problem all clients try to read the database when a
cached-entry expires critical for us since our queries does joins,
aggregations, and sort operations
Apr. 23 2009 Running Realtime Stats Service on MySQL 30
Swifty and KeyedMutex
Swifty is a mmap-based cache cached data shared between processes lock-free on read, flock on write notifies a single client that the accessed entry is
going to expire within few seconds notified client can start updating a cache entry
before it expires
KeyedMutex a daemon used to block multiple clients issuing
same SQL queries Apr. 23 2009 Running Realtime Stats Service on MySQL 31
Swifty and KeyedMutexd (cont'd)
Source codes are available: coderepos.org/share/browser/lang/c/swifty coderepos.org/share/browser/lang/perl/Cache-Swifty coderepos.org/share/browser/lang/perl/KeyedMutex
Apr. 23 2009 Running Realtime Stats Service on MySQL 32
Fulltext-search on SSD
Apr. 23 2009 Running Realtime Stats Service on MySQL 33
Senna / Tritonn
Senna is a FTS engine popular in Japan might not work well with European languages
Tritonn is a replacement of MyISAM FTS uses Senna as backend faster than MyISAM FTS
Wrote patches to support SSD during our transition from RAM to SSD patches accepted in Senna 1.1.4 / Tritonn 1.0.12
Apr. 23 2009 Running Realtime Stats Service on MySQL 34
FTS: RAM-based vs. SSD-based
Size of FTS data: ~ 20GB Downgraded hardware to see if SSD-
based FTS is feasible Speed became ¼
but latency of searches are well below one second
Apr. 23 2009 Running Realtime Stats Service on MySQL 35
Old Hardware New Hardware
CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz)
Memory 32GB 2GB
Storage 7200rpm SATA HDD SSD (Intel X25-M)
Summary
Apr. 23 2009 Running Realtime Stats Service on MySQL 36
Summary
Use UDFs for optimization Sometime it is easier to scale UP
esp. when you can estimate your data growth
Use SSD for FTS Baidu (China's leading search engine) uses SSD
Most of the things introduced are OSS We plan to open-source our URL compression
table as well
Apr. 23 2009 Running Realtime Stats Service on MySQL 37
We are Looking for...
If you are interested in localizing Pathtraq to your country, please contact us we do not have resources outside of Japan
to translate the web interface to ask people to install our browser extension to follow local regulations, etc.
Apr. 23 2009 Running Realtime Stats Service on MySQL 38
Thank you for listening
tinyurl.com/kazuho
Apr. 23 2009 Running Realtime Stats Service on MySQL 39
top related