running a realtime stats service on my sql

Post on 21-Jan-2015

1.922 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Running a Realtime Stats Service on MySQL

Cybozu Labs, Inc. Kazuho Oku

Background

Apr. 23 2009 Running Realtime Stats Service on MySQL 2

Who am I?

 Name: Kazuho Oku (奥 一穂)  Original Developer of Palmscape / Xiino

 The oldest web browser for Palm OS

 Working at Cybozu Labs since 2005  Research subsidiary of Cybozu, Inc.  Cybozu is a leading groupware vendor in Japan  My weblog: tinyurl.com/kazuho

Apr. 23 2009 Running Realtime Stats Service on MySQL 3

Introduction of Pathtraq

Apr. 23 2009 Running Realtime Stats Service on MySQL 4

What is Pathtraq?

 Started in Aug. 2007  Web ranking service

 One of Japan’s largest  ~10,000 users submit access information  ~1,000,000 access infomation per day

 like Alexa, but semi-realtime, and per-page

Apr. 23 2009 Running Realtime Stats Service on MySQL 5

What is Pathtraq? (cont'd)

 Automated Social News Service  find what's hot  like Google News + Digg  calculate relevance from access stats

 Search by...  no filtering (all the Internet)  by category  by keyword  by URL (per-domain, etc.)

Apr. 23 2009 Running Realtime Stats Service on MySQL 6

How to Provide Real-time Analysis?

 Data Set (as of Apr. 23 2009)  # of URLs: 147,748,546  # of total accesses: 413,272,527

 Sharding is not a good option  since we need to join the tables and aggregate

 prefix-search by URL, search by keyword, then join with access data table

 core tables should be stored on RAM  not on HDD, due to lots of random access

Apr. 23 2009 Running Realtime Stats Service on MySQL 9

Our Decision was to...

 Keep URL and access stats on RAM  compression for size and speed

 Create a new message queue  Limit Pre-computation Load  Create our own cache, with locks

 to minimize database access

 Fulltext-search database on SSD

Apr. 23 2009 Running Realtime Stats Service on MySQL 10

Our Servers

 Main Server  Opteron 2218 x2, 64GB Mem  MySQL, Apache

 Fulltext Search Server  Opteron 240EE, 2GB Mem, Intel SSD  MySQL (w. Tritonn/Senna)

 Helper Servers  for Content Analysis  for Screenshot Generation

Apr. 23 2009 Running Realtime Stats Service on MySQL 11

The Long Tail of the Internet

y=C・x-0.44

# of URLs with 1/10 hits: x2.75

Apr. 23 2009 Running Realtime Stats Service on MySQL 12

Compressing URLs

Apr. 23 2009 Running Realtime Stats Service on MySQL 13

Compressing URLs

 The Challenges:  URLs are too short for gzip, etc.  URLs should be prefix-searchable in compressed

form  How to run like 'http://www.mysql.com/%' on a

compressed URL?

 The Answer:  Static PPM + Range Coder

Apr. 23 2009 Running Realtime Stats Service on MySQL 14

Static PPM

 PPM: Prediction by Partial Matching  What is the next character after ".co"?

 The answer is "m"!

 PPM is used by 7-zip, etc.

 Static PPM is PPM with static probabilistic model  Many URLs (or English words) have common

patterns  Suitable for short texts (like URLs)

Apr. 23 2009 Running Realtime Stats Service on MySQL 15

Range Coder

 A fast variant of arithmetic compression  similar to huffmann encoding, but better  If probability of next character being "m" was

75%, it will be encoded into 0.42 bit

 Compressed strings preserve the sort order of uncompressed form

Apr. 23 2009 Running Realtime Stats Service on MySQL 16

Create Compression Functions

 Build prediction table from stored URLs  Implement range coder

 took an open-source impl. and optimized it  original impl. added some bits unnecessary at the tail  use SSE instructions for faster operation  coderepos.org/share/browser/lang/cplusplus/range_coder

 Link the coder and the table to create MySQL UDFs

Apr. 23 2009 Running Realtime Stats Service on MySQL 17

Rewriting the Server Logic

 Change schema url varchar(255) not null # with unique index ↓ urlc varbinary(767) not null # with unique index

 Change prefix-search form url like 'http://example.com/%' ↓ url_compress('http://example.com/')<=urlc and

urlc<url_compress('http://example.com0')

Note: "0" is next character of '/' Apr. 23 2009 Running Realtime Stats Service on MySQL 18

Compression Ratio

 Compression ratio: 37%  Size of prediction table: 4MB

 Benchmark of the compression functions  compression: 40MB/sec. (570k URLs/sec.)  decompression: 19.3MB/sec. (280k URLs/sec.)  fast enough since searchable in compressed form

 Prefix-search became faster  shorter indexes lead to faster operation

Apr. 23 2009 Running Realtime Stats Service on MySQL 19

Apr. 23 2009 Running Realtime Stats Service on MySQL 20

Re InnoDB Compression

 URL Compression can coexist with InnoDB compression

 though we aren't using InnoDB compression on our production environment

Compression Table Size N/A 100% URL compression 57% InnoDB compression 50% using both 33%

Compressing the Stats Table

 Used to have two int columns: at, cnt  it was waste of space, since...

 most cnt values are very small numbers  most accesses to each URL occur on a short period (ex.

the day the blog entry was written)  at field should be part of the indexes

Apr. 23 2009 Running Realtime Stats Service on MySQL 21

at (hours since epoch) cnt (# of hits)

330168 1

330169 2

330173 1

330197 1

Compressing the Stats Table (cont'd)

 Merge the rows into a sparse array  example on the prev. page becomes: (offset=330197),1,0(repeated 23 times),1,2,1

 Then compress the array  the example becomes a blob of 8 bytes  originally was 8 bytes x 4 rows with index

 And store the array in a single column  fewer rows lead to smaller table, faster access

Apr. 23 2009 Running Realtime Stats Service on MySQL 22

Compressing the Stats Table (cont'd)

 Write MySQL UDFs to access the sparse array

 cnt_add(column,at,cnt) -- adds cnt on given index (at)  cnt_between(column,from,to) -- returns # of hits between given hours  and more...

 We use int[N] arrays for vectorized calc.  especially when creating access charts

Apr. 23 2009 Running Realtime Stats Service on MySQL 23

Create a new Message Queue

Apr. 23 2009 Running Realtime Stats Service on MySQL 24

Q4M

 A simple, reliable, fast message queue  runs as a pluggable storage engine of MySQL  GPL License; q4m.31tools.com  presented yesterday at MySQL Conference :-p

 slides at tinyurl.com/q4m2009

 Used for relaying messages between our servers

Apr. 23 2009 Running Realtime Stats Service on MySQL 25

Limiting Pre-computation Load

Apr. 23 2009 Running Realtime Stats Service on MySQL 26

Limit # of CPU-intensive Pre-computations

 Use cron & setlock  setlock is part of daemontools by djb

 setlock  serializes processes by using flock  -n option: use trylock; if locked, do nothing

# use only one CPU core for pre-computation */2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries 5 0 * * * setlock /tmp/tasks.lock precompute_yesterday_data

Apr. 23 2009 Running Realtime Stats Service on MySQL 27

Limit # of Disk-intensive Pre-computations

 Divide pre-computation to blocks and sleep depending on the elapsed time

my $LOAD = 0.25;

while (true) { my $start = time(); precompute_block(); sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD); }

Apr. 23 2009 Running Realtime Stats Service on MySQL 28

Creating our own Cache System

Apr. 23 2009 Running Realtime Stats Service on MySQL 29

The Problem

 Query cache is flushed on table update  access stats can be (should be) cached for a

certain period

 Memcached has a thundering-herd problem  all clients try to read the database when a

cached-entry expires  critical for us since our queries does joins,

aggregations, and sort operations

Apr. 23 2009 Running Realtime Stats Service on MySQL 30

Swifty and KeyedMutex

 Swifty is a mmap-based cache  cached data shared between processes  lock-free on read, flock on write  notifies a single client that the accessed entry is

going to expire within few seconds  notified client can start updating a cache entry

before it expires

 KeyedMutex  a daemon used to block multiple clients issuing

same SQL queries Apr. 23 2009 Running Realtime Stats Service on MySQL 31

Swifty and KeyedMutexd (cont'd)

 Source codes are available:  coderepos.org/share/browser/lang/c/swifty  coderepos.org/share/browser/lang/perl/Cache-Swifty  coderepos.org/share/browser/lang/perl/KeyedMutex

Apr. 23 2009 Running Realtime Stats Service on MySQL 32

Fulltext-search on SSD

Apr. 23 2009 Running Realtime Stats Service on MySQL 33

Senna / Tritonn

 Senna is a FTS engine popular in Japan  might not work well with European languages

 Tritonn is a replacement of MyISAM FTS  uses Senna as backend  faster than MyISAM FTS

 Wrote patches to support SSD  during our transition from RAM to SSD  patches accepted in Senna 1.1.4 / Tritonn 1.0.12

Apr. 23 2009 Running Realtime Stats Service on MySQL 34

FTS: RAM-based vs. SSD-based

 Size of FTS data: ~ 20GB  Downgraded hardware to see if SSD-

based FTS is feasible  Speed became ¼

 but latency of searches are well below one second

Apr. 23 2009 Running Realtime Stats Service on MySQL 35

Old Hardware New Hardware

CPU Opteron 2218 (2.6GHz) x2 Opteron 240 (1.4GHz)

Memory 32GB 2GB

Storage 7200rpm SATA HDD SSD (Intel X25-M)

Summary

Apr. 23 2009 Running Realtime Stats Service on MySQL 36

Summary

 Use UDFs for optimization  Sometime it is easier to scale UP

 esp. when you can estimate your data growth

 Use SSD for FTS  Baidu (China's leading search engine) uses SSD

 Most of the things introduced are OSS  We plan to open-source our URL compression

table as well

Apr. 23 2009 Running Realtime Stats Service on MySQL 37

We are Looking for...

 If you are interested in localizing Pathtraq to your country, please contact us  we do not have resources outside of Japan

 to translate the web interface  to ask people to install our browser extension  to follow local regulations, etc.

Apr. 23 2009 Running Realtime Stats Service on MySQL 38

Thank you for listening

tinyurl.com/kazuho

Apr. 23 2009 Running Realtime Stats Service on MySQL 39

top related