bullet cache

Upload: ivan-voras

Post on 14-Apr-2018

242 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/30/2019 Bullet Cache

    1/40

    Bullet Cache

    Balancing speed and usabilityin a cache server

    Ivan Voras

  • 7/30/2019 Bullet Cache

    2/40

    What is it?

    People know what memcached is... mostly

    Example use case:

    So you have a web page which is justdynamic enough so you can't cache itcompletely as an HTML dump

    You have a SQL query on this page which

    is 99.99% always the same (same query,same answer)

    ...so you cache it

  • 7/30/2019 Bullet Cache

    3/40

    Why a cache server?

    Sharing between processes

    on different servers

    In environments which do not implementapplication persistency

    CGI, FastCGI

    PHP Or you're simply lazy and want something

    which works...

  • 7/30/2019 Bullet Cache

    4/40

    A little bit of history...

    This started as my pet project...

    It's so ancient, when I first started workingon it, Memcached was still single-threaded

    It's gone through at least one rewrite anda whole change of concept

    I made it because of the frustration I felt

    while working with Memcached Key-value databases are so very basic

    I could do better than that :)

  • 7/30/2019 Bullet Cache

    5/40

    Now...

    Used in production in my university'sproject

    Probably the fastest memory cacheengine available (in specificcircumstances)

    Available in FreeBSD ports(databases/mdcached)

    Has a 20-page User Manual :)

  • 7/30/2019 Bullet Cache

    6/40

    What's wrong with memcached?

    Nothing much it's solid work

    The classic problem:

    cache expiry / invalidation memcached accepts a list of records to

    expire (inefficient, need to maintain thislist)

    It's fast but is it fast enough?

    Does it really make use of multiple CPUsas efficiently as possible?

  • 7/30/2019 Bullet Cache

    7/40

    Introducing the Bullet Cache

    1. Offers a smarter data structure to

    the user side than a simplekey-value pair

    2. Implements interesting internal

    data structures3. Some interesting bells & whistles

  • 7/30/2019 Bullet Cache

    8/40

    User-visible structure

    Traditional (memcached) style:

    Key-value pairs

    Relatively short keys (255 bytes) ASCII-only keys (?)

    (ASCII-only protocol)

    Multi-record operations only with a list ofrecords

    Simple atomic operations (relativelyinefficient - atoi())

  • 7/30/2019 Bullet Cache

    9/40

    Introducing record tags

    They are metadata

    Constraints:

    Must be fast (they are NOT db indexes) Must allow certain types of bulk operations

    The implementation:

    Both key and value are signed integers No limit on the number of tags per record

    Bulk queries: (tkey X) && (tval1, [tval2...])

  • 7/30/2019 Bullet Cache

    10/40

    Record tags

    I heard you like key-value records so I'veput key-value records into your key-valuerecords...

    record key record value

    k v k v k v ...

    generic metadata

  • 7/30/2019 Bullet Cache

    11/40

    Metadata queries (1)

    Use case example: a web application hasa page /contacts which contains datafrom several SQL queries as well as othersources (LDAP)

    Tag all cached records with(tkey,tval) = (42, hash(/contacts))

    When constructing page, issue query:get_by_tag_values(42, hash(/contacts))

    When expiring all data, issue query:del_by_tag_values(42, hash(/contacts))

  • 7/30/2019 Bullet Cache

    12/40

    Metadata queries (2)

    Use case example: Application objects arestored (serialized, marshalled) into thecache, and there's a need to invalidate(expire) all objects of a certain type

    Tag records with(tkey, tval) = (object_type, instance_id)

    Expire withdel_by_tag_values(object_type, instance_id)

    Also possible: tagging objectinterdependance

  • 7/30/2019 Bullet Cache

    13/40

    Under the hood

    It's interesting...

    Started as a C project, now mostly

    converted to C++ for easiermodularization

    Still uses C-style structures and algorithmsfor the core parts i.e. not std::containers

    Contains tests and benchmarks within themain code base

    C and PHP client libraries

  • 7/30/2019 Bullet Cache

    14/40

    The main data structure

    A forest of trees,anchored in hashtable buckets

    Buckets aredirectly addressedby hashing record

    keys Buckets are

    protected byrwlocks

    RW

    locknode

    node

    node

    node

    node

    node

    Red-black

    tree

    RWlock

    node

    node

    node

    node

    node

    node

    Red-black

    tree

    hash value = H1

    hash value = H2

    .

    .

    .

    tree root

    tree root

    Hash table

  • 7/30/2019 Bullet Cache

    15/40

    Basic operation

    1.Find h =Hash(key)

    2.Acquire lock #h3.Find record in RB

    tree indexed bykey

    4.Perform operation

    5.Release lock #h

    RW

    locknode

    node

    node

    node

    node

    node

    Red-black

    tree

    RW

    locknode

    node

    node

    node

    node

    node

    Red-black

    tree

    hash value = H1

    hash value = H2

    .

    .

    .

    tree root

    tree root

    Hash table

    1

    2

    3

    4

    5

    OP

    end

  • 7/30/2019 Bullet Cache

    16/40

    Record tags followa similar pattern

    The tags index themain structure andare maintained(almost)independently

    Hash value = H1

    RW

    lock

    TK1 TK2 TK3 TK4

    Hash value = H2

    RW

    lock

    TK1 TK2 TK3 TK4

    ...

  • 7/30/2019 Bullet Cache

    17/40

    Concurrency and locking

    Concurrency is great the defaultconfiguration starts 256 record bucketsand 64 tag buckets

    Locking is without ordering assumptions

    *_trylock() for everything

    rollback-and-retry

    No deadlocks

    Livelocks on the other hand need to beinvestigated

  • 7/30/2019 Bullet Cache

    18/40

    Two-way linking betweenrecords and tags

    RW

    locknode

    node

    node

    node

    node

    node

    RW

    locknode

    node

    node

    node

    node

    node

    hash value = H1

    hash value = H2

    .

    .

    .

    tree root

    tree root

    Hash table

    hash value = H1

    RW

    lock

    TK1 TK2 TK3 TK4

    hash value = H2

    RW

    lock

    TK1 TK2 TK3 TK4

    ...

  • 7/30/2019 Bullet Cache

    19/40

    Concurrency

    Scenario 1:

    A record isreferenced need to hold Ntag bucket locks

    Scenario 2:

    A tag isreferenced need to hold Mrecord bucket

    locks

    4 8 16 32 64 128 256

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    NCPU=64, SHARED NCPU=64, EXCLUSIVE

    NCPU=8, SHARED NCPU=8, EXCLUSIVE

    Number of hash table buckets (H)

    Percentage

    ofuncontested

    lock

    acquisitions

    (U)

    Percentage of uncontested lock acquisitions

  • 7/30/2019 Bullet Cache

    20/40

    Multithreading models

    Aka which thread does what

    Three basic tasks:

    T1: Connection acceptanceT2: Network IO

    T3: Payload work

    The big question: how to distribute theseinto threads?

  • 7/30/2019 Bullet Cache

    21/40

    Multithreading models

    SPED : Single process, event driven

    SEDA : Staged, event-driven architecture

    AMPED : Asymmetric, multi-process,event-driven

    SYMPED : Symmetric, multi-process,

    event driven Model New connectionhandler Network IOhandler Payload work

    SPED 1 thread In connection thread In connection thread

    SEDA 1 thread N1 threads N2 threads

    SEDA-S 1 thread N threads N threads

    AMPED 1 thread 1 thread N threads

    SYMPED 1 thread N threads In network thread

  • 7/30/2019 Bullet Cache

    22/40

    All the models are event-driven

    The dumb model: thread-per-connection

    Not really efficient (FreeBSD has experimented with KSE and

    M:N threading but that didn't work out)

    IO events: via kqueue(2)

    Inter-thread synchronization: queuessignalled with CVs

  • 7/30/2019 Bullet Cache

    23/40

    SPED

    Single-threaded, event-driven

    Very efficient on

    single-CPU systems Most efficient if

    the operation isvery fast (comparedto network IO andevent handling)

    Used in efficient Unix network servers

  • 7/30/2019 Bullet Cache

    24/40

    SEDA

    Staged, event-driven

    Different task threads

    instantiated indifferent numbers

    Generally,N1 != N2 != N3

    The most queueing

    The most separationof tasks most CPUs used

    T1

    T2

    T2

    T2

    T3

    T3

    T3

    T3

  • 7/30/2019 Bullet Cache

    25/40

    AMPED

    Asymmetric multi-process event-driven

    Asymmetric: N(T2) != N(T3)

    Assumes network IOprocessing is cheapcompared tooperation processing

    Moderate amount of queuing

    Can use arbitrary number of CPUs

    T1 T2 T3

    T3

    T3

  • 7/30/2019 Bullet Cache

    26/40

    SYMPED

    Symmetric multi-process event-driven

    Symmetric: grouping of tasks

    Assumes network IOand operationprocessing are similarlyexpensive but uniform

    Sequential processinginside threads

    Similar to multiple instances of SPED

    T1

    T2+T3

    T2+T3

    T2+T3

  • 7/30/2019 Bullet Cache

    27/40

    Multithreading modelsin Bullet Cache

    Command-line configuration:

    n : number of network threads

    t : number of payload threads n=0, t=0 : SPED

    n=1, t>0 : AMPED

    n>0, t=0 : SYMPED n>1, t>0 : SEDA

    n>1, t>1, n=t : SEDA-S (symmetrical)

  • 7/30/2019 Bullet Cache

    28/40

    How does that work?

    SEDA: the same network loop acceptsconnections and network IO

    Others: The network IO threads acceptmessages, then either:

    process them in-thread or

    queue them on worker thread queues

    Response messages are either sent in-thread from whichever thread generatesthem or finished with the IO event code

  • 7/30/2019 Bullet Cache

    29/40

    Performance of various models

    20 40 60 80 100 120 140

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    500

    SPED SEDA SEDA-S AMPED SYMPED

    Number of clients

    Thousands

    oftransactions/s

    Except in specialcircumstances,

    SYMPED is best

  • 7/30/2019 Bullet Cache

    30/40

    Why is SYMPED efficient?

    The same thread receives the messageand processes it

    No queueing No context switching

    In the optimal case: no any kind of(b)locking delays

    Downsides:

    Serializes network IO and processingwithin the thread (which is ok if per-CPU)

  • 7/30/2019 Bullet Cache

    31/40

    Notable performanceoptimizations

    zero-copy operation

    Queries which do not involve complexprocessing or record aggregation are aresatisfied directly from data structures

    zero-malloc operation

    The code re-uses memory buffers as much

    as possible; the fast path is completelymalloc()- and memcpy()-free

    Adaptive dynamic buffer sizes

    malloc() usage is tracked to avoid realloc()

  • 7/30/2019 Bullet Cache

    32/40

    State of the art

    92 185 278 371 464 558 651 744 837 930 1023 1395 1861 2792 3723

    200.000 TPS

    400.000 TPS

    600.000 TPS

    800.000 TPS

    1.000.000 TPS

    1.200.000 TPS

    1.400.000 TPS

    1.600.000 TPS

    1.800.000 TPS

    2.000.000 TPS

    System A (June 2010) System B (Jan 2011) System B (Jun 2011)

    System C (Dec 2011) System D (Mar 2012)

    Average record data size

    Perform

    ance

  • 7/30/2019 Bullet Cache

    33/40

    State of the art

    600.000 TPS

    800.000 TPS

    1.000.000 TPS

    1.200.000 TPS

    1.400.000 TPS

    1.600.000 TPS1.800.000 TPS

    2.000.000 TPS

    Performa

    nce

    Xeon 5675, 6-core, 3 GHzXeon 5675, 6-core, 3 GHz

    +HTT

    9-stable, March 2012

  • 7/30/2019 Bullet Cache

    34/40

    under certain conditions

    The optimal, fast path (zero-memcpy,zero-malloc, optimal buffers)

    Which is actually less important, we knowthat these algorithms are fast...

    Using Unix domain sockets

    Much more important

    FreeBSD's network stack (the TCP path) iscurrently basically nonscalable to SMP?

    UDP path is more scalable WIP

  • 7/30/2019 Bullet Cache

    35/40

    TCP vs Unix sockets

    104 209 314 420 525 639 735 840 945 1050 1576 2102 3153 42040 TPS

    100,000 TPS

    200,000 TPS

    300,000 TPS

    400,000 TPS

    500,000 TPS600,000 TPS

    700,000 TPS

    800,000 TPS

    900,000 TPS

    1,000,000 TPS

    System D / mdcached System D / memcached System D / redis

    Average record size

  • 7/30/2019 Bullet Cache

    36/40

    NUMA Effects

    1 2 3 4 5 6 7 8 9 10 15 20 30 400 TPS

    200,000 TPS

    400,000 TPS

    600,000 TPS

    800,000 TPS

    1,000,000 TPS1,200,000 TPS

    1,400,000 TPS

    1,600,000 TPS

    1,800,000 TPS

    2,000,000 TPS

    System D System D / NUMA System D / ULE

    cpuset(4)-bound toa single socket

    cpuset(4) client & server

    separate sockets

    No cpuset(4)only ULE

    It's unlikely that better NUMA support would help at all...

  • 7/30/2019 Bullet Cache

    37/40

    Scalability wrt number of records

    1.000 10.000 100.000 1.000.000 10.000.000

    300.000 TPS

    320.000 TPS

    340.000 TPS

    360.000 TPS

    380.000 TPS

    400.000 TPS

    420.000 TPS

    440.000 TPS

    mdcached

  • 7/30/2019 Bullet Cache

    38/40

    Bells & whistles

    Binary protocol (endian-dependant)

    Extensive atomic operation set

    cmpset, add, fetchadd, readandclear tstack operations

    Every tag (tk,tv) identifies a stack

    Push and pop operations on records Periodic data dumps / chekpoints

    Cache pre-warm (load from file)

  • 7/30/2019 Bullet Cache

    39/40

    Usage ideas

    Application data cache, database cache

    Semantically tag cached records

    Efficient retrieval and expiry (deletion) Primary data storage

    High-performance ephemeral storage

    Optional periodic checkpoints Data sharing between app server nodes

    Esoteric: distributed lock manager, stack

  • 7/30/2019 Bullet Cache

    40/40

    Bullet Cache

    Balancing speed and usabilityin a cache server

    http://www.sf.net/projects/mdcached

    Ivan Voras