
Capriccio: Scalable Threads for Internet Services (von Behren)

• Non-blocking I/O, async I/O– NB

• Usually doesn’t work well for disks.– Async I/O

• Issue a request, get completion.

• epoll()/poll() • convoy: tendency for threads to “bunch up”• priority inversion• call graph• average, weighted moving average• capriccio: improvisatory style, free form

The Problem

• Web “transactions” involve a number of steps which must be performed in sequence.

• For high-throughput, we want to service many of these requests concurrently.– When does concurrency help? When does it not?

• If we use a single thread per request, we will have too many threads.

• If we multiplex requests on a small set of threads, it’s more difficult.

Read two numbers and add

while (true) { fd = get_read_ready(); state = lookup(fd); if (state.step == READING_FIRST) { c = read(fd, …, bytes_left); if (have enough) { state.step == READING_SECOND; } } else if (state.step ==


while (true) { int n1, n2; readexact(fd, &n1, 4); readexact(fd, &n2, 4); printf(“%d\n”, n1 + n2);}

Thread Design and Scalability

The Case for User-Level Threads

• Flexibility– Level of indirection between applications and the kernel, which

helps decouple the two.– Kernel-level thread scheduling must handle all applications.

User-level can be tailored.– Lightweight which means can use zillions of them.

• Performance– Cooperative scheduling is nearly free.– Do not require kernel crossing for uncontended locks. (Why do

contended locks require kernel crossings?)

• Disadvantages– Non-blocking I/O requires an additional system call. (Why?)– SMPs


• Context switches– Built on coroutine library.

• I/O– Intercept blocking system calls, use epoll() and AIO for disk.– Can be less efficient

• Scheduling– Main scheduling loop looks very much like an event-driven

application. (What is an EDA?)– Makes it relatively easy to switch schedulers.

• Synchronization– Cooperative threading on UP.

• Efficiency– All O(1), except sleep queue.


• 2 X 2.4 GHz Xeon, 1 GB memory, 2 X 10K RPM SCSI, GigE.– 2 X 1.2 GHz US III

• Linux 2.5.70, epoll(), AIO.– Solaris 8

• Capriccio, LinuxThreads, NPTL

Thread Primitives

Capriccio Capriccio(notrace)


NPTL Solaris

Thread creation

21.5 21.5 37.9 17.7 32

Thread context switch

0.56 0.24 0.71 0.65

Uncontended mutex lock

0.04 0.04 0.14 0.15 0.08

Thread Scalability

• Producer-consumer

Thread Scalability

• Drop between 100 and 1000 to cache footprint.

I/O Performance

• pipetest– Pass a number of tokens among a set of


• Disk scheduling– A number of threads perform random 4 KB

reads from a 1 GB file.

• Disk I/O through buffer cache– 200 threads reading with a fixed miss rate.

• When concurrency is low, performance is poorer.

• Benefits of disk head scheduling.

• I/O out of buffer.

• Performance is lower due to AIO.

Linked Stack Management

Thread Stacks

• If a lot of threads, the cumulative stack space can be quite large.

• Solution: Use a dynamic allocation policy and allocate on demand. Link stack chunks together.

• Problem: How do you link stack chunks together? How do you know when to link a new one?

Weighed Call Graph

• Use static analysis to create a weighted call graph.• Each node is weighed by the maximum stack space that

that function might consume. (Why is it maximum, and not exact?)

• Now what?


• Most real-world programs use recursion.

• Even without, static bound wastes too much.

• Instead insert checkpoints at key places to link in new stack chunks.

• Chunks switched right before arguments are pushed.

Placing Checkpoints

• Make sure one checkpoint in every cycle by inserting in back edges. (How?) (Is this efficient?)

• Then make sure each path (sum) is not too long.

• Function B is executing.• Function D, both ways.• Recursion.

Special Cases

• Function pointers– Difficult, but they try to analyze.

• External functions– Allow annotations.– Alternatively, link in a large chunk.

• Variable length arrays– C99


• What kind of a problem is this?

• Is it being solved at the right level?

Resource-Aware Scheduling

Admission Control

• We’ve seen many graphs where performance degrades as some variable increases.

• Scheduling in Capriccio is to keep performance in the “good” part of the curve.

Blocking Graph

• Each node is a location where the program blocked.– Location is call chain.

• Generated at run time.• Annotate with resource usage:

– Average running time (with exponentially-weighted “moving” average), memory, stack, sockets, etc.

• Maintain a run queue for each node. Admit threads till resources reach maximum capacity.


• Too many non-linear effects to predict.

• One solution is to use some kind of instrumentation, plus feedback control.– But even detecting that is hard.

Web Server Test


• Control flow maintains state. Control flow can be swapped for explicit maintenance.

• Threads perform two functions:– Maintain state (logical threads of programming model)– Allow concurrency (kernel)

• Should separate the two, since the overhead of concurrency is not necessary when just want to maintain state.

• Cooperative multitasking has been denigrated before, but can be good.

