Big Data Applications and Tuning in Ceph
Noah WatkinsRed Hat
Who am I?
● Noah Watkins● Red Hat engineer
○ Big Data applications on Ceph● PhD candidate at UC Santa Cruz
○ Use Ceph as a research platform● Contact
2
Today’s Agenda
● Non-technical talk about technical stuff● Less visible projects that deserve attention● Ceph is a big ecosystem
○ Running Hadoop on Ceph○ Tracing and debugging features○ Custom object interfaces
3
Big Data with Hadoop and Ceph
4
Big Data with Ceph and Hadoop
● Do you Hadoop?
5
Big Data with Ceph and Hadoop
● Do you Hadoop?● Are you running a Ceph cluster?
6
Big Data with Ceph and Hadoop
● Do you Hadoop?● Are you running a Ceph cluster?● Combined, they work. End of talk.
National System
Administrator Appreciation
Day
7
Why should you care? Consolidation
8
Why should you care? Consolidation
Why should you care? Consolidation
10
FooStore!
FooApp!
Why should you care? Consolidation
11
FooStore!
FooApp!
Why should you care? Consolidation
12
FooStore!
FooApp!
Why should you care? Consolidation
13
FooStore!
FooApp!
$$$
Why should you care? Consolidation
14
FooStore!
FooApp!
Why should you care? Consolidation
15
FooStore!
FooApp! FooApp!
How does it work?
● A shim layer translates file system APIs○ CephFS <-> Hadoop Common File System
● Opens up the entire Hadoop ecosystem○ MapReduce○ Spark○ Storm○ Impala○ HBase○ The list goes on and on
16
HDFS vs CephFS, 1TB Terasort
17http://www.mellanox.com/related-docs/whitepapers/wp_hadoop_on_cephfs.pdf
1 Year Old Results!
What Works and What Doesn’t (yet)
● Locality-aware scheduling○ The rumors aren’t true :)
● Variable replication and erasure coding○ Select from existing pools
● Snapshots● How’s the stability
○ Terasort, HBase, DFSIO○ Bug fixes and performance tuning (HDFS isn’t strict!)○ Gets better with each new MDS update 18
Test driving Hadoop on Ceph
● Github○ https://github.com/ceph/cephfs-hadoop
● Tutorial○ http://ceph.com/docs/master/cephfs/hadoop/○ Streamlined installation and new docs coming soon!
● Mailing list○ Best resource right now○ http://tracker.ceph.com
19
Tracing and Debugging with LTTng
20
Things always go according to plan
21
FooApp!
Things always go according to plan
22
FooApp!
Complex Systems Can’t Be Grokked
23
FooApp!
Complex Systems Can’t Be Grokked
24
FooApp!
What is tracing & why should I care?● Tracing allows us to see exactly what
happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }
[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }
[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }
[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }
[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }
[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 25
RB
D
What is tracing & why should I care?● Tracing allows us to see exactly what
happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }
[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }
[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }
[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }
[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }
[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 26
RB
D
What is tracing & why should I care?● Tracing allows us to see exactly what
happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }
[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }
[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }
[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }
[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }
[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 27
RB
D
What is tracing & why should I care?● Tracing allows us to see exactly what
happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }
[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }
[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }
[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }
[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }
[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 28
RB
D
What is tracing & why should I care?● Tracing allows us to see exactly what
happens inside the system[11:41:53.226668003] (+0.000270968) issdm-45 librbd:aio_read_enter: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { imagectx = 0x7F929308F600, name = "kubuntu", snap_name = "", read_only = 0, offset = 2078900224, length = 31232, completion = 0x7F92935AE190 }
[11:41:53.226730019] (+0.000062016) issdm-45 librbd:aio_read_exit: { cpu_id = 0 }, { pthread_id = 140267464296512 }, { retval = 31232 }
[11:41:53.228001617] (+0.001271598) issdm-45 librbd:aio_complete_enter: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190, rval = 31232 }
[11:41:53.228007204] (+0.000005587) issdm-45 librbd:aio_get_return_value_enter:{ cpu_id = 1 }, { pthread_id = 140266098906880 }, { completion = 0x7F92935AE190 }
[11:41:53.228009718] (+0.000002514) issdm-45 librbd:aio_get_return_value_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { retval = 31232 }
[11:41:53.228016702] (+0.000006984) issdm-45 librbd:aio_complete_exit: { cpu_id = 1 }, { pthread_id = 140266098906880 }, { } 29
RB
D
With tracing anything is possible
30
Queue Depth over TimeLatency vs Sector Size
Example: Ceph Request Latency
31
Trace processing pipeline
32
● Processing step examines trace events● Typically written in Python● Looking for pairs is a common pattern
○ Time spent in queue○ Time spent in I/O○ Client processing time
● Requires knowledge of internal workings
Zipkin, Blkin, and LTTng
33
● Dapper is a Google system○ Traces causal
relationships● Zipkin implemented by Twitter
○ Look at the pretty GUI○ Ignores data sources
● Huge number of raw LTTng tracepoints in Ceph○ LTTng → Zipkin (Blkin)
■ Marios Kogias○ Andrew Shewmaker○ Adam Crume
Getting started with tracing!
● Lots of tracepoints exist!○ Adding new points is easy :)
● RBD-Replay○ Collect and replay RBD traces○ http://ceph.com/docs/master/rbd/rbd-replay/
● Adding points and discussionhttp://noahdesu.github.io/2014/06/01/tracing-ceph-with-
lttng-ust.html 34
TRACEPOINT_EVENT(librados, rados_write_enter,
TP_ARGS(
rados_ioctx_t, ioctx, const char*, oid,
const void*, buf, size_t, len, uint64_t, off),
TP_FIELDS(
ctf_integer_hex(rados_ioctx_t, ioctx, ioctx)
ctf_string(oid, oid)
)
)
[Scripting] Storage and Compute with RADOS
35
A different version of a better talk
● The objects in RADOS can have arbitrary code associated with them○ Think: “remotely compress object “foo”, please.”
● "Distributed Storage and Compute with Ceph’s librados”○ Great talk by Sage Weil○ Check out YouTube
● Scripting compute with librados36
How does an OSD handle a request?
37
Client OSDlibradosread-object(foo)
read-object
transaction● Client reads an object from an OSD● The OSD executes a “read” operation
○ The read operation knows how to access data managed by the OSD
● All operations are executed in a transactional context
● Exact function can be swapped out
What are RADOS object classes?
38
Client OSDlibradosget-md5(foo)
transaction● Client writes some C++ code● Compiles it into an OSD plugin● After installing this code can be invoked● Avoids data transfer
○ Can cache results● Can simplify application design
MD5-Hash
out-of-band install
Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;
bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;
byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest)); return 0;}
39
Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;
bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;
byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest)); return 0;}
40
Input provided by client, and output returned to client.
Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;
bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;
byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest)); return 0;}
41
Input provided by client, and output returned to client.
Stat the object to query its size.
Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;
bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;
byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest)); return 0;}
42
Input provided by client, and output returned to client.
Stat the object to query its size.
Read the entire object into a buffer.
Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;
bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;
byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest)); return 0;}
43
Input provided by client, and output returned to client.
Stat the object to query its size.
Read the entire object into a buffer.
Pass this data buffer to the MD5 algorithm
Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;
bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;
byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest)); return 0;}
44
Input provided by client, and output returned to client.
Stat the object to query its size.
Read the entire object into a buffer.
Pass this data buffer to the MD5 algorithm
Return the MD5 digest to the client.
Example RADOS object class pluginint compute_md5(cls_method_context_t hctx, bufferlist *in, bufferlist *out){ size_t size; int ret = cls_cxx_stat(hctx, &size, NULL); if (ret < 0) return ret;
bufferlist data; ret = cls_cxx_read(hctx, 0, size, data); if (ret < 0) return ret;
byte digest[AES::BLOCKSIZE]; MD5().CalculateDigest(digest, (byte*)data.c_str(), data.length());
out->append(digest, sizeof(digest)); return 0;}
45
Input provided by client, and output returned to client.
Stat the object to query its size.
Read the entire object into a buffer.
Pass this data buffer to the MD5 algorithm
Return the MD5 digest to the client.
All in a transactional context.
Dynamic object classes with Lua
46
Client OSDlibradoscall-lua(script, foo)
transaction● Lua is great as an embedded language● LuaJIT is a high-performance
implementation● Allow clients to construct and modify
object classes without compiling or restarting OSDs
LuaJIT VM
dynamically generated interface
Example: Lua Thumbnail Generator function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
47
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply ImageMagick transformation
● Append (cache) the new version of the image to the object
● Save the location of the version indexed by its specification
● Write a smart read function to consult the cache
● Application can dynamically alter the transformation applied
App-specific Object Interface
Example: Lua Thumbnail Generator function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
48
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply ImageMagick transformation
● Append (cache) the new version of the image to the object
● Save the location of the version indexed by its specification
● Write a smart read function to consult the cache
● Application can dynamically alter the transformation applied
App-specific Object Interface
Example: Lua Thumbnail Generator function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
49
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply ImageMagick transformation
● Append (cache) the new version of the image to the object
● Save the location of the version indexed by its specification
● Write a smart read function to consult the cache
● Application can dynamically alter the transformation applied
App-specific Object Interface
Example: Lua Thumbnail Generator function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
50
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply ImageMagick transformation
● Append (cache) the new version of the image to the object
● Save the location of the version indexed by its specification
● Write a smart read function to consult the cache
● Application can dynamically alter the transformation applied
App-specific Object Interface
Example: Lua Thumbnail Generator function thumb(input, output)
-- apply thumbnail spec to original image
local spec_string = input:str()
local blob = get_orig_img()
local img = assert(magick.load_image_from_blob(blob:str()))
img = magick.thumb(img, spec_string)
-- append thumbnail to object
local obj_size = cls.stat()
local img_bl = bufferlist.new()
img_bl:append(img)
cls.write(obj_size, #img_bl, img_bl)
-- save location in leveldb
local loc_spec = #img_bl .. "@" .. obj_size
local loc_spec_bl = bufferlist.new()
loc_spec_bl:append(loc_spec)
cls.map_set_val(spec_string, loc_spec_bl)
end
51
Original Ver.1 Ver.2 Ver.3
Thumbnail Index
● Read object and apply ImageMagick transformation
● Append (cache) the new version of the image to the object
● Save the location of the version indexed by its specification
● Write a smart read function to consult the cache
● Application can dynamically alter the transformation applied
App-specific Object Interface
Getting started with scripted RADOS
● Buyer beware!○ Experimental code○ Works and fairly stable
● Code available on github○ http://github.com/ceph/ceph○ branch: cls-lua
● In-depth explanation and examples○ http://ceph.com/rados/dynamic-object-interfaces-with-lua/
52
That’s it!
● Lot’s of interesting development● Ceph is a great platform for experimentation● Q&A
53