building a company-wide data pipeline on apache kafka - engineering for 150 billion messages per day
TRANSCRIPT
Building a company-wide data pipeline on Apache Kafka - engineering for 150 billion
messages per day
Yuto KawamuraLINE Corp
Speaker introduction
• Yuto Kawamura
• Senior software engineer of LINE server development
• Apache Kafka contributor
• Joined: Apr, 2015 (about 3 years)
About LINE•Messaging service
•More than 200 million active users1 in countries with top market share like Japan, Taiwan and Thailand
•Many family services
•News
•Music
• LIVE (Video streaming)
1 As of June 2017. Sum of 4 countries: Japan, Taiwan, Thailand and Indonesia.
Agenda
• Introducing LINE server
• Data pipeline w/ Apache Kafka
LINE Server Engineering is about …
• Scalability
•Many users, many requests, many data
• Reliability
• LINE already is a communication infra in countries
Scale metric: message delivery
LINE Server
25 billion /day (API call: 80
billion)
Scale metric: Accumulated data (for analysis)
40PB
Messaging System Architecture Overview
LINE Apps
LEGY JP
LEGY DE
LEGY SG
Thrift RPC/HTTP
talk-server
Distributed Data Store
Distributed async task processing
LEGY• LINE Event Delivery Gateway
• API Gateway/Reverse Proxy
• Written in Erlang
• Deployed to many data centers all over the world
• Features focused on needs of implementing a messaging service
• Zero latency code hot swapping w/o closing client connections
• Durability thanks to Erlang process and message passing
• Single instance holds 100K ~ connection per instance => huge impact by single instance failure
talk-server
• Java based web application server
• Implements most of messaging functionality + some other features
• Java8 + Spring + Thrift RPC + Tomcat8
Datastore with Redis and HBase
• LINE’s hybrid datastore = Redis(in-memory DB, home-brew clustering) + HBase(persistent distributed key-value store)
• Cascading failure handling
• Async write in app
• Async write from background task processor
• Data correction batch
Primary/Backup
talk-server
Cache/Primary
Dual write
Message DeliveryLEGY
LEGY
talk-server
Storage
1. Find nearest LEGY
2. sendMessage(“Bob”, “Hello!”)
3. Proxy request
4. Write to storage
talk-server
X. fetchOps()
6. Proxy request
7. Read message
8. Return fetchOps() with message
5. Notify message arrival
Alice
Bob
There’re a lot of internal communication processing user’s request
talk-serverThreat
detection system
Timeline ServerData Analysis
Background Task
processing
Request
Communication between internal systems
• Communication for querying, transactional updates:
• Query authentication/permission
• Synchronous updates
• Communication for data synchronization, update notification:
• Notify user’s relationship update
• Synchronize data update with another service
talk-server
Auth
Analytics
Another Service
HTTP/REST/RPC
Apache Kafka
• A distributed streaming platform
• (narrow sense) A distributed persistent message queue which supports Pub-Sub model
• Built-in load distribution
• Built-in fail-over on both server(broker) and client
How it works
Producer
Brokers
Consumer
Topic
TopicConsumer
Consumer
Producer
AuthEvent event = AuthEvent.newBuilder() .setUserId(123) .setEventType(AuthEventType.REGISTER) .build(); producer.send(new ProducerRecord(“events", userId, event));
consumer = new KafkaConsumer("group.id" -> "group-A"); consumer.subscribe("events"); consumer.poll(100)… // => Record(key=123, value=...)
Consumer GroupA
Pub-Sub
Brokers
Consumer
Topic
Topic
Consumer
Consumer GroupB
Consumer
ConsumerRecords[A, B, C…]
Records[A, B, C…]
• Multiple consumer “groups” can independently consume a single topic
Example: UserActivityEvent
Scale metric: Events produced into Kafka
Service Service
Service
Service
Service
Service
150 billion msgs / day
(3 million msgs / sec)
our Kafka needs to be high-performant
• Usages sensitive for delivery latency
• Broker’s latency impact throughput as well
• because Kafka topic is queue
… wasn’t a built-in property
• KAFKA-4614 Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex
• // TODO fill-in
Performance Engineering Kafka
• Application Level:
• Read and understand code
• Patch it to eliminate bottleneck
• JVM Level:
• JVM profiling
• GC log analysis
• JVM parameters tuning
• OS Level:
• Linux perf
• Delay Accounting
• SystemTap
e.g, Investigating slow sendfile(2)
• Observe sendfile syscall’s duration
• => found that sendfile is blocking Kafka’s event-loop
• => patch Kafka to eliminate blocking sendfile
stap —e ' ... probe syscall.sendfile { d[tid()] = gettimeofday_us() } probe syscall.sendfile.return { if (d[tid()]) { st <<< gettimeofday_us() - d[tid()] delete d[tid()] } } probe end { print(@hist_log(st)) } '
and we contribute it back
More interested?
• Kafka Summit SF 2017
• One Day, One Data Hub, 100 Billion Messages: Kafka at LINE
• https://youtu.be/X1zwbmLYPZg
• Google “kafka summit line”
Summary• Large scale + high reliability = difficult and exciting
Engineering!
• LINE’s architecture will be keep evolving with OSSs
• … and there are more challenges
• Multi-IDC deployment
• more and more performance and reliability improvements
End of presentation. Any questions?