building a company-wide data pipeline on apache kafka - engineering for 150 billion messages per day

Building a company-wide data pipeline on Apache Kafka - engineering for 150 billion

messages per day

Yuto KawamuraLINE Corp

Speaker introduction

• Yuto Kawamura

• Senior software engineer of LINE server development

• Apache Kafka contributor

• Joined: Apr, 2015 (about 3 years)

About LINE•Messaging service

•More than 200 million active users1 in countries with top market share like Japan, Taiwan and Thailand

•Many family services

•News

•Music

• LIVE (Video streaming)

1 As of June 2017. Sum of 4 countries: Japan, Taiwan, Thailand and Indonesia.

Agenda

• Introducing LINE server

• Data pipeline w/ Apache Kafka

LINE Server Engineering is about …

• Scalability

•Many users, many requests, many data

• Reliability

• LINE already is a communication infra in countries

Scale metric: message delivery

LINE Server

25 billion /day (API call: 80

billion)

Scale metric: Accumulated data (for analysis)

40PB

Messaging System Architecture Overview

LINE Apps

LEGY JP

LEGY DE

LEGY SG

Thrift RPC/HTTP

talk-server

Distributed Data Store

Distributed async task processing

LEGY• LINE Event Delivery Gateway

• API Gateway/Reverse Proxy

• Written in Erlang

• Deployed to many data centers all over the world

• Features focused on needs of implementing a messaging service

• Zero latency code hot swapping w/o closing client connections

• Durability thanks to Erlang process and message passing

• Single instance holds 100K ~ connection per instance => huge impact by single instance failure

talk-server

• Java based web application server

• Implements most of messaging functionality + some other features

• Java8 + Spring + Thrift RPC + Tomcat8

Datastore with Redis and HBase

• LINE’s hybrid datastore = Redis(in-memory DB, home-brew clustering) + HBase(persistent distributed key-value store)

• Cascading failure handling

• Async write in app

• Async write from background task processor

• Data correction batch

Primary/Backup

talk-server

Cache/Primary

Dual write

Message DeliveryLEGY

LEGY

talk-server

Storage

1. Find nearest LEGY

2. sendMessage(“Bob”, “Hello!”)

3. Proxy request

4. Write to storage

talk-server

X. fetchOps()

6. Proxy request

7. Read message

8. Return fetchOps() with message

5. Notify message arrival

Alice

Bob

There’re a lot of internal communication processing user’s request

talk-serverThreat

detection system

Timeline ServerData Analysis

Background Task

processing

Request

Communication between internal systems

• Communication for querying, transactional updates:

• Query authentication/permission

• Synchronous updates

• Communication for data synchronization, update notification:

• Notify user’s relationship update

• Synchronize data update with another service

talk-server

Auth

Analytics

Another Service

HTTP/REST/RPC

Apache Kafka

• A distributed streaming platform

• (narrow sense) A distributed persistent message queue which supports Pub-Sub model

• Built-in load distribution

• Built-in fail-over on both server(broker) and client

How it works

Producer

Brokers

Consumer

Topic

TopicConsumer

Consumer

Producer

AuthEvent event = AuthEvent.newBuilder() .setUserId(123) .setEventType(AuthEventType.REGISTER) .build(); producer.send(new ProducerRecord(“events", userId, event));

consumer = new KafkaConsumer("group.id" -> "group-A"); consumer.subscribe("events"); consumer.poll(100)… // => Record(key=123, value=...)

Consumer GroupA

Pub-Sub

Brokers

Consumer

Topic

Topic

Consumer

Consumer GroupB

Consumer

ConsumerRecords[A, B, C…]

Records[A, B, C…]

• Multiple consumer “groups” can independently consume a single topic

Example: UserActivityEvent

Scale metric: Events produced into Kafka

Service Service

Service

Service

Service

Service

150 billion msgs / day

(3 million msgs / sec)

our Kafka needs to be high-performant

• Usages sensitive for delivery latency

• Broker’s latency impact throughput as well

• because Kafka topic is queue

… wasn’t a built-in property

• KAFKA-4614 Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex

• // TODO fill-in

Performance Engineering Kafka

• Application Level:

• Read and understand code

• Patch it to eliminate bottleneck

• JVM Level:

• JVM profiling

• GC log analysis

• JVM parameters tuning

• OS Level:

• Linux perf

• Delay Accounting

• SystemTap

e.g, Investigating slow sendfile(2)

• Observe sendfile syscall’s duration

• => found that sendfile is blocking Kafka’s event-loop

• => patch Kafka to eliminate blocking sendfile

stap —e ' ... probe syscall.sendfile { d[tid()] = gettimeofday_us() } probe syscall.sendfile.return { if (d[tid()]) { st <<< gettimeofday_us() - d[tid()] delete d[tid()] } } probe end { print(@hist_log(st)) } '

and we contribute it back

More interested?

• Kafka Summit SF 2017

• One Day, One Data Hub, 100 Billion Messages: Kafka at LINE

• https://youtu.be/X1zwbmLYPZg

• Google “kafka summit line”

https://youtu.be/X1zwbmLYPZg

https://youtu.be/X1zwbmLYPZg

Summary• Large scale + high reliability = difficult and exciting

Engineering!

• LINE’s architecture will be keep evolving with OSSs

• … and there are more challenges

• Multi-IDC deployment

• more and more performance and reliability improvements

End of presentation. Any questions?

building a company-wide data pipeline on apache kafka - engineering for 150 billion messages per day

Technology