is large-scale dns over tcp practical? · dns-over-tcp is feasible on a large scale i with 6...

24

Upload: others

Post on 03-Aug-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Is large-scale DNS over TCP practical?

Baptiste JonglezPhD student, Univ. Grenoble Alpes, France

14-18 May 2018RIPE 76

1/24

Page 2: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

DNS over UDP

Issues with DNS over UDP

Well-known issues:

I no source address validation: huge DDoS attacks byre�ection/ampli�cation

I requires IP fragmentation for large messages (DNSSEC)

I no privacy

I unreliable transport

Each individual issue can be worked-around: RRL, Ed25519 forDNSSEC, DTLS or DNSCurve for privacy. . .

Solution to all four problems: TCP+TLS (see DPRIVE, RFC 7858)

2/24

Page 3: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Switch to DNS over TCP by default?

Scenario

Use persistent TCP connections between home routers andrecursive resolvers, for all customers of an ISP.

3/24

Page 4: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Switch to DNS over TCP by default?

Why is this situation useful to study?

I The resolver could stop supporting UDP queries (no re�ectionattacks possible!)

I First step towards TCP+TLS for privacy

I Using persistent TCP connections can improve responsivenesson lossy networks (not shown here)

4/24

Page 5: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Switch to DNS over TCP by default?

Objection

�But wait, our recursive resolvers won't handle that much load!�

Goals: performance analysis

I Develop a methodology to measure resolver performance

I Experiment with lots of clients (millions) to assess whether arecursive resolver can handle that much TCP connections

I See if resolver performance depends on the number of clients5/24

Page 6: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Challenges

Why would performance depend on the number of clients?

I performance of select-like event noti�cation facilities(bitmap of �le descriptors, linear search)

I the kernel has to manage millions of timers (retransmission oneach TCP connection)

I memory usage, CPU cache

6/24

Page 7: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Experimental challenges

Practical challenges

I How to spawn millions of DNS clients?

I Realistic query generator?

Solution

I Use Grid'5000: a �Hardware-as-a-Service� research platform,with lots of powerful servers: 32 cores, 128 GB RAM, 10GNICs;

I One dedicated server for unbound on Linux, everything servedfrom cache;

I Lots of Virtual Machines acting as clients;

I On each VM, open 30k persistent TCP connections towardsthe server and send DNS queries with custom client in C withlibevent;

7/24

Page 8: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Experimental setup: high-level

8/24

Page 9: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Methodology: how to measure performance?

70620

0

50

100

0 10 20 30 40 50

Time (seconds)

Que

ry r

ate

(bla

ck)

/ ans

wer

rat

e (r

ed)

in k

QP

S

20180502_lille−chetemi−1thread−24VM_1000tcp_1500qps_qps−increase−85−50s

9/24

Page 10: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Methodology: how to measure performance?

113750

0

50

100

0 10 20 30 40

Time (seconds)

Que

ry r

ate

(bla

ck)

/ ans

wer

rat

e (r

ed)

in k

QP

S

20180502_lille−chetemi−1thread−24VM_175tcp_1500qps_qps−increase−95−45s

10/24

Page 11: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

UDP/TCP comparison

●●

●●

●● ● ●

0

100

200

300

0 5000 10000 15000 20000 25000

Number of TCP or UDP connections

Pea

k se

rver

per

form

ance

(K

qps)

mode● tcp

udp

11/24

Page 12: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

UDP/TCP comparisonInterpretation

Resolver performance analysis

I settings: unbound runs on 1 thread

I UDP performance does not really depend on the number ofclients, as expected (stateless)

I performance over TCP is good with very few clients, but thendrops rapidly

I it then reaches a plateau: stable 50k to 60k qps even for 6.5million TCP clients!

Hypotheses for performance drop

I more clients → lower query rate per client, so less potential foraggregation (in TCP, select(), . . . )

I TCP data structures may not �t anymore in CPU cache?

12/24

Page 13: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Large-scale experiment

Result

Experimented with up to 6.5 million TCP clients:

I required 216 client VM running on 18 physical machines

I each VM opened 30k TCP connections to resolver

I server had 128 GB of RAM, peak usage: 51.4 GB (kernel +unbound)

I server performance: around 50k queries per second

Memory usage breakdown per connection: 4 KB for unboundbu�er, 3.7 KB for the rest (unbound, libevent, kernel)

13/24

Page 14: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

What about client query delay?Medium-high load: 43 kQPS from 4.3M TCP clients

14/24

Page 15: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

What about client query delay?Increasing to overload, 360k TCP clients

15/24

Page 16: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

What about multi-threading?

200

400

600

800

0 5 10 15 20

Resolver threads

Pea

k re

solv

er p

erfo

rman

ce (

kQP

S)

Impact of resolver threads on peak performance (300 TCP/VM, 48 VM, dual 10−core server)

16/24

Page 17: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Assumptions and outlooks

Some assumptions we made

I everything was served from static zone in unbound (= cache)

I we currently open all TCP connections beforehand → cost ofclient churn? what about TLS?

I client queries modelled as Poisson processes → any bettermodel?

I could we somehow experiment with constant query rate perclient?

17/24

Page 18: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Setup

Detailed setup

I Linux 4.9 (Debian stretch)

I Unbound 1.6.7, with 4 KB of bu�er per TCP connection, andno disconnection timeout

I custom libevent-based client:https://github.com/jonglezb/tcpscaler

I experiment orchestration:https://github.com/jonglezb/dns-server-experiment

I Grid'5000: https://www.grid5000.fr

I Hardware details (mostly used Chetemi, Chi�et, Grisou):https:

//www.grid5000.fr/mediawiki/index.php/Hardware

18/24

Page 19: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Conclusions

DNS-over-TCP is feasible on a large scale

I with 6 million TCP clients, unbound can still handle around50k queries per second per CPU core

I apparently unlimited number of TCP clients (requires OStweaking and enough RAM)

Remaining work

I better understanding of the server performance drop

I measure impact of client churn

I performance when not serving from DNS cache?

I apply methodology to more recursive resolver software

I experiment with TLS, QUIC, SCTP

19/24

Page 20: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Bonus slides

20/24

Page 21: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Aside: unreliable transport?

Queries or responses can be lost.

Retransmission timeout

Large retransmission timeout when a DNS query is lost!

Retransmission timeouts in stub resolvers:

I Linux/glibc: 5 seconds, con�gurable down to 1 second

I Android/bionic: identical (but there is a local cache)

I Windows: 1 second (since Vista)

21/24

Page 22: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Why not just lower retransmission timeouts?

22/24

Page 23: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Experimental setup, details

23/24

Page 24: Is large-scale DNS over TCP practical? · DNS-over-TCP is feasible on a large scale I with 6 million TCP clients, unbound can still handle around 50k queries per second per CPU core

Experimental setup, more details

Setup

I all queries are answered directly by unbound (100% cache hit)

I unbound was modi�ed to allow in�nite connections (very largetimeout)

I everything scripted with execo, fully reproducible:https://github.com/jonglezb/dns-server-experiment

https://github.com/jonglezb/tcpscaler

Gotcha

I generating queries according to a fast Poisson process is tricky!

I epoll() has very low timeout resolution compared to poll()

or select(). . .

I Linux has several limits regarding the number of �ledescriptors, but they can all be con�gured at runtime (thanksGoogle. . . )

24/24