technical debt: an anycast story - ripe 77 › wp-content › uploads › ... · saltstack 16....

36
Technical Debt: An Anycast Story Tom Strickx Cloudflare, London RIPE 77 Amsterdam

Upload: others

Post on 03-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Technical Debt: An Anycast Story

Tom StrickxCloudflare, London

RIPE 77Amsterdam

Page 2: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Tom Strickx● Network Hooligan at Cloudflare (Network Software Engineer)● Contributor at NAPALM Automation and Saltstack● https://tom.strickx.com

Ichabond

@tstrickx2

Page 3: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

● How big?○ 7+ million zones/domains○ 100+ billion DNS queries/day

■ Largest■ Fastest■ 35% of the Internet queries■ Now also a resolver (1.1.1.1)

○ 10% of the web requests

● 150+ anycast locations globally○ 74 countries (and growing)○ Many hundreds of network devices

Cloudflare

3

Page 4: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

● Anycast introduction● Our technical debt● Configuration changes using Saltstack● Change monitoring

Agenda

4

Page 5: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

5

Page 6: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Our Anycast Network

● ± 250 IPv4 prefixes

● ± 15 IPv6 prefixes

● Announced globally (150+ locations)

6

Page 7: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

7

Page 8: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Technical Debt

8

Page 9: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Technical Debt

● Few Tier 1 transit providers

● Prepends to steer traffic to proper location (± 10 PoPs)

● Eventually normalized globally

History

9

Page 10: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Technical DebtIssues

10

Page 11: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Technical Debt

● Authoritative DNS targeted (eg. 173.245.58.0/24)

● Single location attracts all traffic due to missing prepends

Incidents

11

Page 12: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Technical Debt

● RPKI

● Shorter AS-path

● /24 everything

Solutions

12

Page 13: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Technical Debt

● Staggered deployment (6 stages)

● As quickly as possible globally

● Extensive internal and external monitoring

Resolution

13

Page 14: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Technical DebtChange

14

Page 15: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

15

Page 16: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Automation and orchestration

● Open source

● Python, Jinja2 & YAML

● Highly scalable

● Very fast

● Vendor neutral

● Across our fleet: servers and network equipment

Saltstack

16

Page 17: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Make sure we know what we’re changing

● Adjust configuration if needed

● Add confidence

Prechecks

17

Page 18: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global RolloutActual Change

18

● All in Python (config generation)

● Both Junos and EOS

● Concurrently

● Done globally within

± 2 minutes

Page 19: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Internal metrics

● External metrics

Metrics

19

Page 20: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Stored in Clickhouse or Prometheus

● Visualized with Grafana

● Flows

● SNMP data

● Request data

Internal metrics

20

Page 21: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Developed at Yandex

● Column-oriented DBMS

● Open source

● 3 PB on disk

● 100 Gbps insertion

Clickhouse

21

Page 22: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Stores flow data

● Stores request data

Clickhouse

22

Clickhouse query

Result

Page 23: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Time-series database

● Monitoring platform

● Open source

Prometheus

23

Page 24: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Prepend length

● PromQL

Prometheus

24

Begin rollout Rollout complete

All external looking-glasses

Page 25: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Analytics

● Time-series visualization

● Multitude of plugins

● Open source

Grafana

25

Page 26: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Internal Metrics

● RPS / colo

● Traffic shift during change

● Real-time information

26

Global Rollout

Page 27: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Stored in Prometheus or raw ingestion

● Visualized with Grafana

● RIPE Atlas

● Looking glasses

External metrics

27

Page 28: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Global probes

● Ping, Traceroute, DNS query

● REST API

● Determine routing before and after change

RIPE Atlas

28

Page 29: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Routeviews

● AS57335 (http://dfz.watch/) looking glasses (Thanks Aaron!)

● IX looking glasses (We need more! With APIs!)

● Collect / scrape into Prometheus

● Visualize with Grafana

Looking Glasses

29

Page 30: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Scrape metrics

● BeautifulSoup for scraping

● Aggregate data

Looking Glasses

30

Page 31: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Insert into Prometheus

● python_client

Looking Glasses

31

Page 32: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global RolloutLooking Glasses

32

● Grafana visualization

● Track change in

near-real-time

Page 33: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global RolloutCombined

33

● Traffic

● Prepends

Perturbance

Delay

Page 34: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global RolloutTakeaways

34

Page 35: Technical Debt: An Anycast Story - RIPE 77 › wp-content › uploads › ... · Saltstack 16. Global Rollout Make sure we know what we’re changing Adjust configuration if needed

Global Rollout

● Negligible customer impact

● Route fluctuations for ± 2 minutes

● Took 1 hour to complete change, 2 days to prepare

● Instantly detected and resolved minor issues

● Heavily reliant on open source tooling and community

Takeaways

35