About meInfrastructure at Spotify since 2013
• Service Discovery (Nameless) • Service Framework (http://spotify.github.io/
apollo/) • System-Z • And some more
About this talk• Why model microservices? • Our solution: System-Z • Design • Learnings and Impact
=> Ideas about running microservices at scale
Why model microservices?
Z axis: ShardingY axis: Splitting
X axis: Cloning
http://artofscalability.com/
Why model microservices?
Z axis: ShardingY axis: Splitting
X axis: Cloning
http://artofscalability.com/
~14k servers
~1600 things
Why model microservices?
Z axis: ShardingY axis: Splitting
X axis: Cloning
http://artofscalability.com/
plus ~100 teamswriting code
~14k servers
~1600 things
Problems to Solve
Discovering and understanding: • What things are “out there” • Deployments and configurations • The system as a whole, how do things fit together? • How to get more information: ownership • What’s broken and how to fix it.
Some terminology
• Component - a thing • microservice • data store • data pipeline • client component
• System - some components
These terms, like most things Z, are intentionally vague
As anyone
Overviews of ‘everything’, e.g. • Container versions used
• 30, by 240 components • Various kinds of metrics
• 212 experimental components • 14 squads don’t indicate a slack channel
As anyone
Overviews of ‘everything’, e.g. • Container versions used
• 30, by 240 components • Various kinds of metrics
• 212 experimental components • 14 squads don’t indicate a slack channel
As anyone
Overviews of ‘everything’, e.g. • Container versions used
• 30, by 240 components • Various kinds of metrics
• 212 experimental components • 14 squads don’t indicate a slack channel
As anyone
Overviews of ‘everything’, e.g. • Container versions used
• 30, by 240 components • Various kinds of metrics
• 212 experimental components • 14 squads don’t indicate a slack channel
As a component user
For some component: • Figure out who owns it • Understand its API • See its state • Find documentation
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
As an owner
• Understand deployment status/configuration • Provision servers • Managing deployments • Understand dependencies
And...• Create new components • Configure routing between data centres • Check activity (logins, system updates, etc) • Set up service monitoring dashboards • Check historical build information • Configure data processing pipeline monitoring • Store miscellaneous component metadata • …
Core data model
• Many many-to-many relations • Features add specific data • Discovery names as
indirection
Frontend
Backend
UI Platform
Base Capacity Deployment
Cortana Helios
Sysmodel
RDCSquab BargeServerDB
Git repos
Dirty Data• Organisational change => ownership confusion • Infrastructure evolution => runtime confusion • Owners don’t benefit from metadata quality
What users say• The administration panel of the entire Spotify
backend. • Cthulhu • The IO product to rule them all! • Hydra-headed • Kickin rad. • A source of truth to discover services and libraries
at Spotify • A service without a definition of done.
Impact
• Teams integrating features, making them easier to find
• Teams talking about features => more consistent • System-Z mentioned as ‘great’ in 2016 ‘What
sucks’ • Swiss Army Knife (good?)
Conclusions• Microservices => many small things, big picture is
hard • Metadata about microservices helps understand
the system • Our metadata is dirty; this is probably unavoidable • Combining many tools => better collaboration and
consistency • The metadata is useful in different ways for
different users