apache mesos ecosystem at allegro first year of production use
TRANSCRIPT
![Page 1: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/1.jpg)
Apache Mesos Ecosystem at Allegro - First Year of Production Use
Wojciech Lesicki - Product ManagerTomasz Ziarko - Software Engineer
Allegro
![Page 2: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/2.jpg)
● What we do in Allegro?● Our Mesos Ecosystem● How we deploy apps?● Problems we’ve had● Q&A
Agenda
![Page 3: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/3.jpg)
What is Allegro?
![Page 4: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/4.jpg)
Allegro
● 16 years on the market● Started as an auction site and now the
biggest e-commerce company in Poland and one of the biggest in Central and Eastern Europe
● 50% of e-commerce market and 80% of m-commerce market in Poland
● 623 items sold every minute
![Page 5: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/5.jpg)
● 14 mln users (37% population of Poland)
● 201 mln visits, 3 billion page views per month
![Page 6: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/6.jpg)
Our infrastructure and IT
● Two DC● Openstack - 510 hosts,
20128 CPU, 5537 VM+BaaS with openstack Ironic
● Monolith (PHP) and microservices● Around 500 people in IT, most of them
are software engineers
![Page 7: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/7.jpg)
Ok, so why we need Mesos?
![Page 8: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/8.jpg)
![Page 9: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/9.jpg)
Our deployment before Mesos
● No standards, no procedures● Every team did deployment their own way● Inefficient
![Page 10: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/10.jpg)
Architecture
![Page 11: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/11.jpg)
Openstack
Mesos Slave
Mesos Executor
Mesos Slave
Docker Executor
Mesos Master
Discovery agent Discovery Agent
Zookeeper
Marathon
Discovery
Consul
- 100 % openstack (VM + bare metal)
- marathon as scheduler,
- sync, state, election - zookeeper,
- service discovery - consul,
- separated mesos and docker containerizer.
11
Implementation
![Page 12: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/12.jpg)
- multiple clusters,
- each spawned across two datacenters,
- separate ecosystem,
- fair-share distribution between data centers.
- Prod (105 slaves, 1000 CPU)
- Test (96 slaves, 368 CPU)
- Dev (30 slaves, 120 CPU)
dc1 dc2
Prod Network
Prod Mesos Cluster
Test Network
Test Mesos Cluster
Dev Network
Dev Mesos Cluster
12
Implementation
![Page 13: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/13.jpg)
Implementation
$ terraform apply -var "buildnr=setup234" \ -var "branch=mesoscon2016" \-var "marathon_version=0.15.3-1.ubuntu1404" \-var "mesos_version=0.28.0-1boost+glog+protobuf" \-var 'masters.dc1=1' \-var 'slaves.dc1=2' \-var ‘slaves.dc2=1’
openstack_compute_instance_v2.mesos-master-dc1: Refreshing state... (ID: ce86ab7a-3660-4702-bba0-5825ae2350b1)
openstack_compute_instance_v2.mesos-slave-dc1.1: Refreshing state... (ID: 39bfd9c1-f6b0-4056-a3ac-28b0136cb220)
openstack_compute_instance_v2.mesos-slave-dc1.0: Refreshing state... (ID: acfb2e86-b4d1-44bd-b9e0-2eb4685a76ff)
openstack_compute_instance_v2.mesos-slave-dc2.0: Creating…….Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
![Page 14: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/14.jpg)
14
MESOS
Discovery
Config service
SSL Service
MaaS
LBaaS
AppEngine Console (e.q. Bamboo, Stash, Artifactory)
Implementation
![Page 15: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/15.jpg)
Service Discovery
- Registering inside cluster,
- Automatic or manual registration,
- Fail detection, changes detection,
- DC aware services.15
![Page 16: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/16.jpg)
Service Discovery
Marathon Leader
Marathon
Marathon Consul
Event busSubscription
- Event based registration marathon apps in consul,- Forwards data to appropriate consul agents,- Leader aware,- Cyclic resyncs of all information,
Consul Agent
https://github.com/allegro/marathon-consul
marathon-consul
![Page 17: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/17.jpg)
Slave 1
Slave Process
Consul Agent
Slave 2
Slave Process
Consul Agent
Slave 3
Slave Process
Consul Agent
Slave n
Slave Process
Consul Agent
Marathon Leader
Marathon
Marathon Consul
Mesos Master
Schedule
Register running tasks
Service Discovery
Schedule
![Page 18: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/18.jpg)
Hermes - KafkaConsul Master
Consul Server
ConwatchConsul polling Publish event
Marathon Leader
Running app
Service lookupDNS or RESTConsul agent
Consul Master ( aka discovery Service)
Service Discovery
Production of discovery events
Discovery lookup
![Page 19: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/19.jpg)
Discovery Service
{ "ID": "mesoscon2016", "Name": "mesoscon2016", "Tags": [
"std-srv","v1"
], "Address": "127.0.0.1", "Port": 8000}
$ curl -X POST -d @register_service_on_agent.json 127.0.0.1:8500/v1/agent/service/register$ curl 127.0.0.1:8500/v1/agent/services | python -m json.tool
"mesoscon2016": { "Address": "127.0.0.1", "EnableTagOverride": false, "ID": "mesoscon2016", "ModifyIndex": 0, "Port": 8000, "Service": "mesoscon2016", "Tags": [ "Std-srv",…..
![Page 20: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/20.jpg)
SSL Service
- Custom mesos hook,
- Part of microservice
contract,
- Vault as CA solution,
- Short term
certificates/keys,
- Generated for each
instance. 20
![Page 21: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/21.jpg)
Slave 1
Slave ProcessVault
certhook
Extend env
Executor
service Consul
app_x app_ySSL mutual mode
Storage
Application usage
SSL Service
Application environment setup
![Page 22: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/22.jpg)
Config Service
- Secure storage,
- Fetch in mutual ssl,
- Version controlled config,
- Auth apps only,
- Ease to use,
- Peer review of changes.
![Page 23: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/23.jpg)
Starting App Config serviceMutual SSL
Git repository
Revision X
Revision Y
Revision ZEncrypted Data
Fetch config data
Get revision and environment config
Encrypted Valuable DataConfigured git repo
Git push
Config Service
Push configuration
![Page 24: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/24.jpg)
MAAS
- Metrics collected,
- Dashboards set,
- Service owners get
notified,
- Triggers, not
mandatory,
- Multiple monitoring
solutions,
24
![Page 25: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/25.jpg)
Graphite
Mesos Slave
Git repo
Diamond Collector
Mesos Master
Diamond Collector
MAASGrafana
Cabot
Checks definitions
Triggers
Notifications
Kafka - Hermes
Developer
Mesos Cluster Events
SubscriptionEmail
Pagerduty
Events
Eve
nts
Notify
Metric
MAAS
![Page 26: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/26.jpg)
LBAAS
26
- Based on discovery,- Available through discovery tags,- HAproxy at the core.
![Page 27: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/27.jpg)
Haproxy
VarnishVAAS
LBAAS
Consul
Service Catalog
Service X Information
Service Y Information
Instance x
Instance y
Instance x
Instance y
KAFKA/HERMES
Register instance
Unregister instance
Disco
Pub/Sub
REST Config
LBAAS
![Page 28: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/28.jpg)
Mesos Agent
Mesos Master
Graphite
MAAS
Kafka
Consul Server
Vault
Consul Agent
Conwatch
VAAS
Marathonconsul
Mesos Agent
Implementation
![Page 29: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/29.jpg)
Demo
![Page 30: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/30.jpg)
Figures
![Page 31: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/31.jpg)
What our Mesos Ecosystem gives our devs:
![Page 32: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/32.jpg)
What our Mesos Ecosystem gives our devs:
1. Fast and easy deployment of new applications
![Page 33: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/33.jpg)
What our Mesos Ecosystem gives our devs:
1. Fast and easy deployment of new applications
2. Standardization (e.g. out-of-the-box monitoring tools)
![Page 34: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/34.jpg)
What our Mesos Ecosystem gives our devs:
1. Fast and easy deployment of new applications
2. Standardization (e.g. out-of-the-box monitoring tools)
3. Automation
![Page 35: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/35.jpg)
What our Mesos Ecosystem gives our devs:
1. Fast and easy deployment of new applications
2. Standardization (e.g. out-of-the-box monitoring tools)
3. Automation4. Self-healing
![Page 36: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/36.jpg)
// solved
The bumpy road
![Page 37: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/37.jpg)
Netisolation killing slaves
![Page 38: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/38.jpg)
Netisolation killing slaves
- Enabled isolation,
- Many cyclic deploys, on test env,
- Consulted our fellow mesos developers,
- Decided to disable it,
- Problem solved,
![Page 39: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/39.jpg)
Marathon registers multiple times
- On error while getting znode data,
- Marathon registers with other framework id,
- Exhausting resources in cluster,
- After version 0.14 behaviour changed,
- Now marathon just waits,
- Maybe problem on zookeeper maybe not, solved anyway.
![Page 40: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/40.jpg)
Deploy constraints
![Page 41: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/41.jpg)
Deploy constraints
- We want cross dc/zones instances,
- Working unpredictable,
- Taking into account applications which are going to be downed,
- Multi constraint definitions prone to be unpredictable.
- Solved in newest version, so far.
![Page 42: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/42.jpg)
Readiness checks ...
![Page 43: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/43.jpg)
Readiness checks ...
- Application are deployed and upgraded in blue green principle,
- Recently started instanced not ready to handle load,- No standard mechanism for checking applications are really
running,- Check is passed ? no ? doesn't matter,- We developed custom service wrapper.
![Page 44: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/44.jpg)
The bumpy road# occuring
- DC failure, AWS standby master for quorum,
- Application scaling, usage vs allocation (we try creating our
autoscaling)
- Users authorizations, quota for user,
- Graceful shutdown,
- Opened various endpoints, without authorization.
![Page 45: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/45.jpg)
In a nutshell - you have seen
● Our Mesos Ecosystem● Our deployment● Our bumpy road
![Page 46: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/46.jpg)
Mesos - it takes some time and effort
![Page 47: Apache Mesos Ecosystem at Allegro First Year of Production Use](https://reader036.vdocuments.site/reader036/viewer/2022070520/58f1c2531a28aba64f8b460b/html5/thumbnails/47.jpg)
Mesos - it takes some time and effort, but it's worth it.