Download - OpenStack Swift usage at Turkcell
OpenStack Summit Barcelona, October 20161
OpenStack Swift usage at TurkcellDoruk AksoyOrhan Bıyıklıoglu Christian Schwede
OpenStack Summit Barcelona, October 20162
About TurkcellLeading mobile operator in Turkey and the region
● 9 countries
● 66+ million subscribers
OpenStack Summit Barcelona, October 20163
AnkaraIstanbul
OpenStack Summit Barcelona, October 20164
Turkcells public cloud storage offeringMeet Lifebox (Akıllı Depo)
● Store and share photos, videos & music in the cloud
● https://mylifebox.com
● Can be used on mobile and desktop
● Open to everyone
OpenStack Summit Barcelona, October 20165
Turkcells public cloud storage offeringMeet Lifebox (Akıllı Depo)
● Started in 2014 with a legacy solution
● Migrated in 2015 to an in-house developed solution,
using OpenStack Swift as storage backend
● Today: 3.3PB storage, Over 3M users
OpenStack Summit Barcelona, October 20166
OpenStack Summit Barcelona, October 20167
OpenStack Summit Barcelona, October 20168
OpenStack
OpenStack Summit Barcelona, October 20169
Planning the Swift deployment
OpenStack Summit Barcelona, October 201610
Using SwiftWhy Swift is a good fit in this case
● Unstructured data: object storage makes sense
○ Metadata stored separately (filenames, directories, tags)
● Availability, durability, scalability, flexibility
○ Failure Resiliency
○ Use existing hardware
○ Run customized middlewares
● Swift can be deployed standalone
OpenStack Summit Barcelona, October 201611
Things to keep in mind
● Distribute objects and containers
○ Billions of objects in a single container doesn’t scale
● Keep eventual consistency in mind
● Estimate your growth
○ … and choose your partition powers wisely
● Know your failure domains
○ … and design your rings around them
Plan well and avoid future worries
OpenStack Summit Barcelona, October 201612
App architecture
Swift Proxy RabbitMQ
Oracle DB Elasticsearch
ImageMagickffmpeg
Keystone
MySQL DB
App
OpenStack Summit Barcelona, October 201613
Swift deployment & monitoring
OpenStack Summit Barcelona, October 201614
Initial architecture8 identical servers, 3 storage systems
Loadbalancer
Region 1
swift01
swift02
swift03
Storage 1
Region 2
swift04
swift05
swift06
Storage 2
Region 3
swift07 swift08
Storage 3
Statsd / grafana
OpenStack Summit Barcelona, October 201615
Initial architecture
Loadbalancer
Swift Proxy Keystone
Swift Account
Swift Container Swift Object
MySQL
Disks
OpenStack Summit Barcelona, October 201616
Deploying SwiftRed Hat Enterprise Linux & OpenStack Platform
● Customized standalone Swift deployment
● Baremetal server deployment using Kickstart
● Manual ring management
● Ansible to install & configure Swift
○ Started using the manual install guide
○ Tuned settings later on based on metrics
OpenStack Summit Barcelona, October 201617
● Single Ansible playbook using tags for:
○ Repository management & RPM installation
○ Installation of customized middlewares
○ Configuration & Tuning of Swift & Keystone
○ Ring deployment
○ Enabling & restarting of services
Customized Ansible playbook
OpenStack Summit Barcelona, October 201618
MonitoringThe usual suspects: statsd, grafana, recon, ...
● Separate INFO & WARN log files for each service
● statsd metrics collected and visualized using Grafana
● swift-recon to collect important metrics and trigger alarms
● swift-dispersion-report to monitor rebalance progress
● healthcheck middleware queried by existing monitoring system
OpenStack Summit Barcelona, October 201619
OpenStack Summit Barcelona, October 201620
swift-dispersion-reportMonitor rebalance progress
swift-dispersion-report --object-only
Queried 8192 objects for dispersion reporting, 25s, 0 retriesThere were 3190 partitions missing 0 copy.There were 5002 partitions missing 1 copy.79.65% of object copies found (19574 of 24576)
OpenStack Summit Barcelona, October 201621
swift-reconQuerying metrics directly from Swift
curl http://192.168.10.1:6002/recon/load{"5m": 0.18, "15m": 0.35, "processes": 16105, "tasks": "1/131", "1m": 0.11}
swift-recon --replication[replication_time] low: 2863, high: 53089, avg: 24440.5, total: 195523, Failed: 0.0%, no_result: 0, reported: 8Oldest completion was 2016-07-27 21:12:36 (2 days ago) by... Most recent completion was 2016-07-29 21:19:34 (3 hours ago)
OpenStack Summit Barcelona, October 201622
Challenges
OpenStack Summit Barcelona, October 201623
ChallengesRebalancing, write_affinity & inodes
● Started with 8 servers, added 5 new servers
○ 40% of data needed to be redistributed evenly across all nodes
● write_affinity: write to two regions initially, replicate to 3rd afterwards
○ Requires more space in primary regions
● Growth as fast as new disks/servers added
○ running replicators with handoffs_first helped
OpenStack Summit Barcelona, October 201624
Tuning SwiftProcess concurrency, timeouts, cache pressure
● Increased thread concurrency / workers
○ replicator workers affect IOPS
● Increased object-replicator timeout settings
○ node_timeout
○ http_timeout
○ rsync_io_timeout
○ rsync_timeout
OpenStack Summit Barcelona, October 201625
Things are seldom what they seemWhen something’s broken, It’s likely not Swift’s fault
● Know your load balancer well
○ Especially when streaming data
● Closely monitor other moving parts
○ Keystone response times
○ low-level IO stats
■ inode cache misses slowed down replication a lot
■ vfs_cache_pressure = 1
OpenStack Summit Barcelona, October 201626
Outlook
OpenStack Summit Barcelona, October 201627
Growing usageMore servers, distributed services, and more clusters
● Growth is actually higher than initially expected
● Expand server and storage capacity x5 by the end of 2017
● Upgrade RHEL and OpenStack Platform
○ While being in production
○ Add Elasticsearch/Kibana/Logstash
OpenStack Summit Barcelona, October 201628
Next stepsMore servers, distributed services, and more clusters
● Run services separately
○ Few dedicated Keystone servers
○ Dedicated object storage nodes
○ Using storage policy to keep disk usage balanced
● Second Swift cluster for different app in place
OpenStack Summit Barcelona, October 201629
Questions?