storage security in a critical enterprise openstack environment
TRANSCRIPT
Storage security in a critical enterprise OpenStack environment
Danny Al-Gaaf (Deutsche Telekom AG), Sage Weil (Red Hat)OpenStack Summit 2015 - Vancouver
● Secure NFV cloud at DT● Attack surface● Proactive countermeasures
○ Setup○ Vulnerability prevention○ Breach mitigation
● Reactive countermeasures○ 0-days, CVEs○ Security support SLA and lifecycle
● Conclusions
Overview
2
Secure NFV Cloud @ DT
NFV Cloud @ Deutsche Telekom
● Datacenter design○ BDCs
■ few but classic DCs ■ high SLAs for infrastructure and services■ for private/customer data and services
○ FDCs■ small but many■ near to the customer■ lower SLAs, can fail at any time■ services:
● spread over many FDCs● failures are handled by services and not the infrastructure
4
High Security Requirements
● Multiple security placement zones (PZ)○ e.g. EHD, DMZ, MZ, SEC, Management○ TelcoWG “Security Segregation” use case
● Separation required for:○ compute○ networks○ storage
● Protect against many attack vectors● Enforced and reviewed by security department
● Run telco core services @ OpenStack/KVM/Ceph5
Ceph and OpenStack
6
Ceph Architecture
7
Solutions for telco services
● Separation between security zones needed● Physical separation
○ Large number of clusters (>100)○ Large hardware demand (compute and storage)○ High maintenance effort○ Less flexibility
● RADOS pool separation○ Much more flexible○ Efficient use of hardware
● Question:○ Can we get the same security as physical separation?
8
Placement Zones
● Separate RADOS pool(s) for each security zone○ Limit access using Ceph capabilities
● OpenStack AZs as PZs● Cinder
○ Configure one backend/volume type per pool (with own key)○ Need to map between AZs and volume types via policy
● Glance○ Lacks separation between control and compute/storage layer○ Separate read-only vs management endpoints
● Manila○ Currently not planned to use in production with CephFS○ May use RBD via NFS
9
Attack Surface
RadosGW attack surface
● S3/Swift○ Network access to gateway
only○ No direct access for consumer
to other Ceph daemons
● Single API attack surface
11
RBD attack surface
● Protection from hypervisor block layer○ No network access or CephX
keys needed at guest level
● Issue:○ hypervisor is software and
therefore not 100% secure…■ e.g., Venom!
12
Host attack surface
● If KVM is compromised, the attacker ...○ has access to neighbor VMs○ has access to local Ceph keys○ has access to Ceph public network and Ceph daemons
● Firewalls, deep packet inspection (DPI), ...○ partly impractical due to used protocols○ implications to performance and cost
● Bottom line: Ceph daemons must resist attack○ C/C++ is harder to secure than e.g. Python○ Homogenous: if one daemon is vulnerable, all in the cluster are!○ Risk of denial-of-service
13
Network attack surface
● Client/cluster sessions are not encrypted○ Sniffer can recover any data read or written
● Sessions are authenticated○ Attacker cannot impersonate clients or servers○ Attacker cannot mount man-in-the-middle attacks
14
Denial of Service
● Scenarios○ Submit many / large / expensive IOs
■ use qemu IO throttling!○ Open many connections○ Use flaws to crash Ceph daemons○ Identify non-obvious but expensive features of client/OSD interface
15
Proactive Countermeasures
Deployment and Setup
● Network ○ Always use separated cluster and public net○ Always separate your control nodes from other networks○ Don’t expose to the open internet○ Encrypt inter-datacenter traffic
● Avoid hyper-converged infrastructure○ Isolate compute and storage resources○ Scale them independently○ Risk mitigation if daemons are compromised or DoS’d○ Don’t mix
■ compute and storage■ control nodes (OpenStack and Ceph)
17
Deploying RadosGW
● Big and easy target through HTTP(S) protocol
● Small appliance per tenant with○ Separate network ○ SSL terminated proxy forwarding
requests to radosgw○ WAF (mod_security) to filter○ Placed in secure/managed zone
● Don’t share buckets/users between tenants
18
Ceph security: CephX
● Monitors are trusted key servers○ Store copies of all entity keys○ Each key has an associated “capability”
■ Plaintext description of what the key user is allowed to do
● What you get○ Mutual authentication of client + server○ Extensible authorization w/ “capabilities”○ Protection from man-in-the-middle, TCP
session hijacking● What you don’t get
○ Secrecy (encryption over the wire)19
Ceph security: CephX take-aways
● Monitors must be secured○ Protect the key database
● Key management is important○ Separate key for each Cinder backend/AZ○ Restrict capabilities associated with each key○ Limit administrators’ power
■ use ‘allow profile admin’ and ‘allow profile readonly’■ restrict role-definer or ‘allow *’ keys
○ Careful key distribution (Ceph and OpenStack nodes)● To do:
○ Thorough CephX code review by security experts○ Audit OpenStack deployment tools’ key distribution○ Improve security documentation20
● Static Code Analysis (SCA)○ Buffer overflows and other code flaws○ Regular Coverity scans
■ 996 fixed, 284 dismissed; 420 outstanding■ defect density 0.97
○ cppcheck○ LLVM: clang/scan-build
● Runtime analysis○ valgrind memcheck
● Plan○ Reduce backlog of low-priority issues (e.g., issues in test code)○ Automated reporting of new SCA issues on pull requests○ Improve code reviewer awareness of security defects
Preventing Breaches - Defects
21
● Pen-testing○ human attempt to subvert security, generally guided by code review
● Fuzz testing○ computer attempt to subvert or crash, by feeding garbage input
● Harden build○ -fpie -fpic○ -D_FORTIFY_SOURCE=2 -O2 (?)○ -stack-protector=strong○ -Wl,-z,relro,-z,now○ Check for performance regression!
Preventing Breaches - Hardening
22
Mitigating Breaches
● Run non-root daemons○ Prevent escalating privileges to get root ○ Run as ‘ceph’ user and group○ Pending for Infernalis
● MAC○ SELinux / AppArmor ○ Profiles for daemons and tools planned for Infernalis
● Run (some) daemons in VMs or containers○ Monitor and RGW - less resource intensive○ MDS - maybe○ OSD - prefers direct access to hardware
● Separate mon admin network23
Encryption: Data at Rest
● Ceph-disk tool supports dm-crypt○ Encrypt raw block device (OSD and journal)○ Allow disks to be safely discarded if key remains secret
● Key management is still very simple○ Encryption key stored on disk via LUKS○ LUKS key stored in /etc/ceph/keys
● Plan○ Petera, a new key escrow project from Red Hat
■ https://github.com/npmccallum/petera○ Alternative: simple key management via monitor
24
● Goal○ Protect data from someone listening in on network○ Protect administrator sessions configuring client keys
● Plan○ Generate per-session keys based on existing tickets○ Selectively encrypt monitor administrator sessions
Encryption: On Wire
25
● Limit load from client○ Use qemu IO throttling features - set safe upper bound
● To do:○ Limit max open sockets per OSD○ Limit max open sockets per source IP
■ handle on Ceph or in the network layer?○ Throttle operations per-session or per-client (vs just globally)?
Denial of Service attacks
26
CephFS
● No standard virtualization layer (unlike block)○ Proxy through gateway (NFS?)○ Filesystem passthrough (9p/virtfs) to host○ Allow direct access from tenant VM
● Granularity of access control is harder○ No simple mapping to RADOS objects
● Work in progress○ root_squash○ Restrict mount to subtree○ Restrict mount to user
27
Reactive Countermeasures
● Community○ Single point of contact: [email protected]
■ Core development team■ Red Hat, SUSE, Canonical security teams
○ Security related fixes are prioritized and backported○ Releases may be accelerated on ad hoc basis○ Security advisories to [email protected]
● Red Hat Ceph○ Strict SLA on issues raised with Red Hat security team○ Escalation process to Ceph developers○ Red Hat security team drives CVE process○ Hot fixes distributed via Red Hat’s CDN
Reactive Security Process
29
Detecting and Preventing Breaches
● Brute force attacks○ Good logging of any failed authentication○ Monitoring easy via existing tools like e.g. Nagios
● To do:○ Automatic blacklisting IPs/clients after n-failed attempts on Ceph level
● Unauthorized injection of keys○ Monitor the audit log
■ trigger alerts for auth events -> monitoring○ Periodic comparison with signed backup of auth database?
30
Conclusions
Summary
● Reactive processes are in place○ [email protected], CVEs, downstream product updates, etc.
● Proactive measures in progress○ Code quality improves (SCA, etc.)○ Unprivileged daemons○ MAC (SELinux, AppArmor)○ Encryption
● Progress defining security best-practices○ Document best practices for security
● Ongoing process
32
Get involved !
● OpenStack○ Telco Working Group
■ #openstack-nfv ○ Cinder, Glance, Manila, ...
● Ceph○ https://ceph.com/community/contribute/ ○ [email protected]○ IRC: OFTC
■ #ceph, ■ #ceph-devel
○ Ceph Developer Summit 33
[email protected]@redhat.com
dalgaafsage
Danny Al-Gaaf Senior Cloud TechnologistSage Weil Ceph Principal Architect
IRC
THANK YOU!