k. gopinath iisc - storage networking industry …€“ pigmix benchmark: ... –amazon rds, amazon...

Post on 23-Apr-2018

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cloud Storage Security

K. GopinathIISc

A Very Brief History of Cloud

● मेघदूतम ् – A Cloud Messaging Platform by काळीदास

● AWS in 2006

– IaaS● S3: A Cloud Storage Service

● Now Microsoft and others

– PaaS, AaaS

2014 celebrity photo leaks

● images obtained via the online storage

– offered by Apple's iCloud platform for automatically backing up photos from iOS devices, such as iPhones

● (Apple) victims' iCloud account info obtained using "a very targeted attack on user names, passwords and security questions", such as phishing and brute-force guessing

– no specific vulnerability in the iCloud service itself?

● Aggregation issue...

Outline● “Regular” security?

– “Could Ecosystem” security (vuln. wrt certificates, DNS, etc)● Includes all “systems” insecurities

– Poor key mgmt (but connected with cloud)● Amazon EC2 keys stored in opensource projects?● But may need better solutions (RBAC/RBE)

– DDoS

● vs “Cloud” security?

– Assume all pkts encrypted in the cloud. – How much info is still leaking? Or, exploitable?– Security/Privacy in the context of Data aggregation

● Correlation Analysis, Traffic analysis

Ecosystem Issues

● Bugs in Infrastructure

– Recent examples (ShellShock, HeartBleed, USB)

– Very Basic issues in IPMI 2.0● In Servers for LOM (“lights out mgmt”)● Similar Firmware insecurity in other components

● Trust wrt Certificates

● DDoS Mitigation

Trusting Certificates● Added trust with “Public-key pinning” where possible

– mechanism for sites to specify which certificate authorities have issued valid certs for that site, and for user-agents to reject TLS connections to those sites if the certificate is not issued by a known-good CA

– 2011: attempted SSL man-in-the-middle (MITM) attacks against Google users, whereby someone tried to get between users (primarily located in Iran) and encrypted Google services.

● attacker used a fraudulent SSL cert issued by DigiNotar, a root cert auth that should not issue certs for Google (and has since revoked it)

– Firefox 32 has support since '14; chrome from '11

● Also, stolen private keys of device vendors (“Stuxnet”)

– Issue during upd of drivers, etc.

– Need the trust wrt certificates (CMU work ~'08)

“Cloud” Security● Secrecy of “sustained communication” across cloud

– Is information leaking and how much?

– Discussed in this talk!● Computation on Encrypted Objects?

– Homomorphic Encryption (HE) suitable for cloud!?● Fully HE: a cryptosystem that supports arbitrary

computation on ciphertexts● Partially HE: only certain ops (det, + , *, order-

preserving,...): – Std, AHE (Pallier)/MHE (El Gamaal)/...

– Fully HE “impractical” while Partially HE may be feasible● 1st worries about how to plug the leaks, 2nd worries about how

to exploit the structure of the “crypto” (“regulated” leak!)

Program Analysis and Partially Homomorphic Encryption

● Recent work uses program analysis on Map-Reduce programs to select specific Partially Homomorphic Encryption (PHE)

– “Practical Confidentiality Preserving Big Data Analysis”, Stephen et al (HotCloud'14)

● Based on Pig Latin, a high level data flow language in the Hadoop system

● Generate Data Flow Graph (DFG) from source

● Identify encryption scheme

● Transform

– Generate new DFG using available encryption schemes

– Replace operations by their cryptographic equivalents

● RESULTS:

– PigMix benchmark: 3× overhead (avg) in terms of latency

– PHE overhead extremely low compared to FHE.

Example (from Stephen et al, HotCloud14)doc1: (3,X), (133,Y),...doc2: (56,Q), (344,R), (47,Y),...

group ((133,Y), (47,Y)), ...add (180,Y), ...

AWS● Compute & Networking

– Amazon EC2, Autoscaling Elastic Load Balancing (ELB), Amazon VPC, Amazon Route 53, AWS Direct Connect

● Storage & Content Delivery Netw – Amazon S3, Amazon Glacier, Amazon EBS, AWS Import/Export,

AWS Storage Gateway, Amazon Cloud Front● Databases

– Amazon RDS, Amazon DynamoDB, Amazon Elastic Cache, Amazon Redshift

● Application Services– Amazon CloudSearch, Amazon SWF, Amazon SQS, Amazon SES,

Amazon SNS, Amazon FPS, Amazon Elastic Transcoder● Deploy & Management

– AWS Management Console, AWS Identify and Access Management (AIM), Amazon CloudWatch, AWS Elastic Beanstalk, AWS CloudFormation, AWS Data Pipeline, AWS OpsWorks, AWS CloudHSM

● AWS Marketplace & Software

Storage APIs ● POSIX: Read (fd, buffer, count)

– Partial writes to a file OK (appends, overwrites, etc)

– mmap avlbl● NFS: Read (fd, offset, buffer, count)

– Partial writes and mmap avlbl

● Amazon S3: “storage” service

– Key Value store; no features like partial write or mmap!

S3 Interface: Key Value Store● Amazon S3 stores data in named buckets

– Each bucket is a flat namespace, containing keys associated with objects (but not another bucket)

– Max obj size 5GB. Partial writes to objects not allowed (must be uploaded full), but partial reads OK

● Storage API– create bucket

– put bucket, key, object

– get bucket, key

– delete bucket, key

– delete bucket

– list keys in bucket $aws s3 ls s3://mybucket

– list all buckets ● $aws s3 cp myfolder s3://mybucket/myfolder --recursive

Security Model

● Auth betw (EC2) Machine Instance M and Storage Bucket Y

– Admin, for eg, creates a role X st only instances (such as M) can assume role X with RO perms for some bucket Y

– Programmer creates instance M with role X

– Appl. P (running on M) retrieves temporary credentials from M

● after expiry, renewed automatically if developed w SDK

RBAC and RBE● AWS uses a form of RBAC● Better: Crypto-enhanced RBAC models

● eg, Achieving Secure Role-based Access Control on Encrypted Data in Cloud Storage, Lan Zhou,Vijay Varadharajan,and Michael Hitchens. Uses Weil pairing.

– Role-based encryption (RBE)● Design req.: users only need to keep a single key

for decryption and system operations are efficient regardless of the complexity of the role hierarchy and user membership in the system

● Add an information flow model?

– eg. all network pkts are at a particular “net” label (low) level

– encrypted info can further be declassified (in crypto-enhanced RBAC models) by those with decryption keys (high level) instead of keeping it at “net” label level

What could be a deeper “cloud security” problem?

● (Ecosystem problems)

● Analysis on streams of cmds and responses betw Instances and Storage objs aka “traffic analysis”

– Can we infer something purely from the cmd streams?● Compression: LBX extension to X reduces BW reqd.

– Even with encrypted streams? Comparable with cryptosecurity● “crypto” considered broken with 2^^40 or so ops

● Split Problem to 2 parts

– Workload Identification based on cmds/responses

– Identification of cmds/responses even if encrypted

Workload Identification ● “Discovery of Application Workloads from Network

File Traces,” Yadwadkar, Bhattacharyya, Gopinath, Niranjan, Susarla, FAST 2010

– Work at IISc/NetApp. Uses only cmd name info and no other fields

Variability in the traces belonging to same class

● Additions, deletions and mutations● Issue with the Conventional HMMs● Need profile HMMs

Analogy with Problem in Computational Biology● In Computational Biology

– Need for multiple Alignment

– Divergence due to chance mutations

– Conserving critical parts

● Workload identification essentially statistical– Probabilistic models appropriate

– With Markov property due to statistical regularities: Markov Models with Hidden States (HMMs)

– Need a profile to represent multiple alignment of variable traces

– Proposal: use profile HMM for representing profile of a workload

Pairwise Alignments

● Global Alignment

– compute two equal length seqs such that matches are maximized and insertions/deletions are minimized

● Local Alignment

– locate two subsequences one from each string such that they are very similar

An Example Multiple Alignment● Need of multiple alignment of sequences

– Detecting similarity between more than two sequences

● Multiple alignment of 10 `edit' traces

● Unfortunately,– Extending DP has complexity (time & space) O(n^r)

Transition Structure of a profile HMM

● Essentially L-R HMMs

Workload Identification● Pre-train profile HMM for each workload

– globally align the profile with the unknown sequence

● Trace Annotation– compute a local alignment of each profile with the

test trace

Automated Learning on Real Traces

● Small snippets of traces are sufficient for identifying many workloads

Workload Identification: Summary ● Profile HMM approach successful at discovering the

application-level behavioural characteristics● A diverse, representative sample of workloads required for

training● Not able to handle higher level of concurrency ● Multiple Kernel based SVM Methods can improve accuracy

(HotStorage'12)– Uses all fields instead of only cmd name

● Pankaj Pipada, Achintya Kundu, K Gopinath, Chiranjib Bhattacharyya, Sai Susarla, Nagesh P. C., "LoadIQ: Online learning to label program phases using storage traces," HotStorage Jun 2012

– But encryption makes it again not that useful!

Encrypted Command Streams: How secure are they?

• Here, we use NFS instead of S3– Also, encrypted NFS commands

• Extraction of feature vectors from Network Traces to form training data

• Application of Supervised and Unsupervised Machine learning techniques on training to predict the NFS command

Approach

• Setup NFS client and Server using SSH Tunneling • Tap Encrypted/Decrypted Network trace using Wire-

Shark.• Extract feature vectors for each NFS command from

encrypted NFS trace• Feature Vectors considered:

– Request Packet Size

– Reply Packet Size

– Round Trip Time

– Ratio of Reply and Request Size

Trace Scenario

NFSSERVER

Secure communicationNFS CLIENT(decrypted)

INTRUDER(encrypted)

Trace Extraction

• Some challenges involved in extracting the Feature Vector from encrypted dump eg.– Non Consecutive request and reply sequences

• Extraction of true labels (NFS Command) from decrypted dump for1. Feature set validation using supervised learning

2. finding classification error in unsupervised learning

NFS Commands in Traces

i. GetAttr

ii. Access

iii. Read

iv. ReadDirPlus

v. FSStat

vi. Lookup

No of feature Vectors = 4

No of Output classes = 6

No of Training sets = 7103

Machine Learning Techniques UsedFeature Set Validation using Supervised Learning

(needs labelled examples, ie both encrypted and decrypted pkts, to learn mapping function)

1. Hashing and Analysis2. Decision Tree

Unsupervised Learning: needed as only encrypted pkts avlbl in practice

3. K-Means

4. Hidden Markov Model

5. Online Hidden Markov Model

Relative comparison of accuracy of the methods

Supervised Techniques Technique 1:• Analyze the pattern • Find out similarity among the data corresp to same

command in decrypted trace• Hard code the rules of classification• Accuracy : 99+%

Technique 2: Decision Tree• Generates classification tree based on feature vectors

to predict NFS cmds at leaf of tree • Uses principle of “minimization of entropy”• Accuracy: 94%

Decision Tree

Unsupervised learning

• No access to labels from decrypted data.• Only from timestamp and packet size of encrypted commands• Assumptions:

1. The number of files and their sizes in a directory in server is distributed according to Gaussian Distribution.

2. Naïve Base Model of feature vector (cause is independent when the effect is observed)

3. Data is generated from Gaussian Distribution of feature vectors.

4. Synchronous Request from Single Client Server Environment

Technique 3: K-Means

• Clustering Based Technique • Classifies the training data to 6 classes

corresponding to NFS commands– Use offline access to decrypted data to infer properties

of pkts that can be used to predict labels on clusters

• Accuracy: 87%• “Hard Clustering” • Converges in finite number of iterations.

Technique 4: Hidden Markov Model

• The actual cause (NFS command) is unobserved, but the effect (encrypted n/w trace) is observed

• Hidden Variable: NFS command• Observed Variable: packet size and rtt• Why use HMM?

– Because there is correlation between successive commands. No Independence assumptions

• “Soft Clustering” • Accuracy: 82% on training data.

Technique 5: Online HMM

• Why Online?– Realistic way to model online generation of traces in

NFS network.

– Prevents over application of Markov Structure in data.

• A sequence of n/w trace for 1000 consecutive NFS commands is collected.

• Repeated for every new 1000 n/w traces• Accuracy: 84%

Comparison of Models

Technique Accuracy %

Hashing 99

Decision Tree 94

K-Means 87

HMM 82

Online HMM 84

Summary (Leakage of Info)Unsupervised more useful• Realistic Scenario in most cases• Robust (soft clustering)• More general and easily extended to other netw frameworks• But accuracy lower• Can handle missing data or labels• Useful? in new Android attacks that can predict pwd input

Since accuracy of 80% and above, need to obfuscate• Add variable padding

– But Goldwasser et al prove no “secure” obfuscation!

– No “perfect” crime

Conclusions

● Cloud storage security: “cloudy” with some chances of rain!

● “Ecosystem” security and “real” cloud security aspects intertwined– Detect pwd input thru side channels

– Aggregation attacks

– Traffic analysis attacks

top related