Download - Azure + DataStax Enterprise Powers Office 365 Per User Store

Azure + DSE Powers O365 Per-User Store

1 Introduction

2 What We Built

3 How We Built It Using Cassandra + Spark

4 Why We Built It On Azure + DSE

5 How Can You Build It

6 Wrap Up2© 2015. All Rights Reserved.

Introduction

3© 2015. All Rights Reserved.

Sean Usher

Office 365

Email: [email protected]

Twitter: @seanushermsft

Silvano Coriani

Azure


Twitter: @scoriani

Introduction – Office 365


Email

Collaboration

Document Authoring

Social Networking

Calendaring

File Storage

Business Intelligence

Etc…

Introduction – Azure


Azure is Microsoft’s cloud computing platform, a growing collection of

integrated services—analytics, computing, database, mobile,

networking, storage, and web—for moving faster, achieving more, and

saving money.

What We Built


A way to understand our users and organizations at a deeper level!

• Are users happy with the service they are receiving?

• Are users fully utilizing the services they are paying us for?

• Are users hitting issues that we can proactively help them with?

• How has a user’s experience been over their lifetime?

• Can we discover insights that we aren’t even aware of?

This requires ingesting and storing a lot of data. We need to be able

to perform fast, scalable analytics on that data, or we will discover

issues too late!

Questions:

What We Built (contd)


• Running in the cloud

• Highly scalable

• Initial ingestion of 50k events/sec, growing rapidly

• Millisecond latency for reads/writes

• High Availability

• Tunable Consistency

• Real-time and batch analytics

• Machine learning

• Storing aggregated data for 1+ years

One month to get it built…..

Platform Requirements:

Topology:

1 Physical Datacenter (Eventually will be geo-replicated)

2 Logical Datacenters (Cassandra, Analytics[Spark])

All machines are in a virtual network (VNet) and are assigned internal static IPs.

No inbound access to Cassandra machines from outside the VNet.

How We Built It Using Cassandra + Spark (contd)


Physical Datacenter

VNet

SLogical DC

C*Logical DC

Configuration (Azure G4 machines):

• Ubuntu

• 16 cores

• 224 GB RAM

• 3TB local SSD

Snitch: GossipingPropertyFileSnitch

Replication: All keyspaces use NetworkTopologyStrategy with replication factor=3 in each

DC.



Node Type Node Count Heap Size GC Type

Cassandra 12 12 GB G1

Spark 12 20 GB G1



Ingestion:

RE

ST

AP

I

O365

Queue (event hub)

Ingestion

Worker(Azure worker role

using DataStax

C# driver)

Cassandra

All data is ingested into the Cassandra DC

All read APIs read from the Cassandra DC

All ingested data is PII scrubbed

APIs are the only way to get data in or out of Cassandra.



What Data Do We Ingest?

CREATE TABLE userdatasetraw (

userid text,

timepk timestamp,

device text,

createdtime timestamp,

errorcode text,

errordetail text,

omstid text,

useragent text,

PRIMARY KEY ((userid, timepk, device), createdtime)

) WITH CLUSTERING ORDER BY (createdtime DESC) AND

COMPACTION={'base_time_seconds':'50','class':'DateTieredCompactionStrategy','max_sstable_age_days':'0.25‘….

userid timepk device createdtime errorcode errordetail omstid useragent

102033ffa4a7

079e7c

14411520000

00

8000000A 14412151330

00

Failure InvalidOperati

on

15321c64d-

0A92-4f4e7-

bcc8-

2aeb89354ff2

6

null



What Data Do We Ingest?

CREATE TABLE tenantsubscription (

omstidtext,

skuid text,

city text static,

comculture text static,

exchgtid text static,

exousercnt int static,

geoareacd text static,

isconcierge boolean static,

istrial boolean,

liccount int,

lyousercnt int static,

numtrailsubs int,

orgname text static,

sku text,

spousercnt int static,

statechangedt timestamp,

subautorenew boolean,

subcreated timestamp,

subexpiry timestamp,

substate text,

tdssynctime timestamp,

usedliccount int,

PRIMARY KEY ((omstid), skuid)

)

omstid skuid city exousrcnt liccnt subautorenew sku

129001cd1-

21dc-4706-

96cfe-

1f632522d3

65ae

6fe22a85e-

b296-42f0-

b187-

1b91e9394b

900

BANGKOK 1000 1500 false OFFICE 365

ENTERPRIS

E E3



Aggregations

Spark Batch Jobs to roll up UserDatasetRaw into 1 hour and 24 hour aggregates.

Other jobs use the 1 hour and 24 hours aggregates and join them with the tenant

subscription and dataset tables to calculate insights:

Result:

1. Great feedback from customers and support agents!

2. Saved customers money!

3. Save customers from being locked out of their service!

4. Proactively fix user experience (detected customer misconfiguration)!



What Problems Did We Run Into?

• Bad Data Modelling:

• Partitions getting too large (1-2 GB) which raised the “compacting large row”

warning and led to OOM errors (Cassandra 2.0).

• Not using DateTieredCompactionStrategy for time-series data.

• Bad Configuration:

• OS limits configured too low (ulimit, nofiles, etc…)

• Number of concurrent compactors and flush writers too low.

• Not using G1 garbage collection with large heaps.

• Not paying enough attention to blocked flush writers and dropped mutations.

• Allowing SSTable count to get too high, causing OOM errors.

Why We Built It On Azure + DSE


Why Azure?

• We didn’t want to manage bare metal and the overhead it brings.

• Easy to add capacity without ordering hardware and rack

space.

• We have used Azure for other services for ~5 years.

• We have built tools for deploying and managing services.

• Great track record with Azure support.

• We love to try out new things on the Azure platform!

Why We Built It On Azure + DSE (contd)


Why Cassandra?

The Good

• Low Latency ✓

• Linear Scale ✓

• Highly Available ✓

• Aggregations (Spark/Spark Streaming) ✓

• Machine Learning (Spark ML) ✓

• No Enforcement of Full Consistency ✓ ✓ ✓

The Not-So-Good

• No Hosted Option in Azure ✗

• Have to Install and Configure it Ourselves ✗

Why We Built It On Azure + DSE (contd)


Why DataStax Enterprise?

Training:

Cassandra can be complex and its success depends on various design decision.

Getting training from the experts is invaluable to ensuring our success.

Integration:

DataStax has built integration between Cassandra and Spark (as well as other

products) and provides a tested package that we can depend on. Ops Center UI.

Support:

Cassandra is new to us. There is nothing better than being able to send off a

message when something goes wrong, and getting experts to help solve the

problem.

How Can You Build It


- Marketplace

- Simplified set of deployment options

- Bring Your Own License

- Production Cluster or Dev Sandbox

- 4, 12, 36 or 90 nodes

- Pick your VM type and size

- Single VNET

- OpsCenter:

http://{cluster}.{region}.cloudapp.azure.com:8888

- Define your own deployment

1. Group cluster resources based on

common lifecycle1. E.g. separate infrastructure components from

compute nodes

2. Define compute and storage options for

nodes in the cluster1. Pick your VM type and size

2. Ephemeral vs persistent disks

(Standard/Premium)

3. Snapshots

3. Define networking options1. VNETs configuration

2. Cross-DC (VNET to VNET) connectivity

4. Performance considerations1. Compute

2. Storage

3. Networking

Azure Resource Manager principles

AZURE RESOURCE MANAGER API

• RBAC-based

• Template-driven

• Declarative and imperative

• Idempotent

• Multi-service

• Multi-region

• Extensible

Resource Group container for multiple resources

resources exist in one* resource group

resource groups can span regions

resource groups can span services

RESOURCE GROUP

Deployment tracks template execution

created within a resource group

allows nested deployments

• Template describes the topology (outside the box)

• Template extensions can initiate state configuration (inside the box)

• Multiple extensions available for Windows and Linux VMs

– DSC

– Chef

– Puppet

– Custom Scripts

– AppService + WebDeploy

– SQLDB + BACPAC

Inside the Box vs. Outside the Box

Common Use Cases for ARM Templates

• Enterprises and System Integrators– Delivering a capability or cloud capacity (building block templates, e.g. DSE)

– Delivering an end to end application (solution templates)

• Cloud Service Vendors (CSVs)– Support different multi-tenancy approaches

• Distinct deployments per customer– Within the CSV’s subscription

– “Bring Your Own Subscription” model that uses customer subscriptions

• Scale units within a central multi-tenant system

• Marketplace integration

• All deploy known configurations/skus/t-shirt sizes– Lots of variables makes free form less desirable

– T-shirt Sizes / SKUs are the common approach

Design and deploy a building block template

Go to http://github.com/azure/azure-quickstart-templates

to find 100s of quick start deployment templates for finished solutions.

DataStax is evolving ARM deployment templates in this

github repo to include DSE specific capabilities (e.g.

multi-region topology) for those who want to manage

their own deployment.

Deploying DataStax with the Azure CLI

Deploying DataStax with Azure Marketplace

http://github.com/azure/azure-quickstart-templates

https://www.youtube.com/watch?v=vacp267zLBA

https://www.youtube.com/watch?v=tmXdSEMjwCE

Compute and storage options for nodes in the cluster

• Compute families for production clusters– D-Series, G-Series (Xeon® E5 v3)

• Local SSD disks

– DS-Series, GS-Series

• Premium Storage optimized, host caching for reads

• Storage options for nodes– Maintain data and logs on local ephemeral SSD disks

• ~100k IOPs and 1.5 GB/sec on G5

– Leverage Premium Storage Disks for persistent data and logs

• P10, P20, P30 (128GB to 1TB, up to 5000 IOPs and 200MB/sec)

• Striped volumes to balance storage size, throughput and costs

• Max 64TB, 80000 IOPs and 1GB/sec per node

– Use Standard Storage for backup snapshots

• Low cost, geo-replicated

Networking deployment options

• Supporting your replication topology (NetworkTopologyStrategy), including geo-

replication, for disaster recovery or workload segregation purposes

• Within a VNET, bandwidth is a function of VM type/size– Up to 20Gbps for G5

• Cross-region VNET gateways– Standard (100Mbps) or High Performance (200Mbps), No-Crypto option

– Latency impact proportional to distance

Contact Us


Sean Usher

Office 365


Twitter: @seanushermsft

Silvano Coriani

Azure


Twitter: @scoriani

Thank you

Download - Azure + DataStax Enterprise Powers Office 365 Per User Store

Top Related