webrtc infrastructures in the large (with experiences on real cloud deployments)

WebRTC infrastructures in the large(with experiences from real deployments)

Luis [email protected]

IIT RTC Conference & ExpoOctober 2015

http://www.kurento.org2

Speaker

• Coordinator of Kurento.org– FOSS project– WebRTC Media Server– WebRTC Media APIs– WebRTC Cloud Infrastructure

• Software developer• Software trainer• Software learner• FOSS enthusiast

http://www.kurento.org

http://twitter/@kurentomshttps://www.youtube.com/channel/UCFtGhWYqahVlzMgGNtEmKug

http://www.kurento.org/

http://twitter/@kurentoms

https://www.youtube.com/channel/UCFtGhWYqahVlzMgGNtEmKug

https://www.youtube.com/channel/UCFtGhWYqahVlzMgGNtEmKug


WebRTC infrastructures

Peer-to-Peer WebRTC Application (without media infrastructure)

WebRTC video stream

WebRTC Application based on media infrastructuremedia infrastructure


Function of WebRTC infrastructuresProcessing

VP8 H.264

Group Communications

Archiving


WebRTC infrastructures in the large

From the hundreds to the millions: the scalability problem

WebRTC Cloud

http://www.kurento.org

WebRTC cloud models

High flexibility

Com

plex

de

velo

pmen

t

Low

hou

rly

cost

s

Low flexibility

Sim

ple

deve

lopm

ent

High

hou

rlyco

sts

IaaS

PaaS

APIaaS

SaaS

No WebRTC-specific players here

ComputingResources


WebRTC cloud architectures

Virtual infrastructure

WebRTC Platform

WebRTC API

WebRTC Application

IaaS

PaaS

APIaaS

SaaS

No new science

here

The science forthe scalability

problem is here


WebRTC Vs traditional WWW Platforms: the three tiers

Application Server Container

Service Layer

Application 1 Application N…

WebRTCMedia Server

DD.BB. Server

Signaling


Vertical scalability on monolithic WebRTC platforms

Application Server Instance

Media Server Instance

Application 1 Application N…

Qua

lity

of se

rvic

e

Number of WebRTC legs

Typical scalability curve for SFU media servers

~500 to 1000 in commodity hardware

The bottleneck is here


Horizontal scalability of WebRTC Media Servers

ApplicationServer

ApplicationServer

ApplicationServer

MediaServer

MediaServer

MediaServer

MediaServer

Media Resource Broker

…

…

RFC6917

Load Balancer


Media Resource Broker• Functions

– MS registration• MS instances register on the MRB

– MS brokering• Query model

– AS instances query the MRB for locating a MS instance– MRB is explicit for the AS

• In-line model– MRB routes signaling (control requests)– MRB is transparent for the AS

• MRB does not hold state about MS instances– MS instances are independent– MS instances are equivalent– We say it’s stateless


Stateless MRB use cases

• Independent MS– B2B calls– WebRTC GW– Room servers– Media recording– Etc.

Stateless - MRB

ApplicationServer

Instance

Media Server

Instance

Media Server

Instance

Media Server

Instance

Media Server

Instance

Call Call


• Amazon Web Services EC2– Most popular public cloud

• OpenStack– Popular public clouds (e.g. RackSpace)– Popular for private clouds

• Deployment– Cloud deployment templates• CloudFormation (Amazon)• Heat (OpenStack)

Deploying in public and private clouds


Templates

– Declarative language for• Declaration of resources

and relationships– Images, Computing Nodes,

Networks, Volumes, Load Balancers, Autoscaling groups, etc.

• Deployment– Instantiation of resources

• Runtime– Provisioning– Autoscaling


Deploying in public clouds

AWS AMI / OpenStack Glance

Media ServerImage

ApplicationServer Image

BrokerImage

Stack definition template

AWS EC2 / OpenStack Nova

CloudFormation / HeatChef + Packer

Autoscaling Rules

Launch configurations

AutoscalingGroup

AutoscalingGroup

Elastic Load Balancer

ApplicationServer

Instance

ApplicationServer

Instance

BrokerInstance

Media Server

Instance

Media Server

Instance

Media Server

Instance

Source code


Experiences deploying large WebRTC infrastructures in public clouds

• Lessons learnt: fault-resilience is hard– AS & MRB layers

• Are stateless => use distributed cache systems– MS layer

• Is stateful => lots of problems

ApplicationServer

ApplicationServer

MediaServer

MediaServer

MediaServer

MediaServer

Media Resource Broker

…

…


Computing Node

Lessons learnt: avoid single points of failure

MS

MRB

Computing Node

MS

Computing Node

… MS

Elastic Load Balancer

Computing Node

MS

Computing Node

…

MRB MRBdistributed cache

The wrong way(single point of failure)

The right way(fault-tolerant MRB)


Lessons learnt: fault-recovery at the MS layer

• Fault-tolerance on the MS layer

– Stateful problem• MS instances hold specific resources

that cannot be “serialized” to a distributed cache:– Specific Sockets

• Machine failure => session failure

– Our proposed solution• Re construct the session

– Detect failure– Notify failure– Reconnect

MRB

Media Server

Instance

Media Server

Instance

Media Server

Instance

Media Server

Instance

Call Call

ApplicationServer

InstanceFailure

detection

Failure notification

Sessionreconnection


Autoscaling


Lessons learnt: lack of optimal scale-out events and metrics

• Lessons learnt: firing scale-out events– which metric?– Bottleneck depends on applications: network, CPU, memory, etc.– our recommendation: define a synthetic metric (i.e. scaling points) and be

conservative

Qua

lity

of se

rvic

e

Number of WebRTC legs

Typical scalability curve for SFU media servers

CPU load 50%

Memory 40%


Lessons learnt: scaling-in is harder than scaling-out

• The options (none-good)– Expose # sessions as a metric

• Depends on cloud capabilities• AS needs to be made cloud aware

– Session migration• AS needs to be made cloud aware• Renegotiations

– Retain period• Sub-optimal utilization• The simplest

MRB

ApplicationServer

Instance

MS1 MS2 MS3 MS4

Which one would you remove?


Limits of the (stateless) MRB

Media stream

One to M

ANY


Stateful MRB

Stateful MRB

ApplicationServer

Instance

Media Server

Instance

Media Server

Instance

Media Server

Instance

Media Server

Instance

Media Server

Instance


Why?


Stateful because …

• MRB– Must be aware of media topology• Stateful information about MS relationships

– Request routing depends on topology• Where to place a new viewer?

– Request routing depends on internal state• CPU load• QoS• Memory• Etc.


Experiences with stateful MRB in AWS EC2 & OpenStack

• Lessons learned: beware of WebRTC internals– Differentiated quality

• SVC is the solution– but its not ready

• Plain SFU forwarding models are not an option.– RTCP feedback of viewers with bad connectivity destroy QoE

• Simulcast may be an option– Suppress feedback of viewers with really bad connectivity

• Layered transcoding works nicely– But its expensive

– Churn and the generation of key-frames• Periodic key-frame generation is an option

– In VP8 expect significant increase in BW consumption• Layered transcoding works nicely

– But its again expensive


Experiences with stateful MRB in AWS EC2 & OpenStack

• Lessons learned: the cloud is evil– Placement of incoming WebRTC legs

• New science required here– Ideas?

• Our solutions– Count number of WebRTC legs (points mechanisms9– Ad-hoc, hard and error prone

– Fault-resilience• New science required here

– Ideas?• Our solution

– Re-construct internal parts of the tree, but never leaves.– Requires client renegotiation– Ad-hoc, hard and error prone


Thanks

Luis [email protected]

webrtc infrastructures in the large (with experiences on real cloud deployments)

Internet