webrtc infrastructures in the large (with experiences on real cloud deployments)
TRANSCRIPT
WebRTC infrastructures in the large(with experiences from real deployments)
Luis [email protected]
IIT RTC Conference & ExpoOctober 2015
http://www.kurento.org2
Speaker
• Coordinator of Kurento.org– FOSS project– WebRTC Media Server– WebRTC Media APIs– WebRTC Cloud Infrastructure
• Software developer• Software trainer• Software learner• FOSS enthusiast
http://www.kurento.org
http://twitter/@kurentomshttps://www.youtube.com/channel/UCFtGhWYqahVlzMgGNtEmKug
http://www.kurento.org3
WebRTC infrastructures
Peer-to-Peer WebRTC Application (without media infrastructure)
WebRTC video stream
WebRTC Application based on media infrastructuremedia infrastructure
http://www.kurento.org4
Function of WebRTC infrastructuresProcessing
VP8 H.264
Group Communications
Archiving
http://www.kurento.org5
WebRTC infrastructures in the large
From the hundreds to the millions: the scalability problem
WebRTC Cloud
http://www.kurento.org6
http://www.kurento.org
WebRTC cloud models
High flexibility
Com
plex
de
velo
pmen
t
Low
hou
rly
cost
s
Low flexibility
Sim
ple
deve
lopm
ent
High
hou
rlyco
sts
IaaS
PaaS
APIaaS
SaaS
No WebRTC-specific players here
ComputingResources
http://www.kurento.org8
WebRTC cloud architectures
Virtual infrastructure
WebRTC Platform
WebRTC API
WebRTC Application
IaaS
PaaS
APIaaS
SaaS
No new science
here
The science forthe scalability
problem is here
http://www.kurento.org9
WebRTC Vs traditional WWW Platforms: the three tiers
Application Server Container
Service Layer
Application 1 Application N…
WebRTCMedia Server
DD.BB. Server
Signaling
http://www.kurento.org10
Vertical scalability on monolithic WebRTC platforms
Application Server Instance
Media Server Instance
Application 1 Application N…
Qua
lity
of se
rvic
e
Number of WebRTC legs
Typical scalability curve for SFU media servers
~500 to 1000 in commodity hardware
The bottleneck is here
http://www.kurento.org11
Horizontal scalability of WebRTC Media Servers
ApplicationServer
ApplicationServer
ApplicationServer
MediaServer
MediaServer
MediaServer
MediaServer
Media Resource Broker
…
…
RFC6917
Load Balancer
http://www.kurento.org12
Media Resource Broker• Functions
– MS registration• MS instances register on the MRB
– MS brokering• Query model
– AS instances query the MRB for locating a MS instance– MRB is explicit for the AS
• In-line model– MRB routes signaling (control requests)– MRB is transparent for the AS
• MRB does not hold state about MS instances– MS instances are independent– MS instances are equivalent– We say it’s stateless
http://www.kurento.org13
Stateless MRB use cases
• Independent MS– B2B calls– WebRTC GW– Room servers– Media recording– Etc.
Stateless - MRB
ApplicationServer
Instance
Media Server
Instance
Media Server
Instance
Media Server
Instance
Media Server
Instance
Call Call
http://www.kurento.org14
• Amazon Web Services EC2– Most popular public cloud
• OpenStack– Popular public clouds (e.g. RackSpace)– Popular for private clouds
• Deployment– Cloud deployment templates• CloudFormation (Amazon)• Heat (OpenStack)
Deploying in public and private clouds
http://www.kurento.org15
Templates
– Declarative language for• Declaration of resources
and relationships– Images, Computing Nodes,
Networks, Volumes, Load Balancers, Autoscaling groups, etc.
• Deployment– Instantiation of resources
• Runtime– Provisioning– Autoscaling
http://www.kurento.org16
Deploying in public clouds
AWS AMI / OpenStack Glance
Media ServerImage
ApplicationServer Image
BrokerImage
Stack definition template
AWS EC2 / OpenStack Nova
CloudFormation / HeatChef + Packer
Autoscaling Rules
Launch configurations
AutoscalingGroup
AutoscalingGroup
Elastic Load Balancer
ApplicationServer
Instance
ApplicationServer
Instance
BrokerInstance
Media Server
Instance
Media Server
Instance
Media Server
Instance
Source code
http://www.kurento.org17
http://www.kurento.org18
Experiences deploying large WebRTC infrastructures in public clouds
• Lessons learnt: fault-resilience is hard– AS & MRB layers
• Are stateless => use distributed cache systems– MS layer
• Is stateful => lots of problems
ApplicationServer
ApplicationServer
MediaServer
MediaServer
MediaServer
MediaServer
Media Resource Broker
…
…
http://www.kurento.org19
Computing Node
Lessons learnt: avoid single points of failure
MS
MRB
Computing Node
MS
Computing Node
… MS
Elastic Load Balancer
Computing Node
MS
Computing Node
…
MRB MRBdistributed cache
The wrong way(single point of failure)
The right way(fault-tolerant MRB)
http://www.kurento.org20
Lessons learnt: fault-recovery at the MS layer
• Fault-tolerance on the MS layer
– Stateful problem• MS instances hold specific resources
that cannot be “serialized” to a distributed cache:– Specific Sockets
• Machine failure => session failure
– Our proposed solution• Re construct the session
– Detect failure– Notify failure– Reconnect
MRB
Media Server
Instance
Media Server
Instance
Media Server
Instance
Media Server
Instance
Call Call
ApplicationServer
InstanceFailure
detection
Failure notification
Sessionreconnection
http://www.kurento.org21
Autoscaling
http://www.kurento.org22
Lessons learnt: lack of optimal scale-out events and metrics
• Lessons learnt: firing scale-out events– which metric?– Bottleneck depends on applications: network, CPU, memory, etc.– our recommendation: define a synthetic metric (i.e. scaling points) and be
conservative
Qua
lity
of se
rvic
e
Number of WebRTC legs
Typical scalability curve for SFU media servers
CPU load 50%
Memory 40%
http://www.kurento.org23
Lessons learnt: scaling-in is harder than scaling-out
• The options (none-good)– Expose # sessions as a metric
• Depends on cloud capabilities• AS needs to be made cloud aware
– Session migration• AS needs to be made cloud aware• Renegotiations
– Retain period• Sub-optimal utilization• The simplest
MRB
ApplicationServer
Instance
MS1 MS2 MS3 MS4
Which one would you remove?
http://www.kurento.org24
Limits of the (stateless) MRB
Media stream
One to M
ANY
http://www.kurento.org25
Stateful MRB
Stateful MRB
ApplicationServer
Instance
Media Server
Instance
Media Server
Instance
Media Server
Instance
Media Server
Instance
Media Server
Instance
http://www.kurento.org26
Why?
http://www.kurento.org27
Stateful because …
• MRB– Must be aware of media topology• Stateful information about MS relationships
– Request routing depends on topology• Where to place a new viewer?
– Request routing depends on internal state• CPU load• QoS• Memory• Etc.
http://www.kurento.org28
Experiences with stateful MRB in AWS EC2 & OpenStack
• Lessons learned: beware of WebRTC internals– Differentiated quality
• SVC is the solution– but its not ready
• Plain SFU forwarding models are not an option.– RTCP feedback of viewers with bad connectivity destroy QoE
• Simulcast may be an option– Suppress feedback of viewers with really bad connectivity
• Layered transcoding works nicely– But its expensive
– Churn and the generation of key-frames• Periodic key-frame generation is an option
– In VP8 expect significant increase in BW consumption• Layered transcoding works nicely
– But its again expensive
http://www.kurento.org29
Experiences with stateful MRB in AWS EC2 & OpenStack
• Lessons learned: the cloud is evil– Placement of incoming WebRTC legs
• New science required here– Ideas?
• Our solutions– Count number of WebRTC legs (points mechanisms9– Ad-hoc, hard and error prone
– Fault-resilience• New science required here
– Ideas?• Our solution
– Re-construct internal parts of the tree, but never leaves.– Requires client renegotiation– Ad-hoc, hard and error prone
http://www.kurento.org30