microsoft powerpoint - ccnc10_voip

CCNC 2010 Tutorial: Towards Glitch Free VoIP and Video Conferencing 1/12/2010

Jin Li, Microsoft Research 1

TOWARDS GLITCH-FREE VOIP AND VIDEO CONFERENCING

JIN LI

MICROSOFT RESEARCH

Outline2

� Introduction

� Anatomy of VoIP and Video Conferencing Systems

� Audio/Video Components

� Network Components

� Summary



Introduction3

Booming of IP Based Communication4

� Advanced voice over IP (VoIP)

� Web-, audio-, video-conferencing

� Tele-presence

� Instant messaging

� Calendar and other PIM functions

� Email, fax and voice mail



Worldwide VoIP subscribers5

• Worldwide VoIP service revenue was $24.1B in 2007, up 52% over 2006. • It is expected that worldwide VoIP service to more than double over the next 4 years, to $61.3B in 2011, with an annual growth rate of 26%.

Source: 2008 Infonetics Research Inc,

US Broadband Telephony Forecast, 2007-2013

6

VoIP subscriber base are predicted to double from 2007 to 2013.

Source: Jupiter Research, US Broadband Telephony Forecast, 2008 to 2013



VoIP Trend7

� IP networks are the next gen networks for all forms of communication.

� Broadband penetration is a key driver of VoIP expansion

� Worldwide DSL subscriptions were at 205.9M at the end of 2007, up 23% from 2011. It is predicted to increase to 363.6M in 2011.

� Cable subscriptions were up 15% annually to 68M at the end of 2007, climbing to 97.3M in 2011.

� Passive Optical Network (PON) subscribers were at 10.9M in 2007

� Ethernet FTTH subscribers were at 1.7M in 2007

� 2004/2005 are breakthrough years for VoIP adoption

High End Systems – Tele-Presence8

Cisco Telepresence $299K Tandberg Experia $225K

HP Halo $425K + $18K/mo Polycom RPX210M $269K + $18.5K/mo



Worldwide Tele-presence Forecast (2006-2012)

9

# of end points

Revenue forecast

Source: 2008 IDC Research

Desktop Video Conferencing10

� Multiple solutions, often acted as add on to VoIP

� Benefit

� See faces of people you may not have met before

� See facial expressions & gestures

� Easier to follow a conversation

� More interactive than phone

� Get the general mood of ambience

� See and show documents/objects

� Drawback

� Difficult to setup and planning

� Network reliability� Without(or poor) video, people talk; without(or poor) audio, people walk.

� Interpersonal factors



Anatomy of VoIP and Video Conferencing Systems

11

Infrastructure vs. P2P

� Infrastructure based� Microsoft Unified

Communication

� Cisco

� Gtalk

� P2P based� Skype

12



Infrastructure Based VoIP:Microsoft Unified Communication

13

Unified Communication: Architecture14



Unified Communication: P2P Call15

Key Steps16

� Alice calls Bob

� Find Bob’s registered SIP endpoints



Unified Communication: To VoiceMail17

Key Steps18

� Alice calls Bob

� Find Bob’s registered SIP endpoints

� Voicemail system plays a greeting, records Alice’s msg, send the msgto Bob’s email, and use speech server to transcribe the msg

Bob doesn’t answer after a certain period, call re-routes



Unified Communication: PSTN�UC19

Key Steps20

� PSTN user Alice calls Bob

� IP-PSTN gateway terminates the call

� MS/Gateway routes call to mediation server, which performs transcoding & ICE, etc..

� Through director, the proper UC client is found



P2P VoIP: Skype21

P2P VoIP: Skype

� Information

�Debut: 08/2003, by N. Zennstrom and J. Friis, who founded KaZaA

�A P2P overlay network for VoIP and other app

� Free intra-net VoIP and fee-based SkypeOut/SkypeIn

22



Skype Usage (Apr. 2008)

� 11 million concurrent Skype users on line in peak time (180,000+ simultaneous calls)

� 309 million registered users worldwide, the largest registered user base within eBay portfolio (33 million added users for Q1FY08)

� $126M revenue in Q1FY08 (61% YOY growth, 5.6 billion SkypeOut minutes in FY2007)

� 100 billion cumulative Skype-to-Skype minutes

23

Skype Share of International VoIP Traffic

24



Skype Gadget25

Netgear Skype Wi-Fi Phone

Motorola CN620WiFi Cellphone

IPEVO Free-1USB Skype Phone

USB Mouse with Phone50 hardware partners, 150+ Skype certificated device.

IPDRUM mobile SkypeCable

Skype vs. VoIP

� Public VoIP standard

� H.323, SIP

� Skype is a proprietary VoIP solution

� Rely on P2P network for user directory

� Scalable without costly infrastructure

� Route calls through supernodes in Skype

� Universal firewall/NAT traversal

� Encrypted traffic (but you have to trust eBay/Skype)

26



Skype Ingredient (1)27

User retrieves ID from

a skype server

Skype Network

� any computer w/ sufficient CPU, memory & network bw & not behind firewall

� For distributed directory service

� Relay traffic for computer behind NAT/firewall

28

Skype

Server

Supernode Overlay:

authentication



NAT Traversal (Skype)29

� NAT/Firewall detection� Try UDP connection

� Try TCP connection (arb port, 80 (http), 443(https) )

� Traversal� Direct connection if a) both clients have no NAT, b) one

client has no NAT, and one behind cone-NAT

� Relay by supernode otherwise

� Since Skype doesn’t need to pay for relay cost� High bitrate wideband voice codec (>24kbps)

Skype : Call Routing Through Supernode30

Skype

Server

Supernode Overlay:

authentication

�Route call through supernodes

�High bitrate wideband voice codec (>24kbps)



Skype Encryption

�256-bit AES over 128 bit data block

�1536/2048 RSA for key negotiation (2048/2048 for paid service)

31

Peer 1Peer 2

Skype: Complete Black box(Security by Obfuscation )

� Almost everything is obfuscated� Many protections, anti-debugging tricks, ciphered code� Avoid static disassembly: xor binary with a hard-coded key,

erasure beginning of the code, own packer� Code integrity check: use checksum to avoid breakpoint� Anti-debugging technique: anti softice, integrity check� Code obfuscation� Network obfuscation

32



Audio/Video Component33

Audio/Video Component34

� Audio Codec

� Video Codec

� Acoustic Echo Cancellation



Audio Codec35

G.711 (PCM)

� Still widely used today: PSTN interface� If uniform quantization

�12 bits * 8 k/sec = 96 kbps� Non-uniform quantization

�65 kbps DS0 rate�North America: µ-law

�Other countries: A-law

�MOS of about 4.3µ = 255 , A = 87.6



G.722.1: Siren

� Audio bandwidth: 14 kHz� Sample rate: 32 kHz� Bit rate: 24, 32, and 48 kbit/s

� Algorithm: Transform coding (Siren14TM)� Frame size: 20 ms� Algorithmic delay: 40 ms� Complexity: <11 WMOPS (enc/dec)� Available on royalty-free licensing terms (from Polycom)

Siren Encoder



Siren Decoder39

Siren Codec

� Audio sampled at 32kHz

� Operates on frames of 20 ms corresponding to 640

samples

� Based on transform coding, using a Modulated

Lapped Transform (MLT)

� A Look-ahead of 20 ms due to 50% overlap between

frames

� Total algorithmic delay of 40 ms



41/75

MLT - Modulated Lapped Transforms

Spatial Response Frequency Domain

Categorization & SQVH42

Expected # of Bits For Each Category

Quantization Used by SQVH

Vector Property Used in SQVH



AMR-WB Basics

� “Wideband coding of speech at around 16kbit/s using adaptive multi-rate wideband (AMR-WB)”

� Adopted as ITU-T G722.2, and also as 3GPP spec TS 26.190.

� “Foreseen applications are: VoIP and internet applications, Mobile Com., PSTN app, ISNDN wideband telephony, ISDN videophone and videoconf.”

� Sampling rate 16KHz;� Bitrate: 6.60, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85,

23.05, and 23.85 kbit/s.� 20 ms frame.

� ACELP (algebraic code excited LPC).

Pre-processing

� Sampling rate conversion: 16 to 12.8KHz; (now a 20ms frame has 256 samples…)

� HP filter (cut off @ 50Hz)

� Pre-emphasis filter ( 1 -.68 z-1 )



LP analysis and Quant.

� One 30 ms asymmetric window� 5 ms look-ahead

� Obtain LPC Coef.:� Compute correlation;

� Multiply by window (add 60HZ BW expansion);

� R(0) = 1.0001R(0) ( adds 40dB noise floor);

� levinson-durbin to compute LP coefficients.

� LP to ISP

� Quantize in ISP q-domain.

LP analysis and Quant. (2)

� Quantization bottom line:

� 46 bits/frame on most modes;

� 36 bits/frame on 6.60 Kbps mode;

� M.A. prediction with 1/3 gain;

� Quantizer: S-MSVQ (split multistage VQ)

� Both quantized and unquantized coefs will be used in algorithm.



subframes

� Each 20ms (256 samples) frame is divided in 4 sub-frames (64 samples each).

� Interpolated LPC coefficients obtained for each sub-frame

� Interpolation done in ISP q-domain

Perceptual weighting

� Weighting filter is:

W(z) = A(z/γ1).Hde-emph(z)

� This helps solving the tilt problem, which is worse in WB speech.



Excitation

� Searched for each 5ms sub-frame.

� Two components:

�Adaptive codebook (past excitation)

�Algebraic codebook

� “target” signal obtained by filtering the LPC residual (for the sub-frame) through the synthesis LPC filter and weighting filter.

Adaptive codebook

� Start with “open loop” pitch estimation� based on cross correlation;

� Low-value bias;

� ‘last value’ value bias (actually 5-frame median), if voiced.

� Re-compute with “closed loop”, around initial value ±7, and up to ¼ sample precision.� “Analysis by synthesis” based;

� Restrict to values allowed by encoding step.

� Start with “open loop” pitch estimation� based on cross correlation;

� Low-value bias;

� ‘last value’ value bias (actually 5-frame median), if voiced.

� Re-compute with “closed loop”, around initial value ±7, and up to ¼ sample precision.� “Analysis by synthesis” based;

� Restrict to values allowed by encoding step.



Algebraic codebook

� Remove contribution of (unquantized) prediction from adaptive codebook from the “target signal” to obtain new target.

� Divide sub-frame into 4 alternating tracks.

Algebraic codebook (2)

� Select best pulses, for a total of 24 (6), 18(5-4), 16 (4), 12(3), 10(3-2), 8(2), 4(1), 2(.5), depending on bitrate.

� Pulses + Two filters:� Periodicity enhancement: 1/(1-.85z-T);

� Tilt: 1/(1- β1 z -1)

� Tricks to save bits in encoding pulse position;

� Tricks to save computation on pulse search.



Wrap up

� High pass, de-emphasis;

� Upsample back to 16KHz;

� Add high frequency components.

High Freq. Components

� Random noise used as excitation

� LP filter is extended to 8KHz.

� Energy of excitation based on energy of base-band residual, and voicing info, except in highest bitrate mode.

� Extension of LPC filter is equivalent to mapping 5.1 to 5.6Khz to 6.4 to 7.0KHz;

� Band-pass filtered to 6-7KHz, and added to output signal.



Video Codec55

H.264/AVC Encoder56



H.264/AVC Decoder57

Reference Picture Management58

� Reference pictures are stored in decoded picture buffer (DPB)

� Short/long term reference picture, a decoded frame may be marked as � unused for reference

� short term picture

� long term picture� Sliding Window” memory management

� Keep #(long_term_pic+ short_term_pic)� Remove short term picture if lack of space

� Adaptive memory control� issued by encoder� change the type of the ref frame

� IDR (Instantaneous Decoder Refresh)� clear ref buffer� I frame



Slice Group59

� Former called “FMO” (Flexible MacroblockOrdering)

� A subset of the macroblocks and may contain one or more slices

� Error resilience

Inter Prediction60

� Variable block size

� ¼ pixel motion compensation

� Interpolation



Motion Vector (MV) Prediction61

� Efficiently encode correlated MV

� Other than 16×8 and 8×16, MVp=(MVA+MVB+MVC) /3

� 16×8, MVp of the upper =MVB ;MVp of the lower =MVA

� 8×16, MVp of the left =MVA ;MVp of the right =MVC

� For skipped macroblocks, do as 16 × 16 Inter mode

Intra Prediction62

� For Luma samples

� 4*4 block: 9 prediction modes

� 16*16 block: 4 modes

� I_PCM: transmit the encoded samples w/o pred. & trans



Prediction Modes63

4x4 Luma

Intra 16x16

8x8 Chroma is similar to 16x16 luma intra

Signaling of Intra Prediction Modes64

� Mode choices need to be signaled to the decoder, but compactly

� The prediction mode for luma coded in Intra-16×16 mode or chroma coded in Intra mode is signaled in the macroblock header

� Intra modes for neighboring 4 × 4 blocks are often correlated

� If A and B are available, C = min (A,B)

� else if (neither A nor B are available) C = 2 (DC)

� else C = available (A,B)

� Use prev_intra4x4_pred_mode flag & rem_intra4x4_pred_mode flag to indicate mode selected.

BCA



65

Deblocking filter

� Filter 4 vertical/horizontal boundaries of luma

� Filter 2 vertical/horizontal boundaries of chroma

� Affect up to 3 samples on the either side.

� The filter is stronger at places where there is likely to be significant blocking distortion� e.g.: such as the boundary of an intra coded macroblock or a boundary

between blocks that contain coded coefficients.

66

Transform and Quantisation

� 3 transforms� DCT-base transform for all 4*4 residual block

� Hadamard transform for 4*4 luma DC coefficient (in 16*16 intra)

� Hadamard transform for 2*2 chroma DC coefficient

a=1/2, b = (2/5)1/2, d = 1/2



67

Combine Quantization into Scaling of Transform

� |ZD(i, j)| = (|YD(i, j)| MF(0,0) + 2f ) >> (qbits +1)

� sign (ZD(i, j)) = sign (YD(i, j))

4x4 DC Intra Luma

� |ZD(i, j)| = (|YD(i, j)| MF(0,0) + 2f ) >> (qbits +1)

� sign (ZD(i, j)) = sign (YD(i, j))

CAVLC: Context-Based Adaptive Variable Length Coding

68

� Characteristics:� Run-level coding to compact zero string

� Trailing ones (+1, -1 after 0)

� Number of nonzero coefficient in neighboring blocks is correlated

� Choice VLC lookup table for level parameter for level magnitude



CAVLC Encoding69

� 1. Encode the number of coefficients and trailing ones (coeff token)� TotalCoeffs : 0 ~ 16

� TrailingOnes : 0 ~ 3� if more than 3 TrailingOnes, only last three are treated as ‘special cases’

� Four look up table� Three variable-length, one fixed-length

� Choice depend on neighboring blocks

� 2. Encode the sign of each TrailingOne: In reverse order

� 3. Encode the levels of the remaining nonzero coefficients� level_prefix, level_suffix

� 4.Encode the total number of zeros before the last coefficient� Zero-runs at start of the array need not to be encoded

� 5. Encode each run of zeros� If less then 3 TrailingOnes, the first nonzero coefficient is adjusted

Acoustic Echo Cancellation70



Acoustic Echo Cancellation71

From AudioDecoder

To AudioEncoder

Acoustic Echo Cancellation

Acoustic Echo Cancellation Module72



Adaptive Traversal Filter73

� FIR filter – inherently stable

� Length of the filter affects other performance, convergence, goodness, and complexity.

� Filter introduces errors since it is trying to model IIR response.

� Short Filters

� 128 – 256 coefficients (taps)

� Faster convergence, but final solution has more residual error

� Less complex O(N).

� Long Filters

� 512-1024

� Slower convergence, but final solution has less error.

� More complex, as algorithm can be O(N2)

Challenges74

� Dynamic range of the human ear = 120dB.� Even quiet echoes can be heard.

� Longer delays from satellite (300-500ms), VoIP� Ear is more sensitive to longer delays.

� More difficult to find the beginning of the echo.

� Long filters (~1000 taps) are needed (complexity & convergence)

� Near-end noise: corrupt the echo, decreasing the cancellers ability to converge.

� Acoustic echo paths can change rapidly� More difficult for the AEC to remain converged.

� Nonlinear echo components� Speakers driven beyond linear region.



Network Component75

IP-based VoIP / Video Conference76



Internet Primer77

Internet : Grand View78



Impact on ISPs79

sibling

peering

peering entityboundary

sibling entityboundary

transit

� Economics of ISP relationships

� sibling relationship

� several ISPs belong to same org

� peering relationship

� mutual beneficial free agreement (to certain extent)

� transit relationship

� one ISP pays another

Inside ISP80



ISP POP (Point of Presence)81

Home Networking82



Network Characteristics83

Under-provisioned Links84

BranchBranch



Growth Trends85

Packet Loss vs. Jitter (vs. Delay?)86



The Usual Suspects87

Packet Bursts88



What kind of Enterprise User?89

How QoS can help90



QoS helps inside and between branches!

91

Observation92

� IP-based communication in the enterprise is growing

� Empirical results show poor calls for Wireless and VPN users

� QoS (DiffServ) is both used and useful!



Available Bandwidth Estimation93

What is Available Bandwidth (ABW)?94

� ABW is the left-over capacity along an Internet path



Why Is It Useful?

� Maximizing QoE (Quality of Experience) in A/V conferencing� Audio prefers minimum delay (high priority)� Video prefers maximum rate (low priority)

� One solution: measure ABW, encode and send video at the ABW rate

One Way Delay (OWD) = propagation delay (constant) + queuing delay (variable)

Typical Targeting Scenario

� First hop is the bottleneck

� Cable modem, DSL, high-speed link…

� Timescale for the ABW estimation: 2-4 seconds



Why Is Measuring ABW Hard?

� Available bandwidth changes over time � ABW measurements must be quick

� Audio packets (along the same path) should experience minimum delay � Measurement must be non-intrusive

�

Two Models

� Probe Rate Model (PRM) based solutions

� Pathload, TOPP, Pathchirp, Bfind, PTR …

� Probe Gap Model (PGM) based solutions

� Spruce, Delphi, IGI, Moseab …



Pathload (PRM) [Jain & Dovrolis]

� Send probe trains at various rates

� ABW is the probe rate at transition, where OWD is increasing (queuing delay is observed)

Spruce (PGM) [Jacob et. al.]

� Send probe pairs/train at Ri (Ri > A), measure sending gaps and receiving gaps

� Compute A directly



Advantage/Disadvantages of The Approaches

Advantages Disadvantages

PGM based

approaches

Fast estimation:

Estimation can be done in

single probe.

Assumptions are not easy

to verify in practice

PRM based

approaches

No assumption Slow estimation:

iterative probes

Forward Error Correction102



Block Based Erasure Resilient Coding 103

k1 2 3

1 2 3 k k+1 n

Original data:

ERC:

k messages

At a certain

instance X X X XX

X

Some of the blocks may be lost in delivery. However, as long as there

are at least k blocks delivered, the original data can be reconstructed.

ERC in VoIP and Video Conferencing

� VoIP

� Mainly packet replication, due to small VoIP packet size & low delay requirement

� Video Conferencing

� Packet loss protection (for I frame or P frame in HD)

� Each frame is separate into k msg, and protect by n-k msg. As long as there are less than n-k loss, the transmission succeeds

104



ERC Terms

� Number of Original Block: k

� Number of Coded Block: n

� Rate of ERC: k/n

� MDS: Maximum Distance Separable

� Any k of n coded block may recover the original

� The theoretical optimal performance

105

Erasure Encoding: Mathematics

106

xkx1 x2

y1 y2 yn

Original data:

Coded data:

: Vectors on Galois Field.



Example: ERC of 10MB

107

xkx1 x2

y1 y2 yn

Original data(10MB): Coded data:(n=30)

k=10, GF(28), each vector is 1MB.

30

10 1M 1M

Erasure Decoding: Mathmatics108

xkx1 x2

y1 y2 yn

Original data:

Coded data:

Code select

Available



Erasure Decoding: Mathmatics109

xkx1 x2

y1 y2 yn

Original data:

Coded data:

Original data can be recovered if the sub-generator matrix

has a full rank k.

Systematic vs Non-Systematic ERC

� Systematic ERC

� Slightly low encoding & decoding complexity

� Even can’t recover, we can still use some original msg

110

k1 2 3

1 2 3 k k+1 n

Original data:

Non systematicERC:

k messages

1 2 3 k k+1 nSystematicERC:



Reed-Solomon111

� Has been around for decades

� Has systematic form

� Cauchy Reed-Solomon Code

Tutorial, Jin Li

Reed-Solomon Decoding

112

Receive

Inverse



Dejitter Buffer113

Variable Delay & Dejitter Buffer

� Queuing delay

� Dejitter buffers

� Variable packet sizes

DejitterBuffer

Queuing Delay

Queuing Delay

Queuing Delay



Fixed Dejitter Buffer – Budget For Worst Case

� Total End-to-End Delay� Codec delay: 40ms

� Propagation delay: 8ms

� Dejitter buffer: 50ms � To accommodate queuing delay: 0-50 ms

� Total delay: 98ms

PropagationDelay—8 ms

Coder Delay40 ms

Dejitter Buffer50 ms

QueuingDelay

4-50 ms

Site A Site B

(128kbps Bandwidth

Dejitter Buffer Size & Late Loss

late loss

buffering delay

Playout Jitter

Delay Packet Loss

Fixed playout deadline and jitter absorption:

� The playout rate is constant� The tradeoff is between Dejitter

buffer size and late loss



Adaptive Playout and Dejitter Buffer Adaptation

Adaptive playout and jitter adaptation

� Scaling of voice/video packets in highly dynamic way

� Playout schedule set according to past delays recorded� Usually dejitter buffer size expand quickly to late

packet arrival, and shrink slowly when jitter reduces

� Improved tradeoff between buffering delay and late loss

� Playout rate is not constant

Playout Jitter

Delay Packet Loss

buffering delay

Adaptive Play Out118

� Packets push into Adaptive Playout module

� Render requests new waveform seg for playout

� Playout module passes packet to audio decoder

Audio AdaptivePlayout



Packet Loss Concealment119

Audio Packet Loss Concealment

i-2 i-1 i+1 i+2

time

i-2 i+2

time

i lost

i-1 i+1

L ∆L

2 L1.3 L

alignment found by correlation

� Depend on voiced & unvoiced segment



Voiced segments

Unvoiced segments



Concealment as (bi-directional) stretching

Video Packet Loss Concealment124

� Spatial Concealment

� Use spatial correlation

� E.g., bilinear interpolation

� Projection onto convex sets

� Temporal Concealment

� Use correlation exists between consecutive frames

� Temporal replacement

� Boundary matching



Spatial-Temporal Concealment125

Summary126



Summary127

� VoIP/Video Conference Systems� Infrastructure based

� P2P based

� Audio/Video Components� Audio codec

� Video codec

� Acoustic echo cancellation

� Network components� Primer of the Internet

� Network characteristics

� Available bandwidth estimation

� Forward error correction (FEC)

� Dejitter buffer

� Packet loss concealment

microsoft powerpoint - ccnc10_voip

Documents

glitch free voip

introductionanatomy

voip benefit

skype p2p voip

infrastructure based

voip subscriber base

public voip standardh

worldwide voip service