spoken language translation information theory: source coding

263
Information Theory: Source Coding Dr. Satoshi Nakamura Honorarprofessor, Karlsruhe Institute of Technology, Germany Professor, Graduate School of Information Science, Nara Institute of Science and Technology, Japan Invited Advisor, National Institute of Information and Communications Technology, Japan Fellow, Advanced Telecommunication Research Institute International, Kyoto, Japan [email protected] Spoken Language Translation Research Laboratories

Upload: others

Post on 01-Nov-2021

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spoken Language Translation Information Theory: Source Coding

Information Theory:

Source Coding

Dr. Satoshi Nakamura

Honorarprofessor, Karlsruhe Institute of Technology, Germany

Professor, Graduate School of Information Science, Nara Institute of Science and Technology, Japan

Invited Advisor, National Institute of Information and Communications Technology, Japan

Fellow, Advanced Telecommunication Research Institute International, Kyoto, Japan

[email protected]

Spoken Language

Translation

Research Laboratories

Page 2: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 2

NAIST?

Kenhanna Science City

(NAIST, ATR, NICT,

NTT, NEC…)

Fukushima

Nuclear Power Plant

>700km

Page 3: Spoken Language Translation Information Theory: Source Coding

Acknowledgements of appreciation to the

help and supports.

2012/10/24 Prof. Satoshi Nakamura 3

Daimler Special Purpose Car

(Total 4M Euro) 43 rescue members and 4 rescue dogs

Example: Deutsche Bank $2640,000

and more.

Deeply thank you !

Page 4: Spoken Language Translation Information Theory: Source Coding

Congratulations, Shinya Yamanaka

on Nobel Prize in Physiology or Medicine

Education

He received his M.D. at Kobe University in 1987 and his Ph.D. at Osaka City University Graduate School in 1993.

Professional career Between 1987 and 1989, Yamanaka was a Resident in orthopedic surgery at the National

Osaka Hospital.

During 1993–1995, he was a Postdoctoral Fellow at the Gladstone Institute of Cardiovascular Disease, which is affiliated with the University of California, San Francisco.

During 1995–1996, he was a staff research investigator at the UCSF-affililated Gladstone Institute of Cardiovascular Disease.

Between 1996 and 1999, he was an assistant professor at Osaka City University Medical School.

During 1999–2003, he was an associate professor at the Nara Institute of Science and Technology. During 2003–2005, he was a professor at the Nara Institute of Science and Technology. Between 2004 and 2010, Yamanaka was a professor at the Institute for Frontier Medical Sciences.[9]

Currently Yamanaka is the director and a professor at the Center for iPS Cell Research and Application in Kyoto University, Japan.

In 2006, he and his team generated Induced Pluripotent Stem Cells – pluripotent stem cells from adult mouse fibroblasts. In 2007, he and his team were able to generate Induced Pluripotent Stem Cells from human adult fibroblasts.[1][2][3]

24/10/2012 Satoshi Nakamura, NAIST, all right reserved. 4

Page 5: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 5

NAIST?

Kenhanna Science City

(NAIST, ATR, NICT,

NTT, NEC…)

Fukushima

Nuclear Power Plant

>700km

Page 6: Spoken Language Translation Information Theory: Source Coding

About NAIST ?

Nara Institute of Science and Technology, Japan established 1991.

Japanese national university for basic research and higher education.

1st rank research evaluation among Japanese universities in #papers, #grand

per faculty.

Three graduate schools (No undergraduate school)

Information Science

Biological Science: Prof. Yamanaka IPS Cell.

Material Science

Sister school: JAIST, Japan Advanced Institute of Science and Technology

Graduate School of Information Science

20 laboratories

10 collaborative laboratories

(ATR, AIST, NEC, Panasonic, NTT, NICT, Fujitsu, Docomo, OMRON)

2012/10/24 Prof. Satoshi Nakamura 6

Page 7: Spoken Language Translation Information Theory: Source Coding

NAIST Ranking

Overall:Ranked 1st

Highest Evaluated University in Japan

based on data in Thomson Reuters’“Essential Science Indicators”and

published in“University Ranking 2010”by the leading Japanese

newspaper“Asahi Shimbun”

in the top 5% A+

Three research areas in the Graduate School of Information Science

received top scores in a survey conducted by the Ministry of Economy,

Trade and Industry

Ranked 1st in “Research”and“Education”among all national universities

in Japan published in the weekly magazine “Toyo Keizai”.

Number of Grants-in-Aid for scientific research Ranked 1st per faculty

member*

Grants-in-Aid for scientific research Ranked 1st per faculty member*

2012/10/24 Prof. Satoshi Nakamura 7

Page 8: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 8

National Institute of Information and

Communications Technology, NICT

Page 9: Spoken Language Translation Information Theory: Source Coding

2001

Communications

Research Laboratory

1979

Telecommunications

and Broadcasting

Satellite Organization

Inco

rpo

rate

d

Ad

min

istrative

Ag

en

cy

Natio

nal

Lab

ora

tory

Certifie

d

Institu

tion

1992

Telecommunications

Advancement

Organization

1988

Communications

Research Lab

1952

Radio Research Lab

National Institute of Information and

Communications Technology

About NICT ?

2004

2012/10/24 9 KIIT & CDAC Noida

Page 10: Spoken Language Translation Information Theory: Source Coding

Locations

2012/10/24 10 Prof. Satoshi Nakamura

Page 11: Spoken Language Translation Information Theory: Source Coding

NICT Keihanna Research Laboratories

(Open) since 1. April, 2008

(Location)Kansai Science City

(Number of Staffs)about 160

Kyoto

Nara

Oosaka

Kansai Science City

A part

of ATR

NICT

11 2012/10/24 11 Prof. Satoshi Nakamura

Page 12: Spoken Language Translation Information Theory: Source Coding

12 12

(I) Barriers of language

R&D on the multi-lingual technology

(III) Barriers of information quality

Information analysis with information credibility criteria

(II) Barriers of ability

spoken language and nonverbal interaction technology

(IV) Barriers between the real and the cyber world

Natual, real-time connections between the two worlds

Overcome the barriers in ICT society

Ultra-realistic communications to provide the feeling

of “being there” via all five senses, etc

(V) Barries of distance

2012/10/24 12 Prof. Satoshi Nakamura

Page 13: Spoken Language Translation Information Theory: Source Coding

About ATR?

ATR: Advanced Telecommunication Research Institute International

ATR was founded in March 1986.

2012/10/24 Prof. Satoshi Nakamura 13

Page 14: Spoken Language Translation Information Theory: Source Coding

ATR Laboratories

Brain Information Communication Research Labs Group

Computational Neuroscience Lab.

Cognitive Mechanisms Lab.

Neural Information Analysis Lab.

Social Media Research Labs. Group

Intelligent Robotics and Communication Lab.

Hiroshi Ishiguro Lab.

Adaptive Communications Research Lab.

Wave Engineering Lab.

Spoken Language Communication Research Labs.

Media Information Science Labs

2012/10/24 Prof. Satoshi Nakamura 14

Page 15: Spoken Language Translation Information Theory: Source Coding

History of Speech Translation Research

10/24/2012 15

Read

Speech

・Syntactically correct

・Clear utterance

・Limited domain

“Conference Registration”

Daily

Conversation

・Standard expression

・Unclear utterance

・Limited domain

“Hotel Reservation”

Wider and

Real Domain

・Wider and real domain

“International Travel”

・Realistic expressions

・Noisy speech

・J-E, J-C speech translation

1986 1992 1999 2006 2008

NICT

/ATR NICT

MASTAR

MIC & NICT

& CSTP PJ

ATR NICT

2010

A-STAR (→ U-STAR) C-STAR (ATR,CMU, UKA, CLIPS, IRST, ETRI,

CAS)

Page 16: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 17

Source Coding

Contents of the lecture

Information Theory: Source Coding + Channel Coding + Encryption

Goal: Understanding of Source Coding by theory and application

Contents: Amount of information, modeling of information source

Zero-memory source, Markov source, hidden Markov source

Source coding theorem, compact codes

Universal coding, rate distortion theory

Source coding of analog signal, vector quantization

Modeling and coding of language and speech

Page 17: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 18

Text book and references

Norman Abramson: “Information Theory and Coding”, McGraw-Hill,

1963

A.Gersho, R.M.Gray: “Vector Quantization and Signal Compression”,

Kluwer Academic Publisher

T.C.Bell, J.G.Cleary, I.H.Witten: “Text Compression”, Prentice Hall

Page 18: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 19

Role of information theory

Information Theory: Measure for Information Amount, Modeling of Information Source

Claude Shannon: ``Mathematical Theory of Communication'' (1948), Bell System Technical Journal "Shannon entitles his theory a mathematical theory of communication:

Theory of carriers of information."

"Theory about carriers of information-symbols and not with information itself.”

"The semantic aspects of communication are irrelevant to the engineering problems."

Page 19: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 20

Transmission model

Efficient usage of transmission channel

Digital channel: Reduction of transmission codes

Analog channel: Reduction of transmission time and frequency bands

Improve reliability

Digital channel: Reduction of transmission errors

Analog channel: Improve Signal to Noise Ratio

Information

Source

Transmitter

(Coder)

Transmission

channel

Receiver

(Decoder)

Decoded

Information

Message Code Code Message

Noise

Source

Page 20: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 21

Separate modeling

Separate optimization:

Source coding + Channel coding

Information

Source

Source

Coding

Transmission

channel

Chanel

Decoding

Decoded

Information

Noise

Source

Channel

Coding

Soruce

Decoding

Coding Decoding

Page 21: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 22

Amount of information

Amount of Information: Defined by statistical property of an overall set not by individual events.

Statistical Structure Statistically definable Sets

=> Memoryless source, Markov source

Non-statistical sets and unknown-structured sets

Unknown-structured information sets Universal Coding

Lempel Zip Coding

Arithmetic Coding

Page 22: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 23

Hierarchical model of codes

Information Source Receiver

signal

Model

Structure,

Symbol

Concept

Meaning

Intention

signal

Model

Structure,

Symbol

Concept

Meaning

Intention

Transmitter Receiver

Waveform

coding

Parametric

coding

Recognition

-based coding

Intelligent

Coding

Page 23: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 24

What is information

Messages which reduce uncertainty

Measurement of body temperature

Prediction whether he caught cold or not is possible.

Weather forecast

Prediction of tomorrows weather is possible.

Information theory:

Measurement of information

Higher efficiency and reliability of transmission

Page 24: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 25

Properties of information

Non-negativity: Information amount is non-negative. If probability of the event equals to 0 or 1, amount of information becomes 0.

Events which does surely happen or doesn’t happen, don’t have any additional information. The amount of information of these events is 0.

To know the events whose probabilities are 0<p<1 bring certain amount of information since it reduces ambiguity.

Monotonic decreasing: The more amount of information the less probability the event has.

Amount of information is bigger if the event is unexpected.

Page 25: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 26

Amount of information: Additivity

How much is the amount of information, I(pq), of an joint event with

probability p and q ?,

where,

I(p): amount of information of an event with probability p

I(q): amount of information of an event with probability q

I(pq) = I(p) + I(q)

means,

amount of information is same if given once or one by one.

Page 26: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 27

Amount of information

Only function form which satisfies the above three properties is,

Now, I(P) is defined as amount of information.

Units of amount of information

[bit]

[nat]

[dit] or [Hartley]

If p=0.5, I(p) is maximum. -> only valid for the average case!

Amount of information by [bit] represents average number of [yes/no] questions to know what event has happened.

).log()( ppI

)(log2 p

)(log pe

)(log10 p

Page 27: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 28

What is coding?

Binary coding of the decimal digits.

Message Symbols:

0,1,…,9

Code word:

0000,0001,0010,…

Backward decoding is straight-

forward in this example.

Decimal

number

Binary

representation

0 0000

1 0001

2 0010

3 0011

4 0100

5 0101

6 0110

7 0111

8 1000

9 1001

Page 28: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 29

What is coding ?

A binary code.

Backward decoding is NOT

straightforward.

111001 can be generated by

“S4S3” and “S4S1S2”

Message

Symbols

Binary

representatio

n

s1 0

s2 01

s3 001

s4 111

Page 29: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 30

What is coding ?

Another binary code

Use “0” as a separator.

Backward decoding is unique and

straightforward.

Message

Symbols

Binary

representation

s1 0

s2 10

s3 110

s4 1110

Page 30: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 31

One problem in coding

Weather in San Francisco

Code alpha:

Two binary digits are used.

“Sunny, Foggy, Foggy, Cloudy,”

comes to “00111101”.

Two binary digits are necessary

to backward decoding.

Message

Symbols

Binary

representation

Sunny 1/4

Cloudy 1/4

Rainy 1/4

Foggy 1/4

Message

Symbols

Codes

Sunny 00

Cloudy 01

Rainy 10

Foggy 11

Page 31: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 32

One problem in coding

Weather in Los Angels

Code beta:

Two binary digits are used.

Probabilities are non-uniform

“Sunny,Smoggy,Smoggy, Cloudy” comes to “1000110”.

Waiting for 0 is necessary to backward decoding.

Average code length = 1 7/8 < 2 binit.

Message

Symbols

Binary

representation

Sunny 1/4

Cloudy 1/8

Rainy 1/8

Smoggy 1/2

Message

Symbols

Codes

Sunny 10

Cloudy 110

Rainy 1110

Smoggy 0

messagebinits

SmoggyRainyCloudySunnyL

/8

71

2

11

8

14

8

13

4

12

)Pr(1)Pr(4)Pr(3)Pr(2

Page 32: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 33

Amount of information

TV: Black, white, and gray dots, with roughly 500 rows and 600 columns.

Namely 500x600=300,000 dots may take on any one of 10 distinguishable

brightness levels. (p= 1/10 300,000)

Radio: 10,000 words vocabulary announcer selects 1,000 words randomly.

(p= 1/10,000 1,000)

TV picture is worth more 1,000 words.

bitsEI 61010log000,300)(

bitsEI 4103.1000,10log000,1)(

Page 33: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 34

Average amount of information

Amount of information is defined by,

Average amount of information of the information source A is

defined by,

and, H(A) satisfies,

Entropy = Average amount of information

).log()( ppI

)(log)()()()( 2

11

i

n

i

ii

n

i

i ePePeIePAH

nAH 2log)(0

(bit).

Page 34: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 35

Entropy

Entropy represents ambiguity of the information source.

When one message ei is received, ambiguity of the information

H(A) is decreased.

This amount of decrease is equivalent to the amount of

information of the message ei.

Page 35: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 36

Properties of Entropy

Now we have source alphabet {0,1}, and

Entropy function is like,

.1)1(,)0( PP

Page 36: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 37

Amount of information for multiple events

Amount of information for multiple events can be defined by the decrease of the Entropy. Now let P(ai) be a prior probability of a message ai, and P(ai|bi) be a posterior probability of ai given a message bi. A prior Entropy of information source A is defined by,

and, a posterior Entropy of information source A given a message bj is defined by,

Therefore, Amount of information of multiple events

)(

1log)()(

iA

iaP

aPAH

)/(

1log)/()/(

ji

j

A

ijbaP

baPbAH

)/()( jbAHAH

Page 37: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 38

Conditional Entropy

Conditional Entropy is expectation of H(A|bj).

And following inequality holds,

)|(log),(

)|(log)|()()|(

1 1

1 1

ji

n

i

m

j

ji

jiji

n

i

m

j

j

baPbaP

baPbaPbPBAH

)()()()|(0 BHAHBAHBAH

Page 38: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 39

Mutual Information

Amount of information of multiple events

What is an amount of information if we know information source B not

a single message of bj of B.

I(A;B) is called “Mutual Information”.

)/()( jbAHAH

BA ji

ji

ji

BA ji

ji

A i

i

bPaP

baPbaP

baPbaP

aPaP

BAHAHBAI

,

,

)()(

),(log),(

)/(

1log),(

)(

1log)(

)/()();(

Page 39: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 40

Joint Entropy

Entropy of joint information source A and B is defined by,

),(log),(),(1 1

ji

n

i

m

j

ji baPbaPBAH

Page 40: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 41

Mutual Information

Mutual Information I(A;B) holds,

),()()(

)|()(

);();(

)();(0

BAHBHAH

ABHBH

ABIBAI

AHBAI

Page 41: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 42

Mutual Information (example)

Initial Entropy of A, H(A) is,

The winning rate after we know he plays a game becomes 0.6. The Entropy H(A|bi) is,

Entropy increases by knowing the information of bi.

If we know he doesn’t play, the winning rate is 0.93. This time, Entropy decreses.

Now mutual information is,

A B Play (b1) Not Play(b2) P(ai)

Win(a1) 0.42(0.6) 0.28(0.93) 0.7

Lose(a2) 0.28(0.4) 0.02(0.07) 0.3

P(bi) 0.7 0.3 1.0

88.03.0log3.07.0log7.0)( AH

97.04.0log4.06.0log6.0)|( 1 bAH

17.007.0log07.093.0log93.0)|( 2 bAH

009.079.088.0)17.03.097.07.0(88.0

)|()();(

BAHAHBAI

Page 42: Spoken Language Translation Information Theory: Source Coding

Goal of 1st day

2012/10/24 Prof. Satoshi Nakamura 43

Page 43: Spoken Language Translation Information Theory: Source Coding

ROLE OF SOCIAL MEDIA

2012/10/24 Prof. Satoshi Nakamura 44

Page 44: Spoken Language Translation Information Theory: Source Coding

Credibility Increased information Source

2012/10/24 Prof. Satoshi Nakamura 45

NHK

Portal sites

Social media

Academia

Government

Commercial TV

News papers

Page 45: Spoken Language Translation Information Theory: Source Coding

Credibility Decreased Information Source

2012/10/24 Prof. Satoshi Nakamura 46

NHK

Portal sites

Social media

Academia

Government

Commercial TV

News papers

Page 46: Spoken Language Translation Information Theory: Source Coding

Trends of Social Media Users

2012/10/24 Prof. Satoshi Nakamura 47

#thousands users

March June Aug. Oct. Dec. Feb. Mar. ‘11

Page 47: Spoken Language Translation Information Theory: Source Coding

Weekly Trends of #users

2012/10/24 Prof. Satoshi Nakamura 48

Page 48: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 49

Models for information sources

Information

source

Zero-memory

information

source

With memory

Information

source

Stationary

Information

source

Non-stationary

information

source

Non-ergodic

information

source

Ergodic

information

source

Stationary

Information

source

Non-stationary

information

source

Page 49: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 50

Models for information sources

Zero-memory information source:

Source alphabets in S={s1, s2, s3,…,sq} are mutually independent and

independent to alphabets in history. Zero-memory information source is

completely described by the source alphabet S and their probabilities,

P(s1),P(s2),..,P(sq).

Markov information source:

Probability of the source alphabet Si is described by previous m alphabets.

If m=1, it is called 1st order Markov Model. Probabilities of the alphabet is

described by, P(si|sj1,sj2,..,sjm) i=1,2,..q; jq=1,2,..,q.

Page 50: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 51

Models of information source

Stationary information source:

Probabilities of the specific source alphabets are invariant to time shift.

Ergodic information source:

Observed source alphabet sequence becomes same as a representative one

with probability 1, when we observe the source alphabet sequence for

long time.

Page 51: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 52

Zero-memory information source

Zero-memory information source: Successive symbols emitted from the source are statistically independent, which is described by source alphabet S and the probabilities with which the symbols occur:

An amount of information for one symbol si is,

An average amount of information for information source S is,

Entropy H(S) of zero-information information source is,

Source si, sj, …

)(,),(),( 21 qsPsPsP

bitssP

sIi

i)(

1log)(

S

ii sIsP )()(

S i

isP

sPsH)(

1log)()(

Page 52: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 53

Examples

Source S;

If I(si) is measured in r-ary units, we have

4

1)()(,

2

1)(},,,{ 321321 sPsPsPsssS

bitsSH2

34log

4

14log

4

12log

2

1)(

r

SHSH

unitsaryrsP

sPSH

r

S i

ri

log

)()(

)(

1log)()(

Page 53: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 54

Some properties of Entropy

1 xy lies above xy ln

1ln xxwith equality if, and

only if x=1

xx

11

ln

1

,,0,0

11

q

j

j

q

i

i

i

yx

jandiforyx

i

i

q

i

i

i

i

q

i

ix

yx

x

yx ln

2ln

1log

11

Page 54: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 55

Some properties of Entropy

with equality if, and only if, xi=yi for all i.

This is called Jensen’s inequality.

i

q

i

i

i

q

i

i

q

i

i

q

i

i

i

i

q

i

i

i

i

q

i

i

yx

xx

or

xy

x

yx

x

yx

1log

1log

,0

)(2ln

1

)1(2ln

1log

11

11

11

Page 55: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 56

Some properties of Entropy

i

q

i

iP

PsH1

log)(1

i

q

i

i

i

q

i

i

i

q

i

i

q

i

i

qPPe

qPP

PPqPsHq

lnlog

log

1loglog)(log

1

1

11

0

)1

(log

)1

1(log)(log

11

1

q

i i

i

q

i

i

i

q

i

i

P

P

qPe

qPPesHq

H(s) always less than or equal

to, log q. Equality holds if, and

only if, Pi=1/q for all i.

Page 56: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 57

Properties of Entropy

Again, we have source alphabet {0,1}, and

Entropy function is like, .1)1(,)0( PP

Page 57: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 58

Extensions of a zero-memory source

Extensions to blocks of symbols. For instance, suppose two binary source alphabet case, 00, 01, 10, and 11.

Definition: Let S be a zero-memory information source with source alphabet {s1,s2,…,sq} and with the probability of si equal to Pi. Then the n-th extension of S, Sn, is a zero-memory source with qn symbols . Each corresponds to some sequence of n of the si. P( ), the probability of , is just the probability of the corresponding sequence of si’s. That is, if corresponds to (si1,si2,..,sin), then P( )=Pi1,Pi2,…,Pin.

i

nq ,...,, 21

ii

ii

Page 58: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 59

Extension of zero-memory source

Entropy:

)(

1log)()(

is

i

n

PPsH

n

nn

nn

nn

s i

i

s i

i

s i

i

iiis

i

n

PP

PP

PP

PPPPsH

1log)(

1log)(

1log)(

1log)()(

21

21

)(

1log

1log

1log

1log

1log)(

1

1

11

1

3

3

2

2

11

1

1

21

1

1

111

sH

PP

PP

PPPP

P

PPPP

PP

i

q

S

i

i

q

i

i

q

i

q

i

ii

q

i

i

i

q

i

i

is

iii

is

i

n

n

nn

n

)()( snHsH n

Page 59: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 60

Markov Information Source

A more general type of information source with q symbols than the zero-memory

source is one in which the occurrence of a source symbol si may depend on a finite

number m of preceding symbols. Such a source, mth-order Markov source, is

defined by giving the source alphabet S and the set of conditional probabilities.

State: the probability of emitting a given symbol is known if we know the m

preceding symbols. We call the m preceding symbols as a state of the mth-order

Markov source.

qjqiforssssP pjjji m,,2,1;,,2,1),,,(

21

Page 60: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 61

Markov information source

Ergodic Markov source Non-ergodic Markov source Non-Stationary

Page 61: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 62

Entropy for Markov source

If we are in the state specified by (sj1,sj2,…,sjm), then the conditional probability of receiving symbol si is P(si/sj1,sj2,..,sjm). The information we obtain if si occurs while we are in state (sj1,sj2,..,sjm) is,

amount of information per symbol while we are in state (sj1,sj2,…, sjm ) is given by,

If we average this quantity over the qm possible states, we obtain the average amount of information by a product of the above entropy and steady state probability , namely the entropy of the mth-order Markov source S.

m

mm

s

jjjjjj sssSHsssPSH ),,,(),,,()(2121

S

jjjijjjijjj mmmssssIssssPsssSH ),,,(),,,(),,,(

212121

),,,(

1log),,,(

21

21

m

m

jjji

jjjissssP

ssssI

Page 62: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 63

Entropy of Markov source

Entropy of mth-order Markov source is given by,

If S is zero-memory source,

121

21

121

2121

21

2121

),,,(

1log),,,,(

),,,(

1log),,,(),,,(

),,,(

1log),,,(),,,()(

mm

m

mm

mm

mm

mm

S jjji

ijjj

S jjji

jjjijjj

S jjji

jjji

S

jjj

ssssPssssP

ssssPssssPsssP

ssssPssssPsssPSH

)(),,,(21 ijjji sPssssP

m

Page 63: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 64

Example

Probabilities for the Markov source

4/14

1/14

1/14

1/14

1/14

1/14

1/14

4/14

5/14

5/14

1/14

1/14

1/14

1/14

5/14

5/14

Page 64: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 65

Adjoint source

Definition: Let S={s1,s2,…,sq} be the source alphabet of an mth-order

Markov source, and let P1,P2,…,Pq be the first-order symbol probabilities

of the source. The adjoint source to , written , is the zero-memory

information source with source alphabet identical with that of S, and with

symbol probabilities P1,P2,…,Pq,

here the following relationship holds,

SS

).()( SHSH

Page 65: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 66

Adjoint source

Let S be a 1st order Markov source,

By applying Jensen’s inequation, ),(

)()(log),(

2ij

ij

i

S

jssP

sPsPssP

0)|(

)(log),(

),(

)()(log),(

22

S ij

iij

ij

ij

i

S

jssP

sPssP

ssP

sPsPssP

)(

)(

1log)(

)(

1log)()|(

)(

1log),(

)|(

1log),()(

22

SH

sPsP

sPsPssP

sPssP

ssPssPSH

is

i

i

ii

s s

j

i

i

S

j

ij

i

S

j

i

i j

i

q

i

i

i

q

i

i

i

i

q

i

i

yx

xx

orx

yx

1log

1log

,0log

11

1

Page 66: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 67

Extension of a Markov source

Definition: Let S be an mth-order Markov information source with source

alphabet (s1,s2,…,sq) and conditional symbol probabilities P(si/sj1,sj2,…,sjm).

Then the nth extension of S, Sn, is a th-order Markov source with qn

symbols, . Each corresponds to some sequence of n of the

si, and the conditional symbol probabilities of are .

is given by , here [ ] is a minimum integer number bigger

than m/n. Entropy is given by,

nq ,,, 21

i

i

),,,|(21

jjjiP

]/[ nm

).()(

)|(

1log),()(

SnHSH

PPSH

n

S S ji

ij

n

n n

Page 67: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 68

Extension of a Markov source

Example:

1]/[,3,1 nmnm

)|,,(

),,|,,()|(

112

12312

tttt

ttttttji

ssssP

ssssssPP

n

n n

S ji

ij

S S ji

ij

n

PP

PPSH

2 )/(

1log),(

)/(

1log),()(

)|()|()|()|(

),,,(),,,|()(

1

)(/),,..,,(

)|,..,,()|(

112211

1111

21

21

jiiiiiii

jiijiii

j

jjiii

jiiiji

ssPssPssPssP

sssPssssPsP

sPssssP

ssssPP

nnnn

nnn

n

n

)(

)/(

1log),(

)/(

1log),(

)/(

1log),()(

21

221

21

SnH

ssPP

ssPP

ssPPSH

nnn

n

n

S ii

ij

S ii

ij

S ji

ij

n

Page 68: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 69

Adjoint source of extended Markov source

Adjoint source of extended Markov source, .

Let be the first-order symbol probabilities of the symbols

of the nth extension of the first-order Markov source. Since corresponds to the

sequence , we see that may also be thought of as the nth-order

joint probability of the .

If S is a first-order Markov source.

ns

)(,),(),( 21 nqPPP

i

i

),,,(21 niii sss )( iP

kis

nn

n

n

S iii

iii

S i

i

n

sssPsssP

PPSH

),,,(

1log),,,(

)(

1log)()(

21

21

)()()()(),,,(12312121

nnn iiiiiiiiii ssPssPssPsPsssP

)]()([)()(

)()1()(

)(

1log

)(

1log

)(

1log),,,()(

1121

21

SHSHSnHSH

orSHnSH

ssPssPsPsssPSH

n

S iiiii

iii

n

nnn

n

Page 69: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 70

Adjoint source of extended Markov source

m

n SnHSH )()(

nSH

n

SH m

n )(

)( )()()( SnHSHSH nn

)()(

lim SHn

SH n

n

)()(nn SHSH

)()( SnHSH n

This inequality becomes less important as n becomes larger.

For larger n, the Markov constraints on the symbols from Sn becomes

less and less important. The adjoint of the nth extension of S is not the

same as the nth extension of the adjoint of S.

If is a zero-memory source, S

Page 70: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 71

Example

Probabilities for the Markov source

4/14

1/14

1/14

1/14

1/14

1/14

1/14

4/14

5/14

5/14

1/14

1/14

1/14

1/14

5/14

5/14

Page 71: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 72

Examples

bitsSH

bitsSH

bits

ssPssPSH

bitsSHSH

bitSH

bitSH

S kj

ij

47.3)(

66.2)(

86.1

),(

1log),()(

62.1)(2)(

00.1)(

81.0)(

3

3

2

2

2

Note how the sequence approaches H(S).

bitSH

bitSH

bitSH

bitSH

87.04

)(,89.0

3

)(

93.02

)(,00.1)(

43

2

Page 72: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 73

Example: English

27 symbols: 26 alphabets + space

symbolbits

SH

/75.4

27log)(

symbolbits

PPSH

iS

i

/03.4

1log)(

Page 73: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 74

Example: English

1st order Markov source:

2nd order Markov source

symbolbitsjiP

jiPSHS

/32.3)/(

1log),()(

2

Page 74: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 75

Example: English

Word-based zero-memory source

Word-based 1st order Markov source

Page 75: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 76

Estimation of parameters of Markov source

Estimation of P(si/sj) from samples emitted from

the information source.

Regular 1st order Markov source Non-regular 1st order Markov source

Page 76: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 77

Estimation of parameters of Markov source

The state transition sequence of the Markov source associated the emitted output symbols is

uniquely determined. We maximize the following probability P, if P is a joint probability of N

observed samples.

,where are initial and final state probabilities, respectively. PA(a)=P(a|a) is conditional

probability of state transition.

Now find conditional probabilities which maximize log P under the following

constraints by the Lagrangean method.

The optimal conditional probabilities are given by,

FbPaPbPaPWPc

B

c

B

c

A

c

A4321 )()()()(0

FW ,0

1)()(,1)()(, bPaPbPaPNc BBAAi

21

2

21

1 ,cc

cb

cc

ca PP AA

43

4

43

3 ,cc

cb

cc

ca PP BB

Page 77: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 78

Estimation of parameters of Markov source

The Lagrangean Method:

Our aim is to maximize the above objective function under constraints

of . For simplicity, we maximize the Q=log P function

instead.

By taking derivative for each parameter, now we have,

FPPPPWPC

B

C

B

C

A

C

Ababa

4321

0

1,1 baba PPPP BBAA

)()( 11log21

baba PPPPPQ BBAA

01

,01

,0,0

,0,0

2

1

21

21

42

31

baQ

baQ

bP

PC

bP

Q

bP

PC

bP

Q

aP

PC

aP

Q

aP

PC

aP

Q

PP

PP

BB

AA

BBAA

BBAA

43

4

21

2

43

3

21

1

,

,,

cc

cb

cc

cb

cc

ca

cc

ca

PP

PP

BA

BA

Finally, we obtain,

Page 78: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 79

Estimation of parameters of Markov source

These are nothing but a relative frequencies of symbols sequences observed

through state sequences. Now let NA be a frequency of state A and NA(b) a

frequency of the symbol b produced at state A.

p(b|a) can be calculated by,

Let P(A,a) be a joint probability of symbol a produced at the state A, and P(A) be

a probability of state A.

43

4

21

2

43

3

21

1 ,,,cc

cb

cc

cb

cc

ca

cc

ca PPPP BABA

.

,|

NN

NN

PA

A

A

b

a

baabpb

)(

),(|

AP

aAPaapaPA

Page 79: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 80

State transition matrix

Definition: Matrix representation of conditional probabilities.

Let P(a),P(b) be state transition matrices for symbol a and b, and let A, W0=[1,0], WF=[1,1]

be an initial state, an initial state probability and a final state probability.

Now we can calculate a probability for the observed symbols with arbitrary length.

0

0

|

|

bap

aapaP

bbp

abpbP

|

|

0

0

},{,,...,210 baSWSPSPSPWM itFm

bbp

abp

bap

aapP |

|

|

|

Page 80: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 81

State transition matrix

Limit distribution: Let W0 be an initial state probability vector with an initial

probability at state i, time n=0, and let P and

be a state probability vector at state j, time n.

Limit distribution is given by,

PWWW n

nn

n0limlim

i],,,[ )()(

2

)(

1

n

k

nn

nW

Page 81: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 82

Regular Markov source

Definition:

Pn converges to an unique matrix as n becomes large.

Each column vector converges to an unique state probability vector

, where each element is positive.

Steady state distribution exists uniquely and is equal to .

Steady state distribution is Z=(z1,z2,…,zk), which satisfies,

Example:

The steady state vector is,

P

WW

.11

,

k

i

ZZZP i

8.02.0

3.07.0P ZZZ

ZZZ

221

121

8.03.0

2.07.0

6.0,4.0 21 ZZ

Page 82: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 83

Example

2/1)|()|(,2/1)|()|(

,4/3)|()|(,4/1)|()|(

BbPbbPBaPbaP

AbPabPAaPaaP

12

1

4

1

2/12/1

4/34/1)()(

2/12/1

4/34/1

)|()|(

)|()|(

,

,,

ZZZZZ

ZZZZ

BbPBaP

AbPAaPP

babaa

baba

Now we have, .5

3,

5

2 BPZAPZ ba

92.0)|(

1log)()|(

)|(

1log)()|(

)|(

1log)()|(

)|(

1log)()|(

)|(

1log)()|()(

2

BbPBPBbP

BaPBPBaP

AbPAPAbP

AaPAPAaP

SSPSPSSPsH

ji

jj

s

iEntropy is given by,

Page 83: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 84

Example

Entropy of the extended Markov source is,

97.05/3

1log

5

3

5/2

1log

5

2)( SH

Page 84: Spoken Language Translation Information Theory: Source Coding

Goal of 2nd day

2012/10/24 Prof. Satoshi Nakamura 85

Page 85: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 86

Hidden Markov information source

Information source with k symbols can be represented by nth Markov source with kn

states.

If we merge states which have similar behavior, we can have a non-deterministic

automata. This is called a hidden Markov source model.

The hidden Markov source model doesn’t have unique state sequence for the

observed symbol sequence.

Page 86: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 87

Hidden Markov source

Definition: Non-deterministic probabilistic automata or Markov source model.

The unique state sequence cannot be obtained by observed symbol sequences.

If we let an initial state be q1, an final state be q3, the symbol sequence abab can be

produced by the following state sequences.

qqqqqQqqqqqQ

qqqqqQqqqqqQ

qqqqqQqqqqqQ

333216332215

322214332113

322112321111

,

,

,

Page 87: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 88

Hidden Markov source

P(Q1) can be calculated by

Now we have,

[Forward calculation]: Now let the observed symbol sequence for the source,

We try to estimate probability of P(X|M) assuming a hidden Markov information

source. An initial and final probabilities holds,

,where I and F are an initial and final state set.

0031752.06.05.08.07.03.03.07.03.01

PQ

734832.01 QPP

.,...,, 21 IxxxX

FqIq k

I

k )()(

0 ,

Page 88: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 89

Probability of observed symbol sequence

The probability of the observed symbol sequence x on the model M is given by,

Now we apply 1st order Markov assumption,

Now,

)()|( )()(

1

)(

0,1,1

i

kkI

iQ

kxbaMXP

iqiqiqiq

k

kk Q

kk

Q

k QPQXPMQXPMXP )()|()|,()|(

)|()()(

1

)( k

i

k

ii

k qqPQP

)|()|()()(

1

k

i

k

iii

k qqXPQXP

)(

)|()|(

)|()|()|(

)()(

)()(

1

)(

1

)(

)(

1

)()()(

1

11i

k

qq

k

qqQ

i

k

i

k

ii

k

i

k

i

Qi

k

i

k

ii

k

i

k

ii

Qi

x

qqXPqqP

qqPqqXPMXP

baiiii

k

k

k

State transition probability

Emission probability

Page 89: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 90

Hidden Markov source

Now let be probabilities of the initial state, the probability of

observed symbol sequence given the model is,

and, let forward probabilities in the following,

we get

Iq

ii

i

)1(:

)()|(,1,1

)()(

1

)(

0 iqqk

qqk

I

iallQ

kxbaMXP

iiii

k

Sifori i ...2,1)0,(

)()1,(),( tji

j

ji xbatjti

Fii

IiMXP,

),()|(

Page 90: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 91

Probability of observed symbol sequence

If we apply a forward probability ,

and if we apply a backward probability ,

)()1,(),( ijiji

j

xbatiti

Fi

IiMXP ),()|(

j

tijij tjxbati )1,()(),(

i

i

iMXP )0,()|(

Page 91: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 92

Trellis calculation

Three paths:

abc+dec+dfg (state No.

time)

Node No. path

N1 a

N2 d

N3 ab+de

N4 df

N5 (ab+de)c+dfg

=abc+dec+dfg

(0,1)

(1,1)

(1,2)

(2,2)

(2,3)

Page 92: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 93

Parameter estimation of HMM source

State transition sequence cannot be determined uniquely in the HMM while the symbol sequence is observed. Once number of transitions between states is obtained, state transition probabilities and emission probabilities can be estimated easily.

EM (Expectation and Maximization) algorithm: Iterative algorithm for parameter estimation.

Expectation Step: Find state sequence to observed sequence based on the assumed HMM model parameters.

Maximization Step: Estimate HMM parameters along the state sequences, which maximize the probability to observed symbol sequence.

,here HMM parameters include state transition parameters and emission parameters.

Page 93: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 94

EM algorithm

Leonard Baum proved the following important inequation.

,where is an assumed HMM parameter set, is an estimated HMM parameter set by

EM algorithm.

Let A={ai} be a state sequence estimated by the observed symbol sequence. We

modify the objective function as follows,

by taking logarithm,

Now we take expectation, over estimated state sequences,

)()(ˆ XPXP )1(.ˆ with equality if, and only if

)2(.)|(

),(

),(

),()()(

ˆ

ˆ

ˆ

ˆ

ˆˆ XAP

XAP

XAP

XAPXPXP

)3().|(log),(log)(log ˆˆˆ XAPXAPXP

)4().(log)(log)|()]([log ˆˆ|ˆ XPXPXaPXPE ia

XAi

XAE |[]

Page 94: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 95

EM algorithm

If we substitute (4) with (3),

Now we recall Jensen’s inequality.

Apply Jensen’s inequality to the second term in the right side .

,with equality if, and only if

)5()|(log)|(),(log)|(

])|([log)],([log

])([log)(log

ˆˆ

|ˆ|ˆ

|ˆˆ

XaPXaPXaPXaP

XAPEXAPE

XPEXP

iia

iia

XAXA

XA

ii

RR

dxxgxfdxxfxf )(log)()(log)(

,where f(x),g(x) are probability density function.

with equality if, and only if f(x)=g(x)

)6()|(log)|()|(log)|( ˆ XaPXaPXaPXaP iia

iia ii

.ˆ,),|()|(ˆ isthatXaPXaP ii

Page 95: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 96

EM algorithm

Now we have,

If we set 1st term in right side to be as follows,

Namely,

Equation (5) holds,

).|(log)|(),(log)|()(log ˆˆ XaPXaPXaPXaPXP iia

iia ii

)7(),(log)|(),(log)|( ˆ XaPXaPXaPXaP iia

iia ii

)8(,]),([log)],([log ||ˆ XAXA XAPEXAPE

)9()|(log)|(),(log)|()(log ˆ XaPXaPXaPXaPXP iia

iia ii

Page 96: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 97

EM algorithm

In summary,

If equation (7) holds, we obtain parameters which satisfy,

).(log

)|(log)|(),(log)|(

)|(log)|(),(log)|(

)|(log)|(),(log)|(

)(log

ˆ

ˆˆ

ˆ

XP

XaPXaPXaPXaP

XaPXaPXaPXaP

XaPXaPXaPXaP

XP

iia

iia

iia

iia

iia

iia

ii

ii

ii

).(log)(log ˆ XPXP

Page 97: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 98

Parameter estimation by EM algorithm

As in the previous slides, parameter estimation can be achieved by

maximizing .

, where can be calculated using parameter .

Numerator of is a joint probability of events of observing X and state

sequence ak.

Denominator of is a probability of observing X based on the HMM.

XA

XAPE|ˆ ),(log

),(log,

),(log)|(

ˆ

ˆ

XaPXP

XaP

XaPXaPE

kk

a

kka

k

k

XP

XaP k ,

XP

XaP k ,

XP

XaP k ,

Page 98: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 99

Parameter estimation by EM algorithm

Now, we have by counting state transitions along the state sequence ak.

,where cij and dij are counts of state transition aij and bij(xt), respectively.

Then E can be re-written by,

if we let , we have E as follows.

),(ˆ XaP k

)(

)(),(

)()()(

0

)(

1

)(

1

)(

i

dk

ij

ck

ijq

k

i

k

qq

k

qq

I

Ii

k

k

xba

xbaXaP

ijij

i

iiii

)(log

)(log,

,)(

,)(

,)(

0

)()()(

0

i

dXP

XaPk

ij

cXP

XaPk

ij

XP

XaPk

i

dk

ij

ck

ij

kk

a

xba

xbaXP

XaPE

ijk

kaijk

kak

ka

ijij

k

XP

dXaPd

XP

cXaPc

k

ijk

aij

k

ijk

aij kk

)()( ,',

,'

)('log'')(

0 i

d

ij

c

ij

k xbaE ijij

Page 99: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 100

Parameter estimation by EM algorithm

This is nothing but a probability function of a Markov source. Thus we can obtain

parameters by maximization of E, with .

For aij, can be thought as a relative counts of the state transition

from state i to state j. Thereby, we have,

If use , we have,

0

ija

E

XP

cXaPc

k

ijk

aij k

)(,'

.'

ijj

ij

ijc

ca

t ii

jtijiji

tt

txbattji

)()(

)1()()(),,(

),,(

),,(

'

,

tji

tji

c

ca

jt

t

ijj

ij

ij

Page 100: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 101

Parameter estimation of HMM source

First we define backward probability , which is a probability at state qi, time=t

emitting xi, xi+1,xt+2,…,xI.. This probability can be efficiently calculated from the

final symbol.

Initial setting:

Iteration of backward path:

The following relationship holds.

),( ti

nQqfor ,1

Fqifq :0.1)0,(

otherwiseq :0.0)0,(

0,1...,,2,1 iItfor

nQqfor ...,,2,1

)()1,(),( 1}0:{{ tqjqjajj xbatjtqqj

iiSq

iFi

qIqi

)0,(),(

Page 101: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 102

Parameter estimation of HMM source

Let be an emission probability producing xt during a transition from state

qi to qj. Now can be calculated using

and .

Here, represents a probability (relative transition counts) producing xt

during a transition from state qi to state qj assuming an HMM .

),,( tji),,( tji

)1,( ti ),( tj

)|(

),()()1,(),,(

MxP

tjxbatitji

tijij

),,( tji}),(,{ itijij xba

Page 102: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 103

Parameter estimation of HMM source

Now we have the following estimation formulae.

),,(

),,(

lji

lji

ji

j

i

),(),(

),()()1,(

),,(

),,(

titi

tjxbati

tji

tjia

t

tijijt

jt

tij

),()()1,(

),()()1,(

),,(

),,()(

::

tjxbati

tjxbati

tji

tjikb

tijijt

tijijkxt

t

kxt

ijtt

Page 103: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 104

Parameter estimation of HMM

The calculations above will be iterated until its convergence. Also parameter

estimation will be applied not to a single observation but to many symbol

observations like,

represents a probability that information source produces a symbol xi during a

state transitions from state qi to qj, assuming the symbol sequence x is observed

regardless to the state sequences.

At least the calculation for is the same as that of the Markov source model.

.),,(

),,()(

1

)(

1

tji

tjia

n

jt

Nn

n

t

Nn

ij

Page 104: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 105

Entropy for HMM source

Let Entropy per one symbol at a state qj is given by,

We obtain Entropy for the HMM taking expectation over all states.

,where is steady state probabilities for the HMM states.

)|(log)|()|( jjx

j qxpqxpqXH

)()|( xbaqxp jkjkk

j

)|()()( jjj

qXHqXH

)( jq

Page 105: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 106

An example

Estimate HMM parameters based on observed symbol sequence “ba”.

Step 1:

426.0

0189.01.03.09.07.0

3969.09.07.09.07.0

0021.01.07.01.03.0

0081.09.03.01.03.0

),(.

sum

ABB

ABA

AAB

AAA

XaPabseqState i

   

Page 106: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 107

An example

Step 2:

34.09.03.01.07.0)|(

66.01.03.09.07.0)|(

66.09.07.01.03.0)|(

34.01.07.09.03.0)|(

BbP

BaP

AbP

AaP

9264.034.0log34.066.0log66.0)|(

9264.066.0log66.034.0log34.0)|(

BXH

AXH

)60.066.0log,56.134.0(log

2

1)()( BPAP

9264.0)()|()()|()( BPBXHAPAXHXH

Now we have Entropy for the HMM,

Page 107: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 108

An example

Step 3: Parameter estimation of the HMM.

426.0

0189.0

426.0

3969.02

426.0

0021.02

426.0

0081.0426.0

0021.02

426.0

0081.0

.

Afromtransitionwithsequencesstateofprob

AAtransitionwithsequencesstateofprobaAA

4426.00021.020081.0

0081.0

.

a"".ˆ

AAtransitionwithsequencesstateprob

symbolproducingAAtransitionwithsequencesstateprobbAA

Page 108: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 109

An example

There is another way of estimation using

.

time

symbol

,

021.0405.0

1.03.0)(9.07.0)(

1.07.0)()(9.03.0)()(2

63.003.0

9.07.0)()(1.03.0)()(1

0.0)(0.1)(0

Symbol

11

1212

0101

00

aBB

ABAA

bABAA

BA

Time

                 

426.0)(2 iS

426.0

9.07.0)(0

1.03.0)()(

66.034.0

1.03.0)(1.07.0)(1

9.07.0)()(9.03.0)()(

0.1)(0.1)(2

1

10

22

2121

22

bB

AA

aBB

ABAA

BA

Page 109: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 110

An example

ab

AxbaAAxbaA

AxbaAab

a

AAAA

AxbaAAxbaAa

ab

tAAAAtAAAA

tAAAAAA

tAAAAtAAAAAA

4426.0)()()()()()(

)()()()(ˆ

0420.0)()()()(

)()()()()()(ˆ

2110

21

1100

2110

Page 110: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 111

An example

)21(0.0)|(

)20(0.10.19545.00.10455.0)|(

)19(9766.09950.09580.05574.00420.0)|(

)18(0234.00050.09580.04426.00420.0)|(

BbP

BaP

AbP

AaP

Page 111: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 112

An example

)24(0)|(

)23(1601.0

)22(9766.0log9766.00234.0log0234.0)|(

BXH

AXH

1, iZZZP

)(0455.09545.0

9580.00420.0)( ,, BABA ZZZZ

501.0,499.0 BA ZZ

)26(0799.0499.01601.0

)25()()|()()|()(

BPBXHAPAXHXH

Page 112: Spoken Language Translation Information Theory: Source Coding

First goal of 3rd day

2012/10/24 Prof. Satoshi Nakamura 113

Page 113: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 114

Some properties of codes

Definition: Let the set of symbols comprising a given alphabet be called

S={s1,s2,…,sq}. Then we define a code as a mapping of all possible sequences of

symbols of S into sequences of symbols of some other alphabet X={x1,x2,…,xr}.

We call S the source alphabet and X the code alphabet.

),.....,,(21

3

2

1

3

2

1

jxixixi

iX

iS

x

x

x

x

S

S

S

S

XS

rq

Information

source

Source

alphabet

Code

alphabet

Code word

Page 114: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 115

Classification of coding

Non-block

code

Block

code

Singular

code

Nonsingular

code

Uniquely

undecodable

code

Uniquely

decodable

code

Noninstantaneous

code

Instantaneous

code

Page 115: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 116

Block code

Definition: A block code is a code which maps each of the symbols

of the source alphabet S into a fixed sequence of symbols of the

code alphabet X. These fixed sequences of the code alphabet

(sequences of xj) are called code words. We denote the code word

corresponding to the source symbol si by Xi. Note that Xi denotes a

sequence of xj’s.

Source symbols code

S1 0

S2 11

S3 00

S4 01

Page 116: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 117

Nonsingular block code

Definition: A block code is said to be nonsingular if all the words of

the code are distinct.

It is still possible for a given sequence of code symbols to have an

ambiguous origin. For example, the sequence 0011 might represent

either s3s2 or s1s1s2.

Source symbols code

S1 0

S2 11

S3 00

S4 01

Page 117: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 118

Extension of block code

Definition: The nth extension of a block code which maps the

symbols si into the code words Xi is the block code which maps the

sequences of source symbols (si1, si2, …, sin) into the sequences of

code words (Xi1,Xi2,…,Xin).

Source symbols code Source symbols code

S1S1 00 S3S1 000

S1S2 011 S3S2 0011

S1S3 000 S3S3 0000

S1S4 001 S3S4 0001

S2S1 110 S4S1 010

S2S2 1111 S4S2 0111

S2S3 1100 S4S3 0100

S2S4 1101 S4S4 0101

Page 118: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 119

Uniquely decodable code

Definition: A block code is said to be uniquely decodable if, and only if, the nth extension of the code is nonsigular for every finite n.

Any two sequences of source symbols of the same length are distinct sequences of code symbols, if the code is uniquely decodable.

Two sequences of the different length should also be distinct, if the code is uniquely decodable.

Suppose we have source symbol sequences S1 and S2 which lead to the same sequence of code symbols, Xo, and S1 and S2 may be sequences of source symbols of different lengths. Now let us form two new sequence source symbols, S1’ and S2’, where S1’=S2S1, S2’=S1S2. Both of S1’ and S2’ are sequence X0 followed by X0 with the same length. Thus, the code doesn’t satisfy the condition of unique decodability.

Page 119: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 120

Instantaneous code

Code A : This code is uniquely decodable, since all codes have the same length and distinct.

Code B : This code is also uniquely decodable, since it is non-singular. It is called “Comma code”, which separates code by comma, 0 in this example.

Code C : This code is also uniquely decodable. However, we are not able to decode the sequence, word by word, as it is received. We can decode only after receiving 0 of the next code word.

Source symbol Code A Code B Code C

S1 00 0 0

S2 01 10 01

S3 10 110 011

S4 11 1110 0111

Page 120: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 121

Instantaneous code

Definition: A uniquely decodable code is said to be instantaneous if it is possible to decode each word in a sequence without reference to succeeding code symbols.

Code A and code B are instantaneous. However, code C is not instantaneous. A more general method to know whether instantaneous or not would be helpful.

Definition: Let Xi=xi1xi2…xim be a word of some code. The sequence of code symbols (xi1xi2…xij), where , is called a prefix of the code word Xi.

Ex. 0,01,011,0111 are prefixes of 0111.

mj

Page 121: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 122

Instantaneous code

A necessary and sufficient condition for a code to be instantaneous

is that no complete word of the code be a prefix of some other

code word,.

Sufficient part:

If no word is the prefix of some other word, we may decode any

received sequence of code symbols comprised of code words in a

direct manner.

We scan the received sequence of code symbols until we come to a

subsequence which comprises a complete code word.

The subsequence must be this code word since by assumption it is

not the prefix of any other code word.

Page 122: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 123

Instantaneous code

Necessary part:

We assume that there exists some word of our code, say Xi, which is

also a prefix of some other word Xj.

Now, if we scan a received sequence of code symbols and come upon

the subsequence Xi, this subsequence may be a complete word, or it

may be just the first part of word Xj.

We cannot possibly tell which of these alternatives is true, however,

until we examine more code symbols of the main sequence-thus the

code is not instantaneous.

Non-block

code

Block

code

Singular

code

Nonsingular

code

Uniquely

undecodable code

Uniquely

decodable code

Noninstantaneous

code

Instantaneous

code

Page 123: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 124

Construction of an Instantaneous code

Example code synthesis:

Assign 0 to symbol s1:

If we assign 1 to symbols s2, this would

leave us with no symbols. we might have,

This, in turn, would require us to start

remaining code words with 11. If ,

then the only three-binit prefix still unused is 111.

And we might set,

and

Other alternatives:

If we synthesize another binary instantaneous code.

Then we may set.

We still have two prefixes of length 2 unused.

01 s

102 s

1103 s

11104 s

11114 s

001 s

012 s

103 s

1104 s

1115 s

Page 124: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 125

Kraft inequality

Constraints on the size of words of an instantaneous code.

Consider an instantaneous code with source alphabet,

and code alphabet X={x1,x2,…,xr}. Let the code words be

X1,X2,…,Xq and define the length (number of code symbols) of

word Xi as li. It is often desirable that the lengths of the code words

of our code be as small as possible. Necessary and sufficient

conditions for the existence of an instantaneous code with word

lengths l1,l2,…,lq are provided by the Kraft inequality.

Kraft inequality: A necessary and sufficient condition for the

existence of an instantaneous code with word lenghts l1,l2,…,lq is

that

where r is the number of different symbols in the code alphabet.

},,{ 1 qssS

11

q

i

lir

Page 125: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 126

Kraft inequality

121

q

i

li

For the binary case, the Kraft inequality tell us that the li

must satisfy the equation.

Source symbols Code A Code B Code C Code D Code E

S1 00 0 0 0 0

S2 01 100 10 100 10

S3 10 110 110 110 110

S4 11 111 111 11 11

Page 126: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 127

Kraft inequality

Code A:

Kraft inequality does not tell that code A is an instantaneous code. The inequality is merely a condition on the word lengths of the code and not on the words themselves.

Code B:

Code C:

Code D:

Code D is not an instantaneous code.

Code E: Code E is not an instantaneous code.

122222 224

1

22

i

li

18

722222 33

4

1

31

i

li

122222 334

1

21

i

li

122222 234

1

31

i

li

8

1122222 23

4

1

21

i

li

Page 127: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 128

One more example

Suppose we wish to encode the outputs of a decimal source, S={0,1,2,…,9}, into a binary instantaneous code. Suppose there is some reason for encoding the 0 and 1 symbols of the decimal source into relatively short binary code words. If we were to encode 0s and 1s from the source as,

If we require all these eight code words to be of the same length, say l, the Kraft inequality will provide us with a direct answer to the equation.

By assumption we have l0=1,l1=2, and l2=l3=…=l9=l. Then,

or

101

00

9

0

12i

li

1)2(84

1

2

1 l

5l

Page 128: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 129

The Kraft inequality - Proof

First we prove that the inequality is sufficient for the existence of

an instantaneous code by actually constructing an instantaneous

code, satisfying

(1) can be written as,

on multiplying by rL,

rearranging terms,

dividing by r,

iterate the operation,

)1(11

ilq

i

r qni

L

i

L

L

LL

LLi

i

L

i

i

i

L

i

rnrnrn

rrn

rn

...

1

2

2

1

1

1

1

rnrnrn

rnrnrrnnrn

rnrnrnrn

L

LL

L

L

LL

LLL

L

LLL

L

2

2

1

1

1

2

2

1

111

1

2

2

1

1

...

...

...

rn

nrrrnrn

rnrnrrnrnrn

1

11

2

2

212

2

1

3

3

)(

))((

L is largest of li

0Ln

Page 129: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 130

The Kraft inequality - Proof

Steps:

We assign n1 word of length 1.

There are r possible such words that we may form, using an r-symbol

code alphabet.

We can select these n1 code symbols arbitrarily,

We are then left with r-n1 permissible prefixes of length 1.

By adding one symbol to the end of each of these permissible

prefixes, we may form as many as,

words of length 2.

As before, we choose our n2 words arbitrarily

among our r2-n1r choices, we are left with,

unused prefixes of length 2, from which we may form

permissible prefixes of length 3.

rn 1

rnrrnr 1

2

1)(

21)( nrnr

rnrnrrnrnr 2

2

1

3

21

2 )(

Page 130: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 131

McMillan’s inequality

Proof for the necessity conditions for uniquely decodable codes ?

Consider the quantity, we have qn terms, each terms of

If we let L be the maximum of the word length li.

We define Nk as the number of terms of the form r-k, then,

Nk is also the number of strings of n code words that can be formed so that each string has a length of exactly k code symbols.

If the code is uniquely decodable, Nk must be no greater than rk, the number of distinct r-ary sequences of length k. Thus, we have

Bernulli’s inequality: For x>1, n is arbitrarily large, holds. Considering this inequality and equation (*), we can prove,

nlllnlq

iqi rrrr )...()( 21

1

,...

321 kllllrr niiii

....21 niii lllk

nLkn

)()( 1

k

k

nL

nk

nlq

i rNr i

).(1

)(1

nLnnL

rrrnL

nk

kknq

i

li

nlxn

11

ilq

i r

Page 131: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 132

Example

Assume we wish to encode a source with 10 source symbols into a trinary

instantaneous code with word length 1,2,2,2,2,2,3,3,3,3.

Applying the test of the Kraft inequality, we have,

This doesn’t satisfy the inequality.

Assume we with to encode symbols from a source with nine symbols into

a trinary instantaneous code with lengths 1,2,2,2,2,2,3,3,3. Applyint the

test of the Kraft inequality, we have,

We show the example.

127

28

27

14

9

15

3

13

10

1

i

li

1

27

13

9

15

3

13

9

1

i

li

222,221,220

,21,20,12

,11,10,0

987

654

321

sss

sss

sss

Page 132: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 133

Coding information sources

For a given source alphabet and a given code alphabet, however,

we can construct many instantaneous codes forces us to find a

criterion by which we may choose among the codes.

Perhaps the natural criterion for this selection, although by no

means the only possibility, is length.

Definition: Let a block code transform the source symbols

s1,s2,…,sq into the code words X1,X2,..,Xq. Let the probabilities of

the source symbols be P1,P2,…,Pq, and let the lengths of the code

words be l1,l2,…,lq. Then we define L, the average length of the

code, by the equation

q

i

iilPL1

Page 133: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 134

Coding information source

Average length and Entropy:

Definition: Consider an instantaneously decodable code which

maps the symbols from a source S, s1,s2,…,sq with probabilities

P1,P2,…,Pq into code word composed of symbols from an r-ary

code alphabet. We have the following relationships.

Compact code:

Definition: Consider a uniquely decodable code which maps the

symbols from a source S into code word composed of symbols

from an r-ary code alphabet. This code will be called compact

(for the source S) if its average length is less than or equal to the

average length of all other uniquely decodable codes for the same

source and the same code alphabet.

LSH

rLSH

r

)(

log)(

Page 134: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 135

Compact code

Proof of the relationship:

Consider a zero-memory source S, with symbols s1,s2,…,sq and symbol

probabilities P1,P2,…,Pq, respectively. Let a block code encode these

symbols into a code alphabet of r symbols, and let the length of the

word corresponding to si be li. Then the entropy of this zero-memory

source is,

Let Q1,Q2, …,Qq be any q numbers such that for all if and

By the Jensen’s inequality, we know that

with equality if and only if Pi=Qi for all i. Hence,

q

i

ii PPSH1

log)(

0iQ

Q

i

iQ1

.1

q

i i

i

q

i ii

QP

PP

11

1log

1log

)1(log)(1

q

i

ii QPSH

Page 135: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 136

Compact code

Equation is valid for any set of nonnegative numbers Qi which sum

to 1. We may choose,

We obtain,

ljq

i

li

ir

rQ

1

LSHorLr

SH

rLlPr

rPrPSH

r

i

q

i

i

q

j

ljq

i

i

liq

i

i

)(,log

)(

loglog

)2()(log)(log)(

1

111

0

1 1

Page 136: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 137

Compact code

A method of encoding for special source.

Considering eqns. (1)(2), a condition for equality in the last

inequality is,

Then we see that a necessary and sufficient condition for equlality is,

or

11

q

j

l jr

.

1

iallforr

r

r

QP

i

j

i

l

q

j

l

l

ii

)9.4(1

log biallforlP

i

i

r

Page 137: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 138

Compact code

We may say that, for an instantaneous code and a zero-memory

source, L must be greater than or equal to Hr(S). Furthermore, L

can achieve this lower bound if and only if we can choose the word

lengths li equal to logr (1/Pi) for all i. For the equality, therefore,

log r (1/Pi) must be an integer for each i.

In other words, for the equality the symbol probabilities Pi must all

be of the form (1/r)ai, where ai is an integer.

Note that if these conditions are met, we have derived the word

lengths of a compact code. We simply choose li equal to ai.

Page 138: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 139

Compact code

Source symbol Symbol prob. code

S1 1/2 0

S2 1/4 10

S3 1/8 110

S4 1/8 111

il

iP

2

1

4

31

4

1

i

i

ilPL

4

31

1log

4

1

ii

iP

PH

Page 139: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 140

Example: Compact code

Source symbol Symbol prob. code

S1 1/4 00

S2 1/4 01

S3 1/4 10

S4 1/4 11

Source symbol Symbol prob. code

S1 1/2 0

S2 1/4 10

S3 1/8 110

S4 1/8 111

Page 140: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 141

Example: Compact code

Source symbol Symbol prob. code

S1 1/3 0

S2 1/3 1

S3 1/9 20

S4 1/9 21

S5 1/27 220

S6 1/27 221

S7 1/27 222

Page 141: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 142

Shannon’s first theorem

We now turn to zero-memory source with arbitrary symbol probabilities.

Equation (4-9b) tells us that if logr (1/Pi) is an integer, we should choose

the word length li equal to this integer. If log r (1/Pi) is not an integer, it

might seem reasonable that a compact code could be found by selecting

li as the first integer greater than this value. This tempting conjecture is, in

fact , not valid, but we shall find that selecting li in this manner can lead to

some important results.

First, we check to see that the word lengths satisfy the Kraft inequality.

Summing (4-11) over all i, we obtain,

)104(11

log1

log i

ri

i

rP

lP

)114(1

ii l

i

l

i

rPorrP

q

i

lir1

1

Page 142: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 143

Shannon’s first theorem

If we multiply (4-10) by Pi and sum over all i,

In this way, if we construct the code in the way of (4-10), we can have

the lower and upper bounds of L. This is valid for any zero-memory

source, we may apply it to the nth extension of our original source S.

Ln represents the average length of the code words corresponding to

symbols from the nth extension of the source S. If is the length of the

code word corresponding to symbol and, is the probability

of , then

Ln/n is the average number of code symbols used per single symbol from

S.

)124(.1)()( SHLSH rr

)134(.1)()( n

rn

n

r SHLSH

ii )( iP

i )144()(1

nq

i

iin PL

)154(1

)()( an

SHn

LSH rr

Page 143: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 144

Shannon’s first theorem

It is possible to make Ln/n as close to Hr(S) as we wish by coding

the nth extension of S rather than S:

Equation (4-15a) is known as Shannon’s first theorem or the

noiseless coding theorem. The price we pay for decreasing Ln/n is

the increased coding complexity caused by the large number (qn) of

source symbols.

)154()(lim bSHn

Lr

n

n

Page 144: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 145

Shannon’s first theorem for Markov source

We define the first-order Markov source S, with source symbols

s1,s2,…,sq and conditional symbols probabilities P(si/sj). We also

define Sn, the nth extension of S, with symbols ,

and conditional symbols probabilities P( ). We refer to the first-

order (unconditional) symbol probabilities of S and Sn as Pi and

P( ), respectively.

The process of encoding the symbols s1,s2,…,sq into an

instantaneous block code is identical for he source S and its adjoint

source . If the length of the code word corresponding to si is li,

the average length of the code is,

nq ,,, 21

ji /

ji /

S

q

i

iilPL1

Page 145: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 146

Shannon’s first theorem for Markov source

The average length is identical for S and since Pi, the first-order symbol probability of si, is the same for both these sources. is a zero-memory source, and we have,

This inequality may be augmented to read,

and,

If we now select the li according to (4-10), we may bound L above and below (4-12),

for the extended source,

using (2-41) and dividing by n,

S

S

LSHr )(

LSHSH rr )()(

n

n

r

n

r LSHSH )()(

1)()( n

rn

n

r SHLSH

n

SHSHSH

n

L

n

SHSHSH rr

rnrr

r

1)()()(

)()()(

1)()( SHLSH rr

Page 146: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 147

Coding without extensions

Shannon’s theorem shows the bound above and below considering

its extension. The theorem doesn’t tell us what value of L (or Ln/n)

we shall obtain. It doesn’t guarantee that choosing the word lengths

according to (4-10) will give us the smallest possible value of L ( or

Ln/n) it is possible to obtain for that fixed n.

Source symbol Pi Log 1/Pi li Code A Code B

S1 2/3 0.58 1 0 0

S2 2/9 2.17 3 100 10

S3 1/9 3.17 4 1010 11

symbolsourcebinitsLA /78.149

13

9

21

3

2 symbolsourcebits

PPSH

i i

iA /22.11

log)(3

1

1)()( SHLSH AAAThis satisfies,

However, code B gives shorter L. symbolsourcebinitsLB /33.129

12

9

21

3

2

Page 147: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 148

Binary Compact Codes – Huffman Codes

A compact code for a source S is a code which has the smallest average length possible if we encode the symbols from S one at a time. We develop a method of constructing compact codes for the case of a binary code alphabet.

Consider the source S with symbols s1,s2,…,sq and symbol probabilities P1,P2,…,Pq. Let the symbols be ordered so that . By regarding the last two symbols of S as combined into one symbol, we obtain a new source from S containing only q-1 symbols. We refer to this new source as a reduction of S.

The symbols of this reduction of S may be reordered, and again we may combine the two last least probable symbols to form a reduction of this reduction of S. By proceeding in this manner, we construct a sequence of sources, each containing one fewer symbol than the previous one, until we arrive at a source with only two symbols.

qPPP 21

Page 148: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 149

Huffman codes

0.04 s6

0.1 0.06 s5

0.1 0.1 0.1 s4

0.3 0.2 0.1 0.1 s3

0.4 0.3 0.3 0.3 0.3 s2

0.6 0.4 0.4 0.4 0.4 s1

S4 S3 S2 S1 Prob. Symbols

Original Source Reduced Source

Construction of a sequence of reduced sources is the first step in the construction of a compact instantaneous code for the original source S.

The second step is merely the recognition that a binary compact instantaneous code for the last reduced source ( a source with only two symbols) is the trivial code with the two words 0 and 1.

The final step is to construct a compact instantaneous code for the source immediately preceding the reduced source in the sequence of reduced sources.

Page 149: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 150

Huffman codes

15

1

i

i

ilPL

8113.01

log5

1

ii

iP

PH

Huffman codes for two symbols

Symbols Prob. Code

S1 0

S2 1

43

41

Page 150: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 151

Huffman codes

We assign to each symbol of Sj-1 (sa0 and sa1) the code word used by the

corresponding symbol of Sj. The code words used by sa0 and sa1 are

formed by adding a 0 and 1, respectively, to the code word used for sa.

There are another possibilities to decompose a reduced source in code S3

and S1.

Synthesis of a compact code

0.1

0.2

0.3

0.4

S2

0101

0100

011

00

1

Code

0.1

0.1

0.1

0.3

0.4

S1

01011

01010

0100

011

00

1

Code

0.04

0.06

0.1

0.1

0.3

0.4

Prob.

s6

s5

s4

s3

s2

s1

symbols

011

01 0.3 010

1 0.4 00 0.3 00

0 0.6 1 0.4 1

Code 4 Code S3 Code

Page 151: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 152

Huffman codes

There are three choices in S1. If we choose the fist one, we obtain a code

with word lengths ,

1, 2, 4, 4, 4, 4.

If we choose the second or third, we obtain,

1, 2, 3, 4, 5, 5.

Synthesis of compact codes

0.1

0.2

0.3

0.4

S2

0101

0100

011

00

1

Codes

0.1

0.1

0.1

0.3

0.4

S1

0111

0110

0101

0100

00

1

Codes

0.04

0.06

0.1

0.1

0.3

0.4

Prob.

s6

s5

s4

s3

s2

s1

Symbols

011

01 0.3 010

1 0.4 00 0.3 00

0 0.6 1 0.4 1

Codes S4 Codes S3 Codes

875.15

1

i

i

ilPL 8402.11

log5

1

ii

iP

PH

Page 152: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 153

Huffman codes

Two codes have the same average code lengths. These are shortest

average length codes that can construct.

symbolbinitsL

symbolbinitsL

/2.2)04.0(5)06.0(5)1.0(4)1.0(3)3.0(2)4.0(1

/2.2)04.0()06.0(4)1.0(4)1.0(4)3.0(2)4.0(1

1435.21

log6

1

ii

iP

PH

Synthesis of compact code

111

110

10

0

code

0.125

0.125

0.25

0.5

S1

1111

1110

110

10

0

code

0.025

0.100

0.125

0.25

0.5

Prob.

s5

s4

s3

s2

s1

Symbols

Compact code

15

1

i

i

ilPL 8113.01

log5

1

ii

iP

PH

Page 153: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 154

Proof of Huffman codes

Assume that we have found a compact code Cj for some reduction, say Sj, of

an original source S. Let the average length of this code be Lj.

One of the symbols of Sj, say sa, is formed from the two least probable

symbols of the preceding reduction Sj-1. Let these two symbols be sa0 and sa1,

and let their probabilities be Pa0 and Pa1, respectively.

The probability of sa is then Pa=Pa0+Pa1. Let the code for Sj-1 formed

according to rule (4-24) be called Cj-1, and let its average length be Lj-1.

Lj-1 is easily related to Lj since the words of Cj and Cj-1 are identical except

that the (two) words for sa0 and sa1 are one binit longer than the (one) word

for sa. Thus we know that

What we want to show is if Cj is compact, then Cj-1 must also be compact. In

other words, if Lj is the smallest possible average length of an instantaneous

code for Sj, then Lj-1 is the smallest possible average length for Sj-1.

)25.4(101 PPLL jj

Page 154: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 155

Proof of Huffman codes

10

10

1

10

1

1

1100

1

1

1

)1()1(

PPL

PPlP

lPlPlP

lPlPlPL

j

i

k

i

i

i

k

i

i

i

k

i

ij

101 , PPPlPL ii

k

ij

where,

Page 155: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 156

Proof of Huffman codes

A proof by demonstrating that assuming the contrary leads to a contradiction.

Assume that we have found a compact code for Sj-1 with average length .

Let the words of the code be with lengths respectively.

We assume that the subscripts are ordered in order of decreasing symbol

probabilities so that,

One of the words of this code (call it ) must be identical with except in its

last digit. If this were not true, we could drop the last digit from and decrease

the average length of the code without destroying its instantaneous property.

Finally, we form , a code for Sj, by combining and and dropping their

last binit while leaving all other words unchanged. This gives us an instantaneous

code for Sj with average length , related by

101

~~ PPLL jj

11

~ jj LL

,~

,,~

,~

121 aXXX ,~

,,~

,~

121 alll

121

~~~alll

0

~aX

1

~aX

1

~aX

jC~

1

~aX

0

~aX

Page 156: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 157

Proof of Huffman codes

If we compare the last equation to (4-25), we see that our assumption

implies that we may construct a code with average length

This is the contradiction we seek since the code with average length Lj is compact.

Two properties of Huffman codes.

If the probabilities of the symbols of a source are ordered so that

, the lengths of the words assigned to these symbols will be ordered so that,

The lengths of the last two words ( in order of decreasing probability) of a compact

code are identical:

If there are several symbols with probability Pq, we may assign their subscripts so that

the words assigned to the last two symbols differ only in their last digit.

11

~ jj LL

jj LL ~

qPPP 21

qlll 21

1 qq ll

Page 157: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 158

r-ary compact codes

We would like the last source in the sequence to have exactly r symbols. The last source will have r symbols if and only if the original source has r+a(r-1) symbols, where a is an integer. Therefore, if the original source doesn’t have r+a(r-1) symbols, we add “dummy symbols” with probability 0 to the source until this number is reached.

s7

Synthesis of compact codes

102 0.00 (s12 )

02 0.10 03 0.08 03 0.08 s6

01 0.10 02 0.10 02 0.10 s5

3 0.15 00 0.12 01 0.10 01 0.10 s4

2 0.22 3 0.15 00 0.12 00 0.12 s3

1 0.23 2 0.22 3 0.15 3 0.15 s2

0 0.40 1 0.23 2 0.22 2 0.22 s1

0.08

S2

13

12

11

10

Codes

0.05

0.05

0.06

0.07

S1

103

101

100

13

12

11

Codes

0.00

0.03

0.04

0.05

0.05

0.06

Prob.

(s13 )

s11

s10

s9

s8

Symbols

03

Codes S3 Codes

Dummy

symbols

Page 158: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 159

Code efficiency and redundancy

Shannon’s first theorem shows that there exists a common measure for any

information source. The value of a symbol from an information source S may be

measured in terms of an equivalent number of binary digits needed to represent

one symbol from that source.

Let the average length of a uniquely decodable r-ary code for the source S be L. L

cannot be less than Hr(s). Accordingly, we define the efficiency of the code, by

It is also possible to define the redundancy of a code.

.)(

L

SH r

.)(

1

L

SHL r

Redundancy

Page 159: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 160

Example – nth extension

The average length of this code is 1 binit, so the efficiency is,

To improve the efficiency, we might code S2, the second extension of S:

Extending to higher order,

Huffman codes for two symbols

Symbols Prob. Code

S1 3/4 0

S2 1/4 1

bitSH 811.03

4log

4

34log

4

1)(

.811.0

Huffman codes for two symbols

Symbols Prob. Code

S1 9/16 0

S2 3/16 10

S3 3/16 110

S4 1/16 111

985.02

991.0,985.0 43

Page 160: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 161

Example – nth extension

Page 161: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 162

Compact codes: Elias codes

011 is a point of region [0.375,0.50] . An initial symbol is A.

0110 is a point of region [0.375,0.4375]. The source symbols are AAB.

Elias code:

Elias codes is non-block compact codes in contrast to the Huffman codes, which

are the block codes. This is also called arithmetic codes.

Elias code assign a sequence of source symbols to a fractional number, which is

obtained by dividing a number line according to the symbol probabilities.

2-ary coding

Page 162: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 163

Elias code

In Huffman codes it is necessary to consider extension of codes in order to

improve code efficiency. If the block size is large, it becomes difficult.

Also in Huffman codes code length should be an integer number.

Elias code assigns a sequence of source symbols to one code. It is not necessary to

calculate all of probabilities of nth extension of symbols and we can decode the

codes iteratively.

Page 163: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 164

Elias code

Procedure:

Suppose we have binary codes s0 and s1 with probabilities P0 and P1.

Divide a region of number line [0,1) according to P0:P1 and make a

region A0 and A1. A0 corresponds [0,P0), A1 corresponds [P0,1).

If a first source symbol, S0 is s0 then choose a region A0, else choose a

region A1.

If S0=s0 and a region A0 is selected, divide a region A0 according to

P0:P1 and obtain a region A00 and A01.

Then a next code S1=s0, then choose a region A00, else A01.

Iterate this procedure until the end of the source symbol sequence

and represent a chosen region with a fractional number, which is

lower value of the region.

Page 164: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 165

Elias code

Average code length: The size of the region for symbol sequence SN

becomes,

where letting number of s0, s1 be N0, N1, respectively.

The necessary resolution to represent a point in this region with binary

fraction number is,

If we take longer source symbol length N,

in this way the average code length approaches to the Entropy according

to the length N.

10

10

NNPP

121020 loglog PNPN

N

NP

N

NP

NN 11

00

10 ,

Page 165: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 166

Elias code

Source symbol

This figure depicts the process where

the source symbol sequence 010011..

is encoded by the Elias code.

First a region [0,1) is divided into A0

A1 according to P0:P1.

A0 is chosen since a first symbol is 0.

In this way the subregion is divided

and chosen.

Page 166: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 167

L-R arithmetic codes

Problems of Elias code:

Multiplication by a probability per coding one source symbol is necessary.

Required precision for calculation increases according to N.

Coding cannot be started until receiving the last symbol.

L-R arithmetic code: One approach to solve the problem for binary code.

Approximate an inferior symbol probability by 2-Q.

Assign a value U of the region [U,V) to the symbol sequence.

Prevent bit-reverse propagation by carry introducing bit-stuffing.

Average code length:

An average code length of L-R code is given by,

Coding efficiency becomes 1 if an inferior symbol probability is 2-Q.

)21(log2log 2021

QQ PPL

Page 167: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 168

L-R arithmetic code

Coding algoritm:

Initialization:

Prepare a register C and a register A with V bits.

C is an initial code and A is an initial value of the region.

Coding of source symbol Xi.

Divide the register A into A0 and A1 according P0:P1 of the superior symbol

“0” and the inferior symbol “1”.

(Q: integer, called SKEW) (1) is calculated by right shift and (2) can be

calculated by A-A1=A0

0...000C

1...111A

)1(11 PAA

)2(00 PAA

QP 21

Page 168: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 169

L-R arithmetic code

Code is,

update the region,

,where C represents the lower bound of the chosen region.

0iX

1iX 0ACC

If

If

C is same as it was.

0iX 0AA

1iX1AA

If

If

Page 169: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 170

L-R arithmetic code

Decoding algorithm:

Initialization:

C copied from the received codes.

A the initial value set by the coding algorithm

Decoding:

Every time we receive a code, divide the region A.

For registers,

If C-A0 is negative, keep C as it was and choose source symbol 0.

If C-A0 is non-negative, set C C-A0, and choose source symbol 1.

Next update A,

00 PAA

11 PAA

0iX 0AA1iX

1AA

Page 170: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 171

Another advantage of L-R arithmetic code

We can change a inferior probability, SKEW, according to change

of a symbol probability. If we use the same SKEW in decoding, we

can decode in the same way.

Page 171: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 172

Coding example by L-R code

Sym. A A0 A1 Code Output C

0 1111 1100 0011 0000

1 1100 1001 0011 1001

ren. 0011 Shift 2 bit 10 01

0 1100 1001 0011 10 0100

0 1001 0111 0010 10 0100

ren. 0111 Shift 1 bit 100 100

1 1110 1011 0011 101 0011

ren. 0011 Shift 2 bit 10100 11

1 1100 1001 0011 10101 0101

ren. 0011 Shift 2 bit 1010101 01

0 1100 1001 0011 1010101 0100

0 1001 0111 0010 1010101 0100

ren. 0111 Shift 1 bit 10101010 100

0 1110 1011 0011 10101010 1000

1 1011 1001 0010 10101011 0001

Code string = 101010110001

Page 172: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 173

Decoding example of L-R code

A A0 A1 C Code String Sym.

1111 1100 0011 1010 10110001 0

1100 1001 0011 0001 10110001 1

0011 Shift 2 bit 0110 110001 ren.

1100 1001 0011 0110 110001 0

1001 0111 0010 0110 110001 0

0111 Shift 1 bit 1101 10001 ren.

1110 1011 0011 0010 10001 1

0011 Shift 2 bit 1010 001 ren.

1100 1001 0011 0001 001 1

0011 Shift 2 bit 0100 1 ren.

1100 1001 0011 0100 1 0

1001 0111 0010 0100 1 0

0111 Shift 1 bit 1001 ren.

1110 1011 0011 1001 0

1011 1001 0010 0000 1

Decoded symbol string=0100110001

Page 173: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 174

Bit-stuffing- L-R code

(a) Without bit-stuffing

Code output Register C

Bit reverse by carry

already output

Code output Register C

Register A0 already output Register A0

Register C

Register C Register P

Register P Code output

Code output

Bit “0”

insertion

(a) With bit-stuffing

No influence by the bit reverse

Page 174: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 175

Coding efficiency of L-R code

Probability of an inferior symbol

Co

din

g ef

fici

ency

Page 175: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 176

Universal code

What is universal code?

Coding which can compress source symbols belong to a fixed class,

optimally or very efficiently.

Coding algorithm independent of a prior probabilities of source

symbols. Or coding algorithm for source symbols which have varying

probabilities.

Three coding algorithms:

Adaptive Huffman code

Context Modeling

Dictionary code

Page 176: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 177

Adaptive Huffman code

Adaptive Huffman code (1)

Algorithm:

Every time when we receive N source symbols (one block), update a

probability table of source symbols and re-synthesis Huffman codes. Then

send them to the decoder.

Problems:

According to the size of a block the size of the probability table seems

relatively small, however, we cannot send a code until N source symbols.

It is very inefficient to re-synthesis Huffman code every N symbols.

Page 177: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 178

Adaptive Huffman code

Adaptive Huffman code (2) Algorithm:

Code a source symbols and send the code based on the Huffman codes designed by a prior symbol probabilities.

Let probabilities of source symbols a0,a1,…,aM-1 at time N-1 be,

If we let a source symbol at time N be a, the code for the symbol is synthesized by Huffman codes based on the symbol probability,

This algorithm doesn’t need to send a probability table of source symbols since a decoder can update the probability table simultaneously.

Problems: In worst case an update of Huffman codes will be necessary for each symbol. Higher resolution is necessary according to the size N.

1

)()( 1

1

N

anaP iN

iN

N

apN

N

anaP NN

N

1)()1()()( 1

aaifN

apN

N

anaP i

iniNiN

)()1()(

)( 1

Page 178: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 179

Adaptive Huffman code

Adaptive Huffman code (3)

Algorithm:

Update the probability table only when the tree of Huffman codes changed. The timing

of the table update is calculated based not on the true symbol probabilities but on the

following approximated probabilities.

Normalization in this equation will not be applied in reality.

Initialization:

Synthesize Huffman codes and their tree according to the a prior source symbol

probabilities.

Assign wi to the symbol according to the a prior probability.

When we increment one count to wi when receiving the source symbol.

If there is a change in the Huffman code tree, re-synthesize the Huffman code tree

until it satisfies a Huffman code property.

Huffman code property:

This means that the structure of Huffman codes takes a form of ordered list by

probabilities. This property is also called “Sibling Property”.

1

0

)(M

k k

ii

w

waP

Page 179: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 180

Adaptive Huffman code

Increment wi when we receive Si={s1,s2,s3,s4} . If the sibling property

doesn’t hold, re-synthesize a partial Huffman code tree.

Pi

s1 0.5 0.5 0.5 0

s2 0.3 0.3 10 0.5 1

s3 0.1 110 0.2 11

s4 0.1 111

Wi

s1 50 50 50 0

s2 30 30 10 50 1

s3 10 110 20 11

s4 10 111

Page 180: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 181

Adaptive Huffman codes

Symbol S1 S2 S3 S4

Frequency 10 7 5 3

If we receive 10 times S4, how does Huffman tree change?

Symbol Freq. Code Freq. Code Freq. Code

S1 10 1 10 1 16 0

S2 7 01 9 00 10 1

S3 5 000 7 01

S4 4 001

#S4 +1

Symbol Freq. Code Freq. Code Freq. Code

S1 10 1 10 1 17 0

S2 7 01 10 00 10 1

S3 5 000 7 01

S4 5 001

#S4 +2

Page 181: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 182

Adaptive Huffman codes

Symbol Freq. Code Freq. Code Freq. Code

S1 10 00 11 1 17 0

S2 7 01 10 00 11 1

S4 6 10 7 01

S3 5 11

#S4 +3

Symbol Freq. Code Freq. Code Freq. Code

S1 10 00 12 1 17 0

S2 7 01 10 00 10 1

S4 7 10 7 01

S3 5 11

#S4 +4

Page 182: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 183

Adaptive Huffman codes

Symbol Freq. Code Freq. Code Freq. Code

S1 10 00 12 1 18 0

S4 8 01 10 00 12 1

S2 7 10 8 01

S3 5 11

#S4 +5

Symbol Freq. Code Freq. Code Freq. Code

S1 10 00 12 1 19 0

S4 9 01 10 00 12 1

S2 7 10 9 01

S3 5 11

#S4 +6

Page 183: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 184

Adaptive Huffman codes

Symbol Freq. Code Freq. Code Freq. Code

S1 10 00 12 1 20 0

S4 10 01 10 00 12 1

S2 7 10 10 01

S3 5 11

#S4 +7

Symbol Freq. Code Freq. Code Freq. Code

S4 11 00 12 1 21 0

S1 10 01 11 00 12 1

S2 7 10 10 01

S3 5 11

#S4 +8

Page 184: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 185

Adaptive Huffman codes

Symbol Freq. Code Freq. Code Freq. Code

S4 12 1 12 1 22 0

S1 10 01 12 00 12 1

S2 7 000 10 01

S3 5 001

#S4 +9

symbol S4 S4 S4 S4 S4 S4 S4 S4 S4 S4

Code 001 001 001 001 10 10 01 01 00 1

Page 185: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 186

Dictionary code

Lempel-Ziv coding:

Coding algorithm using a dictionary (code table) including source symbol

sequences had been appeared.

Do not require a prior probability distributions of source symbols.

Non-block codes as well as the arithmetic code.

Compact codes as well as the arithmetic code.

In this method, coding from a source symbol sequence to a code sequence

is obtained in the following procedure.

1. Retrieval: Look for a source symbol sequence in the dictionary.

2. Coding: Code a source symbol sequence into a code sequence considering

an order in the dictionary.

3. Update: Update the dictionary in the decoding side.

Page 186: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 187

LZ77 algorithm

Set empty sequence ( ) into the reference buffer.

Set a source symbol sequence into coding buffer.

Find a max symbol sub-sequence of the source symbol sequence in the

reference buffer . Here let sub-sequence starting from left most side in the

coding buffer be U and let sub-sequence with the same symbol sequence in

the reference buffer be U’. Let u be a next symbol of U, and let P be a

starting address pointer of U’. Let l be a length of U.

Now we encode a source symbol sequence into (P,l,u).

Shift left by l+1 bit until there will be no source symbol.

reference coding

Page 187: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 188

LZ77 algorithm

source symbol sequence “abcabcdef”

source symbol code

a a

b b

c c

a (-3,3,d)

b

c

d

e e

f f

Page 188: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 189

LZ77 algorithm

Properties of LZ77 algorithm:

LZ77 approaches to the compact code if the buffer length L and Ls become

large.

Sending u as a first mismatched symbol is inefficient.

If l is very short, the code length is longer than the original source symbol

length. In this case we just send the original source symbol sequence.

Use a fixed length of U.

Also use the relative address from the left most side of coding buffer or use

“Recency-Rank” meaning a number of different types of source symbols

instead of the relative address.

Page 189: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 190

LZ78 algorithm

LZ78 algorithm:

Universal coding based on “Incremental parsing”.

Let the source symbol sequence be,

Incremental parsing

is decomposition into a partial code sequence

The partial code sequence satisfies,

are different each other except .

If we take a last symbol , equals to .

Tuuuu ,...,, 21

110 ,...,, tUUUu

)10( tmUm

0U

1tUtUUU ..., 10

)1( tmUm )10( msUsmu

Page 190: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 191

Example

Each Um satisfies three properties and Um=Usum.

We can code the source symbol sequence into using

and um.

...

101100011010

]11001100001000110011001[

6543210 UUUUUUU

msm uUU

),( mus )10( mss

Page 191: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 192

Example

Time In Out(s,um) Add to

Table Index

0 0 (-, 0) 0 0

1 1 (-, 1) 1 1

2 10 (1, 0) 10 2

3 01 (0, 1) 01 3

4 100 (2, 0) 100 4

5 101 (2, 1) 101 5

6 1000 (4, 0) 1000 6

7 010 (3, 0) 010 7

8 011 (3, 1) 011 8

Time In Out Add to Table Index

0 0 0 0 0

1 1 1 1 1

2 (1, 0) 10 10 2

3 (0, 1) 01 01 3

4 (2, 0) 100 100 4

5 (2, 1) 101 101 5

6 (4, 0) 1000 1000 6

7 (3, 0) 010 010 7

8 (3, 1) 011 011 8

Encoder Decoder

]11001100001000110011001[

Page 192: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 193

Example

Time In Out(s) Add to

Table Index

0 01 0 01 2

1 11 1 11 3

2 10 1 10 4

3 00 0 00 5

4 011 2 011 6

5 100 4 100 7

6 010 2 010 8

7 011 2 011 9

Encoder

Time In Out Add to

Table Index

0 0 0(? 1 ) ? 01 2

1 1 1 (? 1 ) ? 11 3

2 1 1 (? 0) ? 10 4

3 0 0(? 0) ? 00 5

4 2 01(? 1 ) ? 011 6

5 4 10(? 0) ? 100 7

6 2 01(? 0) ? 010 8

7 2 01(? ) ? 9

Decoder

Input String Index

0 0

1 1

Initial code table

Move pointer to the position of next decomposed code -1.

Page 193: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 194

Example

Time Send New Entry Index

0 1 (for 0) (a, 0) 3

1 0 (for 0) (0, 0) 4

2 4 (for 00) (0,0,b) 5

3 2 (for b) (b,0) 6

4 3 (for (a,0)) (a,0,a) 7

Ternary Encoder

In Out Reconstructed Sequence Add to Table

1 a (a)a,?

0 0 (a)a,0 (0)0,? (a, 0) (as 3)

4 ? (a)a,0 (0)0,? ?

Ternary Decoder

Input String Index

0 0

a 1

b 2

Initial code table

Page 194: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 195

LZ78

Problems of LZ78

Coding by (s,um) is inefficient since we have to send um as it is.

The solution is to send only (s). This method is used in “compress

command” of Unix.

Incremental parsing stores all symbol sub-sequence in the dictionary

and assign addresses.

This algorithm may cause memory overflow of the dictionary. In such

a case we delete LRU (Least Recent Used) entry from the dictionary

by “Self-organizing list”.

Page 195: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 196

Other code

Run-length code:

abbbbbbbab: a(b,7)ab

Page 196: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 197

Rate Distortion

Coding with distortion:

An average code length per one source symbol can be reduced if we allow

coding distortion. Here, the distortion includes redundancy and errors

which prevent uniquely decodability.

Distortion measure:

Let x be an source information symbol of L.

Let y be a decoded output of the code.

The distance between x and y is d(x,y), and called distortion measure. We

evaluate the source coding efficiency by average distortion measure.

where, p(x,y) are a joint probability distribution of a source symbol

variable X and a coded symbol variable Y.

),(),( yxpyxddyx

Page 197: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 198

Rate distortion

Mutual information:

For a channel without any distortion we can easily know the source

symbol x by knowing the decoded output y. The average amount of

information is H(X). If there is distortion, the average amount of

information is,

Therefore the lower bound of the average code length is the mutual

information I(X;Y).

Distortion will be different while the mutual information is the same.

For this case we try to find codes whose distortion satisfies

Under this condition we try to find codes which minimizes the I(X;Y)

This R(D) is called “Rate-Distortion Function” of the information source..

)()();( YXHXHYXI

.Dd

d

);(min)( YXIDRDd

Page 198: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 199

Rate distortion

Definition:

Under the condition that the average distortion is less than D,

there exist codes whose average code length per one source

symbol satisfy,

for an integer .

But there is no codes that has smaller average code length than

R(D).

)()( DRLDR

Page 199: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 203

Rate distortion

Derivation of RD function: The mutual information I(X;Y) is written in the following given Px(x) and conditional probabilities P(y|x),

We also know P(y) and the conditional probabilities P(y|x),

Next, is written by,

And probability constraints requests,

What we need is to minimize I(X;Y) under the above three constraints by the Lagrangean method.

)(

)()()();(

yP

xyPxyPxPYXI

yx

)()()( xyPxPyPx

DyxdxyPxPdyx

),()()(

0)( xyP 1)( xyPy

Dd

Page 200: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 204

Rate distortion

Source coding with distortion:

Suppose we choose a symbol sequence

of length n from an information source S with k symbols. Now we

choose m codes that gives minimum average distortion.

,here the average distortion is given by,

is minimizing , that is,

Information

source

Coding to CD

Minimum dn(x,w)

wxSource coding

Of non-distortion

source

x w

)...,2,1(,..., 21

n

iniii kixxxx

)}...,,2,1(,...,,{ 21 mjwwwWC jnjjjD

))(,())(,(1

iwxpiwxdd jiji

k

i

nn

n

}...,,2,1);,({ mjwxd ji

),(minarg),(minarg)(1

ji

n

k

nj

jinj

wxdwxdij

)(ijw

Page 201: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 205

Rate distortion

Then apply distortion-less source coding. This method provides

average distortion per each source symbol.

Decoding:

Decoding can be obtained by finding that minimizes the

distortion to code word .

Maximum likelihood decoding:

Find that satisfies,

for all m’ except m.

nd

jw ix

mx

),(),( 'mjmj xWpxWp

Page 202: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 206

Rate distortion

Maximum a posteriori probability decoding:

Find , which maximizes,

However, a prior probability needs to be given. This method

is equivalent to a method maximizes the mutual information.

mx

)(

)()()(

j

mjm

jmwp

xwpxpwxp

)( mxP

)(

)(log);(

j

mj

mjwp

xwpExwI

Page 203: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 207

Binary source

Suppose we have a binary information source of {0,1} with probabilities

of p, 1-p and let a bit error rate be distortion measure.

This source coding can be thought as a test transmission channel problem

where the following mutual information is minimized under distortion ,

yx

yxyxd

1

0),(

)()();( YXXXHYXI

d

+ x

Source

Px(1)=p

Error source

PE(1)= d

E

EXY

Test transmission channel

Page 204: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 208

Rate distortion

Y can be thought as a symbol of which an error symbol added to a source

symbol x is with probability . Here, since the addition is “XOR”,

is equivalent to , then,

Furthermore, let be a zero memory binary source,

If the error source is zero-memory source, holds, and even

if the error source has a memory, holds.

Therefore,

d

EXY EYX

)()()( YEHYEYHYXH

).1log()1(log)(~

pppppH

)(~

)()( dHEHYEH

)(~

)( dHEH

)(~

)( dHEH

)(~

pH

)(~

)( dHYXH

Page 205: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 209

Rate distortion

If , the Entropy function says,

then,

Finally, we have a RD function in the following.

5.00 D

)(~

)(~

DHdHDd

)(~

)(~

)(~

)(~

);( DHpHdHpHYXI

)(~

)(~

)( DHpHDR

Page 206: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 210

Rate distortion

RD function for a binary information source

Page 207: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 211

Source coding of analog information

Analog source coding:

Here we treat analog source information that can take continuous

value not a symbol. (ex. Speech, Image, Sensory input)

Analog signal Sampling

Quantization

Page 208: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 212

Sampling

If the frequency band is limited to 0-W[Hz], the function f(t) can be

written by,

,...2,1,0,1...)2/( kWkfX k

k t

tk

kW

kWXtf

)2(

)2(sin)(

Page 209: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 213

Sampling

Let spectrum of f(t) be a F(w), it can be written,

If F(w) is band limited in , it can be transformed by

Fourier expansion.

, here

From (1) ,

WwW 22

)1()()( dtetfwF iwt

)2()( 24

2

W

kwi

k

kW

kwi

k

k eaeawF

W

W

W

kiw

k dwewFW

a

2

2

2 )3()(4

1

)4()(2

1

)(2

1)(

2

2

W

W

iwt

iwt

dwewF

dwewFtf

Page 210: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 214

Sampling

Now we set ,

comparing (3) and (5),

Therefore, we get

)5()(2

1)

2(

2

2

2W

W

W

kiw

dwewFW

kf

)6(22

1

W

kf

Wak

W

kt

2

)7()2

(2

1)( 2W

kiw

k

eW

kf

WwF

)8()2(

)2(sin)

2(

)2

(4

1)(

2

2

)2

(

kWt

kWt

W

kf

dweW

kf

Wtf

k

W

W

W

ktiw

k

Page 211: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 215

Entropy of analog source

The Entropy for digital source is defined,

How can we define Entropy for stochastic variable x that takes

continuous value?

Now we divide a region into small region of x. The probability of

which x takes a value between xi and can be approximated by,

The smaller the is, the better approximation we have. Then,

i

ii ppH log

xxp i )(

x

xxi

x

xdxxpxp

xxpxxpxp

xxpxxpH

x

i

ix

i

i

ix

i

i

ix

loglim)(log)(

}{log)((lim)}({log)((lim

}))(log{)((lim'

0

00

0

Page 212: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 216

Entropy of analog source

The second term goes to infinity. We only use this Entropy to compare

various analog sources. We define Entropy of analog source only by the

first term.

Unit Entropy:

The analog source has n stochastic variables x1,x2,…,xn, we define

Entropy by,

The Entropy per one variable is,

This is called an unit Entropy. And, Entropy normalized by T is called an

Entropy per second, H’. By , the following relationship holds.

dxxpxpH )(log)(

nnn dxdxdxxxxpxxxpH ...),...,(log),...,(... 212121

nnnn

dxdxdxxxxpxxxpn

H ...),...,(log),...,(...1

lim 212121

WHH 2'

TWn 2

Page 213: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 217

Conditional Entropy

Definition:

,here P(x),P(y) are marginal probability distributions.

The following relationship holds as well as in the digital information source.

With equality if and only if,

dxdyxp

yxpyxpdxxYHxpXYH

)(

),(log),()()()(

dxdyyp

yxpyxpdyyXHypYXH

)(

),(log),()()()(

dyyxpxp ),()(

dxyxpyp ),()(

)()(),( YHXHYXH

)()(),( ypxpyxp

Page 214: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 218

Entropy of Gaussian distribution

Probability distribution of Gaussian (Normal) distribution is,

The Entropy is given,

2

2

2 2exp

2

1)(

xxp

2

2

2

2

22

2

2

2log

2

12log

)}2

exp(2

1){log

2exp(

2

1

)(log)()(

e

dxxx

dxxpxpXH

Page 215: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 219

Entropy of analog source

Gaussian process:

[Definition] Let probability distribution of variables Xt1, Xt2, …, Xtn at

time t1,t2,…,tn be P(Xt1,Xt2,…,Xtn). If P is subject to multi-dimensional

Gaussian distribution, we call this process as a Gaussian process. If this

process is subject to stationary Markov process, we call it a stationary

Markov process. If a power spectrum density n(w) of Gaussian process

has a constant value regardless to frequencies, we call it a white Gaussian

noise or process.

If a white Gaussian noise is band limited in frequency range W,

Furthermore if a time period this white Gaussian noise is limited in T, this

process is determined by a sample by 1/2W, x1,x2,…,x2TW. Let a power at

each sample be , the Entropy at each point is given by,

)2(0

)2()( 2

0

WwWf

WwWfwn

N

22log eH

Page 216: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 220

Entropy of analog source

Therefore a Entropy for all 2TW samples is,

22log2 eTWHtotal

Page 217: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 221

Maximum Entropy

Distribution function with maximum Entropy:

Find a probability distribution function with a maximum Entropy under

specific conditions. Now we have following relationships,

We find p(x) that maximizes an objective function I by the Lagrangean

method.

n

n

b

ann

b

a

b

a

kdxxpx

kdxxpx

kdxxpx

))(,(...

))(,(

))(,(

2

2

1

1

22

11

b

adxxpxFI ))(,(

Page 218: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 222

Maximum Entropy

[Case an average power of x given]

Let an average power to be ,

We have p(x) maximizes H(X) by,

The Entropy with the p(x) is,

2

1)(

)(

)(log)()(

22

dxxp

dxxpx

dxxpxpXH

2

2

2 2exp

2

1)(

xxp

22log)(log)()( edxxpxpXH

Page 219: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 223

Maximum Entropy

Maximum Entropy Theorem:

A probability distribution function of an average power that has

a maximum Entropy is Gaussian distribution.

The Entropy of Gaussian distribution is given by,

2

2

2

2 2exp

2

1)(

xxp

22log)(log)()( edxxpxpXH

Page 220: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 224

Mutual information

Let joint probability distribution be p(x,y), If we divide a region of x into

and a region of y into . Here

are probabilities for x takes a value between x and , y takes a value

between y and , x and y jointly take values in the region,

respectively. The mutual information is given by,

x y yxyxpyypxxp iiii ),(,)(,)(

xx yy

)()(

),(log

)()(

),(log);(

ii

ii

ii

iiii

ypxp

yxp

yyxpxp

yxyxpyxI

Page 221: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 225

Mutual information

An average mutual information is,

I(X;Y) is non negative value,

with equality if and only if,

)()(

)()(

)()(

),(log),(

)()(

),(log]),([lim);(

00

YXHXH

XYHYH

dydxypxp

yxpyxp

ypxp

yxpyxyxpYXI

ii

iiii

jiyx

0);( YXI

))((),( yxpyxp

Page 222: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 226

Rate distortion for analog source

Rate distortion function:

Let an average distortion rate be d(x,y), the average distortion is given by,

here, p(x,y) is a joint probability distribution function of a source sample

value x and its decoded result y. We have a Rate-distortion function in the

similar manner as the discrete symbol case.

Let R(D) bit/sample be the minimum mutual information I(X;Y) of X

and Y under condition that the average distortion is smaller than the

threshold D. R(D) provides the lower bound of the average code length

per source symbol when we code it by the binary codes under the

condition that is smaller than D.

dydxyxpyxdd

),(),(

);(min)( YXIDRDd

d

d

Page 223: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 227

Rate distortion of Gaussian source

We use the mean squire error,

The average distortion is given an average squire error.

Under the above condition, we minimize I(X;Y) with P(y|x),

, here

and

If the information is Gaussian source, we can use,

We maximize H(X|Y) instead of minimizing I(X;Y).

Ddydxyxxypxpd

2))(|()(

2)(),( yxyxd

dxdy

yp

xypxypxpYXI ]

)(

)|(log)|()[();(

dxxypxpyp )|()()(

)()();( YXHXHYXI

22log)( eXH

Page 224: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 228

Rate distortion of Gaussian source

Let Z be a probabilistic variable Z=X-Y,

with equality if and only if Z and Y are independent. is smaller than D.

H(Z) will be maximized when p(y|x) follows a Gaussian distribution of mean 0 and variation D according to the maximum Entropy theorem. Then we have,

Therefore,

Finally, R(D) is given by,

)()()()( ZHYZHYYXHYXH

DZd 2

eDZH 2log)(

d

D

eDeYXI

2

2

log2

1

2log2log);(

DDR

2

log2

1)(

Bit/sample

Page 225: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 229

Rate distortion of Gaussian source

When the source signal is band-limited to 0-W, we can have 2W

samples per second, the rate-distortion function per second is given

by,

DWDR

2

log)(

Page 226: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 230

Coding of analog signal

Scalar quantization:

Scalar quantization is a discretization of value of a source sample.

We call this sample as a quantized sample. If we use B bit binary

representation, a quantized sample is represented by 2B bits.

Therefore, the necessary information for transmission or storage is,

This coding is called PCM (Pulse Code mudulation).

Important thing is to reduce necessary bit rates. Therefore, we

utilize a probability distribution of the amplitude distribution. The

quantization that minimize mean square errors with fixed

quantization level N is called a optimal quantization property.

sFBI Bit/second

Page 227: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 231

Coding of analog source

Signal to Noise ratio:

Let peak-to-peak ratio of the target signal be 2Xmax, the quantization

level of B bit quantization is,

If we assume that the noise amplitude distribution is uniform, we get,

SNR will be,

Representation in dB will be,

2

2

2

2

2

2

)(

)(

)]([

)]([

e

x

n

n

ne

nx

neE

nxESNR

B

X

2

2 max

B

xdxxneE

2

2

max

222

2312

1)]([

2

2

2

max

2223

XSNR xB

)(log2077.46][ max10

x

XBdBSNR

Page 228: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 232

Coding of analog source

Transform coding:

If we have consecutive two sample x1,x2 that have a uniform probability

distribution depicted in figure, where p(x1,x2) is,

range of x1, and x2 values is,

Quantization level will be,

We need bits to quantize x=(x1,x2) bits.

If we rotate 45 degree to have new basis (u1,u2). U1 and u2 are

independent, necessary quantization levels are L1 for u1, and L2 for u2.

C

Cxxpxxp ab

0)(),(

1

21

22,

2221

baxx

ba

221

baLL

2

2

212

)(logloglog

baLLBx

bL

aL 21 ,

Page 229: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 233

Coding of analog source

Namely we need bits to quantize u=(u1,u2).

For example if a=2b,

221 logloglog

ab

LLBu

17.1 ux BB

Page 230: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 234

Vector quantization

A method quantizes not a single sample but a set of n samples.

Suppose we have source samples that are independent each other and

have a uniform distribution. This quantization is equivalent to assigning

this sample to a center point of he square area that is made by splitting

x0,x1 2dimensional area by squares. The size of the area is , and

quantization error is , average mean square error per one sample is

, this is a same as scalar quantization.

If we change the shape of the region to a hexagon, the size of the area is

and average quantization error is with the same

number of the representative points.

If we set the area size to be the same of the square and the hexagon, the

average power of the hexagon becomes

2

6/2

12/2

2/33 2 8/35 2

962.09/35

Page 231: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 235

Vector quantization

Representative points of a square Representative points of a hexagon

Vector quantization

Page 232: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 236

Vector quantization

Vector quantization is a quantization method that codes a source

sample (x0,x1,…,xn-1) composed of n consecutive samples to a

closest representative code chosen from representative codes in n

dimensional sample space (X0,X1,…,Xn-1).

If we apply vector quantization to a source sample so as to

minimize an average distortion and apply distortion-less source

coding, we can have a code, of which average length per sample

approaches to the lower bound R(D) according to the size of n.

Page 233: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 237

Vector quantization

Representative points are called code words or code vectors. A set of code

words is called a codebook.

Codebook design algorithm:

There is no optimal algorithm for the codebook design. Here we

introduce a semi-optimal iterative codebook design algorithm.

Now we have k training samples x1,x2,…,xk and centroids defined in the

following.

, here means an operation to find n minimizes f(n).

),(minarg),...,,(ˆ1

21 i

k

ixk xxdxxxCx

)(minarg nfx

Page 234: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 238

Vector quantization

LBG(Linde,Buzo,Gray) Algorithm

Initialization(Step1)

Let training sample set be xj, j=0,…,n-1,

N: Codebook size, m=0, : distortion, and .

Set an initial codebook randomly.

Partitioning(Step2)

Cluster xj into N partial sets Si: i=0,…,N by .

, here the average distortion is given by,

If , then stop, else set be a codebook .

Calculate , go to step 2.

)0(

1

)0(

0

)0( ,..., NN yyA 1D

)(m

NA

ij

m

iji Sxyxdi ˆ)( ),,(minargˆ

} |),({1 )(

1

01ij

m

ii

n

jim SxyxdN

nD

mmm DDD /)( 1

)(m

NA

)1(

1

)1(

0

)1( ,...,

m

N

mm

N yyA 1})({)1( mmSCy i

m

i

Page 235: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 239

Vector quantization

Splitting algorithm:

(Step1)Initialization:

: Arbitrary vector with small norm.

M=1,

(Step2)Split into neighboring two vectors,

Let be,

(Step3)Letting A0,2M be initial values, find sub-optimal codebook

by a LBG algorithm. If M=N then stop, else

set M=2N and go to step 2.

Splitting and LBG algorithm generate a codebook of size 2N.

),...,,( 1211,0 nxxxCA

),...,,( 110,0 MM yyyA ii yy ,

},...,,,,,{ 1100 MMii yyyyyy

},...,,{ 12102,0 MM yyyA

},...,1,0;{ 122,0 MiM yiyA

Page 236: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 240

D(R) function

Distortion rate function:

Let x be N consecutive samples of x(n), vector quantization that codes x

into y with a codebook size of L is given by,

, where

D(R) represents a minimul average distortion with given range of the rate

R. On the other hand, R(D) represents a maximum rate or minimum

average code length with given range of the distortion D.

)],([min)( yxdERDy

N

RyHN

)(1

)(lim)( RDRD NN

Page 237: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 241

Vector quantization

Tree search VQ:

Make tree structure codebook. Each node in the tree represents a code

obtained in the splitting algorithm. The computation of the tree search

VQ is Klog2N to compared to K*N with a parameter dimension of K. The

memory size increases about to twice. Input vector

Code vectors

Code output

Page 238: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 242

Multi-step VQ

Combine multiple vector quantizers to reduce calculation. Codes of each

quantizers are sent to the channel. Number of multiplication can be

reduced from K*N*M to K*(N+M).

codebook codebook

Vector

quantizer

Vector

quantizer

Input vector

Page 239: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 243

Gain/Shape vector quantization

Gain/Shape vector quantization:

Codebook is composed of multiplication of Ng scalar values, g1, g2,..,gNg

and Ng unit vectors, u1,u2,…,uNg.

, here we call a gain codebook, and a shape

codebook. Coding algorithm is shown in the following.

Shape quantization:

Make inner product between an input vector x and u in the shape codebook

u1,u2, …,uNs and find a unit vector ul gives maximum inner product.

Gain quantization:

Find a closest scalar value from a gain codebook g1,g2,…,gNg to the maximum

inner product of (x, ul). Here gk*ul is a quantization vector out of Ng*Ns

quantization samples. Therefore number of calculation is reduced from

K*Ng*Ns to K*Ns and memory size from Ng*Ns to K*(Ns+Ng).

sgji NjNiug ,...,2,1,,...,2,1,

Ngggg ,..., 21 Nsuuu ,..., 21

Page 240: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 244

Gain/shape vector quantization

Code vector

Page 241: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 245

Speech Coding

/sec]

t/sec]

Min.

Min.

Page 242: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 246

Waveform Coding

PCM (Pulse Code Modulation) used in CD, DAT

If signal is band-limited to 0-W[Hz]

T: Sampling Interval [s]

Concept of PCM

Page 243: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 247

Waveform coding (PCM)

Quantization

Let quantization step to be , quantization bit to be B,

range of signal amplitude to be L.

Page 244: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 248

Waveform coding

Speech waveform

Non-uniform

law

Page 245: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 249

Waveform coding (u-law)

law is used for ISDN.

law (u=255)

Page 246: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 250

Waveform coding (DPCM)

DPCM (Differential PCM)

Page 247: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 251

Waveform coding (DPCM)

If quantization step is 1, quantization bit B is 5.

Page 248: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 252

Waveform coding (APCM)

APCM (Adaptive PCM)

Page 249: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 253

Waveform coding (APCM)

APCM (Adaptive PCM)

Quant. Bits

Page 250: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 254

Waveform coding (ADPCM)

ADPCM (Adaptive Differential PCM)

Page 251: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 255

Waveform coding (ADPCM)

ADPCM (Adaptive Differential PCM)

Quant. Bits

Page 252: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 256

Parametric speech coding

---------------

Speech waveform

Excitation: Pitch frequencey

Excitation signal

Vocal Tract:

Phonetic Content

Resonance filter

Page 253: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 257

Parametric speech coding

Speech waveform

Framing

Short term

predict.

Excitation Signal

Linear

Prediction Coeff.

Resonance filter

Linear

Prediction Coeff.

Codebook

Code

Code Pitch

Codebook

Pitch Interval

Approximation

Page 254: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 258

Parametric speech coding

Points of the parametric speech model

Approximation of excitation signal by the Impulse sequence.

Bit rates can decrease.

However, speech quality degrades seriously.

Page 255: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 259

Parametric speech coding (CELP)

CELP (Code-excited Linear Prediction): Cellular phones

Speech waveform

Framing

Short term

predict.

Linear

Prediction

Coeff.

Resonance filter

Excitation Signal

Long term

predict.

Residual Signal

Pitch Interval

Linear Prediction

Coeff. Codebook

Pitch

Codebook

Gain

Codebook

Res. Signal

Codebook

Code

Code

Code

Code

Page 256: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 260

Parametric speech coding (CELP)

CELP (Code-excited Linear Prediction): Cellular phones

Speech waveform

Res. Signal

Codebook

Code

Gain

Code

Gain

Long

term

predict.

Short

term

predict.

Perceptual

Weighting

Filter MSE

Page 257: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 261

Speech coding

Waveform coding

Hybrid coding

Parametric

coding

Page 258: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 262

Music coding

Usage of auditory characteristics for coding not of source model.

Audible

Non-audible

So

un

d p

ress

ure

leve

l

Frequency

Non-audible

Audible

Minimum audible

limit in quiet

environments

Page 259: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 263

Music coding

Frequency masking

Audible

No-audible

Audible

Masking effect

So

un

d p

ress

ure

leve

l

Frequency

Page 260: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 264

Music coding

Temporal Masking So

un

d p

ress

ure

leve

l

Frequency

Backward

Masking Forward Masking

Temporal Masking Curve

Page 261: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 265

Music coding

Subband

Filter

Selection of

scale factor

Quantization samples Quanti-

zation

Masking threshold

Estimation

Page 262: Spoken Language Translation Information Theory: Source Coding

2012/10/24 Prof. Satoshi Nakamura 266

Music coding

Permissible error estimation

Page 263: Spoken Language Translation Information Theory: Source Coding

MP3: MPEG-1/L3, MPEG-2

2012/10/24 Prof. Satoshi Nakamura 267

MPEG1/Audio MPEG2/Audio

No. 11172.3 13818.3

IS year 1992 1994

Low Sampling Fq. Multi-lingual,

Multi-channnel

Sampling FQ. 32,44.148 16,22.05,24 32,44.1,48

Layer I II III I II III I II III

Bit rate min 32 32 32 32 8 8 32 32 32

max 448 384 320 256 160 160

channel 1/0, 2/0 1/0, 2/0 1/0,2/0,3/0,2/1,2

/2,3/1,3/2