end-to-end data outline presentation formatting data compression

End-to-End Data

Outline

Presentation Formatting

Data Compression

Problem

• The sender and receiver seeing the same data is often called the presentation format.

• The efficiency of the encoding involves the error detection/correcting and data compression.


• The transformations of network data from the representation used by the application into a form suitable for transmission is called presentation formatting.

• The sending program encodes data into a message and the receiving application decodes the message into data.

• Encoding is sometimes called argument marshalling, and decoding called unmarshalling.


• Data types we consider– integers– floats– strings– arrays– structs

Applicationdata

Presentationencoding

Applicationdata

Presentationdecoding

Message Message Message…

• Types of data we do not consider– images – video– multimedia documents

Difficulties• Representation of base types

– floating point: IEEE 754 versus non-standard – integer: big-endian versus little-endian (e.g., 34,677,374)

• Compiler layout of structures

(126)(34)(17)(2)

00000010Big-endian

Little-endian

(2)(17)(34)(126)

Highaddress

Lowaddress

0 0111111 00 001 0 01

00 001 001 00 001 0 01

00 001 001 000 000 01

0 0111111

Taxonomy• Data types

– base types (e.g., ints, floats); must convert– flat types (e.g., structures, arrays); must pack– complex types (e.g., pointers); must linearize

• Conversion Strategy– canonical intermediate form– receiver-makes-right (an N x N solution)

Marshaller

Application data structure

Taxonomy (cont)• Tagged versus untagged data

• Stubs– compiled – interpreted

type =INT len = 4 value = 417892

Call P

Clientstub

RPC

Arguments

Marshalledarguments

Interfacedescriptor forProcedure P

Stubcompiler

Message

Specification

P

Serverstub

RPC

Arguments

Marshalledarguments

Code Code

eXternal Data Representation (XDR)

• Defined by Sun for use with SunRPC• C type system (without function pointers)• Canonical intermediate form• Untagged (except array length)• Compiled stubs

#define MAXNAME 256;#define MAXLIST 100;

struct item { int count; char name[MAXNAME]; int list[MAXLIST];};

bool_txdr_item(XDR *xdrs, struct item *ptr){ return(xdr_int(xdrs, &ptr->count) && xdr_string(xdrs, &ptr->name, MAXNAME) && xdr_array(xdrs, &ptr->list, &ptr->count, MAXLIST, sizeof(int), xdr_int));}

Count Name

J O3 7 H N S O N

List

3 4 97 2 658 321

Abstract Syntax Notation One (ASN-1)

• An ISO standard • Essentially the C type system• Canonical intermediate form• Tagged• Compiled or interpretted stubs• BER: Basic Encoding Rules

(tag, length, value)

value

type typelength valuelength type valuelength

INT 4 4-byte integer

Network Data Representation (NDR)

• Defined by DCE• Essentially the C type system• Receiver-makes-right

(architecture tag) • Individual data items untagged• Compiled stubs from IDL• 4-byte architecture tag

– IntegerRep• 0 = big-endian• 1 = little-endian

– CharRep• 0 = ASCII• 1 = EBCDIC

– FloatRep• 0 = IEEE 754• 1 = VAX• 2 = Cray• 3 = IBM

IntegrRep

0 4 8 16 24 31

FloatRepCharRep Extension 1 Extension 2

Compression Overview

• Encoding and Compression– Huffman codes

• Lossless – data received = data sent

– used for executables, text files, numeric data

• Lossy– data received does not != data sent

– used for images, video, audio

Huffman Codes

• Huffman coding [1952] can be used as a reasonable approximation to the theoretical limit.1. Write down the symbols and their probabilities:

A B C D

.50 .30 .15 .05

They are the terminal nodes.

2. Find and mark the two smallest nodes. Add a node with arcs to the nodes marked.

3. Set the probability of the new node to the sum of marked nodes.

4. Repeat steps 2 and 3 until all nodes have been marked, except one the root.

5. The encoding is found by tracing the path from the root to the symbol, with left=0, right=1.

Huffman Codes

() / \ / \1 / \ 0/ () / / \ / 0/ \1 / / \ / / () / / / \ / / 0/ \1 / / / \ (A) (B) (C) (D)

.5 .3 .15 .05

0 10 110 111

Lossless Algorithms • Run Length Encoding (RLE)

– Replace consecutive occurrences of a given symbol with only one copy of the symbol, plus a count of how many times that symbol occurs.

– example: AAABBCDDDD encoding as 3A2B1C4D – good for scanned text (8-to-1 compression ratio) – can increase size for data with variation (e.g., some images)

• Differential Pulse Code Modulation (DPCM)– First output a reference symbol and then, for each symbol in

the data, to output the difference between that symbol and the reference symbol.

– example AAABBCDDDD encoding as A0001123333– change reference symbol if delta becomes too large– works better than RLE for many digital images (1.5-to-1)

Dictionary-Based Methods

• Build dictionary of variable-length strings of common terms

• Transmit index into dictionary for each term – For example, replace ‘compression’ with 9293.

‘compression’ is 9293rd in /usr/share/dict/words.

• Lempel-Ziv (LZ) – compress command is the best-known example.

• Commonly achieve 2-to-1 ration on text• Variation of LZ used to compress GIF images

– first reduce 24-bit color to 8-bit color – treat common sequence of pixels as terms in dictionary– not uncommon to achieve 10-to-1 compression (x3)

Image Compression

• JPEG (Joint Photographic Experts Group) is an ISO/IEC group of experts that develops and maintains standards for a suite of compression algorithms for computer image files.

• JPEG is also a term for any graphic image file produced by using a JPEG standard.

• A JPEG file is created by choosing from a range of compression qualities (actually, from one of a suite of compression algorithms).

• Lossy still-image compression

MPEG

• The Moving Picture Experts Group (MPEG), develops standards for digital video and digital audio compression.

• Lossy compression of video • First approximation: JPEG on each frame• Also remove inter-frame redundancy

end-to-end data outline presentation formatting data compression

Documents