end-to-end data outline presentation formatting data compression
Post on 21-Dec-2015
214 views
TRANSCRIPT
End-to-End Data
Outline
Presentation Formatting
Data Compression
Problem
• The sender and receiver seeing the same data is often called the presentation format.
• The efficiency of the encoding involves the error detection/correcting and data compression.
Presentation Formatting
• The transformations of network data from the representation used by the application into a form suitable for transmission is called presentation formatting.
• The sending program encodes data into a message and the receiving application decodes the message into data.
• Encoding is sometimes called argument marshalling, and decoding called unmarshalling.
Presentation Formatting
• Data types we consider– integers– floats– strings– arrays– structs
Applicationdata
Presentationencoding
Applicationdata
Presentationdecoding
Message Message Message…
• Types of data we do not consider– images – video– multimedia documents
Difficulties• Representation of base types
– floating point: IEEE 754 versus non-standard – integer: big-endian versus little-endian (e.g., 34,677,374)
• Compiler layout of structures
(126)(34)(17)(2)
00000010Big-endian
Little-endian
(2)(17)(34)(126)
Highaddress
Lowaddress
0 0111111 00 001 0 01
00 001 001 00 001 0 01
00 001 001 000 000 01
0 0111111
Taxonomy• Data types
– base types (e.g., ints, floats); must convert– flat types (e.g., structures, arrays); must pack– complex types (e.g., pointers); must linearize
• Conversion Strategy– canonical intermediate form– receiver-makes-right (an N x N solution)
Marshaller
Application data structure
Taxonomy (cont)• Tagged versus untagged data
• Stubs– compiled – interpreted
type =INT len = 4 value = 417892
Call P
Clientstub
RPC
Arguments
Marshalledarguments
Interfacedescriptor forProcedure P
Stubcompiler
Message
Specification
P
Serverstub
RPC
Arguments
Marshalledarguments
Code Code
eXternal Data Representation (XDR)
• Defined by Sun for use with SunRPC• C type system (without function pointers)• Canonical intermediate form• Untagged (except array length)• Compiled stubs
#define MAXNAME 256;#define MAXLIST 100;
struct item { int count; char name[MAXNAME]; int list[MAXLIST];};
bool_txdr_item(XDR *xdrs, struct item *ptr){ return(xdr_int(xdrs, &ptr->count) && xdr_string(xdrs, &ptr->name, MAXNAME) && xdr_array(xdrs, &ptr->list, &ptr->count, MAXLIST, sizeof(int), xdr_int));}
Count Name
J O3 7 H N S O N
List
3 4 97 2 658 321
Abstract Syntax Notation One (ASN-1)
• An ISO standard • Essentially the C type system• Canonical intermediate form• Tagged• Compiled or interpretted stubs• BER: Basic Encoding Rules
(tag, length, value)
value
type typelength valuelength type valuelength
INT 4 4-byte integer
Network Data Representation (NDR)
• Defined by DCE• Essentially the C type system• Receiver-makes-right
(architecture tag) • Individual data items untagged• Compiled stubs from IDL• 4-byte architecture tag
– IntegerRep• 0 = big-endian• 1 = little-endian
– CharRep• 0 = ASCII• 1 = EBCDIC
– FloatRep• 0 = IEEE 754• 1 = VAX• 2 = Cray• 3 = IBM
IntegrRep
0 4 8 16 24 31
FloatRepCharRep Extension 1 Extension 2
Compression Overview
• Encoding and Compression– Huffman codes
• Lossless – data received = data sent
– used for executables, text files, numeric data
• Lossy– data received does not != data sent
– used for images, video, audio
Huffman Codes
• Huffman coding [1952] can be used as a reasonable approximation to the theoretical limit.1. Write down the symbols and their probabilities:
A B C D
.50 .30 .15 .05
They are the terminal nodes.
2. Find and mark the two smallest nodes. Add a node with arcs to the nodes marked.
3. Set the probability of the new node to the sum of marked nodes.
4. Repeat steps 2 and 3 until all nodes have been marked, except one the root.
5. The encoding is found by tracing the path from the root to the symbol, with left=0, right=1.
Huffman Codes
() / \ / \1 / \ 0/ () / / \ / 0/ \1 / / \ / / () / / / \ / / 0/ \1 / / / \ (A) (B) (C) (D)
.5 .3 .15 .05
0 10 110 111
Lossless Algorithms • Run Length Encoding (RLE)
– Replace consecutive occurrences of a given symbol with only one copy of the symbol, plus a count of how many times that symbol occurs.
– example: AAABBCDDDD encoding as 3A2B1C4D – good for scanned text (8-to-1 compression ratio) – can increase size for data with variation (e.g., some images)
• Differential Pulse Code Modulation (DPCM)– First output a reference symbol and then, for each symbol in
the data, to output the difference between that symbol and the reference symbol.
– example AAABBCDDDD encoding as A0001123333– change reference symbol if delta becomes too large– works better than RLE for many digital images (1.5-to-1)
Dictionary-Based Methods
• Build dictionary of variable-length strings of common terms
• Transmit index into dictionary for each term – For example, replace ‘compression’ with 9293.
‘compression’ is 9293rd in /usr/share/dict/words.
• Lempel-Ziv (LZ) – compress command is the best-known example.
• Commonly achieve 2-to-1 ration on text• Variation of LZ used to compress GIF images
– first reduce 24-bit color to 8-bit color – treat common sequence of pixels as terms in dictionary– not uncommon to achieve 10-to-1 compression (x3)
Image Compression
• JPEG (Joint Photographic Experts Group) is an ISO/IEC group of experts that develops and maintains standards for a suite of compression algorithms for computer image files.
• JPEG is also a term for any graphic image file produced by using a JPEG standard.
• A JPEG file is created by choosing from a range of compression qualities (actually, from one of a suite of compression algorithms).
• Lossy still-image compression
MPEG
• The Moving Picture Experts Group (MPEG), develops standards for digital video and digital audio compression.
• Lossy compression of video • First approximation: JPEG on each frame• Also remove inter-frame redundancy