encoding and decoding experiencing one (or more) bytes out of your a’s

16
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Upload: angela-collins

Post on 24-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

ENCODING AND DECODING

Experiencing one (or more) bytes out of your A’s

Page 2: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Overview

• It’s not your father’s character set– 8 bit characters– ASCII– The rest of the world wakes up to computers

• Unicode– Character codes– Different flavors

• Encoding and Decoding classes• Example

Page 3: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

The Good Old Days

• Focus on unaccented, English letters• Every letter, number, capital, etc • Represented by codes 0-127• Space, 32; “A”, 65; “a”, 97• Used 7 bits, one bit free on most computers• Wordstar and the 8th bit• Below 32 – control bits 7, beep; 12, formfeed

Page 4: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s
Page 5: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

8th bit, values 128-255

• Everybody had their own ideas

• OEM Character sets

• IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.)

• Outside U.S. different languages– Code 130

Page 6: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

8th bit, values 128-255

• Everybody had their own ideas

• OEM Character sets

• IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.)

• Outside U.S. different languages– Code 130

Page 7: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

8th bit, values 128-255

• Everybody had their own ideas• OEM Character sets• IBM-PC -> graphics (horizontal bars, vertical bars, bars

with dangles, etc.)• Outside U.S. different languages

– Code 130 é in US, Gimel ג character in Israel– Difficult to exchange documents

• Code pages – regional definition of bit values 128-255– Israel: Code page 862– Greek: Code page 737– ISO/ANSI code pages

• Asia – Alphabets had thousands of characters– No way to store in one byte (8 bits)

Page 8: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Unicode

• Not a 16-bit code • A new way of thinking about characters• Old way:

– Character “A” maps to memory or disk bits– A-> 0100 0001

• Unicode way:– Each letter in every alphabet maps to a “code point”– Abstract concept– “A” is Platonic “form” – just floats out there– A -> U+0639 code point

Page 9: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Unicode

• Hello -> U+0048 U+0065 U+006C U+006C U+006F• Storing in 2 bytes each:

– 0048 0065 006C 006C 006F (big endian)– Or 4800 6500 6C00 6C00 6F00 (little endian)

• Need to have a Byte Order Mark (BOM) at beginning of stream

• UTF8 coding system– Stores Unicode points (magic numbers) as 8 bit bytes– Values 0-127 go into byte 1– Values 128+ go into bytes 2, 3, etc.– For characters up to 127, UTF8 looks just like ASCII

Page 10: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

UNICODE Encodings

• UTF-8• UTF-16 – characters stored in 2 byte, 16-bit

(halfword) sequences – also called UTF-2• UTF-32 – characters stored in 4byte, 32 bit

sequences• UTF-7 – forces a zero in high order bit - firewalls• Ascii Encoding – everything above 7 bits is

dropped

Page 11: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Definitions

• .NET uses UTF-16 encoding internally to store text

• Encoding: – transfers a set of Unicode characters into a sequence

of bytes– Send a string to a file or a network stream

• Decoding: – transfers a sequence of bytes into a set of Unicode

characters– Read a string from a file or a network stream

• StreamReader, StreamWriter default to UTF-8

Page 12: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Encoding/Decoding Classes

• UTF32Encoding class– Convert characters to and from UTF-32 encoding

• UnicodeEncoding class – Convert characters to and from UTF-16 encoding

• UTF8Encoding class to convert to and from UTF-8 encoding – 1, 2, 3, or 4 bytes per char

• ASCIIEncoding class to convert to and from ASCII Encoding – drops all values > 127

• System.Text.Encoding supports a wide range of ANSI/ISO encodings

Page 13: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Convert a string into a stream of encoded bytes

1. Get an encoding objectEncoding e = Encoding.GetEncoding(“Korean”);

2. use the encoding object’s GetBytes() method to convert a string into its byte representation

byte[ ] encoded; encoded = e.GetBytes(“I’m gonna be Korean!”);

Demo: D:\_Framework 2.0 Training Kits\70-536\Chapter 03\EncodingDemo

Page 14: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Write a file in encoded formFileStream fs = new FileStream("text.txt", FileMode.OpenOrCreate); ... StreamWriter t = new StreamWriter (fs, Encoding.UTF8); t.Write("This is in UTF8");

Read an encoded fileFileStream fs = new FileStream("text.txt", FileMode.Open); ... StreamReader t = new StreamReader(fs, Encoding.UTF8); String s = t.ReadLine();

Page 15: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Summary

• ASCII is one of oldest encoding standards.• UNICODE provides multilingual support• System.Text.Encoding has static methods

for encoding and decoding text.• Use an overloaded Stream constructor

that accepts an encoding object when writing a file.

• Not necessary to specify Encoding object when reading, will default.

Page 16: ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

References

• www.unicode.org• Unicode and .Net – what does .NET Provide?

http://www.developerfusion.co.uk/show/4710/3/• Hello Unicode, Goodbye ASCII

http://www.nicecleanexample.com/ViewArticle.aspx?TID=unicode_encoding

• The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode.html