character sets logins & terminal types jarret raim 2004-04-20

28
Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Post on 21-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Character SetsLogins & Terminal Types

Jarret Raim

2004-04-20

Page 2: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Definitions

Character Repertoire A set of characters where no internal

presentation in computers or data transfer is assumed.

Does not define an ordering for the characters.

Usually defined by specifying names of characters and a sample (or reference) presentation of characters in visible form.

Page 3: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Definitions

Character Code Defines a one-to-one correspondence between

characters in a repertoire and a set of nonnegative integers, called a code position.

Aka: code number, code value, code element, code point, code set value - and just code.

Note: The set of nonnegative integers corresponding to characters need not consist of consecutive numbers.

Page 4: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Definitions

Character Encoding A method (algorithm) for presenting characters in

digital form by mapping sequences of code numbers of characters into sequences of octets.

In the simplest case, each character is mapped to an integer in the range 0 - 255 according to a character code and these are used as such as octets.

Eg: 7-bit ASCII, 8-bit ASCII, UCS, Unicode, UTF-6, UTF-16, etc.

Page 5: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

ASCII & Friends

The original ASCII is a 7-bit encoding using 0-127 to define basic US characters

ISO Latin 1 is ASCII with European characters. (8-Bit)

Contain control codes as well as text.

Page 6: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

More ASCII Love

Even basic 7-bit ASCII is not safe Many “national variants” of ASCII

replace some characters with international ones.

Safe ASCII Characters 0-9 A-Z and a-z ! " % & ' ( ) * + , - . / : ; < = > ? Space

Page 7: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Other Ridiculousness

Other 8-Bit ASCII ExtensionsDOS Code PagesMacintosh Character Codes IBM’s EBCDIC (Mainframes)

Windows did not conform to any known standards until NT switched over to using Unicode encoded as UTF-16.

Page 8: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

The Solution: Unicode

Unicode is a practical description of the ISO 10646 standard known as UCS or the Universal Character Set.

Up to 1,114,111 characters can be encoded.

As of Feb 2000, there were 49,194 characters.

The encoding is not defined Several implementations

Page 9: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Encodings For Unicode

Most Common: UTF-8 Character codes less than 128 (effectively, the ASCII repertoire)

are presented "as such", using one octet for each code. All codes with the high bit set to 1 (ie. Not ASCII) link to a

complex mechanism for rendering Unicode character with up to 6 octets.

Allows space savings and compatibility at the cost of implementation complexity.

Page 10: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Unicode Complexity

Characters can be encoded multiple ways. Π is encoded as:

GREEK_CAPITAL_LETTER_PI N_ARY_PRODUCT

Ä can be encoded as: LATIN CAPITAL LETTER A WITH DIAERESIS The symbol A with a link to the umlaut diacritic

All characters can be represented by the U+nnnn notation.

Other Implementations UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4LE,

UCS-4BE, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE

Page 11: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Unicode Implementation

Level 1 Combining characters and Hangul Jamo characters

are not supported. [Hangul Jamo are required to fully support the Korean script including Middle Korean.]

Level 2 Like level 1, except limited combining characters are

supported for some languages. Level 3

All UCS characters are supported, such that, for example, mathematicians can place a tilde or an arrow (or both) on any character.

Page 12: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Programming Languages

Special Data Types for Unicode Ada95, Java, TCL, Perl, Python, C# and others.

ISO C 90 Specifies mechanisms to handle multi-byte encoding

and wide characters. The type wchar_t, usually a signed 32-bit integer, can

be used to hold Unicode characters. ISO C 99

Some problems with backwards compatibility. The C compiler can signal to an application that

wchar_t is guaranteed to hold UCS values in all locales.

Page 13: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Using Unicode in Linux

Most newer distributions have standardized on UTF-8. RedHat 8.0, SuSE 8.1, etc.

Most applications only have a Level 1 implementation. Terminals, output, etc.

Requires hand tuning for UTF-8 Grep without hand tuning was 100x slower in multi-

byte mode than in single-byte mode.

Page 14: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Using Unicode

Libraries must support Unicode formats. New strlen() definitions:

Number of bytes Number of characters Display width (# of cursor positions)

Application must pay attention to the locale setting for UTF-8 activation. Do NOT use command switches (ie. –u8)

Page 15: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Unicode Functions

Setting the locale setlocale (LC_NUMERIC,

"Germany");

Defines all numbers returned from libc to use German notation.

No more reliance on the underlying numerical representation of ASCII.

BAD: l = c - 'A' + 'a';

GOOD: l = tolower(c);

gettext() returns the translations of typical strings.

// get the translation for the "Hello, world\n" string printf(gettext("Hello, world\n"));

Page 16: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Remaining Unicode Issues

Determining Implementation Levels Font Support

No system will be able to display all Unicode characters.

Printing Some conversions for UTF-8 to PS have been written.

Hard Coded Conventions Masking a string to convert to upper-case, etc.

Page 17: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Unicode on the Web

Should be specified in a MIME header for ALL communications internal and external. The header is sent in ASCII (UTF-8)

X-Mailer: Mozilla 4.0 [en] (Win95; I) MIME-Version: 1.0 To: [email protected] Subject: Test X-Priority: 3 (Normal) Content-Type: text/plain; charset=x-UNICODE-2-0-UTF-7 Content-Transfer-Encoding: 7bit

Page 18: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Unicode in DNS

Current DNS only supports ASCII domain names.

iDNS is a program to allow international characters to be used in domains names.Converts native language into roman

characters. iDNS complies with the Row-based ASCII

Compatible Encoding (RACE) and fully supports UTF-5, UTF-8 and other local encodings.

Page 19: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Terminal Emulation

Terminal emulation allows clients on differing operating systems to talk to a central server via a defined language for control characters and text.

Almost all terminal emulators speak ASCIISome still use IBM’s EBCDIC.

As we know, there are major problems with ASCII internationalization.

Page 20: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Serial Devices

Serial ports are one of the most useful ports on a Linux system.

Uses:TerminalsPrintersCustom Connections

Media Changers, Temperature Sensors, Sewing Machines, etc.

Page 21: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Serial Protocols

RS-232 (EIA-232-E) Specification that defines the

meaning of the signals on each wire.

Normally DB-9 or DB-25.

Two Interfaces DCE (Data Communications

Equipment) DTE (Data Terminal Equipment)

Many alternative connectors DIN-8, RJ-45, etc.

Page 22: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Serial In Linux

Character Device Files /dev/ttyS# /dev/cua# (for historical reasons, deprecated)

setserial Sets the serial parameters Eg. Setserial /dev/ttyS1 port 0x02f8 irq 3

In RedHat /etc/rc.d/rc.sysinit reads /etc/rc.serial

Page 23: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Hardwired Terminals

The init process spawns a terminal process (a getty) for each terminal port in /etc/inittab.

# Run gettys in standard runlevels1:2345:respawn:/sbin/mingetty tty12:2345:respawn:/sbin/mingetty tty23:2345:respawn:/sbin/mingetty tty34:2345:respawn:/sbin/mingetty tty45:2345:respawn:/sbin/mingetty tty56:2345:respawn:/sbin/mingetty tty6

Page 24: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Login Sequence

1. getty prints the contents of /etc/issue and a login prompt.

2. getty executes the login program with the username and password from the user.

3. login verifies the un/pass against /etc/shadow, prints the motd and opens a shell.

4. The shell runs startup scripts and waits for input.

Page 25: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Terminal Support

Linux supports many terms through a database of terminal capabilities. termcap - /etc/termcap terminfo - /usr/share/terminfo

Linux looks at the TERM environment variable to determine terminal type.

Special Characters CTRL-? vs. CRTL-H stty command

tset command

Page 26: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

More Terminals

Terminals can be badly broken cat’ing a binary file Vi or emacs crashing and not restoring the terminal

state

Use reset or stty sane to correct. Modems can be used to transmit terminal

information. USB is successor.

Page 27: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Conclusions

Unicode is complex, but using UTF-8 allows a programmer to get many of the benefits of internationalization for free.

Using STL data structures and other Unicode aware libraries will significantly reduce the pain of using Unicode (kiss char* goodbye).

Assume that there will be problems with internationalization.

Page 28: Character Sets Logins & Terminal Types Jarret Raim 2004-04-20

Sources

Quick Overview http://www.linuxjournal.com/article.php?sid=3327

Longer Overview http://turnbull.sk.tsukuba.ac.jp/Tools/I18N/LJ-I18N.html

Sun’s Internationalization Reference http://developers.sun.com/dev/gadc/educationtutorial/creference/

sampfiles/sampfiles.html

Programming for Internationalization FAQ http://www.cs.uu.nl/wais/html/na-dir/internationalizatio

n/programming-faq.html Plus many more.