p. chellappan
TRANSCRIPT
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 1/27
am nco ng tan ar sP.Chellappan
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 2/27
What is Encoding ?
Computers store all data as numbers
not as glyphs (shape)s no s ore as u as
A number should be assigned to every
a p a eThis is called encoding
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 3/27
What is Encoding ?
A Byte is a unit of storage and consists of 8bits
A Byte usually represents one character
A B te can contain a maximum of 256combinations of 1s and 0s
00000000 00000001 ...
11111110, 11111111
be stored in a byte
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 5/27
Tamil Encoding
Tamil has 313 characters
Number of slots available is less than 256
Gl hs are encoded and not characters
e.g. £, ª, «, ¬ etc are encoded
. . but « + è + £
s s ca e yp nco ng
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 7/27
English Text input, display and storage
Q W E
Z X C
English English
driver font
Screen
and 9 7 1 1 0 1 0 0
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 8/27
Tamil Text input, display and storage
Q W E
Z X C
Tamil Tamil Ü
driver FontÜ
Screen
Ü‹ñ£ 2 2 0 1 3 9 2 4 1 1 6 3
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 9/27
8 Bit Encodings - Problems
•Same 256 locations are used by alllanguages
97 a a
•Use of wron font will dis la arba e
English Tamil
andnd
•Hence language information is also
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 10/27
8 Bit Encodings - Problems
Font Change is difficulte.g Ü‹ñ£ and mother‹‹ñ£
ross a orm a a exc ange s cu
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 11/27
Solution
•Characters of different languages shouldbe give different numbers
•256 locations are not sufficient
• y e c arac er sys em canhave 65,536 characters
•Sufficient to handle most of the worldlanguages !
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 12/27
Solution
Moving to a
16 Bit encodin s stem
is a must
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 13/27
Unicode
•16 Bit character encoding scheme
•Contains characters of most of the
world's languages•9 Indian Scripts including Tamil areencoded
•One Script could cover more that one
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 14/27
Unicode
•Primarily encodes characters not glyphs necessarily 1:1
•Many Characters can be represented by
g ype.g è + § will be displayed as °
è + ªo£ will be displayed as ªè£
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 15/27
Unicode
Level 1 Implementation
•TTF font is sufficientLevel 2 Implementation (Tamil)
•
•OTF Fonts are required
, .
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 16/27
Unicode
•Tamil block - U+0B80 to U+0BFF (128locations)
•All Vowels Ü to å÷ and ayutham ç
• aram r ya e è to ù an ú, û,ü and ý (consonant)
•Vowel Modifiers – o ¢, o£, o ¤, o ¦, o §
¨ ÷
Tamil numerals and symbols are encoded
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 18/27
Unicode Sequences
•A Uyirmeis are enco e on y as
consonant + vowel modifiers
è + o ¢ = è¢
è + o£ = è£è + o § = °
¨è + ªo£ = ªè£
¢ •þ s enco e as è + o + û
•ÿ is encoded as ú +o ¢ + ó + o ¦
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 19/27
Advatages of Unicode
•It is a common enco ing eing use orall languages by all operating systems
•It is the standard used in the Internet for
all data exchange•Systems automatically detect thelanguage and process information
correctly
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 20/27
Disadvantages of Unicode
•Fi e sizes are igger
•Data processing time is more
•NLP is more complex
– Meis are not encoded
– Character widths are not uniform
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 21/27
Problems with Unicode
•Not a app ications support comp exscripts
•Major DTP applications also do support
complex scripts at present•Same characters are represented bymore than one not necessarily correct
sequences.e.g. è + ªo£ = ªè£
è + ª + £ = ªè£
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 22/27
TACE – An alternate encoding
•It is an a ternate 16 it enco ing
•It encodes all uyirs, meis and uyirmeis
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 24/27
TACE – Design features
•Every uyirmei as a unique co e point
•Since every character is encoded,complex script support is not required
•The mei and uyir component of everycharacter can be easily separated bysimple bit operations
e.g xxxx xxxc cccc vvvv
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 25/27
TACE – Advantages
•Comp ex Script support is not require
•Compact file sizes
•Fast data processing
•Fast NLP rocessin
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 26/27
TACE – Disdvantages
•T e major ott e nec is t at it isencoded in the Private Use Area (PUA)
•It cannot be used for global data
interchange•It can be used only within a closed usergroup
8/2/2019 P. Chellappan
http://slidepdf.com/reader/full/p-chellappan 27/27
Tamil Nadu Government Standards
•T e Government o Tami Na u as ta ena policy decision to convert to the 16 Bitenco ng sys em
•UNICODE will be the primary 16 bits an ar
•TACE16 will be the only alternate
encoding standard in areas where supportfor unicode is not available fully orpar a y.