Tamil All Character Encoding

Tamil All Character Encoding(TACE16) is a 16-bit unicode based character encoding scheme for Tamil language.[1][2]

Keyboard drivers and Fonts

The Keyboard driver for this encoding scheme are available in Tamil Virtual University website for free.[3] It uses Tamil99 and Tamil Typewriter keyboard layouts, which are approved by Tamil Nadu Government, and maps the input keystrokes to its corresponding characters of TACE16 scheme.[2] To read the files which are created using TACE16 scheme, the corresponding Unicode Tamil fonts for this encoding scheme are also available in the same website.[3] These fonts not only has mapping of glyphs for characters of TACE16 format, but also has mapping of glyphs for the present Unicode encoding for both ASCII and Tamil characters, so that it can provide backward compatibility for reading existing files which are created using present Unicode encoding scheme for Tamil language.

Codepage Layout

All characters of this encoding scheme are located in Basic Multilingual Plane of Unicode's Universal Character Set.

Tamil All Character Encoding(TACE16) Character Set
Consonants→
Vowels
E10 E18 E1A E1F E20 E21 E22 E23 E24 E25 E26 E27 E28 E29 E2A E2B E2C E2D E2E E2F E30 E31 E32 E33 E34 E35 E36 E37 E38 E39 E3A E3B E3C E3D E3E E3F
0 அரைக்கால் க் ங் ச் ஞ் ட் ண் த் ந் ப் ம் ய் ர் ல் வ் ழ் ள் ற் ன் ஜ் ஶ் ஷ் ஸ் ஹ் க்ஷ்
1 கால் க்ஷ
2 அரை கா ஙா சா ஞா டா ணா தா நா பா மா யா ரா லா வா ழா ளா றா னா ஜா ஶா ஷா ஸா ஹா க்ஷா
3 முக்கால் ி கி ஙி சி ஞி டி ணி தி நி பி மி யி ரி லி வி ழி ளி றி னி ஜி ஶி ஷி ஸி ஹி க்ஷி
4 அரைவீசம் கீ ஙீ சீ ஞீ டீ ணீ தீ நீ பீ மீ யீ ரீ லீ வீ ழீ ளீ றீ னீ ஜீ ஶீ ஷீ ஸீ ஹீ க்ஷீ
5 வீசம் கு ஙு சு ஞு டு ணு து நு பு மு யு ரு லு வு ழு ளு று னு ஜு ஶு ஷு ஸு ஹு க்ஷு
6 மூவீசம் கூ ஙூ சூ ஞூ டூ ணூ தூ நூ பூ மூ யூ ரூ லூ வூ ழூ ளூ றூ னூ ஜூ ஶூ ஷூ ஸூ ஹூ க்ஷூ
7 அரைமா கெ ஙெ செ ஞெ டெ ணெ தெ நெ பெ மெ யெ ரெ லெ வெ ழெ ளெ றெ னெ ஜெ ஶெ ஷெ ஸெ ஹெ க்ஷெ
8 பௌர்ணமி ஒருமா கே ஙே சே ஞே டே ணே தே நே பே மே யே ரே லே வே ழே ளே றே னே ஜே ஶே ஷே ஸே ஹே க்ஷே
9 அமாவாசை இரண்டுமா கை ஙை சை ஞை டை ணை தை நை பை மை யை ரை லை வை ழை ளை றை னை ஜை ஶை ஷை ஸை ஹை க்ஷை
A கார்த்திகை மும்மா கொ ஙொ சொ ஞொ டொ ணொ தொ நொ பொ மொ யொ ரொ லொ வொ ழொ ளொ றொ னொ ஜொ ஶொ ஷொ ஸொ ஹொ க்ஷொ
B ராஜ நாலுமா கோ ஙோ சோ ஞோ டோ ணோ தோ நோ போ மோ யோ ரோ லோ வோ ழோ ளோ றோ னோ ஜோ ஶோ ஷோ ஸோ ஹோ க்ஷோ
C முந்திரி கௌ ஙௌ சௌ ஞௌ டௌ ணௌ தௌ நௌ பௌ மௌ யௌ ரௌ லௌ வௌ ழௌ ளௌ றௌ னௌ ஜௌ ஶௌ ஷௌ ஸௌ ஹௌ க்ஷௌ
D அரைக்காணி ஸ்ரீ
E காணி
F முக்காணி
Note:
Newly added. Not present in Unicode_v6.3.
Allocated for researches(NLP)
For future use

Analysis of TACE16 over present Unicode standard for Tamil language

Issues with the present Unicode for Tamil language

The present Unicode standard for Tamil is considered not adequate for efficient and effective usage of Tamil in computers, due to the following reasons:[1]

  1. Unicode code Tamil has code positions only for 31 out of 247 Tamil Characters. These 31 characters include 12 vowels, 18 agara-uyirmey and one aytham. Five Grantha agara-uyirmey are also provided code space in Unicode Tamil. The other Tamil Characters have to be rendered using a separate software. Only 10% of the Tamil Characters are provided code space in the Present Unicode Tamil. 90% of the Tamil Characters that are used in general text interchange are not provided code space.
  2. The Uyir-meys that are left out in the present Unicode Tamil are simple characters, just like A, B, C, D are characters to English. Uyir-meys are not glyphs, nor ligatures, nor conjunct characters as assumed in Unicode. ka, kA, ki, kI, etc., are characters to Tamil.
  3. In any plain Tamil text, Vowel Consonants (uyir-meys) form 64 to 70%; Vowels (uyir) form 5 to 6% and Consonants (meys) form 25 to 30%. Breaking high frequency letters like vowel-consonants into glyphs is highly inefficient.
  4. This type of encoding which requires a rendering engine to realize a character while computing is not suitable for applications like system software developments in Tamil, searching and sorting and Natural language processing(NLP) in Tamil, It consumes extra time and space, making the computing process highly inefficient. For such applications Level-1 implementation where all the characters of a language have code positions in the encoding, like English is required.
  5. This encoding is based on ISCII - 1988 and therefore, the characters are not in the natural order of sequence. It requires a complex collation algorithm for arranging them in the natural order of sequence.
  6. It uses multiple code points to render single characters. Multiple code points lead to security vulnerabilities, ambiguous combinations and requires the use of normalization.
  7. Simple counting letters, sorting, searching are inefficient
  8. It requires ZWJ/ZWNJ type hidden chars.
  9. It needs exception table to prevent illegal combinations of code points.
  10. Unicode Indic block is built on enormous, complex, error-prone edifice, based on an encoding that is NOT built to last.
  11. Very first code point says "Tamil Sign Anusvara - Not used in Tamil".
  12. Assumed collation was same as Devanagari - incorrectly uses ambiguous encoding to render same character.
  13. It encodes 23 Vowel-Consonants (23 consonants + Ü) and calls them as consonants, against Tamil grammar.
  14. Unnatural for Speech to Text/Text to Speech.
  15. Inefficient to store, transmit and retrieval(For example, File reading and writing, Internet, etc.).
  16. Complex processing hinders development.
  17. Need normalization for string comparison.
  18. A sequence of characters may correspond to a single glyph, that is, ச + ெ◌ + ◌ா = ெசா. Characters are not graphemes. According to Unicode ெசா is a grapheme; but ச, ெ◌, ◌ா are characters.
  19. Requires Dynamic Composition - a text element encoded as a sequence of a base character followed by one or more combining marks.
  20. There are two methods of rendering the Vowel Consonants. This leads to ambiguity in rendering characters.
  21. The present Unicode is not efficient for parsing. For example, let us count the letters in the name திருவள்ளுவர். Even a Tamil child in a primary school can say that this name has Seven letters. According to Unicode this name has twelve characters: த ◌ி ர ◌ு வ ள ◌் ள ◌ு வ ர ◌
  22. To properly count the letters in this name, an expert developer had to write a complex program and present it as a technical paper in a Tamil computing conference. To compare, counting letters in an English word is an exercise left to a beginning programmer. Such problems are triggered because a simple script such as Tamil is treated as a complex script by Unicode. This is provided, for example in Python library open-tamil, by function tamil.utf8.get_letters.
  23. The Unicode standard policy is to encode only characters, not glyphs. However,https://ezhillang.wordpress.com/2014/01/26/open-tamil-text-processing-%E0%AE%89%E0%AE%B0%E0%AF%88-%E0%AE%AA%E0%AE%95%E0%AF%81%E0%AE%AA%E0%AF%8D%E0%AE%AA%E0%AE%BE%E0%AE%AF%E0%AF%8D%E0%AE%B5%E0%AF%81/ because Unicode Tamil standard includes the vowel signs as combining characters. These signs that have no meaning to a Tamil reader would be displayed as is by character shaping engines that detect a blank space between them and a base character. Thus Unicode introduces the dotted circle as a Tamil character.
  24. Unicode Tamil is not fully supported in many platforms primarily because Tamil is treated as a complex script that requires complex processing.
  25. Since all the above-mentioned inefficiencies consumes extra processing cycles of a processor(which in turns the consumption of electricity) for a machine than needed, it will increase the overall lifetime power usage(electricity) by a machine which processes Unicode Tamil and might reduce the lifetime of that machine. For example, take a very simple instance of processing a single Tamil character kI(கீ), it has to process both consonant and vowel modifier, which doubles the consumption of processing cycles of a processor(which in turns the consumption of electricity). If we consider all the machines and servers across the whole world which processes the Unicode Tamil characters, the extra processing power consumption will be huge.

Analysis of TACE16 over Unicode Tamil

The following data provides the comparison of analysis of current Unicode encoding for Tamil language vs TACE16 on E-Governance and Browsing:[1]

  1. TACE16 is efficient over Unicode Tamil by about 5.46 to 11.94 percent in the case of Data Storage Application.
  2. TACE16 is efficient over Unicode Tamil by about 18.69 to 22.99 percent in the case of Sorting Index Data.
  3. TACE16 is efficient over Unicode Tamil by about 25.39% when the entire data is of Tamil. The default collation sequence followed (Binary) while using the code space values in the New TACE16 is not as per Tamil Dictionary order. Some of the uyir-meys (Agara-uyirmeys) are taking precedence over vowels and other Uyirmeys in the New TACE16, the vowels and agarauyir-meys being in the 0B80 - 0B8F block and the other Uyir-meys being in the 0800 to 08FF. Because of this reason, sorting Unicode data looks better than TACE16 data.
  4. TACE16 is faster in sorting over Unicode Tamil by about 0.31 to 16.96 percent.
  5. Index creation on TACE16 data is faster by 36.7% than Unicode.
  6. For Full key Search on Indexed Fields, TACE16 performed better than Unicode Tamil by up to 24.07%. In the case of non-indexed fields also TACE16 performed better than Unicode Tamil by up to 20.9%.
  7. Rendering of static Tamil Data was fine with TACE16.

Advantages of TACE16 over Unicode Tamil

TACE16 character encoding scheme not only overcomes all the issues with the present Unicode encoding standard for Tamil language which are mentioned above, but also provides additional advantage over major performance improvements in both processing time and processing space which are the major factors in affecting the efficient and speedy execution of any computer based program. This system has the following additional advantages:[1]

  1. The encoding is Universal since it encompasses all characters that are found in general Tamil text interchange.
  2. The Collation is sequential in accordance with the code value.
  3. The encoding is unambiguous.
  4. Any given code point always represents the same character.
  5. There is no ambiguity as in the present Unicode Tamil.

This system has the following advantages for computer programming:

Method 1(By simple arithmetic operations):
 க் + இ = கி
 E210(க்) + E203(இ) = 1C413
 1C413 - E200(Constant) = E213(கி)
Method 2:
 க்(E210) + இ(E203) = கி(E213)
 E210(க்) | ( E203(இ) & 000F(Constant) ) = E213(கி)

Alternatives

The open-tamil project provides many of the common operations, e.g. to extract letters from Unicode UTF-8 encoded string, sorting, searching etc., whereby we achieve the Level-1 compliance of Tamil text processing without using TACE16.

   #!usr/bin/python
   # -*- coding:UTF-8 -*-
   import codecs,os
   import tamil.utf8 as utf8
   with codecs.open('singl','w',encoding='utf-8') as ff:
        letters = utf8.get_letters(u"கூவிளம் என்பது என்ன சீர்")
        for letter in letters:
            ff.write(unicode(letter))
            print unicode(letter)
            ff.write('\n')
   ff.close()

generates the output, output: கூ வி ள ம் எ ன் ப து எ ன் ன சீ ர்

References

  1. 1 2 3 4 Report on the final recommendations of the task force on TACE16
  2. 1 2 Tamil Nadu Government's Tender Document for development of Tamil fonts and Tamil keyboard driver for 16-bit encodings(Unicode and TACE16)
  3. 1 2 Tamil Nadu Government's Order(G.O.), Keyboard Drivers and Fonts
This article is issued from Wikipedia - version of the 6/30/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.