resources for sanskrit and other indian languages-- dr girish nath jha
DESCRIPTION
Current Progress in developing Resources for Sanskrit and other Indian Languages-- Dr Girish Nath JhaTRANSCRIPT
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Current Progress in Developing
Resources for Sanskrit and other Indian
languages
Girish Nath Jha
Associate Professor, Computational Linguistics
Special Center for Sanskrit Studies, J.N.U., New Delhi – 110067
&
Mukesh and Priti Chatter Distinguished Professor of History of Science,
University of Massachusetts Dartmouth
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
What is a “resource” ?
Language data, corpora in standard formats for computer processing for direct/indirect use by humans
India is considered “resource-poor” country as we do not have enough standard resources.
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
What does it mean for Sanskrit ?
-electronic texts, dictionaries
-digital libraries
-parallel corpora
-search engines
-language processing tools (MT, Speech, OCR, OLHWR etc)
-second Indology revolution in the making?
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Why Sanskrit?
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Language Scripts Family Hindi Devanagari Indo Aryan Sanskrit Devanagari Indo Aryan Marathi Devanagari Indo Aryan Konkani Devanagari Indo Aryan Maithili Devanagari Indo Aryan Nepali Devanagari Indo Aryan Sindhi Devanagari Indo Aryan Bodo Devanagari Tibeto Burman Dogri Devanagari Indo Aryan Santhali Devanagari, Ol Chiki Austro Asiatic Bengali Bengali Indo Aryan Assamese Bengali Indo Aryan Manipuri Bengali, Meithei Indo Aryan Gujarati Gujarati Indo Aryan Kannada Kannada Dravidian Malayalam Malayalam Dravidian Oriya Oriya Indo Aryan Punjabi Gurumukhi Indo Aryan Tamil Tamil Dravidian Telugu Telugu Dravidian Urdu Perso-Arabic Indo Aryan Kashmiri Perso-Arabic Indo Aryan
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Indian constitution on languages
448 articles, 12 schedules, 107 amendments (so far)
Article III – Fundamental rights
Article IV A – Fundamental duties
Article XVII – Official Language
Article XVII – Regional Languages
Article XVII – Language of Supreme Court and High Court
Article XVII – Special Directives
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Sanskrit
Commission, 1956
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Sanskrit in digital age
Computer for Sanskrit
Sanskrit for Computer
major e-contents
Sanskrit wikipedia Sanskrit wikipedia (Sanskrit medium wikipedia)
http://sa.wikipedia.org
Sanskrit wikisource (Sanskrit e-texts)
Sanskrit wiktionary (Sanskrit encyclopedia )
Sanskrit wikiBooks (Sanskrit e-library)
major e-contents
Digital libraries DLI project (http://dli.iiit.ac.in/) 1022 Sanskrit books
(IISc, CMU,NSF,ERNET,MCIT)
NSF funded, Brown Univ
(http://www.sanskritlibrary.org/)
Clay’s project (http://www.claysanskritlibrary.org) JJC
foundation, NYU Press
INRIA, Paris (technical texts, tools)
IGNCA (http://ignca.nic.in/sanskrit.htm _
J-TESS (JNU Text Encoding and Search for Sanskrit)
major e-contents
Sanskrit e-documents
Maharshi Mahesh Yogi
(http://sanskrit.safire.com/Sanskrit.html)
Avinash Sathaye - Sanskrit documents list(http://sanskritdocuments.org/ )
Srinivas Varkhedi – Sanskrit corpus (http://rsvidyapeetha.ac.in/)
Oliver Hellwig (Univ of Berlin)
Anand Mishra
(http://sanskrit.sai.uni-heidelberg.de/)
http://sanskrit.jnu.ac.in
major e-contents
Sanskrit documents
Tirupati Vidyapeeth
ASR Melkote
CDAC- heritage computing group
Sanskrit blogs
JNU students
Others (http://sanskritlinks.blogspot.com )
Sanskrit corpora and tagset
JNU , LDC, Univ. of Pennsylvania, U.Hyd
major e-contents: static
Himanshu Pota (http://learnsanskrit.wordpress.com/)
http://www.ee.adfa.edu.au/staff/hrp/personal/sanskrit/
American Sanskrit Institute
(http://www.americansanskrit.com/)
Acharya, IITM
(http://acharya.iitm.ac.in/sanskrit/tutor.php)
Vasudev Bhatt
(http://www.ourkarnataka.com/learnsanskrit/sanskrit_main
.htm)
Sanskrit Bharati (http://www.samskrita-
bharati.org/newsite/index.php)
http://sanskritbhasha.blogspot.com/
major e-contents: dynamic
Tutorials
Sudhir Kaicker (http://www.sanskrit-lamp.org/_
Prof. G.V.Singh (CASTLE project of DoE)
Peter Scharf
Avinash Sathaye
Sanskrit CD (Mahesh Kulkarni, CDAC Pune)
Language processing tools
Gerard Huet
Amba Kulkarni
Peter Scharf
Girish N Jha
Anand Mishra
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Editor:
Girish Nath Jha
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Work done at
Jawaharlal Nehru
University (JNU),
New Delhi
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Special Center for Sanskrit Studies, JNU
Linking Traditional scholarship with modern
methods
Exploring Science & Technology in Sanskrit
Developing language technology resources and
tools for Sanskrit and other Indian languages
Collaboration with universities
Collaboration with industry
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
SATIAIT -
Science And Technology
In
Ancient Indian Texts
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD, July14,2012
Center for Indic Studies, UMASSD initiative
Due to the initiative and efforts of Prof Bal Ram Singh, we are doing the following activities - Identifying key S&T texts
Digitizing them, providing computer help
Translating
Lab experiments
Documenting…
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Editors:
Bal Ram Singh
Girish Nath Jha
Umesh Kumar Singh
Diwakar Mishra
Keynote delivered at WAVES2012, UMASSD, July14,2012
7/14/2012 Special Center for Sanskrit Studies, J.N.U., New
Delhi
Editors:
Girish Nath Jha
Bal Ram Singh
R P Singh
Diwakar Mishra
7/14/2012 Special Center for Sanskrit Studies, J.N.U., New
Delhi
Editors:
Angela Marcantonio
Girish Nath Jha
Keynote delivered at WAVES2012, UMASSD, July14,2012
Technology Development for Indian Languages
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Localization
Linguistic Resources
Standards
Certification
Software/Tools
Training Awareness
Technologies
Building Blocks of Language Technology Development
Language
Technology
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Keynote delivered at WAVES2012, UMASSD, July14,2012
Near Future initiatives
Localization R & D Center
(JNU, CDAC, IIT Delhi)
NME-ICT center at JNU
(MHRD, JNU)
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD, July14,2012
Machine Translation
SHMT (Dept of IT, Govt. of India)
SaHiT (unfunded)
Microsoft Translator Hub
English-Hindi (Microsoft)
English-Urdu (Microsoft)
English-Gujarati (Microsoft)
Sanskrit-English (unfunded)
English-Maithili (unfunded)
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
SHMT (DIT)
A consortium of 7 universities/institutes University of Hyderabad
JNU
IIIT Hyderabad
Tirupati Vidyapeeth
Sanskrit Academy Hyderabad
Poornaprajna Vidyapeth Bangalore
Rajasthan Sanskrit University, Jaipur
Duration 3 yrs (2008 – 2012) MT system tobe hosted on http://tdil-dc.in very soon
Phase2 (2012-2015)
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Indian Languages
Corpora Initiative
(ILCI)
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
A consortium of Indian universities has been formed
under my leadership – 17 languages, remaining 6 to
join later
Parallel tagged corpora if 100,000 sentences in all
Indian languages in tourism, health, agriculture,
entertainment domains
Funded by TDIL program of Ministry of C & IT
Phase1 :2009-12 (a consortium of 12 languages
including English) - corpora to be hosted on
http://tdil-dc.in very soon
Phase2 : 2012-2015
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Languages & Consortia partners
Consortium of universities
Server baser corpora development and
management >> the server is called
“sanskrit”
Limited Crowd sourcing
7/14/2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Shallow parsing tools for Indian
languages
Under a consortium project led by Univ.
of Hyderabad
Morph analyzers for 11 Indian languages
Duration = 2012-15
7/14/2012
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Consultancies
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Online Handwriting Recognition
for Devanagari based languages
-Microsoft
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Indic languages tagset and
annotation
-Microsoft Research India
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Multimodal data in 8 security
sensitive languages (Indian English, Hindi, Urdu, Tamil, Bangla,
Punjabi, Pushto, Dari)
-LDC, University of Pennsylvania
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
English- (major) Indian
languages Machine Translation (English-Hindi, English-Urdu, English-Gujarati,
Sanskrit-English, English-Maithili)
Started this summer
-Microsoft
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Some of the recent R&D with
the help of research students
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Sanskrit Speech Synthesizer (in collaboration with Microsoft Research India)
(prototype by next year)
Named Entity Recognizer for
Sanskrit (prototype finished)
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
J-TESS :
JNU Text Encoding &
Search for Sanskrit
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Tools
Server based corpora creation,
annotation application called ILCIANN
Sanskrit and other Indian languages
processing tools
Multimedia animation, e-learning tools
Lexical resources and search
Indian language Transliterator
Special Centre for Sanskrit Studies, J.N.U., New Delhi
Keynote delivered at WAVES2012, UMASSD,
July14,2012
Demo
http://sanskrit.jnu.ac.in
Special Centre for Sanskrit Studies, J.N.U., New Delhi
ক
ક क
ಕ കൂ क କ ਕ క
గ
ક ಕ କ ਕ
ক क
ક గ
ಕ
ಕ
धन्यवाद ! questions??
91-11-26741308 Keynote delivered at WAVES2012, UMASSD,
July14,2012