shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/43484/7/phd … · web...
TRANSCRIPT
Chapter.1 Introduction‘Where shall I begin, please, your majesty?' he asked.
‘Begin at the beginning; the king said gravely''And go on till you come to the end: then stop.'
Carroll, 2003
1. Introduction
It is a well established fact that for the development of NLP tools and
applications, a substantial amount of linguistic knowledge is required which can
be either in the form of computational grammar1 or in the form of syntactically
annotated machine readable corpus known as treebank.2 This research is an effort
to create a treebank for Kashmiri Language [KashTreeBank].3 It investigates the
theoretical as well as the practical issues involved in the creation of a small scale
dependency treebank of Kashmiri, using simple grammar formalism for syntactic
parsing and annotation.
Treebank creation is a promethean task which requires different types of
resources and enormous funding for the development or acquisition of corpus &
tools as well as for labor-intensive annotations, expert opinions, and validation.
The current research is an initiative to build language resources for Kashmiri so
that a base line syntactic parser of Kashmiri can be developed. The findings of
this research can serve as basis for carrying out treebanking for Kashmiri on large
scale.
The next section discusses the motivations behind pursuing the current
research and highlights its social relevance. Section three provides brief
introduction of Kashmiri Language. Section four on research problem, introduces
a whole spectrum of issues associated with treebanking in general & development
of KashTreeBank in particular. Section five on theoretical preliminaries,
elaborates the theoretical framework used in this research work.
1 An assembly of e-dictionary and formal representation of word, phrase and sentence formation rules for a language.
2 The well known treebanks are Penn English Treebank; Marcus et al. 1993, Penn Arabic Treebank; Maamouri et al. 2004, Penn Chinese Treebank; Xue et al. 2004, Prague Dependency Treebank of Czech; Hijicova & Hajic, 1998, Böhmova et al. 2003, HyDTB Hindi Treebank; Begum et al., 2008, etc.
3 KashTreeBank was initially conceived as a summer school project in IASNLP 2011.
1
2. Motivation
Treebank is a rich language resource for research on grammar development &
grammar engineering. Grammar engineering is the practice of building elaborated
linguistic models on computers. It has been used for practical purposes for many
years. For instance, it has been used for developing grammar checkers, e.g.
Microsoft grammar checker, Boeing’s simplified English grammar checker, etc
but the contemporary grammar engineering involves extraction and induction of
probabilistic grammars. Besides, if a treebank is created as a reference work,
rather than an application oriented repository, it can serve multiple functions in
various subfields of linguistics as well as in language technology. Theoretical
linguists can use treebank for searching various illustrations of the different
syntactic phenomenon under investigation, whereas psycholinguists can use it to
find the relative frequencies of various possible PP attachments or relative clauses
(Abeille, 2003). Similarly, formal and computational linguists can evaluate the
correctness and coverage of grammars and lexicons against the analyses stored in
a treebank and at a more general level, the adequacy of linguistic theories and
formalisms can be assessed.
Further, treebanking is not goal in itself rather treebank driven parsers are
used as an important component of artificial intelligence (AI) systems like MT
system, Question-Answering system and Grammar Checker. Therefore, treebank
is a valuable resource not only for Computational Linguistic (CL) and Natural
Language Processing (NLP) tasks, such as automatic syntactic parsing, grammar-
induction4 and grammar-extraction5 but also for non-technological academic
research such as experimental syntax. Evaluation of NLP systems or their
components is yet another field which is currently very active. These days,
treebanks are in much demand for testing and optimization of syntactic parsers.
Treebanks can be also used for pedagogical purposes, both in teaching of
language and linguistic theory, e.g. the Visual Interactive Syntax Learning (VISL)
project, established at the University of Southern Denmark, has developed
teaching-treebanks for twenty two languages with a number of different teaching
4 During the last decade treebanks have been used for the induction of probabilistic grammars for syntactic parsing (see Collins, 1999 & Charniak 2000) but currently these are used in data-driven parsing (see Bod 1998, Nivere 2009) which eliminates the traditional notion of grammar completely and uses a probabilistic model defined directly on the treebank.
5 Besides optimization of syntactic parsers, treebanks are used to induce other linguistic phenomena that are relevant to NLP e.g. extraction of sub-categorization frames (Briscoe, 1997).
2
tools including the interactive games such as Syntris6. Treebanks are also being
used for empirical linguistic research in theoretical syntax and historical
linguistics. For instance, creation of historical treebanks like Middle English
(Kroch & Taylor, 2000), Old English (Taylor et al. 2003), Early New High
German (Demske et al. 2004), etc, have revolutionized historical linguistics and
comparative philology in last one decade or two. Given the versatility of
treebanks to hold vast amount of empirical grammatical knowledge and given
their commercial utility to be consumed as language data in research &
development, it is the need of hour to develop large scale treebanks for all
resource poor languages and Kashmiri is one of them.
3. Kashmiri Language
Kashmiri, locally known as “Koshur,” is one of the 22 scheduled languages of
Indian union, as per the 8th schedule of its constitution. It is mainly spoken in the
greater region called “Kashmir” which includes State of Jammu Kashmir (JK) and
Pak Administered Kashmir. JK is located at a strategically important geographical
point where it is bordered with Tibet in the east, China in the north, Pakistan in
the west and south west, and in the south-east by rest of India (Hussein, 1987).
There are approximately six million Kashmiri speakers scattered in India,
Pakistan, UK, USA and Gulf Countries (Ethnologue 2006). Kashmiri is a Dardic
language, considered genealogically distinct from Indo-Aryan and Indo-Iranian
languages (Grierson, 1915) but latter on it has been classified under Dardic group
within Indo-Aryan language family (Morgenstiene, 1961). It is closely related to
Shina and some other languages of the North-West frontier (Koul 2006).
It is a highly inflectional language with predominant V2 phenomenon &
pronominal clitics like Germanic languages. Kashmiri is the only Dardic language
which has a written tradition. It is written in modified Persio-Arabic script, with
additional diacritics to capture its peculiar phonetic features. Like Urdu, its
writing convention is from right to left. Although, Persio-Arabic is an officially
approved script, it is also written in Devanagari. Moreover, Sharda and Roman
scripts have been also used for it from time to time. However, it is, mainly written
in modified Persio-Arabic script with writing convention from right to left. The
script uses some additional distinguishing set of diacritic markers and letters, for
6 see, http://visl.edu.dk
3
representing a system of central vowels and secondary articulations, e.g.
palatlization at token initial, medial and final positions. Therefore, the script is
fully capable of representing all the sounds of Kashmiri. It has two writing styles-
Nasaq and Nastaliq. Kashmiri is mainly written in Nastaliq style, either manually
by cartographers (kA:tib) or by using some word processor, e.g. Inpage-Urdu. It
can be also directly input in Microsoft Word where it will be displayed in Nasaq
style like Arabic as the available Unicode fonts are in only in Nasaq Style. It is
worth to mention that the readers of Kashmiri are not normally used to this style
and they find it difficult to read.
4. The Research Problem
Kashmiri is a highly inflectional language with relatively variable word-order,
extensive pronominal cliticisation and predominant V2 phenomenon. As per
computational resources are concerned, it is a resource poor language, lagging far
behind than other Indian languages like Hindi, Urdu, Punjabi, Bengali, Telugu,
Tamil, etc. Several kinds of resources are needed for developing a treebank like
annotation guidelines to state the conventions in order to guide the annotators
throughout their work and a software tool to aid the annotation work. Since,
constructing syntactic trees manually is a very slow and error-prone process;
semi-automatic annotation can be opted but the semi-automated treebank
annotation needs a whole battery of NLP modules like Tokenizer, POS tagger,
Morph-analyzer, chunker (shallow parser) and a Syntactic Parser.
The development of KashTreeBank involves many challenges ranging
from preliminary decision making regarding the selection of framework & the
associated formalism to the actual syntactic annotations & their representations in
certain format. Therefore, this multi-dimensional problem of “Creating
KashTreeBank” can be better addressed by describing the wide spectrum of small
problems related to its design & development. It includes choice of corpus,
selection of framework & the associated grammar formalism, choice of
annotation scheme, nature of annotation process, representation of treebank & the
choice of annotation tool.
4.1. Choice of Corpus
Treebank can't be created out of vacuum. One needs to have some primary source
data (machine readable text) to work on and to annotate required linguistic
information. Either, already created corpus resources, under various projects, can
4
be used, or, new resources can be created for this purpose. But the choice
governing the acquisition of the old resources or the development of new
resources should be the principled one. The principles necessary to determine the
choice of corpus for treebanking are given as:
a) The corpus should be freely available for research and development with
easy licensing policy.
b) The licensing policy for the distribution of corpus should not undermine
your rights on the treebank.
c) The corpus should have been developed following certain encoding
standards, preferably, Unicode for character encoding and XML for text-
encoding.
d) The corpus should be sanitized and normalized one, i.e. with no
typographical errors, tokenization problems and missing diacratics (crucial
ones).
e) The corpus should be balanced one (with samples from all the possible
existing domains).
f) The corpus should represent almost all types of constructions of a
language for a wider coverage of treebank to produce a robust annotation
scheme and parsing model.
g) The corpus should be preferably annotated with Morph, POS & Chunk
information so that one can directly start parsing sentences.
h) The sufficient quantity of corpus should be available (at least, 1500-2000
sentences). Less than this quantity may be sufficient for developing
annotation scheme & guidelines but it won’t be sufficient to train a base
line parser.
It is not obligatory to follow all the above criteria, strictly. They can vary from
language to language but certainly one has to think on these lines for the
acquisition or the development of corpus to create a treebank. It is important to
mention that there is a need to use corpus of shorter sentences (with considerable
complexity) in the initial stage of the research to lay down a basic annotation
scheme.
4.2. Treebanks and Linguistic Theory
The choice of a suitable framework as well as an implementable formalism is of
paramount importance in any treebanking endeavor as it determines the nature of
5
all data (trees) in the treebank and consequently determines the value and utility
of the entire treebank. Since, a number of grammatical frameworks and
formalisms exist worldwide; it has become imperative to choose one among the
existing models to be implemented on the selected sets of Kashmiri corpus. The
choice of annotation scheme for a treebank is influenced by different factors. One
of the most central considerations is its relationship with the linguistic theory. It is
to be decided if the annotation scheme should be theory-specific or theory-neutral.
If the first of these alternatives is taken into consideration, then which theoretical
framework should be adopted? If the second is opted then how do we achieve the
broader consensus on framework selection, given the fact that truly theory-
neutrality is almost impossible? Although, it has been argued that while creating
treebank theoretical neutrality should be maintained (Fei Xia, 2008) but in reality
theoretically neutral treebank is a myth. However, if theory neutrality is
interpreted as NLP friendly, one can choose that framework for preparing
annotation guidelines that is advantageous for Natural Language Processing.
However, the solution to the problem of framework selection and design of
annotation scheme comes from the interaction between different factors that
govern treebanking, in particular, from the nature of the language (configurational
or non-configurational) that is being analyzed. Also, the researchers, particularly
from resource poor scenarios, cannot afford to disregard the already created
resources and tools for automatic and interactive annotations. The following
criteria can be posited to help in grammar formalism vis-à-vis annotation scheme
selection.
a) The formalism should be simple & elegant with fewer abstractions, i.e. it
should be NLP friendly.
b) The associated resources (tools and schemes) should be accessible.
c) It should suite the nature of the language under investigation.
d) It shouldn’t disregard the grammatical tradition for the language.
e) It should have some cognitive reality.
Two types of frameworks, constituency and dependency, have been used in
framing annotation schemes for different treebanks. A constituency-based
annotation scheme posits the structure of a sentence as hierarchically organized
phrases (IP = Spec + X’ & X’ = X + Comp.) where the annotations are confined
to phrasal tags (such as S, JJP, NP, PP, VP, etc). Such schemes do not represent
6
the grammatical relations between and within the constituents, explicitly. On the
other hand, dependency-based annotation scheme posits a sentence as a
dependency graph, i.e. a structure consisting of a head and a dependent with a
labeled arch (which can be also a directed arch), denoting the grammatical
relation (GR) between them. The relations in the syntactic structure can be labeled
with not only GRs but also with other specifications of the function of the
dependent. Syntactic units are words in more lexicalized dependency frameworks
(Hudson, 1984; Mel’cuk, 1988) but dependency annotation schemes sometimes
rely on units of several words or word clusters, e.g. chunks in case of Abney
(1991) and Bharati et al., (1994).
The annotation schemes used in different treebanks can be compared and
contrasted on the basis of the following parameters, proposed in Bosco and
Lombardo (2004).
a) The number of layers involved
b) The number and the nature of relations annotated
c) The richness of the annotation
d) The explicit representation of semantic information
On the one hand, Penn Treebank (PTB) (Marcus et al., 1993) uses a mono-stratal
(single layered) annotation scheme that combines the annotation of syntax and
semantics on the same level of representation. The syntactic annotation is based
on constituency but it has been enriched with the annotation of a small set of
grammatical relations and semantic information. On the other hand, annotation
scheme used in Prague Dependency Treebank (PDT) uses a multi-stratal
annotation scheme that consists of three separate layers: morphological, analytical
and tecto-grammatical (or semantic). However, NEGRA treebank also uses a
mono-stratal annotation scheme which combines phrase-structure and dependency
representations, allowing for the direct representation of both phrases for fixed-
word-order constructions as well as syntactic dependencies (predicate-argument
structures). The PDT uses a richer annotation of the relational structure compared
to others. Since the number of relations annotated in NEGRA Treebank and PTB
is quite low, their representation of the relational structure is quite poor.
Nevertheless, the relational structure can be easily recovered all at once in
monostratal representations such as in the NEGRA and PTB than in multi-stratal
representations where the information is sparse on several structurally different
7
layers, as in the PDT. The major limits of monostratal representation have been
referred to representation of phenomena in one level which require structurally
different levels, e.g. representation of semantics and syntax as coordinated rather
disjoint.
Some scholars claim that dependency based annotation is more suitable for
relatively free-word-order languages (Hudson, 1984; Mel’Cuk, 1988; Covington
1990 & Bharati et al. 1995) while others make their choice on the basis of
application requirements and in some cases, the annotation scheme follows the
linguistic tradition. To annotate the corpus of relatively fixed-word-order
languages like English, principle of constituency is usually employed. However,
in treebanks like TIGER Treebank for German (Brants et al. 2002) and Quranic-
Arabic Treebank for Quranic Arabic (Kais Dukes & Tim Buckwalter, 2010)
dependency is combined with PSG. Also, recently efforts were made to annotate
relatively free-word-order languages like Hindi-Urdu with dependency structure,
lexical predicate structure & phrase structure in a coordinated manner (Palmer et
al. 2009). Further, a treebank can have multiple representations rooted in different
linguistic theories to maintain theory equality rather than theory neutrality. For
instance, The Multi-Representational and Multi-Layered Treebank for Hindi-Urdu
(Bhatt et al., 2009) has both the phrase structure (PS) as well as dependency
structure (DS) representations. In fact, multiple representations are the current
state-of-art in treebanking but still one has to start with one type of
representations.
4.3. Nature of the Annotation Process
Treebanking, primarily, involves syntactic parsing & annotation of POS tagged
corpus which can be done in different ways. The most commonly used method for
developing a treebank is a combination of automatic and manual processing.
However, there are some treebanks created completely manually but with taggers
and parsers available to automate some of the work. Such a method is rarely
employed in state-of-the-art treebanking. There are three main techniques to carry
out annotation process.viz:
a) Supervised Technique: In this technique, the annotation process is carried
out manually by human annotators, preferably by syntacticians.
8
b) Un-supervised Technique: In this technique, the annotation process is
carried out automatically by an intelligent system called syntactic parser
(developed without any training data).
c) Semi-supervised Technique: In this technique, the annotation process is
partly done automatically by a trained parser & the parses are partly done
or corrected by human intervention.
Traditionally, the parsing or syntactic annotation was mostly confined to
manual methods but after the development of more sophisticated grammar
formalisms such as context free grammars like PSG, it became possible to
automatise the process of syntactic annotation either on the basis of computational
grammar in which hand crafted grammar rules (morphological, phrasal and
sentential) are used to develop parser or on the basis of statistical modeling in
which syntactically annotated electronic corpus is used to train a parser. Hybrid
techniques, involving both grammar rules as well as statistical modeling, are also
used to develop parsers. But treebank creation on the basis of automatic parsing,
using a probabilistic grammar or statistical modeling (Bod 1998; Collins, 1999;
Charniak, 2000) is desirable for both practical and theoretical reasons and manual
annotation has the disadvantage of being time consuming, labor-intensive, costly
& error prone. Also, it is difficult to achieve satisfactory consistency both within
and between human annotators (Van Der Beek et al., 2002). However, in order to
create treebank for any resource poor language like Kashmiri, automatic approach
is an impractical one. Therefore, to stick to the old method of manual annotation
is the only choice and the treebank, so obtained serves as data for training &
testing state-of-art parsers like Stanford Parser (Dan Klein & Christopher D.
Manning, 2003), Malt-parser (Nivre et al, 2006), or MST-parser (McDonald,
2006). Training results in the induction of a language model which in turn results
in a baseline Kashmiri parser. Once the baseline parser for Kashmiri is ready, it
can be employed to parse more and more Kashmiri corpus automatically and learn
more and more structures by boot-strapping7 and only then the labor-intensive
manual annotations can be avoided. Nevertheless, the validation of the
automatically annotated corpus needs to be done manually. Since, currently, the
automation syntactic parsing is predominantly the domain of machine learning
(engineering) where consistencies in annotations matter more than the granularity,
7 Parse a little, learn a little
9
i.e. the depth of analysis, the annotation guidelines need to be prepared &
followed strictly during the annotation process to avoid frequent inconsistencies
and propagation of errors to other annotation layers.
4.4. Representation of Treebank
Treebanking not only involves deep syntactic analysis of natural language corpus
(sentences) according to particular grammar formalism but also the representation
of the syntactic analysis (trees) in certain format so that the annotated information
can be read by an algorithm during training process. A format is, generally, a sort
of matrices which represents various levels of annotated grammatical information
in different data types or fields (columns) in such a way so that a link is
maintained between them. So far, many such formats have been devised, for
instance; CONLL-X (see Table. 2) and Shakti Standard Format (SSF) (see Table.
1). SSF was originally devised for Shakti-Machine Translation System for Indian
Languages and is mostly used in India but CONLL-X standard is a widely used
format. It has ten data types (fields), of which seven are utilized in the analysis.
Recently algorithms have been developed to convert SSF into CONLL so that
experiments can be done on wider range of parsers.
In the CONLL-X format, all word-forms and punctuation marks are presented on
a separate line. Each word has a numerical address (NA) within the sentence in
Column-1. The next column from the left is the actual word-form (WF), followed
by its base form (BF) in Column-3. The morphological description is given in
both short and coarse grained manner (POS) in column 4, and a fine-grained
analysis (Morph) in Column-5. The dependency relations (dRel) are marked in
Column-7 by indicating the governing word (Head/Root/Regent) using the
sentence-internal numerical address of Column-1. The dependency functions
(dFn) of the word-forms are presented in Column-8. Columns 6, 9 and 10 are
unused and are marked with an underscore (_).
In the present work, the annotated data is represented in SSF (Bharati et
al., 2007). The SSF consists of four columns in which the column1 (C1) carries
information about the address of the token (like 1, 2, 3,……….., n), the column-2
(C2) carries the actual tokens in the manner of one token per line (see Fig.1), the
column-3 (C3) carries the POS category of the node and the column-4 (C4)
carries other features like the dependency relations. Any further information like
morph information can be represented in this column using an attribute–value
10
pair. Therefore, POS and chunk information of the tokens would be in the C3 and
the morph, dependency and any other information pertaining to a node would
appear in the C4 (see Table 1).
</Sentence>
س آسان حضو ۍسفید پلو دٲ ٹھا پس ۔رن س ن� ٮ� <Sentence id=''22''>
C1 C2 C3 C4
1 (( NP <fs name='NP' drel='k2:VGF'>
1.1 سفید JJ <fs name='سفید'>
1.2 پلو NN <fs name='پلو'>
))
2 (( VGF <fs name='VGF'>
2.1 ۍس ٲ VAUX <fs name=' ۍس <'ٲ
2.2 آسان VM <fs name='آسان'>
))
3 (( NP <fs name='NP2' drel='k4a:VGF'>
3.1 رن‘حضو NNP <fs name=' رن‘ <'حضو
))
4 (( NP <fs name='NP3' drel='pof:VGF'>
4.1 ٹھا ٮ�س INTF <fs name=' سٮ77ٹھا <'ٮ�
4.2 د ن�پس NN <fs name='د ن�پس '>
4.3 ۔ SYM <fs name='۔'>
))
Table.1: An Eight Token Kashmiri Sentence in SSF
11
1.NA 2.WF 3.BF 4.POS 5.Morph 6. 7. dRel 8.dFn 9. 10.
سفید سفید 1 JJ JJ.0.0.0 _ 2 Adj _ _
پلو پلو 2 NN NN.0.0.0 _ 3 Obj _ _
ٲس ٲس 3 VA VM.0.0.0 _ 0 Root _ _
آسان آسان 4 VM VA.0.0.0 _ 3 Aux _ _
حضو‘ر حضو‘رن 5 NNP NNP.0.0.0 _ 3 Subj _ _
ٹھا 6 ٹھا س ٮ�س ٮ� INT INT.0.0 _ 7 Intf
_ _
د 7 د پس ن�پس ن� NN NN.0.0.0 _ 3 pRoot
_ _
۔ ۔ 8 SYM SYM.0 _ _
_
Table.2: An Eight Token Kashmiri Sentence in CONLL-X Format
4.5. Choice of Annotation Interface
The annotation process for the development of treebank can’t be accomplished
effectively unless some user friendly annotation interface is available. The
annotation interface is generally customized on the basis of requirements of the
annotation scheme. Given the specifications of the treebank to be built, one can
search for some open-source tools instead of wasting resources to develop new
tools. In fact, there are many open-source syntactic and syntacto-semantic
annotation tools available which have been developed under various research
projects throughout the world. Such tools include:
i. Dependency Grammar Annotator (DGA)
This tool has been developed to facilitate the syntactic annotation of text-corpus
within the formal framework of Dependency Grammar (Tesnière, 1959). DGA is
12
a user friendly graphical interface which allows the efficient creation and
manipulation of syntactic structures. DGA was developed by Marius Propescu of
the University of Bucharest, under the BALRIC-LING project.
ii. Syntactic Tree Viewer
It is an easy to use interface to visualize or create simple linguistic trees. It allows
creating and editing of syntactic and viewing the output in string format. It
supports for visualizing parse trees produced by various parsers including
Stanford parser and Charnaik parser. It also support for visualizing Penn Treebank
trees with slight modification.
iii. Sanchay
It is an open-source platform to carry out various NLP tasks for South Asian
Languages (SALs). It has been extensively used for Indian Languages (ILs) at
various NLP research labs especially at LTRC Lab., for various research projects
like ILMT8, Treebank projects (Hindi, Urdu, Telugu & Bengla), PropBank9 &
Bengali Treebank10. So the tool has been very instrumental in creating languages
resources & carrying out various NLP tasks for ILs. However, it is generally
assumed that Sanchay is exclusively devised to implement PCG but the fact is
that it can be customized & used irrespective of grammatical frameworks &
formalisms but it is also true that Panini’s PCG was first experimented &
implemented in Sanchay for ILs under ILMT project.
iv. Cornerstone
It is a PropBank frameset editor developed at the University of Colorado at
Boulder. It runs platform independently and supports multiple languages such as
Arabic, Chinese, English, Hindi and Korean. It is worth mentioning that before
development of cornerstone, Sanchay was used for PropBank (Palmer et al.,
2005) for annotating predicate argument structure. However, it is not sufficient
for treebanking where one also needs to annotate beyond predicate argument
structure like the coordinated and embedded clause constructions, sentential
modifiers, internal structure of complex predicates, serial verb constructions and
subject & object complement constructions.
Besides above mentioned tools, there is also GATE Architecture which can be
customized and used for syntactic annotation and other NLP tasks also. Finally, it
8 Indian Languages Machine Translation, a consortium project at LTRC Lab IIIT Hydrabad
9 PropBank (Palmer M, Kingsbury P, Gildea D, 2005) at University of Colorado
10 LDCIL-IIT Kharagpur Bangla Treebank (Sanjay C, Praveen S, Sudeshna S, Devshri R, 2009)
13
is worth to mention that if the demands of annotation schemes are not fulfilled by
such open source tools (as in case of Corner Stone), even after their
customization, new annotation tools can be developed provided funding and
technical support is available. Usually, annotation schemes are never alien and are
developed in consonance with the pre-existing schemes and Tools. So the second
situation barely arises and it is hardly necessary to strive for building new
annotation tools.
5. Theoretical Preliminaries
Lucien Tesniére (1930s), a French Linguist, developed a relatively formal and
sophisticated theory of DG grammar, Éléments de syntaxe structural, for
pedagogic purposes. It was first drafted in 1939 but published latter on in 1959,
posthumously. Tesniére puts forward his notion of dependency in the following
lines:
“[I] La phrase est un ensemble organisé dont les éléments constituants sont les
mots. [II] Tout mot qui fait partie d’une phrase cesse par luimeˆme d’eˆtre isolé
comme dans le dictionnaire. Entre lui et ses voisins, l’esprit apercoit des connex-
ions, dont l’ensemble forme la charpente de la phrase. [III] Les connexions struc-
turales établissent entreles mots des rapports de dépendance. Chaque connexion
unit en principe un terme supérieur a`un terme inférieur. [IV] Le terme supérieur
recoit le nom de régissant. Le terme inférieur recoit le nom de subordonné. Ainsi
dans la phrase Alfred parle [. . .], parle est le régissant et Alfred le subordonné.”
(Tesniére, 1959, p.11-13)
“[I] The sentence is an organized whole, the constituent elements of which are
words. [II] Every word that belongs to a sentence ceases by itself to be isolated as
in the dictionary. Between the word and its neighbors, the mind perceives connec-
tions, the totality of which forms the structure of the sentence. [III] The structural
connections establish dependency relations between the words. Each connection
in principle unites a superior term and an inferior term. [IV] The superior term re-
ceives the name governor. The inferior term receives the name subordinate. Thus,
in the sentence Alfered parle11 [. . .], parle is the governor and Alfred the subordi-
nate.”12 The ‘parle’ is also the root (the head of whole clause, Alfered parle) of
the structural diagram (dependency graph) called ‘stemma’ which is widely used
in different formalisms of dependency framework.
11 The French clause “Alfred parle” means “Alfred speaks”
12 Translated from Tesniere (1959, page. 11–13) by Joakim Nivre (2009)
14
Dependency relations belong to structural order which is different from linear
order of spoken or written string of the words and structural syntax (Nivre, 2009).
Dependency relation holds between Head (H) and Dependent (D) in a clause or
sentence which is represented by a labeled arch13 (arrow), projecting from H to
Ds. Therefore, the criteria for establishing dependency relations and for
distinguishing between the H and D are of paramount importance, not only in
dependency framework, but also within other frameworks where the notion of
syntactic head plays a pivotal role, including all constituency based frameworks
that belong to some version of X-bar theory (Chomsky, 1970; Jakendoff, 1977).
Zwicky (1985) has proposed some of the following criteria to distinguish between
an H and a D in a construction (C)14.
i. H determines the semantic category of C, D gives semantic specification.
ii. H determines the syntactic category of C and can often substitute C.
iii. H is obligatory, D is optional.
iv. H selects D and determines whether D is obligatory or optional.
v. The form of D depends on H (government or agreement/concord).
vi. The linear position of D is specified with reference to H.
It is very important to distinguish between syntactic dependencies in endocentric
and exocentric constructions (Bloomfield, 1933). For illustration, consider the
structure of the following sentence, taken from the Wall Street Journal part of
Penn Treebank:
Figure.4: Dependency structure for English sentence15
The Attribute (ATT) relation holding between ‘H’ (noun “markets”) and ‘D’
(adjective “financial”) is an endocentric construction in which head can substitute
13 The notational convention used in the above dependency graph is that the arrows point from H to Ds but there is a competing tradition in the literature according to which arrows
point from the Ds to the H (Nivere, 2009).
14 Taken from Hudson (1990, pp. 106-7).
15 One peculiarity of the dependency structure in Figure.4 is that there is an artificial word root before the first word of the sentence. This is a mere technicality, which simplifies both
formal definitions and computational implementations. In particular, it is assumed that every real word of the sentence has a syntactic head. Thus, instead of saying that the verb had
lacks a syntactic head, it can be said that it is dependent of the artificial word root (Nivere, 2009).
15
the entire group of words “financial markets” (phrase or chunk), without
impacting the overall syntactic structure of the sentence. The endocentric
constructions generally satisfy all the above criteria. However, aforementioned
criterion (IV) is usually considered less relevant as dependents are always
optional in such constructions.
While as the Prepositional Compliment (PC) relation holding between H
(preposition “on”) and the dependent (noun “markets”) is an exocentric
construction in which head can’t substitute the entire phrase (on financial
markets). Such constructions fail to meet mentioned criterion (I), at least, with
respect to the substitutability of the head for the whole construction (phrase or
chunk) but they may satisfy rest of the criteria. Further, the subject (SBJ) and
object (OBJ) relations are clearly exocentric while the remaining ATT relations
(effect →little, effect →on) have a more unclear status.
The contrast between Endocentric and Exocentric constructions is also
related to the contrast between head-complement and head-adjunct relations. The
former relations (preposition-noun) are exocentric while as the latter relations
(adjective-noun) are endocentric. The third one, head-specifier relation
(determiner-noun) is also an exocentric relation like the head-complementation
but there is no clear selection of the dependent element by the head. The contrast
between complements and adjuncts (modifiers) is often defined in terms of
valency which is the central notion in the theoretical tradition of the dependency
grammar. The notion of valency has been originally taken from chemistry. It is
usually related to the argument structure16. The idea is that the verb ‘H’ imposes
certain requirements on its syntactic dependents that reflect its interpretations as a
semantic predicate. The nouns (Ds) which are the arguments of a predicate (can
be obligatory or optional in surface syntax) can only occur once with each
predicate. While as the ‘Ds’ which are adjuncts (tend to be optional) can occur
more than once with a single predicate. The valency frame of the verb (predicate)
is generally considered to include ‘Ds’ which are arguments not Adjuncts.
Therefore in the Figure.1, the (SBJ) “news” and (OBJ) “effect” would be
generally considered as valance-bound ‘Ds’ of the ‘H’ “had” while adjectival
16 Argument Structure is inherent property of certain classes of lexemes, particularly verbs (also for nouns and adjectives). The argument structure of verb is called predicate argument
structure.
16
modifiers of the ‘Hs’ “news” (economic) and “markets” (financial) would be
considered as valance-free ‘Ds’.
While head-complement and head-modifier structures have a fairly
straight forward analysis in dependency grammar, there are also many
constructions that have a relatively unclear status. This group includes
constructions that involve function words, such as articles, complementizers and
auxiliary verbs, and apart from that, structures involving prepositional phrases.
For these constructions, there is no general consensus in the tradition of
dependency grammar as to whether they should be analyzed as head-dependent
relations at all and if so, what should be regarded as the head and what should be
regarded as the dependent. For example, some theories regard auxiliary verbs as
heads taking lexical verbs as dependents; other theories make the opposite
assumption; and yet other theories assume that verb chains are connected by
relations that are not dependencies in the usual sense. Another kind of
construction that is problematic for dependency grammar (as for most theoretical
traditions) is coordination. According to Bloomfield (1933), coordination is an
endocentric construction, since it contains not only one but several heads that can
replace the whole construction syntactically. However, this characterization raises
the question that whether coordination can be analyzed in terms of binary
asymmetrical relations holding between a head and a dependent.
6. Chapterization
This dissertation consists of following seven chapters:
Chapter.2: Review of Existing Literature
The chapter surveys the existing literature on various grammar formalisms and
treebanking. It presents a historical view on dependency parsing, tracing its roots
in Indian, Semitic and Hellenic traditions. Some brief history of treebanking is
also traced out. It attempts a link between these old grammatical traditions and the
contemporary practice of natural language parsing & treebanking.
Chapter.3: Developing KashCorpus
The chapter begins with the introduction of philosophical grounds that underlie
the current corpus based research. It gives a brief account of language and other
computational resources that have been developed for Kashmiri. Finally, the
chapter investigates the problems of KashCorpus collection, development,
sanitization & normalization.
17
Chapter.4: POS Tagging of KashCorpus
The chapter discusses the building of the fundamental layer of annotation for the
dependency treebank of Kashmiri, i.e. parts-of-speech tagging of the selected
portion of Kashmiri corpus. Further, a brief review of various POS tagging
frameworks and tagsets that have been developed for English and Indian
Languages is given. The various issues that have been encountered in the
annotation process and the empirical results are presented in this chapter.
Chapter.5: Chunking of KashCorpus
The chapter discusses the second layer of annotation for building the dependency
treebank of Kashmiri, i.e. the chunking of POS annotated KashCorpus. It presents
the detailed description of various chunks found in Kashmiri. It further gives a
detailed account of various issues and also presents the empirical results.
Chapter.6: Syntactic parsing of KashCorpus
The chapter discusses dependency annotation of the chunked KashCorpus in
detail and presents a detailed account of the dependency treebank of Kashmiri
(KashTreeBank). Further, the language related issues which have been raised
during the annotation process are also discussed. Finally, the results of inter-
annotator agreement are also presented.
Chapter-7: Conclusion
It presents a conclusion of all the research presented in this dissertation.
Chapter.2 Review of Existing Literature`Would you tell me, please, which way I ought to go from here?'
18
`That depends a good deal on where you want to get to,' said the Cat.
`I don't much care where.' said Alice.`Then it doesn't matter which way you go,' said the Cat.
`So long as I get somewhere,' Alice added as an explanation.`Oh, you're sure to do that,' said the Cat,
`If you only walk long enough.'Carroll, 2003
1. Introduction
This chapter surveys the existing literature regarding grammar formalisms,
dependency parsing and tree-banking. The chapter is organized in nine sections.
Section two presents various relational structure (dependency) based grammar
formalisms for treebanking. Section three debates on various modifications in the
notion of VP to account for the non-configurationality and to justify the use of
dependency based formalisms. Section four tries to view dependency grammar
from the historical perspective, tracing its roots in ancient & medieval times.
Section five presents the rationale for using DG. Section six describes the notion
of treebanking. Section seven presents the principles involved in treebanking.
Section eight gives a brief account of some dependency treebanks. Finally,
section nine summarizes the chapter.
2. Grammar Formalisms
There is very close relationship between grammar formalism, syntactic parsing,
syntactic annotation and treebanking. In fact, treebank is a product of syntactic
parsing and annotation of natural language corpus, based on a given grammar
formalism or simply grammatical model. The syntactic annotation for building a
treebank can be carried out either manually, automatically or semi-automatically.
The term 'Parsing' has been derived from Latin phrase Paras Orationis meaning
“Parts-of-Speech”. The term refers to both the synthetic (bottom-up)17 as well as
the analytical (top-down)18 approaches of inquiry into the natural language syntax.
In CL and NLP literature, the former is commonly known as dependency based
parsing (DBP), which addresses the following research questions: a. How do
words combine to form sentences? b. How does bottom-up approach to parsing
17 This bottom-up approach is widely used in Europe (by linguists in Germany, France, Scandinavia, Czechoslovakia, Russia), and by Russians and Slavists in USA (Mel'cuk,
Shaumyan, Nichols). From concrete data (empirical) to abstract categories (rational) or simply from data to theory
18 In 1930 Leonard Bloomfield in the USA developed a top-down approach: Immediate-Constituent Analysis (which turned into PSG, TGG, X-bar Syntax and Minimalism), largely
inspired by the German psychologist Wundt - Percival, "On the Historical Source of Immediate Constituent". From abstract categories (rational) to concrete data (empirical) or
simply from theory to data
19
help in understanding the nature of language? c. How does bottom-up approach
facilitate the annotation and capturing of the grammatical knowledge and ensure
its role in developing real world computational tools and applications? The latter
approach is known as constituency based parsing (CBP), which also addresses the
similar research questions, like: a. How is a sentence broken into smaller units
like clauses, phrasal nodes and then into the terminal nodes (words)? b. How does
top-down approach of analysis help in understanding the nature of language and
how does it ensure its role in developing real world computational tools and
applications? Both these approaches include some notion of relational structure
but it is described in different ways (C. Bosco & V. Lombardo, 2004). Since, the
notions of dependency or relational structure are used in the current work;
constituency based formalisms such as PSG, GB & Minimalism are not dealt with
here. However, the notions of grammatical relations and predicate argument
structure are given a proper treatment.
There are several approaches in the literature to explain the grammatical
relationships (GRs) in a clause. These approaches posit GRs as semantic roles which
include Verb-specific roles e.g. Runner, Killer, and Bearer; Thematic Roles e.g. Agent,
Patient, Theme, Instrument and Experiencer, and Generalized Roles like Actor and
Undergoer (Dowty, 1982; Van Valin, 1999). Marantz (1984) describes that GRs are the
syntactic counterparts of certain Logico-semantic relations such as the predicate-subject
and modifier-modifiee relations. Rappaport & Levin (1988) describe GRs in terms of
purely syntactic relations (SUBJ, DOBJ, and IOBJ) and thematic roles. However, the
status of thematic roles (as purely semantic or syntacto-semantic) and the identification of
an appropriate inventory of semantic GRs are not very clear (Leech et al., 1996). It gets
more complicated when we see that purely syntactic relations may bear thematic roles. For
instance, in the sentence “the garden is swarming with vipers,” the subject coincides with a
thematic role-locative instead of the more expected agent relation (Renzi, 1988). There are no one-
to-one clear cut correspondences between syntactic relations and semantic roles and most theories
of grammatical relations make distinction between purely syntactic relations and semantic roles.
The distinction between syntactic and semantic relations with some independence from
morphology is not new and can be traced back to Panini’s Karaka theory. The six Karakas are
semantic relations (agent, object, instrument, destination, source and locus) which are assigned to
the nouns governed by a Verb. However, an inventory of universally accepted semantic relations,
also known as thematic or theta-roles ceases to exist.
20
2.1. Dependency Grammar (DG)
In contrast to the constituency, dependency is a vertical organizational principle
that shows binary asymmetrical relation between a head and its dependents19
(Kruijff, 2002). The basic idea of dependency grammar is that the syntactic
structure is a flat (with no non-terminals) and rooted structure called Stemma
which consists of lexical elements linked by binary asymmetrical relations called
dependencies. The variants of DG which are briefly reviewed here are Structural
Syntax (SS), Functional Dependency Grammar (FDG), Word Grammar (WG),
Meaning Text Theory (MTT) and Paninian Computational Grammar (PCG).
These variants share the major tenets of dependency and proposed relation-based
structures for language representation.
i. Structural Syntax (Tesnière, 1959)
It adheres to the long standing notion that syntax is a matter of combinatory
requirements or capabilities of words (i.e. their valency). The fundamental
syntactic building block of the sentence is considered to be a word (token) which
is linked to other words (directly or indirectly) by means of the dependency
relations (for details see Chapter 1).
The main idea behind Tesnière’s model is the notion of dependency which
identifies the syntactic relation existing between two elements within a sentence,
one of them taking the role of governor (or head) and the other of dependent
(régissant and subordonné in the original terminology). He schematizes this
syntactic relation using a tree diagram called Stemma.
In his scheme all words are divided into two classes: full content words
(e.g., nouns, verbs, adjectives, etc), and empty functional words (e.g. determiners,
prepositions, etc). Each full word forms a block which may additionally include
one or more empty words and it is on blocks that operations are applied. He
distinguishes four block categories (or functional labels); nouns, adjectives, verbs
and adverbs. Also, a distinction is made between Actants and Circumstants. The
verb represents the process or state expressed by the clause and all its actants
(representing the participants) are determined by the valence of the verb and have
the functional labels of nouns. On the other hand, the verb’s Circumstants
(representing the circumstances under which the process is taking place, i.e. time,
19 The alternative terms for that are used in the literature are modifier or child for dependent and modified, governor, regent or parent for head.
21
manner, location, etc) have the functional labels of adverbs. There are two
operations, Junction and Transference, by means of which it is possible to
construct more complex clauses from simple ones. The junction is employed to
group blocks which are at the same level, i.e. Conjuncts, into a unified entity by
itself attaining the status of a block. The conjuncts belong to the same category
and are horizontally connected (and not always) by means of empty words called
the conjunctions. There are two types of transference operations. The first degree
transference is a changing process which makes a block to change its original
category. This process occurs by means of one or more empty words belonging to
the same block called transferrers. For instance, the category of word ‘rotten’ in
the construction “rotten food” is transferred from noun to the functional label of
an adjective through transferrer, the perfective participle -en. The second degree
transference occurs when a simple clause becomes an actant or a circumstant of
another clause, maintaining all its previous lower connections, but changing its
functional label within the main clause.
For example:
1. She believes that he knows it.
2. The man I saw yesterday is here today.
3. You will see him when he comes.
In the sentence 1, we have a verb-noun transference by means of the transferrer
‘that.’ The embedded clause in italics takes the functional label of a noun and
becomes the object of the verb. The embedded clause in the sentence 2 is a verb-
adjective transference without any transferrer. The temporal clause in the sentence
3 is an example of verb-adverb transference where the transferrer is ‘when.’
Actants (arguments) are immediately dominated by the verb and represent the
entities involved in the event, described by the verb (obligatory to fill the valence
frame of the verb). The Circumstants (adjuncts) instead express the bystander’s
role in the event (optional). The first actant corresponds to the Arg-1 (SUBJ), the
second to the Arg-2 (DOBJ), the third to the Arg-3 (IOBJ), as in RG. In SS, the
verbal valency also motivates this sorting of actants. The first actant can be found
in mono/bi/tri-valent verbal nodes (that can take one, two or three actants), the
second only in the bivalent nodes (that can take two or three actants) and the third
only in the trivalent nodes (that can take three actants). Dependency relations are
annotated to make the function of the nodes explicit. The words of a sentence
22
together with their dependency relations form the dependency graph in which the
information regarding the dependency structure is explicit while as the
information regarding the constituent structure is implicit e.g. a node X with the
sub-tree attached to it can represent the constituent headed by X (X-phrase) and
can express all the important properties of the constituent. Therefore, a sentence
structure can be described as consisting of structural nodes organized
hierarchically by the nodal functions and held together by structural connections.
A structural node is a group of words consisting of only one head and one or more
sub-ordinate words. It is this head of the structural node which carries out the
nodal function. The structure and the meaning of the sentence are theoretically
independent but parallel as the structural connections match with the semantic
connections to negotiate the meaning. In fact, a structural connection is usually
motivated by a semantic connection, i.e. two words are linked by a structural
connection in order to make their semantic connection explicit. Just as the head of
the structural node bears the nodal function, the head of the semantic node bears
the semantic function.
ii. Functional DG (Tapanainen & Jarvinen 1997, Tapanainen 1999)
It is a computational implementation of Tesniere (1959), describing the Structural
Syntax (SS) through formal rules. FDG posits that the basic elements of the
syntactic structure are the Nuclei which have mutual connections and every
Nucleus has only one head. The relationship between structure (i.e. syntax) and
semantics is evident in the notion of Nucleus20 which encompasses both the
structural and the semantic node. Since, there is a close parallelism between
syntax and semantics, i.e. the syntactic structure depends on the semantic
interpretation rather than on word order or morphological marking, the variation
in word order does not affect the structural analysis of the sentence. The basic
element of FDG is the nucleus which consists of tokens which are words or parts-
of-words of the input sentence. Here, a distinction is made between valency
functions which are unique in the Nucleus (actants) and ambiguous functions
(circumstants).
iii. Word Grammar (Hudson 1990, Hudson 1984)
20 The nucleus is a unique head which coincides with the head of the structural as well as the semantic node and consequently bears both the semantic and the nodal functions.
Otherwise the nucleus is said to be dissociated. The most typical example of a dissociated nucleus in the verb group consisting of an auxiliary Verb and a main Verb; the former
bears the nodal function, while the latter bears the semantic function.
23
WG primarily, developed for English, is a monostratal, non-transformational ap-
proach which uses word-to-word dependencies to show grammatical relations/
functions by explicit labels, e.g. SUBJ and OBJ. It includes two main inheritance
hierarchies: the system of word-classes, which also includes all lexemes and in-
flections, and the system of dependency types or grammatical functions. WG
presents language as a network of knowledge, where all the areas of knowledge
are included with no clear cut boundaries between the ‘internal’ and ‘external’
facts about words.
iv. Meaning To Text Theory (Mel’cuk 1988)
MTT is primarily, developed for Russian. It provides a rich representation and
analysis of a variety of aspects of language. The natural language is posited as a
logical device that establishes correspondences between the infinite set of possible
meaning and the infinite set of possible texts. The representation of the sentence
consists of various separate components, in particular, a semantic component and
a deep-syntactic component. By performing several operations, the semantic com-
ponent establishes the correspondence between a sentence and all its synonym
sentences. The deep-syntactic component establishes the correspondence between
the various syntactic realizations of a sentence.
v. Paninian Computational Grammar (Bharati et al., 1993)
It has been used for syntactic annotation in the current work which is actually a
variant of dependency grammar (Kiparsky & Staal, 1969; Shastri, 1973). This
model helps to capture the syntacto-semantic relations in a sentence. Sentence is
considered as a series of modifier-modified relations with a primary modified,
root of the dependency tree, the main verb. The elements which modify the verb
are its arguments that participate in the action specified by the verb. The relations
of these participants (arguments) with the verb are called Karakas.
2.2. Relational Grammar (RG)
RG is basically motivated by the idea that positing grammatical relations in terms
of linear constituent order and their domination by the VP node, seems to be
inadequate for VSO languages like Welsh in which there can be no VP node
(Perlmutter, 1983) and for free word-order languages like Czech (Dowty, 1982).
RG is primarily concerned with capturing the pure grammatical relations
that constitute predicate argument structure (syntactic) and other relations
(semantic) that are not related to the core arguments (Perlmutter, 1980). The
24
former includes the three pure grammatical relations like S (subject), DO (direct
object and IO (indirect object) or 1, 2, 3, respectively. The numbers (1, 2 & 3)21
are assigned to posit a hierarchical organization which is motivated by the
behavioral properties of the relations (head vs. dependent). The latter includes a
set of impure grammatical relations (oblique objects -OO) that have independent
semantic content like Instrumental, Locative, Benefective, etc. The NPs which are
labeled with pure grammatical relations are called Terms while as the other
NPs/PPs which are labeled with impure grammatical relations (i.e. with their
semantic functions) are called Non-Terms. Figure.1 is a Relational Network which
posits the relational structure of a clause as an abstract universal representation
that remains constant in spite of the cross linguistic morpho-syntactic variations
i.e. a clause in another languages, involving the same predicate and the same
participants, will be represented by the same relational network.
Figure 1: The Relational Network Showing Dative Shift
RG assumes a universal mapping between thematic and grammatical relations
known as Universal Alignment Hypothesis (UAH): the agent maps on Argument1
(John), the patient or theme maps on Argument 2 (the book), and the recipient on
Argument 3 (Mary) but the surface form of the clause does not always correspond
to the UAH, for instance, Passive and Dative-Shift constructions. In these cases
several syntactic layers (Strata) are proposed and the surface syntactic form of the
clause is derived through a series of transformations that generate a syntactic form
consistent with UAH.
21 Johnson (1977) used terms S, DO, IO to describe grammatical relations but Perlmutter (1980) used simple numbers 1, 2. & 3.
25
Figure 2: The Relational Network of Passive Construction
The initial stratum in the Figure.2 represents the underlying syntactic structure,
which corresponds exactly to the active form of the clause (John eats the apple)
where the UAH holds because the Agent is 1 (John) and the Theme is 2 (the
apple). But when the passive rule has been applied, there is a transformation
which produces a second stratum where the initial 1 loses its syntactic role,
chomeur22 and the initial 2 becomes 1 (2-to-1 advancement). Being the semantic
relations unchanged from the initial stratum, the agent is mapped in this last
stratum to a chomeur, whilst the theme is mapped to 1, so contrasting with the
UAH. In the representation of dative shift (see Figure.1), instead, comparing the
initial stratum with the final one, we can observe a phenomenon referred as 3-to-2
advancement. The recipient which corresponds in the initial stratum with 3
becomes 2 in the final, consequently, 2 loses its role and becomes a chomeur.
2.3. Lexical Functional Grammar (LFG)
LFG posits a flat (VP-less) structure for many VSO languages (Kroeger, 1993)
and “free-word order” or “non-configurational” languages like Warlpiri (Simpson
1991). In LFG, the grammatical relations are termed as functions (Bresnan, 1982,
Bresnan and Kaplan, 1982). Since, LFG does not adhere to the notion of an un-
derlying abstract syntactic representation and transformational rules; it posits a
representation where the lexicon plays a key role. It postulates three distinct but
interrelated levels of grammar which co-occur in a single representation; Lexical
Structure (LS), Functional Structure (FS) and Constituency Structure (CS). The
LS captures the information about the meaning of the lexical items, semantic
roles, constituting predicate argument structure and the grammatical functions like
Subject (SUBJ) and Object (OBJ) that are associated with the arguments through
the Lexical Assignment (LA). The LA states that each argument is assigned with
22 The term ‘chomeur’ is the French word meaning a jobless person, indicated by * in the clause.
26
a unique grammatical function (GF). GFs are assigned at the Lexicon-Syntax in-
terface. For instance, the transitive verb (kick) has a predicate argument structure
that consists of an Agent associated with SUBJ-function and a Theme associated
with OBJ-function. The other levels of the representation, as shown in the Figure
3, are called f-structure and c-structure, respectively. Constituency relations vary
cross-linguistically and across constructions within a single language while the
syntactic functions are universal (invariant) and are represented in a universal for-
mat. Therefore, a number of different c-structures can have a single f-structure
and it would be possible to derive an f-structure from a c-structure but not vice
versa. The inventory of grammatical functions is different from that of RG. LFG
distinguishes between sub-categorizable functions (governable), which can be
part of a verb sub-categorization, like SUBJ, OBJ1 (direct object), OBJ2 (indirect
object), OBL (oblique) and POSS (possessor) and non-sub-categorizable (non-
governable) functions like AJT (adjunct), syntactic FOCUS and TOPIC. Among
sub-categorizable functions, SUBJ, OBJ1 and OBJ2 are semantically unrestricted,
i.e. they can bear a variety of semantic functions while as OBL and POSS are se-
mantically restricted and can bear only some particular semantic function. The
non-sub-categorizable functions are used to refer to adjuncts, to the discourse
functions indicating an entity that has already been established in the discourse
context (topic), or to the information about some topical participant that is new in
the context (focus).
Figure. 3: The f-structure of the sentence “A boy handed the teacher a gift.”
27
LFG represents the f-structure of the sentence in terms of an attribute-value ma-
trix (as shown in Figure 3) and the c-structure as an augmented constituency tree.
The relationship between c-structure and f-structure is represented by adding
functional information on the tree edges.
The notion of grammatical relation occupies a central role in LFG in deter-
mining which of the arguments is semantically selected by a predicate or syntacti-
cally realized and how? In particular, the lexical level plays a central role by the
mechanism of sub-categorization. The similarities between LFG and RG can con-
sist of the inventory of basic syntactic relations, the relevance of the semantic
level of the sentence and use of a form of relational structure where grammatical
relations are the interface between the syntactic and semantic level of the sen-
tence. There are many differences between these approaches: LFG is mono-
stratal, non-transformational and gives a central role to lexical sub-categorization.
Moreover, LFG clearly represents the syntactic interrelation between relational
structure and constituent structure, by assigning a c-structure to the sentence and
assuming that there is some f-structure associated with each node in the c-struc-
ture.
3. (Non-) Configurationality and DG
Although the tradition of using syntactic models in linguistics can be traced back
to Panini’s work (3rd century BC), the discussion about which grammatical model
one should use is still an open issue. For addressing this problem, i.e. which
framework/formalism one should use for treebank annotation, certain typological
features of a language are necessary to be taken into account. Such features have
very strong repercussions on the encoding of grammatical relations in different
languages. Adhering to the similar view, Chomsky (1981) and Hale (1982, 1983)
have divided the languages of the world into Configurational and Non-
configurational languages (Covington, 1990). Hale (1982 & 1983) has put
forward following general diagnostic criteria to check whether a language is non-
configurational or not:
i) Variable word-order
ii) Lack of pleonastic NPs (expletives)
iii) Extensive null anaphora (pro-drop)
iv) Syntactically discontinuous constituents
v) Lack of NP movement (passive, raising, etc)
28
vi) Use of rich case-system
These criteria were latter on attested by Farmer (1984), Jelnik (1984), Mohnan
(1983), Webelhuth (1984) and Speas (1990). On the basis of these criteria, several
languages were claimed to be non-configurational, at least to some extent. These
languages include; Japanese (Chomsky 1981), Warlpiri (Hale 1982), German
(Haider, 1982), Hungarian (Kiss, 1987), Hindi (Mohnan, 1983), Kashmiri (Raina,
1991), etc. Such languages share most of aforementioned criteria, if not all. On
the other hand, some languages like English and French do not meet any of the
above criteria. It can be argued that English is totally devoid of such properties
while as Warlpiri possesses most or all of these properties. However, it appears
that there is not a clear cut division between languages as configurational and
non-configurational rather languages tend to form a continuum from completely-
fixed-word-order languages to completely-flexible-word-order languages with no
sharp transition from one type to another (Siewierska, 1988). In this continuum,
Warlpiri exists at one extreme (non-configurational) and English at the other
(configurational) while the rest of languages lie in between, possessing more or
less of configurationality and non-configurationality. It is not only the case with
the entire concept of non-configurationality but also with aforementioned
individual properties as well. For instance, the nature of word-order (i.e.
completely free/completely rigid) is not cross-lingually a categorical property. If
we correlate one of these six properties (typological variables) with other
properties, we can find high correlation measures between “variable-word-order”
and “rich case system”, indicating that these properties go hand in hand with each
other to characterize a language. It is in fact the rich overt case system that allows
flexibility in word-order. Therefore, some languages are found to be with fixed-
word-order and others with flexible-word-order but this fixity and rigidity in
word-order is itself relative rather than categorical property. However, one thing
is clear that the division does exist, though not very sharp, as some languages tend
to be fixed word-order languages while others tend to be more variable.
Nevertheless, it has been argued that most of the languages have partly variable
word-order (Covington, 1990). Fixed word-order languages resist any kind of
scrambling (change in word-order) leading to information distortion (change in
propositional semantics), evident from the sentences i, ii & iii. Sentence-i carries
proper semantic information but sentence-ii is syntactically anomalous, violating
29
the default SVO word-order of English but it can be a stylistic or pragmatic
variant (in terms topic-focus or information structure) of the Sentence-i as the
propositional information is still intact. However, sentence-iii is syntactically well
formed but semantically anomalous, violating the sub-categorization rules of the
noun (football) and leading to information distortion.
For example:
i) S [[The fat boy] [[kicked] [a football]
ii) S [[kicked] [a football] [The fat boy]
iii) S [a football] [kicked] [The fat boy] **
On the other hand, completely free-word-order language (e.g. Warlpiri) do not
conform to the typical English type hierarchical clause structure (SUBJ-OBJ
asymmetry) i.e. SUBJ as an external argument (higher one) and OBJ as an
internal argument (lower one) as shown below in a PS rule:
1. S = [NP1 + VP], where VP = V+(OBJ NP2)
We can say, unlike English, it lacks VP and has a flat clause structure (SUBJ-OBJ
symmetry) as shown in the following PS rule:
2. S = [NP1 + V + NP2], where both the NPs (1 & 2) are symmetrical.
Moreover, VSO languages (e.g. Semitic & Clitic languages) are also considered
as problematic for the universal appeal of the notion of VP (Perlmutter, 1983 and
Dowty, 1982). There are two approaches to explain word-order, Parameter
Approach (Chomsky 1991, 1993, 1994, 1995) which implies that languages are
partly defined by the head parameter that sets the positions of the head of a phrase
either initial or final and Universal Base Hypothesis (Kayne, 1994 and Zwart,
1997) which posits that SVO is a basic canonical word-order, underlying even
VSO languages and other free-word-order languages. It is because it is in
consonance with the basic tenants of X-bar schema (or X-bar schema has been
optimized for SVO or SOV languages). The notion of basic order refers to the
ordering of elements in the representation that expresses the basic meaning
relations between the elements in the deep structure (Chomsky 1957). These
relations are expressed by an interaction of theta theory and X-bar theory in which
a complement of a verb appears as the sister of V, the subject of V appears as the
sister of VP. The perceived word-order in the surface structure of a sentence often
deviates from the basic ordering (ibid). However, if a language has only VSO
sentences, the basic word-order never surfaces but all the variations in word-order
30
can be accounted by a series of movements. It is being argued that the importance
of the X-bar schema is not only that it regularizes structure but also that the
structure defined by it conveys meaning. In traditional and intuitive sense, a verb
has a complement, likewise, the combination of a verb and a complement (VP) is
a predicate, requiring a subject. These notions of complement and subject are
defined in structural terms. A complement is a sister of a head (V) and a subject is
a sister of a predicate (VP). The hypothesis that the function of a noun phrase is
defined by its hierarchical position in the syntactic structure is part of the theta
theory of generative grammar. Nevertheless, one can raise many questions on this
hierarchical structure vis-à-vis on SUBJ-OBJ asymmetry like; why only SVO is
considered a basic word-order? Why can’t we consider VSO basic and then
explain constituent structure? Why can’t be there a clause structure like the
following?
3. S = [VP + NP2], where VP = V+SUBJ.NP1
Many such questions have been addressed quite elaborately in various
formulations that subscribe in one way or other to X-bar theory. For instance, a
distinction was made between internal (OBJ) and external arguments (SUBJ) of a
verb (Williams, 1981). Internal arguments were further divided into direct and
indirect objects (Marantz, 1984). The external argument was considered to be
generated outside the base (i.e. not dominated by VP projection) and the internal
arguments were the only arguments generated by base (i.e. dominated by VP
projection). This raised a question as to whether this distinction between
arguments (external vs. internal) was legitimate, given the fact that they both are
arguments of verb. Consider Marantz’s famous idiom-explanation for internal
argument where idioms are formed only with OBJ and not with the SUBJ e.g. VP
(kick the bucket). VP-internal Subject Hypothesis (VISH) (Fukui & Speas 1986,
Kitagawa 1986, Kuroda 1988, Koopman & Sportiche 1991) showed that these
problems could be overcome if we assume that subjects are base-generated as a
specifier in the VP and then raised to the specifier of IP. According to VISH, the
external argument would be like other arguments of the verb in that it is generated
like other arguments in the domain of its Theta-licenser. As the previous notions
of VP were totally changed with VISH, VP-shells were introduced (Larson 1988).
According to VP-shells formulation, in the lower VP the thematic elements are
generated and there is an empty ‘shell’ of a VP generated on top of the thematic
31
VP. This theory also helps to maintain the binary branching structure for the
dative shift/to-dative constructions and double object constructions (DOC) in
English. Minimalist Programme also maintains that if a verb has several internal
arguments, a Larsonian VP-shell must be postulated (Chomsky 1995).
Moreover, if all phrases were required to have a specifier by X-bar notation, why
was VP exempted? Why did Spec of IP receive a dual characterization i.e.
sometimes as a Case position in object-raising as in passives and sometimes a
Theta-position (a base generated position) for the external argument? Even if we
look at the nature of VP without X-bar notation, phrases tend to be homogeneous,
discrete, and perceptually compact & closed syntactic units that can confirm the
substitution test of constituency (e.g. NP, AdjP, AdvP) but VP as shown in the PS
rule (1) is heterogeneous and perceptually non-compact & open syntactic unit
with one or two NPs embedded in it. Further, adhering to the notion of constituent
structure (with or without X-bar notations) is at par with ignoring potential
semantic cues in the constructions, even in variable word-order languages, where
there is, more or less, a well defined system of case markers or pre/post-positions
to represent semantic roles. In such cases, a single layered representation of
syntax and semantics is quite possible. However, the representation of syntax
(case relations) and semantics (thematic relations) has been a long standing issue
in theoretical syntax, as clearly mentioned below:
“One of the most important research questions in the history of generative
grammar has been the determination of the domains in which Case and theta
theory apply as distinct, related or disjoint. The main concern is whether
Case is parasitic or derivative of thematic configurations or whether Case
and thematic relations involve different projections/configurations
altogether. Although the research tradition has settled for the disassociation
approach, it has met with variable degrees of success in achieving a complete
severance of the domains in which theta and Case are assigned.” (Richa
2011)
Finally, it can be argued that non-configurational or relatively variable word-order
languages can be explained even without positing VP (as in X-bar) but by
positing a bare Verb (V) or Verb Group (VG) with symmetrical arguments
(SUBJ.NP1 & OBJ.NP2), organized in a flat structure, as shown above in PS rule
2. Such a treatment to these languages is given dependency theory that posits flat
32
organization of verbal arguments and doesn’t consider any notion of deep
structure, surface structure and any kind derivation through movements. As name
indicates, variable word-order languages are non-positional languages and mostly
the arguments and adjuncts are with overt case markers. Hence, the position of the
verbal arguments/adjuncts in the sentence doesn’t matter. Their relation with the
verb is determined by the morpho-syntactic or semantics cues carried by the case
markers or pre/post-positions, not by their position in the construction. For
instance, Indian languages (e.g. Hindi, Urdu, Gujarati, Punjabi, Bangla and
Kashmiri), Semitic languages (e.g. Arabic and Hebrew) and Slavic languages
(Czech and Russian) are relatively variable word-order languages. They allow
scrambling of their constituents without impacting the propositional information
of the sentence.
4. A Historical View of DG
Although the interdisciplinary fields of Computational Linguistics (CL) and
Natural Language Processing (NLP) is the gift of modern technological era, the
origin of the syntactic parsing (syntactic analysis) of natural language which
forms the backbone of various NLP systems, can be traced back in antiquity. The
present notions of syntactic parsing and the existing grammar formalisms are
actually the outcome of accumulation of the vast grammatical knowledge which
originated in ancient, medieval and modern grammatical traditions all over the
world. Butt (2005) gives an elaborate account of various grammatical traditions.
The next subsections describe a brief history of the notion of dependency analysis
and its roots in different grammatical traditions based on Mariam Butt’s and
Svetoslav Marinov’s account:
4.1. Indian Tradition (350-250 B.C) The earliest traces of syntactic analysis can be found in Panini’s grammatical
sketch of Sanskrit (350-250 B.C), which was based on long standing linguistic
thought in India, rooted in the works of Vedas about 500 years ago (Kruijff,
2002). It falls within the realm of dependency grammar. Panini’s grammar
consists of four modules that account for different aspects of language separately
as given in table.1:
The module called Ashtaadhyaayii, deals with the derivation of sentence
structure. The derivation of a sentence starts from the semantic level and ends
with the formation of phonological form (Itkonen, 1991). The lexicon contains
33
verbal and nominal stems. The sentence derivation begins with choosing of the
lexical items from the lexicon and deciding on the karaka-relations that holds
between the verbal root and the nominal roots. So, only verbs and nouns play
primary role in sentence construction and the rest of the parts-of-speech (POS)
play the secondary or tertiary roles. This is, in fact, the simplest way of
representing the sentence structure of a language, particularly, in Indian
languages. Therefore, in Paninian perspective, to construct a skeletal structure of a
sentence, we primarily need events (or action/states) and entities. Other elements
like modifiers (verbal and nominal) can also be incorporated in the construction
but to add different semantic shades to the primary predication. As such they have
least role in the basic syntactic skeleton of a sentence. It is evident that Paninian
relational view is primarily focused at the Verb-Noun relations and the linking
case markers/vebhakti.
S.
NO.
MODULES DESCRIPTION COVERAGE
01 Ashtaadhyaayii Describe syntactic rules 4000 (Approx.)
02 Dhaatupaatha Describe verbal roots with their
morpho-phonemic and morpho-
syntactic properties
2000 (Approx.)
03 Ganapaatha An inventory of lexical items 261 (Approx.)
04 Shivasuutras Describe the segmental phonology
Table.1. Four Modules of Paninian Grammar (Kiparsky 2002)
The karakas are actually the six primary syntacto-semantic roles that the nominal
roots (arguments) play for their verbal root in well-formed sentential
constructions (Kiparsky, 2002). The six karakas include karta (Agent), karma
(Goal), sampradaana (Recipient), karana (Instrument), adhikarana (Locative), and
apaadaana (Source). The well-formedness of a sentence is only assured when each
of the participating nominal entity is assigned a syntacto-semantic role. Karakas
act as mediators between semantic level and the morpho-syntactic level of the
sentence structure by adopting following two constraints (Kiparsky 2002, p. 16):
i) Every karaka must be morpho-syntactically realized (in the form of
vebhakti/case marker/ postposition).
ii) No karaka must be realized by more than one morphological
form/element.
34
These two constraints can play a pivotal role in establishing a simple mapping
schema between karakas and vebhakti which can prove instrumental while
developing any formalism based on Paninian view.
To sum up, the Paninian perspective of syntactic analysis (traditional parsing)
highlighted some key notions/relations prevailing in the contemporary parsing
formalisms that fall within dependency framework like Meaning Text Theory
(Me’lcuk, 1988). Such notions include binary relations holding between a verb
root and nominal roots, (Itkonen, 1991; Kiparsky, 2002), the rootedness of a
sentence which is due the central role of verb (Itkonen, 1991) and the labeled
relations (six syntacto-semantic roles; k1, k2…… k6) which are binary in nature
(Misra, 1966; Itkonen, 1991, Kiparsky, 2002). It is worth to mention that in
Mel’cuk’s Meaning Text Theory, there are six syntacto-semantic relations
(actants; a1, a2…, a6), labeled with digits; 1, 2, 3, 4, 5, 6 like the six Paninian
karakas.
4.2. Hellenic Tradition (100 B.C)
The traces of syntactic analysis can also be found in the Greek grammatical
tradition (GrGT). In this period, there were two schools, Logicians &
Grammarians, involved in the study of word-classes or parts-of-speech of Greek.
Consequently, there were two different views regarding POS of Greek. The
Logicians (Plato, Aristotle, etc) were involved in the analysis of proposition into
logical parts (subject & predicate). So, they recognized only two POS categories
(V and N). While as Grammarians like Dionysius Thrax who wrote Techne, a
grammatical sketch of Greek (100 B.C), recognized eight POS categories (Verb,
Noun, Adjective, Adverb, Pronoun, Preposition, Conjunction & Particle) which is
still a role model for POS tag-sets or word-class classifications across
grammatical traditions of the world.
In the works of Stoics (300-150 B.C) we find the traces of the modern
notion of dependency. The Stoics were concerned with the analysis of spoken
utterance, lekton- ‘the thing said’, (Lepschy, 1994). They considered the predicate
like graphei- ‘writes’ an incomplete lekton which requires a nominal of some sort
to perform the act of writing and become a complete lekton, or an axıoma. Ineke
Sluiter writes in (Auroux et al, 2000)
35
“The predicate was called an ‘incomplete lekton’ with a number of slots that
need filling ....” (ibid. p. 378) and “... they (the Stoics) describe interaction of
bodies as occurring in relation to lekta …” (ibid. p. 384).
In the works of Apollonius Dyscolus (200 A.D) we find a more straight forward
reference to the notion of dependency. In their view, adverbs complement or
diminish the meaning of the verb and are attached to verbs. While adverbs require
the verb, verbs do not necessarily require adverb (Percival, 1990). Both of the
authors distinguish between major word-classes (verbs and nouns) and minor
word-classes, where the latter serve the purpose to support or circumscribe the
former. Apollonius, for example, regarded some words as naturally more closely
related than others. Prepositions preceded nouns and had to be construed with
them. Articles related to nouns and nouns relate to verbs. Conjunctions could not
bind a noun and a verb. “In some of these relations there are clear indications of
what we now call dependency.” (Lepschy, 1994, page .99).
The logician Boethius (480-524/6 A.D) was the first person who introduced a
special term for the supportive function of the minor word-classes (Percival,
1990). In his work on Aristotle’s On Interpretation, he referred to quantifiers,
syncategorematic words, as determinations (specifiers). In his De Divisione, he
developed the notion of specification further to include not only quantifiers but
also words from others word-classes. His term determinatio is generic and refers
to the relation of all minor word-classes with the corresponding major word-
classes, adding an idea of semantic specification.
In Priscian’s Latin Grammar (500 A.D) which is based on Appolonius’
ideas, rudiments of dependency analysis have been found. According to him, lexis
or diction-words are ‘the smallest part of a connected sentence’ (Lepschy, 1994).
One word is put in construction with (construitur cum) or requires (exigit) another
(Covington, 1984). Given that, a very long sentence can be diminished or
collapsed to a very short one sentence, consisting only of a noun and a verb.
The question which worried the grammarians was which of the two
elements; Noun or Verb, is logically prior. Ancient grammarians generally
considered the noun as prior to the verb but many Greek and Latin verbs in first
and second person singular mark morphologically the subject. Both Percival
(1990) and Lepschy (1994) find support for the view that the verb being prior to
the noun since one could omit the subject in the cases like above.
36
To sum up, many of the ideas discussed in the works of ancient
grammarians and logicians can be subsumed under the modern understanding of
dependency. These include rootedness (i.e. the prior of the noun and the verb),
head-modifier relations (e.g. the adverb-verb relation), analysis in terms of words
only as well as a term for the head-dependent relation (determinacio).
4.3. Arabic Tradition (798-928 A.D)
It is in the Arabic Linguistic Tradition (ArLT), where we find the first systemic
treatment of Syntax, based on the concepts that form the core of contemporary
dependency grammar (Bohas et al 1990; Owens, 1988). The Siibawaihi (793 A.D)
was the main grammarian in ArLT and his seminal work, Al-Kitaab (The Book) is
considered as the core grammatical thought of Arabia (Itkonen, 1991). He
recognized only three parts-of-speech; Nouns (that include adjectives, pronouns,
active & passive participles), Verbs and Particles. According to him:
i. Verbs are primarily governors but can be governed by particles.
ii. Particles (i.e. prepositions) can be non-governors or governors of Nouns or
Verbs, but they can never be governed.
iii. Nouns can never govern but they can be governed by Verbs Particles.
The governor-dependent scheme proposed by Siibawaihi accounts for many
verbal sentences (i.e. verb + noun), with general principle being that “A unit may
govern more than one unit; but it can be governed only by one unit” (Itkonen,
1991, page.136). It is worthwhile to mention that the nominal sentences (i.e. noun
+ noun) are not analyzed explicitly in terms of dependency but rather in terms of
Topic-Comment. For example:
1) zayd-un rajul-an
Topic Comment
zayd is a man
2) kaana zayd-un rajul-an
was Zayd-NOM man-ACC
Was zayd a man
However, a covert auxiliary has been proposed by Siibawaihi to account for its
dependency structure as in example 2: As for as the proposing of a covert element
is concerned, Itkonen (1991) considers this to support a transformational grammar
approach. However, Mel’cuk (1988), similarly, assumes an empty
category/element to be the head in copula constructions (N+N cases) of Russian.
37
Some scholars have found support only for dependency analyses in the ArLT
(Owens,1988) but others stick to the idea that the syntactic analysis of both
nominal and verbal sentences, as proposed by Siibawaihi and his followers, is
essentially a Bloomfieldian type of IC analysis (Carter, 1973). However, Itkonen
(1991) maintains a moderate stance by assuming that Siibawaihi and his followers
operated with the two notions, which are today known as dependency and
constituency, depending on the type of the given structure. As per the views of
Owens (1988) and Itkonen (1991), it can be said that the Arab grammarians were
the proponents of the modern definition of dependency. They differentiated
between aamil (head) and macmuul (dependent). Single-headedness and
Projectivity were two main principles explicitly present in their grammatical
analyses.
4.4. European Tradition (1260-1310 A.D)
According to Covington (1984), the earliest rudiments of dependency analyses in
the European Linguistic tradition (EuLT) have been found in the works of
Modistae (1260-1310 A.D). The modistic grammar known as “Grammatica
Speculative” describes how whole sentences can be build up by concatenating the
words together. The terms Suppositum (subject) & Appositum (predicate) were
used to denote the Syntactic-function of the two parts of a basic sentence, the
nominal & the verbal (Robins, 1997). The process of the formation of the
sentence is divided into three successive steps (Covington, 1984):
i) Constructio: - It involves establishing links between the words.
ii) Congruitas: - It involves application of three well-formedness conditions
on the links.
a. There should be compatibility (agreement) of the modes of
signifying.
b. Every dependens should have a terminans.
c. A suppositum and appositum of finite mood should appear in the
sentence.
iii) Perfectio: - It involves a final check on if it is a complete sentence.
Within each construction there are two grammatical relations:
a) Primum-to-Secundum: Secundum presupposes the presence of Primum23.
23 For Covington (1984) the Primum-to-Secundum relation correspond the current notion of dependency.
38
b) Dependens-to-Terminans: Terminans presupposes the presence of
Dependens24. Dependens is an ‘unsaturated’ element while a terminans
is the element which ‘saturates’ (Lepschy, 1994).
c) It has been argued that the Dependens–Terminans relation is an extension
of Petrus Helias’ concept of Regimen, which according to Law (2003)
is actually the concept of Government, where one word forces another
to be in a particular form (Covington, 1984).
Still one big difference with modern dependency theories is the fact that the root
node of a dependency graph will typically be the subject nominal for the
Modiestae while according to the contemporary formalization this should be the
finite verb in the clause.
For the Modist Martin of Dacia (1304 AD), there was only one Primum in
the whole sentence and this was the subject (Covington, 1984). Latter on, this idea
was replaced with a model where in every construction Primum & Secundum
were identified; although the criteria for differentiating between the two were not
entirely clear (Covington, 1984). For instance, the verb was considered Secundum
in subject-verb constructions but Primum in the verb-object constructions.
Entities & Substances were considered prior to their attributes and therefore a
Primum. Certain constructions, like coordination and subordinate clauses,
however, posited problems for the Modistae where it is difficult to identify a
single element as being a Primum. According to Svetoslav Marinov (MS.), the
two sets of relations that of Primum-to-Secundum and of Dependens-to-
Terminans, receive a contradictory interpretation in the literature. It is now clear
that dependency like analyses were central in the syntactic theory of the Modistae.
As mentioned above some sort of head-dependent dichotomy was present along
with the notion of root node in the sentence.
Latter on in European Grammatical Tradition, from the mid 14th century
AD up to the mid 20th century AD, there was hardly any grammatical notion
relates to dependency. The only references available in the literature are based on
Kruijff (2002) and Percival (1990). However, the main dependency grammar,
24 Covington (1984) and Robins (1997) explicitly point out that the Dependens-to-Terminans relation should not be confused with the present-day notion of dependency. Percival (1990), for example, considers the Dependens-to-Terminans dichotomy to correspond to the modern notion of dependent-head asymmetry. But he also considers this relation to be another way of capturing Boethius’ notion of “Determination”.
39
“Elements de syntaxe structurale” (Lucien Tesnere 1959) was published
posthumously in French. A number of scholars like Mel’cuk (1988), Graffi
(2001), Nivre (2005), etc have summarized the key notions of his work in English
which is instrumental in understanding his work that otherwise is very difficult.
5. Why Dependency Grammar?
According to Covington (2001) constituency based approach appears to have been
invented only once, by the ancient Stoics, and has been passed through formal
logic to modern linguists. On the other hand, dependency based approach appears
to have been invented many times in many places (Covington 2001). Nevertheless
the constituency based discourse has overshadowed every other view of syntactic
representation. Mel’cuk argues the constituency based approach is particularly
suitable for English and this was the mother tongue of its founding fathers
(Mel’cuk, 1988). Furthermore, Mel’cuk summarizes a few reasons why the
dependency model is preferrable:
i. A phrase-structure tree focus on grouping of the words, which words go
together in the sentence, but does not give a representation of the relations
between the words.
ii. A dependency tree is based on relations. It shows which words are related
and in what way. The sentence is “built out of words, linked by
dependencies”. The relations could be described in more detail by giving
them meaningful labels.
iii. A dependency tree also represents grouping. A phrase is represented by a
word and its entire sub-tree of dependents.
iv. In a phrase-structure tree usually most nodes are nonterminal, representing
intermediate groupings. A dependency tree consists of only terminal
nodes. There is no need for abstract representation of grouping.
v. In a phrase-structure tree the linear order of the nodes is relevant. It must
be kept to retain the meaning of the sentence. In a dependency tree this is
not important. All information is preserved in the, possibly labeled,
connections.
6. Notion of Treebanking
The improvement in natural language parsing during the last two decades has
been generally attributed to the emergence of statistical and machine learning
approaches (Collin, 1999; Charnaik, 2000). However, such approaches became
40
possible only with the availability of large scale machine readable handcrafted or
automatically generated (manually corrected) syntactic trees. The art or science of
crafting or generating and organising machine readable syntactic trees is called
treebanking. In the next sub-sections, the concept of treebank, principles of
treebanking and review of various dependency treebanks are given.
6.1. Some Background
The term ‘treebank’ was probably introduced by Geoffrey Leech (Sampson
2003). The pioneering work in treebanking started in early 70’s in Sweden with
the inception of Talbanken25 (Teleman 1974; Einarsson, 1976) which was
developed at Lund University by manually annotating Swedish corpus with
phrase structure and grammatical functions. However, the serious work in this
area started in 80s as put forward by Fred Jelinek of IBM in his 1987 Lifetime
achievement talk at Applied Computational Linguistics (ACL):
“We were not satisfied with the crude n-gram language model we were using
and were “sure” that an appropriate grammatical approach would be better.
Because we wanted to stick to our data-centric philosophy, we thought that
what was needed as training material was a large collection of parses of
English sentences. We found out that researchers at the University of
Lancaster had hand-constructed a “treebank” under the guidance of
Professors Geoffrey Leech and Geoffrey Sampson (Garside, Leech, and
Sampson 1987). Because we wanted more of this annotation, we
commissioned Lancaster in 1987 to create a treebank for us. Our view was
that what we needed above all was quantity, possibly at some expense of
quality …… We wanted to extract the grammatical language model
statistically and so a large amount of data was required.” (Marcus, 1995)
Actually, it was Linguistic Data Consortium (LDC), established at University of
Pennsylvania that started massive and sophisticated efforts in developing
treebanks for European languages. However, these efforts were latter on extended
to Non-European languages as well. So, there are Penn treebanks for various
languages like Penn English Treebank (Marcus et al. 1993), Penn Arabic
Treebank (Maamouri et al. 2004), Penn Chinese Treebank (Xue et al., 2004), etc.
The Penn English Treebank is one of the largest and most widely used English-
language treebanks that has contributed greatly in creation of important English
NLP resources. Moreover, it is well-documented and the documentations are
25 Talbanken was recently reconstructed into Talbanken-05 (Nivre et al. 2006).
41
freely available; consequently, it provides a solid template methodology for
researchers attempting to produce treebanks in other languages. Similar efforts
were made in Charles University at Prague and various treebanks were created.
They include Prague Dependency Treebank of Czech (Hijicova & Hajic, 1998;
Böhmova et al. 2003), Turkish Treebank (Oflazer et al. 2003), Danish
Dependency Treebank (Kromann, 2003), Turin University Treebank of Italian
(Bosco & Lombardo 2004), etc. AnnCorra Treebank (Bharati et al. 1995) is
another similar kind of effort made for Indian Languages at LTRC Lab and
presently, the work is going on in some major Indian languages e.g. Hindi, Telugu
and Urdu (R. Begum et al., 2008; Vempaty et al., 2010, R. Bhat 2012). Moreover,
Bangla Treebank26 has been constructed at IIT Kharagpur (S. Chatterji et al.,
2009) and some efforts in developing dependency treebank for Kashmiri
(KashTreeBank27) have been already initiated (S. Bhat, 2012).
6.2. What is a Treebank?
Treebank is a set of corpora annotated with skeletal syntactic information, such as
POS labels for words level and syntactic labels beyond word level (Kristin
Jacque, 2006). A treebank is text corpus annotated with syntactic, semantic and
sometimes even inter-sentential relations (Hajicova et al., 2010). It is essentially
a machine readable repository of annotated syntactic structures of a language that
predominantly serves as a bank of training & testing data for the development of
various computational tools and applications that use some form of supervised
learning, e.g. deep syntactic parser, chunker, POS tagger, etc. Although, the term
‘treebank’ initially referred to a bare collection of syntactic trees, its
contemporary usage has been extended to the corpora with all kinds of structural
annotations, such as constituent structure, functional structure, or predicate-
argument structure (Nivre 2005; Smedt & Volk 2005). Currently, treebanks are
augmented with different types of structural representations and to restrict a
treebank to a particular type of structural representation is not current state-of-art.
However, a basic skeletal treebank is perquisite for any kind of further
augmentation like multiple representations. Earlier treebanking efforts were based
on manual annotations which are laborious, time-consuming and error-prone.
Such limitations in the manual annotations have led to the development of several
26 Funded by Linguistic Data Consortium for Indian Languages (LDCIL)
27 KashTreeBank started as a summer school project in IIIT Hyderabad Advanced Summer School for Natural Language Processing (IASNLP 2011)
42
alternative approaches like automatic annotation or automatic conversion but the
alternative approaches won’t work for resource poor Languages like Kashmiri.
We have to rely on the manual annotations first as no previous treebank resources
are available for Kashmiri. Therefore, to start from scratch & use manual methods
are unavoidable unless sufficient resources are created to train a parsing system
for automatic annotation. Moreover, a large number of treebanks have been
developed and many are currently under construction. Many treebanks implement
formats similar to those of the major treebanks and rarely new models are being
devised. For instance, the English dependency treebank (Rambow et al., 2002)
follows the model of the Prague Dependency Treebank but uses a mono-layered
representation centered on the notion of predicate argument structure instead of
multi-layered approach of Prague. Similarly, Spanish Treebank adheres to the
model of the Penn Treebank (Moreno et al., 2000).
7. Dependency Treebanks: A Brief Review
It is a fact that most of the languages have relatively free-word order and for
treebanking in free-word-order languages dependency based annotation schemes
are used. It is because of this fact that there is an ever expanding number of
dependency treebanks across the world. Many of these dependency treebanks28 are
briefly explored here:
7.1. Prague Dependency Treebank (PDT)
PDT for Czech is the largest of the existing dependency treebanks in which the
corpus has been annotated on the basis of a multi-layer annotation scheme,
consisting of morphological layer; analytical layer i.e. syntactic and a tecto-
grammatical layer i.e. semantic (Hajic, 1998; Bohmova and Hajikova 1999,
Böhmova et al., 2003). It consist of, approx 90,000 sentences, from newspaper
articles on diverse topics (e.g. politics, sports, culture) and texts from popular
science magazines, selected from the Czech National Corpus (T. Kakkonen,
2006). There are 3,030 morphological tags in the morphological tagset (Hajic,
1998). The syntactic annotation comprises of 23 dependency relations. The
annotation for three levels has been done separately, by different groups of
annotators. The morphological tagging was performed by two human annotators
28 Treebanks given here are mainly taken from (Kakkonen, 2006)
43
selecting the appropriate tag from a list proposed by a tagging system. Third
annotator then resolved any differences between the two annotations. The
syntactic annotation was at first done completely manually with the help of
ambiguous morphological tags and a graphical user interface. After the annotation
of about 19,000 sentences, Collins Lexicalized Stochastic Parser (Nelleke et al.,
1999) was trained with the data with 80% accuracy. Thereafter, the status of work
of the annotators changed from building the trees from scratch to post-editing
(checking and correcting) the parses assigned by the parser, except for the
analytical functions, which still had to be assigned manually. There are other
treebank projects that use the same framework developed for the PDT. For
instance, Prague Arabic Dependency Treebank (Hajic et al., 2004) is a treebank of
Modern Standard Arabic, consisting of around 49,000 tokens of newswire texts
from Arabic Giga-word and Penn Arabic Treebank. The Slovene Dependency
Treebank consists of around 500 annotated sentences obtained from the
MULTEXT-East Corpus (Erjavec, 2005).
7.2. Russian Dependency Treebank
The Dependency Treebank for Russian is based on the Uppsala University Corpus
(Lonngren, 1993). The texts have been collected from contemporary Russian
prose, newspapers, and magazines (Boguslavsky et al., 2000; 2002). The treebank
consists of about 12,000 annotated sentences. There are 78 syntactic relations
(divided into 6 subgroups, e.g. attributive, quantitative, and coordinative). The
annotation is layered, in the sense that different levels of annotation are
independent and can be extracted or processed independently. The treebank has
been developed automatically with the help of a morphological analyzer and a
syntactic parser (Apresjan et al., 1992) which was followed by post-editing.
7.3. Italian Dependency Treebank
Italian Dependency Treebank is known as Turin University Treebank. It consists
of 1,500 sentences, divided into 4 sub-corpora (Bosco, 2000; Lesmo et al., 2002;
Bosco and Lombardo, 2003). The majority of text is from civil law code and
newspaper articles. The annotation format is based on the Augmented Relational
Structure (ARS). The POS tagset consists of 16 categories and 51 subcategories.
There are around 200 dependency types, organized in 5 levels. The scheme
provides the annotator with the possibility of marking a relation as under-
specified if a correct relation type cannot be determined. The annotation process
44
consists of automatic tokenization, morphological analysis, POS disambiguation
and syntactic parsing (Lesmo et al., 2002).
7.4. German Treebank
It is known as TIGER Treebank (Brants et al., 2002). It was developed on the
basis of NEGRA Corpus (Skut et al., 1998) and consists of complete articles
covering diverse topics collected from a German newspaper. It consists of
approximately 50,000 sentences. It combines both phrase structure and
dependency and organizes them in a way that phrase categories are marked as
non-terminals, POS information as terminals and syntactic functions as the edges.
The syntactic annotation is rather simple and flat29.
7.5. English Dependency Treebank
The Dependency Treebank of English consists of dialogues between a travel
agent and customers (Rambow et al., 2002), and is the only dependency treebank
with spoken language annotation. The treebank has about 13,000 words. The
annotation is a direct representation of lexical predicate-argument structure, thus
arguments and adjuncts are dependents of their predicates and all function words
are attached to their lexical heads. The annotation is done at a single, syntactic
level, without surface representation for surface syntax, the aim being to keep the
annotation process as simple as possible. The trained annotators have access to an
on-line manual and work off the transcribed speech without access to the speech
files. The dialogs are parsed with a Dependency Parser, the Super-tagger and
Light weight Dependency Analyzer (Bangalore and Joshi, 1999). The annotators
correct the output of the parser using a graphical tool developed by Prague
Dependency Treebank project.
7.6. Basque and Danish Dependency Treebanks
The Basque Dependency Treebank (Aduriz and al., 2003) consists of 3,000
manually annotated sentences from newspaper articles. The syntactic tags are
organized as a hierarchy. The annotation is done by aid of an annotation tool, with
tree visualization and automatic tag syntax.
The annotation of the Danish Dependency Treebank is based on
Discontinuous Grammar formalism which is closely related to Word Grammar
(Kromann, 2003). The treebank consists of 5,540 sentences covering a wide range
of topics. The morpho-syntactic annotated corpus is obtained from the PAROLE
29 Note that hierarchical structure has been avoided and flat structure been preferred to reduce the amount of attachment ambiguities.
45
Corpus (Keson and Norling-Christensen, 2005), thus no morphological analyzer
or POS tagger is applied. The dependency links are marked manually by using a
command-line interface with a graphical parse view.
7.7. Turkish Dependency Treebank
It is known as METU-Sabanci Turkish Treebank30. It consists of morphologically
and syntactically annotated 5,000 sentences. The treebank is represented in the
XML-based Corpus Encoding Standard format (Anne and Romary, 2003). Due to
morphological complexity of Turkish, morphological information is encoded as
sequences of inflectional groups (IGs). An IG is a sequence of inflectional
morphemes, divided by derivation boundaries. The dependencies between IGs are
annotated with the following ten link types: subject, object, modifier, possessor,
classifier, determiner, dative adjunct, locative adjunct, ablative adjunct, and
instrumental adjunct. The annotation is done in a semi-automated fashion though
lot of manual work is also involved. First, a morphological analyzer based on the
two-level morphology model (Oflazer, 1994) is applied to the texts. The
morphologically analyzed and pre-processed text is input to an annotation tool.
The tagging process requires two steps: morphological disambiguation and
dependency tagging. The annotator selects the correct tag from the list of tags
proposed by the morphological analyzer. After the whole sentence has been
disambiguated, dependency links are specified manually.
7.8. Danish, Portuguese and Estonian Treebanks
Danish, Portuguese and Estonian treebanks are called Arboretum, Floresta
Sintactica and Arborest, respectively. These are all sibling treebanks in which
Arboretum is the oldest one. The treebanks are hybrids with both constituent and
dependency annotation organized into two separate levels. The levels share the
same morphological tagset. The dependency annotation is based on the Constraint
Grammar (CG) (Karlsson, 1990) and consists of 28 dependency types. For
creating each of the four treebanks, a CG-based morphological analyzer and
parser has been applied. The annotation process consisted of CG parsing of the
texts followed by conversion to constituent format, and manual checking of the
structures. Danish treebank (Bick, 2003; Bick, 2005) has around 21,600 sentences
annotated with dependency tags, and of those, 12,000 sentences have also been
marked with constituent structures. The annotation is in both TIGER-XML and
30 The corpus for the treebank was obtained from the METU Turkish Corpus (Atalay et al., 2003), hence, the name of the treebank.
46
PENN export formats. Portuguese treebank (Afonso et al., 2002) consists of
around 9,500 manually checked and around 41,000 fully automatically annotated
sentences obtained from a corpus of newspaper. Estonian treebank (Bick et al.,
2005) consists of 149 sentences from newspaper articles. The morpho-syntactic
and CG-based surface syntactic annotations are obtained from an existing corpus,
which is converted semi-automatically to Arboretum-style format.
7.9. AnnCorra : Treebanks for Indian Languages
AnnCorra (Hyderabad Treebanks) for Indian Languages (ILs) are a dependency
treebanks which use indigenous Karaka-theory based grammatical scheme,
known as Paninian Computational Grammar, for syntactic annotation (Bharati et
al., 1996, Begum et al., 2008). Currently, treebanks of four ILs, namely Hindi,
Urdu, Bangla and Telegu, following grammatical scheme, are under development.
Hindi dependency treebank consists of 20705 sentences, Urdu dependency
treebank consists of 3226 sentences from newspaper corpus, Bangla dependency
treebank consists of 1279 sentences and Telegu dependency treebank consists of
1635 sentences (Bhat & Sharma 2012, DVempaty et al., 2010) annotated with the
linguistic information at morpho-syntactic (morphological, part-of-speech and
chunk information) and syntactico-semantic (dependency) levels. No reference is
available about the size of Hindi and Bangla treebanks. However, the annotation
schemes in all treebanks consider the verb as the root of the sentence. The
relationship between the participant and the event/activity/state denoted by the
verb is marked using relations that are called karaka. It has been shown that the
notion of karaka incorporates the local semantics of a verb in a sentence and that
it is syntactico-semantic. Indian languages are morphologically rich and have a
relatively free constituent order. Unlike karaka relations, structural relations like
the subject and the objects are considered less relevant for the grammatical
description of ILs due to the less configurational nature of these languages (Bhat,
1991; Begum et al., 2008).
8. Principles of Treebanking
According to Haung (2003), there are four general principles that have been
considered important for the design & development of a treebank. These
principles31 are given below:
8.1 Maximal Resource Sharing
31 Taken from (Haung et al. 2003*) “Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface”
47
The resources for developing a treebank include corpus, tools, annotations
schemes, guidelines and human annotators. Since, developing these resources
from a scratch can be very expensive and time consuming process; one should
make maximum use of the existing resources, if available at all. For instance, in
order to achieve maximal resource sharing, the Sinica Treebank (Chen et al. 1996)
has been bootstrapped from existing Chinese computational linguistic resources.
The textual material has been extracted from the tagged Sinica Corpus (ibid).
Moreover, the same research team that carried out the POS annotation of Sinica
Corpus and annotated Sinica Treebank to ensure the consistency in the
interpretation of texts and tags.
8.2 Minimal Structural Complexity
The criterion of minimal structural complexity is motivated by the idea that the
annotated structural information can be shared regardless of users’ theoretical
orientations. It is observed that theory-internal motivations often require abstract
intermediate phrasal levels like intermediate phrasal category X’ in X-bar-theory
and abstract covert phrasal category like INFL in the GB theory. Although, the
phrasal categories are well-motivated within the theory, their significance cannot
be maintained across theoretical frameworks. Since, the minimal basic level
structures are shared by all theories, it would be better to annotate the information
which is most commonly shared among theories like the canonical phrasal
categories.
8.3 Optimal Semantic Information
The most critical issue, involving Treebanking as well as the theories related to
NLP, is how much semantic information should be incorporated? The original
Penn Treebank used a pure syntactic approach. A purely semantic approach is yet
to be attempted. However, a third approach involving annotation of partial
semantic information, especially encoded in argument-relations. It is this third
approach which is shared by most of the treebanks, e.g. the Prague Dependency
Treebank (Bohmova and Hajikova 1999), AnnCorra Treebanks (Bharati et al,
1994), etc use syntacto-semantic approach. In this approach, the thematic relation
between a predicate and an argument is marked, in addition to the grammatical
category. This allows optimal semantic information to be incorporated in a
treebank and subsequently in an NLP system like a syntactic parser.
8.4 Minimal Granularity
48
More important parameter is the granularity (depth) of analyses in treebanks.
While some of the earliest syntactically annotated corpora contain information of
only syntactic boundaries, others contain, constituent structures (Abeille,
Clement, and Toussenel, 2003), functional dependency structures (Hajic, 1998) or
in addition to the syntactic structures, also predicate-argument structures (Marcus
et al., 1994). However, the present KashTreeBank (S. Bhat, 2012) contains inter-
chunk dependency relations in addition to POS and Chunk labels.
9. Summary
In this chapter a review of literature was done with focus on dependency grammar (DG),
dependency parsing and the treebanking. First of all different grammar formalisms which
are considered closer to DG were briefly presented in order to compare their fundamental
notions with that of DG and to understand what is common ground between them as they
all constrast with constituency based formalisms like DG. The sample representations for
each of these formalisms, i.e. for DG, RG and LFG, were also given. The review of PSG
based formalisms was deliberately avoided as their reparesentations and the notions were
hardly required when it confermed that DG based formalisms are more famous for
treebanking purposes for relatively variable word order language, due to many reasons
some of which are given in sections five of this chapter. After the grammar formalisms,
the notion of non-configurationality was given elaborately along with some modifications
that were done to original PSG based formalisms in order to minimize the operational
apparatus and incorporate notions of dependency, e.g. incorporation of VP shell. This
was done to justify the suitability of DG for inflectionally rich languages. Next, the
history of dependency based representations was charted out, its rrots were traced and its
development in different grammatical traditions was also given. In next section, the
notion of treebanking was introduced along with some background that trigerred the
creations the wave of creation of treebanks. The notion of treebank has been also given.
Further, some important dependency treebanks were introduced and finally the principles
that should govern treebanking efforts have been presented.
Chapter.3 Creating Corpus for Kashmiri Treebank“There are and can exist but two ways of investigating and
discovering truth.The one hurries on rapidly from the senses and particulars to the
most generalaxioms, and from them derives and discovers the intermediate
axioms. The other
49
constructs its axioms from the senses and particulars, by ascending continually
and gradually, till it finally arrives at the most general axioms.”Francis Bacon, Novurn Organurn Book 1.19 (1620)
1. Introduction
In rationalistic discourse, competence, the underlying ideal grammatical system in
the mind of a native speaker, is considered the only legitimate source of
grammatical knowledge which can be accessed only through grammatical
intuitions of the native speaker (Chomsky 1956). In spite of the fact that
performance, the actual real world utterances of a native speaker, is also a source
of grammatical information, it is not considered the legitimate source. It has been
only considered the inferior copy of the tacit knowledge, the competence.
However, in empirical discourse, alternative stance has been taken and the real
world observable and verifiable language, the actual writings, speech and signs,
which come under the purview of performance acts, are given prime legitimacy to
build a linguistic theory. Since corpus is a real world linguistic artifact (written,
spoken or sign) that stores linguistic knowledge, it is extensively used in
empirical research in Linguistics, CL and NLP. As mentioned earlier, in Chapter
1, linguistic knowledge that exists in corpus is very crucial for creating various
NLP tools and applications. Such knowledge can be captured either by building
computational grammar (hand crafted linguistic rules) or by annotating large
electronic corpus to create treebanks. It is from these treebanks, grammatical
knowledge can be induced in a machine by some statistical modeling. Hence,
need of the treebank as an empirical basis for research on grammar is well
established.
Further, corpus-based empirical research which was not much in practice
for quite a long time since late 50s was almost completely marginalized by strong
rationalistic discourse and subsequently developed formalisms. For instance, one of
the pioneers of Brown Corpus (BC), shares the response they got for the development of
BC. It was considered “a useless and foolhardy enterprise” as the intuition of native
speaker was considered the only legitimate source of grammatical knowledge of a
language which could not be obtained from corpus (Francis, 1992). However, with the
progress in corpus linguistics itself and the achievements in Speech Recognition and
NLP, particularly, in Statistical Machine Translation (SMT), it is now a well established
50
fact that corpus based empirical grammar products like treebanks are of crucial
importance not only for linguistic research and language technology (Nivre et al; 2005)
but also for cognitive and historical linguistic studies.
The next section on Corpus linguistics introduces the notion of corpus
linguistics and also providers its background. The section three tries to talk about
the status of Kashmiri text corpus. Section four discusses the methodology for
developing Kashmiri text corpus. Section five tries to look into various problems
of corpus development, like corpus sanitization, corpus normalization and
tokenization, in general and for creating Kashmiri corpus (KashCorpus) in
particular. Finally, section six summarizes the chapter.
2. Notion of Corpus Linguistics
The Latin term ‘Corpus’ means a ‘body’. It was traditionally applied to various
collections of linguistic or non-linguistic items. In linguistics, however, the term
‘corpus’ refers to the finite collections of naturally occurring utterances. Corpus is
actually a machine readable, principled and organized collection of text, speech &
sign samples that represent a particular language or a variety of that language
(Leech 1992, Sinclair 1996). However, Corpus Linguistics is not a branch of
linguistics like many inter-disciplinary branches like psycholinguistics,
sociolinguistics etc. and core branches like morphology, syntax etc., rather it is an
alternative empirical methodology (corpus based) that percolates through all the
branches of linguistics.
2.1. Some Background
The term Corpus Linguistics was not much in practice up to early 1980s but it
came into lime light with the publication of The Recent Trends in English Corpus
Linguistics (Aarts & Meijs, 1984). Actually, corpus based linguistic research
predates rationalistic generative era (late 50s), when it was practiced by many
linguists32. Although, they used hard-copies of text for manual analysis and paper
slips or card boards for data storage33, their methodology was purely empirical,
based on real world data. As mentioned above, the underlying notion of the
language in corpus linguistics is an empiricist and probabilistic one, where
language is considered as a real-life object which can only be probabilistically
32 Linguists like Firth (1930s), Jesperson (1940s), Franz Boas (1940s), Sapir (1950s), Bloomfield (1950s), Harris (1950s), Fries (1950s), etc were practicing this empirical brand of linguistic research (See Biber and Finegan 1991: 207)
33 Unsophisticated (Pen & Paper technology)
51
modelled, i.e. the correspondence between linguistic structures and grammatical
rules is a matter of frequency vis-à-vis probability.
“If it is correct to describe linguistic behavior as rule-governed, this is much
more like the sense in which car-drivers’ behavior is governed by the
Highway Code than the sense in which the behavior of material objects is
governed by the laws of physics, which can never be violated” (Sampson,
1992).
This period, prior to 1950s, is considered as the golden era of the old fashioned
corpus linguistics which has been termed as Early Corpus Linguistics (ECL)
(McEnery & Wilson, 1996). In ECL, the corpus was collected, stored and
analyzed by linguists by hand, using pen and paper as the aids. Consequently,
corpora were hardly as large as today and rarely faultless. The corpus-based
methodology required data storage (memory devices) and processing abilities
which were not available at that time. In 50s, under the influence of logical
positivism and behaviorism, several linguists regarded corpus as the primary
source of linguistic information. The corpus was deemed both necessary and
sufficient for the task at hand and intuitive evidence was sometimes rejected
altogether. A small number of researchers, applying some corpus-based
methodology did make weaker claims for suggesting that the purpose of linguist’s
work is not simply to account for all utterances included in the corpus but rather
to account for the ones which are not in the corpus at a given time (Leech, 1992).
In spite of its intrinsic limitations (theoretical and technological), the corpus-
based approach was being considered as a scientific methodology for language
study. The ECL was widespread among the linguists until early 1950s (McEnery
& Wilson, 2001). At the end of 1950s, corpus based empirical method was
severely criticized and almost overshadowed by rationalistic discourse (ibid). The
criticism was partly genuine, given the crude techniques available at that time.
Finally, with the advent of computing machines and their usage in corpus
processing, owing to their large storage and computing capacities, modern corpus
linguistics came into existence and in early 60s, first modern corpus, known as
Brown Corpus was compiled for American English. The modern corpus
linguistics, known as Computerized Corpus Linguistics (CCL), received further
impetus from the ground breaking successes in automatic speech recognition &
automatic machine translation, using various techniques of statistical language
52
modelling. The success in building various NLP applications, based on different
modern day corpora, rekindled hope in empiricism and by early 90s, the magic
spell of rationalism was almost reversed.
2.2. Text and Grammatical Knowledge
Writers write without being conscious that they, apart from their intents, carve
their grammatical knowledge and mastery of the language in the patterns of the
text. It is a well established fact that this grammatical knowledge can be
harnessed. The grammatical information in the text corpus needs to be annotated
at various levels in order to be used in developing real world NLP tools and
applications. It involves direct induction (learning) of linguistic knowledge from
annotated corpora. The annotated corpora being used are the treebanks where the
implicit linguistic information has been made explicit through various levels of
annotation. Several NLP modules like Part-of-Speech Tagger, Chunker, Parser,
etc, and various NLP application systems like Machine Translation, Question
Answering or Information Extraction, are trained and tested on treebanks, i.e. the
aforementioned systems learn linguistic knowledge from the treebank samples
and their performances are also evaluated on those samples. Training of the
system consists of two stages - (a) classifying the linguistic structures (i.e. words
and chunks) occurring in the corpus, and (b) assigning them probabilities of
occurrence according to a probabilistic language model.
3. Status of Kashmiri Text Corpus
Kashmiri language presents unique challenges to descriptive, theoretical and
historical linguistics. It is not only a fascinating language for linguists who base
their research on rationalism but also to the corpus linguists and NLP practioners
who base their research on empiricism. Though Kashmiri is pretty well explored
from rationalistic orientation, it is yet to be explored from empirical perspective.
The brief overview regarding existing corpus resources for Kashmiri are given in
this section.
Since corpus is the primary source data for empirical research34, corpus building is
to be seen as the part-and-parcel of corpus linguistics which has become an
essential enterprise for quantitative analysis and technological development of
34 The method used in empirical research is totally quantitative in nature which, in addition to documenting structural and functional analysis, also stores some numbers like frequency counts and probability weights with the items of analysis. This augmentation with statistical information makes linguistic data more information rich. Information richness and machine readability of such data makes it more preferred data for language technology & NLP research.
53
any language in post 1980 scenario. This has led to the development of huge
corpora in many languages of the world like English, French, German, Arabic,
Chinese etc., hence, loosely called resource rich languages but some languages
still lack such resources on large scale like most of the South Asian Languages
(SALs), hence, called resource poor languages. For instance, Indian Languages
present a good example of resource poor scenario. The work of corpus building in
ILs first started at individual level thirty three years ago at Kolhapur University
and KCIE35 (Shastri 1988) came into being which consists of approx one million
words of Indian English with ISCII encoding. The next initiative in this direction
was taken by Department of Electronics Govt. of India in the form of a project-
TDIL36 in 1991. The project was launched to develop three million text corpora
for all ILs that are included under 8th schedule (cf. Ganesan, 1999). The corpora
were compiled from the texts materials published between 1981 and 1990. For
Kashmiri, Urdu and Sindhi the initiatives were taken at AMU and similar efforts
were made at different institution throughout the country. Another effort was
taken under EMILLE project to build multi-lingual corpora for South Asian
Languages (McEnery et al., 2000) which released 200,000 words parallel corpus
of English, Urdu, Bengali, Gujarati, Hindi and Punjabi. However, still ILs need
large scale languages resources and to develop enough language technology and
thereby, enhance their online representation. However, to enhance technological
development, many corpora projects have been launched recently, for instance
LDCIL and ILCI37 which are still going on. The former aims at producing quality
annotated corpus for all 22 scheduled ILs while as the latter aims at producing
parallel corpus in tourism domain for all major ILs, keeping Hindi corpus as the
pivotal one which is translated into other languages.
The efforts of corpora building started in post 80s in Europe and America on large
scale with considerable amount of standardization while as these efforts started
only a decade back in India on small scale in isolated projects, that too with less
emphasis on standardization. As a result of such efforts, resources were created
only for few major languages that too without proper standardization.
Consequently, until 2008-2009, in spite of the efforts made under TDIL, there
35 Kolhapur Corpus of Indian English36 Technology Development of Indian Languages37 Indian Languages Corpora Initiative
54
were hardly any language resources for Kashmiri, and hence, no corpus based
research for Kashmiri was possible before. It was only after some initial efforts
that started in this direction first at Central Institute of Indian Languages (CIIL)
and then at Kashmir University (KU)38 that some corpus based studies were made.
These corpus building efforts resulted in some basic language resources and
computational tools like unicode compatible font, text corpus, POS annotated
corpus, speech corpus, annotation tools, transliteration tools & some lexical
resources like trilingual dictionary, frequency dictionary for Kashmiri. Besides,
C-DAC Pune which is also involved in the localization of various softwares like
Open-Office, has developed a software package for all Indian Languages,
including Kashmiri which consists of word processor, browser, excel, etc. In spite
of the fact that considerable amount of text corpus for Kashmiri was build at
AMU, more than one million words of text corpus has been built under in LDC-
IL39 at CIIL and about 2-5 lakhs at KU (Bhat 2012), no existing corpus is open to
the researchers till now. Therefore, instead of trying to get the existing corpus,
new small scale resources were created for developing KashTreeBank. The next
section describes the methodology used in building Kashmiri corpus.
4. Methodology for Building Kashmiri Text Corpus (KashCorpus)
Theoretically, text corpora can be developed by typing in printed texts, using
OCR or through speech recognition. OCR and speech technologies are far from
perfect, especially for ILs and the only workable method is to key in texts. For the
development of Kashmiri Text Corpus (KashCorpus) too, raw text has been
collected and digitized by inputting the data manually into Microsoft Word (.doc
format). After certain procedures like cleaning and normalization, the corpus is
deemed fit for the linguistic scrutiny and for different types of annotations. The
entire procedure that was adopted for the development of KashCorpus is
explained below along with the associated issues.
4.1 Planning Corpus
Planning is a very important stage, in fact, a decision making one in corpus
building. It is in this stage that the source and the nature of text and the purpose
for which corpus needs to be built are decided upon. Once the purpose of the
38 In Department of Linguistics, Kashmir University under DIT funded Project- Development of Kashmiri Language Technology Tools (See kashmirizaban.com)
39 LDCIL stands for Linguistic Data Consortium for Indian Languages which is set at CIIL. It is a scheme of MHRD, Govt. of India with goal to create annotated language resources for all ILs (for details see ldcil.org)
55
corpus is clear, other specifications like character-encoding, text-encoding and
format for storage and usage are also laid down. The general practice in
treebanking is the usage of news papers as primary source data. It is because these
are easily available and can be freely downloaded, e.g. the Wall Street Journal
(WSJ) part of Penn Treebank. However, digitization is yet to be achieved for the
newspapers in most of the ILs and if the news papers are digitized at all, these are
mostly in image format which can’t be directly used as corpus but one can
download or copy and input them. But for the current work, it was even more
difficult situation as only few newspapers are available in Kashmiri that too very
rare and without digitization.
4.2 Selecting Text Domains
Theoretically, one can identify different domains of text like Aesthetics-
Literature and Fine Arts, Natural, Physical and Professional Sciences, Social
Sciences, Commerce, Government Documents, etc which are very important for
creating a balanced corpus but availability of all such domains vary from
language to language. Moreover, certain domains have more day-to-day relevance
than others, like government documents, medical and tourism texts. These
domains are more useful in developing technology for e-governance and hence,
much in demand these days to be used for commercial purposes in developing
various NLP applications. However, such text domains, whether important for
building balanced corpus or important for commercial purpose, are not available
in Kashmiri. It is because Kashmiri has been never used as an official language or
the medium-of-instruction40 and currently too, the official language of the state is
Urdu and the alternative official language is English. Therefore, the text
production is in limited domains41, predominantly, in literature. As mentioned
earlier the current corpus is meant for developing a KashTreeBank, it was decided
that newspaper text should be used. The rational to use newspaper corpus was not
in tandem with general practice in the field of treebanking but additional reason
was that the textual material in collected from books (Bhat 2012) show least grip
of standardization but newspapers use comparatively standard forms. However,
40 It is worth to mention that Kashmiri was intermittently introduced and taken out from the school curriculum and again recently, it has been re-introduced to be taught as a subject in schools. This is probably reason that most young people are unable to read and write Kashmiri but elders and children are well versed with it.
41 It was observed in a fieldwork that there is hardly any text from the domains other than Aesthetics (Bhat 2012). The fieldwork was done for LDCIL in which data was collected from 270 books for developing balanced corpus for Kashmiri.
56
when newspaper corpus was used initially on experimental basis it was found
very difficult to annotate it at sentence level as the sentences were very complex
and lengthy. Consequently, it became very hard to lay down the first version of
annotation guidelines. However, to avoid this difficulty, some short story text was
also selected to add to the existing corpus. The current KashCorpus consists of the
following domains:
S. No Domain Word Count (WC) %age
01 Short Stories (SS) 3384 7.29
02 News Articles Political (NAP) 14395 31.02
03 News Articles (NA) 7001 15.09
04 News Report Political (NRP) 14263 30.74
05 News Report (NR) 2997 6.45
06 Editorials (ED) 4354 9.38
Total WC 46394
Table.1 Text Domains
4.3 Data Collection
For building KashCorpus, data collection was carried out through field work. As
mentioned above, it was not possible to collect newspaper data online, as it can be
done for English, Urdu, Hindi, Tamil, etc which have pretty good online
representation. Further, it was decided to use text of Sangarmaal, the only well
known news paper in Kashmiri which has recently started daily publication but
before it was weekly newspaper. The other Kashmiri news papers - kAhvaTh,
soon miiraas, arnimaal, miiraas & kA:shur times are not much circulated ones.
Sangarmaal too is not a widely read paper as there are very less people who could
read and write Kashmiri but English and Urdu newspapers are widely read in
Kashmir. Therefore, it became necessary to go to the field for newspaper
collection. Some issues of Sangarmaal (of 6 months duration) were purchased
and news items, editorials and articles from mainly political domain were marked
up. Besides, short stories were also taken to be included in the corpus most of
which have been taken from an anthology of prose used to teach Kashmiri at
NRLC. The decision to add short stories to the corpus was taken at the last stage
and, as aforementioned, the average sentence length in the newspaper corpus was
found high, approx. 27 Ws and the sentences are also quite complex. On the other
57
hand, the average sentence length of short story corpus is approx 12 Ws but with
quite considerable complexity. Moreover, data to be collected needs to follow a
proper sampling scheme as is done at LDCIL for building text corpus (each nth
page from n-page book/ magazine/ journal) for all scheduled languages but for the
current case, random sampling was done in which no explicit criterion was
followed to chose the text. However, it was taken care that least possible number
of newspapers be used to avoid wastage. The sample details of the newspaper data
collected during the field trip, in 2011 are given the table 1.
File ID. Metadata Words Domain
KashCorp 01 ۲٠ تا ۱۴ ، ۲۱ شمار : ۵سنگرمال سرینگر : جلد: ۲٠۱٠جون ،
سی اعتبار مویوس کن: د دور سی ٲوزیر اعظم س ن� ہ کن ن اصل سیسی سوال گال ٲاقتصدی مراعات ہ ٮ� ہ ٲ
اونک ول بنتھ ن�کھاتس تسنگرمال تجزی
204 ED
KashCorp 02 �� م ،۹ تا ۳ ، ۱۷، شمار :۵سنگرمال سرینگر :جلد: ی ۲٠۱٠
ٹھ انن ین سمیر ر گوژھ ڈنج پ ۔۔۔سری ینگر ش ہ ٮ� ہ رشیدبٹ
132 NA
KashCorp
04 مارچ تا۲۹ ، ۱۲ ، شمار :۵سنگرمال سرینگر :جلد:
۲٠۱٠اپریل، :۴ز راونس ژیر د مسل ا ر گن کش ن�ندوستان م ل ہ ن� ہہ ی� ٲ
172 NAP
KashCorp
06 تا۱۶ ، ۹٠ ، شمار : ۴سنگرمال سرینگر :جلد: ، ۲٠٠۹
نومبر۲۲ل سرحد مضبوط کران ت رلن و نس ۍندوستان چھ چ ٲ ۍ س� ی�
216 NRP
KashCorp
19:می ،۳٠ تا ۲۴ ، ۱۹ ، شمار :۵سنگرمال سرینگر :جلد:
۲٠۱٠ن اعزاز ن لکھار د ترین کتابن ٮ�ب ٮ� ن� ہہ
202 NR
Table.1 Metadata of Sample Newspaper corpus
58
4.4 Character Encoding
These days, unicode has become the prime choice in character encoding for text
corpora creation. Unicode is the universal character encoding standard which
defines a consistent scheme for encoding multilingual text and assigns a numeric
value (code point) and a name for each of its characters. Unicode characters are
represented in three forms of UTF42; 32-bit form, 16-bit form or an 8-bit form
(UTF-8). UTF-8 has been designed for ease of use with existing ASCII and ISCII-
based systems. The Unicode Standard specifies a code point and a name for each
of its characters. It contains more than 1 million code points, most of which are
available for the encoding of characters (Allen at al., 2009). The availability of
unicode compatible font is a prerequisite for the development of corpus with
unicode compatibility. As mentioned earlier, Kashmiri has only one unicode
compatible font with least issues, i.e. Afan Koshur Naksh (Aadil 2011) and is
being used for major NLP related works for many projects. It has been also used
for the development of the current corpus. The table 2 shows the encoding of
Kashmiri characters employed in developing KashCorpus.
S. No Characters Unicode
Values
S. No Characters Unicode
Values
1 ا 0627 30 ل 0644
2 ب 0628 31 م 0645
3 پ 067E 32 ن 0646
4 062A ت 33 و 0648
5 ٹ 0679 34 06C1
6 ث 062B 35 ھ 06BE
7 ج 062C 36 ء 0621
8 چ 0686 37 ی 06CC
9 ح 062D 38 ے 06D2
10 خ 062E 39 064E
11 د 062F 40 آ 0622
42 Unicode Transfer Form http://www.unicode.org, http://www.unicode.org/versions/Unicode 5.2.0
59
12 ڈ 0688 41 ٲ 0672
13 ذ 0630 43 0650
14 0631 ر 44 ی 0656
15 ڑ 0691 45 064F
16 0632 ز 46 065717 ژ 0698 47 0654
18 س 0633 48 0655
19 ش 0634 49 065A
20 ص 0635 50 ن 065B
21 ض 0636 51 � 064D
22 ط 0637 52 ، 061B
23 ظ 0638 53 ۔ 06D4
24 ع 0639 54 ؟ 061F
25 غ 063A 55 * ۄ 1732
26 ف 0641 56 س * 1773
27 ق 0642 57 * ۍ 1741
28 ک 06A9 58 ٮ� * 1646 + 1770
29 گ 06AF
Table.2 Kashmiri Unicode Chart
4.5 Text Encoding
The term text encoding refers to the practice of representing textual and linguistic
data in a certain format in corpus. A standard encoding format provides the most
possible generality and flexibility (McEnery & Wilson, 1996). The XML43 is the
emerging standard for data representation and exchange on the World Wide Web
(Bray, Paoli & Sperberg-McQueen, 1998). At the fundamental level XML is a
document markup language directly derived from SGML with various additional
features that make it a far more powerful tool for data representation and access.
Therefore, natural choice these days for storing a corpus is in an XML format. An
XML format provides needed standardization so that a user, who is not familiar
with the corpus but familiar with XML-DTDs, can easily interface with the
corpus but for the current KashCorpus, no markup language or XML-DTDs were
43 Extensible Markup Language
60
used, instead, the entire corpus has been rendered in plain document (.doc) format
as there was only one purpose of the corpus, i.e. to be used for syntactic
annotation and for that purpose it was not necessary to have corpus in XML
format, a plain text (.txt) format in UTF-8 was sufficient. However, the corpus
can be easily converted into XML format.
4.6 Data Entry
Data entry is the corner stone in any corpus building endure. It is time consuming
task especially for the language in which people are accustomed to use some
different kind of word processors which are use some different kind encoding
standards and are not compatible with unicode (like InPage) but are not yet much
familiar with using Microsoft Word, e.g. in Kashmiri. Finally, the manually
marked up news items & articles from Sangarmaal and short stories from an
anthology were typed in. It took a professional data inputter 8 days to input 46394
words of Kashmiri newspaper text in Microsoft Word, with an average of 5051
words per day (5-7 hrs). It was found that the corpus is unclean, i.e. it contains lot
of typos and space problems, and is still unfit to be used for next level process. It
is a well established general practice that the corpus needs to be sanitized and
preprocessed first before putting it in actual usage. The sample of the unclean
corpus is given below in the Table 2. It contains three parts; a) Metadata
(information about the data available in the corpus) b) Data (text on which actual
work is done) c) Word Count (number of words of the actual data, excluding
metadata.
M
E
T
A
D
A
T
A
File ID No.KashCorp11
Newspaper
Details
ہ� : ۵سنگرمال سرینگر : جلد: ی�/ ۹ تا ۳ ، ۱۷ شما ۲٠۱٠ م
News Item
Title
ی�ن و�گ زا ن�ن ل ٮ�ن امکا ن��
چ5لو سیاست ننز ہہ چد� ٮCٹھ مثبت اثر، علیحدگ/ پسن چEس پ ہد� صو�تحا نن ہہ ہر یی �Hش Iہ نرا ام� عمل ت
ہH Kا�گر 5ھن
Item TypeKسنگرمال تجزی
ہمژ ٮI �Cزیراعظم� د�میا� سپز H تا�سHپا Kہ چسسس د�Iا� بھا�ت ت ہتمپھو سا�ک سربرا اجلا KYہ ہK5 �ازد ا ہ] بھوٹا� Eحاس
61
ھھ] 5Iچ� چیو Eید ہو ہمت۔ حالاYکK دۄش ھو ہد گ ہYک اما�H پٲ نبال سپد ننزع مل ہہ ہھ چHتھ بات ہ5ھ دۄ� ملک� د�میا� جامع Kہ ہنگK پت باہم/ میٹ
Kہ چپتھ ماحول سازگا�ت ہمK با چر� ا H ہتم ہز ہمژ ہدژ ہK دٲ�� ہgم Kہی چی� ہK خا�KY سیکریٹر� ننز ضر�Iت Iاضح Hرا� خٲ�ج/ Iزیر� ت ہہ ہتھ چHتھ با
ہYچ Iٲحد ہK حل Hر چHتھ مسل ننز ہK م ہK پت چ�� د�میا� ملاقات نYہK دۄ� Eیڈ I نر چاHھ و�� ب ننز ہHس شرم اEشیخ شہر� م ہر ہH Kا� تلاش۔ مص Iطریق
�گھس ا�I شرم سظم۔ Iزیر اعظم ڈاHٹر منموہ� سن ہK من ہد محتاط ت ہمت بیا� زیا آا Kہ Yہد Kہ ہمK پت ہK ا ہیK ملاقات ت ہ5ھ ہمش ٲ�، آا KYہ ہد چIتھ قرا�
چیس ہHس و�ۍ �ا ہ� ہمس پو �پتۍ ہI� Kد �}ژ سبب ہK ام/ مخال ہمت ت ھو ہ5ھ� پی Iہ ہبتھ ہک ننز زبردست مخاEفت ہK پا�Eیمنٹس م اEشیخ اعلاYیK پت
Kہ ہK ت ہشٹھ نند موقف ہہ Kہ Eہود Y قس متلYچ ننز ملوث Yفر�Y � ۲٠٠۸ومبر ۲۶جامع مذاHرٲت/ عمل بحال Hر چل� م �/ حم ٮ�C ممب H و�عیس
چنس ہی Kہ YرH Kہ ٮCٹھ تمام ام� عمل �Iٹ چلس پ ہس] معام �اHۍ چاتھ ہI� Kد پاHستا� ہیمK طر} Iن سستۍ مشرIط۔ د چنس ہی Kہ YرH /ٲیIا�H خلاف ٹھو�
/� ہممب Kہ ہیتھۍ حمل ہمتۍ آا ہK سپدا� ہتتھ/ حمل ہI Kاد ننز د�جن چتتھ ملکس م ہ5ھ چیس د�Iا� ہHس و�ۍ �ا ہمس �پتۍ ہز چYIا� Kہ ٮCٹھ IاI یلا Hرا� ت پ
ہمت ا�I ۔ ہد� ننز سپ م
W-241
Table.3 Sample of Unclean KashCorpus
4.7 Copyright Issue
Copyright legislation is one of the serious problems for building and usage of
large scale text corpora as the authors and publishers protect their rights on their
texts through copyright laws. The main concern for the corpus builder is that any
text which is to be digitized and included in the corpus will be under copyright
protection and that the permission has to be obtained for its use. If corpus is going
to be used in the development of various systems or applications for commercial
purpose, one has to take the permission and enter into an agreement with the
authors or publishers in which some royalty will be fixed for each text. However,
if corpus is to be used for research purpose there is hardly any need of taking
permission and entering into any agreement. Since the current KashCorpus is not
for any commercial purpose but only for the research, permission for using the
text has not been taken. The next section describes the procedures involved in
polishing the corpus.
5. Preprocessing
Once the inputting of the text is finished and the corpus is ready, it can’t be
directly used for annotation purposes rather it has to go through some more
manipulating procedures. For instance, the current KashCorpus has been
62
manually sanitized, normalized and tokenized before it has been used for POS
annotation. So, all the manipulations done (manually or automatically) to the
corpus prior to annotation can be collectively called Preprocessing.
5.1. Corpus Sanitization
Corpus cleaning involves proof reading or checking of the digitized corpus files
for typos, errors, spelling and grammatical mistakes. During this process, it is
necessary to be faithful to the text, as whatever, one may think as a mistake on the
part of a writer could be in fact a variation. The reasons of the errors or mistakes
and how these were corrected in the process of sanitization are given as under:
a. The less expertise in the Kashmiri script, in spite of the good typing
experience, on the part of inputter has resulted in many errors and
mistakes and consequently in more unclean corpus. Moreover, it was
found that the highest scoring day (in terms of number of words per day)
was also the day in which more errors & mistakes were committed by the
data inputter. There was more percentage of erroneous words in the corpus
as compared to the day when there was average word count. Therefore,
taking the required time seems to be a good strategy as in haste to finish
more and more words per day can lead to the increased percentage of
erroneous words.
b. Sometimes the bad quality of print and the errors in the original text would
lead to the wrong judgment of the letters/words and consequently the
mistakes on the part of the inputter.
c. Kashmiri script uses lots of diacritics to represent different phonetic
subtleties of the language. Sometimes some of these diacritics would
appear on the top or bottom of one character when actually they are part of
preceding or following characters. So, in the text it is sometimes hard to
decide to which letter the diacritic belongs, unless the native speaker’s
intuitions have not been taken into consideration. These apparently
misplaced diacritics generally lead to confusion for the data inputter and
result in lot of spelling mistakes or unclean corpus. For instance, in the
word ہک ,(mukhaalfatku) مخاEفت given above in Table 2, the ‘pesh’ ( )
diacritic appears misplaced, i.e. on final letter of the word (ک) which is
63
actually on the preceding letter (ت) and the actual word is ہتک مخاEف
(mukhaalfatuk). Such mistakes are regular and hence quite predictable.
d. Since, various combinations of keys are used to input different character,
e.g. by pressing Shirt+P ( ) can be typed in but sometimes only one key is
pressed (e.g. only ‘P’) and an entirely different character (e.g. gets (پ
typed in. This has resulted in various errors which are more or less
predictable ones.
e. It was maintained to use some diacritics in a consistent way, despite of the
variations in the text. When two diacritics are typed contiguously (one
after other) in which the first one joins two consonants to function as a
unit and the second one represents the vowel on the unit, it was decided to
type in on this sequence – 1st consonant, 2nd diacritic representing the
vowel, 3rd consonant and 4th the conjoining diacritic. For instance, in the
words ( مت ) & (ہن�دی7777777 ) the splitting occurs when two diacritics (ہن�تی7777777ت& ن ) come contiguously. To avoid this, vowel representing diacritic ( ) is
typed after first consonant (د) which forms a unit with (�) with the help of
a linker & shortening diacritic ( ن ). Therefore the above words were
corrected as & نیتھ ہت ہمت نی ہد and this pattern was followed in the entire process
of cleaning.
f. It was also maintained not to consider aspirated consonants as unit and put
the diacritic after this unit, instead put diacritic on or under the letter
representing the consonant. For instance, in the word ہ5ھ) ), the vowel
representing diacritic ( ) is not actually associated with only (چ), the
word initial letter but the unit (5ھ). However, it is maintained to write after
the first letter (چ) so that (ہ5ھ ) is typed everywhere instead of (چھ).
g. It was also maintained to put the diacritic representing a vowel under the
unpronounced pseudo-character44 instead directly under the preceding
letter representing a consonant which is not a writing convention. For
instance, in the words (Kہ ہK) & (ت ) the letter ,(ت) remains unpronounced and is
used as a supporting characters for the diacritics ( ہ & ) so that they are
44 In Kashmiri a letter () is used at word final position just for the support of the preceding diacritic which can’t stand on its own. In such cases ( ) is a pseudo-character as it doesn’t represent anything of the phonological word.
64
written as ( & ہ ہ ), respectively. These are mere orthographic conventions
and have nothing to do with phonological rules.
These all types of errors were rectified during the course of cleaning, keeping in
mind the principle of faithfulness to the text and some additional decisions to
maintain consistency. The next sub-section describes the normalization of the
corpus.
5.2. Corpus Normalization
Corpus Normalization involves all the necessary manipulations of the corpus, not
covered under cleaning and tokenization. It primarily involves filling in left out
diacritics. As mentioned above, the Kashmiri script uses fourteen diacritics (e.g. ی ) to represent different phonetic subtleties of the language. Urdu also uses
modified Persio-Arabic script but drops the three crucial vowel representing
diacritics, namely; zer, zabar & pesh ( ). This tendency can be also seen in
Kashmiri texts but it is not as severe as it is in Urdu texts where all diacritics are
being left. However, like Urdu writers, there is a tendency in Kashmiri writers too
to drop these three crucial diacritics but the remaining are the essential ones,
specific to Kashmiri and can’t be inferred from the context. However, dropping of
the diacritics creates a big text normalization problem that needs to be taken care
of, i.e. all diacritics need to be put in the text or at least where these are crucial for
word identification and disambiguation. Same has been done with the current
KashCorpus; all the crucial diacritics have been put there manually. Actually,
corpus cleaning and normalization has been done simultaneously.
5.3. Tokenization
A token is a string of characters delimited by unit character spaces and
tokenization is a preprocessing procedure by which disparity is removed in
achieving one-to-one mapping between tokens and the words or major
grammatical categories either by concatenation (joining) or by segmentation,
respectively. The natural one-to-one correspondence has been generally observed
between a word (simple or complex) and a token in isolating45and inflectional46
45 They have low morpheme per word ratio and more this ratio is lower more the language is said to isolating. Purely isolating ones have 1:1 word-morpheme ratio, e.g. Mandarin. Therefore, languages with one to one correspondence between words and morphemes are said to be isolating.
46 They have high morpheme per word ratio, in contrast to isolating languages, and more this ratio is higher more the language is said to be inflectional, e.g. Indo-European languages (also known as low synthetic languages).
65
languages but hardly in agglutinating languages47 where, one token usually
corresponds to many grammatical words (POS categories). However, in case of
some inflectional languages, particularly, the languages which use modified
Persio-Arabic script and borrow heavily from Persian, e.g. Urdu and Kashmiri, no
one-to-one correspondence between the words and the tokens in many complex
words (bound + free morphemes) has been observed. The root form of such words
is written as one token and the affix as another separate token. It is the common
practice in Kashmiri and Urdu and mostly occurs in Persian borrowed affixes, as
given in the Table.4. In Kashmiri, this practice has been observed even in some
simple words where the two parts of word are written as two separate tokens, with
the blank space between them may or may not representing the morphemic
boundary. Moreover, the case markers, if added to such words, give rise to three
token words as given in examples 4, 5 & 6 of the Table.5. Therefore, the second
part of the word may or may not be a bound morpheme but the third token is
surely bound morpheme. This orthographic convention of writing bound
morphemes or parts-of-word as separate tokens to avoid unacceptable word
shapes due to the context-sensitive48 script is called split-orthography49. The
Kashmiri specific examples of split-orthography are mostly taken from the corpus
sample given in Table.3.
The concept of space as a word boundary is weak in Urdu script
(Durrani and Hussain, 2010) but it is far weaker in Kashmiri script. A zero width
non-joiner (space character as can be seen between Roman letters) is primarily
required to generate acceptable word shapes on the one hand and to join various
parts of a word and rectify tokenization problem on the other hand. It has been
47 They have highest morpheme per word ratio but additionally, there is a low degree of fusion of major grammatical categories, e.g. Turkish, Tamil, Malayalam, Telegu, etc (also known as polysynthetic languages).
48 In such scripts some letters (joiners but not non-joiners) attain different shapes upon joining with the adjacent letters. There are three possible shapes a letter can attain at initial, medial & final positions (contexts) in a concatenated sequence of letters of the word. The letters assuming these three shapes according to the context are called joiners. Another set of letters, called as non-joiners do not do not change their shape according to the context. They only join with the letter immediately preceding them and thus, have only word final and isolated variants. An examples of a joiners are Arabic letters ‘te, miim, ye, be, siin’ ( س ب ی م ) ’and that of non-joiner are Arabic letters ‘vaav’ &‘re (ت ڑ ۄ � I).
49 The term split-orthography is used due to unavailability of any technical term in the existing literature to denote the splitting tendency in Persio-Arabic Script (an orthographic convention) due to which affixes and the roots are written separately and even some roots are written in two tokens, forming multi-token words. The term is in a way new coinage to describe the tokenization problem of Kashmiri, Urdu, etc (S. Bhat, 2010 & 2012).
66
already implemented for Urdu (G. Lehal, 2010) but for Kashmiri it has been
implemented very recently which is compatible to windows-08 only. However,
instead of zero width non-joiner, underscore (_) has been used in the tokenization
of Urdu Dependency Treebank (R. Bhat 2012) and manual preprocessing for
Urdu and Kashmiri Corpus at LDCIL (S.Bhat 2012) but for the current work dash
(-) has been used, instead of underscore or zero width unit character space, to join
parts of a word as shown in examples 1-3 of Table 4 and 10-15 of the Table 5.
S. No. Root (Token I) Affix (Token II) Words Urdu Kashmiri
1 aqIl (عقل) -mand (مند) aqIl–mand عقل_مند نند چم عقل-
2 mazmoon -nigar mazmoon–nigaar مضمو�_Yگا� چمضمو�-Yگا�
3 tA:liim -yaaftah tA:liim–yaaftah تعلیم_یا}تھ Kہ ییم-یا}ت Eتٲ
4 khatIm -shudah khatIm–shudah حاصل_شد حٲصل شد
5 hA:sil -kardah hA:sil–kardah حاصل_Hرد ہد حٲصل Hر
6 gonah -gaar gonah–gaar گنا_گا� گۄKY گا�
7 qosuur -vaar qosuur–vaar قصو�_Iا� قصو� Iا�
67
8 khosh -go khosh–go خوش_گو خوش گو
9 tarqii -paziir tarqii–paziir ترق/_پزیر ییر ترق/ پز
Table.4: Tokenization Problem Common in Kashmiri & Urdu
S.NO. Root
(Token I)
Affix
(Token
II)
Affix
(Token III)
Words Kashmiri
1 butaan Chi - buuTaan-chi K5ہ بھوٹا�
2 iisvii k’n - isvii-k’n �Cٮ H عیسو�
3 sekretrii Yan - sekretrii-yan چی� سیکریٹر�
4 zimI dA:rii yan zimI-dA:rii-yan چی� ہK دٲ�� ہgم
5 tariiqI Kaar k’n tAriiqI-kaar-k’n �Cٮ H ا�H Kہ طریق
6 fal safI kis fal-safI-kis ہHس Kہ }ل سف
7 paanI van’ - paanI van’ ہYI Kۍ Yپا
8 mukhaal fAts - mukhaal-fAts �}ژ مخال
9 misrI Kis - misrI-kis ہHس ہر مص
10 sapIz mIts - sapIz-mIts ہمژ سپز-
11 vAr’ Yas - vAr’-yas چیس و�ۍ-
12 Ak’ sIY - Ak’-sIY ہس] �اHۍ-
13 pAt’ Mis - pAt’-mis ہمس �پتۍ-
68
14 anan vA:l’ - anan-vA:l’ ل ۍانن-و ٲ
15 yithI pA:Th’ - yithI-pA:Th’ ٹھ -پ ۍیتھ ٲ ہ
Table.5 Tokenization Problem Specific to Kashmiri
6. Summary
In the wake of current corpus linguistics scenario and the boom of empirical
studies, the development of Kashmiri corpus is need of the hour. It is not only to
feed data hungry research and development initiatives for technological
enhancement of Kashmiri Language but also to carryout various quantitative
studies to discover the new realities which have remain unexplored so far due to
the unavailability of the corpus. Though, in this chapter the building of
KashCorpus is described from a specific point of view, i.e. for developing
KashTreeBank, but the corpus can be also used for different types of studies. This
work is the most basic part of a large attempt of resource creation to put Kashmiri
language on the map of current language technology. Like any other corpus
building endeavors, the creation of KashCorpus was not a straightforward
process; there were many issues like, selection of text domains, representativeness
of the language in the selected samples, etc. which were properly scrutinized and
solved before starting the actual work. The other major problems include the
unavailability of any online resource from which data could have been obtained,
the total vacuum of commercially important text domains like medical & tourism
text, lack of well trained data inputters who are not only well versed with Persio-
Arabic script in general but particularly with Kashmiri script and its unicode
based inputting setup. Usually, data inputters use “Inpage” but not Microsoft
office for Kashmiri inputting. Finally, many processes were carried out to make
corpus worth for adding further values by various types of annotations. These
69
preprocesses include corpus cleaning, normalization and tokenization. Though,
sometimes tokenization is treated as a separate problem in between corpus
building and corpus annotation but in this work it is included as the part of corpus
building as it has been carried out manually along with cleaning and
normalization. In the present form, the KashCorpus is ready to be used for the
future work.
Chapter.4 POS Tagging of KashCorpus“The definitions of the parts-of-speech arevery far from having attained the degree ofexactitude found in Euclidean geometry.”
Otto Jespersen, the Philosophy of Grammar, 1924
1. Introduction
Part-of-speech (POS) tagging constitutes the fundamental layer of annotation in
treebanking, on the basis of which furthers annotation layers are build. The next
layer of annotation is called chunking, which is important to determine
dependency relations, the most crucial task in building a dependency treebank.
The POS category which forms the head of the chunk can be further augmented
with the crucial morphological information like PNGC50 and TAM51 but in the
current work adding morphological information has been avoided to concentrate
on inter-chunk dependencies and get the skeletal dependency trees ready. It is
important to mention that it is easier to add morph information latter in order to
get better results in automatic syntactic annotation.
This chapter describes the first level of annotation of KashCorp, i.e. POS tagging and
chunking and the associated resources, technicalities and manipulations of the data that
were required to start POS annotation. The section second provides the notion of POS
tagging. Section three discusses briefly the important annotation standards. Section four
presents POS tagsets developed, mainly, for English and Indian Languages, and
50 Person, Number, Gender, Case51 Tense, Aspect, Mood
70
elaborates only the most relevant ones. Section five describes the Kashmiri BIS tagset.
Section six, seven, eight, and nine talk about the requirements, the process, issues and the
guidelines of POS tagging, respectively. Section ten provides the statistical results and finally
section eleven summarizes the chapter.
2. The Notion of POS Tagging
The notion of parts-of-speech (POS) tagging has been given very elegantly in Daniel Jurafsky and James H. Martin (2000):
“Words are traditionally grouped into equivalence classes called parts-of-speech
(POS), word classes, morphological classes, or lexical tags. In traditional grammars
there were generally only a few parts of speech (noun, verb, adjective, preposition,
adverb, conjunction, etc.). More recent models have much larger numbers of word
classes (45 for the Penn Treebank (Marcus et al., 1993), 87 for the Brown corpus
(Francis, 1979; Francis and Kučera, 1982), and 146 for the C7 tagset (Garside et al.,
1997).
The part of speech for a word gives a significant amount of information about
the word and its neighbors. This is clearly true for major categories, (verb
versus noun), but is also true for the many fine distinctions. For example
these tagsets distinguish between possessive pronouns (my, your, his, her, its)
and personal pronouns (I, you, he, me). Knowing whether a word is a
possessive pronoun or a personal pronoun can tell us what words are likely to
occur in its vicinity (possessive pronouns are likely to be followed by a noun,
personal pronouns by a verb). This can be useful in a language model for
speech recognition.”
POS tagging is a process of assigning part-of-speech tags to each and every word
used in continuous text after the morphological analysis and grammatical
interpretation (Garside, 1995). A set of specially designed tags, carrying
grammatical information are assigned to words to indicate their parts-of-speech
category with regard to their use in the text (Leech and Garside, 1982). POS
tagging is actually the process of labeling words in running corpus with their
grammatical categories (optionally with the morpho-syntactic features), based on
both their form as well as their contextual function. It is essentially a classification
problem in which words are classified on the basis of a predefined inventory of
parts-of-speech categories called POS tagset. For morphologically rich languages,
it plays a limited role of syntactic category disambiguation in the entire pipeline
of NLP modules where morphological analyzer provides all possible POS
categories for a word and POS tagger just disambiguates the category of the given
71
word by selecting only one according to its context. It is the fundamental level of
corpus annotation; in fact, it is the first stage to proceed for the syntactic
annotation in order to develop a treebank. Apart from its role in treebanking, POS
annotated corpus alone can be used in wide number of NLP applications like
information extraction, information retrieval, parsing (shallow as well as deep),
machine translation, speech synthesis and speech recognition.
3. Annotation Standards
POS standards provide a framework in which a tagset can be designed to annotate
corpora. Therefore, to decide upon choosing a standard from the existing ones or
to lay down a new one by taking inputs from the existing ones is the first and
foremost task in corpus annotation.
Standardization in POS tagset designing is not only important to achieve
consistency in the annotation across related languages and research projects but
also to ensure maximum resource sharing and least wastage of annotated language
resources, particularly, in resource poor scenarios like Indian Languages. For
European languages such steps had been taken more than a decade ago in the
form of EAGLES52, ELRA53, and ISLE54 but for Indian Languages, it is quite
recent tendency (only 3-4 years old) and came into being in the form of BIS
scheme, though there were earlier efforts in this direction in the form of ILPOSTS
& ILMT. EAGLES and BIS POS annotation schemes can be seen as instrumental
in bringing consensus among NLP groups with divergent interests and approaches
to take up annotation projects and solve various CL, NLP or LT problems. These
two standard frameworks are briefly given below from the POS annotation point
of view.
3.1. EAGLES Framework
It is widely used framework on POS tagset designing with main aim of
standardisation of POS tagsets used for the annotation of corpora of various
European Languages. Standardization of the tagsets is very important process as
pointed out by Leech and Wilson (1999: 55-56):
“In the interests of interchangeability and re-usability of annotated corpora, it
is important to avoid a ‘free-for-all’ or ‘re-invention of the wheel’ every time
52 The EAGLES (Expert Advisory Group for Language Engineering Standards) guidelines provide recommendations for standardization of a range of language engineering resources. The recommendations actually refer solely to the guidelines on morpho-syntactic annotation of texts.
53 The European Language Resources Association (ELRA)54 International Standards for Language Engineering Standards (ISLES)
72
a new project begins………… At the cross-linguistic level, annotations used
for one language should as far as possible be compatible with annotations
used for another. Compatibility here means that where there are descriptive
categories common in between different languages, these should be
recognised in the annotation scheme and recoverable from the annotations
applied to texts in different languages.”
The EAGLES guidelines provide a set of features and an encoding scheme which
different tagsets were supposed to include. The EAGLES guidelines for morpho-
syntactic annotation include: 1) what is obligatory 2) what is recommended 3)
what are optional extensions for morphosyntactic annotation. At each level, tags
are defined as morphosyntactic Attribute-Value (A-V) Pairs e.g. gender is an
attribute that can have the values, masculine, feminine or neuter. These A-V pairs
are structured as a hierarchy but need not be so, strictly. The property suggested
by the EAGLES guidelines as obligatory to any POS tagset is that of thirteen
major word classes which include: noun, verb, adjective, pronoun/determiner,
article, adverb, adposition, conjunction, numeral, interjection, unassigned/unique,
residual, and punctuation. The recommended properties are then organised
according to these major word classes, e.g. the attribute Type with values;
Common, Proper, etc, is for nouns but Person with values First, Second and
Third, is for verbs and Degree with values Positive, Comparative, Superlative is
for adverbs. The recommended attributes also include number, gender, case,
finiteness, tense, voice, and other sub-categorisation features. The optional
recommendations consist of similar attributes of lesser applicability, and some
additional language specific values for the recommended attributes.
The value of this framework is that it promotes consistency and reusability
of linguistic resources for different languages and discourages “wheel
reinvention”. The main drawback to the EAGLES guidelines, however, is that
they cover only a tiny fraction of the world’s languages. As a project of the
European Union, it covers only English, Dutch, German, Danish, French,
Spanish, Portuguese, Italian and Greek: nine languages of Western Europe which
are moreover typologically similar. It is worth to mention that the ILPOSTS on
the basis of which LDCIL POS tagsets were made for the annotation of Indian
73
Language Corpora was based to EAGLES. We can say it was an Indian extension
of EAGLES. As point out by Leech and Wilson (1999: 58):
“It remains to be seen how far these guidelines can be extended, without
substantial revision, to other languages”.
3.2. BIS Framework
It is the latest annotation framework for the annotation of Indian Languages and
recognised by Bureau of Indian Standardization (BIS). Its foundation was laid
down by the first meeting of POS tagset standardization committee, held at
Department of IT, New Delhi on 19th Nov. 2009. It has been evolved by taking
insights from earlier efforts-ILPOSTS, ILMT, etc, to bring consensus among
different NLP groups in India. It incorporated the set of POS labels from ILMT
POS tagset (Bharati et al., 2006) and the notion of hierarchical structure from
ILPOSTS (Baskaran et al., 2008) but avoided fine granularity proposed by
ILPOSTS.
In line with the ILMT tagset, it assumes separate layers for morphological
analysis and POS annotation for efficient capturing of grammatical information
and better results in manual as well as automatic annotation. It, further, holds that
the input to the POS tagger (text corpus) should have already undergone through
pre-processing. Thus, every token (word) to be assigned a POS tag is a single
lexical item and is not a token which internally contains more or less than one
lexical item as can be seen in agglutinative languages and in the languages with
split-orthography (Bhat, 2010), respectively. It also sticks to the assumption that
there must be a MWE identifier layer after POS tagging. Since POS tagging is a
lexical level annotation process, any unit that involves more than one lexical item,
such as conjunct verb, compounds and will not be captured at the POS level.
Therefore, BIS proposes hierarchical and coarse grained tagsets for all Indian
Languages. These tagsets have three-levels of hierarchy, including Type,
Subtype-I and Subtype-II. The first level (type) includes 11 main categories-
Noun, Pronoun, Demonstrative, Adjective, Quantifier, Verb, Adverb,
Postposition, Conjunction, Particle, and Residual. The second level (subtype-I)
includes 32 subcategories and the third level (subtype-II) includes 3 sub-
subcategories only for verb but the third level is optional. The main principles55
55 The principles are given in ‘Linguistic Resource Standards for POS Tag Set for Indian Languages’. Documentation by D. M. Sharma in May 2010. MS
74
that were taken into consideration while developing the POS tagsets for the
annotation of Indian Language Corpora are as:
i. The scheme should be generic, i.e. it should work for all the Indian Languages
and shouldn’t be oriented towards any one language or a group of languages.
ii. A layered approach should be followed for annotating various types of
linguistic information available in a text. Each type of information like
morphological, POS and chunk information should be annotated in separate layer.
iii. The scheme should be flexible to incorporate or drop a category either at the
top level of hierarchy or as a sub-category of an existing type so that the scheme
can be extended from one language to other.
iv. The annotation scheme should be annotator friendly by avoiding ambiguous
tags which puts cognitive load on the annotators and leads to inconsistency in the
annotation.
v. The scheme should be mappable with pre-existing annotation schemes of
Indian Languages to avoid the wastage of the resources.
vi. The scheme should support all types of NLP research efforts independent of a
particular technology and development approach.
4. POS Tagsets
The POS tagsets that have been designed for English and Indian Languages have
been given.
4.1. POS Tagsets for English
“POS tagging has been a hot research topic since the early 1980s” (Voutilainen,
1999) but the research actually originated in 1960s for European Languages.
However, the research in POS tagging is a quite recent tendency in India and,
therefore, the concept of tagset designing and its standardization is also very
recent as compared to its European and American counterpart. The main efforts in
POS tagging resulted in various POS tagsets such as Brown, CLAWS1, and U-
Penn (mainly designed for English) but these tagsets are mostly simple
inventories of tags corresponding to the morpho-syntactic features, and varied
greatly in terms of their granularity (Hardie, 2004). It is CLAWS 2 & 7 tagsets
which are considered landmark in the history of tagset designing (Leech 1997).
75
CLAWS7 marked an important change in the structure of tagsets, from a flat-
structure56 to a hierarchical-structure57.
According to Daniel Jurafsky and James H. Martin (1999) “There are a small
number of popular tagsets for English, many of which evolved from the 87-tag
tagset used for the Brown corpus (Francis, 1979; Francis and Kučera, 1982).
Three of the most commonly used are the small 45-tag Penn Treebank tagset
(Marcus et al., 1993), the medium-sized 61 tag C5 tagset used by the Lancaster
UCREL project's CLAWS (the Constituent Likelihood Automatic Word-tagging
System) tagger to tag the British National Corpus (BNC) (Garside et al., 1997),
and the larger 146-tag C7 tagset (Leech et al., 1994).” However, irrespective of
the popularity, a brief description of many POS tagsets of English and Indian
Languages are given as follows.
4.1.1. CGC58 Tagset
The earliest work on POS tagging started with CGC of Klein and Simmons
(1963) for English in USA. The tagset consists of thirty tags of which only
pronoun tags are decomposable59 but the rest are not. Their CGC-program also
outputs information, external to the main tag, on the number of nouns and verbs;
it is also noted if a noun is possessive, so that the actual number of categories
distinguished is considerably greater (Hardie, 2004). It also incorporate tags for
punctuation marks, which are treated as words as has been pointed out that the
treatment of punctuation marks in this manner can be a significant aid in the
tagging of other nearby words (Leech, 1997).
4.1.2. TAGGIT Tagset
Klein and Simmons’s work inspired the work of Greene and Rubin (1971)60. The
tagset contains 77 POS tags, but their TAGGIT program displays information
56 Simple inventory of unrelated POS tags57 The term “hierarchical”, when used for a tagset, means that the categories in that tagset are structured
relative to one another. Rather than a large number of independent categories, a hierarchical tagset will contain a small number of categories, each of which contains a number of sub-categories, each of which may contain sub-sub-categories, and so on, in a tree-like structure (ibid).
58 Computational Grammar Coder-CGC (Klein and Simmons 1963) was designed as a component of a parser (in turn a component of a system to synthesize human language behaviour).
59 A tag is considered to be “decomposable” if the string that represents that tag consists of one or more characters that represent the same elsewhere which it represents in the original tag within the same tagset. For example, any noun tag which combines an N for “noun” with other characters to indicate other features of the word is decomposable (N.SG.MAS .dir).
60 It was Green & Rubin’s POS tagset which was used in annotating the Brown Corpus, and was refined slightly in a latter stage of this project (see Francis and Kučera 1982: 3-15) which came to be known as Brown tagset. It consists of 87 tags; allowing for compound tags, the number of potential analyses for any given orthographic form is 179 (Sampson 1987).
76
regarding the number as an integral part of the main tag itself (ibid). The CGC &
TAGGIT display some consistent design features. The tagset incorporates tags for
punctuation marks, which are treated as words. They based the definition of their
tags on the syntactic functions that a given word form performs in a particular
context. The tags display more of a tendency to be decomposable. For example, in
the tag WPO, W is Wh-word, P is pronoun and O is objective form. However,
unlike some latter tagsets, this tagset was not hierarchical. The earlier Klein and
Simmons’ (1963) tagset was not hierarchical either. Both these early projects also
had some means of dealing with ambiguity. Some of the TAGGIT tags were
exclusively for dealing with ambiguous words. For example, the CI tag marks a
word which is either a subordinating conjunction or a preposition, such as
‘before.’ There are also tags for subordinating conjunction (CS) and preposition
(IN). Only CS and IN tags are needed for an exhaustive classification, but CI is
necessary on a pragmatic ground.
4.1.3. CLAWS61 Tagset
As mentioned above, POS tagging has been a well known research topic since the
early 1980s, a number of tagsets have been devised for English at Lancaster
University within a decade from 80s to 90s to be used in CLAWS Tagger
(Garside 1987). The C1 tagset was used in the annotation of the LOB Corpus
(also known as LOB tagset). Since, this corpus was designed to parallel the
structure of the Brown Corpus, the tags were also parallel, and C1 is very similar
to the latter version of the Brown tagset (Francis and Kučera 1982). The
development of the C262 tagset was motivated by:
“providing distinct codings for all classes of words, having distinct
grammatical behavior, and making the tagset more systematic, in the way,
that tags are built up from individual characters” (Sampson 1987).
It means more decomposability and hierarchical nature was brought in C2 tagset
(166 tags). For example, all verbal tags have V as their first character and as their
second character either V again (for a main verb) or another character (for
auxiliary verb). The major subsequent developments in the CLAWS tagset were
the C5 and C7 tagsets, developed for the annotation of the BNC Corpus (see
Leech, Garside & Bryant 1994, Leech 1997b, Garside and Smith 1997). The C7
61 Constituent Likelihood Automatic Word Tagging System62 The CLAWS2 tagset was the basis for the much larger, much finer-grained SUSANNE Word-tag Set
(Sampson 1995: 79-149; circa 360 tags).
77
tagset (146 tags) is the more fine-grained of the two and can be regarded as a
further refinement of the CLAWS2 tagset but the C5 tagset is something of a
departure from the others, since it has fewer tags (61 tags) – this was in order to
make it useful to the largest number of end users (Hardie, 2004). On the other
hand, C5 tagset has been characterized as flat tagset (Cloeren, 1999). In fact,
although none of the CLAWS tagsets are laid out in the hierarchical fashion
described by Cloeren, the C7 tagset is hierarchical in conceptual terms (Leech,
1997). Furthermore, both C5 and C7 are largely decomposable – the C7, again, to
a greater extent. For example, in the tag PPHO2, ‘P’ is pronoun, ‘P’ is personal,
‘H ’is third person, & ‘O’ is objective case and ‘2’ is plural.
4.1.4. UPenn63 Tagset
The POS tagset used in Penn Treebank (Marcus et al., 1993) is also based on the
Brown Corpus tagset. However, it has been modified in terms of simplification,
rather than complexity, as is case with CLAWS tagsets (Hardie, 2004). Thus,
there are considerably less tags (36). It makes fewer of what has been described as
“lexically recoverable distinctions” (Marcus et al, 1993), i.e. the distinction
between lexical verbs and the auxiliary verbs (be, do and have) is not retained in
this tagset as the distinction is made on the basis of the forms of words. Also,
information that could be recovered from the parsing information has been
excluded from the tagset to avoid the risk of inconsistency in tagging. “It is clear
that reducing the size of the tagset reduces the chances of such tagging
inconsistencies” (ibid).
4.1.5. Lund Tagset
The tagset, designed for the annotation of the London-Lund Corpus of Spoken
English, represents a tagset significantly different from the Brown
Corpus/CLAWS tagset tradition (Svartvik 1990). It is more fine-grained,
consisting of just over 200 tags. It has been designed for spoken texts and
includes tags for a variety of discourse element type adverbs, not usually
distinguished in the tagging of written texts, as well as tags for other features of
speech such as swearing. Similarly it lacks punctuation tags. Moreover, this tagset
is also hierarchical and decomposable into single characters (or 2-3 character
strings) that indicate given features.
63 University of Pennsylvania, USA
78
4.2. POS Tagsets for Indian Languages
Despite being relatively new field, research on POS annotation in Indian
Languages has also produced a number of tagsets and common frameworks.
These include AU-KBC tagset for Tamil (2001), Hardie's tagset for Urdu (Hardie,
2005), IIIT-ILMT tagset for Hindi (Bharati et al., 2006), MSRI-JNU tagset for
Sanskrit (Chandra Shekhar, 2007), MSRI-ILPOSTS for Hindi & Bangla
(Baskaran et al., 2008), CSI-HCU tagset for Telugu (Sree R.J et al., 2008),
Nelrlac tagset for Nepali (Hardie et al., 2005), LDCIL tagsets for ILs (Malikarjun
et al., 2010; Bhat et al., 2010), BIS tagsets for all ILs (Ms. 2009), etc. Some of the
important POS tagsets relevant to the current work are briefly given below.
4.2.1. EMILLE64 Tagset for Urdu
Urdu, written in the Perso-Arabic script, offers different set of challenges in POS
tagset design. Hardie (2005) designed the Urdu tagset based on the Urdu grammar
of Schmidt (1999) in accordance to the EAGLES guidelines for the EMILLE
project. However, designing a tagset in Urdu was not a straightforward task
particularly with respect to the orthographic convention, and the presence of
Arabic and Persian borrowed forms, which are structurally quite distinct from the
Indo-Aryan forms. Some of the issues that were highlighted in Hardie (2005) are
tokenisation and idiosyncratic features of Urdu. It has been found that in Urdu
orthography, many elements described as suffixes in traditional grammars are
actually written as independent tokens. Hence, the arbitrary decision was taken to
treat every orthographic space as a word break even if it occurs within a lexical
item. However, this leads to include some means of tagging those elements which
do not constitute a free form (words). For example, “zimmah daar" (responsible)
consists of two tokens - a root and a derivational suffix. The same suffix appears
fused to the root in other contexts like "samajhdaar" (sensible), and further
suffixation can take place like "zimmah daarii" (responsibility). In the background
of such orthographic conventions, a syntactically null tag has been introduced
which is dependent for its grammar on the subsequent token, e.g. samajhdaar\JJU
and zimmah\LL daar\JJU. The major categories in Urdu tagset are virtually
identical to the equivalent categories as defined in the EAGLES - Nouns,
Pronouns, Verbs, Adjectives, Adverbs, Postpositions and Conjunctions. The
tagset handles tokenization problem (for details see chapter 3) at POS level and
64 Enabling Minority Language Engineering
79
thus tries to deal with two separate problems- tokenization & POS tagging,
simultaneously.
4.2.2. ILMT65 POS Tagset for Hindi
The ILMT POS tagset has been developed by Akshar Bharti Group for annotating
Hindi corpus. It is based on the principle of simplicity with a motivation to extend
it as a framework for all ILs. Another important dimension that has been taken
into account in its design is the division of labour between POS tagger and Morph
Analyser. POS tagger is supposed to merely disambiguate the multiple tags
generated by the Morph Analyser. Finer distinctions have been avoided in order
to have lesser number of tags to facilitate efficient machine learning vis-à-vis
accuracy in automatic annotation. This has resulted in the flat tagset, comprising
of 21 POS tags but other inflectional information associated with the tokens can
be obtained from the Morph Analyser. Form-Function duality is one of the crucial
issues in tagset designing. However, it is mainly form-based tagset as pointed out
in (Bharati et al., 2006) “the syntactic function of a word is not considered for
POS tagging.....the word is tagged always according to its lexical category...”
Hence, pragmatic function of a token in the context is not considered as the
primary basis for POS tagging. As far as tags are concerned, the UPenn tags along
with the newly devised tags have been used. The most important point is that the
tagset has innovatively left the finiteness to be dealt at next level of annotation,
i.e. at word group or chunk level not at the level of token. The participial and
gerund are tagged as VM (though they function differently), and all auxiliaries are
tagged accordingly as VAUX. A variable tag (XC) has been also introduced
where (X) stands for category which is a part of a compound and (C) stands for
compound. Finally, it is worth to mention that form has been chosen as primary
basis for POS annotation but often adherence to semantic as well as syntactic
functions are evident from the tagset.
4.2.3. ILPOSTS66
It is a POS tagset framework designed to cover the fine-grained morphosyntactic
details of Indian Languages. It proposes a three-level hierarchy of categories,
types and attributes. It has been developed by Microsoft Research India, on the
basis of EAGLES guidelines (Leech & Wilson 1999). Language specific POS
65 Indian Language Machine Translation is a consortium project for developing MT systems for major ILS pairs. It has been set at IIIT Hyderabad and is funded by DIT.
66 Indian Language POS Tagset
80
tagsets have been customised on the basis of it. First Sanskrit (C. Shekhar, 2007),
Hindi and Bangla (Baskaran et al., 2008) tagsets were customised but latter the
scheme was more refined and tagsets for all ILs were developed at LDCIL
(Malikarjun et al., 2010; Bhat et al., 2010). These tagsets are hierarchical in nature
and consist of decomposable tags.
A general guiding principle has been formulated to handle form-function duality.
A set of ‘Attributes’ have been devised on the basis of morpho-syntactic or
simply orthographic practices and the attributes are marked according to their
form while the ‘Types’ are marked on the basis of their function. It is worth to
mention that on the one hand ‘Attributes’ are tagged according to their
morphological visibility (like tense, aspect, etc) as well as the semantics (like
number, gender, etc). On the other hand, ‘Types’ are exclusively based on
semantics (like common noun, proper noun, etc). A combination of the form and
the function based on distribution is applied for tagging categories like
demonstrative (DEM), Pronoun (P), Quantifier (JQ), Noun (NC), Noun denoting
Space & Time (NST) and Adverb of Location (ALC), and the orthographic
convention is taken as basis to annotate Postposition and Case marker. Although
finiteness is defined on the basis of the inflection for person, number, gender,
tense, aspect and mood but verb is not dealt as neatly as it has been dealt latter.
Further, with respect to the similar forms, a distributional basis is considered for
distinguishing and annotating the categories like pronoun and demonstrative, or
between pronoun and quantifier. A token is to be tagged as a demonstrative if it
follows an adjective or a noun and as a pronoun if it does not follow another noun
or other parts of speech. Similarly, a token is tagged as a nominal modifier if the
token it is followed by noun and as a noun if it is not followed. Case marker and
Postposition are assumed to be an instance of the same phenomenon of marking
dependents. However, due to orthographic conventions, the dependent marker is
written in two ways: together and separate. These two ways are tagged as case
marker and postposition, respectively.
4.2.4. BIS67 Tagset
POS tagset designing and developing is a perquisite for any POS annotation work,
whether carried out in isolation or as an integral part of a larger annotation
pipeline like as involved in building a treebank. As mentioned above (see the
67 Bureau of Indian Standardization
81
second section), BIS is an annotation framework, recognized by Bureau of Indian
Standardization. The framework has not adopted the Indic system of descriptive
categories rather, it has, like the most of the annotation schemes of the world
relied on the descriptive categories of Techne. Therefore, Dionysius Thrax’s
Techne (C.100 B.C) – a grammatical sketch of Greek – has not only served as
role model for contemporary POS descriptions in European Languages but also
for the POS descriptions of South Asian Languages. Techne includes an inventory
of eight POS categories (noun, verb, pronoun, preposition, adverb, conjunction,
particle, and article).
BIS recommended POS tagsets for Indian Languages also uses the same basic set
of POS categories which were also used by earlier tagsets like ILPOSTS and
ILMT. The 32 parts-of-speech categories recommended by BIS for Kashmiri are
given in the Table.1 (for detailed tagset see appendix-I). It is worth to mention
that at POS level, verb subcategories of Kashmiri have been kept in line with
Hindi-Urdu, i.e. fine grained distinction (finite, non-finite, infinite distinction) has
been avoided and category verbs has been further sub-divided into verb main and
auxiliary.
Category Tag Category Tag
1 Noun Common N_NN 22 Quantifier General QT_QTF
2 Noun Proper N_NNP 23 Quantifier Cardinals QT_QTC
3 Noun Locative N_NST 24 Quantifier Ordinals QT_QTO
4 Pronoun Pronominal PR_PRP 25 Residual Foreign-word RD_RDF
5 Pronoun Reflexive PR_PRF 26 Residual Symbol RD_SYM
6 Pronoun Relative PR_PRL 27 Residual Punctuation RD_PUNC
7 Pronoun Reciprocal PR_PRC 26 Residual Unknown RD_UNK
8 Pronoun WH PR_PRQ 29 Residual Echo-words RD_ECH
9 Pronoun Indefinite PR_PID 30 Adverb Manner RB_RB
10 Demonstrative
Deictic
DM_DMD 31 Adjective JJ_JJ
11 Demonstrative
Relative
DM_DMR 32 Postposition PP_PSP
12 Demonstrative WH DM_DMQ
82
13 Demonstrative
Indefinite
DM_DMI
14 Verb Main V_VM
15 Verb Auxiliary V_VAUX
16 Conjunction
Coordinating
CC_CCD
17 Conjunction
Subordinating
CC_CCS
18 Particle Default RP_RPD
19 Particle Interjection RP_INJ
20 Particle Intensifier RP_INTF
21 Particle Negation RP_NEG
Table.1. BIS POS Tagset of Kashmiri68
5. Description of Kashmiri BIS POS Tagset
POS categories and subcategories as given in the tagset (Appendix-I) are briefly
discussed below with reference to KashTreeBank Dataset-4.
i. Noun (N)
Noun is an open-class item or content word that refers to people, places, animals,
objects, substances, ideas, concepts, feelings etc. Nouns have inherent
characteristic of number, gender & case and they are usually inflected for such
information. Noun is the first top-level category in BIS tagset with three sub-types
belonging to level-1 of hierarchy. It sub-types include Common Noun (NN),
Proper Noun (NNP) and Spacio-temporal Noun (NST).
NN is the first subtype of the noun which includes those nouns that are classes but
not particular instances. Most of the nouns are common nouns which can be easily
quantified or pluralized, e.g. kitaab (book), gagur (rat), insaan (human), etc. The
common nouns extracted from the dataset-4 are given in Appendix-II. It not only
includes simple nouns but also other multiword expressions like compound nouns
and Izaafe. NNP is the second subtype which includes nouns that are particular
68 Note: Two tagsets with considerable differences were developed for Kashmiri on the basis of BIS format; one was developed at KU (which proposed fine-grained distinction in verb classification) and other at LDCIL (which avoided the fine-grained distinction like Hindi-Urdu). They were, latter on, combined in National Workshop on BIS & ILCI (2011), commenced at LTRC Lab. IIIT Hyderabad. The present tagset of Kashmiri is the same unpublished collaborative work proposed from the side of LDCIL and it has been first time used in any research.
83
instances (like person, place and institution names) but not general classes. These
can’t be quantified or pluralized, e.g. zA:kir hussain (Zakir Hussain), Jiil-i-Dal
(Dal Lake), ladaakh (Ladak), Kashmir University, Cufewed Night, etc. In some
languages like English, common and proper nouns can be identified with the help
of orthographic cues like initial letter capitalization but many languages like
Indian Languages lack such luxury. Moreover, proper nouns are also used as
common nouns; hence, extraction of either of them is very tough task. The proper
nouns that have been extracted from the dataset-4 are given in Appendix-II. It not
only includes simple single token proper nouns but also the multi-word/token
expressions like Compound Nouns, Izafe & other Named Entities, e.g. company
names, institution names, book names, person names, etc.
NSTs are fourth subtype of nouns which are also called Nouns of Location
(Nloc). This subcategory was actually introduced in ILMT tagset to register the
distinctive nature of some of the locational nouns which also function as part of
complex postpositions (e.g. ke uupar, ke niiche, etc) in Indian Languages but in
the current tagset the notion has been used little differently. Here, NSTs have
been treated as equivalent to the traditional adverbs of time and place. Since, there
is no place for traditional adverb of time and place in this tagset; these have been
classified under NST which basically refers to particular points in space or time.
For example, hoteyth/tateyth (there), yeteyth (here), bronThI (front), peyThI (top),
etc (also see Appendix-II). The Fig.1 shows frequency distribution of
subcategories of noun and reveals which subcategory is the most frequent in
Kashmiri.
N_NN N_NNC N_NNP N_NNPC N_NST0
200
400
600
800
1000
1200
1400
84
Figure.1 Subtype Frequency of Noun
ii. Pronoun (PR)
Pronoun is a closed class item which like noun has the inherent property of being
inflected for PNGC and can substitute a noun or a noun phrase. The idea for
introducing pronominals as separate category or as subtype of noun has been well
explored and it has been decided that the tag for pronouns will be helpful for
anaphora resolution. Moreover, it is not a subtype of noun but is rather a variable
which need not necessarily be referring to a noun. The top-level category of
pronouns (PR) includes Pronominal69-PRP, e.g. bI (I), tsI (you), su (he), sw (she),
yi (this), ti/hu (that/it), etc; Reflexive-PRF, e.g. paanI (herself/ himself); Relative-
PRL, e.g. yus (who), yi (which), yeli (when), etc; Reciprocal-PRC, e.g. paanIvan’
(each other); WH or Interrogative-PRQ, e.g. kus (who/which), kyaa (what), kar
(when) and Indefinite-PRI kahn (someone), kuni (somewhere), as six sub-types. It
is important to mention that unlike other traditional pronominal sub-classes,
possessive pronoun hasn’t been introduced as a sub-type in this tagset. The reason
is that possession, as an attribute (genitive), can be inflected with other sub-types
as well like his, whose, etc. The pronouns that have been extracted from the
dataset-4 are given in Appendix-II. The Fig.2 shows frequency distribution of
subcategories of pronoun and reveals which subcategory is the most frequent in
Kashmiri.
PR_PRC PR_PRF PR_PRI PR_PRL PR_PRP PR_PRQ0
2040
60
80
100
120
140
160
180
200
69 It is a cover term that was originally used in LDCIL tagsets. It includes personal pronouns (I, you, He, etc) that have persons (+human) as antecedents, pronouns (this, that, it) that have animates (-human) or in-animates as antecedent. Discourse deictic pronouns (this, that, it) that have whole proposition as antecedent, e.g. John abused Mary. It is clearly violation.
85
Figure.2 Subtype Frequency of Pronoun
iii. Demonstrative (DM)
Demonstratives are closed class items that perform deictic70 function for a noun.
Demonstrative will be always followed by a noun, a pronoun, an intensifier or an
adjective. These are a distinct category of determiners and can neither substitute a
noun, nor can specify a noun but can point out a noun. Therefore, one must not
confuse them with nouns or adjectives though these resemble by form with
pronouns and are traditionally treated as adjectives. It is obvious why
demonstratives are being posited as separate top-level category in this tagset. It
consists of Deictic or Default Demonstratives (DMD), Relative Demonstratives
(DMR), WH-Demonstratives (DMQ) and Indefinite Demonstrative (DMI), e.g. yi
in yi laDkI (this boy), kahn in kahn chiiz (something), kus in kus insaan (which
man), etc. The demonstratives that have been extracted from the dataset-4 are
given in Appendix-II. The Fig.3 shows frequency distribution of subcategories of
demonstratives and reveals which subcategory is the most frequent in Kashmiri.
DM_DMD DM_DMI DM_DMR0
20
40
60
80
100
120
140
Figure.3 Subtype Frequency of Demonstrative
iv. Verb (V)
Verb is an open class item that refers to actions, events, occurrences or states.
Verbs have inherent properties of Tense, Aspect, Mood or Voice and are also
inflected with such information. They also show inflections for Person, Number,
Gender and Case due to their agreement properties. In the present tagset Verbs are
70 It literally means pointing out.
86
top-level category with two subtypes; Verb Main (VM) and Verb Auxiliary
(VAUX). As mentioned above, the finer distinctions of finite, nonfinite and
infinite have been postponed to be tackled at chunk level. The rationale to use
these unspecified tags is that the morphosyntactic information of verb that
determines status of verb as finite or nonfinite is distributed on two or three
tokens. Therefore, it is impossible to decide upon the status of the verb unless all
the constituent tokens are not taken into consideration. For example; kheyvaan
(eating), shong (slept), chu (is), os (was), etc. The verbs extracted from dataset-4
are given in appendix-II. The Fig.4 shows frequency distribution of subcategories
of verbs and reveals which subcategory is the most frequent in Kashmiri.
V_VAUX V_VM0
100
200
300
400
500
600
700
800
Figure.4 Subtype Frequency of Verb
v. Adjective (JJ)
Adjective is an open class item that modifies a noun or pronoun by representing
one of its properties. Adjectives agree in terms of number, gender, and case with
the nouns they modify. Therefore, Adjectives (both attributive and predicative)
are inflected for PNGC. In the present tagset, there is no further distinction of
subtypes but a distinction has been made between those adjectives which are
constituents of compound words as well as izafe and simple adjectives. The tag
for simple adjective is JJ while as the tag for constituent adjective is JJC. For
example: zyuuTh (tall), asIl (fine), byuuTh (waste), vozul (red), etc. The adjectives
extracted from the dataset-4 are given in Appendix-II. The Fig.5 shows frequency
distribution of subcategories of adjective and reveals which subcategory is the
most frequent in Kashmiri.
87
JJ_JJ JJ_JJC0
50100150200250300350400
Figure.5 Subtype Frequency of Adjective
vi. Adverb (RB)
Adverb is an open class item that modifies verb. They form an important top-level
category of this tagset. Unlike agreement of adjectives, adverbs do not agree with
the verb they modify. They are indeclinable, i.e. do not have any inflectional
property. They are floating elements in the sentence and do not occur necessarily
adjacent to the verb, they modify. Their distribution in a sentence varies
considerably. In the present tagset, only adverb of manner (RB) has been taken
into consideration as adverb of time and adverb of place have been already
classified under noun as Nloc. For example: The adverbs that have been extracted
from the dataset-4 are given in the Appendix-II. The Fig.6 shows frequency
distribution of subcategories of adverbs and reveals which subcategory is the most
frequent in Kashmiri.
RB_RB RB_RBC0
5
10
15
20
25
30
35
40
45
50
Figure.6 Subtype Frequency of Adverb
vii. Postposition (PSP)
88
Postpositions are closed class items which like prepositions represent case
relations between verb and its dependent nouns in a sentence. The forms that
represent case relations are either free-forms or bound-forms. The free-forms are
called pre/postpositions while as the bound-forms (inflectional categories) are
called case-markers. Postpositions, as their name suggests, are always preceded
by nominals (noun or pronoun) & always trigger obliqueness either in their head
nominals (common in Indo-Aryan languages) or in the entire noun phrase (as in
Kashmiri). However, in literature, the notion of pre/postposition is vexed with the
notion of case or case marker. For instance, case-marker is considered to be
purely syntactic inflectional category while pre/postposition an independent word
representing semantic relations but the fact is that orthographic conventions defy
these norms and a purely syntactic form is bound (inflectional) in one language
and free (independent) in other language. To simplify, all the free-forms that
represent some sort of relation (not necessarily semantic) between nominals and
verb or between two nominals are considered as postpositions. It is worth to
mention that Kashmiri have very few relation representing forms occurring before
nouns, e.g. bamutaabiq Farooq (according to Farooq). Such forms can be
considered as prepositions but in the current tagset they can be classified as
postpositions, given the fact that there is no further sub-division in this category
because of their negligible frequency but postpositions have far more frequency,
e.g. sund (of), sI:t’ (with), khA:trI (for), etc. The Fig.7 shows frequency of
pre/postpositions in the dataset-4.
10
100
200
300
400
500
600
700
PP_PSP
Figure.7 Type Frequency of Postposition
89
viii. Conjunction (CC)
Conjunctions are closed-class items or function words that conjoin two or more
lexical items, phrases or clauses. In the current tagset, conjunctions have been
introduced as a top-level category with its two sub-types; coordinators (CCD) and
subordinators (CCS). If their conjoining operation is symmetrical the conjunction
is coordinator, however, if the conjoining operation is asymmetrical, the
conjunction is subordinator. Coordinators form compound sentences while
subordinators form complex sentences. In the former constituent clauses are
symmetrical (both are independent in nature) while in the latter, the constituent
clauses are asymmetrical (one is principal, independent or matrix clause and other
one, introduced by subordinator, is subordinating, dependent or embedded
clause).
Since conjunctions are indeclinable in nature, they were classified under particles
in ILMT and ILPOST vis-à-vis in LDCIL tagsets as per the definition of particle
is concerned. However, given their key syntactic functions unlike other particles,
they have been introduced as top-level category in BIS tagset and have been
tagged as CC, e.g. tI (and), zi (that), etc. It is worth to mention that this decision
may be helpful in conversion of dependency treebank into phrase-structure
treebank. The CCs that have been extracted from dataset-4 are given in Appendix-
II. The Fig.8 shows frequency distribution of subcategories of conjunction and
reveals which subcategory is the most frequent in Kashmiri.
CC_CCD CC_CCS0
50
100
150
200
250
Figure.8 Subtype Frequency of Conjunction
90
ix. Particle (RP)
Particles are open-class items or functional words which are generally
indeclinable in nature and have least significance in a construction. Particle
constitutes a top-level category in the current tagset and has Default (RPD),
Intensifier (INT), Interjection (INJ) and Negation (NEG) as its sub-types. There is
an elaborate list of particles (Emphatic, Similative, Dedative, Inclusive,
Exclusive, etc) which have been assigned a single underspecified label ‘default’,
given the fact that their finer distinction is not very significant at this level.
Particles generally have limited syntactic function but encode key semantic and
pragmatic information, e.g. seThaa (very), na (no), hata (hey), etc. The Fig.9
shows frequency distribution of subcategories of particles and reveals which
subcategory is the most frequent in Kashmiri.
RP_INJ RP_INTF RP_NEG RP_RPD0
102030405060708090
100
Figure.9 Subtype Frequency of Particles
x. Quantifier (QT)
Quantifiers are also closed-class items or function-words which quantify
nominals. Quantifier is a top-level category in the current tagset with General
(QTF), Cardinal (QTC) and Ordinal (QTO) as sub-type, e.g. akh (one), pI:ntsin
(fifth), vaariyaa (lot), etc. General quantifiers include non-numeric quantifiers
that show highness or lowness in the quantum of countable nouns or simply show
quantity of mass nouns while as Ordinals are numeral quantifiers that specify
quantum of countable nouns numerically. The former are less precise in
quantification as compared to the latter. Cardinals on the other hand, do not
quantify at an at all, rather, they specify position of an item in a series. They
91
modify nominals and can occur both at attribute as well as predictive position like
adjectives. However, by form these are generally derivatives of numerals
(ordinals) of a language. In Kashmiri, QTF and QTO show agreement properties
with their phrasal heads in terms of case like their adjective or demonstrative
counterparts. The Fig.10 shows frequency distribution of subcategories of
quantifiers and reveals which subcategory is the most frequent in Kashmiri.
QT_QTC QT_QTF QT_QTO0
20
40
60
80
100
120
140
Fig.10 Subtype Frequency of Quantifier
xi. Residual (RD)
Residual is not any POS category but to accommodate the remaining elements of
the corpus (text) which do not fit in the already discussed scheme, it has been
introduced as a separate top-level category with five sub-types in the present
tagset. Its sub-types include Foreign Word (RDF), Unknown Word (UNK), Eco
Word (ECO), Symbol (SYM) and Punctuation (PUNC). RDF includes the words
which are given in other script while as UNK includes the words which we don’t
know or are confused about or which apparently do not fit anywhere. Therefore,
UNK is a kind of baggage where you dump words that we are unable to classify.
ECO includes partially reduplicated non-words that play a definite grammatical
role. Symbols are apparently neither words nor punctuations but the elements of a
text which encode certain information about some entities which can prove
crucial for NER recognition. Punctuations are closed-class items but not words
that play crucial grammatical function in organizing a discourse. They mark
phrase, clause and sentence boundaries and sometimes play role of coordinators.
The Fig.11 shows frequency distribution of subcategories of residuals and reveals
which subcategory is the most frequent in Kashmiri.
92
RD_ECH RD_PUNC RD_UNK0
50100150200250300350
Figure.11 Subtype Frequency of Residual
6. Requirements for POS Tagging
There are two main requirements for POS annotation besides the availability of
corpus and POS tagset. These include an annotation interface and the storage
format which are elaborated below:
6.1. POS Annotation Interface
The best way to perform consistent and error-free POS annotation is by using
specialized user-friendly interfaces designed for this purpose. There are many
POS annotation interfaces that have been developed in India, e.g. the one
developed by MSRI and other developed by LDCIL, but these are only POS
annotation interfaces. But another level of annotation can not be carried out by
using them and also a link can’t be maintained between two or more levels. Since,
the current POS annotation is an integral part of KashTreeBank, first level of
annotation; we need some specialized plate-form that can consolidate all levels of
annotation in certain format. One such platform is Sanchay71 which has been
developed by writing Approx. 300000 lines of Java code over many years.
Sanchay is a collection of tools and APIs (Application Programme Interfaces) for
various language processing purposes (Singh 2006). It is an open-source platform
to carry out various NLP tasks for South Asian Languages (SALs). So far, it has
been extensively used for Indian Languages (ILs) at various NLP research labs for
various research projects. The background information ob Sanchay has been
given nicely as:
“It has already been used for the creation of POS tagged corpora for several
Indian languages. In fact, the beginning of treebank creation work in India
coincides with that of the beginning of the development of this interface and
71 http://sanchay.co.in
93
much of the treebank annotation work for Indian languages has been
accomplished on various versions of this interface.” Singh (2011)
Sanchay Syntactic Annotation (SA) interface, as shown in Fig.12.A & Fig.12.B,
is a specialized interface for syntactic annotation but it has been generalized for
various kinds of annotations; Morphological annotation, POS tagging, Chunking,
PSG Annotation, Dependency annotation, Named entity annotation and PropBank
annotation. It was first developed when the preparations for creating a Hindi
treebank were started at LTRC. Actually the work on developing the whole
platform started with this interface, as pointed out by A. K. Singh, developer of
Sanchay, “It was not just the first annotation interface, but also the first graphical
user interface in Sanchay” (ibid 2011).
The same interface with the same mechanisms can be used for these
different kinds of annotations. This is made possible by a data representation that
is in terms of threaded trees with feature structures (multiple and/or nested). The
different threads in the base tree allow different layers of annotation (for details
see Singh, 2011).
Sanchay is a generalized platform which needs customization to work for
a particular language which is yet to be included. Customization related to
encoding or font was already done but same for the tagsets needed to be done.
BIS tagset is a recent development and Sanchay was customized for previous
tagsets only. For the current work, it needed to be customized for BIS scheme.
The properties files (pos-tags.txt, pos-tags-ben.txt, non-terminals.txt) located in
the directory <Sanchay/workspace/syn-annotation> which contain the lists of tags
for POS tagging and chunking, need to be customized. These plain text files
contain simple listings of tags in alphabetical order, with one tag per line. In the
same files ILMT tags were simply replaced with BIS tags. These tags have been
sorted alphabetically so that Method-3 can be employed for POS tagging
conveniently.
6.2. Storage Format
Since, the data has to be converted into the storage format before starting actual
POS tagging, as mentioned in the Chapter one. The format encodes the threaded-
tree-representation, which allows multiple layers of annotation to be stored in a
single structure or a single file which is readable for various algorithms or
94
convertible in a format which in turn is readable. The default format which
Sanchay uses for storage & linking of various levels of grammatical information
is called SSF72 (for details, see section-3.2, Chapter one). However, the interface
also supports XML and several other formats that are commonly used for
computational purposes such as preparing input data for Machine Learning tools.
For converting the corpus into SSF, first the text needs to be split into
separate sentences in such a way so that each sentence occupies a separate line. It
was done in MS.Word by using a special case of “Find and Replace”
(CNTRL+H) option, in which sentence delimiters ( ۔& ؟ ) have been replaced with
paragraph markers (^p) & one sentence per line arrangement has been achieved.
Secondly, the doc-files need to be converted to plain txt-files by saving the
content in plain text editor-notepad with UTF-8 encoding. Finally, the resultant
text file needs to be loaded/ opened in SA-Interface of Sanchay and then saved
there. The clicking on the save button automatically converts the raw text into
four-column SSF, provided the sentences are arranged in one sentence per line
fashion.
The following screenshots, Figures- 12.a, 12.b, and 12.c, depict the step-
wise opening of the corpus file <kashmiri_treebank_IASNLP_286.txt> as well as
the Sanchay.
Step-I: Open the Sanchay folder and double-click on the Sanchay.bat file, an
executable file which starts running and results in the opening of the Sanchay
Shell, as shown in the Figure.12.a & 12.b.
72 http://shakti.iiit.ac.in
95
Figure.12.a. Opening of Sanchay Shell
Figure.12.b. Sanchay Shell with Multiple API Tabs
Step-II: Clicking on the SA-button in Sanchay Shell results in the opening of SA-
interface as shown in Figure.12.c. It is this API only which is needed in the entire
course of this work.
96
Figure.12.c. Sanchay Syntactic Annotation (SA) Interface
Step-III: Clicking on Open button in the SA-interface results in the opening of a
small Browsing-window in which Browsing-button is to browse for the required
task-file. In this window, one needs to set the language in the language drop-down
list and also set the encoding in the encoding-dropdown list as shown below in
Figure.12.d.
Figure.12.d. SA-Interface with Browsing Window
Step-IV: Clicking on the Browse button results in the opening of small Open-
window which provides a list of the files in a particular directory which are in text
format & can be opened. One needs to select the required file by clicking on it
97
and then clicking again but on the Open-button so that the path of the file is
selected in the Browsing-window. It is shown in the Figures 12.e & 12.f.
Figure.12.e. SA-Interface Showing Browsing Window
Step-V: The path (C:\ Users\ Shanu\ Desktop\ Sanchay-16-02-11\
KashDTreeBabk_03-Aug 2013\ 1.kashmiri_treebank_IASNLP_286.tx) of the
required file is selected as shown in the Figure 12.f. The selected file will open as
soon as one clicks on the OK-button. The file will open in the interface in a way
so that only one sentence is displayed at a time as shown in the Figure.13.a.
Figure.12.f. SA-Interface Showing Annotation Task-Setup
98
Figure.13.a. SA-Interface Showing a Sentence (Before POS Tagging)
7. POS Tagging of KashCorpus
POS tagging can be performed with the help of this SA-interface by four methods
which differ in terms of ease of use. As shown in Figure.13.a, one sentence is
displayed at a time in a vertical order so that the first word of a horizontal
configuration (right-to-left or left-to-right) corresponds to the top most word in
the vertical configuration. In SSF, each word is represented by a node. Once a
node has been selected either by clicking on it or by moving the cursor with the
keyboard, one of the following methods can be employed to tag it. (i) by selecting
from a drop-down list, as shown in Figure.13.b (ii) by right-clicking to get a
context menu, then selecting a ‘Node Name’ from the sub-menu and then
selecting the tag from sub-sub-menu (iii) by typing with key-board the first letter
of the tag one or more times (iv) by clicking on a button with the intended tag as
its label.
99
Figure.13.b. SA-Interface Showing a Method of POS Tagging
For POS tagging, 812 sentences of KashCorpus had been taken and converted
into SSSF in which 226 sentences were taken from newspaper domain, 286
sentences from short stories and 300 sentences from literary criticism. All 812
sentences were tagged with POS tags in four phases. Twenty nine POS tags were
assigned to three domains, divided into four data sets. The results are given in the
next section.
100
Fig.13.c. SA Interface Showing a POS Annotated Sentence
8. POS Tagging Issues
The POS annotation of four samples of Kashmiri corpus resulted in raising,
understanding and solving of various linguistic issues. The main issues are
discussed below and their solutions are given below in the form of annotation
guidelines. The statistical information about the various POS categories is also a
byproduct of this work which is also given below.
There are some general decisions that need to be taken at the time of tagset
designing. These decisions are related to whether the tagset should be flat or
hierarchical, fine-grained or coarse-grained, form-based or function based, etc.
Though, only after deciding upon these dualities one can proceed with the
customization of tagset for a particular language but all the things can’t be
decided at the time of the customization and certain things can’t be decided upon
at all, categorically in binary yes-no manner. Therefore, some things need to be
decided at the time of actual corpus annotations and the decisions need to be
documented to form, what is called annotation guideline. One must keep in mind
that the decisions taken to solve some issues may or may not be theoretically
appealing but mere shallow ad hoc solutions, either to postpone the immediate
problem to the next level or to provide the best possible solution that prevents
further problems. The issues that have been raised and addressed at this level of
corpus annotation have been classified under the following headings:
8.1. Fuzzy Items in Complex predicates (FI)
POS categories are hardly like the elements of periodic table that they always
retain their unique identity. They lose their grammatical identity, i.e. morpho-
syntactic features in certain contexts either due to neighborhood effect or due to
grammaticalisation. For instance, in some complex predicates (see Butt for
explanation), it is hard to decide upon the grammatical category of the words
other than the light verb (on V2 or V-final position) as given in the examples;
kor dafah, kor hA:sil, kor pA:dI, darguzar korun, kor tabdiil, kor fanah, etc. In
these examples the bold words (dafah, hA:sil, pA:dI, etc) are most likely to be
either adjectives or nouns although their nominal features like number, gender &
case have been bleached, nevertheless, they are not as clear as the bold words in
101
the following complex predicates; gov khosh, tuj’ dav, dits kreykh, nyuv kheyth,
etc.
It is easy to decide if these words are adjectives, nouns or something else as these
words clearly retain the nominal fetures either at morphological level or at
semantic level. Here, khosh is adjective (mas/fem, agree); dav is noun
(fem), kreykh (fem) is noun and kheyth is verb (participle).
8.2. Zwitter Ion of Natural Language (ZI)
The term Zwitter Ion has been taken from chemistry to illustrate the dual nature of
Gerunds. Usually, chemical particles are either +vely charged or -vely charged at
one instant of time but Zwitter Ions are of dual nature unlike other particles and
have both +ve & -ve charges, simultaneously. Analogically, nouniness &
verbiness are two polar oppositions like positive & negative charges. If a word
tends to be noun it means its verbal properties have declined and vice-versa.
Gerund is the only class of words that simultaneously retain nominal as well as
verbal properties. Gerunds on one hand function as possess case markers &
function like nominal but on the other hand retain their predicate-argument
structure properties like a typical verb.
Now the question arises, how to tag a gerund? Whether its form should be taken
into consideration in order to classify it or its function? If form is taken into
account, it is verb, though nonfinite one but if function is taken into consideration
it is a noun. It should be noted that in ILPOSTs, it was placed under the category
Noun as Verbal Noun, perhaps the focus was on the function, but in ILMT &
subsequently in BIS, it has been placed under the category Verb as gerund, given
the fact that by form it is verb and its predicate-argument structure frame remains
intact, through, it can never be inflected for other typical verbal features like
tense, aspect, mood or voice, e.g. kheyn-I sI:t’, vandn-I kin’, cheyn-
as peyTh, marn-an, etc.
Here, on one hand, kheyn-I, vandn-I, cheyn-as, are gerunds in oblique form,
followed by postposition like nominals. However, the gerund, marn-an, is not in
oblige form but inflected with a case marker (-an). On the other hand, the gerunds
(transitive) like kheyn-I & cheyn-as can also have their Arguments
like batI kheyn-I sI:t’ or chai cheyn-as peyTh.
8.3. izaafat Constructions (IC)
102
These constructions are Persian borrowed multiword expressions like Compounds
and Named Entities but with more coherent internal structure. Usually, two nouns
or a noun & an adjective are combined by means of a marker called “izaafe” to
form izaafat construction. The izaafe behaves like genitive in Urdu but in Persian
it behaves more like a linker (see Butt). However, in Kashmiri, the construction
seems to behave more like a compound with less conspicuous internal structure,
e.g. aab-i hayaat, vaziir-i aazam, hoquuq-o frA:yiz, dast-i shafaa, habiibi paakh,
hoquumat-i hind, shariiq-i hayaat, etc.
The diacritic marker that represent izafe in Kashmiri is mostly zer ( ,(ہunlike in Persian and Urdu where hamzah(ء), vaav (و) and badii ye(ے) also
represent izaafe. Therefore, izaafat constructions are either NN-NN combinations
or NN-JJ combinations. In NN-NN combinations, the diacritic is on the first
element (NN) but it seems to belong to the second element (NN) when it is
simplified (nativized) for interpretation. Kashmiri news paper corpus is replete
with such expressions. Given the writing conventions of Arabic, i.e. omission of
diacritics, and its influence on Urdu and thereby on Kashmiri writing, such
markers may or may not be there in the written expressions but are intact in
spoken forms.
Here, the problem is how to tag the two constituent words of an izaafat
construction? Should the words be tagged separately (like aabi/NN Hayaat/NN)?
Or should the constituents be joined & then tagged together as a unit (like aabi-
hayaat/NN)? However, in the first case, there will be less clarity in determining
the POS category of the first word marked with Izafe.
8.4. Identification of Proper Nouns (PNI)
As such the noun as a POS category doesn’t pose any problem but the inclusion
of common noun, proper noun and Nloc as subcategories have proved confusing
and thus, the noun came to be the most debatable category in the tagset. At times,
it becomes very difficult to distinguish between NN and NNP by relying on the
traditional notions. For instance, mevIh (fruit) is not referring to any specific fruit
or is not the name of any fruit, hence, it is NN. Then, by this logic, amb (mango)
or tsuunTh (an apple), names of specific fruits, should be NNP but these are
considered as NN. In order to address this issue, properly, one needs to go by
some concrete standards that can be generalized with least exceptions. Therefore,
it has been posited that NN is the noun that denotes a class of things, concrete or
103
abstract, (set or sub-set) while as NNP denotes an instance of a class (member of
set or sub-set), e.g. mango is a name but of a class of different varieties or
instances like Alphanso, Baadaamii, etc. Similarly, different varieties of apples
like Amriican, Chomuuriyah, Deylshas, etc are instances of the class apple, hence,
‘Chomuuriyah’ is NNP and ‘tsuunTh’ is NN. This position solves the problem to
some extent but raises other questions like, whether zuun (moon) is NN or NNP,
given the fact that other planets also have moons with specific names, and hence,
zuun is a class not an instance, likewise, in the above examples, Alphanso
mangeos or Chomuuriya apples are also names of classes rather than the
particular instances. By the same logic, it can be said that Alphanso mango tree is
also a class of different Alphanso mango trees but not an instance. Actually,
determining the status of a thing as a class or an instance is very tough ontological
problem. By looking from the top to bottom of an ontological tree, it seems an
object, like Alphanso mango, is a class but looking at the same object from the
bottom to top, it seems that the thing or object is an instance. It is hard to
determine, where one should stop dividing a class into subclasses, sub-subclass,
etc? in order to take one level as an instance. The problem of indeterminacy
comes to fore as soon as the definiteness issue creeps in the already vexed
problem of looking for instances. One can ask question whether the notion of
NNP incorporates the notion definiteness or is it independent of it, e.g. a person
name, “Umar” is no doubt an NNP but is not definite as the “Umar” can be Umar
Farooq (hurriyat leader), Umar Abdullah (CM), Baba Umar (Journalist), Umar
ibni Khataab (Second Khaliifah), Umar Gull (cricketer) or any other Umar.
Hence, the person names can be themselves a class (indefinite) rather than an
instance (definite). In order to ease out the problem, one can keep the definiteness
at bay from the notion of an instance while looking for ‘instances’ within a class.
Only then, one may be able to distinguish between NN and NNP otherwise the
status of person names or some place names as NNP can be objectionable.
Nevertheless, one can propose various diagnostic features, as given below, to help
in determining whether a noun is NN, NNP or NST.
1. If a noun can’t be pluralized and quantified it is likely to be NNP.
2. If there is room for asking question for the thing under consideration, like
which thing? Then the thing is likely to be NN and if there is no room for asking
such question, the thing is likely to have NNP as its POS category, e.g. aaftaab
104
(sun) is a specific instance of stars and hence, NNP. There is no room for asking
the question, which sun?
8.5. Named Entities (NEs)
As the name itself suggests, NEs include the names of companies, institutions,
persons, places and things which are multiword in nature. For example: vaziir-
i aazam manmohan singh, islamic university of science and technology, Microsoft
India Private Limited, etc.
The problem with named entities is that they form long chains of words which in
isolation refer to nothing specific but as whole refer to specific entities. Therefore,
as whole they are multiword proper nouns though their constituent words can be
of any category. Also, izaafat constructions can be their constituent elements as in
vaziir-i aazam manmohan singh.
The question arises how to tag them at this level? There are two options; one is to
tag each constituent word with their respective POS categories and the second
option is to tag all the constituent words with the same tag used for proper noun
because as whole they are proper nouns.
8.6. Compound Words (CW)
Compound words are also the problematic multiword expressions but they are not
comparatively simpler than izaafat constructions and named entities in that the
number of the constituent words can’t exceed more than two like named entities
and there is no internal linker to them like Izafe. However, they have their own
complexities. They can be endocentric with compositional meaning or exocentric
with non-compositional meaning. Like the outward drift in the meaning of
exocentric compounds, there can be also heterogeneity in their POS
compositionality, i.e. words of two different POS categories can form a new word
which may or may not have the category of one of its constituent words. For
example: Akis/QT akh/QT (pronoun), pA:n’/?? paanai/PR (pronoun), shinI/??
baal/N (noun), gA:r/JJ zimIdaar/JJ (adjective), kheyn/V cheyn/V (noun),khosh/JJ
nasiib/N (adjective),zorI/RB zorI/RB (adverb), As’/?? As’/?? (adverb), heyokun/V
kheyth/V (verb) , etc.
Of all the compounds, compound nouns and verbs are far more productive in
Kashmiri and pose more challenges to the annotator. In some compounds as
shown above, the constituent words without POS tags (with ??) are intuitively
difficult to classify as their original form has been changed and reduced to a sort
105
of bound form (like pA:n’/?? & As’/??) . However, some words (like shinI/??)
seem to have assumed the oblique form, the form which a noun assumes under the
influence of following postpositions as in (shinI/N peyTh’/PSP), “shiin” changes
to “shinI”. Therefore, such forms have independent existence unlike (pA:n’/?? &
As’) whose existence is bound to these contexts (compounds) only & do not exist
outside such contexts. Such forms have been classified and tagged on the basis of
their original form vis-à-vis category.
In addition to such problems compounds in general are like multiword
expressions and therefore, it is important to keep in mind whether the constituent
words of the compounds should be joined together by some convention, e.g. dash
(-), to form a single token or they should be kept as such (two tokens) without
joining. If former approach is followed then they need to be tagged as whole unit
which will ignore the category of their constituent words. However, if one follows
the latter approach then they need to be tagged separately as per their respective
categories which will ignore the category of the entire compound. It is also
important to take into account that whether the POS information of the individual
constituent words of a compound is more important at this level of annotation or
the POS information of the entire compound. However, if one thinks both are
important, then it must be also keep in mind whether it is possible to achieved at
this level.
8.7. Numeric Dates (NDs)
It has been observed that there occur various instances of dates in the corpus.
They are also like named entities. As they represent particular points in time, it is
quite possible to label them as Nloc but it is a debatable issue whether to classify
them under Nloc or not. Date is the name for particular point of time like the
name of a place or a person and they are unnamed like the typical temporal Nloc
(adverbs of time). Therefore, they are classified under proper nouns and not under
Nloc. However, problem arises when they are followed by a case marker which
occurs as separate token, e.g. 16 January 1950 has, 1847 huk, 1947 yas (۱۹۴۷ ، چ777س۱۹۵٠ ہ777ک،۱۹۴۷یس ،). In these examples, -has, -huk and -yas are
basically bound forms but can’t be attached with numerals naturally. Similarly,
sometimes the dates, e.g. 1950 (۱۸۵٠) occur with symbol (ء) for issvi (AD) and
the case-marker (یس) yas, e.g. in 1850 iisvi yas (۱۸۵ یس ٠ ء ) manz. It has been
106
also observed that there are some occurrences of the dates in the corpus where the
initial and final dates representing a period of time are kept in brackets and the
case marker occurs outside the brackets, e.g. (چ777س تام ۱۹۴۷- ۱۹۵٠( or (1842-
1857) as taam. Such cases, though a typical tokenization problem, have been
handled at POS level as they had been left as such at the time of tokenization
because they came to fore during annotation process.
8.8. Underspecified Verbs (UVs)
As aforementioned in the subsection 2.2.6 (d), the fine grained sub-classification
of verb has been avoided given the fact that fine-grained sub-classification is
based on the notion of finiteness, i.e. finite verb, non-finite verb and infinite verb
and the notion itself is controversial at deeper level. It is not only tense that
contributes to finiteness but sometimes aspect, mood and agreement determines
finiteness. The morpho-syntactic information that constitutes finiteness is usually
distributed in two or three verb tokens (auxiliary and main verbs). For instance, in
the example; su chu batI khevaan (he is eating an apple), auxiliary verb (chu)
carries tense (present) & PNG agreement (mas.SG.3rd), and the main verb
(khevaan) carries aspectual information (progressive) in addition to the lexical
semantics. In such sentences, it would be absurd to say that the auxiliary verb
(chu) is finite and the main verb (khevaan) is nonfinite. If only the tense
determines finiteness then all –tense verbs are nonfinite. Then, a very important
questions which arises is that, are the perfective sentences like “arshidan kheyo
batI” (arshid ate rice) and imperative sentences like “tsI khe batI” (you eat rice)
basically nonfinite clauses? Isn’t it that only de-verbal verb-forms like the
participles or gerunds are basically non-finite?
It is obvious that the main verb in the sentence; su chu batI khevaan (he is eating
an apple), is also finite despite that fact that it doesn’t carry the tense (-tense) or
PNG information (-PNG). It is verbal in nature and plays a key role in the
sentence by providing the lexical semantics of the action and its aspectual
information, unlike the nonfinite verbs (such as its khey-th form) which are de-
verbal in nature and, thus, play a marginal (modifying) role in the sentences in
which they occur. Therefore, finiteness needs to be determined by taking into
account the all verb tokens (except de-verbal) of a sentence, irrespective of
whether the verbs tokens are contiguous or non-contiguous with relation to each
other. Keeping in view this complicated nature of finiteness, verb classification
107
has been kept underspecified at this stage and only two types (main and auxiliary)
have been posited just to avoid resolving of finiteness puzzle at this stage and to
postpone it to the next level of annotation, i.e. chunking level.
8.9. Non-manner Adverbs (NMVs)
The notion of adverb has been simplified by restricting it to only manner adverbs
and putting time and location adverbs under noun as Nloc. However, there are lots
of words which seem to be adverbs but other than manner and locatives adverbs.
Since, only manner adverb has been posited in the current POS tagset, the label
needs to be neutralized & expanded to accommodate both, manner as well as non-
manner adverbs which signify reason, frequency, some quantification and
sentential adverbs e.g. kyaazi (why), beyi (again), dohdish/ dohai/ rozaanI
(everyday), hameshI (forever), zyadI (more), vaariya (lot), kam (less), shaayad
(perhaps), yaqiinan (surely), lA:ziman (necessarily ), etc.
The rationale to include reason, frequency, unique quantification and sentential
adverbs in adverbs is well grounded. The reason word kyaazi (why) modifies
whole clause like the sentential modifiers; shaayad (perhaps), yaqiinan (surely),
lA:ziman (necessarily ). Frequency words like dohdish/ dohai/ rozaanI (everyday)
& hameshI (forever) sound like manner adverbs, sort of temporal manner. It is not
that all quantifiers modify verbs but surely some are verb modifiers. Such role of
some quantifiers is more evident when they are used with intransitive verbs, e.g.
zyadI (more) in the sentence, su shong zyadI (slept more); kam (less) in the
sentence, kam osun (s/he laughed less), vaariya (lot) in the sentence vaariya
kheyvun (s/he ate lot), etc.
Another problem related to adverbs is that of being multiword like compounds,
though it is far from compounding. It mostly arises out of writing convention and
can be handled more like other multiword expressions or taken care at the time of
tokenization. For instance, certain adverbs are composed of two tokens in which
first token is adjective or noun and the second one mostly postposition, e.g. Thiikh
pA:Th’ (nicely), khOsh pA:Th’ (happily), vaarI pA:Th’ (safely), dor pA:Th’
(strongly), khushii saan (with happiness), etc.
All the multiword rather multi-token adverbs are not problematic for this level of
annotation as they can be handled like any other multiword expression but some,
in which the POS status of both the constituent tokens is not clear, are really
challenging, e.g. the status of pA:Th’ in the expressions; Thiikh pA:Th’ (nicely),
108
khOsh pA:Th’ (happily), vaarI pA:Th’ (safely) & dor pA:Th’ (strongly), is not
clear. Although, it has been treated as postposition in some previous annotation
works, it is more likely to be a bound-form. It is, no doubt, a separate token in
corpus but intuitively speaking, it is not a word; rather, it is a part of the preceding
word and is more like an adverbial morpheme except in the instances like misaali
pA:Th’ (for example) where it is clearly a postposition but its frequency in the
corpus is very less. The instances, in which it appears to be a bound-form have
high frequency in the corpus and have not been handled at the time of
tokenization, where a bound-form is usually attached to the preceding token (see
chapter-III). The reasons due to which it has been left as such in tokenization
process are its high frequency and unclear status.
8.10. Paradox in POS Annotation
As aforementioned form-function is one of the important dualities. It is very
crucial for tagset designing as well as corpus annotation. Theoretically one needs
to stick to only one aspect & to carry out the entire task of corpus annotation on
the basis of the same principle without occasionally switching to the alternative
dictum. However, practically it seems to be implausible as the annotations are not
carried out in isolation just for the sake of annotation but the product of
annotation needs to be used for some bigger task ahead and thus, one can’t ignore
the formal aspect of a word and focus entirely on its functional aspect or vice-
versa as demanded by theory. Somehow, both the aspects need to be taken into
account and one needs to realize the use of a one aspect or other in a particular
task along and also its use in the tasks ahead so that a particular aspect can be
ignored if not very important. For instance, on the one hand, demonstratives
wouldn’t have been a POS category if only formal approach would have been
taken into account, as by form demonstratives are actually pronouns but by
function they are demonstratives, e.g. the word, su (he) is pronoun in the sentence
su aav (he came) but the same word is demonstrative in the sentence su shur aav
(that kid came). Similarly, on the other hand gerunds wouldn’t have been verbs
but nouns if their formal aspect would have been ignored and only functional
aspect would have been taken into account. Therefore, it seems contradictory that
at one point for positing demonstrative as POS category, functional aspect of a
word has been taken into account but to posit gerunds as subclass of verbs the
same functional approach has been defied. It is important to mention that such
109
decisions are matter of expertise and experience and one need not to follow theory
strictly unless it doesn’t undermine the goal of the task in hands. In the present
task, i.e. POS annotation, the goal is to lay down the foundation of dependency
treebank which can be further augmented with anaphoric information or other
discourse level information. Thus, this dual or hybrid approach to corpus
annotation is justified. Nevertheless, it can be said that opting hybridity in corpus
annotation under the influence some practical usage, is indeed paradoxical, as
capturing the functional aspect of words in the corpus is an alternative way of
looking at data and thus, the essence of corpus linguistics vis-à-vis corpus
annotation.
9. Guidelines for POS Annotation
Some important guidelines that were framed and followed for POS tagging of
KashCorpus are given below:
i. All NEs are essentially proper nouns (NNPs) as they refer to specific
entities that have been named; however, the name is composed of more
than one word with different POS categories. Actually NEs are phrases
rather than words but need to be handled like words at this level. Since,
NEs are as whole are NNPs, all the words composing NEs are tagged as
NNPs, e.g. NE “vaziiri aazam manmohan singh” is tagged as
“vaziiri/NNP aazam/NNP manmohan/NNP singh/NNP” so that a chain of
NNPs is obtained can be easily identified in the annotated corpus. This
might look absurd decision and one can argue that original POS
information is suppressed but as aforementioned, it is strategy to evade the
problem at this level and to keep track of the problem-items in order to be
handled at another intermediate level handling multiword expressions
(MWEs).
ii. CWs are handled slightly differently, though like NEs they too are treated
as MWEs. These are composed of only two words and mostly include
compound nouns, compound adjectives, compound adverbs
(reduplications), etc. The words which form a compound are assigned
their respective POS tags but the specialized ones with ‘C’ to indicate
compound, e.g. Compound nouns like “shinI baal, zaril zaal, masI vaal”,
Compound adjectives like “khOsh qIsmath, gA:r zoruurii” are tagged as
“shinI/NNC baal/NNC, zaril’/NNC zaal/NNC, masI/NNC vaal/NNC” and
110
“khOsh/JJC qIsmat/NNC, gA:r/JJC zoruurii/JJC”, respectively. The
overall POS information of the compound word has been suppressed
unlike the treatment of NEs but ‘C’ maker has been added to the tag to
make the compounds identifiable or extractable. It must be noted that the
capturing compounding information of verbs, like the above, has been
avoided, given the fact that there are other dreaded complicacies
associated with verbs which are handled at the chunking level. It has been
also avoided in pronouns as there are very few compound forms in them.
iii. ICs are also handled like CWs and the words which are linked together by
izaafe are tagged with their respective POS categories, ignoring the
change brought about by the izaafe in the word to which they are bound,
e.g. the ICs “aabi hayaat, khuuni jigar, habiibi paakh, hoquuqo faraayiz”
are tagged as “aabi/NNC hayaat/NNC, khuuni/NNC jigar/NNC,
habiibi/NNC paakh/JJC, hoquuqo/NNC faraayiz/NNC”. Here, information
about Izaafat has been suppressed as like many other cases it is not much
needed for sentence parsing and ICs behave more like compound words.
iv. NC’s are actually a kind of numeral NEs and hence, tagged as NNPs but
the other complications associated with them have been handled by
joining the bound-forms with the numeric date by dash (for details, see
above discussion), e.g. Numeric dates like 1845 has manz, 1845 yas ء
manz, (1845 (ء yas manz, etc are tagged as 1845-has/NNP manz/PSP,
1845/NNP ء-yas/NNP manz/PSP, (/PUNC 1845/NNP ء-yas/NNP )/PUNC
manz/PSP. It should be noted that some unexpected tokenization problems
like the one discussed in the issues and mentioned above, have been
tackled even at this stage.
v. As aforementioned, the de-verbal non-finite forms like perfective
participles (-ith-forms) such as kheyth, pArith, shongith, bihith, etc;
progressive participles (-vun-forms) such as zeyvvun, shongvun, natsvun,
asvun, gindvun, khasvun, bozvun, etc and gerundial forms such as
shongun, shongnI, shongas, shongnan, shongnuk, vothnI, natsnas, etc
have been assigned underspecified POS tag, VM. Also, verbal non-finite
forms (infinitives like shongun) and verbal finite forms (main verbs) also
have been assigned the same tag, i.e. VM. It is important to mention that
111
no distinction has been maintained at this level and the reasons for the
same are discussed above.
vi. The pronominal-forms which are followed by verb or postposition have
been assigned a POS tag, PRP but if the same form is followed by
intensifier, quantifier, adjective or noun, it is demonstrative and is tagged
as DMD.
vii. Adverbs which are not essentially manner adverbs like those representing
frequency, quantification & reason (beyi, vaariyah, kyaazi, etc) have been
also tagged as RB like the manner adverbs.
viii. In addition to traditional adverbs of time and space, some vague time-
words like subhas (in morning), shaamas (in evening), dohli (in the day)
and ‘roth’ in “roth kyuth” are also potential NSTs as long as they provide
temporal location for an action/event. But when they are inflected with
genitive marker like subhuk (of the morning), shaamuk (of the evening),
etc and cease to provide temporal location for an action/event, they cease
to be NSTs and are NNs.
ix. Usually, NSTs do not take demonstratives, are marked with locative,
ablative or terminatives, and can’t be pluralized or quantified.
x. Besides, words like “kA:shur or kashmiri” are most likely to be NNP
when used in isolation or with other words in order to refer to a language
but are likely to be NN when used in isolation to refer to people “koshur,
PunjA:b’, BangA:l’, etc” However, they are likely to be JJ, when used
with other words like koshur saqaafat (kashmiri culture, koshur geyav
(kashmiri ghee), etc.
So far linguistic information, an outcome of the analysis cum annotation process
has been discussed or put forward in the form of a small guideline for problematic
cases. Now, in the next sub-section, statistical information, yet another kind of
outcome of the annotation process has been given in the following sub-section.
10. Statistical Results
As aforementioned, the data has been divided into four sets for annotation. In
each dataset, the words have been classified into eleven classes. Table.1. shows
cumulative frequency of each of the POS category in all four datasets while as
Fig. 14.a shows the total quantity of each POS category in terms of percentage. It
112
has been prepared from the frequency Table.1. to show the contribution of each
POS category in making KashCorpus and to compare the percentage of the
categories, in order to get the most frequent and the least frequent POS categories.
N V PP RD JJ PR CC RP DM QT RB0
5
10
15
20
25
30
35
40
34.689
18.242
9.077 7.9346.4346.3129999999999
96.038
3.476 2.945 2.501 2.348
Figure.14.a. Total Quantum of POS in terms of (%)
S.
No
POS Type
(x)
Data Set-1
(f1)
DataSet-2
(f2)
Data Set-3
(f3)
Data Set-4
(f4)
Grand Total
fx = (f1+f2+f3+f4)
1 N 953 2296 868 1042 5159
2 V 793 1045 597 278 2713
3 PP 190 665 210 285 1350
4 RD 394 345 251 190 1180
5 JJ 176 384 212 185 957
6 PR 333 234 176 196 939
7 CC 169 313 208 208 898
113
8 RP 207 115 99 96 517
9 DM 50 146 119 123 438
10 QT 68 183 64 57 372
11 RB 48 48 96 157 349
Total f = 3381 5774 2900 2817 14872
Table.1. Cumulative frequencies (fx) of POS
11. Summary
In this chapter, the fundamental layer of annotation, i.e. POS tagging, of
dependency treebank of Kashmiri, has been explored with reference to the four
datasets taken from KashCorpus, discussed in chapter-III. First of all, the task to
be handled in this chapter has been introduced in section.1 and then the notion of
POS tagging has been explained in the beginning of the section.2. In the same
section some important corpus annotation standards have been discussed and the
various existing POS tagsets have been reviewed briefly. Further, not only the
category wise description of Kashmiri POS tagset (used in the current work), has
been given in this section, but also the comparative statistical information about
various sub-categories involved. In section.3, at first, the prerequisites for actual
POS tagging have been discussed which include the annotation interface and a
particular data storage format. The SA-Interface of Sanchay platform has been
used for the current task and the procedure for using the same has been given in
114
this section, using various snapshots. The storage format called SSF has been also
discussed along with the need to rely on any such format for any annotation
pipeline. Latter, the actual POS annotation has been discussed along with the
results, in the form of various linguistic issues raised and their solutions. The
solutions have been presented in the form of a mini-guideline. Finally, statistical
results like the frequency and cumulative frequency of various POS categories
have been given in the same sections.
The chapter has overall explored and discussed various annotation schemes and
tools which have been found relevant to the present work and has also laid down
the foundations for building dependency treebank (KashTreeBank), using four
samples of data taken from KashCorpus. The next chapter will address further
two layers of annotation which revolve around the syntactic dependencies.
Chapter.5. Chunking of KashCorpusJudgments are inherently unreliable because of their
unavoidable meta-cognitive overtones, because grammaticality is better described as a graded quantity, and for a host of other
reasons.Edelman and Christianson (2003)
1. Introduction
Chunking is the second level of annotation in developing a dependency treebank
based on HTB guidelines (Bharati et al., 2012). It involves annotating clusters of
words based on local dependencies with predefined chunk labels. The chunk layer
encodes the intermediate level of linguistic information between the POS level
and the dependency level. In fact, it covers all those dependency relations which
dependents form with their head except with the verbal head. Although, it covers
all lower level dependencies which do not belong to argument-adjunct level but
115
these dependencies are not overtly labeled. However, it is very crucial for
annotation of inter-chunk dependency relations.
This chapter is mainly concerned with describing the second layer of annotation
of KashTreeBank. The second section of this chapter deals with the notion of
chunk, the third section discusses the rationale behind chunking, the fourth
section gives description of chunk tagset, section five describes the process of
manual chunking carried out with the help of Sanchay SA Interface, section six
talks about the issues that were encountered during the annotation process, section
seven presents results, both statistical as well as theoretical. Section eight presents
the guidelines and section nine summarizes the chapter. The next section
discusses the notion of chunk.
2. The Notion of Chunk
The term ‘chunk’ appears similar to the term ‘phrase’ but a chunk and a phrase
differ, considerably, both refer to a group of words. The former is a general term
which has been widely used across various disciplines for a perceptually compact
group of entities and in linguistics; it refers to non-recursive group of words. The
latter is purely a syntactic term which refers to constituents which are often
recursive nature. Therefore, non According to Abney (1991), a chunk consists of
a single content word surrounded by a constellation of function words which
matches a fixed template, e.g. in Kashmiri noun chunk, [huth/DMD baagas/NN
manz/PP].NC (in/PP that/QT garden/NN), the content word baagas/N (garden) is
surrounded by function words, huth/DM (that) and manz/PP (in).
Abney (1995) also defines a chunk as “the non-recursive core of an intra-clausal
constituent, extending from the beginning of the constituent to its head but not including
post-head dependents.” There is psychological evidence for the existence of chunks.
Gee and Grosjean (1983) examine that these are performance structures of word
clustering that emerge from a variety of types of psychological experimental data
such as pause durations in reading and naive sentence diagramming. They argued
that performance structures are best predicted by what they called Ø-phrases
which are created by breaking the input string after each syntactic head that is a
content word. They do not assign syntactic structure to chunks and assume that
pre-nominal adjectives do not qualify as syntactic heads; otherwise, phrases like a
big dog would not comprise one chunk but two. Contrary to that, Abney (1994)
argued that a chunk has syntactic structure which comprises of a connected sub-
116
graph73 of a global parse-tree of a sentence and that the chunks are represented in
terms of major heads which are all content words except those that appear
between a function word and the content word, e.g. ‘proud’ is a major head in ‘a
man proud of his son’ but proud is not a major head in ‘the proud man’ because it
appears between the function word ‘the’ and the content word ‘man’.
However, the practical considerations in implementing a framework on the corpus
samples can lead to a variety of word-constellations that may or may not be
psychologically real chunks as discussed above. Therefore, chunks may not be
complicit with the well-known definitions of a chunk but merely ad-hoc solutions
to more practical problems, e.g. non-contiguity. Thus, a chunk is a sub-tree within
a syntactic phrase structure tree corresponding to nominal, prepositional,
adjectival, adverbial or verbal phrases (Abney, 1991, 1992, 1995) or simply a
word-group based only on local surface information, e.g. the noun group and the
verb group (Bharati et al 1995). Sometimes, even the simplest notion of chunk as
a word group may be problematic (see Bhat, 2012) while handling discontinuity.
3. Rationale for Chunking
It has been already discussed in Chapter-I and Chapter-II that dependency
relations involve asymmetrical grammatical relations, i.e. head-dependent or
modifier-modified relations between words. These relations hold at two levels,
one at chunk level74, between the words of minor POS-class (secondary
dependents) and the word of a major POS-class (secondary head), and other at
sentential level, between the secondary heads (primary dependents) and the
primary head, i.e. finite verb. The rationale behind the division of dependency
relations into two levels is that it allows incorporating the popular notion of
phrase, though crudely, and thereby, permits of division of labor in order to
achieve consistency in syntactic annotation. Moreover, at POS level the more
focus had been on the form or words rather than the function they performed in a
sentence. Therefore, positing the intermediate chunk level is to regain the scope of
function which is the key force constructing a sentence. However, at the first
level, dependency relations between dependent words and their head word,
constituting secondary modifier-modified relations, have not been labeled
explicitly, instead the cluster of dependent words and the head word has been
73 The parse-tree of a chunk is a sub-graph of the global parse-tree (ibid, 1994).74 It can be considered equivalent to the popular phrasal level.
117
annotated with the chunk tags which have been devised based on the notion of
head. Therefore, the relations can be easily predicted by head computing based on
the chunk label, e.g. in NP chunk, N but not JJ, DM, QT, RP or PP will be head.
Similarly, in RBP, RB will be head. So there is no need to label relations
explicitly at this level as they can be easily computed from the information
encoded in the tags.
Further, it is worth to mention that the chunk tag is not only assigned to a cluster
of words which are formed by dependency relations but also to the clusters which
are formed by non-dependency relations, e.g. the clusters of JJ + N, DM + N and
QT + N are clearly formed by dependency relations and have been tagged as NP
chunks but the clusters of N and PP, PRP and PP, N and RP, V and RP, etc are
not formed by modifier-modified relations (hence non-dependency) and have
been tagged with chunk labels, NP and VGF, respectively. Similarly, predicative
adjectives and quantifiers, non-contiguous adverbs, conjuncts and other
discontinuous elements like the tensed and non-tensed verbal elements have been
assigned chunk tags despite of the fact that they don’t essentially form a cluster of
words with some dependency or non-dependency relations, rather, they are
solitary elements and have been treated at par with the cluster of words. This
flexibility of treating clusters of words at par with the solitary words is actually to
account the discontinuity and flexibility in surface word-order which is the
hallmark of sentences taken from corpus. Therefore, positing chunk level is not
only important to deal with one set of dependency relations but also to settle most
of the problems of surface level and to smooth the ground for the next level of
annotation. This also brings the notion of chunk closer to the performance
structure proposed in Gee and Grosjean (1983) than the standard notion of phrase,
visible only in NP chunks.
As POS tagging is prerequisite for chunking; chunking is also pre-requisite for
deep syntactic parsing vis-à-vis annotation, i.e. for annotating inter-chunk
dependency relations. However, in order to be able to chunk POS annotated data
consistently, a chunk tagset, and the chunking interface are must. The chunk
tagset that has been used in the current work is described in the next section.
4. Description of the Chunk Tagset
Though parsing by chunking is common practice in IL treebanks, there is yet to
be a standardized tagset of chunks and other higher dependency relations for ILs
118
as there is for POS tagging in the BIS standards. However, there has been some
work in chunking for Indian Languages particularly for Hindi, Bangla, Urdu and
Telugu (see Bharati et. al. 1995; Ray et. al. 2003; Singh et. al., 2005; Das et. al.
2005; Bharati et. al. 2006). The current chunk tagset is based on the POS tagging
and chunking guidelines used in ILMT (Bharati et. al. 2006) but the notion of
verb-group as posited the guidelines has been refuted in the current work, as non-
applicable for Kashmiri. It has been found that the POS annotated words of
Kashmiri corpus can be grouped and classified into ten chunk categories (Bhat,
2012 & 2013). These ten chunk categories, along with chunk-tags75, are given in
the Table.1 and their description is given below.
4.1 Noun Chunk (NP)
Noun Chunk is the name assigned to a cluster of words, which a noun forms, with
its dependents such as JJ, DM, QT or even with PP which are also considered as
its dependents though they are not modifiers like the other dependents. The notion
of noun chunk is similar to that of the noun phrase except that it is a single entity
and can’t be recursive, i.e. can’t embed any sub-phrase in it, e.g. kwr-i hund (of
the girl), su boDbaarI bag (that big orchid), Ak-is bAd-is maqaan-as manz (in one
big house), su ti (he too), etc.
The further examples of NPs are given in the Table.1 and Table.3, and the
proportion of NP in the Kashmiri corpus is given in the Figure.3.b.
4.2 Auxiliary Chunk (AUXP)
In Kashmiri like other Indo-Aryan (IA) Languages, tense, aspectual and lexical
information of verbs are distributed into three verb tokens known as tense
auxiliary, modal auxiliary and main verb, respectively, but unlike them these three
verb tokens are non-contiguous in nature, with other elements, particularly the
arguments intervening between them. Thus, Auxiliary Chunk is the name
assigned to solitary tensed auxiliary or cluster of tense and aspectual auxiliaries,
both tagged as VAUX at POS level, rather than to the cluster of three verb tokens,
forming a Verb Group (see Bharati, 2006) in Urdu and Hindi. The AUXP tag has
been assigned to these solitary tense auxiliary or a cluster of auxiliary tokens
away from the main verb, e.g. aasi (will), Os-nI (was not), chi-nI aasaan (do not
75 It must be noted that some of the chunks, though conceptually different from other ILs, have been assigned the same tags as in other IL treebanks with the understanding that tags, like words, are arbitrary in nature and there is no point in making objections like why is not verb-chunk-Finite tagged as VCF instead of VGF? Or why Noun Chunks have been tagged as NP instead of NC. This was purely done to keep the doors for easy recourse sharing open.
119
keep), chi heykaan (can), etc. The further examples of AUXP are given in Table.1
and Table.3, and the proportion of AUXP in the Kashmiri corpus is given in the
Figure.3.b.
4.3 Verb Chunk Finite (VGF)
Verb Chunk Finite is the name assigned to the solitary tense-less or tensed main
verbs, tagged as VM at POS level or to the clusters of RB-VM, VM-RP or RP-
RB-VM. When VM is tense-less, it is either the lexical part of the auxiliary verb
occupying V2 position (or both occupying the V2 and V3 positions) in the
sentence or it is itself a full-fledged verb with both lexical part and the mood
information condensed in a single token, e.g. gatsh (go), khey (eat), chey (drink),
etc. But when it is tensed, tense is either clearly inflected, e.g. in khe-yi (will eat),
che-yi (will drink) and shongi (will sleep) or it isn’t inflected at all or it can be
said that tense information is morphologically unmarked or underspecified in
these cases but contextually encoded for which the aspect provides the most
crucial cue. Another possibility is that aspectual information is contaminated with
tense and both the tense and aspect are expressed through a single inflection
(portmanteau morpheme). For instance, there are two perfective forms of finite
verbs in Kashmiri, ‘-mut’ form and ‘-ov’ form. The ‘-mut’ forms, e.g. khey-mut
(eat-prf), chey-mut (drink-prf), shong-mut (sleep-prf), etc, co-occur with tense
auxiliary which occupy the V2 position in the sentence. Therefore, one can easily
determine whether the ‘-mut’ form of the verb is present or simple past perfective
form by looking at V2 position where its tense information is located. Therefore,
the tense and aspectual information is disjunctive in such cases. However, the ‘-
ov’ forms, e.g. khey-ov (ate), chey-ov (drank), shong (slept), etc, neither co-occur
with the tense auxiliary at V2 position, nor are such forms inflected with tense
information. Since, tense information is underspecified, in such forms; it should
have been difficult to determine whether such forms are present or past
perfectives but as default, native speaker perceives such forms as past perfectives.
So it is evident that in ‘-ov’ forms either ‘-ov’ carries tense information in
addition to aspectual information (hence, portmanteau) or it merely provides a cue
to tense which is encoded in the context. Irrespective of whatever may be the
convincing explanation for this case, such forms have been tagged as VGF, e.g.
natsaan (dances), khe’ (eat), shong (slept), etc. The further examples of VGF are
120
given in Table.1 and Table.3, and the proportion of VGF Kashmiri corpus is given
in the Figure.3.b.
4.4 Verb Chunk Non-finite (VGNF)
Verb Chunk Non-finite is the name that has been assigned to solitary participle
forms, ‘-vol’ forms and a cluster of reduplicated progressive forms which are
essentially de-verbal in nature and function either as an event or an entity
modifiers. Such forms are generally known as non-finite verbs but non-finite
verbs also include gerunds and infinitives. However, as mentioned in the Chapter-
IV, the task of determining finiteness has been avoided on POS level as the
grammatical information of the verbs distributed on multiple tokens rather than
being condensed in a single token. The task, thus, becomes very complex if one
goes by the standard definition of finiteness but it has been found that the notion
of ‘de-verbal’ forms simplifies the task. It has been better addressed under the
forth coming section on issues. It is important to mention that gerunds and
infinitives, though de-verbal in nature, don’t play a modifying role and thus, are
not tagged as VGNF like other de-verbal forms, mentioned above, e.g. shong-ith
(sleeping), bih-ith (sitting), pakaan pakaan (while walking), etc. The further
examples of VGNF are given in Table.1 and Table.3, and the proportion of VGNF
in Kashmiri corpus is given in the Figure.3.b.
4.5 Verb Chunk Gerund (VGNN)
Verb Chunk Gerund is the name that has been assigned to those de-verbal forms
which function as nominals. These include solitary direct gerundial forms and the
clusters of oblique gerundial forms and the postpositions. Infinitives have been
also tagged as VGNN as they form the argument of the finite verbs like the
gerunds. As aforementioned, such forms are distinguished from the other de-
verbal forms only in terms of their functions, otherwise the constituent verbs of
both the VGNF and VGNN are devoid of any verbal feature, except the argument
structure which is intact in them even if they play non-verbal roles in the
sentence, e.g. shong-nI sI:t’ (because of sleeping), nats-un (dancing), asn-an
(laugh-ERG), etc. The further examples of VGNN are given in Table.1 & Table.3,
and the proportion of VGNN in Kashmiri corpus is given in the Figure.3.b.
4.6 Conjunct Chunk (CCP)
Conjunct Chunk is the name assigned to the conjunctions, both coordinating and
subordinating which have been tagged as CCD and CCS, respectively at POS
121
level. Most of the sentences in the corpus are compound, complex or compound-
complex in nature in which conjunctions play a key structural role and thus, the
frequency of conjunctions is high in the corpus. Since, conjunctions neither have
any modifier-modified relation nor do they bear any part whole relation with any
other POS category, they can’t be part of any other chunk like postpositions.
Therefore, they are solitary and are projected as separate chunks, e.g. tI (and), ya
(or), kinI (or), zi (that). The further examples of CCP are given clearly, in Table.1
& Table.3, and the proportion of CCP in Kashmiri corpus is given in the
Figure.3.b.
4.7 Adjectival Chunk (JJP)
The name Adjectival Chunk has been given to the solitary adjectives and
quantifiers, or adjectival or quantifier clusters like RP-JJ and RP-QT, which can’t
be a part of any noun chunk. It is worth to mention, here, that, although, all
adjectives have been tagged as JJ and all quantifiers have been tagged as QT, at
POS level, all adjectives and quantifiers can’t be raised up to the chunk level as
JJP. Adjectives or even quantifiers occur either at attributive position, as part of
NP, or at predicative position, as solitary elements or clusters. It is only at these
predicative positions, the adjectives and quantifiers have the status of head as they
do not constitute what are popularly known as discontinuous phrases and can be
easily posited as the adjectival chunks and have been tagged as JJP, e.g. su chu
rut (he is nice). The further examples of JJP are given in Table.1 & Table.3, and
the proportion of JJP in Kashmiri corpus is given in the Figure.3.b.
4.8 Adverbial Chunk (RBP)
The name Adverbial Chunk has been assigned to the solitary adverbs or adverbial
clusters (RB-RP) which can’t be a part of any verb chunk. It must be noted that
although all adverbs are tagged as RB at POS level, all can’t be raised up to chunk
level and tagged as RBP because sometimes they are adjacent to their head and
can be also part of VGF but mostly they occur non-contiguously with their head
are tagged as RBP, e.g. su os vaarI vaarI garI kun pakaan (he was moving
towards home slowly). The further examples of RBP are given in Table.1 &
Table.3, and the proportion of RBP in Kashmiri corpus is given in the Figure.3.b.
4.9 Negation Chunk (NEGP)
The name Negation Chunk has been given to those negative particles that occur
as solitary elements without an obvious head and hence, may be treated as the
122
heads to be projected as the chunks, e.g. na su yiyi-nI vaapas (no, he won’t come
back). The further examples of NEGP are given in Table.1, and the proportion of
NEGP in Kashmiri corpus is given in the Figure.3.b.
4.10 Other Chunk (BLK)
The name Other Chunk is reserved for all those solitary or clusters of POS tagged
words which do not fit in the aforementioned chunks. This is actually like a bag
in which all elements can be put which do not confirm with the chunking scheme,
either because they are unrelated to the sentence structure, e.g. the serial
numbers, or they belong to discourse level, connecting one sentence with the
other, e.g. khA:r tAm’ vAn’-nI zahn ti titsh kath (however, he never said
anything like that). The further examples of BLK are given in Table.1, and the
proportion of BLK in Kashmiri corpus is given in the Figure.3.b.
S. No Chunk Name Tag Examples
I Noun Chunk NP [su/DM badI/RP rut/JJ shakhIts/NN]
NP
(that very big man), [farooq/NNP
ti/RP] NP (farooq also), [farooq/NNP
nI/RP] NP (not farooq), [Akis/QT
bADis/JJ palas/NN peTh/PP] NP (on
one big rock)
II Auxiliary Chunk AUXP [chu/VAUX] AUXP (is), [chu/VAUX
aasaan/VAUX] AUXP (keeps),
[Os/VAUX] AUXP (was), [Os/VAUX
rOzaan/AUXP] AUXP (used to),
[aav/VAUX] AUXP (was),
[yiyi/VAUX] AUXP (will be)
III Verb Chunk Finite VGF [kheyovum/VM] VGF (ate +1st person
clitic), [vaarI/RB parihaa/VM] VGF
(may should have read nicely),
[dav/VM haz/RP] VGF (run +
honorific), [variyaa/RP zorI/RB
pakh/VM] VGF (walk very fastly),
123
[yiyi-nI/VM kehn/RP] VGF (will not
come + emphasis) *[chu/VAUX
vonmut/VM] VGF (has said)
IV Verb Chunk Non-
Finite
VGNF [kheyth/VM] VGNF (after eating),
[kheyth/VM cheyth/VM] VGNF (after
eating drinking), [pakaan/VM
pakaan/VM] VGNF (during walking),
[kheynIvol/VM] VGNF (eater)
V Verb Chunk Gerund VGNN [paknas/VM peyTh/PP] VGNN (for
walking), [kheynI/VM sI:t’/PP]
VGNN (with eating), [natsnI/VM
kin’/PP] VGNN (due to dancing),
[khenIch/VM] VGNN (of eating),
[kheyon/VM] VGNN (eating/ to eat),
[kheynIvol/VM] VGNN (one who eats)
VI Conjunct Chunk CCP [tI/CCD] CCP (and), [yaa/CCD] CCP
(or), [kinI/CCD] CCP (or),
[natI/CCD] CCP (or) [ki/CCS] CCP
(that), [zi/CCS] CCP (that),
[yodvai/CCS] CCP (if), [agar/CCS]
CCP (if), [magar/CCD] CCP (but),
VII Adjectival Chunk JJP [variyaa/INT rut/JJ] JJP (very good),
[pantsah/QC kiluu/NN] JJP (fifty
kilo), [pandhA:yim/QO] JJP
(fifteenth), [zyuuTh/JJ] JJP (tall or
lengthy)
VIII Adverbial Chunk RBP [teyz/RB teyz/RB] RBP (quickly or
fastly), [zorI/RB ti/RP] RBP (loudly)
[lot/RB] RBP (slowly) [ti/RP
kyaazi/RB] RBP (because),
[chunki/CCS] RBP (because),
[tawai/RB] RBP (because of that),
[teli/RB] RBP (then)
124
IX Negation Chunk NEGP [na/RP] NEGP (no), [na/RP saa/RP
na/RP] NEGP (no +honorific not)
X Other BLK [khA:r/RP] BLK (however), [teli/RP]
BLK (so)
Table.1. Kashmiri Chunk Tagset
5. Chunking POS Tagged Corpus Samples
As aforementioned, chunking is labeling a cluster of POS annotated words (with
an obvious head) or a solitary POS annotated word (which itself acts as a head),
with a higher level tag. During chunking, words have been clustered together and
assigned a particular chunk tag, keeping in view their POS tags, adjacency and
dependency relations between them that make them perceptually closed entities. It
has been done in such a way that each chunk has a definite internal structure, i.e.
words constituting a chunk are asymmetrically related to each other, with one
word as a head and the remaining words as its dependents or in case the word is
solitary, it is itself head with no dependents. However, there are certain cases
where the word which has been given chunk status is neither a head nor a
dependent as per semantic dependency is concerned, e.g. AUXP, as discussed
above. The chunking process has been carried out using the same interface which
was used to carry out POS tagging. The chunk layer has been built on the POS
layer as illustrated below in three steps for the sentences 43 and sentence 42
(taken from the corpus), given in the Table.2 along with their English translation
and the chunk information. The POS annotated file in SSF format can be opened
in the Sanchay SA Interface (GUI), as shown in the Fig.1.a in order to carry out
manual chunking.
Kashmiri Sentence 43
ٹھ اجار-دری ید خطر ٹیکنالوجی پ ری طاقت چھ پنن ف ن ز جو ٲتم و ٮ� ٲ ٲ ہ ے ہ Iن ۍ یی پراونس ری توان چھ بیین ملکن امن مقصدو خطر جو ٲیژھان ت ن ٲ ہ ہ
۔اجازت دوان Translation He said that the atomic powers want monopoly on the technology for their benefit
and do not let other countries to use atomic energy for peaceful purposes.
125
Chunks [[ [[ ۍتم _PR_PRP ]]_NP [[ ن نIو _V_VM ]]_VGF [[ CC_CCS_ز ]]_CCP [[ ری ج77و _JJ_JJ ہط77اقت _N_NN ]]_NP [[
ےچھ _V_VAUX ]]_VGF [[ ہپنن _PR_PRF ]]_NP [[ ی77د ٲف _N_NN ٲخطر _PP_PSP ]]_NP [[ ٹیکنالوجی_N_NN ٹھ ٮ�پ _PP_PSP ]]_NP [[ ری ٲاج7777ار-د _N_NN ]]_NP [[ V_VM_یژھ7777ان ]]_VGF [[ ہت _CC_CCD ]]_CCP [[ ہن _RP_NEG V_VAUX_چھ ]]_VGF [[ JJ_JJ_بیین N_NN_ملکن ]]_NP [[ N_NNC_امن N_NNC_مقصدو ٲخطر _PP_PSP ]]_NP [[ ری ج77و _JJ_JJ یی ٲتوان _N_NN ]]_NP [[ ۔ V_VM_دوان ]] NP_[[ N_NN_اجازت ]] V_VM ]]_VGF_پراونس_RD_PUNC ]]_VGF ]]_SSF
Sentence 42 دس ملکس نمت ز تم س ن�ایرانک صدر محمود احمدی نژادن چھ و ۍ Iن ۍ میت ت سالمتی ہخالف چھن اقوام متحد-کس تاز قراردادس کا ا KہYن ہ
ک ‘ آل بنیمژ ‘ - ۔کونسل چھ امریک ہ ہ ےTranslation President of Iran Mahmood Ahmad Nasraad has said that there is no significance
of United Nation’s resolution against his country and United Nations has become an instrument in the hands of America.
Chunks [[ [[ ۍایرانک _N_NNP ]]_NP [[ N_NNPC_صدر N_NNPC_محمود N_NNPC_احم777دی N_NNPC_ن777ژادن ]]_NP [[ V_VAUX_چھ نمت نIو _V_VM ]]_VGF [[ CC_CCS_ز ]]_CCP [[ ۍتم _PR_PRP دس ن�س _PP_PSP ]]_NP [[ ملکس_N_NN خالف_PP_PSP ]]_NP [[ ہچھن _V_VAUX ]]_VGF [[ N_NNPC_اق77777وام متح77777د- N_NNPC_کس ]]_NP [[ JJ_JJ_ت77از N_NN_ق77راردادس ]]_NP [[
نYہKکا _DM_DMI میت ا _N_NN ]]_NP [[ ہت _CC_CCD ]]_CCP [[ N_NNPC_س77777777المتی N_NNPC_کونسل ]]_NP [[
ےچھ _V_VAUX ]]_VGF [[ ک - ہام77777ریک _N_NNP ]]_NP [[ ‘_RD_PUNC ہآل _N_NN ]]_NP [[’_RD_PUNC V_VM_بنیمژ ۔ _RD_PUNC ]]_VGF ]]_SSF
Table.2. Showing Example Sentence
126
Figure.1.a. SA Interface Showing POS Tagged Sentence
Step-1: In this step, the contiguous words which form a chunk have been selected by holding control key and clicking on the nodes so that all the contiguous nodes are selected, simultaneously, as shown in the Fig.1.a. Although, the first three chunks (NP, VGF and CCP) consist of solitary words, they have been also chunked following the same steps as shown for the forth chunk (highlighted one in Fig.1.b), i.e. by selecting the nodes, adding a layer and changing the name of the layer (chunk name) for the selected nodes.
Figure.1.b. SA Interface Showing Step-1 of Chunking
127
Step-2: In step two, one can right click on the selected chunk so that the drop down list of actions opens in which ‘Add Layer’ option can be selected and new chunk layer can be added in the format as shown in the Fig.1.b.
Figure.1.c. SA Interface Showing Step-2 in Chunking
Step-3: In this step, the newly added layer would have some default tag (NP) which can be easily changed by clicking on the chunk tag itself and using keyboard by pressing the first letter key of the desired chunk tag. One can keep pressing the letter key unless the desired chunk tag is assigned to the newly added chunk layer, as shown in the Fig.1.d.
Figure.1.d. SA Interface Showing Step-3 in Chunking
128
As shown above, sentence-43 has 26 token which have been grouped into 18
chunks and sentence-42 has 29 tokens which have been grouped into 16 chunks.
The ratio between the tokens/words and chunks is not very large (approx 1.6)
which is indicative of the fact that there is high frequency of the solitary words
that have been given status of chunk. The 18 chunks of the sentence-43, as viewed
in the tree viewer of the interface are shown in Fig.2.a and Fig.2.b.
Figure.2.a. SA Interface Showing Chunks in Sentence 43
Figure.2.b. SA Interface Showing Chunks in Sentence 43
129
<Sentence id='42'>1 (( NP1.1 ۍایرانک N_NNP))2 (( NP2.1 صدر N_NNPC2.2 محمود N_NNPC2.3 احمدی N_NNPC2.4 نژادن N_NNPC))3 (( VGF3.1 چھ V_VAUX3.2 نمت نIو V_VM))4 (( CCP4.1 ز CC_CCS))5 (( NP5.1 ۍتم PR_PRP5.2 دس ن�س PP_PSP))6 (( NP6.1 ملکس N_NN6.2 خالف PP_PSP))7 (( VGF7.1 ہچھن V_VAUX))8 (( NP8.1 اقوام N_NNPC8.2 متحد-کس N_NNPC))9 (( NP9.1 تاز JJ_JJ9.2 قراردادس N_NN))10 (( NP10.1 نYہKکا DM_DMI
<Sentence id='43'>1 (( NP1.1 ۍتم PR_PRP))2 (( VGF2.1 ن نIو V_VM))3 (( CCP3.1 ز CC_CCS))4 (( NP4.1 ری جو JJ_JJ4.2 ہطاقت N_NN))5 (( VGF5.1 ےچھ V_VAUX))6 (( NP6.1 ہپنن PR_PRF))7 (( NP7.1 ید ٲف N_NN7.2 ٲخطر PP_PSP))8 (( NP8.1 ٹیکنالوجی N_NN8.2 ٹھ ٮ�پ PP_PSP))9 (( NP9.1 ری ٲاجار-د N_NN))10 (( VGF10.1 یژھان V_VM))11 (( CCP11.1 ہت CC_CCD))12 (( VGF
130
10.2 میت ا N_NN))11 (( CCP11.1 ہت CC_CCD))12 (( NP12.1 سالمتی N_NNPC12.2 کونسل N_NNPC))13 (( VGF13.1 ےچھ V_VAUX))14 (( NP14.1 ک - ہامریک N_NNP))15 (( NP15.1 ‘ RD_PUNC15.2 ہآل N_NN))16 (( VGF16.1 بنیمژ V_VM16.2 ’ RD_PUNC16.3 ۔ RD_PUNC))</Sentence>
12.1 ہن RP_NEG12.2 چھ V_VAUX))13 (( NP13.1 بیین JJ_JJ13.2 ملکن N_NN))14 (( NP14.1 امن N_NNC14.2 مقصدو N_NNC14.3 ٲخطر PP_PSP))15 (( NP15.1 ری جو JJ_JJ15.2 یی ٲتوان N_NN))16 (( VGF16.1 پراونس V_VM))17 (( NP17.1 اجازت N_NN))18 (( VGF18.1 دوان V_VM18.2 ۔ RD_PUNC))</Sentence>
Table.3. Showing Chunked Sentences in SSF
6.1. Chunking Issues
As discussed above, chunking covers half of the dependency relations, though
they are not explicitly marked as relational labels which is a general practice in
dependency treebanks and can be clearly seen in the works of Nivre (2009) who
has translated Tesnière’s (1959) seminal work on dependency grammar.
However, applying the framework which has been designed for treebanking in ILs
(Bharati et. al. 1995, 2006) on Kashmiri, entirely new issues come to fore. Such
issues have partly stemmed from the underlying theory and partly from the
peculiar morphosyntactic or syntactic properties of Kashmiri that distinguish it
131
from rest of the ILs and bring it closer to Germanic Languages like German and
Yiddish. The main issues that have been encountered during the manual chunking
of Kashmiri corpus are briefly given below.
6.1. V2 and V3 Phenomena
It has been found that the notion of verb group that was proposed for ILs, do not
stand for Kashmiri corpus because of a unique syntactic feature of Kashmiri
language known as V2 Phenomenon. V2 phenomenon occurs in all tensed clauses
be it matrix clause or embedded clause, both in active and passive configurations.
It is due to this phenomenon that the tense auxiliary and the main verb cease to
exist contiguously. The tense auxiliary (VAUX) occur at the second (V2) position
and the main verb (VM) occur at the final position of the sentence but if there is
also modal auxiliary in the sentence, it occupies the third (V3) position. For
example:
farooq/NNP chu/VAUX batI/NN khevaan/VM
Farooq is eating rice.
farooq/NNP chu/VAUX aasaan/VAUX batI/NN khevaan/VM
Farooq keeps eating rice.
However, in interrogative sentence the tense auxiliary can also occur at third (V3)
position in the sentence if the auxiliary carrying aspectual information is also
present, it occurs at forth (V4) position. For example:
farooq/NNP kya/WH chu/VAUX reyaazas/NNP divaan/VM
What Farooq is giving to Riyaz?
farooq/NNP kya/WH chu/VAUX aasaan/VAUX reyaazas/NNP
divaan/VM
What Farooq keeps giving to Riyaz?
The problem with the finite clauses in Kashmiri is that they can’t be easily
chunked like in other IA languages, e.g. Hindi, Urdu or Punjabi, etc, due to the
presence of V2 phenomenon. The tensed verb stands discontinuous from its main
verb as shown in the above examples. Usually, a group or cluster of words is
assigned a chunk label if the words are adjacent or contiguous to each other and
also have an asymmetric relation of dependence with each other or simply share
unequal category status, so that the one with higher status can be projected as the
head but in this case the VAUXs and the VMs are neither adjacent to each other
nor the relationship they hold with each other is dependency relation in the real
132
sense. Dependencies are essentially modifier-modified relations and between the
discontinuous VAUX and VM in finite clauses, there is hardly any modifier-
modified kind of relation but definitely a part-whole kind of relationship.
Therefore, it impossible to posit verb as a chunk as noun and adjective chunks
have been posited. Some ad hoc decisions need to be taken to tackle the V2
problem as the language data is far from being ideal to fit for our perceived notion
of chunk.
6.2. Headless Head
Adverbs are considered as the most floating or movable elements in a sentence.
They frequently occur, discontinuously, away from their heads (VMs), at
beginning, at the final position or at elsewhere in the sentence. However,
sometimes they occur adjacent to the VM, they modify and thus, become parts of
VGF, VGNF or VGNN. When adverbs (RB) occur discontinuously, they have no
governing or influencing head adjacent to them and are authority in themselves.
Under such circumstances, RBs can be considered as heads, though they would be
pseudo-heads and would be still having a clear cut dependency relation with their
far away head, which is also the ultimate head, the root. It is evident that the
dependency relation of lower level (chunk level) has been promoted to the
dependency relation of higher level (argument structure level) to handle
discontinuous verb chunk.
6.3. PP No More a Head
Since there is well known notion of functional heads in both the constituency
based as well as in dependency based frameworks, at least for exocentric
constructions, a distinction has been maintained for when a cluster of words (N-
PP) has noun its head and when it has post/preposition its head. In other words,
when N-PP cluster is a noun phrase/chunk and when it is a post/prepositional
phrase/chunk, has been properly distinguished. However, no such distinction has
been drawn on the functional basis, i.e. on the basis of functioning as an argument
or an adjunct. In all the clusters of words containing noun, nouns have been
treated as their head but never the pre/postposition, irrespective of the fact that
some of them perform core functions (subject or objects) and many of them
perform mere subsidiary functions (adverbial) in the sentence. This uniformity
has been maintained at this level because the underlying notion of the head in
133
PCG (Bharati, 1995) is essentially a semantic notion, with few exceptions76. The
function words are devoid of semantics or content and can’t be treated as heads
based on the underlying theory. Therefore, there is no possibility of existence of
pre/postpositional phrases or chunks unlike what been originally posited in
Bloomfieldean and Post-Bloomfieldean literature for exocentric constructions, as
already given chapter two, which latter appeared in PSG (Chomsky, 1956) and
DG (Tesniere, 1959). It is worth to mention that in these works, NPs are generally
arguments and PPs are adjuncts but this distinction has been avoided here for the
sake of theory and has been encoded at the next level of annotation.
6.4. Junction Still a Head
The notion of dependency does not always provide unambiguous solutions when
it comes to exocentric constructions. The dependency representation is at a loss
when it comes to representing the notorious paratactic linguistic phenomena such
as coordination, whose nature is symmetric (two or more conjuncts play the same
role), as opposed to the head-modifier asymmetry of dependencies (Popel et al.,
2013). In other words, coordination is a pending problem of natural languages and
both PSG as well as DG struggle with it (Hudson, 1988, Covington, 1980).
Conjuncts are also exocentric constructions but they have not been treated as
endocentric constructions like the pre/postpositional phrases/chunks have been,
given their crucial role in the structural organization of sentences.
6.5. Negation and Double Negation
Kashmiri has negative elements in free-form like na and nI (no and not) as well as
in bound-form like -nI (not) in khe-yi-nI (will not eat) and sometimes there is also
double negation, e.g. khe-yi-nI kehn (will not eat no) and na saa na (no +honrific
not). The negative markers do not belong to this level and certain negative
particles in double negation constructions (see the above example) which do have
obvious heads, are of no concern here and can’t be projected as chunks. However,
some negative particles, either solitary or in clusters (RP-RP), do not have any
obvious heads and they themselves have the potential of being head.
6.6. Discourse Elements
Discourse elements are the particles that have been tagged as particle default
(RPD) at POS level. They conjoin sentences at semantic or discourse level to
76 The strict notion that only lexical items can be heads seems to be diluted by projecting certain chunks from function words, e.g. CCP, NEGP and BLK.
134
bring cohesion in the text. Since, they were extraneous to the existing set of
chunks and like conjunctions, in spite of being functional words don’t seem to be
dependents of any existing semantic head. It must be noted that discourse
elements have been also treated as heads (connectives) in discourse treebanks.
6.7. Relational Confusion
As aforementioned, at chunk level one needs to handle two kinds of grammatical
relations, one lower level dependencies, e.g. between JJ and N or RB and VM and
a kind of part-whole relations, e.g. between N and PP, N and RP, VAUX and VM.
It would be more result oriented if one focuses on the one type of relation at a
time. Therefore, ones need to keep track of the kind of the relations one has to
handle without confusing between the dependencies and part-whole relations.
7. Statisticsl Results
The quantitative results are given in terms of chunk statistics and qualitative
results are given in terms of a miniature guideline.
The four datasets that had been used for POS tagging have been reduced to only
three datasets by merging second and third ones. These three datasets have been
utilized in chunking which has been carried out by using SA Interface of Sanchay
as aforementioned and chunk frequency of each dataset has been obtained with
the help of the same interface. The frequency distribution table so obtained has
been latter used to calculate the cumulative frequency and the percentage of the
chunks. The same data is represented through the bar chats given in the Fig.3.a
and 3.b.
135
AUXP BLK CCP JJP NEGP NP RBP VGF VGNF VGNN0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
40750
794
2729
4556
159
1473
215 190
Series1
Figure.3.a. Showing Cumulative Frequency of Chunks
The three datasets consist of 682 POS annotated sentence which in turn consist of
8125 chunks classified into ten chunk categories. It has been found that the most
frequent chunk is NP and the least frequent chunk is NEGP, as given in the
Fig.3.a. As shown in the figure the height of the bar is directly proportional to the
frequency of the item it represents. Therefore, the ascending order of frequency of
the series of chunks would be as follows:
NEGP< BLK< RBP< VGNN< VGNF< JJP< CCP< VGF< NP
The NP being the most frequent and VGF being the second most frequent chunk
is expected from the empirical facts about POS categories that have been as given
in Chapter-IV. The statistical chunk results have shown an important empirical
fact that 27.630% clauses show V2 phenomenon and 72.369 % clauses are devoid
of V2 phenomenon in which tense is condensed in the main verb itself. However,
it must be noted that only tensed verbs have not been considered finite but the all
verbs, which have not become de-verbal and possess aspectual or modal
information, have been considered as finite and it is because of this reason that
comparatively lesser percentage of finite clauses have been found with V2
phenomenon which otherwise could have been larger. The statistical results reveal
another empirical fact that 78.438 % verbs in Kashmiri are finite and 21.565 %
verbs are non-finite. In the non-finite forms, 46.913 % are gerunds and remaining
53.086 % are other non-finite forms.
136
The bar diagram in Fig.3.b shows the data represented in Fig.3.a in terms of
percentage. It was done just to reveal the striking quantitative similarities among
the three datasets and to put forward a numerical generalization about the
percentage of various chunks in the corpus so that one can claim reliably that NPs
constitute more that 50% of chunks in Kashmiri.
Figure.3.b. Showing Relative Proportion of Chunks
8. Chunking Guidelines
Chunk guidelines include various decisions that have been taken to resolve
various chunk issues raised during the chunking the data. These guidelines can be
followed in order to achieve the consistency in future chunking tasks.
i. The auxiliaries and main verbs need to be independently projected as
chunks (AUXP and VGF), so that the non-adjacency problem can be
settled at next level by positing a relation between them in which VM
would be head of VAUX. The solution may sound weird if one is
preoccupied with the popular notions of syntax but one must think that it
is the surface form that is being accounted here through surface level
manipulations without positing some abstract layers and categories which
has been the popular practice. Moreover, the purpose here is not to
contribute or challenge to theoretical paradigm but simply to produce a
well grounded data driven grammar which a parser can learn or from
which a probabilistic grammar can be extracted.
137
AUXP BLK CCP JJP NEGP NP RBP VGF VGNF VGNN0
10
20
30
40
50
60
5.0090.615000000000004
9.7723.347
0.11
56.073
1.95600000000001
18.129
2.646 2.338
%
ii. Though conjunctions can’t be semantic head, it has been worked out that
conjunction should be treated as the head and be projected as a chunk
under the label CCP.
iii. The negative particles have scope on the entire sentence rather than on the
single word or phrase. Therefore, it can be said that they are involved in
sentential negation. Such particles should be projected as chunks under the
label NEGP.
iv. Though discontinuous adverbs have quite high frequency but as
aforementioned, in spite of occurring at long distances from the semantic
head, they are still the dependents of verb at lower level. They need to be
projected as chunks under the label RBP only to handle the discontinuity.
v. Since, discourse particles have no role in the internal organization of a
sentence; they can not belong to any other chunk proposed in the tagset
which are essential to account the internal organization of a clause or a
sentence. Therefore, they must be projected as separate chunks under the
label BLK.
vi. Though MWEs which include named entities, compound words and
izaafat constructions, are the POS level problems which have been
handled by concatenating ‘C’ with the tag but they are still separate tokens
which can be potentially confusing. It must be taken care of that all the
adjacent or contiguous POS tagged tokens with the ‘C’ marked tag must
be considered one word so that they are together either a head or a
dependent. It should not be seen as problem that they apparently give rise
to very big chunks.
vii. It has been found that discontinuous noun phrases are rarity un like
discontinuous verb phrases but adjectives do occur either as predicative
adjectives or as adjectival component of complex predicates which are
genuinely heads and should be projected as chunks and assigned a label
JJP.
9. Summary
In this chapter, first part of syntactic annotation, i.e. chunking, has been described.
First of all the nation of chunk was dealt which seems to be quite similar to the
popular notion of phrase. The distinction between the two has been neatly
discussed. Since chunking also is an annotation task, it prerequisites a tagset and
138
annotation tool just like POS annotation. Both the tagset and the tool have been
described at length. Each tag has been explained with the help of examples and
the entire process of manual chunking of previously built POS annotated
KashCorpus has been illustrated with the help of example sentences 42 and 43.
The snapshots of the tree viewer have been also given along with the chunked
data of the example sentences 42 and 43 in SSF to show how chunk projections
are created in the interface and how actually they are stored at the back end in
SSF. Further, the linguistic issues raised during the process have been given
elaborately with sufficient examples. Finally, the results of the annotation work
have been presented. The empirical results have been given in the form of bar-
charts which have been also briefly interpreted. The theoretical results have been
given in the form of guidelines which covers main decisions that have been taken
in order to resolve various issues.
139
Chapter.6 Dependency Parsing of KashCorpus“Unfortunately or luckily, no language is tyrannically
consistent . . . All grammars leak.”
Adverd Sapir, Language (1921)
1. Introduction
As already mentioned in the Chapter three and Chapter four, treebank is a set of
machine readable parse trees of natural language, encoding the syntactic, semantic
or both types of linguistic information. Dependency treebanks are multi-layered
annotation pipelines and at each layer, a separate but related set of linguistic
information is annotated, in a manner, so that the tags at the lower level facilitate
the annotation at the higher level. The obligatory annotation layers of a
dependency treebank include a POS layer, a chunk layer and a relational layer.
However, further layers of linguistic information like the morphological or
discourse level information can be also added, depending upon the intended utility
of the layer in a treebank.
For the current dependency treebank, only three layers of linguistic information
have been taken into consideration. The first layer contains coarse grained hierarchical
POS labels for each token/word of a sentence, as discussed in Chapter four. The second
layer contains chunk labels for the clusters of words which have been dealt in Chapter
fifth. The third layer contains labels for inter-chunk dependency and non-
dependency relations which are dealt in this chapter. In this chapter dependency
parsing/annotation of already built chunked KashCorpus has been discussed.
Dependency parsing involves labeling head-dependent relations at lower level as
well as at higher level. The dependency annotation at lower level has been
covered under chunking, hence, intra-chunk, which has been already dealt in
previous chapter. The dependency annotation at higher level, i.e. at predicate
argument structure level, hence, inter-chunk, is the sole concern of this chapter.
Section second is concerned with introducing the notion of (deep) syntactic parsing.
Section three is concerned with the description of the grammar formalism used as parsing
model. Section four is concerned with the description of GRs. Section five deals with the
annotation of dependencies. Section six is concerned with the issues raised during the
annotation process. Section seven provides the statistical results of dependency
annotation. Section eight discusses the inter-annotator agreement and the results that have
been obtained in the concerned experiment. Finally, section nine summarizes the chapter.
140
2. Notion of Syntactic Parsing
Generally, parsing refers to syntactic analysis of an input string and parser is a
programme that parses an input string automatically. According to Grune and
Jacobs (2008), parsers are already being used extensively in a number of
disciplines; in computer science for compiler construction, database interfaces and
artificial intelligence, in linguistics for text analysis, corpora analysis, machine
translation and stylistic analysis, in document preparation and conversion, in
typesetting chemical formulae, in chromosome recognition, etc. Although, the
term parsing has been derived from Latin phrase paras orationis meaning parts-
of-speech, it is a technical term used for manual or automatic grammatical
analysis. When the grammatical analysis involves word level analysis, it is called
morphological parsing, when it involves phrase or chunk level analysis, it is
shallow syntactic or simply shallow parsing and when it involves a clause or
sentence level analysis, it is deep syntactic parsing or simply syntactic parsing.
Similarly, if the analysis belongs to discourse level, it can be called as discourse
parsing. However, in general terms, parsing is a cognitive or computational
process of taking an input string and generating some sort of structure for it, e.g.
generation of a parse tree for an input sentence. As far as, the end product of
syntactic parsing is concerned, it is clear that parsing stays at the heart of
treebanking where the syntactic trees are produced by manual or semi-supervised
methods. The notion of syntactic parsing is closely linked to the parsing model
which provides grammar formalism for determining the nature of output syntactic
trees or graphs. As already mentioned in the chapter one and two, there are two
main approaches to syntactic parsing. One is based on the popular syntactic
notion known as constituency and other is based on the relatively obscure notion
of syntax known as dependency. However, the last decade has shown renewal
interest in various varieties of dependency grammar, particularly for parsing text
corpus and developing dependency treebanks and parsers. The current work is in
line with this resurgent dependency wave. The next section gives a brief account
of Indian version of dependency grammar.
3. Paninian Computational Grammar (PCG)
The goal of Paninian approach is to construct a theory of human communication,
i.e. how natural language is used to convey information to the hearer and how the
hearer gets on to the intended meaning? Therefore, grammar is seen as the system
141
of rules that establishes correspondence between what the speaker intendeds to
say and corresponding utterance s/he produces and also between what the hearer
listens and the meaning s/he extracts from it. Paninian Grammar (500 B.C) has
been originally written for Sanskrit two and PCG is actually an attempt to
interpret Paninian Grammar in new light and apply it to all modern IA languages.
According to Kiparsky & Staal (1969), PCG (Bharati et al., 1993) is a variant of
dependency grammar. It has been used as parsing model for all treebanks that are
being built in India. It is for the same reason that it has been also used in the
current syntactic annotation which is final level of annotation in building
dependency treebank of Kashmiri. This model helps to capture the syntacto-
semantic relations which are instrumental in constructing a sentence. Sentence is
considered as a series of modifier-modified relations with a primary modified,
main verb (VM), which is the root of dependency stemma (graph or tree). The
elements which modify main verb are its arguments and adjuncts that participate
in the action specified by the verb. The relations of these participants with the
main verb are called karaka. Since, Kashmiri is highly inflectional language; there
are clear cut case markers or postpositions (vibaktis) on the arguments and
adjuncts that participate in an action/event. Such morpho-syntactic cues can be
very instrumental in identifying the relation of arguments and adjuncts with its
root. To some extent there is one-to-one relation between the karakas & the case
markers/postpositions. However, many constructions found in the corpus defy this
expected correspondence. It has been found that such correspondences between
karaka & vebhakti along with TAM features are very helpful in syntactic
annotation of Indian Languages which are relatively free word-order in nature
(ibid). For illustration consider the following sentence:
raath dits library nish bAshiir-an farooq-as neelofer-as khA:trI akh kitaab.
Yesterday give-PRF library near Bashir-ERG Farooq-DAT Neelofer-DAT
for one book
Yesterday Bashir gave a book to Farooq for Neelofer near library.
142
Figure.1. Paninian Dependency Graph
In the above sentence, there is an action represented by the finite verb dits (gave)
which is also the root of Paninian stemma, shown in Figure.1. Since the verb is
ditransitive in nature, it has three valency slots for arguments. Therefore, there are
three arguments represented by three NPs in which Bashir is SUB which has
agentive role and has kartaa (k1) relation with the root, Farooq is IO which has
the semantic role of recipient or beneficiary and has sampradaanaa (k4) relation
with the root and kitaab (book) is DO which has the semantic role of patient and
has kartaa (k2) relation with the root. Besides, these participating NPs which fill
the valency slots of the verb and play the core roles directed by it, there are
additional NPs which are external to the predicate argument or sub-categorization
frame of the verb and hence, play secondary non-participatory roles. Some of the
NPs (which project from NSTs) provide location for the action diyun (to give).
The raath (yesterday) provides temporal location therefore, has kaala adhikarnaa
(k7t) relation with the root and the library nish (near library) provides spacial
location and therefore, has disha adhikarnaa (k7p) relation with the root.
However, the NP Neelofer is neither part of sub-categorization frame nor does it
stem from NST and hence, doesn’t provide any information related to direct
participation or location of an action or an event but represents an indirect
participant which is the purpose of the action. Therefore, Neelofer is a purpose
NP which has Taadarthya (rt) relation with the root.The dependency labels that
have been devised based on karakas are given in the Figure.2. The description of
these karakas is given in the section four of this chapter.
143
Figure.2. Grammatical Relations Shown in HTB Guidelines
Keeping aside limitations and strengths of dependency grammar in general, the
criticism that is being leveled upon the PCG is that it lacks tight formalism and
doesn’t distinguish between arguments and adjuncts. It is fact that there are hardly
any syntactic notions like the transitivity or the argument-adjunct distinction
either in the original Paninian grammar or in the current PCG. It is because it is
essentially a syntacto-semantic theory that has hardly to do anything with the
syntactic notions of sub-categorization, argumentation and adjunction but it must
be noted that the syntactic categories like argument and adjunct can be easily
extracted from the dependency labels itself. Further, the notion of karaka are
roughly equivalent to the notion of semantic role but the karaka relations are
identified through the notions semantic roles, subject and object, otherwise, unless
one has complete hold on Sanskrit, it is impossible to know what a particular
karaka is all about.
4. Description of Relational Labels
So far parts-of-speech tagset and chunk tagsets have been described in chapter
four and five, respectively which were one of the pre-requisites to carry out the
annotation at the respective levels. Similarly, in order to carry out annotation at
this level, i.e. syntactic annotation, an inventory of grammatical relations (GRs) is
needed. The following table presents the set of the GRs that have been used in
developing the current treebank. These GRs, essentially Sanskritic, are given
144
along with their interpretations and the attachment labels and their variants that
have been used at this level of dependency annotation.
S. NO Name of
The Relation
Interpretation Relational
Label
Variants
1 Karta SUB, Agent, Doer, k1 pk1, jk1, mk1
2
Karta
Samanadhikarana
SUB compliment,
predicative JJ.1k1s **
3 Karma
OBJ, Patient, Goal,
Destinationk2 k2g, k2p
4
Karma
Samanadhikarana
OBJ compliment,
predicative JJ.2k2s **
5 Samanadhikarana Noun Elaboration rs rs-k1, rs-k2
6 Karana Instrumental k3 **
7
Sampradaana/
Anubhava Karta
Recipient, Experiencer,
Possessor k4 k4a, k4v
8 Apaadaana
Source, Departure from
the source k5 k5prk
9
Vishaya/Kaal/Desha
Adhikarana
Time/Space/Elsewhere
Locationalk7 k7t, k7p
10 Shashthi Genitive/Possessive r6 r6k1, r6k2
11 Prati Directional rd **
12 Hetu Reason/Cause rh **
13 Taadarthya Purposive rt **
14 Saadrishya Comparative/Similative k*u k1u, k2u, rsm
15
Upapada
Sahakaarakatwa Associativeras-*
ras-k1, ras-k2,
ras-neg
16 *** Duratives rsp **
17 *** Address Terms rad **
18 Kriyaa Visheshana Adverbs/Sentential too adv sent-adv
19 *** Participlised N-Modifiers nmod **
20 ***
Participlised/Gerundial
V-Modifiersvmod*
vmod_Rh,
vmod_Inst
21 *** Yus/Yuth/Yeli Relative *mod_Relc nmod_Relc,
145
Clauses
jjmod_Relc,
rbmod_Relc
22 ***
Conjunct of
Co-ordinationccof **
23 ***
N/JJ Part of Complex
Predicatepof **
24 ***
Tense/Aspectual
Fragment of Verbfragof **
25 *** Enumerator enm **
Table.1. Showing Grammatical Relations in KashTreeBank
The main twenty five GRs, given in the Table.1, though used in developing the
current dependency treebank of Kashmiri and other ILs (Hindi, Urdu, Telegu and
Bangla) are not all dependency relations. Many of them are no-dependency in
nature and are very important for accounting the structure of a sentence. These
GRs, dependency or non-dependency, can be divided into eight types, depending
up on the nature of relation they represent. The type wise description of the GRs
(dependency and non-dependency and karaka or non-karaka) is given below with
detailed examples, so that this description would also serve as guidelines for the
dependency annotation:
9.1. Type-one GRs
These include all six karaka relations which are the core of Ashtaadhyaayii. They
include Karta labeled as k1, pk1, jk1 and mk1, Karma labeled k2, k2g and k2p,
Karna labeled as k3, Sampradaana or Anubhava Karta labeled as k4, k4a and k4v,
Apadana labeled as k5 and k5prk, and Adhikarana labeled as k7, k7t and k7p.
Type one GRs also include three non-karaka relations such as Karta-
samanadhikarana and Karma-samanadhikarana labeled as k1s and k2s,
respectively and Samanadhikarana labeled as rs-k1 and rs-k2. The description of
each Type-one grammatical label (GR) is given below:
i. Karta <k1>
It is the most independent of all karakas. The chunk or clause having k1 relation
with the finite chunk is generally subject with agentive role but there are non-
agentive instances also. Therefore, karta can be either primary karta which is
146
volitional in nature or secondary karta which is non-volitional in nature. In
nominative constructions, it is SUB <k1> which agrees with the AUXP, in terms
of number and gender. In short, the dependency relations, which nominative and
ergative marked subjects hold with their respective heads in non-causative active
constructions are all karta relations. For example:
bI chu-s batI khey-vaan. (1)
I-NOM be-PRS rice eat-PROG
I am eating rice.
mea khe-yo batI. (2)
I-ERG eat-PRF.SG.MAS rice
I ate rice.
However, there are exceptional cases which may not fall under the
aforementioned criteria. In such cases the SUB may be marked with a case which
is dative by form but not by function. For example:
feroz-as pazi-nI zyaadI davIdav karin’* (3.a)
Feroz-DAT need-NEG more struggle do-INF.SG.FEM
Feroz needs not to struggle hard.
*farooq-as chi yi kitaab parIn’ (3.b)
Farooq-DAT be-PRS-SG-FEM this book read-INF.SG.FEM
Farooq has to read this book.
Here also, the DREL between the SUB and the verb is marked as k1, irrespective
of the fact that it is non-volitional as mentioned above that karta can be volitional
or non-volitional. Therefore, it must be noted that volitionality (agentiveness) is
not the sole criterion for being karta of a verb but it is fact that it is the strongest
criterion.
In Kashmiri, passives are formed by a combination of an infinitival oblique verbal
form -nI and a periphrastic auxiliary yun (to come) in perfective form, in which
the internal argument of the transitive verb surfaces as the subject of the sentence.
The agent of the action is not overtly realized and preferably omitted. Therefore,
the agentive phrase is optional. However, if the agent is realized, it is either in the
form of -zaryi or -athi phrase (a kind of by phrase).
farooq-an khuul kuluf. (ACTIVE VOICE) (4)
Farooq-ERG open-PRF lock
147
Farooq opened the lock.
farooq-ni zAryi aav kuluf khol-nI. (PASSIVE VOICE) (5)
Farooq-GEN by come-PRF lock open-PASS
The lock was opened by Farooq.
ii. Prayojaka, Prayojaya and Madhyastha Karta <pk1, jk1 and mk1>
Like any other morphologically rich language, causative or double causative
verbs in Kashmiri are formed by morphological process, by suffixing –Inaav in
single causatives where there is a causer and a causee, and by doubling the suffix
–Inaav in double causatives where, in addition to a causer and a causee
arguments, there is one more argument (NP chunk) called mediator causer. The
Prayojaka Karta is the causer NP, Prayojaya Karta is the causee NP and
Madhyastha Karta is the mediator causer. The dependency relations which causer,
causee and mediator causer NPs hold with the causative verb (root/head) are
labelled as pk1, jk1 and mk1, respectively, as shown below, in (6), (7) and (8),
respectively. For example:
arshid-an dyaav-InA:v feroz-as athi aijaaz-as kitaab. (6)
Arshid-ERG give-PRF-CAU.SG.FEM feroz-DAT by aijaaz-ACC book
Arshid made Feroz to give book to Aijaaz.
Arshid-an dyaav-InA:v ferozas athi aijaazas kitaab. (7)
Arshid-ERG give-PRF-CAU.SG.FEM feroz-DAT by aijaaz-ACC book
Arshid made feroz to give book to aijaaz.
arshidan dyaav-Inaav-InA:v’ feroz-ni zaryi shaanu-vas athi
aijaa-zas akh kitaab. (8)
Arshid-ERG give-PRF-CAU-CAU.SG.FEM feroz-GEN through
shaanuv-DAT by aijaaz-ACC a book
Arshid made feroz give aijaaz a book through shanuv.
It is evident from the above examples that pk1 is the DREL of ergative marked
NP chunk (causer), jk1 is the DREL of –athi marked NP chunk (cause) and mk1
is the DREL of –zAryi marked NP Chunk (mediator causer) with the root of the
sentence, i.e. causative verb.
148
iii. Karta-samanadhikarana <k1s>
It is the DREL which predicative JJ chunk holds with the verb. These JJ chunks
function as SUB compliments. The NP chunks at the predicative position in the
copular constructions can be also considered as SUB compliments and can be
labelled as k1s. For example:
farooq chu baDi neyk. (9)
Farooq be-PRS-SG-MAS very pious
Farooq is very pious.
farooq chu scholar. (10)
Farooq be-PRS-SG-MAS scholar
Farooq is a scholar.
iv. Karma <k2>
The OBJ NP chunks which have semantic role of patient are the Karma and the
DREL they bear with their heads are labelled as k2, irrespective of whether the
construction is in active or passive configuration. The finer distinctions have been
also drawn within karma and have been labelled as k2p and k2g which are dealt
separately. In ergative constructions (active voice), it is OBJ (k2) which agrees
with the verb, as shown in (12). Also in passive configurations, the agreeing NP
chunk is karma as shown in (15). In short, the agreeing NP chunks in ergative
constructions are karma. These are actually unmarked (accusative) OBJs in both
nominative and ergative constructions and the DREls they hold with the root are
labelled as k2. However, there are instances where a whole clause introduced by a
subordinating particle can be treated as the OBJ. It is called Vakya Karma or
clausal OBJ and is also labelled as k2, as shown in (13). For example:
farooq chu kitaab par-aan. (11)
Farooq be-PRS-SG-MAS book read-PROG
Farooq is reading a book.
farooq-an chi yim-I kitaab-I pArm-ItsI. (12)
Farooq-ERG be-PRS-SG-FEM this-Pl book-Pl read-PRF-PL.FEM
Farooq has read this book.
farooq-an von zi ta-s chu-nI kahn ti vyatsaan-Iy. (13)
149
Farooq-ERG say-PRF that he-DAT be-PRS-SG-MAS-NEG no one
impress-PROG-EMP
Farooq said that no one impresses him.
yuh-us A:y’ maah-i-ramzaan-as manz vaariyaa khAzIr bAgraavnI. (15)
This year come-PRF-PL-MAS in ramazaan-DAT lot of dates distribute-
PASS
This year in Ramadan lot of dates have been distributed.
v. Karma <k2p>
The OBJ NP chunks which have semantic role of goal or destination is also
Karma but the DREL they bear with their heads are labelled as k2p, irrespective
of whether the construction is in active voice or passive conformation. For
example:
farooq-as chu garI gatsh-un. (16)
Farooq-DAT chu-PRS-SG-MAS garI gatsh-INF
Farooq has to go home.
farooq gatsh-i garI. (17)
Farooq go-FUT home
Farooq will go home.
vi. Karma <k2g>
In the sentences with ditransitive verbs where there are no giver-recipient roles,
the second OBJ is called secondary karma and bears k2g DREL with the root. For
example:
yim pagal lukh chi gulzaaras piir sA:b vanaan. (18)
These insane people bePRS-Pl gulzaar-DAT saint call-HAB
These insane people call Gulzar a saint.
It must be noted the semantics of verb vanun (to call) presupposes that there is a
person who calls, a thing/person that is to be called and another the name by
which the thing has to be called. All the three presuppositions are nominal in
nature but can’t be attributive in nature.
vii. Karma-samanadhikarana <k2s>
The difference between k2g and k2s seems to be very confusing given the fact
that in both cases the verbs involved, e.g. vanun (to say) and maanun (to believe),
150
are ditransitive in nature. However, in the latter case, as illustrated in (19) and
(20), the predicative JJP or NP (in copula constructions) can’t be treated as
arguments but OBJ compliments. The reason for treating them as compliments is
that they are attributive in nature, carrying the attributes of OBJ, rather than being
nominal in nature like OBJ, so that they could be treated as arguments. These are
just like SUB compliments. For example:
arshid chu shaanuv-as rut samjaan. (19)
Arshid be-PRS.SG.MAS shaanuv-ACC nice think-HAB
Arshid thinks that Shanuv is nice.
chiin chu hindostaan-as askh taaqathvar muluk maanaan. (20)
Chiin be-PRS-MAS hindustaan-ACC a strong country consider-HAB
China considers India a strong country.
viii. Karna <k3>
In the sentences with transitive verb, Karna is that NP chunk through which the
action has been carried out by the agent or which is instrumental in carrying out
the action. The instrumental role exists irrespective of the type of sentence
configuration, i.e. whether the sentence is in active, passive or WH configuration.
However, it should be noted that it is not part of argument structure like the
causee in the causative constructions. The DREL of instrumental marked NP
chunks has been labelled as k3. In short, –sI:t’ marked NPs are Karna but –sI:t’ is
ambiguous and shows syncretism between instrumental and associative roles. For
example:
farooq-an khuul kunz-i sI:t’ kuluf. (21)
Farooq-ERG open-PRF with key-ABL lock
Farooq opened lock with key.
maaji aaprov bachch-as chamchi sI:t’ batI (21)
Mother-ERG feed-PRF.SG.MAS kid-ACC spoon-ABL with rice
The mother has feed kid rice with spoon.
kuluf aav kunz-i si:t’ khol-nI. (22)
Lock come-PRF.SG.MAS key-ABL with open-PASS
The lock was opened with key.
151
ix. Sampradaana <k4>
The OBJ NP chunks which have semantic role of recipient or beneficiary or
represent final destination of an action are the Sampradaana. In the sentences with
ditransitive verbs, it is the dative marked NP (DO) which is recipient, beneficiary
or final destination and holds k4 DREL with the root. It must be noted that the
semantic role receiver is not constrained by animacy feature in Kashmiri and even
the inanimate OBJs can be marked with dative case, as shown in (26). For
example:
farooq-an di-ts Suhail-as kitaab parnI khA:trI. (23)
Farooq-ERG give-PRF.SG.FEM Suhail-DAT book for reading
Farooq gave Suhail a book for reading.
tse vAn-ith mea raath rIts kath. (24)
you-ERG tell-PRF-2PC me-DAT yesterday nice talk
You told me yesterday a nice thing.
mea van-iy tse raath rIts kath. (25)
I-ERG tell-PRF-1PC.SG.FEM you-DAT yesterday nice talk
I told you yesterday a nice thing.
darvaaz-as diut-ukh kuluf (26)
door-DAT give-PRF-3PC.PL lock
They locked the door.
sw kitaab chi dAh-an rop-yan yiv-aan. (27)
That book be-PRS.SG.FEM ten-DAT rupee-DAT.PL.FEM come-PROG
That book cots ten rupees.
x. Anubhava Karta <k4a>:
The SUB NP chunk which has the semantic role of a passive experiencer is
Anubhava Karta, who perceives through a process represented by the
(intransitive) verb. In the clauses with perception verbs, it is the perceiving entity.
However, it must be noted that perceiving entities in Kashmiri are not constrained
by animacy feature, as shown in (32). In short, it is the dative marked SUB NP
chunk which is experiencer and have k4a DREL with the root. For example:
152
mea lAj bochi. (28)
I-DAT.SG hurt-PRF.SG.FEM hunger
I felt hungry.
lADk-as peyi nindIr. (28)
boy-DAT.SG.MAS fall-PRF sleep
The boy felt asleep.
lADk-an tor fiqri (29)
boy-DAT.PL.MAS cross-PRF.SG.MAS understand
The boys understood.
kor-i gov shakh. (30)
girl-DAT.SG.FEM go-PRF.SG.MAS doubt
The girl got doubt.
kor-eyn aav mushuk. (31)
girl-DAT.PL.FEM come-PRF.SG.MAS smell
The girls caught smell.
makaan-as peyi traTh (32)
house-DAT.SG.MAS fall-PRF.SG.FEM lightening
The house is struck by thunder.
dukaan-as log naar. (33)
shop-DAT catch-PRF.SG.MAS fire
The shop caught fire.
tsuunT-is log daag. (34)
apple-DAT catch-PRF.SG.MAS stain
An apple caught stain.
makaan-as peyov pash vAs’ (35)
153
house-DAT fall-PRF.SG.MAS roof
The roof of the house fell down.
xi. Shashthi Karta <k4v>
It is a unique kind of Anubhava Karta specific to Kashmiri which has a different
semantic role to play which is entirely different from the recipient/
beneficiary/destination or experiencer. This role is sort of possessor-experiencer
with a clause structure like that of k1s, i.e. the type of clause structure of
predicative adjectives and that of copulas which simply state the existential state
of affairs. The DREL which is labelled as k4v actually holds between a dative
SUB, showing some existential state of affairs, and the root. For example:
farooq-as chi sath neychiv (36)
Farooq-DAT be-PRS.PL seven sons
Farooq has seven sons.
bAshiir-as chu sharaarath. (37)
Bashir-DAT be-PRS.SG.MAS anger
Bashir is angry.
farooq-as chu-nI heys-Iy. (38)
Farooq-DAT be-PRS.SG.MAS-NEG consciousness-EMP
Farooq is unconscious.
bAshiir-as chu safeyd mas. (39)
Bashir-DAT be-PRS.SG.MAS white
Bashir has white hairs.
xii. Apaadaana <k5>
The NP chunk which is the source of an activity, i.e. the point of departure or
starting point of an action or activity, is the Apaadaana. In Kashmiri it is the
ablative marked NP followed by ablative marked location postposition, which
represents source or starting/departure point of an activity. It should be noted that
the cases like (41) and (42) might appear confusing. However, the former ablative
marked one is right example of k5 but not the latter dative marked one. The
difference lies in the ablative marked and the unmarked locative postpositions, i.e.
peyTh-I and peyTh, which is enough to recognize that the former case is an
154
instance of Apaadaana while the latter can be the instance of simple locative.
Although, semantically it is obvious that in former case, there is a sense of
departure (ablative effect) which is lacking in the latter case but it would be
semantically anomalous to consider latter case a simple spatial locative adverbial
as the ‘plate’ in (42) can be instrumental or even ablative but can’t be space where
the action of drinking takes place. The DREL which such ablative marked NPs
hold with the root of the clause is labelled as k5. For example:
kul-i peyTh-I pyov Duun pathar.* (40)
Tree-ABL on-ABL fall-PRF.SG.MAS walnut down
A walnut fell sown from the tree.
farooq-an cheyi kawl-i manz-I treysh. (41)
Farooq-ERG drink-PRF.SG.FEM bowl-ABL in-ABL water
Farooq drank water from the bowl.
farooq chu kuTh-is manz paleT-as manz batI kheyvaan.* (42)
Farooq be-PRS.SG.MAS room-DAT.SG.MAS in plate-DAT.SG.MAS in
rice eat-PROG
Farooq eats rice in plate in the room.
farooq-as nish-I draav su raath. (43)
Farooq-DAT near-ABL left he yesterday
He left from Farooq (Farooq’s place) yesterday.
su nafar tsol az militaryvaaly-an nish-I (44)
That man run-PRF.SG.MAS today military man-DAT.SG.MAS
from-ABL
That man ran away today from military.
There are cases where the activity represents a process through which change of
state of a substance occurs. In such cases, the source point from which that change
starts is the NP representing one substance (raw/natural material) which changes
into another substance (the product), represented by another NP. The former NP
which undergoes change is said to be the Prakruti Apaadaana and its DREL with
155
the root is represented by a variant of k5, i.e. k5prk. The dative marked NP which
represents base substance or raw material is the Prakruti Apaadaana. For example:
guur chu dod-as tsaaman banaavaan. (45)
Milkman be-PRS.SG.MAS milk-DAT.SG.MAS cheese make-PROG
Milkman makes cheese out of milk.
sIts chu kapr-as palav suvvaan. (46)
Tailor be-PRS.SG.MAS cloth-DAT.SG.MAS dress-Pl sew-PROG
Tailor makes dress out of cloth.
In the above examples (45) and (46), it is clear that there are two states of a
substance, the first is the natural or the original state and the second is the finished
or the changed state of the substance. The former is called Prakruti and the latter
one is called Vikruti.
xiii. Desha Adhikarana <k7p>
The NP chunk which denotes a location in space for an action, involving different
participants like Karta, Karma or Karna, is called Desha Adhikarana. These are
not only those cases which are typical spatial locatives and are tagged as NST at
POS level, e.g. yetyth (here), hotyth (there), etc but also those cases which are
typical nouns and are tagged as NN/NNP at POS level, e.g. garI (home), saDakh
(road), kuTh (room), etc. However, both are NPs at chunk level, therefore, any NP
which provides spatial location for an event, action or a state is Desha
Adhikarana. It is generally a dative marked NP followed by postposition
indicating spacial location. The combination of –DAT marker and locative
postposition, e.g. -as peyTh, in Kashmiri has corresponding complex postposition
in Urdu/Hindi, e.g. ke uupar. It has been found that whatever role genitive (ke)
performs in the formation of Hindi/Urdu complex postpositions (where it ceases
to be genitive and the postposition projects itself in non-compositional way), the
dative markers (-i/-yan/-as/-av) in Kashmiri perform the same role but it can’t
form a complex postposition as the dative itself is merely a marker (a bound form)
unlike genitive of Urdu/Hindi which occurs as postposition (a free form).
However, the –DAT in Kashmiri also ceases to be dative when it is followed by
locative position and the GREL it represents has totally locative interpretation as
if the markers, –i, –yan, –as, and –av are no more dative markers but mere
156
obliqueness markers. It must be noted that the dative markers are very
controversial in Kashmiri Koul and Wali (2006) treats them as –DAT but Emily
Manetta (2008*) treats them as obliqueness markers. The DREL of Desha
Adhikarana with the root of clause is labelled as k7p. For example:
tati Os Farooq kursy-i peyTh bih-ith. (47)
there be-PST.SG.MAS Farooq be-PST.SG.MAS chair-DAT.SG.FEM
on sit-PART
There Farooq was sitting on the chair.
su chu bon-i tal aaraam karaan. (48)
He be-PRS.SG.MAS chinaar tree-DAT.SG.FEM under relax do-PROG
He is relaxing under the Chinar tree.
farooq chu kuTh-is manz paleT-as manz batI kheyvaan.* (49)
Farooq be-PRS.SG.MAS room-DAT.SG.MAS in plate-DAT.SG.MAS in
rice eat-PROG
Farooq eats rice in plate in the room.
mysuur-as manz chu mosam baDi jaan rozaan. (50)
Mysore-DAT.SG.MAS in be-PRS.SG.MAS weather very
pleasant remain-PROG
In Mysore weather remains very pleasant.
xiv. Kaal Adhikarana <k7t>
The NP chunk which denotes location in time for an event, action or an activity
involving various participants is called Kaal Adhikarana. These not only include
the cases which are typical temporal locatives and are tagged as NST at POS
level, e.g. vun’ (now), patI (latter), etc but also those cases which are typical
nouns and are tagged as NN/NNP at POS level, e.g. shaam-as (in the evening),
pagah (tomorrow), 1950-has manz (in 1990), etc. Both the cases will be NPs at
the chunk level which have adverbial function. Such chunks which provide the
temporal location of an event or an action are usually dative marked NPs. For
example:
subh-as gayizi garI panun. (51)
Morning-DAT go-2PC home own
157
In the morning go your own home.
1947-has manz gov heyndoshtaan aazaad. (52)
1947-DAT in go-PRF.SG.MAS india free
In 1947 India got freedom.
raath vot farooq garI paantsi baji. (53)
Yesterday reach-PRF.SG.MAS Farooq home five O’clock
Farooq reached home yesterday at five O’clock.
patI khot tamm-is taph. (54)
then rise-PRF.SG.MAS he-DAT fever
Then he caught fever.
xv. Vishaya Adhikarana <k7>
The NP chunk which denotes location of an event, action or an activity elsewhere,
i.e. other than concrete space and time is called Vishaya Adhikarana. Generally,
the NPs which are non-spatial and non-temporal in nature are marked with dative
and followed by locative postposition like Desha and Kaal Adhikarana. Such NPs
include abstract entities that capture some notion of abstract space. The DREL
such NP holds with root is labelled as k7. For example:
kejriiwaal chu az kal surkhi-yan manz aasaan. (55)
Kejriwal be-PRS.SG.MAS today tomorrow headline-DAT.Pl.FEM in
remain-PROG
These days Kejriwal continuously remains in the headlines.
myaa-ni kath-i peyTh khot tAmm-is sakh sharaarath. (56)
My talk-DAT.SG.FEM
on climb-PRF he-DAT.SG.0 very anger
He became very angry on my argument.
siyaast-as manz vasun chu-nI Thiekh. (57)
Politics-DAT in enter-INF be-PRS.SG.MAS -NEG good
To enter into politics is not good.
158
tAm’-sInd-is khayaal-as manz chi sA:rii panni-panni shaayi Thiekh. (59)
He-GEN-DAT.SG.MAS idea-DAT.SG.MAS in be-PRS.Pl.MAS
own-GEN.SG.FEM good
In his opinion everyone is correct at his place.
9.2. Type Two GRs
It only includes only two GRs, r6 and rsp. These are non-karaka but dependency
GRs which holds between two nominals, i.e. between two nouns or between a
pronoun and a noun. It rarely occurs between two pronouns. The description of r6
and rsp DREls are given below.
11.1. Shashthi <r6: r6-k1, r6-k2>
The NP chunks which have genitive relation with the other NP chunks are the
Shashthi. In Shashthi GR, there are like any other dependency relation a
dependent and a head, the possessor and the possessed. The first NP is usually
possessor and the second NP is possessed and it is the possessed NP which is
head and possessor the dependent. This is the only GR that holds between two
nominals and not between a nominal and a verb root.
For example:
farooq-un boD bOy chu polices-as manz. (60)
Farooq-GEN elder brother be-PRS.SG.MAS in the police-DAT
Farooq’s elder brother is working in police.
tAm’ sInz majbuurii ma kariv nazar andaaz. (62)
He of-SG.FEM predicament not do-Pl ignore
Do not ignore his predicament.
kaam-i hInz jaldii kin’ draav su garI. (63)
work-ABL.FEM of-SG.FEM hastiness
Because of urgency of work he went home.
yeti-ch sadakh chi vaariyah kharaab. (64)
here-GEN road be-PRS.SG.MAS very bad
Road of this place is very damaged.
makaan-uk pash peyov vAs’. (65)
159
house-GEN roof fall-PRF down
The roof of the house fell down.
The label <r6> is an underspecified one which has two realizations or variants,
i.e. r6-k1 and r6-k2. <r6-k1> is assigned to if the possessed is k1, i.e. if the head
NP is k1 and the dependent NP is attached to it. Similarly, if an NP is in genetic
relation with another NP and the head NP is k2 dependency relation which a
dependent NP has with it is labeled as r6-k2.
For example:
insaan-I sund athI chu vaariyah qismI-chi qaami heyqaan kArith. (66)
Human-ABL.MAS of-SG.MAS hand be-PRS.SG.MAS lot of
type-GEN.Pl.FEM work-Pl able do-NF
Human hand can do many types of work.
In the above sentence (Q.6), the first genitive marked NP is dependent on the NP
which having k1 relation with root of the sentence and the second genitive
marked NP is dependent on the NP which has k2 relation with the root.
11.2. Duratives <rsp>
The NP chunks which indicate the duration (temporal) or span (spatial) of an
event, an action or a state is durative expression. There are two points in durative
expressions, viz. a point of starting and an end point. The duratives, consisting of
two NPs, may function as temporal, spatial or manner adverbs. The starting point
NP depends upon the end point NP and the DREL between them is labelled as
<rsp>. In Kashmiri, the initiating postposition is ablative marked locative (peyTh-
I) and the terminating postposition is dative marked locative (taam). For example:
1982(-I) peyTh-I 2012-as taam ruudus bI baDI khosh. (67)
1982(-ABL) on-ABL 2012-DAT to remain-PRF-1PC.SG.MAS I very
happy
From 1982 to 2012 I remained very happy.
Kashmiri-I peyTh-I kanyaakumaarii taam cha-nI kahn-ti train. (68)
Kashmir-ABL on-ABL Kanyakumari to be-PRS.SG.Fem-NEG any-Emp
train
There is no train from Kashmir to Kanyakumari.
160
9.3. Type-Three GRs
These include rd, rh, rt, ku, ras, rsp, and rad GRs which are also non-karaka but
dependency relations holding between the dependent NP and the root of the
clause like Type-one GRs but unlike the Type-two GRs which holds between a
dependent NP and non-root head. The description of these relations is given
below:
i. Prati <rd>
The NP chunk which indicates the direction of an activity is the Prati or the
directional NP. In Kashmiri, ‘kun’ is the directional postposition and any NP
consisting of directional postposition is Prati. It has been observed that like the
NPs consisting of locative postpositions, the NPs which consist of directional
postpositions are marked with dative. Also, like locative NPs, the directional NPs
are mere adjuncts. It must be noted that there are certain directionals which do not
really show the direction of an activity but show some metaphorical direction as
shown in (72). The GR which such NPs bear with the root are labelled as ‘rd’. For
example:
farooq draav darvaaz-as kun. (69)
Farooq go-PRF door-DAT towards
Farooq went towards the door.
su chu mea kun vuchaan. (70)
He be-PRS.SG.MAS me towards see-PROG
He is looking/ towards me.
su chu dochun kun pakaan. (71)
He be-PRS.SG.MAS right towards
He moves towards right.
tim luukh A:s’ puliis-as mukhA:lif naarI bA:zii karaan. (72)
Those people be-PST.Pl.MAS opposite protest do-PROG
Those people were protesting against police.
ii. Hetu <rh>
The NP chunk which indicates the reason or cause of an activity is the Hetu or the
reason NP. In Kashmiri, (sI:t’ and kiny) are the reason postpositions and any NP
161
consisting of reason postposition is Hetu. However, these NPs need to be clearly
distinguished from the instrument NPs which also use (sI:t’) postposition. (sI:t’)
is actually an instance of case syncretism in Kashmiri. The GR which such NPs
bear with the root are labelled as ‘rh’. For example:
tami vajah kiny heyok-nI police-an su band thA:vith. (73)
That reason because of can-PRF.SG.MAS -NEG police-ERG he
close keep-PRF
Because of that reason police couldn’t kept him locked.
tami sabb-I gov mea hana tseyr. (74)
That reason-ABL go-PRF.SG.MAS I little late
Because of that reason I am late.
ami sardii sI:t’ pevos bI ti beymaar. (75)
This cold due to fall-PRF.1P.SG.MAS I too ill
I too have fallen ill due to/because of this cold.
ShoTh Os dagi sI:t’ chwrI chwrI karaan. (76)
I be.PST.SG.MAS pain due to shiver-PROG
I was shivering due to pain.
iii. Taadarthya <rt>
The NP chunk which indicates the purpose of an activity is the Taadarthya or the
purpose NP. In Kashmiri, khA:trI, mokhI, muujuub, and baapath (for) are the
purpose postpositions and any NP consisting of purpose postposition is
Taadarthya. However, sometimes VGNN also performs purposive role as shown
in (80) in which a gerund is marked with purposive case. The GR which such NPs
bear with the verb root are labelled as ‘rh’. For example:
faqooq-ni khA:trI pyov yi soorui karun. (77)
Farooq-GEN for fall-PRF.SG.MAS this all do-INF
All this needed to be done for Farooq.
kaam-i baapath aas bI yor. (78)
Work-ABL for come-PRF.SG.MAS I here
I came here for work.
162
chaan-i mokhI/muujuub gatshi sw shahar. (79)
You-GEN for go-FUT she Srinagar
For you she will go to Srinagar.
batI khe-yth draav su shong-ni. (80)
Rice eat-PART go-PRF he sleep-GER-PUR
Having eaten rice he went for sleeping.
iv. Saadrishya <ku: k1u, k2u>
The NP chunk which indicates the similarity or comparison (expressed through
predication) between two entities is the Saadrishya or the comparand NP. In
Kashmiri, khotI and nish are the comparand postpositions while as pA:Th’ and
hiuv or hish are the Similative postpositions. Hence, any NP consisting of
comparand or Similative postposition is Saadrishya. However, the NPs with
pA:Th’ postposition need not to be confused with adverbials like Thiekh-pA:Th
and also nish postposition need not to be confused with locative postposition. The
forms, nish and pA:Th’, are ambiguous and can perform two functions in to
contexts. These are actually two more instances of case syncretism in Kashmiri.
The GR which such NPs bear with the root can be labelled as ‘k*u’ but if the
comparison or similarity is with SUB NP, the GR of comparand NP is labelled as
k1u and similarly if it is with OBJ NP, the GR is labelled as k2u. Actually, the
star mark (*) can be seen as a variable mark which is substituted with any karaka,
depending upon the comparee. For example:
koshur treebank chu vuni hindi tI urdu treebank-av khotI/nish
vaariyaa lokut. (81)
Kashmiri treebank be-PRS.SG.MAS yet Hindi and Urdu
treebank-ABL.Pl.MAS as
compared to very small
Kashmiri treebank is yet very small as compared to Hindi and Urdu
treebanks.
farooq chu bAshiir-In’ pA:Th’ rut insaan. (82)
Farooq be-PRS.SG.MAS basher-GEN like good person
Farooq is a good person like Bashir.
163
bAshiir oos deyv hiuv insaan ??? (83)
Bashir be-PST.SG.MAS gaint like-SG.MAS person
Bashir was a giant like man.
faaroq-as baasey sw kuur nowshiin-as hish. (84)
Farooq-DAT feel-PRF.SG.FEM that-SG.FEM girl
Nowsheen-DAT like-SG.Fem
That girl looked like Nowsheen to Farooq.
v. Upapaada-sahakaaraktawa <ras: ras-k1, ras-k2, ras-neg>
The NP chunk which indicates the association of an entity with other entity in
performing an activity is the Upapaada-sahakaaraktawa or the associative NP. In
Kashmiri, sI:t’, saan, heyth, bagA:r and varA:y are the associative postpositions
and any NP consisting of associative postposition is Taadarthya. The sI:t’ is
positive associative postposition while as bagA:r and varA:y are negative
associative postpositions. The GR which such NPs bear with the verb root can be
labelled as <ras> but when the NP is associative of SUB, it is labelled as <ras-k1>
and when it is an associative of an OBJ, it is labelled as <ras-k2>. However, when
there is any negative associative postposition it is labelled as <ras-neg>.
For example:
su draav pann-is mA:l-is sI:t’ chakr-as. (85)
He go-PRF.SG.MAS own-DAT father-DAT with walk-DAT
He went on a walk with his father.
hindostaan chu chiin-as sI:t’ kath baath karn-I baapath tayaar.* (86)
India be-PRS.SG.MAS China-DAT talk do-GER-GEN for ready
India is ready to talk with China.
farooq-an kheyov tshuunTh deyl heyth/deyl-I saan. (87)
Farooq-ERG eat-PRF.SG.MAS apple peel with/along with
Farooq ate apple with/along with peel.
miltry-vA:l’ A:s’ saaman-I varA:y/ bagA:r natsaan. (88)
Military men be-PST.Pl.MAS weapons-ABL without roaming
Soliders were roaming around without weapons.
164
vi. Address Terms <rad>
The NP chunk, which indicates addressing to some person, bears a DREL with
the verb root which is labelled as <rad>. Some of the addresses terms are overtly
marked with vocatives but others are inherently vocatives. For example:
moj-ai, mea di-tay batI. (89)
Mother-VOC.SG.MAS me give-2PC.SG.FEM rice.
Mother, give me rice.
farooq-aa, tala yuur’ yi. (90)
Farooq-VOC.SG.MAS tala-MOOD here come
Farooq, you come here.
jinaab, tAm’ von zi bI gatsh-I garI. (91)
Sir, he say-PRF that I go-FUT.1PC.SG home
Sir, he said that he will go home.
hayaa, so kitaab maqlA:v-tha pAr-ith. (92)
Hey-SG.MAS that book finish-WH.SG.MAS read-PART
Hey! You finished that book.
vii. Information Source <rac>
The NP chunk, which indicates the source of an information or point of view
which may or may not be a person, is information source NP. In Kashmiri, the NP
consisting of genitive marked nominal and informational postpositions (mtA:bik
and hisaabI) bear a DREL with the verb root which is labelled as <rac>. For
example:
farooq-ni mutA:bik pazi mea garI vaapas yun. (93)
Farooq-GEN according to should I home back come-GER
According to Farooq I should come back home.
chaa-ni hisaab-I os su apuz vanaan. (94)
You-GEN way-ABL be-PST.SG.MAS he lie tell-PROG
According to you he was telling a lie.
165
viii. Information Target <rab>
The NP chunk, which indicates an entity towards which all the information of a
proposition is directed, is information target NP. It can be also a clause. In
Kashmiri, the NP consisting of dative marked nominal and the target postposition
mutliq (about) bears a DREL with the verb root which is labelled as <rab>. For
example:
farooq-an vAn’ mea bAshiir-as mutliq akh kath. (95)
Farooq-ERG tell-PRF.SG.FEM me Bashir-DAT about one talk
Farooq told me something about Bashir.
ix. Hurdle <rin> in spite of
The NP chunk, which indicates a hurdle in an activity that has been overcome, is
the hurdle NP. The hurdle can also be a whole clause. In Kashmiri, the NP
consisting of genitive marked nominal and hurdle postposition baavojuud (in spite
of or despite) bears a DREL with the verb root which is labelled as <rin>. For
example:
tAm’sIndi inkaar karnI baavojuud pyo mea tor gatshun. (96)
He-GEN denial in spite of fall-PRF.SG.MAS I go-INF
In spite of his denial, I had to go there.
pareshA:ni-yav baavojuud ruud-us su panIn’ kA:m karaan. (97)
Difficulty-Pl.FEM in spite of keep-PRF.SG.MAS -1PC
he own work do-PROG
In spite of difficulties, he kept doing his work.
9.4. Type Four GRs
Type-Four relations are essentially to capture the adverbial relations of inherent
adverbs and participles. Since, the same participles also modify nouns; participial
noun modifiers are also included in this class of GRs. It includes adverbials (adv
and sent-adv), vmod (vmod_Rh and vmod_Inst) and nmod.
i. Adverbial <adv and sent-adv>
The manner adverbs which are discontinuous and hence, didn’t form a chunk with
verb but gave rise to their own chunk projections tagged as RBPs depend on the
verbal root and the DREL they bear with the root is labelled as <adv>. Similarly,
the discourse particles which form BLK chunk are considered to be sentential
adverbs and the DRELs they bear with the verb root are labelled as sent-adv. It
166
must be noted that other traditional adverbs, i.e. adverb of time and place, don’t
form RBP and hence do not bear adv DREL with the root. The examples of RBP
and BLK chunks that bear adv and sen-adv DRELs, respectively with the root are
given below:
su draav vaarI vaarI garI kun. (98)
He leave-PRF.SG.MAS slowly towards home.
He left towards home slowly.
tim chi yim-I kath-I baar baar karaan. (99)
They be-PRS.Pl.MAS these talk-Pl again do-PROG
They talk it again and again.
yithI-pA:Th’ gatsh-an-nI yim-I kath-I karIn-i. (100)
This like should-2PC-NEG these talk-Pl do-INF.Pl
You shouldn’t talk like this.
ii. Participial Verb Modifier <vmod>
The participial forms which may or may not constitute non-finite clauses are
projected as chunks and are attached to the root in order to show that they bear
modifier relation with it. In Kashmiri, the –ith and –aan forms of verb are
participial forms and constitute VGNF chunks. The –ith forms like shongith
(having slept), bihith (having sit), tsA:p’ith (having chewed), etc may be
sequential, consequential or instrumental in nature as far as their role is concerned
while as the –aan forms like pakaan pakaan (during walking), gindaan gindaan
(while playing), etc which are actually progressive/habitual forms but express
simultaneity on reduplication and function as verb modifiers. The –ith forms are
also reduplicated but when reduplicated they change their form like shongith
(having slept) changes into shong’ shong’ to encode simultaneity and
sequentiality together in more complex way in order to express manner. The
DRELs which all these variants of VGNF bear with the root are labelled with an
underspecified label <vmod>. For example:
batI khe-yth draav su shong-ni. (101)
Rice eat-PART go-PRF he sleep-GER-PUR
Having eaten rice he went for sleeping.
167
mAr’-mAr’ chu su kA:m karaan. (102)
die-PART-RED be-PRS-SG.MAS he work do-HAB
He works slowly.
pakaan-pakaan os su malaayi kulfii kheyvaan. (103)
Walk-HAB-RED be-PST.SG.MAS he ice cream eat-PROG
While walking he was eating ice cream.
iii. Participial Noun Modifier <nmod>
Some participials forms are projected as VGNF chunks like the verb modifier
participials as mentioned above but are attached to the modified noun instead of
the root to show their modifier relation. In Kashmiri, the –vun, –vol and –ith
forms are noun modifier participials. The –vun forms like asvun (laughing) and
vudvun (flying), –vol forms like asanvol (laughing) and natsanvol (dancing) and –
ith forms like bihith (sitting) and shongith (sleeping) are the noun modifier
participials and the DREL they bear with the nouns is labelled as <nmod>.
For example:
asvun insaan chu saarinIy khosh kar-aan. (104)
smile-PART person be-PRS.SG.MAS all happy do-HAB
All people like a smiling person.
gindan-vol insaan chu chust dusrust rozaan. (105)
Play-PART person be-PRS.SG.MAS healthy remain-HAB
The person who plays remains healthy.
kullis peyTh bihith kaav chu Taav Taav kar-aan. (106)
tree-DAT on sit-PART crow Taav Taav do-PROG
The crow sitting on the tree is crowing.
9.5. Type Five GRs
It includes clausal modifications brought about by relative clauses. Relative
clauses are embedded clauses introduced by relative pronouns. The relative
pronouns have their corresponding pronouns in the matrix clause and therefore,
these clauses are called relative–correlative constructions. In Kashmiri, yus–su,
yuth–tyut, and yithIpA:Th’–tithIpA:Th’ are relative–correlative elements.
168
Relative clauses modify noun/pronoun, adjective and adverb. The description of
such GRs is given below:
i. Relative clause Nominal Modification <nmod_Relc>
In Kashmiri, the relative element which introduces a relative clause to modify a
nominal is yus and its correlative is su or any other noun. These relative elements
are either the relative pronouns or the relative demonstratives. The DREL which
these relative clauses bear with the non-root nominal head are that of noun
description which is different from normal nominal modification brought about by
inherent adjectives or participials. Nevertheless the name used for noun
description is nominal modification but it is labeled differently as <nmod-Relc>.
For example:
su nafar yus otrI gov garI yiyi pagah vaapas. (107)
That man who day before yesterday go-PRF.SG.MAS home be-FUT.SG
tomorrow return
The man who went home day before yesterday will return tomorrow.
su nafar yiyi pagah vaapas yus otrI garI gov. (108)
That man be-FUT.SG tomorrow who day before yesterday gov.
The man will return tomorrow who went home day before yesterday.
bI chus sw kitaab paraan yath peyTh bavaal os voth-mut. (109)
I be-PRS.SG.MAS .1PC that-SG.FEM book read-PROG
which-DAT on hue and cry be-PST.SG.MAS stand-PRF.SG.MAS
I read that book on which there was hue and cry.
tAm’ dits tas lADkI-as kitaab yAm’ tas mAnj’. (110)
He give-PRF.SG.FEM that-DAT boy-DAT book who-ERG
ask-PRF.SG.MAS him
He gave that boy book the book who asked him for it.
farooq-an lyokh tami qalmI sI:t’ zindagii hund falsafI yami sI:t’
tAm’ mea kA:trI akh chiTh leych-mIts A:s. (111)
Farooq-ERG write-PRF that pen with life of philosophy which with
he-ERG I for one latter write-PRF.SG.FEM be-PST.SG.Fem
169
Farooq wrote philosophy of life with that pen with which he had
written a letter to me.
ii. Relative clause Adjectival Modification <jjmod_Relc>
The relative elements which introduce relative clauses in order to modify deictic
adjectives are yuth, yith’, yitsh and yitshI and their corresponding correlatives are
tyuth, tith’, titsh and titsI, respectively. For example:
su chu huu-ba-huu tiyuth-ui yuth tas panun mwl os. (112)
He be-PRS.SG.MAS exactly like-that like-which he-DAT
his father be-PST.SG.MAS
He is exactly like that like which his father used to be.
tiyuth-ui jacket An’-zi yuth mea raath on. (113)
like-that-EMP jacket buy-IMP.2PC like-which I buy-PRF yesterday
You buy that kind of jacket like which I bought yesterday.
iii. Relative clause Adverbial Modification <rbmod_Relc>
The relative elements which introduce relative clauses in order to modify an
adverb or deictic adverb is yIthI-pA:Th’, yithI-kIn’ and yami tAriiqI and their
corresponding correlative are tithI-pA:Th’, tithI-kIn’ and tami tAriiqI. For
example:
tithay-pA:Th’ kAri-zi az ti rut dance yithIpA:Th’ raath kor-uth. (114)
Like-that do-IMP.2PC today also nice dance like-which yesterday
do-PRF.SG.2PC
You dance as nicely today also as you did yesterday.
yemi tAriiqI tami mea von tami tAriiqI pyov mea karun. (115)
Like-which she I-DAT tell-PRF like-that had I-DAT do-INF
As she told me I had to do like that.
9.6. Type Six GRs
It includes a set of non-dependency relations that hold between two conjoined
elements or clauses of equal status and that of between two elements or clauses of
unequal status. The former is the coordination relation and the latter is the
subordination relations which are basically structural relations to keep the
organization of compound, complex, compound-complex and complex-compound
clauses intact and have nothing to do with modification of the finite verb but have
170
everything to do with the tying together of one or more chunks on intra-clausal
level or one or more VGF chunks on inter-clausal level. The description of these
relations is given below.
iv. Coordinating Conjunct <ccof>
The CCP chunk which conjoins two chunks within a clause is the head of the
conjoining chunks and is attached to the root of the clause in which they occur as
they both are in symmetrical relation with each other and thus, both are
dependents of the root. However, the CCP chunk which occurs at inter-clausal
level is considered as the root of the compound and compound-complex sentences
that involve many finite clauses. In Kashmiri, CCPs include tI (and), kinI/yaa (or),
etc. It must be noted that sometimes even commas can also functions as CCPs.
The GR which two chunks or clauses bear with their respective heads, which in
both cases is CCP, is labelled as <ccof> as shown in Fig.1 and Fig.2 for the
examples (116) and (117). For example:
farooq tI bAshiirI chi dohdish paanIvan’ chob chob karaan. (116)
Farooq and Bashir be-PRS.Pl everyday each other hit hit do-HAB
Everyday Farooq and Bashir fight with each other.
farooq chu lamaan dwchun tI bAshiirI chu lamaan khovur. (117)
Farooq be-PRS.SG.MAS pull-HAB towards right and Bashir
be-PRS.SG.MAS pull-HAB towards left.
Farooq pulls towards right and Bashir pulls towards left.
v. Sub-ordinating Conjunct <ccof>
Instead of symmetrically conjoining chunks and clauses, some CCP chunks
introduce new chunks new clauses and thus, enter into embedding phenomenon
by joining two chunks or clauses asymmetrically. The embedded clauses
introduced by complementizers are considered as OBJs already discussed but the
rest of the rooted complementizer clause attaches to the complementizer which
acts as the syntactic head of the compliment clause but not the VGF. Therefore,
the complementizer, e.g. zi and ki (that), is the dependent (OBJ) of the root of
matrix clause but simultaneously, it is also the head of complementizer clauses as
shown in Fig. which is the graphic representation of (118). The GR the VGF of a
complementizer clause bears with the complementizers is also labelled as ccof as
shown in Fig.5. For example:
171
farooq-an chu vanaan zi tas chu-nI kahn-ti veytsaan. (118)
Farooq-ERG be-PRS.SG.MAS say-HAB that he-DAT
be-PRS.SG.MAS -NEG noyone-EMP impress-HAB
Farooq says that no one impresses him.
Figure.3 Showing Intra-clausal ccof in (116)
Figure.4 Showing Inter-clausal ccof in (117)
172
Figure.5. Showing Sub-ordinating ccof in (118)
9.7. Type Seven GRs
It includes entirely different a set of GRs, pof and fragof, which are actually
innovations to handle certain crucial phenomenon like complex predication and
v2 phenomenon. Without these relations, it would have been difficult to account
for the structures involving such phenomenon. The description of these relations
is given below:
i. Part of Verb <pof>
Some nouns, adjectives and participle forms combine with certain verbs which are
bleached of their original semantics due to grammaticalisation, hence called light
verbs. Such combinations with light verb give a pure sense of predication in South
Asian Languages and are called complex predicates or conjunct verbs (see Butt
2005). A generalized internal structure of these complex predicates is
(Noun/Adjective/Participle + verbalizer). Like Hindi/Urdu, complex predicates
are productive in Kashmiri in which participles also are involved in complex
predicate formation in addition to nouns and adjectives. In Kashmiri, the
173
commonly occurring light verbs are karun (to do), niyun (to take), tshunun (to
enter), etc. As illustrated above in Figure.3, the GR of these nominal, adjectival
and participles with the light verb, which is projected as VGF, is a non-
dependency relation and is labelled as <pof>. For example:
tAm’ kyaa chu vunyuk taam hA:sil kormut. (119)
He what be-PRS.SG.MAS now till achieved
What he has achieved till now?
achaanak gov su mea broThI kani pA:dI. (120)
Suddenly go-PRF he I front in appear
Suddenly he appeared in front of me.
tAm’ tuj’ su vuchith vwTh. (121)
He lift-FEM he see-PART jump
Having seen him he jumped.
tAm’ kor ti saarinIy bronTh kani zA:hir. (122)
he do-PRF.SG.MAS that everyone front in reveal
He revealed it before everyone
tAm’ diyut ath savaal-as akh rut javaab. (123)
He give-PRF that question-DAT a nice answer
He gave him nice answer.
sw chi farooq-In sakh tA:riif karaan. (124)
She be-PRS.SG.FEM very praise do-HAB
She praises Farooq very much.
su chu pann’ galtiyi qwbuul karaan. (125)
He be-PRS.SG.MAS his mistake-Pl accepts
He accepts his mistakes.
Although, there are various diagnostics for identifying complex predicates
(Mohanan 1994; Butt, 2004; Chakrabarty et. al, 2007; Bhatt, 2008) but still
identifying them is not easy task and hence, their annotation is also a confusing
job. The problem in identifying them is that sometimes it is difficult to figure out
174
whether the nominal part is an OBJ or not. Intuitively, it appears that the nominal
is a part of complex predicate but syntactically as per sub-categorisation frame is
concerned, it appears to be an OBJ as can be seen in above example (123).
ii. Fragment of Verb <fragof>
It has been observed that in finite clauses the tensed verbal element (VAUX)
which has been projected as AUXP chunk occurs at second position while as the
un-tensed lexical part (VM) which has been projected as VGF chunk, occurs at
the final position of the clause. This disjunctive or discontinuous occurrence of
tensed and lexical verbal elements is due to the fact that Kashmiri exhibits V2
phenomenon like German. Since, such elements of finite verb do not occur
contiguously like in other Indo-Aryan languages, they do not form a single verb
chunk instead they form two chunks, AUXP and VGF. Since VGF is root of a
clause and most of the other chunks are its dependents and are attached to it as per
the current scheme. However, AUXP is not modifier of VGF in any sense but
tensed fragment of it which has fallen apart. This ‘fragment-of’ GR is shown by
attaching AUXP chunk to the root like the dependents and labelling the relation as
<fragof>. For example:
farooq chu tsuunTh kheyvaan. (Active Voice) (126)
Farooq be-PRS.SG.MAS apple eat-PROG
Farooq is eating an apple.
farooq-ni zAryi aav tsuunTh khey-nI . (Passive Voice) (127)
Farooq-GEN by come-PRF apple eat-PASS
An apple was eaten by Farooq.
farooq ch-aa tsuunTh kheyvaan? (Interrogative) (128)
Farooq be-PRS-WH apple eat-HAB/PROG
Is Farooq eating an apple/does Farooq eat an apple?
As aforementioned, Kashmiri exhibits the verb-second phenomenon (V2) which
has been argued by Raina (1991) to be a PF level constraint. In Kashmiri tensed
clauses are subjected to the verb second constraint due to which the finite verbal
element always occurs in the second position, i.e. the position followed by the
first constituent. At surface level Kashmiri shows V2 like German except that V2
appears in both main and embedded clauses in Kashmiri but at deep level, it is
175
argued that the underlying word order of Kashmiri is SOV like German for which
the evidence comes from non-finite and relative clauses.
9.8. Type Eight GRs
It includes a non-grammatical relation which though not of any significance to
account the structure of a clause but in corpus it is a part of sentence. It includes
enumerator or serial numbers for sentences which are non-structural parts. Even
though, enumerators are of no grammatical significance, these are important to
account as these are integral elements in corpus. Enumerators can be projected as
BLK chunks and can be attached to the toot of the clause. So for these relations
have not been labeled as the enumerator elements were not present in the corpus.
However, the relation between the enumerator BLK and the root can be labeled as
<emn>. For example:
1. akh nafrah chu vat-i pakaan. (129)
1 one man be-PRS.SG.MAS road-DAT walk-PROG
1. One man is walking on the road.
5. Annotating Inter-chunk GR Relations
Marking inter-chunk grammatical relations (dependencies or non-dependencies)
involves syntactic parsing and its annotation. The chunked corpus, a set of GRs
and a SA Interface are prerequisites for carrying out annotation of inter-chunk
GRs. The chunked Kashmiri corpus development of which has been described in
the chapter five is used for the current task. The set of GRs, given in the section
four of this chapter, provides all the necessary relational labels along with their
description and illustration. The same annotation interface (Sanchay SA Interface)
which was used for POS level annotation and chunk level annotation has been
also used for the current syntactic annotation. The process of syntactic annotation
has been carried manually SA Interface of Sanchay. The entire process of
annotation is illustrated with reference to the following example sentence taken
from corpus.
kAshiir-i manz haalI-keyn doh-an manz shAhrii halaaqts-an hund silsilI
teyzn-I kin’ chu salaamtii maahol mutA:sir sapud-mut.
Kashmir-DAT.SG.Fe in recent-GEN.Pl.MAS day-DAT.Pl.MAS in
civilian death-GEN.Pl.FEM of-SG.MAS spree intensity-ABL towards be-
PRS.SG.MAS security condition affect-PRF.SG.MAS
176
The security conditions have been affected in Kashmir because of increase
in recent death sprees of civilians.
The various steps that were involved in the annotation of the above sentence are
given below:
Step-1 Opening Chunked Data in SA Interface
The chunked corpus file is opened in the interface which shows POS level nodes
as well as chunk level nodes in SSF format.
Figure.6. SA Interface Showing a Chunked Sentence
Step-2 Opening in Tree Viewer Window
In order to attach various types of chunks to the root and other non-root heads, the
sentence needs to be opened in the tree viewer by clicking on ‘View Dependency
Tree’ button on right side of the window indicated by the arrow. Once it is done
the chunks will be displayed as below. Each chunk has been already automatically
assigned an ID number according to its position in the sentence. Here chunks are
displayed in the same order but from left to right. The sentence displayed is
constituted of two clauses, one finite clause with ultimate head VGF (root) and
one non-finite modifying clause with ultimate VGNN as head.
177
Figure.7. SA Interface Displaying Various Chunks
Step-3 Finding Root and Target Chunk
Once chunks are displayed in the tree viewer, root of the sentence, i.e. VGF
chunk, needs to be identified so that its rest of dependents and their relations with
it can be annotated. In tensed clauses the root occurs at the final position of the
sentence as shown below in Fig.8 by the arrow. After finding the root of the
sentence, the target chunk needs to be identified which can be attached to the root.
First of all the most closest element of the verb root, i.e. AUXP needs to be
identified and attached if it is tensed clause and then, NP, JJP, or VGNF needs to
be identified and attached if it is light verb of complex predicate which is
projected as VGF. This needs to be done with first priority in order to get the
complete information (inflectional and lexical) about the root as shown in Fig.9 to
decide upon its sub-categorization frame. Therefore, the first target chunk in the
finite clause of the given sentence (chu salaamtii maahol mutA:sir sapud-mut)
was AUXP which is the tensed part of VGF and the second target chunk would be
JJP which is adjectival part of the VGF, a complex predicate.
178
Figure.8. SA Interface Showing the Identified Root Chunk (VGF)
Figure.9. SA Interface Showing the Identified Target Chunk (AUXP)
Step-5 Drag and Drop of Target Chunk
Once the target chunk (AUXP) is identified, it can be attached to the VFG root by
drag and drop method as shown in Fig.10 by an arch. This creates an undefined
179
relation between AUXP and the root as shown by the arrow. The relation needs to
be identified and labeled according to the set of labels given in the table.
Figure.10. SA Interface Showing Attaching of the Target Chunk to the Root
Step-5 Choosing Relational Label
In this step, a dropdown list of relational labels can be opened by simply left
clicking on the dependent node, i.e. AUXP. The clicking on the node will open a
dialog box, as shown in Fig.11 and by clicking on the OK button of the dialog box
a dropdown list of relational labels will open, as shown in Fig.12, from which an
appropriate label can be chosen.
Figure.11. SA Interface Showing Undefined GR between AUXP and VGF
180
Figure.12. SA Interface Showing Selected GR between AUXP and the Root
Once the OK button of the dropdown list is clicked upon, the selected label, i.e.
fragof, gets assigned to the undefined relation between AUXP and VGF as shown
in Fig.13 by an arrow.
Figure.13. SA Interface Showing FRAGOF GR between AUXP and the Root
181
Same procedure is applied for the next target, i.e. JJP chunk and is attached to the
root with the relational label fragof, as shown in Fig.14 with the help of an arrow.
Once the complete information of the root is available, it is easy to identify and
attach other dependents, both arguments and adjuncts, and to decide upon their
DRELs.
Figure.14. SA Interface Showing partof and fragof Attachments to the Root
Step-6 Annotating Rooted Dependencies
Having idea of the sub-categorization frame of the complex predicate which is the
root of finite clause, it becomes obvious that the next NP is its argument, though
there was little bit confusion on whether it bears k1 DREL with the root or k2 but
initially it seemed to be k2. Therefore, it was attached to the root by same drag
and drop method which was used to attach other chunks. The DREL it holds with
the root was annotated as k2 as shown in Fig.15 with help of an arrow.
182
Figure.15. SA Interface Showing k2, partof and fragof Attachments to the Root
In this way, the syntactic annotation of the finite clause of the given sentence (chu
salaamtii maahol mutA:sir sapud-mut) was completed in which the three chunks
AUXP, JJP and NP have been attached to the root VGF with the attachment labels
fragof, pof and k2 repectively.
Step-7 Annotating Non-Rooted Dependencies
In this step, the annotation of non-finite clause (kAshiir-i manz haalI-keyn doh-an
manz shAhrii halaaqts-an hund silsilI teyz-nI kin’) of the sentence was taken in
which first point was to identify the head of the entire nonfinite clause, i.e. VGNN
and the target chunk that needs to be attached first but there were some other
dependents which instead of depending on VGNN were dependents of NPs which
in turn were dependents of VGNN. Such cases needed to be taken care first so
that latter one can fully concentrate on the attachments of VGNN in order to avoid
errors in the annotation. Therefore, attaching to VGNN was postponed and
instead the next genitive marked dependent NP was attached to its head NP and
the attachment was labeled as r6, as shown in Fig.16. Having finished this, the
head NP along with its own attachment was itself attached to the ultimate head of
the clause, i.e. VGNN and k1 was assigned as attachment label, as shown below
in Fig.16 by an arch and in Fig.17 by an arrow.
183
Figure.16. SA Interface Showing r6 Attachment to the first NP Head
Figure.17. SA Interface Showing k1 Attachment to VGNN Head
Immediately another genitive marked NP was encountered which also was
attached to its head NP and was assigned r6 attachment label as shown in Fig.18
by an arrow. Next the first NP of the sentence was attached to VGNN as shown in
184
Fig.18 by an arch and an attachment label k7p was assigned to the attachment as
shown in Fig.19 by an arrow.
Figure.18. SA Interface Showing r6 Attachment to the Second NP
Figure.19. SA Interface showing Another k7p Attachment to VGNN Head
185
Finally, the leftover NP along with its genitive attachment is attached to VGNN as
shown in Fig.19 by an arch and was assigned an attachment label as shown in
Fig.20 by an arrow.
Figure.20. SA Interface showing k7t Attachment to VGNN Head
Step-8 Annotating Inter-clausal Dependencies
By now, there are two parsed clauses in which one is finite and other is nonfinite.
As already mentioned several times that the root of a sentence lies in finite clause,
i.e. VGF and nonfinite clause just modifies the root. Therefore, VGNN chunk
with all its attachments is attached to the root and assigned rh attachment label as
shown in Fig.21 by an arrow.
Figure.21. Showing Inter-clausal DREL (rh) between VGNN and the Root
186
Once the annotation of entire sentence is complete, the dependency tree is saved
and the the tree viewer window is closed. The saved annotated sentence is
displayed in the interface as threaded structure in SSF format as shown in Fig.22.
Figure.22. Showing Threaded Structure of Syntactically Annotated Sentence
Finally, opening the threaded structure in the tree viewer in collapsed form, i.e.
with collapsed nodes, the dependency tree will be displayed as shown in Fig.23.
On evaluating various attachment labels once more before moving on the next
sentence is essential as the relations can be seen more clearly now as done in this
case also, the NP attachment to the root actually bears k1 relation but mistakenly
it was labeled as k2. Errors like this can be rectified at this stage easily.
Figure.23. Showing a Complete Dependency Tree in Collapsed Form
187
The corresponding expanded form of the dependency tree will be as shown in
Fig.24, with all its nodes or sub-trees completely expanded.
Figure.23. Showing a Complete Dependency Tree in Expanded Form
6. Issues of Syntactic Annotation
The crucial issues that have been encountered while annotating the data are
summarized below:
6.1. V2 Phenomenon
V2 phenomenon is the most crucial issue for annotating Kashmiri data. The issue
is discussed with reference to the following example sentence taken from the
current Kashmiri Treebank.
Asi [A:s]AUXP doshvun’ bA:ts-an tam’-sInz seyThaa nikhath [gA:mIts]VGF.
(1.a)
we be-PST.SG.MAS two-DAT.EMP husbandwife-DAT.Pl it-GEN lot hatred
go-PRF.SG.Fem
We both husband-wife had developed lot of hatred of it.
In the examples like above, the Finite Verb Group [A:s gA:mIts]VGF (had gone)
occurs discontinuously as AUXP (A:s) and VGF (gA:mIts) with three intervening
NP chunks. As aforementioned, the tense auxiliary occurs at second position in
188
Kashmiri and the main verb at final position of the sentence. This discontinuous
occurrence of AUXP is called V2 phenomenon which is similar to German and
Yiddish with variation. Since the root of the sentence is VGF chunk, the main
issue was whether to posit AUXP or VGF as the root of the sentence; given the
discontinuity in finite verb group (VGF). Initially, FRAGP chunk label (used to
handle occasional discontinuity in Hindi treebank) was used for tense auxiliary
and it was treated as a root of the sentence given the fact that most of the
treebanks consider finite verb as the head and also because in generative
framework too finite clause is treated as tensed phrase. Latter on, the decision was
taken to change the nomenclature and replace FRAGP label with AUXP to mark
that it is regular phenomenon and the notion of verb group, as posited for
treebanking in Indian Languages, is problematic with respect to Kashmiri data
which is replete with V2 phenomenon. Also, the previous notion of head vis-à-vis
root of sentence was revised and VGF instead of AUXP was taken as head/root of
the sentence. This decision was made in consonance with the basic tenet
governing PCG which is that only content words can be heads. Further, it is
considered that the grammatical information that gives the impression of
finiteness is distributed over two or three tokens and only single tensed token
without its lexical part can’t be considered finite verb. Therefore, AUXP is
considered tensed part of the lexical element which together constitute finite verb.
Since, only lexical elements can be root, the lexical part of the verb has been
assigned VGF and the AUXP is attached to it like any other dependent, though it
is not a dependent but tensed fragment of the lexical verb, and is assigned fragof
attachment label, considering the AUXP-VGF complex would give sense of only
VGF. In short, the V2 phenomenon was tackled by simple attachment technique
assuming what can’t be grouped together during chunking can be at least attached
and the attachment label will indicate its status as there is no notion of
hierarchical notion of the organization of sentence.
6.2. Complex Predicates
The problems related to complex predicates in Kashmiri are discussed with
reference to the following example sentence taken from the current Kashmiri
Treebank which of a complex predicate pasand aasun (to like).
zA:hir chu ki Akis [aasi]VGF akh kitaab pasand tI beykis
aasi byaakh kitaab [pasand]NP. (1.b)
189
obvious is that one will-have one book like and other
will-have other book like
It is obvious that one would like one book and other
would like other book.
Identification of complex predicates (CP) and their extraction is already a
complex problem in which at times it becomes very difficult to indentify whether
a combination [light verb + Noun] is simply verb + OBJ combination or a
complex predicate as aforementioned. Four criteria have been used, in addition to
native speakers’ intuitions, to recognize CPs in Kashmiri.
i. The first one that verbal element is semantically beached and doesn’t
retain the original lexical semantics CPs. It is because of this reason that it
is also called light verb and more or less functions metaphorically.
ii. The second criteria would be that if the (NN/JJ/VM + VM) combination
has a single lexical item, a verb, as its translation equivalent in English, it
is most likely to be a CP.
iii. Pondering on the sub-categorization frame of the light verb will reveal a
lot that if the nominal element is an argument, adjunct or something else.
If it is something else the combination is more likely to be a complex
predicate.
iv. The third criteria would be that the nominal, adjectival or the participial
part of CP can’t be easily conjoined while as the OBJ or the complements
can be easily conjoined.
v. Further, some CPs can be identified by just looking at the non-verbal part
to see if they are brushed of any agreement features like PNG. If one can
perceive no features there, it most likely forms a complex predicate.
This problem is even more complicated in Kashmiri where both discontinuous
CPs are hallmark of finite clauses. The noun/adjective/participial part of CP
occurs apart from the light verb which takes second position due to V2-
phenomenon while as the noun/ adjective/ participial. The light verb carries only
the grammatical features but the lexical semantics is provided by the
noun/adjective/participial part. However, the light verb is tensed element but the
only verbal element and there is no main verb which provides lexical semantics
fro predication. Therefore, the light verb is assigned VGF tag but not AUXP. The
noun/adjective/participial parts are simply attached to VGF with an attachment
190
label pof (Part-of). In the above example the nominal part of the CP “pasand” is
attached to the light verb “aasi” by pof attachment label just like AUXP was
attached to VGF. Here, again the discontinuity of complex predicate is solved
through attachment technique.
6.3. Pronominal Cliticisation
The problems related to pronominal cliticisation in Kashmiri is discussed with
reference to the following example sentence taken from the current Kashmiri
Treebank.
yAmi-is yi Ø behtar zon-un ti thov-n-as Ø lekh-ith. (1.c)
who-DAT this Ø better know-PRF.3PC.SG.MAS that
keep-PRF-3PC.DAT Ø write-PART
For whom whatever s/he deemed better s/he kept that in his/her destiny.
yAmi-is yi tAm’ behtar zon-un ti thov-n-as tAm’ le’kh-ith.* (1.d)
who-DAT this better know-PRF.3PC.SG.MAS that
keep-PRF-3PC.DAT write-PART
For whom whatever s/he deemed better s/he kept that in his/her destiny.
bI chus-ai tse vuchaan. (1.e)
I be-PRS.SG.MAS -2PC you see-PROG
I am watching you.
bI chus-ai Ø vuchaan. (1.f)
I be-PRS.SG.MAS -2PC Ø see-PROG
I am watching you.
bI chus-Ø tse vuchaan.* (1.g)
I be-PRS.SG.MAS -2PC Ø see-PROG
I am watching you.
Pronominal clitics are the characteristic morpho-syntactic feature of Kashmiri
verbs like that of Punjabi, Landha and Sindhi. There are two types of pronominal
clitics in Kashmiri, one type includes those which simply act as agreement
markers and do not replace arguments as shown in examples (1.e) and (1.f). In
such cases, presence or absence of pronominal arguments hardly matters in
191
presence of the clitics, both can also co-exist without making a construction sound
odd but in absence of the clitic, the pronominal argument makes the construction
sounds odd as shown in (1.g). This indicates that in the slot of PRO drop in such
clauses, an artificial argument can be introduced even though the information
about the argument can be extracted from the clitic itself. However, there are
other cases, in which argument replacing takes place and the clitic and
pronominal argument can’t co-exist. If the artificial pronominal arguments are
introduced to fill the slot of PRO drop, triggered by the clitics, the presence of the
argument sounds redundant and the construction looks odd as shown in (1.c) and
(1.d). In example (1.d), the clitic and the argument are simultaneously present in
the clause “yAmi-is yi (tAm’) behtar zon-un” and this is the reason the clause
sounds odd. Therefore, in such cases introducing pronominal arguments
artificially is not of much importance. However, in the cases where pronominal
argument and the clitics are mutually compatible and can coexist, they can be
introduced.
7. Statistical Results
As mentioned in the chapter five, the three datasets that have been used for the
current task consist of 682 POS annotated sentences of varying lengths, taken
from three different text domains, i.e. newspaper editorials (ASL = 25 Ws or 15
Cs), short-stories (ASL = 11 Ws or 8 Cs) and critical discourse (ASL = 16 Ws or
10Cs), are partially parsed into 8125 chunks. The task with which this chapter is
concerned is to deduce, annotate and find score for each inter-chunk GR, holding
among 8125 chunks in 682 structures. In aggregate, 4287 GRs have been found
holding under 682 dependency structures among 8125 chunks. The 4287 GRs are
further classified under 25 labels, each with its frequency count in three different
domains and also in aggregate, as shown in the Table.2. However, the score for
each attachment label is given in underspecified manner, i.e. no separate
frequency score of the variants is given.
192
Label Variants f1 f2 f3 fx1
k1pk1, jk1, mk1 294 80 230 604
2k1s ** 49 14 64 127
3k2 k2g, k2p 213 49 262 524
4k2s ** 7 2 16 25
5Rs rs-k1, rs-k2 3 1 17 21
6k3 ** 31 3 24 58
7k4 k4a, k4v 55 2 102 159
8k5 k5prk 6 0 6 12
9k7 k7t, k7p 166 45 120 331
10r6 r6k1, r6k2 93 55 151 299
11Rd ** 23 0 3 26
12Rh ** 10 1 16 27
13Rt ** 9 8 18 35
14k*u k1u, k2u 4 0 0 5
15Ras
ras-k1, ras-k2, ras-neg 6 4 5 15
16Rsp ** 6 8 5 19
17Rad ** 8 0 0 8
18Adv sent-adv 134 9 68 211
19Nmod ** 14 7 31 52
20Vmod
vmod_Rh, vmod_Inst 78 7 66 151
21
*mod_Relc
nmod_Relc, jjmod_Relc, rbmod_Relc 27 4 25 56
22Ccof ** 334 76 455 865
23Pof ** 134 49 126 309
24Fragof ** 113 33 203 349
25Enm ** 0 0 0 0
Total 1817 457 2013 4287Table.2. Showing Frequency Distribution of GRs
193
The empirical facts given in pie chart in the Fig.24 reveal that ccof is the most
frequent GR which covers 20% of the total GRs. Therefore, co-ordination and the
sub-ordination form the bulk of grammatical operations occurring Kashmiri text.
Fragof constitutes 8% of the total GRs found in Kashmiri, indicating the strength
of V2-phenomenon. Pof constitutes 7% of total GRs showing the significant
occurrence of complex predicates in Kashmiri. Similarly, k1 constitutes 14% and
k2 constitutes 12% of the relational bulk of Kashmiri text, indicating that SUBs
and OBJs constitute 26% GRs in aggregate which is quite significant. It is
interesting to see that quantitatively, k1, k2, ccof and fragof together cover more
than half of the total relational bulk. These facts further reveal that 39-40% GRs
in Kashmiri are karakas and rest, about 60%, are non karakas and 65% of GRs are
dependency relations while as 35% of the relations are non-dependencies in
which 6% are non-rooted dependencies, i.e. the attachments are made with non-
root heads (in genitive, participial and relative clause modifiers). 16% of GRs are
adverbial modifiers and only 1% of GRs are relative clause modifiers. Finally, it
is important to point out that only 30% GRs belong to sub-categorization frame,
thus, represent the arguments relations while as the 61% of GRs fall outside the
sub-categorization frame, thus, represent adjunct relations.
Figure.24 Showing Proportion of Each GR
194
8. Inter-annotator Agreement
One of the biggest challenges to a treebank project is maintaining consistency in
annotations. It includes both, achieving significant inter-annotator and intra-
annotator agreement. To check the inter-annotator agreement, two independent
annotators need to annotate the same data with while as intra-annotator agreement
can be achieved if an annotator encounters the same constructions or phenomenon
many times during the course of annotation, the annotator annotates them
consistently by sticking to the previous decisions regarding. Since, consistency
increases the usefulness of the data for training or testing automatic methods for
linguistic investigations. The understanding of various linguistic phenomena and
the annotation guidelines is also often reflected in inter-annotator agreement
studies. In order to check the consistency in the annotations of the current treebank,
a dataset of 200 sentences was annotated by two annotators who had proper
understanding of various issues and the guidelines for Kashmiri treebank. When the
two annotated datasets were compared, a confusion matrix was formulated as
shown in the Table.6. The matrix shows for which label and for how many times
there is confusion. For example: in the first row of the table, there is confusion of
adv with rt one times, with vmod two times, k7p two times, nmod one times, k2
one times, k7 one times, k7t one times and pof one times.
Inter-annotator agreement was measured using Cohen’s kappa (Cohen, at al., 1960)
which is the mostly used agreement coefficient for annotation tasks with
categorical data. Kappa was introduced to the field of computational linguistics by
(Carletta et al., 1997) and since then many linguistics resources have been
evaluated using the matrix such as (Uria et al., 2009; Bond et al., 2008; Yong and
Foo, 1999). The kappa statistics show the agreement between the annotators and
the reproducibility of their annotated datasets. However, a good inter-annotator
agreement does not necessarily ensure accuracy of attachment labels as the
annotators can make similar kind of mistakes and errors.
The kappa coefficient k is calculated as:
195
Pr (a) is the observed agreement between the annotators and Pr (e) is the expected
agreement, i.e. the probability that the annotators agree by chance. Based on the
interpretation matrix of kappa value proposed by Landis and Koch (Landis and
Koch, 1977) as shown in Table.3, the agreement between two annotators on the
data set used for the evaluation is reliable as given in the Table.4. There is a
substantial amount of inter-annotator agreement which implies that there is
similar understanding of the annotation guidelines and of the linguistic
phenomenon found in the data. The label attachment score, agreement on only
labels and agreement on only attachments are given in Table.5.
Kappa Statistics Strength of Agreement
1 < 0.00 Poor
2 0.0-0.20 Slight
3 0.21-0.40 Fair
4 0.41-0.60 Moderate
5 0.61-0.80 Substantial
6 0.81-1.00 Almost Perfect
Table.3. Coefficients for the Agreement Rate
Observed Agreement Expected Agreement Kappa Value
0.77738515901060079 0.089149258949418789 0.75559679434126129
Table.4. Kappa Statistics
Label
Attachment
Score (LAS)
Agreement on
Labels (LA)
Agreement on
Attachments
(UAS)
No Match
(NM)
0.5177619893428
064
0.7380106571936
057
0.6341030195381
883
0.15008880994671
403
Table.5. Kappa Statistics
196
S. NO Labels Confusions
1 adv {'rt': 1, 'vmod': 2, 'k7p': 2, 'sent-adv': 1, 'nmod': 1, 'k2': 1, 'k7': 1, 'k7t': 1, 'pof': 1}
2 ccof {'k1s': 2, 'rt': 1, 'vmod': 2, 'nmod__relc': 1, 'k2': 1, 'k1': 1, 'pof': 1}
3 fragof {'pof': 1, 'ccof': 3, 'nmod': 1}
4 k1 {'k1s': 3, 'r6': 1, 'vmod': 1, 'k1u': 1, 'ccof': 1, 'k4v': 7, 'nmod': 2, 'k2': 14, 'pof': 2, 'k4a': 5}
5 k1s {'k2s': 1, 'nmod': 1, 'k3': 1, 'k2': 12, 'k1': 3, 'k7t': 1, 'pof': 2}
6 k2 {'adv': 2, 'r6': 1, 'k4v': 4, 'k3': 1, 'ccof': 1, 'k1': 8, 'k4': 2, pof': 6, 'k4a': 3}7 k2p {'k7': 1, 'rh': 1, 'k2g': 1}8 k2s {'k2': 2}9 k4 {'k2': 1, 'k4v': 6, 'k4a': 4, 'k1': 2}10 k4a {'k2': 1, 'k1': 2, 'k4': 1}11 k4v {'k1': 1, 'k4': 1}12 k5 {'rd': 1, 'k7p': 1}
13 k7 {'vmod': 1, 'k2': 1, 'k2p': 1, 'k7p': 1, 'k1': 2, 'k7t': 2, 'k5': 1, 'rsp': 3}14 k7p {'rd': 1, 'k2p': 1, 'k7': 3, 'k7t': 1}15 k7t {'adv': 1, 'k7p': 1, 'k7': 1, 'vmod': 3}16 nmod {'vmod': 2, 'rs': 1, 'ccof': 1, 'k2': 1, 'k1': 1, 'k7': 2, 'k5': 1}
17nmod__k1inv {'nmod': 1}
18nmod__k2inv {'nmod': 1}
19 nmod__relc {'fragof': 1, 'nmod': 1}20 pk1 {'k1': 1}21 pof {'k2': 7, 'k1': 1, 'vmod': 1}22 r6 {'k4v': 1, 'r6-k2': 1, 'k1': 1}23 r6-k2 {'r6': 5}24 r6v {'k1s': 1, 'k7p': 1, 'k4v': 1}25 rad {'k7p': 2}26 ras-k1 {'r6': 1, 'k7': 1, 'k4': 1}27 rbmod {'ccof': 1}28 rh {'k3': 2, 'ccof': 1}39 rs {'k2': 3, 'vmod': 1, 'k2s': 2}30 rt {'sent-adv': 1, 'rh': 4}31 vmod {'adv': 1, 'ras-neg': 1, 'sent-adv': 1, 'ccof': 3, 'pof': 1}
Table.6. Confusion Matrix Showing Disagreement Labels
197
9. Summary
In this chapter, the most important annotation layer of dependency treebank of
Kashmiri, i.e. syntactic parsing and annotation has been discussed. First of all the
notion of parsing was introduced as it forms the key syntactic operation to
produce dependency trees out of input sentences with some degree of previously
annotated grammatical information. Since, the Paninian grammatical sketch for
Sanskrit has been adopted for sentence parsing in IL treebanks, the basic tenets of
Paninian Computational Grammar (PCG) were introduced in order to reveal what
kind of syntactic parsing would be involved in developing dependency treebank
for Kashmiri. As already mentioned, PCG is reinterpretation of Paninian grammar
by one of the leading NLP groups in India called Bharati. PCG seems to be a
blend of ideas flourishing throughout the world in dependency tradition. It is not
purely Paninian as the name suggests, some key notions like Noun Group and
Verb Group also appear in Abney (1996). Moreover, the reinterpretation has been
made in terms of modern notions of grammar either by complete equivalence or
by mere approximation, e.g. Karta is interpreted as roughly equivalent to agent or
SUB. So it is though the popular notions of agent or SUB, the annotators
equipped with modern linguistic jargon interpret the terms like Karta. It is
because of this reason that PCG sometimes appears to be a matter of ancient
labels which, however, is not the case. The fundamental ideas that stay at the
heart of PCG have been taken from the ancient Sanskritic genius which is more
semantics oriented. It is essentially syntactico-semantic model which incorporates
more semantics in it as compared to the syntax and this is the reason that it lacks
the popular notions like SUB, OBJ, argument, adjunct, etc but it must be noted
that the elements corresponding to such notions can be easily extracted from the
treebank as the semanto-syntactic attachment labels can be classified in terms of
popular notions of syntax, e.g. k1 attachment always attaches an argument, a
SUB.
Next, in this chapter the inventory grammatical relations that need to be
annotated have been given with their original Paninian terms, their interpretation
in modern terms, their label and their variant labels. The description of each GR
mention in the inventory has been given with variety of examples in such a
manner that this description also serves as the guidelines. Then, the procedure for
annotating various inter-chunk GRs has been given with graphic illustrations so it
198
becomes obvious how all the dependency structures have been produced by using
the Sanchay SA Interface. The various annotation issues that have been
encountered while annotating Kashmiri corpus were also discussed and
illustrated, particularly the V2 phenomenon which brings Kashmiri closer to
Germanic languages and it is due to V2 factor particularly that Kashmiri
dependency structures from the dependency structures of other ILs.
Finally, the notion of inter-annotator has been introduced and an
experiment, for measuring the inter-annotator agreement vis-à-vis consistency in
the treebank annotation, has been given. The confusion matrix has shown the
disagreement or conflicting labels and the rest of the tables in this section show
that the inter-annotator agreement is substantial as per the interpretation matrix of
Landis and Koch (1977). The observed agreement was found 0.777 and the kappa
value was found 0.7555. In short, the experiment conducted to check the inter-
annotator agreement has shown that the annotators agree quite considerably on
labels as well as on attachments which mean both have similar understanding of
the issues and the guidelines. It also indicates that there will be quite considerable
consistency in the syntactic annotations of the current treebank.
199
Chapter.7 Conclusion“Computers are incredibly fast, accurate and stupid!!!
Human beings are incredibly slow, inaccurate and brilliant!!!Together they are powerful beyond imagination.”
Albert EinsteinThe corpus based investigations on natural languages has become hallmark of
contemporary linguistic research, which not only presents an alternative to the
popular introspection based investigations, particularly on natural language
syntax, but also adds more interdisciplinary and applied orientation in the
research. Since, the contemporary age is considered an age of information where
knowledge creation, knowledge dissemination and knowledge acquisition is no
more restricted to traditional means and privileged persona but with the invent of
computer, internet, world-wide-web and social media, under the force of
globalization, even underdogs can share, produce and disseminate knowledge.
Since, vehicle of knowledge, either on representational level (cognitive) or on
transactional level (communicative) are concrete natural languages, rather than,
an ideal natural language which provides space for notions like universalism and
lingua-franka, and undermines the potential of creativity in individual languages
vis-à-vis the community, need for online representation of all the languages has
been severely felt in last few decades. This thirst on the part of speech
communities can’t be quenched through only introspection based research but it
definitely needs boom of empirical research on natural language so that the
probabilistic methods can be harnessed for technological purposes. Further, the
need for human and machine interactions through natural languages has also
increased considerably which is compensated by increase in resource creation
(both linguistic as well as computational) and interdisciplinary researches on
natural language which resulted in the inception of entirely new fields of inquiry
like computational linguistics (CL), natural language processing (NLP) and
language technology (LT). The current research is this kind of effort to create a
small scale syntactically annotated corpus, i.e. a treebank for Kashmiri, and lay
down the basic methodology for creating a large scale syntactically annotated
corpus which can be used for training various NLP algorithms like syntactic
parsers.
However, for creating syntactically annotated corpus, grammar formalism
is of paramount importance but one astonishes to see the wide range of competing
200
grammatical models/ formalisms. It becomes very difficult to prefer one model
over the other as all the models claim flexibility and universality to cater to wider
range of language data. However, the fact is that the choice of framework vis-à-
vis grammar formalism is, in itself, an interesting and lesser explored research
area of experimental syntax which is beyond the scope of this dissertation.
Nevertheless, dependency based representations have been considered more
suitable for inflectionally rich (relatively free-word-order) languages, i.e. lesser
configurational or positional, e.g. Czech, Turkish, Hindi, Urdu, Kashmiri, etc.
Since, dependency relations are essentially syntactico-semantic in nature and
directly encode the predicate-argument structure, i.e. directly encode the
participatory relation of various arguments or adjuncts, it has been argued that
dependency based representations are more suitable for annotated resource
creation. It is not only because they cater to free-word-order but also because they
are considered more suitable for a number of NLP applications (Covington 1995,
Culotta & Sorensen 2004, Reichartz et al., 2009). Further, most of languages of
the world are inflectionally rich vis-à-vis relatively free-words-order languages
(Covington 1995) and thus, in most of the treebanks, there is a tendency to take
into account the morpho-syntactic cues, i.e. obliqueness, overt case markings or
relational words (pre/postpositions) during sentence parsing. This also provides
clear semantic information, crucial for various NLP applications like MT. Since,
Kashmiri is an inflectionally rich language; there are also clear cut morpho-
syntactic cues associated with NPs or VGNNs, i.e. with entities which are the
participants of an action or an event, which are crucial in sentence parsing. It has
been observed that if there are no hundred percent one-to-one correspondences
between the case relations & the case markers/pre/postpositions which mark the
dependents but definitely such morpho-syntactic cues along with TAM features
are very helpful in syntactic parsing in relatively free-word-order languages. On
the other hand, constituency based representations have been considered better for
the fixed-word-order languages where there are least morpho-syntactic cues but
the positions of the constituents dictate their grammatical relations, e.g. English
and French.
Finally, it is very important to recognize the advantages and disadvantages
of both the frameworks while applying them on a particular language data. It is
equally important to approach the problem of framework selection on the basis of
201
certain research questions like, what essentially these formalisms are able to
capture. Are they complimentary to each other? How can they be helpful in
developing treebanks and subsequently, robust syntactic parsers? However, the
choice of the framework or formalism, in this research, is not much determined by
the theoretical motivations and other technicalities related to the formalism but by
the unavailability of the required resources in Indian scenario. For instance, if one
wishes to use annotation scheme of Prague or TIGER Treebank, one need to be in
constant touch with the people who are actually working in the area to obtain
resource & get expert opinions. So, it is quite impractical to use such a grammar
formalism for which there are no resources available or not easily accessible.
Further, there is no need of reinventing the wheel, as other representations can be
added latter and also the algorithms are now available to convert a dependency
treebank to the phrase structure one. Therefore, on practical considerations as well
as on the basis of principles for treebanking, given in chapter second, the model of
AnnCorra Treebanks, i.e. Hindi, Urdu, Telugu and Bengali, has been followed.
Apart from the grammatical model, the most important requirement for
developing Kashmiri treebank was the primary source data, i.e. Kashmiri text
corpus. The major bottleneck in getting the corpus is the unavailability of any
online resource like newspaper, from which data could have been obtained. There
is complete vacuum of commercially important text domains (like medical &
tourism) in Kashmiri. Therefore, KashCorpus was built for developing
KashTreeBank. The selected sets of the corpus were manually pre-processed, i.e.
sanitized, normalized and finally tokenized.
The selected sets of corpus were converted into Shakti Standard
Format (SSF) with the help of Sanchay platform. Its SA Interface was used to
built three annotation layers, i.e. parts-of-speech layer, chunk layer and inter-
chunk dependency layer. It is not merely adding the annotation layers to the
corpus which is important but the arrangement in which the lower layer of
annotation facilitates the higher layer. The arrangement is provided by the SSF
which is also important for machine readability of the dependency trees, created
during annotation process. The fundamental annotation layer that was added to
KashCorpus was POS layer. Each word in a sentence was assigned a POS tag
according the BIS tagset for Kashmiri which is coarse grained hierarchical,
consisting of 11 top level categories and 32 type level tags. In the process total
202
14852 words were classified into 11 POS categories with the frequency order; N
>V >PP >RD >JJ >PR >CC >RP >DM >QT >RB. During the annotation process,
several issues were raised which were resolved and annotation guidelines were
laid down to achieve consistency and intra-annotator agreement. Finally,
frequency for each category was calculated and cumulative frequencies were
obtained.
In the second layer of annotation, same interface to add chunk level
information to the same sets of POS annotated KashCorpus. Earlier there were
four sets of corpus from three domains in which two sets which belong to the
same domain were combined. Therefore, three sets of POS annotated corpus were
chunked based on the local dependencies and discontinuities. The cluster of POS
tagged words which were contiguous with dependency or part-whole relation with
each other, were assigned a single chunk label. It is not that only groups of words
were assigned chunk labels but also some solitary or the discontinuous elements
which defy the intuitive notion of chunk. The V2 phenomenon, which constitutes
5.009% of grammatical phenomena at the chunk level in Kashmiri data, was also
handled by positing AUXP chunk. The most crucial issues related to finiteness
and complex predication were resolved and an annotation guideline was laid
down for consistency in chunking. All 14852 POS annotated words were chunked
and classified into 10 chunks. The increasing frequency order of the chunks is
NEGP< BLK< RBP< VGNN< VGNF< JJP< CCP< VGF< NP.
Finally, the third layer of linguistic information was added to the
three datasets. The 682 POS annotated sentences of varying lengths from three
domains, i.e. newspaper editorials (ASL77 = 25 Ws or 15 Cs), short-stories (ASL
= 11 Ws or 8 Cs) and critical discourse (ASL = 16 Ws or 10Cs), were partially
parsed into 8125 chunks. The inter-chunk GRs holding among 8125 chunks were
annotated. In aggregate, 4287 GRs have been found holding under 682
dependency structures. The 4287 GRs were further classified under 25 labels. The
inter-annotator agreement was also measured for syntactic annotations. The inter-
annotator agreement was found substantial as per the interpretation matrix of
Landis and Koch (1977). The observed agreement was found 0.777 and the kappa
value was found 0.7555. In short, the experiment conducted to check the inter-
annotator agreement has shown that the annotators agree quite considerably on
77 Average Sentence Length
203
labels as well as on attachments which indicates that both the annotators have
similar understanding of the issues and the guidelines.
Appendix-I: Showing BIS POS Tagset for Kashmiri
S.
No
Category Name Annotation
Convention
Examples
Top level Subtype
(level 1)
1 Noun N
1.1 Common N__NN gu:r (milk man), kul (tree), ku:r (girl)
1.2 Proper N__NNP gulI marg (Gulmarh) Pahal gha:m
(Pahagham), huzaif (huzaif)
1.3 Nloc N__NST heri (in upper storey), bonI (in lower storey)
2 Pronoun PR
2.1 Pronominal PR__PRP su (he-nom) bI (I-nom), tse (you-erg), hom-
is (he-dat), yi (this), ti (that)
2.2 Reflexive PR__PRF panun (self’s-MAS ), panIn’ (self’s-FEM)
2.3 Relative PR__PRL yus (who-SG), yim (who-pl)
2.4 Reciprocal PR__PRC akh Akis (to one another),
pa:nIvan’ (amongst each other)
2.5 WH PR__PRQ kus (who-SG), kIm (who-pl)
2.6 Indefinite PR__PID kenh (something), kanh (someone)
3 Demonstrative DM
Deictic DM__DMD hu (he), so` (she), hum (those), yi (this)
Relative DM_DMR yus (who-SG), yim (who-pl),
yAmy`(who-erg), yimav (who-pl-erg)
WH DM__DMQ kus (who), kIm (who)
Indefinite DM__DMI kenh (something), kanh (someone)
204
4 Verb V
4.1 Main V__VM paka:n (walks/walking), thovmut (kept)
pari (will read), gindun (to play), tulun
(to lift), tsalun/davun (to run), gatshith
(having gone), gindnuk (of playing)
4.2 Auxiliary V__VAUX chi/chu (is), Os/A:s (was), a:si (will)
5 Adjective
tshoT (dewarf), z’u:Th (tall) zabar (nice),
asIl (good)
6 Adverb
jaldi: (quickly), va:rI va:rI (slowly)
7 Postposition
peTh (on), manz (in), tal (under), nish
(near)
8 Conjunction CC
8.1 Co-ordinator CC__CCD tI (and), ya:/natI (or) magar (but)
8.2 Subordinator CC__CCS zi/ ki (that), agar (if), zanti (as if), teli
/adI (then)
9 Particles RP
9.1 Default RP__RPD ti (too), sirif/ mAhaz (only), hish/hiuv
(like)
9.3 Interjection RP__INJ alie! Oho!
9.4 Intensifier RP__INTF seTha: (very), va:riyah (very)
9.5 Negation RP__NEG na (no), ma (don’t), kehn (not)
10 Quantifiers QT
205
10.1 General QT__QTF kam (little), zya:dI (more), kehn (some)
10.2 Cardinals QT__QTC akh (one), zI (two), tsor hath (4 hundred)
10.3 Ordinals QT__QTO Akim (first), doyim (second)
11 Residuals RD
11.1 Foreign
word
RD__RDF It is fine
11.2 Symbol RD__SYM ، ، ء ، ،،
11.3 Punctuation RD__PUNC ؛ ! ( ) “ ؟ ، : ۔
11.4 Unknown RD__UNK
ڑگ ،فاین ،از ،اٹ
11.5 Echo words RD__ECH tre:lI ve:lI (apple and the like)
cha:y va:y (tea and the like),
batI vatI,(rice and the like)
ma:z va:z,(meat and the like)
Appendix-II: Showing POS Additional Examples Extracted from Dataset-4
N_NN ن، پ77وچھر، ، دور، ریاس77تس، امن، کوشش77 -ک ہوادی، ح77ال ہ 77ددلی، ، مویویس77ی، ب می77د، امک77ان، دور، بی77ان ، و -نگار، مستقبل-ک ہتجزی ۄ ہ - ، بی77ان، مقص77د، را وپن، کتھ-ب77اتھ، وزی77راعظم، خ77وش-آو ژو، گ ےجم ے ن� ٲش77ت- ، پ77زر، وزی77راعظمن، کنوکیش77نس، خط77اب، گ77روپن، کتھ، د عام رگرمیٮو، اس77تعمال، گردی، دوستی، اشتراکس، تعلقات، سر-زمین، س77چھ، خطاب، دل، ، ک ن س، سل -ی ، خوبصورتی، دن ن ۄسلسلس، یقین-د ۍ ٲ ٲ چ ۍ ۍ ہٲ ر، ن، رز، ش77 ر، موس77من، دلن، سالن ر ، خوبص77ٮورتی، ر-چ ، ش حص ٮ� ٲ ٮ� ہ ہ ہ ، ان77د، مس77لک، ح77ل، کتھن، دان ی، م ، تب پ ، بن77دوق، ژھ ، مس77ل ہاند، عالق ٲ ٲ ہ ۄ ہ ہ -در، س، ژک ، ام77ورن، پ77ارٹی، ک77ارکنن، لکھ، تھ نم77ا، ریاس77تک ۍباوتھ، ر ۄ ۍ ، لکن، ر، نی77ا گ، کش ، پالیس77ی، لتھ، م77ذاکراتن، س77 ن ن، ےجنگجٮ77و ی� ہ
206
، قدم، ، امن -ک ، ریاست -ک ن، تجویز، مسل -ی د، دن ہپیشکش، تشددس، د ہ ہ ہ ہ ۍ ۄ ن، مس77تقبلک، مل77ک، ن، ب77د -ک ، ریاست ج-وٹھ، مسل ، ب ٮ�چیرمین، تنازع ٮ� ہ ہ ٲ ہ ، ر ل77ک، کش77 77ر، م د، ک77تر-بت نی، جدوج ن، قورب ۍمسلس، اندبور، کشر ٲ ٲ ٲ ٮ� ٲ زی، سرحدن، ف77وج، ربتس، ف77وجی، ، سرکار، ت ک ی�بچاو، سرحدس، چھ ہ ۄ ، ر، رپوٹن، ف7وجن، ص7وبن، ٹینک ک ، باس، ملکن، قسمک، ژ ، کتھ ہطرف ٮ� ۄ ہ ہ 77و، ، چھوک ، جنگ 77ر، س77رحدس، ک77ام کن، حمل77و، بت ہبینکر، می77ٹر، گ77اڑ، ٹی ہ ن� د، ش77روع، حق77وقن، ادارو، احتج77اجو، م77وت، نی، رو رحدن، نگ77ر ن�س77 ٲ ن، خ77ونس، گج77و وٹلس، ج گج77و، وایت، ج ، ر نوج77وانن، وری، تھن ن� ن� ٮ� ہ ش، گیس، احتج77اج، افس77رن، ن، الر، ا ، نف77ر، م77ار، پلس77 ، ل77ڑ یش نIت ٲے ہ ن� ، س77رکارن، تحقیق77ات، ، پلس77ن، جم77اژ، پلس 77و-ج77وانن، گ77ول ہک77رکٹ، ن ہ میت، ک، ملکس، قراردادس، ا دعوی، معاملن، تر، پلسس، فوجس، ژ Iن یی، اج77ازت، ط77اقتو، ری، ملکن، توان ید، ٹیکنالوجی، اجار-د ، ف ، طاقت ٲآل ٲ ٲ ہ ہ یی، اس77تمعال، اس77تعمال، کتھن، ب77اوتھ، دورس، م77ذاکرات، پ77وت، ٲتون ری، -د ی77ٮراد، آبچ، حص س، رنگ، ضرورت، ٲمسلن، دل، اند، ملک، میڈیا ہ ھ ٮ چ ، صحت، اث77ر، ، سرکار، تموک، منشیاتک ، تعلیم ۍبحث، حمایت، ریاستک ہ ۍ ر، ، نن ، پرن77اونس، س77نجیدگی، کتھ ٮ�عن77وانس، جمژ، نص77ابس، مض77مون ہ ٲ ، تقریر، قونون، ادارن، کتھ، باند، قونونس، کم، سماج، منشیات- ٲتقریب ٹ، ، کتھ، منیش77اتس، ر ، کث77افت ، کاربار، بیدری، زور، ریاس77ت، تم77اک نIک ہ ہ ٲ ہ د، لی77درو، ، عم77ل، امک77ان، پ ، میٹنگ ٲکمبر، سرکارس، استھواس، رازدان ہ ، وتھ، ق77رار، مالق77ات، ری، ماحول، تالش، لی77ڈرن، مالق77ات -د ہوزیرن، ذم ٲ -یس، موق77ف، نف77رن، 77ک، بتھ، مخ77الفژ، ور ۍبی77ان، پ77ارلیمنٹس، مخالفت ، م، ع77دم-تش77دد-چ ، آزدی، م یی، مع77املس، واویال، ملکس، حمل ہک77ارو ٲ ہ ٲ زی، چن777اون، دھان777دلی، رش777و-خ777ور، تختس، ، منظ777رس، دغاب ، پ ٲوت ے ہ -ستری، نفر، نع77رو، دور، ح77د، ند، زندانن، زنانن، ب ن، پ ےطاقت، چانٹ ٮ� ٮ� س، ، دعوی- ، حق، گصف، بد-قسمتی، حص چ کھ، لو چمصیبت، ستم، د ہ ن� ۄ -ون، ، ژین ل 77و، حقیق77ژ، ورن، س77وتھرس، حق77وقن، پام پی ہورب77ر، رقم، ر ۍ ٲ ۄ ر، دران77دزی، ری، ریڈالرٹ، عم77ارژن، کھ ، تی شت-گرد، پھاس، حملچ ٲد ۄ ٲ ، -ون ین، وژھ، زال، معلوم، بارس، ژین ہسرحدو، تصدیق، حکومتن، کارو ہ ٲ
207
تس، یی، فص77لس، وزی77ر، ژ -اف77ز ٮ�حفاظت، فص77لس، پ77ذیریی، حوص77ل ٲ ٲ ہ ٲ ٲ ، ادارس، ، پھش، خطس، س77ن 77و، تھ77ام ی ن، ور نم77ا نٹس، ر ہنشست، گ ۍ چ ٲ س77تادو، ن، خ77وش-قس77متی، زب77ان، انقالب، عن77وانس، و ۄت77ربیت، دو چ ن، روح، نشورو، یاد، شمٮولیت، مس77لمانن، علمس، اچھ77و، ح77التھ، د د ٲ ٲ ار، ن، اظ دگ، انسان، دم77اغ، وز، کتھ، مج77ال، موض77وعس، ج77ذبک، د چ ۄ ن، ، علومن، عبور، س77ر-پرس77تی، س77رحدو، جنگج77و ثبوت، مستقبل، وت ہ ریخ، ، س77المتی، خط77ر، عالقن، ذک77ر، عالق77و، طالب77انن، اج77را، ت ٲمق77ابل ہ راجس، یادداش77ت، تقس77یم، یستام، لکو، ریذڈنٹن، ت یاداشت، عمارژ، و ن� دان، لکھ، ، وف77د، م ٲمبصر، مقام، لکن، استان، مطالبن، یاداش77تن، دش77 ، ، قبض ، نیچ77وس، مک77انس، س77اعتک ہظلمس، احتج77اج، عم77ارت، مک77ان ۍ ہ ، ج77ون، لکار، ادار، جنگ-بندی، نظ77ر-گ77زر، پروگ77رام ، ا س -د س ، و جای ہ ن�ۄ ہ ۄ ہ ، تقریٮبن، شرکت، دورک، علیحدگی-پسندن، شٹھ_نی77ار، ر ن، سرک ۍدو ٲ چ ، دع77وتس، اعتم77ادک، دورس، ن77وعیتس، کسن ۍپروگرام، ج77والی-یس، پ ٲ ٲ ، ع77زت- ن، کت77اب ، لکھ77ار ، تق77ریی س77وار، د د، ب ہم77احولس، س77وال، پ ٮ� ہ ۄ ٮ� ن� ٲن، اعزاز، عطا، شعر، ، زبانن، کتابن، کتا ٲافزیی، اکیڈیمی، صدر، ینام چ� ٮ ہ ٲ ریخ-دان، لکھ77777777777777777اربو، دانش77777777777777777ورو ٲنق77777777777777777اد، ت ، ت77و، ش، ر -گی77ور، مزاک77راتن، ش ، ج77واب، وز، خ77بر، گت نٹ ، گ ر کھ ٮ�ل ۄ ہ ہ ٲ ۍ ٲ ٮ� ، چھاو، وت77اولی، ، آش، بت ، تجربن، کنڈ ٹ ج، ژکس، لکن، ر، ن ر-چھ ہچھ ۍ ۄ ۄ لک77ارن، ر، خط77رات، وس77یلو، علحی77دگی-پس77ندو، ا -گ ژھ ، پ ط77رز، عمل ٲ ہ ن� ہ رت77الن، احتج77اجن، انکش77افس، ، کھ، خ77برو، منص77وب ، اتھن، پھ نش77ان ہ ۄ ہ ، ج77اداوس، ددار، ن، ٹھر ر، گاڑ نی، بلو، مظ ، کرست ن ۍتاثرات، کرست ٮ� ٲ ٲ ٲ ۍ ٲ ٲ ٹ، -ر نIج777رم، سوس777ایٹی-ین، اک777ثریت، عالقن، امن، س777رکارس، اتھ ہ ، ، زن77دگی، گیس77ک ، سیاستدانن، معمٮ77ولچ ۍنظامچ، جرم، کاژ، ازا، مثال ہ ہ رین، بان7د-بان7د، ین، ش ، ن، چھک7ار، ک77ان 77ژ، گ7ول ش77ل، مخ77الفت، اکژی ٲ ہ ٮ� ، ک ر، می77ڈیا- رت77ال، س77ڑکن، تش77دد، وط ۍشکایژ، سڑکن، دار، جوان، ہ ی� ، میل-واجن ، راز، ش77کل، ش77 ، تولق، سپ ، داستان، سنگ-باز، مذاق ہذرای ٲ ہے ہ ژ، راج، 77تر ، ب ، وقت م ، س77 فاظتک، التجا، رازن، گر، ع77رض، گل ، ح ٲزنان ہ ۍ ٮ� ہ و، ، گریس77ت بر، آسمان-کس، چھتس، ٹھر، دلس، پریشن ٮ�خوف، ناگ، ا ۍ ٲ Iن
208
س، حکن، م77ارکس، ص77 ا، غ کھ، ژور- و، ک ، دای77رس، علی-ج77ا ۄتم، ژکھ ن� چ ٲ ہ ، ناگو، و، حکم ، دیا، کن، حکم، بیٹر ن، رو نی، پاس، پادش رب ہرحم، م ٮ� ے ٲ ر-یقی77نی، 77و، افراتف77ری، غ -ی -ی77ا، تش77دد، ور رن، قون77ونن، دن ٲبرتھ، کھ ۍ ۍ ۄ ، ، تنخ77وا ، مع77امل ، بح77ران ڑت77ال زمن، -کس، آغ77ازس، مل -ی ش77کار، ور ہ ٲ ہ ۍ س77پتالن، ن، ، ریاس77تک، ص77ورتحال، ص77ورتحالس، و پنش77نک، مع77امل ۄ ہ عث، س77رکارک، ربتن، فض77انن، ب م77ارن، ن77ادارن، مش77کالتن، ن ٲلوکن، ب ۄ ٮ� زمن، چن7اوس، ، مل ، مرک7ز، ب7دس، کوشش ، بقایاج7ات، ادا، روپی ٲمطالب ہ ہ ہ ر، تش77ویش، ف77د، کش ، ، پ77ارلیمنٹ-ک ی�وع77د، ف77روری-یس، س77مجھوت ٮ� ہ ہ -کس، ، ممبرو، اتف77اق، ع77الم د، کاز، کوشش وپن، ع ری، اند-بور، گ رفت ن� ٲ ، زش77 ، زی77ر، س ، مس77لمانن، حوص77ل نمایندن، دعوت، سزش، امام، چھ ٲ ہ Kہ ٲ ، سزشن، مس77لمان، ، حکمو، سزشن، دس، امامن، لڈالی ٲلڈاین، لحاظ ہ ٲ ٲ ہ ن، ی، ایمپ77ایر، س77رکار-ک ، حکومتچ، پش77ت-پن ٮ�نا-انصفی، ظلمن، لڈ ٲ ٲے ٲ ، ، وزیراعل77ی -کس، اجالسسس، ریاس77تک ٹس، مفاد، ایوان ۍتعلقاتن، گ ۍ ہ ٮ� ن� ، مف77اداتن، ، اش77ار، مج77را رن، ش77ریکن، وت ، پھ ژ، ن77ال ، جم س77رکار-چ ہ ٮ� ہ ٲ ہ قدمن، ترجمانن، ، پنڈتن، م ٹس، مفادات، کال ن، تعلقاتن، گ ۄنظر، در ہ ٮ� ن� ٮ� ، بنی77اد- و، تح77ریک ہبٹن، نسل-کشی، اتھس، الزام، پن77ڈت، لفظن، در-ک ٮ�، -ین، م77الک ، ور ین، ڈوٹھ-ت ن، ژ ، وتش، طبقن، اتھواس77 ۍپرستی، جام ۍ ن� ۄ ہ دن، ب77وس، 77دس، نمی 77د، ون ، ون ، س77ربرا ن�لیڈر، سوالک، تش77دد، ف77وجک ٲ ۍ ، دستن، ، کوشش، کوششن، جنوری، کوشش حاالتن، وذارتن، زرایو، لٹ ری، رس77تس، گرفت یو، ف ش77ت-گ77ردن، ک77ارو ن، مزید، جنگجو، د ٲجنگجو ٲ چ شتگردی، جنگس، فورسن، اتھواس، آیتن، پاکس77تان، آپریشن، انجام، د ، مرحب77ا، ین، دنی77ا ش77ت-گ77ردن، ک77ارو اعتراض، اتھ77واس، درخاس77ت، د ٲ تھی77ار، مم77برن، نماین77دن، ین، ، دع77وا، ور کھ77ن پروگرامس، پابندی، و ۍ ے ۄ یس، ۍپررگ77رامس، ق77راردادن، ب77وژ-ش77وژ، مش77اورت، چ77یرمینن، ور س، زن77گ، ڈ، کالس، تش77ددس، ، کھ -وٹ ن بی، کھ ہتھیارن، یورینیم، کامی ۄ ہ ہ ۄ ٲ ، برص77غیرس، امنچ، ض77مانت، تل77وار، سمٹس، خطابس، صدرس، تنازع ، ان77داز، -ک ، پاونڈ، مول م متس، نیالم، لچھ، ساس، پونڈن، نیل ہریکارڈ، ہ ۍ ٲ س� ، منظ77ور، حص77ل، پ ، ر پی ، گ77ورنر، ر پ ٲص77دی، چان77د، ش77یر، اعالن، ر ے ۄ ۄ ے ۄ
209
الک، ، ، پیش، بن77دوق-ب77ردارو، گ77ول ، واق 77ژن، سلس77ل الک ن، -ک ح77ال ۍ ہ ہ ٮ� ہ ر، ل777وکھ، ص777ورژن، ایجنس777ی-ین، ڈر، ، اض777طراب، ظ ٲشخص777س، آی ہ -چن، دی، نفس77یاتس، ف77وجکس، رد-ب77دل، ملک 77ژو، آب الک ہاض77طرابک، ٲ کرس، ش77تگردن، افس77پا-کس، ایکٹ، ک ، موجودگی، سوالس، د یی ۄکارو ہ ٲ ، تن77او، مش77ور، -کھ77اتس، دسپوس ہزن77گ، ط77ور، اعتم77اد-سزی، گ77ال ۍ ٲ شٹھنیار،مالقاتسN_NNC وا، حکم77ران، جم77اعت، ، جنت، آب، د، وت علحیدگی، پس ہ ن� ، ڈیرس، خود- ٹ ژو، ژھ ، وزیر، پسندن، کتھ، پسندو، جنگجو، جم ہگاڈ، وت ۄ ٲ ہ
از، اڈ، اڈ، 77رس، می77ڈیا، ذراین، ش77مال، ج ، ٹک ن ری، تج77ویز، زم خت م ہ ی� ٲ ۄ گ، ، ج لک77ارن، کن ن، طلب، علم، ف77وجی، ا ، ورش77 ن�طالب، علمن، گ77ول ہ ٲ ہ ، تعلق77ات، وزی77ر، موص77وفن، ر ۍعلمس، قتل، عامس، امن، مقصدو، باپ ٲ ، پھیر، بچ77او، ، کوچ ، گل ، کار، عمل، کوچ ، سیکریٹری-ین، طریق ہخارج ۍ ہ ہ ، ش77یچھ، 77ن 77و، ک ، سٹیشن، س77رحد، ایجنس77ی-ی ، اڈن، ریلو ۍبجٹ، عمل ے ، ڈاک77ٹر، ٹری77ول، ای77ڈوایزری، ب77اتھ، قس77مت، علم77و، س77یرت، کانفرنس اس77رارس، دل، مل77ٹری، اوب77زروز، عم77ارت، ڈوگ77را، راج-کس، تعلیمی،77رن، اوب77زرور، گ77روپن، اعظم، ج77ون، ربتس، ن، کاربار، پنڈ، پ ل ر، ق ی�م ٲ ٲ یمچ، کلچرل، اکی77ڈمی، ام، تف ، طور، طریقن، پوت، منظرس، اف خارج ، وایس، چانس777لر، س777ینٹرل، یونیورس777ٹی، ڈیوجن777ل، کمش777نر، ۍاعل777ی ، ، لیج ، بت ، ت77وو، شوش77 پ ، زی77و، پی-تس، ر، حریت، کانفرنسک کشم ہ ہ ن�ۄ ۍ ی� ، جن77گ، ، جنگس، س77نگ، ب77ازن، کن وم، منس77ٹری، کن ہمرک77زی، گریل77و، ، ارس77اتس، ٹ77یر-گیس، ، چھرک77او، لک ، آب نج ، و ری، کن 77نز، ب77رد ہاعلی، ک ٲ ٲ ، بی77ٹر، ، داں، فص77ل ک ، چھ ٹھ ہشل، حکمران، رحم، دل، دیا، ساگر، ڈو ہ ٮ� ہ ن� ، -وارن ، کم، م77ذکرات، مس77لمان، نوج77وان، ف77رق ژن، ت77ال ہپتھ77ر، زو، ذ ہ ٲ ہ ٲ ، پتر، پن7ڈت، ب77رادری، گ7ر، خ7بر، اس77مبلی، ر، لچھ ک فسادو، اندری، ژ ٮ� ۄ
، ب777ردارو، ژر،۲٠٠۴مم777بر، رس، اس777لح ، عیس777وی، جم777وں، کش777م ی� ، طالب777ان، کمان777ڈر، وزی777رن، ، ج777ا ، خرج ےعیس777وی-یس، بتھی، الگی ٲ ہ ، نیالم، ، ژھیپ ہکمانڈرن، عالمی، ط77اقتن، ج77وانٹ، چی77ف، آف، اس77ٹافک ۍ
ر، تعلیم، پ77دم، ش77ری، اع77زاز، م77احول، متس، م ہگرس، ریک77ارڈ، ٲ ،۲٠س�
210
نٹ77ل، کلنکس، ب77اعث، تش77ویش، ملیٹنٹ، تنطمین، ٮ�م77ارچ، ش77امس، ڈ ، سنگر، مال، تجزی مشرقی، ریاستن، ملک، دشمن، عنصرن، ان ۍ ،
N_NNP ، -چ مالی ر، کش بھارت، کستان، پ کستانس، ہپ ی� ٲ ٲ ن، س77اگرن، این- ن، این-س77ی، کش77ر ر، نگ نگر، کش ر ر، س77 ٮ�کش ٲ ی� ی� ی� ی�ندوس77تانچ، اروناچ77ل، ندوس77تان، ندوس77تانس، پاکس77تانس، س77ی-یچ، ، نش77اطس، ندوس77تانن، بی-ایس-ایفک ، ندوس77تانک ۍپ77ردیش، چینس، ۍ ، پاکستان، بھوٹانس، سیاچن، ک، چین-کس، پاکستانک - ، امریک ۍایرانک ہ ۍ ن، مص777ر-کس، مم777بی، ، تمپھ777و، پاکس777تان-ک ٮ�س777رکریک، بھوٹ777انچ ہ
ن،۱۹۱٠ ندوس77تانک، ام77ریک ار، ن، کش77یر، ی77وپی، ب ندوستان-ک کس، ٮ� رن، ، ممبی، بنگلور، چندی-گڈھ، گج77رات، ش ، امریک -چ بھارتن، امریک ہ
، ک - ۍام777ریک ہ ، علی، گ777ڈھ، اس777المچ، اس777المس، پاکس777تانکس،۱۹۸۷ ، پینٹ77ا-گ77ونن، و، تالب77انن، پینٹ77ا-گ77ونن، پاکس77تان ندوس77تانک ہپاکس77تانن، ٮ� ، ک - ، ٹین77گ، گپک77ارس، برط77انی ر س، پینٹا-گنک، سرینگرس، کش77 ۍفاٹا ہ ۍ ٲ ر، ن، اردو، کش77 _ ر_کس، عم77ر، عب77دالل ٲپاک، تعلق77اتن، جم77وں، کش77م چ ی� د، میلولن77گ، ی، ل77ل-د اڑی، پنج77ابی، وی77د-را ندی، ل77داخی، پ ٮ�گوجری، ر، پالی، گیالنی، برس77یلز، اس77الم-آب77ادس، ی77ورپی، پ77ارلیمنٹس، ی�‘کش
،۲٠٠۸، ۲٠٠۳نجیب-آب777اد، پی-ڈی-پی، کانگریس777س، ۍس، س777نگرامک ۱۹۸۹ ،۲٠٠۴، ن ، پی-چدمبرس، اسالم-آب77اد، پاکس77ت ن ندوست ۍ، عیسوی، ٲ ۍ ٲ
، پاکس77تان-کس، قریش77ی-ین، نس77تانک ، افغانس77تانس، افغ ن ۍافغنس77ت ٲ ۍ ٲ ٲ ، چین، ف77رانس، روس، ران، برط77انی ن، ت - نیویارک، ایران-کس، امریک چ ، افغانس77تان، ، بھ77ارتچ جرمنی، ایران، واشنگٹنس، سرینگر، پاکس77تانک ۍ -کس، ندوستانس، امریک بلوچستان، افغانستانس، بھارتچ، واشنگٹنس، ، نیویارک ، افسپاچ ال ک، سوپر، ت اراشٹرا ہلندن، پاٹلن، پاٹل، م ہ ہ ن� ہ ، N_NNPC ، ح77زب، نگھن ن، س77 ہن77و، دل وزی77ر، اعظم، ڈاک77ٹر، منمٮ77و ہ انگیرن، ، ج ر، زرعی، یونیورسٹی، مغل، بادشا دین، شیر، کشم المجا ی� ، علی، ک ۍجھیل، ڈل، نیشنل، کانفرنس77ن، نیش77نل، ک77انفرنس، کانفرنس77 ن، سنگھن، جمٮوں، لبریشن، فرنٹ- محمد، ساگرن، وزیراعظم، منمو
،۱۹۶۲کس، یاسین، ملک، اروناچل، پردیش، چین، جموں، ہ، عیسوی_ک
211
، وام77ق،۳، الل، چوک-کس، عنایت، خان، ی۲٠۱٠، جنوری، ۸ ، کدل ہ، راز ے د، ۵ف77اروق، ، غ77نی، میمٮوری77ل،۱۱، ف77روری، زا ، ک77دال ہ، ج77ون، راز ے
سٹیڈیمس، طفیل، متو، ناوکس، صدر، محمود، احمدی، نژادن، اق77وام،، ، قریش77ی-ین، س77ارک، پ77یرزاد متح77د-کس، س77المتی، کونس77ل، ش77ا ، رس، س77نگھس، اعالنی س، ش77رم، الش77یخ، ش ، اجالس77 سعیدن، سربرا
، ن، مم777بی، حملن، بھ777ارت،۲٠٠۸، نوم777بر، ۲۶ہن777و، دل ٮ�، عیس777وی-ک ن77د، پ77اک، اس77ٹنٹ، س77یکریٹری، آف، س77ٹیٹ، ف77ار، پبل77ک، س77رکارن، ، نگھ، یوس77ف، رض77ا، گیالنی، کانفرنس ین، س77 ، ک77رول ہافریس، پی، ج ے ے س، اسرار، ادار، تصنیف، و، تحقی77ق، ادارس، گ77ڈھس، ، تھمپو ہراج-دان ہ ، گ77ڈھ، اس77رارس، ذاک77ر، نای77ک، پروفیس77ر، ، پیغمبر، اس77ال ن نمعربی، زب ۍ ٲ ، خ77دام، الق77رآن، موالن77ا، امین، احس77ن، اص77الحی، ، تنظیم کلیم، الل
ک، ح77777د نگھس، گیالنی-یس، م ن، س77777 ہمنم نت Kعیس77777وی-یس،۱۹۲٠ٮ� ، کی، مم77بر، ٹینگن، ، کانفرنس77 ، شیخ، عب77دالل ، اقوام، متحدک ندوستانک ۍ ہ
، ، جن77وری، 1947ہجعف77ر، خ77ان ، زراع77تی۱۹۴۹ہ، عیس77وی، متح77د-چ ، ، ایس، ایم، کرش77نا، اس77الم، وم، منسٹر، پی-چ7دمبرم، خ7ارج تعلقاتن، ن، انٹرنیش77نل، رس، منم77و آباد، س77نگن، س77رینگر_کس، مس77ل، کش77م ی� ہ خن، فیاض، ف77ورک-ل7ور، ج7ان، ، س ن ، می ۄکنوینشن، سینٹرس، کامل، ی ۍ ٲ ہ وس77کتا، تھ77ا، س77رنگ، ی، ار، میں، سمندر، حکیم، جان، غزل، شام، ب ربھجن، س77اغر، جوسیل، شارشمس، بشیر، مرزا، رنگ، رتن، گل77زار، ، ترنم، ریاض، میرا، رخت، سفر، عمر، مجید، نم77بر، رومی، بند، درواز ، سون، ادب، پنجابی، واح77د، قریش77ی، نس77یم، لنک77ر، مب77ارک، ۍبیگ، کٹ ، س، مرک7زی، جی، ک ےگل، دیون7در، ران7ا، اٹ7ل، ڈل7و، آل، ان7ڈیا، ری7ڈیو ن، فضل، الحق، -وال ، چدمبرم، دل ٮ�پالی، میر، واعظ، فاروقن، سنگھ-ن ہ ہ ، ، تنخ77وا -م ژ، ش77یی ک، ریاس77 س، ریاس77تس، نینش77ل، کانفرنس77 - عبدالل ہ چ
، ی77ورپی،۲٠۱٠، ن77و، پ77ارٹی، گ77روپ، کش77میرک، ۲٠٠۹، ۲٠٠۶کمیش77ن، ن، - ، سید، احمد، بخ77اری-ین، رفی7ع، ال7دین، بخ7اری، عب7دالل فت چیونین، ہ 77ک، بش77ارت، ج77نرل، دیپ77ک، ن، پیپل77ز، ڈیموکریٹ چک77انگریس، رمن، بال
س،۲٠کپورن، چ، مارچ، مال، عبدالغنی، برادر، حامد، ک77رزای-ین، متح77د-
212
س، ٹیپ77و، س77لطان، س، گیالنی-ین، ایجنس77ی، را ن، متح77د چکونس77ل-ک چ ٮ� ، گ77ورنر، ڈی، یش77ونت، راو،۲٠٠۳ ک - ، تریپ77ور ن، اگ77رتل - ، م77الی ۍ، وج ہ چ ے
ر، س77رینگر- ، س77لیم، ڈار، ش ، سوپور، س77ری ال ، ڈاڈسر، ت پاٹلن، تریپور ہ ہ ن� ، س77نگھ، 77و، دل، چی77ف، ویک ر-نگر، عالقس، ن ، کالونی، جوا ےکس، بمن 7و، ، ن ، مس7ل ہآرمڈ، فورسز، سپیشل، پ7اورس، ایکٹس، کمش7یر، وی-ک ے زی ن اعظمن، الزم، ترشی، بیان، ب ، واشنگٹن، آباد-ک ٲدل ٲ ٲ ٮ� ہ N_NST ، ، پت ، تیلی، ییل ، ی77ور، از-ک77ل، ت77و-ت77ام، ی77و-ت77ام-ن ، پت و ہب ہ ہ ہ KہYن ن� ، ، ح777ال ٹھ، اول ، پ ، ات ، ونیس777تام، ون ، ییت ن ، و ، ان777در ہےح777ال، ح777ال ہ ٮ� ہ ہ ہ ۍ ۄ ۍ ے بر، ، ونیس77تام، یوت77ام، ن ن، اور، ی77ورع، تل -ک ال، ون س، اپ77ار، ٮ�ابت77دا ہ ٮ� ۍ چ ، پتھر ، تیتھ ، تیل ےب777777777777777777777777777777777777777777777777777777777777777777ر ہ KہYن Iن ، ک، س77 س، ی ، ونیستام، یو-تام، وں، پرس، ی ، پرس، توتام، یوتام-ن برو ہ ہ KہYن ر پ ز، ٲم ہ ن�PR_PRC ، اکھ-اکس، اکھ، اکس ن -و ۍپان ٲ ہPR_PRF ، ن، پ77ان ، پنن ، پنن 77ن ن، پنن، پن ، پنن، پننس، پنن ، پنن 77ن ہپن ٮ� ہ ۍ ٮ� ہ ۍ ، پننس، خود ےپانPR_PRI ، کی نس ، ک ژھا، کا نKYکی ٲ KہYن ن�PR_PRL ، ، یوس ، تمن، یوس س ، ی س ، ی ، ییم ک، یم، یتھ، ییم ہییم ہ ۄ ہ ۄ ہ ن� د ، یس ، یس، ییم تھ، ییم ن�ییت ہ ٮ�PR_PRP ، ، ام ، ییم ن7777د ، یتھ، تمن، ی ن7777د، تم ، یم، ت ، اس ہاتھ، ی ہ ۍ ۍ ۍ ہ ، ، ام ، امس، ب د ، تم، ت د، تتھ، یی، یمن، اس ، تم777و، تس777 ، س777 ۍت ہ ۍ ن� ہ ہ ن� ۍ س� د، ، ت ، ام ز ن، ت د 77ک، ت دس، تمی ، تمس، تس77 ، سن ، م ن�تس، تم ہ ے ن� ہ ٮ� ن� ہ ن� ۍ ٲ ھے ہ 77777ک، ، امی د، ام ، مز د، ام ، سنس، ت د، س ، ت ز، س77777ان ، ت ی، بی ہت ی� ن� ہہ ٲ ۄ ن� ہ ہ ن� ہ ے ، امک -چ ز، ام و، ت زن، سان ، یمو، ی ، تمو ۍسور ہ ہ ن� ہ ٮ� ن� ہ ے ے
DM_DMD ، یمن، یمن، ، یم ، یم، اتھ، ییم ، تم، ی ، تمن، اتھ، ی ہام ہ ہ ہ] ہ ، یتھ، و ، ی ، اتھی، تم ، یس، امس، ام ، ام ، امی، تتھ، س ، ییم ، ییم ےام ہ ہ ، ی ، یمو، ی ، کن ، ییم اتھ، ام ہ ہ ۍDM_DMI تام، کنس ، کیا- ، کی ، کن ہکا ٲ KہYن ہ KہYنDM_DMR ، یس، ییم ہتمن، یمو، ی ہ
213
V_VAUX و، گ77ژھن، ، اوس، آمت، ت ، آمت -ن ، چھ، دی ، چھ، چھ ن�ٲچھن ۍ ہ ہ ے ہ ، ، س77پد، چھن س ، آمژ، وژھ، گ77وو، گ77و، -مت ت ، ین، چھن، ہگژھن، یین ۍ ٲ ۍ ۍ ھ ۍ ، 77وان، روزم77ژ، آی ، یتھ، آمژ، س، پیومت، ینس، گوژھ، ی ، یی ہدوان، چھن ٲ ہ ہ ، ، رودمت، آم77ژن، آس س، گی ، آو، اوس77 ، آم77تی، یی -م77ت ، اس ، پیی ہآ ہ ہ ۍ ۍ ہ ے ، چھ، ، آو-ن ک777ون، ووت، رود کن، آمت، ین، آس777ان، ، یک ہچھ777و، ۍ ٮ� ہ ٮ� ہ ہ
، چھ -ن ، چھن، یوان، یی ، گژھ، یین ھےکور-ن ہ ہ ہ ہV_VM نمت، ریم77ژ، و ، ، آمت پدن -چن، ل77وگمت، مان77ان، س77 نIک77رن ہ ہ ہ ہ ن، درتھ، یژھ7ان، س7پدن، ، اس77 ن، کران، چھ، ک77رن ، کرن، و ہکرنچ، چھ Iن ے -وول، تمت، ب77دلن، ین ، چھ، آس77نک، د پدتھ پھلیمژ، س77 77وان، پھ77ا ہکر، ی ن� ہے ن� 77ر، ، ک ، انن ٹ77اون تمت، روز، گژھن، انزراون، کر، ان77ان، ہینس، گومت، و ہ Iن ژھن77اونس، ن، و -ن77و، دتھ، د -ن77و، ژل -ن77و، ڈل ، استھ، پز، ال ٮ�کن-ن ن� ے ے ے ہ ٮ� ہ، ، ک77رن ، رلتھ، میلتھ، دن -وال 77ک، ین ، وون، ونن، کرن ۍس، کرم77ژ، چھن ۍ ہ ہ ہ ٲن، لجمژ، چھکراون ، سوزنس، رلن-وال ہسپدیومت، بناون ٮ� ہ ، 77ک، ، منگن77اومژ، بچن ، منگن77اون نمت، وڑاون ، و ، س77وزن 77نی، ونن ہدوان، رن ہ Iن ہ ہ ، ، ژای ، چل77ون ، م77ارن ن 77نیمژ، پ 77ژان، ب ، ن -م77ت 77ر ، ک -ن س ، -م77ت و ہبن ۍ ہ ہ ٮ� ۍ ۍ ہ ۍ ٲ ۍ ۍ ٲ 77رن- ن، چالو، پ ن، انن-وول، ک77ور، چالو، گن77دن-وال ، ک77رن-وال پد ٮ�س77 ٮ� ۍان، دنس، امت، ووتمت، ب77نیمژ، پ77راونس، ک77رنس، ، د لس، ک77رنک ہۄو ن� ۍ ٲ ران، ، س77پد، س77 کن، ان77زراون ، ، دنچ، چھ 77ن ، یتھ، س77پدن-واجی ہواتن ٮ� ہ ھے ۍ ہ 77ژھ، گن77ڈتھ، س77پزمژ، س77پدنک، ، گ ، دیت، انن یک ، بچتھ، 77ن ہس77وچان، یی ہ ۍ ، ، سپدن، آس77ن ، وچھن، رود، ونان، سپدان، سپدمت ، دن ہدژمژ، کرن، دن ۍ ہ ، 77رن ، گن77ڈتھ، ک تراون یزان، و ، گ ، دژ، کرتھ، برن ناون ، بی ، پکان، آی ہمانن ۍ ٮ� ن� ہ ہ ہ ہ ، ک77ورمت، -م77ت ، لگ اون اونس، س77پدمت، پیم77ژ، ت ، گ77ومت، ت ۍخرچ77اون ۍ ہ ن� ن� ہ -کس، ، ک77رن ، ب77ڑاون -م77ت -نس، لگ وتھ، وون ک77و، بل م77ژ، وتھ، ون ر ہٹھ ہ ۍ ۍ ہ ٲ ٮ� ہ ۍ ٲ متت، مژ، د ، روز، گ -مت ، سمکھن، پاونس، سمکھ -مت ہن�وونمت، سپد ٲ ۍ ۍ ۍ ۍ اوان، سمکھیووس، اوسس، سمکھان، کھنچ، پ ، اوس، آمت، -ون ن�پوش ٮ� ہ ہ ، ژیون، -نم، رودس، لبن ن تھ، و ، و ہبوزان، ورتاونک، ژستھ، گژھان، چھن ہ Iن Iن ہ - ، س77وز ، س77پدن زن ، س ٹاون، لڑن ، تلمت، ، نن ، گاران، بناون - ۍچھا، ڈل ہ ہ Iٮ ہ ہ ہ ہ] ہ
، روزان، رٹ، و لس، ر ، گن77ڈان، ک77رن-و 77ژھن ۍمتھ، روزتھ، ب77نیومت، گ ٲ ٲ ہ
214
ن، س77رن، س77ونچن، ونن، _ک 77ک، ک77رن ، پکن77اون، گژھن ل ٮ�تھوتھ، واتن-و ہ ۍ ٲ - ، ین -تھ 77ژھ ت، گ تمت، د ، آو، د و گر ، ب کھن ، ل ، پک ا، آس یک ، ہپیی ہے ۍ ہن� ن� ۍ ٲ ٲ ہ ٮ� ہ ہ ن� ہ
77رتھ، ج77ڑتھ، ، ک - یک ک77ان، باس77یوو، وی77ان، ، وچھو، وسان، کڈتھ، ہ]چ ہ ٮ� ہ ہ ، ، واتن77اون، تھ77ون 77ن 77ژھنس، ان ، کرن، گ ہبچوو، ننان، وونمت، کرن، باون ۍ ۍ ، ، چالون، رودمت، اس77تعمال، ک77رنک ل -نس، ک77رن-و ، الی ہتھ77وان، رٹن ۍ ٲ ہ ہ ، باسان، واتن77اون- اون، بناون -ناونچ، ، دپس، پان ، ونن ، چھکن ۍنیران، رٹن ہ ہ ہگتھ، تھ، ل س77 رتھ، ، کھ -م77ت روو، کورن، گنڈ ، ٹھ ٲوول، گژھتھ، دراو، گی ٲ ٲ ۍ ۍ ہ -ت77و، 77ر ، ک ن ، ی ، ب77وزتھ، پ و -ن تھ، ژل ، ، کھ77وژتھ، پپن-وال -واجن ۍین ہ ٮ� ہ ہ ن� ۍ ٲ ہ ٮ� ہ ہ ہ ہ ،اسان، کریوکھ، دژ، نیمتھ، 7777رن ٹتھ، ونن، س7777پدنس، ک7777رنچ، ک 7777ر، و ، ک ، روزن-وول، زینن ا د ٲ ہ ے ن� تمت، س777مکھیوو، دیتکھ، ، گم777ژ، د زراون ہن�عمالونس، ٹ777ال-مٹ777ول، ا ہ ن� ، کھولمت، ین، ین، تل، تھونس، ، سپد، مناون ، تھون، پکناون ہبوومت، لگ ہ - ، روزتھ، ول و ن، ک77ڈن اوو، بنیمت سمژ، آسن، نتھ، ت ، کڈان، دیت، ن ۍپ ۍ ٲ ٮ� ن� ٲ ہ ، ل77وژراوان، وونمت، کرم77ژ، وتھ، س77مکھن ، ت تمت، وومت، ژل ، د ہم77ت ن�ٲ ۍ ہن� ۍ 77ک، تھ77وتھ، ، دن -م77ت ، س77پد ، چھپن، ژلن -م77ت ، سپزمژ، سپز، تر ۍچھڑ ۍ ہ ۍ ۍ ے ، وتھ، کھن ، ک77رنس، پ77ر ، س77پدن وتھ، سپز، س77پد، ک77رن تھمت، بن ، و ہالگن ٲ ۍ ٲ Iن ہ ، 77ن ، ون ، نی 77ژ-م77ژ، کنن ی ، کھنتھ، -ون وتھ، پوش ، پ ، اننچ، اند-ن او، ژھنن ۍت ہ ہ ہ ٲ ہ ۍ ن� ، ن وان، یتھ، آس77 ، پ ، کھ77الن -تھ س ، یتھ، چال ، ل ، ت77یزن ۍگوو، رٹن، ونن ٮ� ہ ے ۍ ٲ Iن ٲ ہ ۍ -متس، سپزمژ ن، سپد ، و -ک ، کرن ۍتلن ۄ ہ ہ ہJJ_JJ ، شت-گردان د تیار، ، گمرا عالمی، ملکی، د، پ نو، یم، ق یمی، ہد ٲ ٲ ٲ -ون، س77ی، پوش 77د-ص77ورت، ختم، ممکن، سی م، ب ثر، ا ش77ور، مت ہپ77ور، م ٲ ٲ وژ، س77رحدی، کش77ر، شمل، ، پھ -ون ر، قونٮونی، روا-دار، اوم، پوش ٲس ٲ ن� ہ ہ ی� یی نی، ، چ یی، ب77اق می ، ایٹمی، ک کچ ، ل ، مضبوط، ف77وجی، بیی ، اصل ی�ثبق ے ٲ ی� ہ ۄ ہ ۍ ہ
77د، -ل ن، ع77ام، چھ77وک ہمختلف، انسنی، کش77ر ٮ� ٲ ر، -۱۲ٲ ر، -۱۳ہو ر،-۱۶ہو ہو می،۱۷ ری، بیین، بیین، خرجی، ج77امع، ب77ا ، ت77از، ج77و -کھل رس، ان -و ٲ ہ ۍ
، آزاد، بحال، واض77ح، س77از- ق فز، ث ، مضر، خصوصی، ن بی، بڈ می، ہب ٮ� ٲ ۍ ٲ ٲ، ٹھ تی، بح77ال، ش77 -مس، م77ذاکر 77ت ہگ7ار، محت77اط، منظ77وم، زبردس77ت، پ ٲ ۍ 77رامن، بی، امن-پس77ند، پ ، م77ذ ، تتھی، یتھ ملوث، ٹھوس، مشروط، روٹ ۍ ہ
215
ن، ، اژھ 77ژھ وری، ت ، غب، ٹاک77ار، جم ر ، س77رک ن ندوس77ت ، ٮ�تعلیم-یافت ے ٲ ۍ ٲ ۍ ٲ ہ یی، و 77ڈ، بین، زمی77نی، ج، ب ، د ٲیتھ، اصل، نا-کام، بین-االق77وامی، انسن Iن ن� ۍ ٲ س77مندری، متح77رک، خ77وش، چ77الو، ش77روع، ض77روری، مس77رور، تیتھ،، ن ن77نی، پاکس77ت م، ت77یز، ش77اندار، عظیم، قر می، نامور، ابتر، اعظ ۍاسل ٲ ٲ ی� ٲ دی- و، برط77انوی، موج77ود، ص77در، آب د، کشر ٲبیین، صاف، اندرونی، ش ٮ� ٲ ٲ، -پ77ای ل _افزا، ریاس77تی، ب ور، سرکری، کم، تژھ، حوصل ، مش ہوول، کھل ہ ہ ٲ ہ ، 77ژھ ن، ی -م 77ت ، پ 77و و، ن -م 77ت ، پ ، خفی ترین، ب77اق ، شمل، ب ہپر-شکو ٮ� ۍ ے ٮ� ۍ ہے ٲ 77د، -ل ، غل77ط، آیتن، تتھ، چھ77وک ژار، علحیدگی-پس77ند، گریل77و، ب77یی ، پ خفی ہ ن� ہ ، قون77ونی، س77مجی، ٮ77وریت-واجن 77د، گ77وش-گ7ذار، ع77دالتی، جم -ل ک ٲچھ ہ ہ ۄ ، ادا، ش77777777وژ، نم77777777ودار، منظ77777777ور ، ب77777777دل ہےمجرم77777777ان ہ ، -دار، پر-جوش، گرفتار، روشن، مچھ، کار-بند، مجب77ور، ن77وو، ملی، ٲکھان ہ ، بال- ، انتظمی، تعلیمی، تھ7ام اکھ، اقتصدی، ساد، سیود، معش اکھ، ب ٲ ۍ ٲ ٲ ن�ی، 77ر-امن، ش ، ی77ورپی، سس77ت، پ یم ٲجواز، واگذار، درکار، وعد-بن77د، و Iن ، -مل ، منظم، بند، مجب77ور، رل ، گرفتار، زیر، معشی، تبا -وای ہفرضی، ب ہ ٲ ھے ، ر -دخ7ل، کش77 7یین، ب ، ب ، کش7ر ، خود-س7اخت ، رت ، بج -مل ن، رل ۍرت ٲ ھے ۍ ٲ ہ ۍ ہ ہ ٮ� ری، گ77رکلی، ر، گ77رکلی، مرک77زی، اخب -دار، ب77د-ن77ام، س77ینر، کش77 ٲذم ٲ ہ ، بین-االق7ومی، پ77ور، مص77الحتی، ٲحفاظتی، کامیاب، مطل77وب، مش77ترک ٲ ہ ن، چی77نی، س77فارتی، ام77ریکی، واس، اف77زود، ٮ�مزی77د، ن77و، مس77تقل، ن77و ری، -وار، ش ت ، س7777ودی، نای7777اب، ر ن یی، نزدی7777ک، پ د، نیوکلی الو ہ ٮ� ۍ ن�ٲ ٲ ن� ہہ - ثر، نا-معلوم، یژھن، یژھو، کربناک، ناکار، معمولی، تن77ازع ہسالمتی، مت ٲیاب، کنی، مختصر ۍدار، دسJJ_JJC ن، مش77رقی، _وال ، ایٹمی، ط77اقت در ، ا ر، ب77را -نظ ٮ�خ77اص، ب ہ ے Iن ی� ے ٹھ ، خ77777777777777وش، ن77777777777777رم، ش ، ام77777777777777ریکی، خفی ہادر ی� ہ ۍ ، ، رت، سالمتی ےورش، وژھ، مین، سٹریم، بند، رتRB_RB ، نوس77ر، -مونجی، جری، ییی ، دبار، مل ، واپس، عنقریب ہبیی ٲ ہ ے ہ ، ج77ل، تھ س، ک س-ن ، ت ، تیل ، ح77ال میش ، واپس، ، دوری ت دش، ہد ٮ� ہے ہ ہ ۍ س� ۄ ، ٹھ -پ ، س77ید، س77یود، یتھ 77ن -ک ، کتھ ر، یک77وٹ ، بظ ، جان، ج77ل، پت ۍبال-وج ٲ ہ ۍ ہ ہ ٲ ہ ن -ک ، بیک-وقت، تتھ میش ر- ، میش ر- ، گ -م ۍگ ٲ ۍ ہ ہ ن� ہ ن� ہ
216
RB_RBC ، کھل ہسید، سیود، ژھیپ PP_PSP ، ت دس، س77 ز، ز، کن، س77 ، م ، بجای ز، حوال ، م ت ۍد، ن� ہہ ن� ن� ہ ہ ن� ۍ س� ن� ہہ تھ، خطر، د، م77نز، ز، ، 77ن ٹھ، مخلف، ک ٲب77اپتھ، بق77ول، خالف، پ ٮ� ہ ن� ہہ ن� ہہ ۍ ٲ ٮ� ، ٹ د، درمی77ان، خطر، پ ، الیق، باپت، س د ن، د ، د ز، ، ٹھ پ ٮ� ٲ ن� ۍ ن� ہہ ٮ� ن� ہہ ۍ ن� ہ ن� ہہ ے ٮ� و، ب77اوجود، د ، عالو، ، پت ٮ�سان، تام، دور، خ77اطر، موج77وب، نش، نکھ ن� ہہ ہ ہ ، ، رنگ ٹھ ، پ وٮ77ی ند، ، دس، دوران، پت ، س ، تل ۍخالف، نزدیک، طرف ٮ� ۍ ہ ہ ہ ن� ہے ہ ، د ، منز، ط77ور، س77 ٹھ ، پ ت ، ک ست دی، ر ، متلق، ، سبب و ٹھی، ب ۍپ ن� ہ ٮ� ہ ہۄ ے Iن ن� ہہ ہ KہYن ن� ٲ ز د، خظر، ، و ز، ب ز، متعل777777777777777777777ق، س777777777777777777777 ن�م ہہ ٲ ن� ہہ KہYن ن� ن� ن� ، ، ز ، اتھ، س77 ، تل ، ذری ن، کن د، خطر، ت77ل، ب بق، س77 ، مط ون ےدس، ب ن� ہ ہ ہ ۄ ٲ ن� ٲ Kہ ن� ن� ہہ ، باپتھ ، نش کھ ، م ، کن ، غرض ت ، کھ ٹھ پ ہ ۄ ہ ہ ہ ۄ ۍ ٲ ،CC_CCD ، و، -ک ، ن ، کن ، مگ77ر، مگ77ر، ز، وں-گ7و، یتھ، پش ، ی77ا، ت ہبلک ہ ہ ہ ہ ہیتھ-زن، اما-پوز، تCC_CCS ، ، ی77ان رگ ، اگر، ذا، ز، یعن ، ل ، تکیاز، یودو ، تو رگ ےز، Kہ ے ے ے چ خصوصن
RP_INJ ہ]RP_INTF 77ڈ، س77خ، ، ب ژا ، ٹھ ، براب77ر، س ، واری77ا ٹھا بڈ، زی77اد، س سک ہ] ٮ� ٮ�
ٹھ ، س ہ]تیوتا ہ ٮ� RP_NEG ، کی ، ن ، ن ا، ن نYہKک ہ ہ ن�RP_RPD - ، یتھ ، ح77االنک ا ، ک م، او ، ت77ا ش77 ی77و، ، ہسان، محض، ت ہ ن� ے ہ ہ
وی، ام777777ا، زن، ام777777ا_پ777777وز ، ی777777و، کی777777ا ، ص777777رف، ٹھ ہپ ۍ ٲ ، ، تی، وں-گ7و، فق7ط، چھ7را، ی7ودو ، ی7وت، ، زنت رگ ےآیا، وں-گوو، ما، س� ہ Kہ ہوں، بیی
QT_QTC ، اکس، ش77و ےاکھ، کرور، د ن، ۱۵٠٠٠ۄ ،۳٠٠٠، ۳٠٠ۄ، د ن�ے، ت ، یشو ےت ، ستتھ، س77اس، ک77رور، ز،۵ن� -س شویو، اک ، اکس، د ، اک ے، اک ۍ ۄ ہ ۍ
، -م ت تھ، س77 ن، لچھ، س77 ن، د ن، د ڈس، د ، ڈ ش77ون ، د ن ش ، د ن ش د ۍ چ ہ چ ۄ ۍ ۄ ۍ ہۄ ۄ ۍ ۄ ۄ ٹھن، ن، لچھن، -وی، س777777اڈن، ژ ، د ن، ت ، ت و ٲژ Iن ے ھے ن� ٮ� ن� ، ارب،۴۲ن، ۱۶ۄ
ن، ، و چاکی، دچھن ژن،۱۱٠، ۴٠، ش77یٹھ، ۲۱۹، ۳۴۲٠۲، ۲۱۹۱ہ ن، پ77ا ن�، ش ٮ�
217
ٮتن، تھ، ڈاین، ا، ، د۱۹، ۷٠، ۵٠چ -م ہ
QT_QTF - س، درجن تھ، د ، پ گن گنس، د س، ا ، د 77ت ہاکثر، تم7ام، ی ن� ٮ� ن� ہ ۄ ۄ ن� ے، ، ب77ود، ی77وت، زی77اد ، کھ77رب ح77د، واری77ا ، سری، پور، پ77ور، و ہواد، سارن ہ ٲ ٲ ے ، سری، ژو، کم، کی ژن، زی77اد، کی ، کی ، سر تیا ، مزید، ک ژو، زیاد ٲکی KYن ن� ن� ے ٲ ٲ ے ن� ژن ن، تمام-تر، کی ، واری ، زیاد م، سارو ، کل ن�باق ے ہےQT_QTO یم، اول ن، د -م ی ، د یم ، د م ، ب یم نIب ٮ� ہ Iن Iن ہ ہۍ ہ RD_ECH ، اترو ، وت ہباتھ، باتھ، باتھ، وت ہRD_PUNC ہ، ،، ‘، ’، “، ت : ،’’ ،( ،) ،‘‘ ،۔RD_UNK کی
Appendix-III: Showing a Sample of Sytactically Annotated Sentences in SSF
<document id="">
<head></head>
<Sentence id='1'>
1 (( NP <fs name='NP' drel='r6:NP4'>
1.1 وزیر N_NNPC <fs name='وزیر' cat='n'>
1.2 اعظم N_NNPC <fs name='اعظم' cat='n'>
1.3 ڈاکٹر N_NNPC <fs name='ڈاکٹر' cat=''''>
1.4 نمنمٮ�ہن N_NNPC <fs name='نمنمٮ�ہن ' cat=''''>
1.5 ہ� ہ�نگھن N_NNPC <fs name='ہ� ہ�نگھن ' cat=''''>
))
2 (( NP <fs name='NP2' drel='r6:NP4'>
2.1 وادی N_NN <fs name='وادی' cat='n'>
2.2 ہ� نن ہہ PP_PSP <fs name='ہ� نن ہہ ' cat=''''>
))
3 (( NP <fs name='NP3' drel='r6:NP4'>
3.1 ہ�- ک ہ� حال N_NN <fs name=' ہ�- ک ہ� <'cat='nst 'حال
))
4 (( NP <fs name='NP4' drel='k3:VGF'>
4.1 دور N_NN <fs name='دور' cat='n'>
4.2 س�تۍ PP_PSP <fs name='س�تۍ ' cat=''''>
))
5 (( AUXP <fs name='AUXP' drel='fragof:VGF'>
5.1 ہ� ی#ھن V_VAUX <fs name='ہ� ی#ھن ' cat='v'>
218
))
6 (( NP <fs name='NP5' drel='k7p:VGNN'>
6.1 ہ%�ا�تس N_NN <fs name='ہ%�ا�تس ' cat='n'>
6.2 م���ن�ز PP_PSPن''''''& <fs name=' م���ن�ز <''''=cat 'ن''''''&
))
7 (( NP <fs name='NP6' drel='k2:VGNN'>
7.1 یمی ٲد JJ_JJ <fs name='یمی ٲد ' cat='adj'>
7.2 یمن نا N_NN <fs name='یمن نا ' cat=''''>
))
8 (( JJP <fs name='JJP' drel='pof:VGNN'>
8.1 ٲقیم JJ_JJ <fs name='ٲقیم ' cat='adj'>
))
9 (( VGNN <fs name='VGNN' drel='r6v:NP7'>
9.1 ن#ن- ہ� کر� V_VM <fs name=' ن#ن- ہ� <''''=cat 'کر�
))
10 (( NP <fs name='NP7' drel='k2:VGF'>
10.1 ن*ن ہ+ ک� N_NN <fs name='ن*ن ہ+ <'cat='n 'ک�
))
11 (( NP <fs name='NP8' drel='pof:VGF'>
11.1 ن�ہ� کا DM_DMI <fs name='ہ��ن <'cat='pn 'کا
11.2 خاص JJ_JJC<fs name='خاص' cat=''''>
11.3 پوچھر N_NN <fs name='پوچھر' cat=''''>
))
12 (( VGF <fs name='VGF' drel='ccof:CCP'>
12.1 یمت و�گ ل V_VM <fs name='یمت و�گ <'cat='v 'ل
))
13 (( CCP <fs name='CCP'>
13.1 ہ� ب0لک CC_CCD <fs name='ہ� ب0لک ' cat='avy'>
))
14 (( AUXP <fs name='AUXP2' drel='fragof:NP10'>
14.1 ہ#ھ V_VAUX <fs name='ہ#ھ ' cat='v'>
))
15 (( NP <fs name='NP9' drel='k1:NP10'>
15.1 اکثر QT_QTF <fs name='اکثر' cat='avy'>
15.2 نگار- ہتجزی N_NN <fs name=' نگار- ہتجزی ' cat=''''>
))
219
16 (( NP <fs name='NP10' drel='ccof:CCP'>
16.1 ما�ا& V_VM <fs name='&ا�ما' cat='v'>
))
17 (( CCP <fs name='CCP2' drel='csof:NP10'>
17.1 ہز CC_CCS <fs name='ہز ' cat='avy'>
))
18 (( NP <fs name='NP11' drel='k7:VGNN2'>
18.1 ہ�- ک مستقبل N_NN <fs name=' ہ�- ک <'cat='n 'مستقبل
18.2 ہ� ح�ال PP_PSP <fs name='ہ� <''''=cat 'ح�ال
))
19 (( NP <fs name='NP12' drel='ccof:CCP3'>
19.1 ن�ہ� کا DM_DMI <fs name=' ن�ہ� 2کا ' cat='pn'>
19.2 نو JJ_JJ <fs name='نو' cat=''''>
19.3 وۄمی� N_NN <fs name='وۄمی�' cat=''''>
))
20 (( CCP <fs name='CCP3' drel='k1:VGNN2'>
20.1 یا CC_CCD <fs name='یا' cat='avy'>
))
21 (( NP <fs name='NP13' drel='ccof:CCP3'>
21.1 ہامکا& N_NN <fs name='&ہامکا ' cat='n'>
))
22 (( JJP <fs name='JJP2' drel='pof:VGNN2'>
22.1 ہ; پٲ JJ_JJ <fs name=';ہ <'cat='adj 'پٲ
))
23 (( VGNN <fs name='VGNN2' drel='vmod:VGF2'>
23.1 ہ� ن�پ�� V_VM <fs name='ہ� ن�پ�� ' cat=''''>
23.2 ہ� 0جا� PP_PSP <fs name='ہ� <''''=cat '0جا�
))
24 (( AUXP <fs name='AUXP3' drel='fragof:VGF2'>
24.1 وھے # V_VAUX <fs name='وھے #' cat='v'>
))
25 (( NP <fs name='NP14' drel='ccof:CCP4'>
25.1 ہ� نام DM_DMD <fs name='ہ� نام ' cat='pn'>
25.2 ہ% ;و N_NN <fs name='%ہ <''''=cat ';و
))
26 (( CCP <fs name='CCP4' drel='k3:VGF2'>
220
26.1 ہ� ت CC_CCD <fs name='ہ� <'cat='avy 'ت
))
27 (( NP <fs name='NP15' drel='k7t:VGNF'>
27.1 ناتھ PR_PRP <fs name='ناتھ ' cat='pn'>
27.2 من�ز ن''''''& PP_PSP <fs name=' من�ز <''''=cat 'ن''''''&
))
28 (( NP <fs name='NP16' drel='UNDEF:VGNF'>
28.1 ن�ہ� نرو 0 N_NST<fs name='ہ��ن نرو 0' cat='nst'>
28.2 یکن PP_PSP <fs name='یکن ' cat=''''>
))
29 (( VGNF <fs name='VGNF' drel='nmod:NP17'>
29.1 ہ� آامت V_VM <fs name='ہ� آامت ' cat='v'>
))
30 (( NP <fs name='NP17' drel='ccof:CCP4'>
30.1 ہ� 0یا� N_NN <fs name='ہ� <'cat='n '0یا�
30.2 س�تۍ PP_PSP <fs name=' 2س�تۍ ' cat=''''>
))
31 (( NP <fs name='NP18' drel='ccof:CCP5'>
31.1 مویویسی N_NN <fs name='مویویسی' cat='n'>
))
32 (( CCP <fs name='CCP5' drel='k1:VGF2'>
32.1 ہ� ت CC_CCD <fs name=' ہ� 2ت ' cat='avy'>
))
33 (( NP <fs name='NP19' drel='ccof:CCP5'>
33.1 بددلی N_NN <fs name='بددلی' cat='n'>
))
34 (( VGF <fs name='VGF2' drel='rsv:CCP2'>
34.1 ہمژ یہر� V_VM <fs name='ہمژ یہر� ' cat='v'>
34.2 ۔ RD_PUNC <fs name='۔' cat=''''>
))
</Sentence>
<Sentence id='2'>
1 (( NP <fs name='NP' drel='UNDEF:NP2'>
1.1 Eی ہحز N_NNPC <fs name='Eی ہحز ' cat='n'>
1.2 المجاہ��ن N_NNPC <fs name='المجاہ��ن' cat=''''>
221
1.3 �ا& RP_RPD <fs name='&ا�' cat=''''>
))
2 (( NP <fs name='NP2' drel='k1:VGF'>
2.1 تمام QT_QTF <fs name='تمام' cat='avy'>
2.2 علحیدگی N_NNC <fs name='علحیدگی' cat=''''>
2.3 نن� نپس N_NNC <fs name='نن� نپس ' cat=''''>
2.4 ژو ٲجم N_NN <fs name='ژو ٲجم ' cat=''''>
))
3 (( VGF <fs name='VGF'>
3.1 ی#ھ V_VAUX <fs name='ی#ھ ' cat='v'>
3.2 ی�مت نو و V_VM <fs name='مت�ی نو <''''=cat 'و
))
4 (( CCP <fs name='CCP' drel='rsv:VGF'>
4.1 ہز CC_CCS <fs name='ہز ' cat='avy'>
))
5 (( NP <fs name='NP3' drel='ras-k2:VGNN'>
5.1 تمام QT_QTF <fs name=' 2تمام ' cat='avy'>
5.2 نپن نرو گ N_NN <fs name='نپن نرو <''''=cat 'گ
5.3 س�تۍ PP_PSP <fs name='س�تۍ ' cat=''''>
))
6 (( NP <fs name='NP4' drel='pof:VGNN'>
6.1 0اتھ- نکتھ N_NN <fs name=' 0اتھ- نکتھ ' cat='n'>
))
7 (( VGNN <fs name='VGNN' drel='r6v:NP6'>
7.1 ہ�چ کر V_VM <fs name='چ�ہ <'cat='v 'کر
))
8 (( NP <fs name='NP5' drel='r6:NP6'>
8.1 وزیراعظم N_NN <fs name='وزیراعظم' cat='v'>
8.2 ن�ز س��� ن''''''& PP_PSP <fs name=' ن�ز س��� <''''=cat 'ن''''''&
))
9 (( NP <fs name='NP6' drel='k1:VGF3'>
9.1 آاوے- خ�ش N_NN <fs name=' آاوے- <'cat='psp 'خ�ش
))
10 (( VGF <fs name='VGF2' drel='csof:CCP'>
10.1 وھے # V_VM <fs name='وھے #' cat='v'>
))
11 (( NP <fs name='NP7' drel='k2:VGF3'>
222
11.1 محض RP_RPD <fs name='محض' cat='avy'>
11.2 ناکھ QT_QTC <fs name='ناکھ ' cat=''''>
11.3 ن0یا& N_NN <fs name='&ن0یا ' cat=''''>
))
12 (( NP <fs name='NP8' drel='nmod__Relc:NP7'>
12.1 نیک یم وی � PR_PRL <fs name='نیک یم وی �' cat='pn'>
))
13 (( NP <fs name='NP9' drel='r6:NP8'>
13.1 نمقص� N_NN <fs name='نمقص� ' cat='n'>
))
14 (( JJP <fs name='JJP' drel='ccof:CCP2'>
14.1 محض RP_RPD <fs name=' 2محض ' cat='avy'>
14.2 ملکی JJ_JJ <fs name='ملکی' cat='adj'>
))
15 (( CCP <fs name='CCP2' drel='CCNmod:NP10'>
15.1 ہ� ت CC_CCD <fs name='ہ� <'cat='avy 'ت
))
16 (( JJP <fs name='JJP2' drel='ccof:CCP2'>
16.1 عالمی JJ_JJ <fs name='عالمی' cat='adj'>
))
17 (( NP <fs name='NP10' drel='k2:VGF3'>
17.1 عام�- %اے N_NN <fs name=' عام�- <'cat='n '%اے
))
18 (( JJP <fs name='JJP3' drel='pof:VGF3'>
18.1 یگمراہ JJ_JJ <fs name='یگمراہ ' cat='adj'>
))
19 (( VGF <fs name='VGF3' drel='k1:NP9'>
19.1 یر& ک V_VM <fs name='&یر <'cat='v 'ک
19.2 ی#ھ V_VAUX <fs name=' 2ی#ھ ' cat='v'>
19.3 ۔ RD_PUNC <fs name='۔' cat='s'>
))
</Sentence>
<Sentence id='3'>
1 (( NP <fs name='NP' drel='k2:VGF4'>
1.1 پزر N_NN <fs name='پزر'>1.2 ہ� ت RP_RPD <fs name='ہ� <'ت
223
))
2 (( VGF <fs name='VGF'>
2.1 ی#ھ V_VAUX <fs name='ی#ھ '>
))
3 (( NP <fs name='NP2' drel='k1:VGF4'>
3.1 ہ� � PR_PRP <fs name='ہ� �'>))
4 (( CCP <fs name='CCP' drel='ras-k1:NP2'>
4.1 ہز CC_CCS <fs name='ہز '>))
5 (( NP <fs name='NP3' drel='k1:VGF4'>
5.1 وز�راعظمن N_NN <fs name='وز�راعظمن'>))
6 (( VGF <fs name='VGF2' drel='csof:CCP'>
6.1 نو& و V_VM <fs name='&نو <'و))
7 (( NP <fs name='NP4' drel='r6:NP5'>
7.1 شیر N_NNPC <fs name='شیر'>
7.2 کشمی�ر O''''''ی N_NNPC <fs name=' کشمی�ر O''''''ی'>7.3 زرعی N_NNPC <fs name='زرعی'>
7.4 یونیورسٹی N_NNPC <fs name='یونیورسٹی'>7.5 ہ�س نن ہہ PP_PSP <fs name='ہ�س نن ہہ '>
))
8 (( NP <fs name='NP5' drel='k2:VGNF'>
8.1 کن�کی*نس N_NN <fs name='کن�کی*نس'>
))
9 (( NP <fs name='NP6' drel='pof:VGNF'>
9.1 خطاب N_NN <fs name='خطاب'>
))
10 (( VGNF <fs name='VGNF' drel='vmod:VGF4'>
10.1 کرا& V_VM <fs name='&کرا'>
))
11 (( JJP <fs name='JJP' drel='UNDEF:VGF4'>
11.1 ی�تے QT_QTF <fs name='ی�تے '>))
12 (( CCP <fs name='CCP2' drel='rs:JJP'>
12.1 ہز CC_CCS <fs name=' 2ہز '>
224
))
13 (( NP <fs name='NP7' drel='k1:VGF4'>
13.1 با�ۍ PR_PRP <fs name='با�ۍ '>))
14 (( VGF <fs name='VGF3' drel='csof:CCP2'>
14.1 ہ#ھ V_VM <fs name='ہ#ھ '>))
15 (( NP <fs name='NP8' drel='k2:VGNN'>
15.1 نمن ہت DM_DMD <fs name='نمن ہت '>15.2 تمام QT_QTF <fs name='تمام'>15.3 گروپن N_NN <fs name='گروپن'>
15.4 ہ�تۍ PP_PSP <fs name='ہ�تۍ '>))
16 (( NP <fs name='NP9' drel='pof:VGNN'>
16.1 نکتھ N_NN <fs name='نکتھ '>
))
17 (( VGNN <fs name='VGNN' drel='UNDEF:VGF4'>
17.1 ہ� کر� V_VM <fs name='ہ� <'کر�
17.2 نپتھ 0ا PP_PSP <fs name='نپتھ <'0ا))
18 (( JJP <fs name='JJP2' drel='pof:VGF4'>
18.1 تیار JJ_JJ <fs name='تیار'>))
19 (( NP <fs name='NP10' drel='nmod__k1inv:NP8'>
19.1 یم PR_PRP <fs name='یم'>))
20 (( NP <fs name='NP11' drel='x:VGF4'>
20.1 Rہ 0ق� PP_PSP <fs name='Rہ <'0ق�20.2 ہ� ہتہن PR_PRP <fs name='ہ� ہتہن '>
))
21 (( NP <fs name='NP12' drel='k2:VGF4'>
21.1 گردی- شت ہد N_NN <fs name=' گردی- شت ہد '>
21.2 خالف PP_PSP <fs name='خالف'>
))
22 (( VGF <fs name='VGF4' drel='k1:NP10'>
22.1 ن�ن با V_VM <fs name='ن�ن با '>22.2 ۔ RD_PUNC <fs name='۔'>
225
))
</Sentence>
<Sentence id='4'>
1 (( NP <fs name='NP' drel='k1:VGF'>
1.1 بتمۍ PR_PRP <fs name='بتمۍ '>))
2 (( VGF <fs name='VGF'>
2.1 نو& و V_VM <fs name='&نو <'و))
3 (( NP <fs name='NP2' drel='k1:VGF2'>
3.1 با�ۍ PR_PRP <fs name='با�ۍ '>))
4 (( AUXP <fs name='AUXP' drel='fragof:VGF2'>
4.1 ہ#ھ V_VAUX <fs name='ہ#ھ '>))
5 (( NP <fs name='NP3' drel='ras-k1:VGF2'>
5.1 ن�س پٲکستا N_NNP <fs name='س�ن <'پٲکستا5.2 س�تۍ PP_PSP <fs name='س�تۍ '>
))
6 (( NP <fs name='NP4' drel='ccof:CCP'>
6.1 دوستی N_NN <fs name='دوستی'>
))
7 (( CCP <fs name='CCP' drel='k7:VGNF'>
7.1 ہ� ت CC_CCD <fs name='ہ� <'ت))
8 (( NP <fs name='NP5' drel='ccof:CCP'>
8.1 نکس ہا+ترا N_NN <fs name='نکس ہا+ترا '>8.2 ٮٮٹھ پ PP_PSP <fs name='ٮٮٹھ <'پ
))
9 (( VGNF <fs name='VGNF' drel='nmod:NP6'>
9.1 ہ%تھ ب; V_VM <fs name='ہ%تھ ب; '>))
10 (( NP <fs name='NP6' drel='k2:VGF2'>
10.1 تعلقات N_NN <fs name='تعلقات'>))
11 (( VGF <fs name='VGF2'>
226
11.1 ن�ژھا& V_VM <fs name='&ن�ژھا '>11.2 ۔ RD_PUNC <fs name='۔'>
))
</Sentence>
<Sentence id='5'>
1 (( CCP <fs name='CCP' drel='csof:CCP2'>
1.1 نہرگ� CC_CCS <fs name='نہرگ� '>))
2 (( NP <fs name='NP' drel='k1:VGF'>
2.1 پٲکستا& N_NNP <fs name='&پٲکستا'>))
3 (( NP <fs name='NP2' drel='r6:NP3'>
3.1 ہننۍ نپ PR_PRF <fs name='ہننۍ نپ '>))
4 (( NP <fs name='NP3' drel='k2:VGF'>
4.1 زمین- ہر ن� N_NN <fs name=' زمین- ہر ن� '>))
5 (( NP <fs name='NP4' drel='rd:VGF'>
5.1 0ھا%ت N_NNP <fs name='0ھا%ت'>5.2 ٲمخلف PP_PSP <fs name='ٲمخلف '>
))
6 (( NP <fs name='NP5' drel='rh:VGF'>
6.1 ہ�- گر;ا� ب;ہ*ت JJ_JJ <fs name=' ہ�- گر;ا� ب;ہ*ت '>6.2 سرگرمیٮو N_NN <fs name='سرگرمیٮو'>
6.3 نپتھ 0ا PP_PSP <fs name='نپتھ <'0ا))
7 (( NP <fs name='NP6' drel='pof:VGF'>
7.1 استعمال N_NN <fs name='استعمال'>))
8 (( VGF <fs name='VGF' drel='ccof:CCP2'>
8.1 ن�& �پ V_VM <fs name='&ن� <'�پ
8.2 ہ�- � ہ� ہ;� V_VAUX <fs name=' ہ�- � ہ� ہ;� '>))
9 (( CCP <fs name='CCP2'>
9.1 ہ� ت CC_CCD <fs name='ہ� <'ت))
227
10 (( NP <fs name='NP7' drel='k7:VGF2'>
10.1 ناتھ DM_DMD <fs name='ناتھ '>10.2 نلس ہس ہ�ل N_NN <fs name='نلس ہس ہ�ل '>10.3 م���ن�ز PP_PSPن''''''& <fs name=' م���ن�ز <'ن''''''&
))
11 (( NP <fs name='NP8' drel='r6:NP9'>
11.1 ہ� نپنن PR_PRF <fs name='ہ� نپنن '>))
12 (( NP <fs name='NP9' drel='k2:VGF2'>
12.1 ;ہٲ�ۍ- ن�قین N_NN <fs name=' ;ہٲ�ۍ- ن�قین '>))
13 (( JJP <fs name='JJP' drel='pof:VGF2'>
13.1 ہ% و� پ JJ_JJ <fs name='%ہ و� <'پ))
14 (( VGF <fs name='VGF2' drel='ccof:CCP2'>
14.1 ہر نک V_VM <fs name='ہر نک '>
14.2 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='6'>
1 (( NP <fs name='NP' drel='k1:VGF2'>
1.1 ک���شی�ر O''''''ی N_NNP <fs name=' ک���شی�ر O''''''ی'>))
2 (( VGF <fs name='VGF' drel='ccof:CCP'>
2.1 وھے # V_VM <fs name='وھے #'>
))
3 (( NP <fs name='NP2' drel='r6:NP3'>
3.1 ہ� نپنن PR_PRF <fs name='ہ� نپنن '>))
4 (( NP <fs name='NP3' drel='rh:VGF2'>
4.1 خوبصورتی N_NN <fs name='خوبصورتی'>
4.2 ہکنۍ PP_PSP <fs name='ہکنۍ '>
))
5 (( NP <fs name='NP4' drel='k7p:VGF2'>
5.1 ن%س ی; QT_QTF <fs name='ن%س ی; '>
228
5.2 نہس- � ی;�ۍ N_NN <fs name=' نہس- � ی;�ۍ '>5.3 م���ن�ز PP_PSPن''''''& <fs name=' م���ن�ز <'ن''''''&
))
6 (( JJP <fs name='JJP' drel='k1s:VGF2'>
6.1 شور ہم JJ_JJ <fs name='شور ہم '>
6.2 ، RD_PUNC <fs name='،'>))
7 (( CCP <fs name='CCP'>
7.1 نت�ے CC_CCS <fs name='نت�ے '>))
8 (( AUXP <fs name='AUXP' drel='fragof:VGF2'>
8.1 ہ#ھ V_VAUX <fs name='ہ#ھ '>))
9 (( NP <fs name='NP5' drel='rsp:VGF2'>
9.1 ہ� نپت N_NST<fs name='ہ� نپت '>9.2 ہ� نوت N_NNC <fs name='ہ� نوت '>9.3 ٮٮٹھ پ PP_PSP <fs name='ٮٮٹھ <'پ
))
10 (( NP <fs name='NP6' drel='k1:VGF2'>
10.1 �ٲلٲ�ۍ N_NN <fs name='ۍ�ٲلٲ�'>
))
11 (( NP <fs name='NP7' drel='k7p:VGF2'>
11.1 یور N_NST<fs name='یور'>))
12 (( VGF <fs name='VGF2' drel='rh:CCP'>
12.1 ہ��ا& V_VM <fs name='&ہ��ا '>12.2 ہمتۍ آا V_VAUX <fs name='ہمتۍ آا '>12.3 ۔ RD_PUNC <fs name='۔'>
))
</Sentence>
<Sentence id='7'>
1 (( NP <fs name='NP' drel='k1:VGF'>
1.1 ہ� � PR_PRP <fs name='ہ� �'>))
2 (( AUXP <fs name='AUXP' drel='fragof:VGF'>
229
2.1 وھے # V_VM <fs name='وھے #'>
))
3 (( NP <fs name='NP2' drel='r6:NP3'>
3.1 ہ�- # ہمالی� N_NNP <fs name=' ہ�- # <'ہمالی�))
4 (( NP <fs name='NP3' drel='k7p:VGF'>
4.1 ہ#ھ کۄ N_NN <fs name='ہ#ھ <'کۄ
4.2 م���ن�ز PP_PSPن''''''& <fs name=' م���ن�ز <'ن''''''&))
5 (( VGF <fs name='VGF'>
5.1 ہمژ ن�پھلی پھا V_VM <fs name='ہمژ ن�پھلی <'پھا5.2 ۔ RD_PUNC <fs name='۔'>
))
</Sentence>
<Sentence id='8'>
1 (( NP <fs name='NP' drel='r6:NP2'>
1.1 ہر یی بک* N_NNP <fs name='ہر یی بک* '>
1.2 ہز نن ہہ PP_PSP <fs name='ہز نن ہہ '>))
2 (( NP <fs name='NP2' drel='k4:VGNF'>
2.1 خوبصورتی N_NN <fs name='خوبصورتی'>
2.2 س�تۍ PP_PSP <fs name='س�تۍ '>))
3 (( JJP <fs name='JJP' drel='pof:VGNF'>
3.1 ثر ٲمت JJ_JJ <fs name='ثر ٲمت '>
))
4 (( VGNF <fs name='VGNF' drel='vmod:VGF'>
4.1 ہے ہ�تھ ہپ ن� V_VM <fs name='ہے ہ�تھ ہپ ن� '>))
5 (( AUXP <fs name='AUXP' drel='fragof:VGF'>
5.1 ی#ھ V_VM <fs name='ی#ھ '>
))
6 (( NP <fs name='NP3' drel='k1:VGF'>
6.1 مغل N_NNPC <fs name='مغل'>
6.2 0ا;+اہ N_NNPC <fs name='0ا;+اہ'>6.3 جہا�گیر& N_NNPC <fs name='&گیر�جہا'>
230
))
7 (( NP <fs name='NP4' drel='k1:VGNN'>
7.1 ن�تھ PR_PRP <fs name='ن�تھ '>))
8 (( NP <fs name='NP5' drel='k1s:VGNN'>
8.1 جنت N_NNC <fs name='جنت'>
8.2 بنظی�ر- O''''''ےی JJ_JJC<fs name=' بنظی�ر- O''''''ےی '>
))
9 (( VGNN <fs name='VGNN' drel='r6:NP6'>
9.1 ینک آا� V_VM <fs name='ینک آا� '>))
10 (( NP <fs name='NP6' drel='pof:VGF'>
10.1 خطاب N_NN <fs name='خطاب'>
))
11 (( VGF <fs name='VGF'>
11.1 یمت ن�ت ; V_VM <fs name='یمت ن�ت ;'>11.2 ۔ RD_PUNC <fs name='۔'>
))
</Sentence>
<Sentence id='9'>
1 (( NP <fs name='NP' drel='r6:NP2'>
1.1 ہر یی بک* N_NNP <fs name='ہر یی بک* '>
1.2 نن� یہ PP_PSP <fs name='نن� یہ '>))
2 (( NP <fs name='NP2' drel='k1s:VGF'>
2.1 دل N_NN <fs name='دل'>
))
3 (( VGF <fs name='VGF'>
3.1 ی#ھ V_VM <fs name='ی#ھ '>
))
4 (( NP <fs name='NP3' drel='k1:VGF'>
4.1 س���ری�نگر O''''''ی N_NNP <fs name=' س���ری�نگر O''''''ی'>4.2 ۔ RD_PUNC <fs name='۔'>
))
</Sentence>
231
<Sentence id='10'>
1 (( NP <fs name='NP' drel='k1:VGF'>
1.1 ہ� � PR_PRP <fs name='ہ� �'>))
2 (( VGF <fs name='VGF'>
2.1 ی#ھ V_VM <fs name='ی#ھ '>
))
3 (( NP <fs name='NP2' drel='r6:NP3'>
3.1 ک���شی�ر Oی'''''''''''ی N_NNP <fs name=' ک���شی�ر Oی'''''''''''ی'>3.2 نن� یہ PP_PSP <fs name='نن� یہ '>
))
4 (( NP <fs name='NP3' drel='ccof:CCP'>
4.1 دل N_NN <fs name='دل'>
))
5 (( CCP <fs name='CCP' drel='k1s:VGF'>
5.1 ہ� ت CC_CCD <fs name='ہ� <'ت))
6 (( NP <fs name='NP4' drel='ccof:CCP'>
6.1 ناکھ QT_QTC <fs name='ناکھ '>6.2 م ہا JJ_JJ <fs name='م ہا '>
6.3 ہ� حص N_NN <fs name='ہ� <'حص
6.4 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='11'>
1 (( NP <fs name='NP' drel='ccof:CCP'>
1.1 ہل جھی N_NNPC <fs name='ہل <'جھی
1.2 ڈل N_NNPC <fs name='ڈل'>
))
2 (( CCP <fs name='CCP' drel='k1:VGF'>
2.1 ہ� ت CC_CCD <fs name='ہ� <'ت))
3 (( NP <fs name='NP2' drel='ccof:CCP'>
3.1 ییین ہ�گ N_NNP <fs name='ییین ہ�گ '>
232
))
4 (( AUXP <fs name='AUXP' drel='fragof:VGF'>
4.1 ی#ھ V_VM <fs name='ی#ھ '>
))
5 (( NP <fs name='NP3' drel='r6:NP4'>
5.1 ہ�- # ن+ہر N_NN <fs name=' ہ�- # ن+ہر '>))
6 (( NP <fs name='NP4' drel='k7:VGF'>
6.1 خوبصٮورتی N_NN <fs name='خوبصٮورتی'>
6.2 م���ن�ز PP_PSPن''''''& <fs name=' م���ن�ز <'ن''''''&))
7 (( NP <fs name='NP5' drel='k2:VGF'>
7.1 ننس اۄگ QT_QTF <fs name='ننس <'اۄگ7.2 ہ� ;ۄگن QT_QTF <fs name='ہ� <';ۄگن7.3 ���رٮ��ر ''''''ہٮ''''''\ ی N_NN <fs name=' ���رٮ��ر ''''''ہٮ''''''\ ی '>
))
8 (( VGF <fs name='VGF'>
8.1 کرا& V_VM <fs name='&کرا'>
8.2 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='12'>
1 (( NP <fs name='NP'>
1.1 نمن م�� N_NN <fs name='نمن <'م��1.2 نن� یہ PP_PSP <fs name='نن� یہ '>
))
2 (( VGNF <fs name='VGNF'>
2.1 یلن �0 V_VM <fs name='یلن �0'>))
3 (( CCP <fs name='CCP'>
3.1 ہ� ت CC_CCD <fs name='ہ� <'ت))
4 (( NP <fs name='NP2'>
4.1 ہ;لن N_NN <fs name='ہ;لن '>))
5 (( JJP <fs name='JJP'>
233
5.1 0راہ JJ_JJC<fs name='0راہ'>))
6 (( VGNF <fs name='VGNF2'>
6.1 وول- ہ''''''ہین��� V_VM <fs name=' وول- ہ''''''ہین��� '>
))
7 (( NP <fs name='NP3'>
7.1 آب N_NNC <fs name='آب'>7.2 ہوا N_NNC <fs name='ہوا'>
))
8 (( AUXP <fs name='AUXP'>
8.1 ی#ھ V_VAUX <fs name='ی#ھ '>
))
9 (( NP <fs name='NP4'>
9.1 ٮٮن �ٲلا� N_NN <fs name='ٮٮن <'�ٲلا�
))
10 (( NP <fs name='NP5'>
10.1 یور N_NST<fs name='یور'>10.2 یکن PP_PSP <fs name='یکن '>
))
11 (( VGNF <fs name='VGNF3'>
11.1 ننس ہ� V_VM <fs name='ننس ہ� '>))
12 (( NP <fs name='NP6'>
12.1 رز N_NN <fs name='رز'>
))
13 (( VGF
13.1 کرا& V_VM <fs name='&کرا'>
13.2 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='13'>
1 (( CCP <fs name='CCP'>
1.1 مگر CC_CCD <fs name='مگر'>
))
2 (( NP <fs name='NP' drel='k7t:VGF'>
2.1 کل- از N_NST<fs name=' کل- <'از
234
))
3 (( AUXP <fs name='AUXP' drel='fragof:VGF'>
3.1 ی#ھ V_VAUX <fs name='ی#ھ '>
))
4 (( NP <fs name='NP2' drel='k1:VGF'>
4.1 ر ہش N_NN <fs name='ر ہش '>
))
5 (( NP <fs name='NP3' drel='k7p:VGF'>
5.1 ٮ�تھ نر پ QT_QTF <fs name='ٮ�تھ نر <'پ5.2 ہ� نا� N_NN <fs name='ہ� نا� '>
))
6 (( NP <fs name='NP4' drel='k1s:VGF'>
6.1 ی�ہے DM_DMD <fs name='ی�ہے '>6.2 ص�%ت- ن�0 JJ_JJ <fs name=' ص�%ت- ن�0 '>6.3 ہ� علاق N_NN <fs name='ہ� <'علاق
6.4 و� ہی RP_RPD <fs name='و� <'ہی))
7 (( VGF <fs name='VGF' drel='ccof:CCP'>
7.1 یمت و� گ V_VM <fs name='یمت و� <'گ
7.2 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='14'>
1 (( NP <fs name='NP' drel='modn:NP2'>
1.1 یحکمرا& N_NNC <fs name='&یحکمرا '>
1.2 جماعت N_NNC <fs name='جماعت'>
))
2 (( NP <fs name='NP2' drel='k1:VGF'>
2.1 نیشنل N_NNPC <fs name='نیشنل'>2.2 کا�فر�سن N_NNPC <fs name='سن�فر�کا'>
))
3 (( VGF <fs name='VGF'>
3.1 ی#ھ V_VAUX <fs name='ی#ھ '>
3.2 یتمت نو و V_VM <fs name='یتمت نو <'و))
235
4 (( CCP <fs name='CCP' drel='rsv:VGF'>
4.1 ہز CC_CCS <fs name='ہز '>))
5 (( NP <fs name='NP3' drel='r6:NP4'>
5.1 ک���شی�ر O''''''ی N_NNP <fs name=' ک���شی�ر O''''''ی'>5.2 نن� یہ PP_PSP <fs name='نن� یہ '>
))
6 (( NP <fs name='NP4' drel='k1:VGF2'>
6.1 ہ� نمسل N_NN <fs name='ہ� نمسل '>))
7 (( VGF <fs name='VGF2'>
7.1 ہز %و V_VM <fs name='ہز <'%و))
8 (( NP <fs name='NP5' drel='rsp:VGF2'>
8.1 تام- تو N_NST<fs name=' تام- <'تو))
9 (( JJP <fs name='JJP' drel='pof:VGF2'>
9.1 cہ گا N_NNC <fs name='cہ <'گا
9.2 ی%ے نو; ا JJ_JJC<fs name='ی%ے نو; <'ا))
10 (( NP <fs name='NP6' drel='rsp:VGF3'>
10.1 - ہ�- � تام و� � N_NST<fs name=' - ہ�- � تام و� �'>))
11 (( NP <fs name='NP7' drel='k1:NP8'>
11.1 بندوق N_NN <fs name='بندوق'>))
12 (( JJP <fs name='JJP2' drel='pof:NP8'>
12.1 ختم JJ_JJ <fs name='ختم'>
))
13 (( NP <fs name='NP8'>
13.1 نگژھن V_VM <fs name='نگژھن '>
13.2 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='15'>
236
1 (( NP <fs name='NP' drel='r6:NP2'>
1.1 ک���شی�ر O''''''ی N_NNP <fs name=' ک���شی�ر O''''''ی'>1.2 نن� یہ PP_PSP <fs name='نن� یہ '>
))
2 (( NP <fs name='NP2' drel='k1:VGNN'>
2.1 ہ� نمسل N_NN <fs name='ہ� نمسل '>))
3 (( VGNN <fs name='VGNN' drel='k1:VGF'>
3.1 یو& با�ز%ا V_VM <fs name='&یو با�ز%ا '>))
4 (( VGF <fs name='VGF'>
4.1 ی#ھ V_VM <fs name='ی#ھ '>
))
5 (( NP <fs name='NP3' drel='rsp:VGF'>
5.1 تیلی N_NST<fs name='تیلی'>))
6 (( NP <fs name='NP4' drel='k1s:VGF'>
6.1 یممکن JJ_JJ <fs name='یممکن '>))
7 (( NP <fs name='NP5'>
7.1 ہ� ویل � N_NST<fs name='ہ� ویل �'>))
8 (( NP <fs name='NP6' drel='k1:VGF2'>
8.1 بندوق N_NN <fs name='بندوق'>))
9 (( NP <fs name='NP7' drel='pof:VGF2'>
9.1 ہ� ژھۄپ N_NN <fs name='ہ� <'ژھۄپ))
10 (( VGF <fs name='VGF2'>
10.1 ہر ک V_VM <fs name='ہر <'ک
10.2 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='16'>
1 (( NP <fs name='NP' drel='k1:VGF'>
237
1.1 بندوق N_NN <fs name='بندوق'>))
2 (( AUXP <fs name='AUXP' drel='fragof:VGF'>
2.1 ی#ھ V_VAUX <fs name='ی#ھ '>
))
3 (( NP <fs name='NP2' drel='k2:VGF'>
3.1 ی ٲہتب N_NN <fs name='ی ٲہتب '>
))
4 (( VGF <fs name='VGF' drel='ccof:CCP'>
4.1 نا�ا& V_VM <fs name='&ا�نا '>))
5 (( CCP <fs name='CCP'>
5.1 ہ� ت CC_CCD <fs name='ہ� <'ت))
6 (( NP <fs name='NP3' drel='k2:VGF3'>
6.1 بندوق N_NN <fs name=' 2بندوق '>
))
7 (( AUXP <fs name='VGF2' drel='fragof:VGF3'>
7.1 اوس V_VAUX <fs name='اوس'>))
8 (( NP <fs name='NP4' drel='k7p:VGF3'>
8.1 ک���شی�ر O''''''ی N_NNP <fs name=' ک���شی�ر O''''''ی'>8.2 من�ز ن''''''& PP_PSP <fs name=' من�ز <'ن''''''&
))
9 (( NP <fs name='NP5' drel='k2:VGNN'>
9.1 نیشنل N_NNPC <fs name='نیشنل'>9.2 کا�فر�س N_NNPC <fs name='س�فر�کا'>
))
10 (( NP <fs name='NP6' drel='k7:VGNN'>
10.1 ٲسیسی JJ_JJ <fs name='ٲسیسی '>
10.2 ہ� مٲ;ا� N_NN <fs name='ہ� <'مٲ;ا�10.3 ہز من PP_PSP <fs name='ہز <'من
))
11 (( NP <fs name='NP7' drel='adv:VGNN'>
11.1 نا�� N_NN <fs name='��نا '>11.2 ٮٮتھ ہ PP_PSP <fs name='ٮٮتھ <'ہ
))
238
12 (( VGNN <fs name='VGNN' drel='rh:VGF3'>
12.1 ہ� ہٹاو� V_VM <fs name='ہ� <'ہٹاو�12.2 ہر خٲط PP_PSP <fs name='ہر <'خٲط
))
13 (( VGF <fs name='VGF3' drel='ccof:CCP'>
13.1 ہ� نا�ن V_VM <fs name='ہ� نا�ن '>13.2 یمت آا V_VAUX <fs name='یمت آا '>13.3 ۔ RD_PUNC <fs name='۔'>
))
</Sentence>
<Sentence id='17'>
1 (( NP <fs name='NP' drel='k1:VGF'>
1.1 سی- این N_NNP <fs name=' سی- <'این))
2 (( AUXP <fs name='AUXP' drel='fragof:VGF'>
2.1 ی#ھ V_VAUX <fs name='ی#ھ '>
))
3 (( NP <fs name='NP2' drel='rsp:VGF'>
3.1 ہ� پت N_NST<fs name='ہ� <'پت3.2 ہ� نوت N_NNC <fs name='ہ� نوت '>3.3 ٮٮٹھے پ PP_PSP <fs name='ٮٮٹھے <'پ
))
4 (( NP <fs name='NP3' drel='r6:NP4'>
4.1 ک���شی�ر O''''''ی N_NNP <fs name=' ک���شی�ر O''''''ی'>4.2 ہ� نن ہہ PP_PSP <fs name='ہ� نن ہہ '>
))
5 (( NP <fs name='NP4' drel='r6:NP5'>
5.1 یلک مس N_NN <fs name='یلک <'مس))
6 (( NP <fs name='NP5' drel='k2:VGF'>
6.1 یو&- ہ� پ�+ JJ_JJ <fs name=' یو&- ہ� <'پ�+6.2 حل N_NN <fs name='حل'>
))
7 (( VGF <fs name='VGF'>
7.1 ن�ژھا& V_VM <fs name='&ن�ژھا '>7.2 یمت آا V_VAUX <fs name='یمت آا '>
239
7.3 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='18'>
1 (( NP <fs name='NP' drel='r6:NP2'>
1.1 نمن ہت DM_DMD <fs name='نمن ہت '>1.2 بکتھن N_NN <fs name='بکتھن '>
1.3 ���ن�ز ''''''ہن''''''ن ہ PP_PSP <fs name=' ���ن�ز ''''''ہن''''''ن ہ '>
))
2 (( NP <fs name='NP2' drel='pof:VGF'>
2.1 نوتھ 0ا N_NN <fs name='نوتھ <'0ا))
3 (( VGF <fs name='VGF'>
3.1 کر V_VM <fs name='کر'>
))
4 (( NP <fs name='NP3' drel='r6:CCP'>
4.1 نیشنل N_NNPC <fs name='نیشنل'>4.2 ہسکۍ N_NNPCکا�فر� <fs name='ہسکۍ <'کا�فر�
))
5 (( NP <fs name='NP4' drel='ccof:CCP'>
5.1 سین�ر ''''''�ی''''''ن ی JJ_JJ <fs name=' سین�ر ''''''�ی''''''ن ی '>
5.2 نما ہر N_NN <fs name='نما ہر '>
))
6 (( CCP <fs name='CCP' drel='modnc:NP8'>
6.1 ہ� ت CC_CCD <fs name='ہ� <'ت))
7 (( NP <fs name='NP5' drel='r6:NP7'>
7.1 ہتکۍ %�ا� N_NN <fs name='ہتکۍ <'%�ا�))
8 (( NP <fs name='NP6' drel='r6:NP7'>
8.1 قونٮونی JJ_JJ <fs name='قونٮونی'>
8.2 ام�%& N_NN <fs name='&%ام�'>8.3 نن�ۍ ہ PP_PSP <fs name='نن�ۍ <'ہ
))
9 (( NP <fs name='NP7' drel='ccof:CCP'>
9.1 وزیر N_NNC <fs name='وزیر'>
240
))
10 (( NP <fs name='NP8' drel='k1:VGF'>
10.1 علی N_NNPC <fs name='علی'>
10.2 محم� N_NNPC <fs name='محم�'>10.3 �اگر& N_NNPC <fs name='&اگر�'>
))
11 (( NP <fs name='NP9' drel='r6:NP10'>
11.1 پارٹی N_NN <fs name='پارٹی'>11.2 ٮ�ن نن� ہہ PP_PSP <fs name='ٮ�ن نن� ہہ '>
))
12 (( NP <fs name='NP10' drel='k2:VGF2'>
12.1 کا%کنن N_NN <fs name='کا%کنن'>
))
13 (( NP <fs name='NP11' drel='pof:VGF2'>
13.1 خطاب N_NN <fs name='خطاب'>
))
14 (( VGNF <fs name='VGF2' drel='vmod:VGF'>
14.1 کرا& V_VM <fs name='&کرا'>
14.2 ۔ RD_PUNC <fs name='۔'>))
</Sentence>
<Sentence id='19'>
1 (( NP <fs name='NP' drel='k1:VGF'>
1.1 بتمۍ PR_PRP <fs name='بتمۍ '>))
2 (( VGF <fs name='VGF'>
2.1 نو& و V_VM <fs name='&نو <'و))
3 (( CCP <fs name='CCP' drel='rsv:VGF'>
3.1 ہز CC_CCS <fs name='ہز '>))
4 (( NP <fs name='NP2' drel='k1:VGF3'>
4.1 تم DM_DMD <fs name='تم'>4.2 یلکھ N_NN <fs name='یلکھ '>
))
5 (( NP <fs name='NP3' drel='nmod__k1inv:NP2'>
241
5.1 یم PR_PRL <fs name='یم'>))
6 (( NP <fs name='NP4' drel='k2:VGF2'>
6.1 نکتھ N_NN <fs name='نکتھ '>
6.2 ہتھ 0ا RD_ECH <fs name='ہتھ <'0ا))
7 (( NP <fs name='NP5' drel='pof:VGF2'>
7.1 تھۄس N_NN <fs name='تھۄس'>))
8 (( VGF <fs name='VGF2' drel='k1:NP3'>
8.1 نا�ا& V_VM <fs name='&ا�نا '>8.2 ہ#ھ V_VAUX <fs name='ہ#ھ '>
))
9 (( AUXP <fs name='AUXP' drel='pof:VGF3'>
9.1 ہ�- � ٮٮکن ہ V_VM <fs name=' ہ�- � ٮٮکن <'ہ))
10 (( NP <fs name='NP6' drel='r6:NP7'>
10.1 ٮ�ن کٲ+ر N_NNP <fs name='ٮ�ن <'کٲ+ر
10.2 نن�ۍ ہہ PP_PSP <fs name='نن�ۍ ہہ '>))
11 (( NP <fs name='NP7' drel='k1s:VGF3'>
11.1 در- ۍژک N_NN <fs name=' در- ۍژک '>
))
12 (( VGF <fs name='VGF3' drel='csof:CCP'>
12.1 ہ�تھ با V_VM <fs name='ہ�تھ با '>12.2 ۔ RD_PUNC <fs name='۔'>
))
</Sentence>
<Sentence id='20'>
1 (( NP <fs name='NP' drel='ccof:CCP'>
1.1 علحیدگی N_NNC <fs name='علحیدگی'>
1.2 پسن�& N_NNC <fs name='&پسن�'>))
2 (( CCP <fs name='CCP' drel='k1:CCP2'>
2.1 ہ� ت CC_CCD <fs name='ہ� <'ت
242
))
3 (( NP <fs name='NP2' drel='ccof:CCP'>
3.1 جنگجٮ�ہن N_NN <fs name='جنگجٮ�ہن'>
))
4 (( VGF <fs name='VGF' drel='fragof:CCP2'>
4.1 ہز پ V_VM <fs name='ہز <'پ))
5 (( VGF <fs name='VGF2' drel='pof_idiom:NP3'>
5.1 نو- ےال V_VM <fs name=' نو- ےال '>
5.2 نو- ےڈل V_VM <fs name=' نو- ےڈل '>
))
6 (( NP <fs name='NP3' drel='pof_idiom:VGF3'>
6.1 ہ� نپنن PR_PRF <fs name='ہ� نپنن '>6.2 ہ� ہن N_NN <fs name='ہ� <'ہن
))
7 (( VGF <fs name='VGF3' drel='k2u:NP4'>
7.1 نو- ےژل V_VM <fs name=' نو- ےژل '>
7.2 ہ*� ہہ RP_RPD <fs name='�*ہ ہہ '>))
8 (( NP <fs name='NP4' drel='k2:VGNF'>
8.1 پالیسی N_NN <fs name='پالیسی'>))
9 (( NP <fs name='NP5' drel='pof:VGNF'>
9.1 نلتھ N_NN <fs name='نلتھ '>))
10 (( VGNF <fs name='VGNF' drel='vmod:CCP2'>
10.1 ہ;تھ V_VM <fs name='ہ;تھ '>))
11 (( NP <fs name='NP6' drel='k2:VGF4'>
11.1 مذاکراتن N_NN <fs name='مذاکراتن'>))
12 (( NP <fs name='NP7' drel='pof:VGF4'>
12.1 ن�گ N_NN <fs name='ن�گ '>))
13 (( VGF <fs name='VGF4' drel='ccof:CCP2'>
13.1 ن�ن ی; V_VM <fs name='ن�ن ی; '>
243
))
14 (( CCP <fs name='CCP2'>
14.1 ہ� ت CC_CCD <fs name=' ہ� 2ت '>
))
15 (( NP <fs name='NP8' drel='r6:NP9'>
15.1 ہر یی بک* N_NN <fs name='ہر یی بک* '>
15.2 نن� یہ PP_PSP <fs name='نن� یہ '>))
16 (( NP <fs name='NP9' drel='k2:VGF5'>
16.1 ہ�یاے N_NN <fs name='یاے�ہ '>))
17 (( VGF <fs name='VGF5' drel='ccof:CCP2'>
17.1 یو& با�ز%ا V_VM <fs name='&یو با�ز%ا '>17.2 ۔ RD_PUNC <fs name='۔'>
))
</Sentence>
244
Bibliography
Aarts, J. & Meijs, M. 1984. Corpus Linguistics: Recent Developments in the Use of
Computer Corpora in English Language Research. Amsterdam: Rodopi
Abeille, A. Editor. 2000. Building and Using Syntactically Annotated Corpora.
Kluwer, Dordrecht.
Abney, S. 1989. A Computational Model of Human Parsing. In The Journal of
Psycholinguistic Research. Vol.8.1. Bell Communications Research: Morristown,
NJ.
Abney, S. 1991. Chunks and Dependencies: Bringing Processing Evidence to Bear
on Syntax. MS. University of Tubingen.
Abney, S. 1991. Parsing by Chunks. In Berwick, R. Abney, S. Tenny, C. (eds.),
Corpus-based Methods in Language and Speech. Dordrecht: Kluwer, page 257-278.
Abney, S. 1996. A Grammar of Projections. MS. University of Tubingen
Abney, S. 1996. Partial Parsing via Finite-State Cascades. John Caroll, ed. ESSLLI
Workshop on Robust Parsing. Prague. page 8-15.
Abney, S. 1996. Chunk Stylebook. MS. University of Tubingen.
Khan, A. J. 2006. Urdu/Hindi: An Artificial Divide. Algora Publishing: New York.
Aduriz, I. Aranzabe, M. J. Arriola, J. M. Atutxa, A. Diaz de Ilarraza, A. Garmendia,
A. Oronoz, M. 2003. Construction of Basque Dependency Treebank. In:
Nivre/Hinrichs 2003, page 201-204.
Afonso, S. Bick, E. Haber, R. Santos, D. 2002. A Treebank for Portuguese. In
Proceedings of the Third International Conference on Language Resources and
Evaluation. Las Palmas, Spain, 1698-1703.
Ambati, Bharat Ram, Samar Husain, Joakim Nivre & Rajeev Sangal. 2010. On the
Role of Morphosyntactic Features in Hindi Dependency Parsing. MS. Language
Technologies Research Centre, IIIT-Hyderabad, India & Department of Linguistics
and Philology, Uppsala University, Sweden.
Ambati, Bharat Ram, Pujitha Gade, Chaitanya GSK & Samar Husain. 2009. Effect
of Minimal Semantics on Dependency Parsing. MS. LTRC, IIIT-Hyderabad.
245
Arppe Antti, Gaëtanelle Gilquin, Dylan Glynn, Martin Hilpert & Arne Zeschel. 2010
Cognitive Corpus Linguistics: Five Points of Debate on Current Theory and
Methodology. Corpora Vol. 5.1:1-27. Edinburgh University Press
Atkins, Sue, Jeremy Clear & Nicholas Ostler. 1991. Corpus Design Criteria. Literary
& Linguistic Computing 7:1-16.
Backer, P. et al., 2000. Corpus Linguistics and South Asian Languages: Corpus
Creation and Tool Development. Literary and Linguistic Computing. Vol 19 (4).
Page 509-524
Baerman, Matthew & Brown, D. 2013. Case Syncretism. World Atlas of Language
Structures, Eds. Bernard Comrie, Matthew Dryer, David Gil and Martin Haspelmath.
Munich: Max Planck Digital Library.
Bamman, D. & Crane, G. 2006. The Design and Use of a Latin Dependency
Treebank. In Proceedings of TLT, 67-78. FAL MFF UK, Prague.
Bank. 2003. In Proceedings of the 4th International Workshop on Linguistically
Interpreted Corpora (LINC). Budapest, Hungary.
Barnbrook, Geoff. 1996. Language and Computers: A Practical Introduction to the
Computer Analysis of Language. Edinburgh University Press: Edinburgh.
Baskaran S. et al. 2007. Framework for a Common. Parts-of-Speech Tagset for Indic
Languages. (Draft) http://research.microsoft.com/~baskaran/POSTagset/
Bayer, Josef. 2008. What is Verb Second? MS. University of Konstanz.
Begum, R., Husain, S., Sharma, D.M., Bai, L. 2008. Developing Verb Frames in
Hindi. In Proceedings of LREC. Marrakech, Morocco.
Begum, R. Husain. S. Dhwaj, A. Sharma, D. M. Bai, L. and R. Sangal. 2008.
Dependency annotation scheme for Indian Languages. In Proceedings of IJCNLP.
Citeseer.
Begum, R. Jindal, K., Jain, A., Husain, S. and Sharma D.M. 2011. Identification of
Conjunct Verbs in Hindi and its Effect on Parsing Accuracy. In Computational
Linguistics and Intelligent Text Processing: 29-40.
Becker, D. Kashif, R. A Study in Urdu Corpus Construction. University of St.
Thomas,
Department of Computer Science, University of Minnesota-Twin Cities. U.S.A. Ms.
246
Bharati, A., Chaitanya, V., Sangal, R. and KV Ramakrishnamacharyulu. 1995.
Natural Language Processing: A Paninian Perspective. Prentice-Hall of India.
Bharati A, D. M. Sharma, L. Bai and R. Sangal. 2006. AnnCorra: Annotating
Corpora Guidelines For POS And Chunk Annotation For Indian Languages. LTRC
Technical Report-31
Bharati A, D. M. Sharma, S. Husain, L. Bai, R. Begam, R. Sangal. 2012. AnnCorra:
Treebanks for Indian Languages Guidelines For Annotation Hindi Treebank. LTRC
Technical Report.
Bharati, A., Bhatia, M., Chaitanya, V. and R. Sangal. 1996. Paninian Grammar
Framework Applied to English. Technical Report, Technical Report TRCS-96- 238,
CSE, IIT Kanpur.
Bharati, A. Sangal, R. and D.M. Sharma. 2007. SSF: Shakti Standard Format Guide.
Technical Report, Technical report, IIIT Hyderabad.
Bharati, A., Sharma, D.M., Husain, S., Bai, L., Begum, R. and R. Sangal. 2009.
Anncorra: Treebanks for Indian Languages Guidelines for Annotating Hindi
Treebank (version–2.0).
Bharati, A., Husain, S., Sharma, D.M., Sangal, R. 2008. A Two-Stage Constraint
Based Dependency Parser for Free Word Order Languages. In Proceedings of the
COLIPS IALP. Chiang Mai, Thailand.
Bhat, D.N.S. 1991. Grammatical Relations: the Evidence Against their Necessity and
Universality. Psychology Press
Bhat S.M. 2012. Building large Scale POS Annotated Corpus for Hindi & Urdu (co-
authored). In Proceedings of Workshop on Indian Language & Data: Resources &
Evaluation (WILDRE), LREC 2012 (Istanbul, Turkey).
Bhat S.M. 2010. Developing Fine-grained Hierarchical POS Tagset for Kashmiri. In
Proceedings of International Conference on Language Development & Computing
Methods ICLDCM, Coimbatore
Bhat, S.M. & Richa, S. 2011. Case Syncretism and Disambiguating Algorithms for
Urdu-Hindi POS Tagging. In Interdisciplinary Journal of Linguistics 4:187-194/
University of Kashmir, Srinagar
247
Bhat, S.M. 2012. Introducing Kashmiri Dependency Treebank. In Workshop on
Machine Translation and Parsing of Indian Languages (MTPIL), COLING 2012, IIT
Mumbai, Mumbai
Bhat, S.M. 2011. Developing Small Scale Treebank for Kashmiri. In SCONLI-06, at
Banaras Hindu University, Varanasi. Ms.
Bhat, S.M. 2012. Empirical Method of Language Documentation: A Case Study of
Compiling Kashmiri Corpus, In National Seminar on Endangered and Lesser Known
Languages: Issues and Responses, 2012, LU, Lukhnow. Ms.
Bhat, S.M. 2013. Manual Chunking and Parsing Kashmiri Text Corpus. In
International Conference of Linguistic Society of India (ICOLSI), Central Institute of
Indian Languages (CIIL), Mysore. Ms.
Bhat, R.A. Bhat, S.M & D. M. Sharma. 2014. Towards Building a Kashmiri
Treebank: Settingup A Trrbanking Pipeline. Ms.
Bhat, R. A. & D. M. Sharma. 2012. In Proceedings of the 6th Linguistic Annotation
Workshop, pages 157-165, Jeju, Republic of Korea. Association for Computational
Linguistics
Bhat, R. A. & D. M. Sharma. 2013. Non-projective Structure in Indian Language
Treebanks. Ms.
Bhat R. N. Dardic: What does the Label Denote? BHU, Varanasi. Ms.
Bhatt, R. B. Narasimhan, M. Palmer, O. Rambow, D.M. Sharma, and F. Xia. 2009. A
multi-representational and multi-layered treebank for hindi/urdu. In Proceedings of
the Third Linguistic Annotation Workshop: page 186-189. Association for
Computational Linguistics.
Biber, Douglas. 1993. Representativeness in Corpus Design. Literary and Linguistic
Computing 8.4.
Blake, Barry J. 2004. Case. Camdridge University Press: Cambridge.
Bloomfield, L. 1933. Language. The University of Chicago Press.
Bögel, T. M. Butt, and S. Sulger. 2008. Urdu ezafe and the Morphology-syntax
interface. In Proceedings of LFG ’08.
Bond, F. S. Fujita, and T. Tanaka. 2008. The Hinoki Syntactic and Semantic
Treebank of Japanese. Language Resources and Evaluation 42(2):243–251.
248
Bod, R. and Scha, R. 1997. Data-oriented Language Processing. In Young and
Bloothooft, pages 137–173.
Bod, R. 2003. Is there Evidence for a Probabilistic Language Faculty? Ms.
Bod, R. Hay, J. & Jannedy, S. (Eds.). 2003. Probabilistic Linguistics. Cambridge,
Massachusetts: MIT Press.
Bod, R. Margaux, S. 2012. Empiricist Solutions to Nativist Puzzles by Means of
Unsupervised TSG. In Proceedings of Workshop on Computational Models of
Language Acquisition and Loss. EACL. Association of Computational Linguistics.
Bosco, C. Lombardo, V. 2004. Dependency and Relational Structure in Treebank
Annotation. In Proceedings of Workshop on Recent Advances in Dependency
Grammar at COLING.
Bosco, C. & Lombardo, V. 2003. A Relation-based Schema for Treebank
Annotation. In Proceedings of the Advances in Artificial Intelligence, 8th Congress
of the Italian Association for Artificial Intelligence, Pisa, Italy.
Bosco, C. & Lombardo, V. 2000. An Annotation Schema for an Italian Treebank. In
Proceedings of the Student Session, 12th European Summer School in Logic,
Language and Information, Birmingham, UK.
Brants, S., S. Dipper, P. Eisenberg, S. Hansen, E. Knig, W. Lezius, C. Rohrer, G.
Smith & H. Uszkoreit, 2004. TIGER: Linguistic Interpretation of a German Corpus.
In E. Hinrichs and K. Simov (Eds), Research on Language and Computation, Special
Issue. Vol. 2: 597-620.
Brants, S. S. Dipper, S. Hansen, W. Lezius, and G. Smith. 2002. The Tiger Treebank.
In Proceedings of the Workshop on Treebanks and Linguistic Theories: page 24-41.
Brants, T. Wojciech S. and Hans U. 1999. Syntactic Annotation of a German
Newspaper Corpus. In Proceedings of the ATALA Treebank Workshop. Paris, France.
Burkhardt, Petra. 2005. The Syntax–Discourse Interface: Representing and
Interpreting Dependency. John Benjamins Publishing Company:
Amsterdam/Philadelphia
Butt, Miriam. 2005. Theories of Case. Cambridge University Press: Cambridge.
Butt, Rajesh. --- . Verb Movement in Kashmiri. Ms.
249
Buchholz, S. Marsi, E. 2006. CoNLL-X Shared Task on Multilingual Dependency
Parsing. In Proceedings of Tenth Conference on Computational Language Learning.
Association for Computational Linguistics.
Carroll, John. Guido Minnen. & Ted Briscoe. 1999. In Proceedings of the EACL
Workshop on Linguistically Interpreted Corpora (LINC), Bergen, Norway.
Carletta, J. S. Isard, G. Doherty-Sneddon, A. Isard, J.C. Kowtko, and A.H. Anderson.
1997. The Reliability of a Dialogue structure Coding Scheme. Computational
linguistics 23.1:13–31.
Chatterji, Sanjay,Tanaya Mukherjee Sarkar, Sudeshna Sarkar & Jayshree
Chakraborty. 2009. Karak Relations in Bengali. In Proceedings of 31st All-India
Conference of Linguists (AICL), Hyderabad, India, pp 33-36.
Chatterji, Sanjay, Praveen Sonare, Sudeshna Sarkar & Devshri Roy. 2009. Grammar
driven rules for hybrid Bangla dependency parsing. In Proceedings of ICON09 NLP
Tools Contest: Indian Language Dependency Parsing, Hyderabad, India, pp. 37-41.
Chaudhry, H. and D.M. Sharma. 2011. Annotation and Issues in Building an English
Dependency Treebank.
Chater, N. & Manning, C.D. 2006. Probabilistic Models of Language Processing and
Acquisition. Trends in Cognitive Sciences, 10, pages 335-344.
Chen, K.-J. Luo, C. C. Gao, Z. M. Chang, M. C. Chen, F. Y. & Chen, C. J. 1999. The
CKIP Chinese Treebank. In Journees ATALA sur les Corpus annot es pour la
syntaxe, Talana, Paris VII: 85-96.
Charniak, E. (1993). Statistical Language Learning. MIT Press, Cambridge, Mas-
sachusetts.
Chen, K. J. et al., 2003. Building and Using Parsed Corpora. (A. Abeillé Eds).
KLUWER: Dordrecht.
Chomsky, N. 1981. Lectures on Government and Binding: The Pisa Lectures.
Holland: Foris Publications.
Cloeren, J. 1999. Tagsets. In Syntactic Wordclass Tagging, Hans van Halteren (ed.),
Dordrecht: Kluwer Academic.
Cohen, J. et al. 1960. A Coefficient of Agreement for Nominal Scales. Educational
and Psychological Measurement 20 (1): 37-46.
250
Collins, M., Jan Hajič, L. Ramshaw and C. Tillmann. 1999. A Statistical Parser for
Czech. In Proceedings of ACL: 505-512.
Collins, M. 1999. Head-driven Statistical Models for Natural Language Parsing. PhD
Thesis, University of Pennsylvania. Ph.D. thesis.
Corbett, G., N. M. Fraser, and S. McGlashan, 1993. Heads in Grammatical Theory.
Cambridge University Press, Cambridge.
Covington, M. A. (1984). Syntactic Theory in the High Middle Ages. Cambridge
University Press.
Covington, M. A. (1990a). A dependency parser for variable-word-order languages,
Technical Report AI-1990-01, University of Georgia.
Covington, M. A. (1990b). Parsing Discontinuous Constituents in Dependency
Grammar. Computational Linguistics 16: pages 234-236.
Covington. M. A. 1990. A Dependency Parser for Variable Word Order Languages.
Research Report AI-1994-02, Artificial Intelligence Programmes, University of
Georgia, Athens, Georgia 30602 U.S.A.
Covington. M. A. 1994. Discontinuous Dependency Parsing of Free and Fixed Word
Order. Research Report AI-1994-02, Artificial Intelligence Programmes, University
of Georgia, Athens, Georgia 30602 U.S.A.
Covington. M. A. 2001. A Fundamental Algorithm for Dependency Parsing. In
Proceedings of the 39th Annual ACM Southeast Conference, pages 95–102.
Cowper, Elizabeth. 2002. Finiteness. MS. University of Toronto
Culotta, A. and J. Sorensen. 2004. Dependency tree kernels for relation extraction. In
Proceedings of the 42nd Annual Meeting on Association for Computational
Linguistics: 423. Association for Computational Linguistics.
Dandapat, Sandipan. 2008. Part-of-Speech Tagging for Bengali. Unpublished
Dissertation. IIT Kharagpur.
Durrani, N. and S. Hussain. 2010. Urdu word segmentation. In Human Language
Technologies, The 2010 Annual Conference of the North American Chapter of the
Association for Computational Linguistics: 528–536. Association for Computational
Linguistics.
251
Dash, N. S. 2010. Corpus Linguistics: A General Introduction. Paper presented at
CIIL, Mysore.
D. Chakrabarty, V. Sarma and P. Bhattacharyya. 2007. Complex Predicates in Indian
Language Wordnets, Lexical Resources and Evaluation Journal, 40 (3-4), 2007.
Debusmann, R. 2004. A Declarative Grammar Formalism for Dependency Grammar.
Dissertation. Universitat des Saarlandes.
Dipper, S. 2008. Theory-driven and Corpus-driven Computational Linguistics, and
the Use of Corpora. In Anke Lüdeling and Merja Kytö (eds.), Corpus Linguistics: An
International Handbook. Handbooks of Linguistics and Communication Science, pp.
68-96.Mouton de Gruyter: Berlin.
Dowty, D. 1982. Grammatical Relations and Montague Grammar. In Jacobson, P.
and Pullum, G., Editors, The Nature of Syntactic Representation, pages 79-130. D.
Reidel Publishing Company.
Eide, Kristin M. 2007. Finiteness. Paper presented at 3rd ScanDiaSyn Grand
Meeting, Iceland.
E. Hajicova. 1998. Prague Dependency Treebank: From Analytic to
Tectogrammatical Annotation. In Proc. TSD’98.
Fillmore, C. 1968. The Case for Case. In Universals in Linguistic Theory, E. Bach
and R. T. Harms (eds).
Fillmore, Charles J. 1992. Corpus Linguistics or Computer-aided Armchair
Linguistics. In: Directions in Corpus Linguistics. Proceedings of Nobel Symposium,
48 August 1991. Ed. By Jan Svartvik. Berlin, New York: Mouton de Gruyter.
Fong, S. Robert C.B. ---. Treebank Parsing and Knowledge of Language: A
Cognitive Perspective. Department of Linguistics and Computer Science, University
of Arizona, Department of EECs, Brain and Cognitive Science, MIT. Ms.
Garside, R. 1987. The CLAWS Word-tagging System. In The Computational
Analysis of English, Garside, Leech and Sampson, (eds). London: Longman.
Garside, R. Leech, L. & MacEnery, T. 1997. Corpus Annotation: Linguistic Inform-
ation from Computer Text Corpora. London and New York: Longman
Glynn, Dylan. 2010. Corpus-driven Cognitive Linguistics. A case study in polysemy.
MS. Lund University.
252
Gries, Stefan. 2011. Corpus data in usage-based linguistics: What’s the right degree
of granularity for the analysis of argument structure constructions? In Mario Brdar,
Stefan Th. Gries, & Milena Žic Fuchs (eds.), Cognitive linguistics: convergence and
expansion, 237-256. John Benjamins: Amsterdam & Philadelphia.
Gries, Stefan. 2012. The Corpus Linguistics: Quantitative Methods. In Carol A.
Chapelle (ed.), The encyclopedia of applied linguistics, 1380-1385. Wiley-
Blackwell: Oxford.
Gruber, J. S. 1965. Studies in Lexical Relations. Ph.D. thesis, MIT.
Gupta, Mridul, Vineet Yadav, Samar Husain & Dipti M Sharma. 2008. A Rule
Based Approach for Automatic Annotation of a Hindi TreeBank. In Proceedings of
the 6th International Conference on Natural Language Processing (ICON-08).
Gupta, Swati. 2004. Aligning Hindi and Urdu Bilingual Corpora for Robust
Projection. M.Sc. Report.
Habash, N. and Owen Rambow (2005). Arabic Tokenization, Morphological
Analysis, and Part-of-Speech Tagging in One Fell Swoop. In Proceedings of the
Conference of American Association for Computational Linguistics (ACL05).
Hajiˇc, J. 1998. Building a Syntactically Annotated Corpus: The Prague Dependency
Treebank. Issues of valency and meaning: 106–132.
Hajič, J. E. Hajicova, M. Holub, P. Pajas, P. Sgall, B. Vidova-Hladka, and V.
Reznickova. 2001. The Current Status of the Prague Dependency Treebank. Lecture
Notes in Artificial Intelligence (LNAI) 2166: 11-20, NY.
Hajicova, E. and M. Ceplov, 2000. Deletions and Their Reconstruction in
Tectogrammatical Syntactic Tagging of Very Large Corpora. In Proceedings of
Coling: 278-284.
Hajicov´a, E. 1998. Prague Dependency Treebank: From Analytic to
Tectogrammatical Annotation. In Proceedings of TSD ’98: 45–50.
Hajicova, E., A. Abeill´e, J. Hajiˇc, J. M´ırovsk´y, and Z. Ureˇsov´a. 2010. Treebank
Annotation. In Nitin Indurkhya and Fred J. Damerau (eds), Handbook of Natural
Language Processing, Second Edition. CRC Press, Taylor and Francis Group, Boca
Raton, FL.
Hammond, Michael. 2003. Programming for Linguists: Perl for Language
Researchers. Blackwell Publishing: Oxford.
253
Hardie, A. 2003. Developing a Tagset for Automated Part-of-speech Tagging in
Urdu. In Proceedings of the Corpus Linguistics ‘03.
Hardie, A. 2004. The Computational Analysis of Morpho-syntactic Categories in
Urdu. PhD Dissertation, Lancaster University.
Haspelmath M. 1997. From Space to Time Temporal Adverbials in the World’s
Languages. LINCOM Studies in Theoretical Linguistics 03. LINCOM EUROPA
München – Newcastle
Herrera, Jesus. 2007. Building Corpora for the Development of a Dependency Parser
for Spanish Using Maltparser. Procesamiento del Lenguaje Natural 39: pages 181-
186.
Holt, Rinehart and Winston, NY C. Fillmore, P. Kay & M. O’Connor. 1988.
Regularity and Idiomaticity in Grammatical Constructions: The Case of Let Alone.
Language 64: 501-538.
HSK_Corpus Linguistics, MILES, Release 18.02x on Friday December 7 16:20:48
BST 2007, gesp. unter: HSKCOR$U13/letzter Rechenvorgang: 14-01-08 10:04:45
Hudson, R. 1984. Word Grammar. Basil Blackwell, Oxford and New York.
Hudson, R. 1990. English Word Grammar. Basil Blackwell, Oxford and Cambridge.
Hudson, R. 2003. The Psychological Reality of Syntactic Dependency Relations.
MTT, Pasis. Ms.
Hudson, R. --- . Discontinuous Phrases in Dependency Grammar. Ms.
Jarvinen, T. 2000. Bank of English and Beyond Hand-crafted Parsers for Functional
Annotation. In Abeille, 2000, pages 43–59.
Kinsbury, P. and Palmer, M. 2002. From Tree-Bank to PropBank. In Proceedings of
LREC, Las Palmas, Spain.
Kinsbury, P., Palmer, M., and Marcus, M. 2002. Adding Semantic Annotation to the
Penn TreeBank. In Proceedings of the Human Language Technology Conference,
San Diego California.
Husain, Samar, Phani Chaitanya, Ganeshwar Rao Dulam, Tariq Khan & Dipti M.
Sharma. 2009. Using Levins Verb Classification for Preposition Sense Selection in
English to Indian Language MT. In Proceedings of the Conference on Language and
Technology 2009 (CLT09), Lahore, Pakistan.
254
Hussain, M. 1987. Geography of Jammu and Kashmir, Delhi, Rajesh publications.
Jackendoff, R. 1972. Semantic Interpretation in Generative Grammar. MIT Press:
Cambridge.
Jacque, Kristin. 2006. Analysis of a Potential Latin Treebank. MS.
J. Daniel & James. H. Martin. 1999. Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics and
Speech Recognition, Prentice Hall, Englewood Cliffs, New Jersey.
Kahane, Sylvain. ---. Why to Choose Dependency Rather Than Constituency for
Syntax: A Formal Point of View. MS. Modyco-Université Paris Ouest Nanterr &
CNRS.
Kakkonen, T. 2006. DepAnn - An Annotation Tool for Dependency Treebanks. In
Proceedings of the Eleventh ESSLLI Student Session. Janneke Huitink & Sopia
Katrenko (eds.)
Kakkonen, T. 2006 Dependency Treebanks: Methods, Annotation Schemes and
Tools.
http://arXiv:cs/0610124v1 [cs.CL] 20 Oct 2006
Keith, A. 2007. The Western Classical Tradition in Linguistics. Equinox Publishing
Ltd, London.
Kidwai, Ayesha. 2007. A Handbook for Research Scholars. URL:
www.jnu.ac.in/SLLCS/SLLCS%20Research%20Manual.pdf
King, T. H., R. Crouch, S. Riezler, M. Dalrymple and R. Kaplan. 2003. The
PARC700 Dependency.
Kingsbury, P. and M. Palmer. 2002. From treebank to propbank. In Proceedings of
LREC.
Kiparsky, P. ---. On the Architecture of Panini’s Grammar. Stanford University. Ms.
Kiparsky, P. 1994. Paninian Linguistics, Asher R.E., Ed., Encyclopedia of Language
and Linguistics. Oxford, New York
Kiparsky, P. ---. Panini is Slik But He is not Mean. Stanford University. Ms.
Kiparsky, P. 2007. Panini’s Razor. Paris. Ms.
255
Kiparsky, P. 1979. Panini as a Variationist. MIT Press and Poona University Press,.
Kiparsky, P. 1991. On Paninian Studies. Journal of Indian Philosophy, Vol. 19:189-
225
Kiparsky, P. ---. Dvandvas, Blocking, and the Associative: The Bumpy Ride from
Phrase to Word. Ms.
Kiparsky, P. ---. Event Structure and the Perfect. Ms.
Kiparsky, P. ---. The Shift to Head-Initial VP in Germanic. Ms.
Kiparsky, P. ---. Towards a Null Theory of the Passive. Ms.
Kiparsky, P. ---. Grammaticalization as Optimization. Ms.
Klein, D. and C. D. Manning. 2003. Accurate Unlexicalized Parsing. In Proceedings
of ACL-.Japan.
Kolachina, Prasanth, Sudheer Kolachina, Anil Kumar Singh, Samar Husain,
Viswanatha Naidu, Rajeev Sangal & Akshar Bharati. ---- . Grammar Extraction from
Treebanks for Hindi and Telugu. MS. Language Technologies Research Centre, IIIT-
Hyderabad, India
Koul, Omkar N. 2006. Modern Kashmiri Grammar. USA: McNeil Technologies,
Inc.Manfred Krifka
Krifka, Manfred. 2006. Basic Notions of Information Structure. Interdisciplinary
Studies on Information Structure 06, Féry, Fanselow and Krifka (Eds.)
Kroch, A. Taylor A. --- .Verb Movement in Old and Middle English: Dialect
Variation and Language Contact. Ms
Kroch, A. Taylor A. 2000. Verb-Object Order in Early Middle English. Ms.
Kübler. Sandra, Ryan McDonald, and Joakim Nivre. 2009. Dependency Parsing.
Synthesis Lectures on Human Language Technologies. Graeme Hirst (Ed) Morgan &
Claypool Publishers.
Kucera, H. 1992. The Odd Couple: The Linguist and the Software Engineer. The
Struggle for High Quality Computerized Language Aids. In Svartvik, pages 401–424.
Kuhlmann, M. and M. Möhl. 2007. Mildly Context Sensitive Dependency Language.
In Proceedings of ACL. Prague, Czech Republic.
256
Landis, J.R. and G.G. Koch. 1977. The Measurement of Observer Agreement for
Categorical Data. Biometrics: 159–174.
Lawey, Aadil, A. & Nazima, Mehdi. 2011. Development of Unicode Complaint
Kashmiri Font: Issues and Resolutions. In Interdisciplinary Journal of Linguistics
4:195-200/ University of Kashmir: Srinagar.
Lee, H. C. N. Huang, J. Gao and X. Fan, 2004. Chinese Chunking with Another
Type of Spec. In Proceedings of SIGHAN: 41-48. Barcelona.
Leech, G. & Wilson, A. 1996. Recommendations for the Morpho-syntactic
Annotation of Corpora. EAGLES Report EAG-TCWG-MAC/R.
Leech, G and Wilson, A. 1999. Standards for Tag-sets. In Syntactic Wordclass
Tagging, Hans van Halteren (ed.), Dordrecht: Kluwer Academic.
Leech, G. 1991. The State of the Art in Corpus Linguistics. In Aijmer, K. and
Altenberg, B., Editors, English Corpus Linguistics: Studies in Honour of Jan
Svartvik, pages 8–29. Longman, London.
Leech, G. 1992. Corpora and Theories of Linguistic Performance. In Svartvik,
1992b, pages 105–122.
Leech, G., Barnett, R., and Kahrel, P. 1996. EAGLES Recommendations for the
Syntactic Annotation of Corpora, eag-tcwg-sasg/1.8 version of 11th march 1996.
http://www.ilc.pi.cnr.it/EAGLES96/segsasg1/segsasg1.html.
Lehal, G.S. 2010. A Word Segmentation System for Handling Space Omission
Problem in Urdu Script. In Proceedings of 23rd International Conference on
Computational Linguistics: 43.
Lesmo, L. and Lombardo, V. 2000. Automatic Assignment of Grammatical
Relations. In Proceedings of LREC, pages 475-482, Athens, Greece.
Litkowski, K. 1999. Question-answering Using Semantic Relation Triples. In
Proceedings of TREC-8, pages 349–356, Gaithersburg MD.
Lombardo, V. and Lesmo, L. 1998. Unit Coordination and Gapping in Dependecy
Theory. In Processing of Dependency-based Grammars, COLING-ACL.
Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation.
University of Chicago Press.
Lindquist, Hans. 2009. Corpus Linguistics and the Description of English. Edinburgh
University Press: Edinburgh.
257
Liberman, M. 2000. Legal, Ethical and Policy Issues Concerning the Recording and
Publication of Primary Language Materials. In Steven Bird and Gary Simons,
(editors).
Lüdeling, Anke & Merja Kytö (eds.). 2009. Corpus Linguistics: An International
Handbook Vol.2. Walter de Gruyter: Berlin.
Manning, C. and H. Schütze. 1999. Foundations of Statistical Natural Language
Processing. MIT.
Steedman M. 2011. Romantics and Revolutionaries: What Theoretical and
Computational Linguists Need to Know About Each Other But We Are Afraid.
Linguistic Issues in Language Technology LILT. CSLI Publications
Marcus, M.P. M.A. Marcinkiewicz, and B. Santorini. 1993. Building a Large
Annotated Corpus of English: The Penn Treebank. Computational linguistics 19 (2):
313–330.
Marantz, A. P. 1984. On the Nature of Grammatical Relations. MIT Press,
Cambridge.
Marcus, M., Kim, G., Marcinkiewicz, M., MacIntyre, R., Bies, A., Ferguson, M.,
Katz, K. and Schasberger, B. 1994. The Penn Treebank: Annotating Predicate
Argument structure. In Proceedings of The Human Language Technology Workshop,
San Francisco. Morgan-Kaufmann.
Marcus, M., Santorini, B., and Marcinkiewicz, M. 1993. Building a Large Annotated
Corpus of English: The Penn Treebank. Computational Linguistics, 19:313–330.
M. Butt. 2004. The Light Verb Jungle. In G. Aygen, C. Bowern & C. Quinn Eds.
Papers from the GSAS/Dudley House Workshop on Light Verbs. Cambridge, Harvard
Working Papers in Linguistics, p. 1-50.
McEnery, T. and Wilson, A. (1996). Corpus Linguistics. Edinburgh University Press,
Edinburgh.
Masica, C.P. 1993. The Indo-Aryan Languages. Cambridge University Press.
Cambridge, UK
Matthews, P.H. 2007. Syntactic Relations: A Critical Survey. Cambridge University
Press, Cambridge, UK
McDonald, R. F. Pereira, K. Ribarov and J. Hajič. 2005. Non-Projective Dependency
Parsing using Spanning Tree Algorithms. In Proceedings of HLTEMNLP.
258
McEnery, Tony & Wilson, A. Corpus Linguistics. Edinburgh University Press:
Edinburgh.
McEnery, A. M. Backer, J. P. Gaizauskas, R. & Cunningham, H. 2000. EMILLE:
Building Corpus of South Asian Languages. Vervek, A Quaterly in Artificial
Intelligence. 13 (3): page 23-32.
Melčuk, I. 1979. Studies in Dependency Syntax. Karoma Publishers, Inc.
Mel’cuk, I.A. 1988. Dependency Syntax: Theory and Practice. State University Press
of New York.
Meyers, A. R. Reeves, C. Macleod, R. Szekely, V. Zielinska, B. Young, and R.
Grishman, 2004. The NomBank Project: An Interim Report. In NAACL/HLT 2004
Workshop Frontiers in Corpus Annotation.
Meyers, A. 1995. The NP Analysis of NP. In Papers from the 31st Regional Meeting
of the Chicago Linguistic Society: 329-342.
Meyer, Charles F. 2002. English Corpus Linguistics: An Introduction. Cambridge
University Press: Cambridge.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill Higher Education.
Milicevic, Jasmina. 2006. A Short Guide to the Meaning-Text Linguistic Theory.
Journal of Koralex, vol. 8: 187-233.
Mohanan, T. 1990. Arguments in Hindi. Ph.D. Thesis, Stanford University.
Neumann, Gunter. 1994. A Uniform Computational Model for Natural Language
Parsing and Generation. Doctoral Dissertation. The University of Saarlandes
Nilsson, Peter. --- . An Experimental Study of Nivre’s Parser. Thesis for a diploma in
computer science, Department of computer science, Faculty of science, Lund
University.
Nivre, J. 2003. An Efficient Algorithm for Projective Dependency Parsing. In
Proceedings of the 8th International Workshop on Parsing Technologies (IWPT 03),
pages 149–160, Nancy.
Nivre, J. 2005. Inductive Dependency Parsing of Natural Language Text. PhD thesis,
School of Mathematics and System Engineering, Växjö University.
259
Nivre, J. and Nilsson, J. (2005). Pseudo-projective Dependency Parsing. In
Proceedings of the 43rd Annual Meeting of the Association for Computational
Linguistics (ACL’05), pages 99–106, Ann Arbor.
Nivre, J. ---. Dependency Grammar and Dependency Parsing. Ms.
Oepen, S. K. Toutanova, S. M. Shieber, C. D. Manning, D. Flickinger, and T. Brants,
2002. The LinGO Redwoods Treebank: Motivation and Preliminary Applications. In
Proceedings of COLING. Taipei, Taiwan.
O'keeffe, A. & M. Mccarthy (eds.). 2010. The Routledge Handbook of Corpus
Linguistics. Routledge:London.
Oflazer, K. B., Say, D.Z. Hakkani-T¨ur, and G. T¨ur. 2003. Building a turkish
treebank. Abeill´e: 261–277.
Palmer, M., D. Gildea, P. Kingsbury. 2005. The Proposition Bank: An Annotated
Corpus of Semantic Roles. Computational Linguistics 31(1):71-106.
Palmer, M., R. Bhatt, B. Narasimhan, O. Rambow, D.M. Sharma, and F. Xia. 2009.
Hindi syntax: Annotating dependency, lexical predicate-argument structure, and
phrase structure. In Proceedings of 7th International Conference on Natural
Language Processing: 14–17.
Perlmutter, D. M. and P. M. Postal, 1984. The 1- Advancement Exclusiveness Law.
In Studies in Relational Grammar 2. D. M. Perlmutter & C. G. Rosen,(eds). Univ. of
Chicago Press.
Perlmutter, D. 1983. Studies in Relational Grammar. University of Chicago Press.
Poesio, M. 1999. Coreference in MATE Deliverable 2.1,
http://www.ims.unistuttgart.de/projekte/mate/mdag/cr/cr_1.html
Piwek, Paul & Kees van Deemter. 2006. Constraint-based Natural Language
Generation: A Survey. Technical Report. The Open University, UK.
Phillips, C. - - -. Should we Impeach Armchair Linguists? To Appear in S. Iwasaki
(Ed.) Japanese/Korean Linguistics 17. CSLI Publications. Special Section of Papers
from a Workshop on ‘Progress in Generative Grammar’. Ms.
Poesio, M. 2004. The MATE/GNOME Scheme for Anaphoric Annotation, Revisited.
In Proceedings of SIGDIAL.
260
Poesio, M. and R. Artstein. 2005. The Reliability of Anaphoric Annotation,
Reconsidered: Taking Ambiguity into Account. In Proceedings of ACL Workshop on
Frontiers in Corpus Annotation.
Polguère. A & Mel’čuk A. Igor. 2009. Dependency in Linguistic Description. John
Benjamins.
Pustejovsky, J. A. Meyers, M. Palmer, and M. Poesio, 2005. Merging PropBank,
NomBank, TimeBank, Penn Discourse Treebank and Coreference. In ACL
Workshop: Frontiers in Corpus Annotation II: Pie in the Sky.
Rambow, O., Creswell, C., Szekely, R., Taber, H., Walker, M. 2002. A Dependency
Treebank for English. In Proceedings of LREC.
Rajesh Bhatt. 2008. A Lecture at EFLU, Hyderabad.
http://people.umass.edu/bhatt/papers/eflu-aug18.pdf
Reddy, Prashanth, Aswarth Abhilash & Akshar Bharati. 2009. LTAG-spinal
Treebank and Parser for Hindi. In International Conference on Natural Language
Processing (ICON2009).
Reichartz, F., H. Korte, and G. Paass. 2009. Dependency Tree Kernels for Relation
Extraction from Natural Language Text. Machine Learning and Knowledge
Discovery in Databases: 270–285.
Renouf, A. 2002. The Time Dimension in Modern English Corpus Linguistics. In B.
Kettemann & G. Marko (eds.). 2000. Teaching and Learning by doing Corpus
Analysis. Papers from the Fourth International Conference on Teaching and
Language Corpora, Graz, Amsterdam.
Richa. 2011. Hindi Verb Classes & Their Argument Structure Alternations.
Cambridge Scholars Publishing: UK.
Ross, J. R. 1967. Constraints on Variables in Syntax. Doctoral dissertation, MIT.
Robins, R. H. 1967. A Short History of Linguistics. Longman.
Robinson, J. J. (1970). Dependency Structures and Transformational Rules.
Language
46: page 259-285.
Sag, I. A. and J. D. Fodor, 1994. Extraction without Traces. In R. Aranovich, W.
Byrne, S.
261
Sampson, G. 2005. Quantifying the Shift Towards Empirical Methods. International
Journal of Corpus Linguistics 10 (1)
Sampson, G. 2007. Grammar without Grammaticality. Corpus Linguistics and
Linguistic Theory 3 (1)
Schneider, G. 1998. A Linguistic Comparision of Constituency, Dependency and
Link Grammar. ExtrAns Research Report: Dependency vs. Constituency
Bird, S. and Simons, G. 2001. The OLAC Metadata Set and Controlled
Vocabularies. In Proceedings of ACL/EACL Workshop on Sharing Tools and
Resources for Research and Education. http://arXiv.org/abs/cs/0105030.
Bird, S. and Simons, G. 2001. Seven Dimensions of Portability for Language
Documentation and Description. LDC UPenn http://arxiv.org/abs/cs/0204020v1
Salmon-Alt, S. and L. Romary. 2004. RAF: Towards a Reference Annotation
Framework, LREC.
Santorini, B. 1990. Part-of-speech Tagging Guidelines for the Penn Treebank
Project. Technical Report MS-CIS- 90-47, Department of Computer and Information
Science, University of Pennsylvania.
Sharma, D. M., R. Sangal, L. Bai, R. Begam, and K.V. Ramakrishnamacharyulu.
2007. AnnCorra: TreeBanks for Indian Languages, Annotation Guidelines
(manuscript), IIIT, Hyderabad, India.
Shaumyan, S. 1977. Applicative Grammar as a Semantic Theory of Natural
Language. Chicago Univ. Press.
Shieber, S.M. 1985. Evidence Against the Context-freeness of Natural Language.
Linguistics and Philosophy 8(3): page 333-343.
Singh, A. K. 2008. A Mechanism to Provide Language-encoding Support and an
NLP Friendly Editor. In Proceedings of the third international joint conference on
natural language processing (ijcnlp). Hyderabad, India: Asian Federation of Natural
Language Processing.
Singh, A. K. 2011. A Concise Query Language with Search and Transform
Operations for Corpora with Multiple Levels of Annotation. CoRR,
http://arxiv.org/abs/1108.1966.
Singh, A. K. & Ambati, B. 2010. An Integrated Digital Tool for Accessing Language
262
Resources. In The Seventh International Conference on Language Resources and
Evaluation (lrec). Malta: The European Language Resources Association (ELRA).
Singh, A. K. 2011. Part-of-Speech Annotation with Sanchay. In Proceedings of
National Seminar on POS Annotation: Issues and Prespectives.LDCIL, CIIL
Mysore.
Skut, Wojciech, Brigitte Krenn, Thorsten Brants, and Hans Uszkoreit, 1997. An
Annotation Scheme for Free Word Order Languages. In Proceedings of the Fifth
Conference on Applied Natural Language Processing ANLP-97. Washington, DC.
Simkova, Maria (ed.). 2006. Insight into the Slovak and Czech Corpus Linguistics.
Publishing House of Slovak Academy of Sciences: Bratislava.
Sinclair, John & Ronald Carter (eds.). Trust the Text: Language, Corpus &
Discourse. Routledge: London.
Singh, Anil Kumar, Samar Husain, Harshit Surana, Jagadeesh Gorla, Chinnappa
Guggilla & Dipti Misra Sharma. 2007. Disambiguating Tense, Aspect and Modality
Markers for Correcting Machine Translation Errors. In Proceedings of the
Conference on Recent Advances in Natural Language Processing (RANLP).
Borovets, Bulgaria.
Sinha, Mahesh K. 2009. A Journey from Indian Scripts Processing to Indian
Language Processing. IEEE the Annals of the History of Computing:8-31. IEEE
Computer Society.
Sampson, G. 1992. Probabilistic Parsing. In Svartvik, 1992b, pages 105–122.
Sampson, G. 2000. Thoughts on Two Decades of Drawing Trees. In Abeill´e, 2000,
pages 23–41.
Taylor, A., Marcus, M., and Santorini, B. 2000. The Penn Treebank: An Overview.
In Abeill´e, 2000, pages 5–22.
Telljohann, H. E. Hinrichs, S. Kübler and H. Zinsmeister. 2005. Stylebook of the
Tübinger Treebank of Written German (TüBa-D/Z).
Teubert, Wolfgang. 2001. Corpus Linguistics and Lexicography. International
Journal of Corpus Linguistics Vol. 6:125-153.
Teubert, Wolfgang. 2005. My Version of Corpus Linguistics. International Journal
of Corpus Linguistics 10.1: 1-13.
263
Thielen, C. and A. Schiller, 1996. Technical report. University of Tübingen.. Ein
kleines und erweitertes Tagset fürs Deutsche. In Feldweg,
Tsai, J. L., 2005. Lexicographica. Tübingen: Niemeyer. 193-203. A Study of
Applying BTM Model on the Chinese Chunk Bracketing. In LINC-2005, IJCNLP-
2005, pp.21-30.
Uria, L., A. Estarrona, I. Aldezabal, M. Aranzabe, A. D´ıaz de Ilarraza, and M.
Iruskieta. 2009. Evaluation of the syntactic annotation in epec, the reference corpus
for the processing of Basque. Computational Linguistics and Intelligent Text
Processing:72–85.
Uszkoreit, H. 1986. Constraints on Order. Linguistics 24.
Vaidya, A., S. Husain, P. Mannem, and D. Sharma. 2009. A Karaka Based
Annotation Scheme for English. Computational Linguistics and Intelligent Text
Processing: 41–52.
Vempaty, Chaitanya, Naidu, Viswanatha, Husain, Samar, Kiran, Ravi, Bai, Lakshmi,
Sharma, Dipti M., and Sangal, Rajeev. 2010. Issues in Analyzing Telugu Sentences
Towards Building a Telugu Treebank. In Proceedings of CICLING.
Klimes, Vaclav. 2006. Analytical and Tectogrammatical Analysis of a Natural
Language. Ph.D. Thesis. Charles University, Prague.
Van Deemter, K. and R. Kibble, 2001. On Coreferring: Coreference in MUC and
related Annotation schemes. Journal of Computational Linguistics 26 (4): 629-637
Van Der Beek, L., G. Bouma, R. Malouf, and G. Van Noord. 2002. The Alpino
Dependency Treebank. Language and Computers 45(1):8-22.
VanValin, R. D. 1999. Generalized Semantic Roles and the Syntax-semantics
Interface. In Corblin, F., Dobrovie-Sorin, C., and Marandin, J. M., Editors, Empirical
Issues in Formal Syntax and Semantics 2, pages 373-389. Thesus, The Hague.
VanValin, R. D. 2001. An Introduction to Syntax. Cambridge University Press,
Cambridge.
Vempaty, Chaitanya, Viswanatha Naidu, Samar Husain, Ravi Kiran, Lakshmi Bai,
Dipti M Sharma & Rajeev Sangal.2010. Issues in Analyzing Telugu Sentences
Towards Building a Telugu Treebank. MS. Language Technologies Research Centre,
IIIT-Hyderabad, India. Page 50-59
264
Volodina, Elena. 2008. From Corpus to Language Classroom: Reusing Stockholm
Umeå Corpus in a Vocabulary Exercise Generator SCORVEX. Master Thesis.
University of Gothenburg.
Wenger, Neven. 2009. The Syntax of Finiteness. Frankfurt a. M.
Woolford, Ellen. 1997. Four Way Accusative Case Systems: Ergative, Nominative,
Objective and Accusative. Natural Language & Linguistics Theory 15:181-227.
Kluwer Academic Publishers: Netherlands.
Xia, F. O. Rambow, R. Bhatt, M. Palmer and D. Sharma, 2009. Towards a Multi-
Representational Treebank. In Proceedings of the 7th Int’lWorkshop on Treebanks
and Linguistic Theories (TLT-7)
Xia, F. M. Palmer, N. Xue, N., M. E. Okurowski, J. Kovarik, F.-D. Chiou, S. Huang,
T. Kroch, and Marcus, M., 2000. Developing Guidelines and Ensuring Consistency
for Chinese Text Annotation. In Proceedings of LREC. Greece.
Xia, F. 2001. Automatic Grammar Generation from Two Different Perspectives. PhD
Thesis, University of Pennsylvania.
Xia, F. and Palmer, M. (2001). Converting Dependency Structures to Phrase
Structures. In Proceedings of the Human Language Technology Conference (HLT-
2001), San Diego CA.
Xue, N. F. Chiou and M. Palmer. Building a Large-Scale Annotated Chinese Corpus,
2002. In Proceedings of COLING. Taipei, Taiwan.
Xue, N., F. Xia, F.-D. Chiou and M. Palmer, 2005. The Penn Chinese TreeBank:
Phrase Structure Annotation of a Large Corpus. Natural Language Engineering
11(2): 207.
Yong, C. and S.K. Foo. 1999. A Case Study on Inter-annotator Agreement for Word
Sense Disambiguation. Ms.
Zeldes, Amir & Anke Lüdeling (eds.). 2011. Proceedings of Quantitative
Investigations in Theoretical Linguistics 4. Humboldt-Universität zu Berlin.
Zwicky, A. M. (1985). Heads. Journal of Linguistics 21: page 1-29.
265