lingloss - artificial language

Lingloss

Welcome to The Lingloss Project Page!

In 1967, I designed what was meant to be an international auxiliary language called Lingloss. Like many other such projects, it was never really ready enough to inflict upon the public. In 2012 Lingloss still remains a work in progress; however, I believe I have recently made some progress on one aspect of the overall problem. The reasons for this belief are more fully detailed at

http://www.richardsandesforsyth.net/docs/bunnies.pdf .

So I am using this webpage share some software which, when more fully developed, may help designers of the coming international auxiliary language. (Yes, there will have to be one eventually: the human race can always be relied upon to do the right thing, as Churchill said of the Americans, once they have exhausted the alternatives.) The software is concerned with the problem of establishing a suitable core vocabulary. This is an obstacle that prior efforts have never convincingly overcome.

What you will find when you download and unzip

[glossoft.zip]

is a pair of programs written in Python3 (along with various ancillary files) which address the following aspects of the vocabulary-building problem:

1.� How to choose a core collection of lexical items, i.e. what Hogben (1963) calls a "list of essential semantic units" (LESU), which is concise enough to be learnt in a matter of weeks and at the same time extensive enough to support the great majority of essential communicative functions;

2.� How to choose a suitable international word for each of the items in the LESU.

Towards a Core Vocabulary

The program corevox1.py takes in several lists of essential semantic units (formatted one item per line) and produces a consensus list consisting of all the items that occur in at least minfreq of the input lists, where minfreq is an integer from 1 (in which case the output is all the items that occur in any of the input lists) to N, the number of input lists (in which case the output is only those items common to all the input lists).

Where do the input lists come from? Well, to test the program, four files containing previous attempts to come up with a LESU are provided (baslist, hoglist, longlist and maclist). These are, respectively: the Basic English wordlist (Ogden,1937); the LESU of "Essential World English" (Hogben, 1963); the defining vocabulary of the Longman English Dictionary (Longman, 2003); the defining vocabulary of the MacMillan English dictionary for advanced learners (MacMillan, 2002). [subfolder: lexicons]

Ogden and Hogben were trying to establish minimal subsets of words needed for the majority of communicative purposes in simplified versions of English. Compilers of the Longman and MacMillan dictionaries were trying to establish basic word lists in terms of which all the other entries in their dictionaries could be defined. Thus all four lists represent principled attempts to create concise but

http://www.richardsandesforsyth.net/zips/glossoft.zip

http://www.richardsandesforsyth.net/docs/bunnies.pdf

effective vocabularies. They didn't all settle on the same words, but any term that appears in more than one of these sets is likely have a strong claim for inclusion in anyone's core vocabulary.

Note that, although most of the entries in these lists are relatively common, they are not mere frequency lists. They result from attempts to cover the most commonly used concepts without redundancy. Therefore some high-frequency terms will be excluded if they are redundant.

I should perhaps apologize for anglocentric bias here; although in mitigation it should be noted that there is nothing in this software that limits it to the English language. I am most at home with English examples, but I would hope that others could apply the same methods to other languages: the comparisons would be instructive.

Towards an International Vocabulary

The second program, avwords3.py, is more innovative, as far as the field of interlinguistics is concerned. It finds the 'verbal average' of a number of different words. As far as I know, nobody has ever defined what a verbal average might be; so, to be a little more specific, the heart of this program is a function that takes in a number of strings (usually words, though they could be short phrases) and produces a string which is, in a certain sense, the most typical representative of those input strings. As currently implemented, it works in 2 stages. Firstly, using a string-similarity scoring function, the string in the group which is most similar to all the others of that group is chosen. Secondly, certain manipulations, such as dropping a character or swapping 2 adjacent characters, are tried to see if they increase the similarity score of that string in relation to the rest and, if so, the modified string is accepted.

For example, given the following inputs

['cheval', 'caballo', 'cavallo', 'cavalo', 'cal', 'equus', 'cavall']

which are the French, Spanish, Italian, Portuguese, Romanian, Latin and Catalan words for 'horse', the program computes that

'cal'

is the most central or typical item. In this case, no deletions or letter-exchanges make it more typical, so it is retained.

The program works by reading in several (utf8) files in the format exemplified below.

young giovane you voi yes sì yellow giallo year anno would sarebbe work lavorare word parola wool lana woods bosco wood legno

woman donna with con wire filo wing ala wine vino window finestra

This is an extract from a simple English-Italian lexicon: each line consists of a source-language term followed by a target-language equivalent, with tab character separating them.

Each of these input lexicons uses the same source language (English in the examples provided) with a different target language (various Romance languages in the examples provided). These sample bilingual lexicons can be found in the lexicons folder after you have unzipped the software.

Incidentally, the part that hasn't been automated is going from the LESU produced as output by corevox1.py to the several lexicons needed as input by avwords3.py. There are lots of public-domain bilingual lexicons, so it would be possible to write software that took a LESU and an existing lexicon (English-to-target-language in the present case) and produced suitable input for avwords3.py, but to do it properly would, I suspect, require human scrutiny anyway, so that task is left as "an exercise for the reader".

The output of avwords3.py is a lexicon in the same format as the inputs, where each source-language item is associated with the 'verbal average' of the terms in the various target languages -- intended as a first approximation to an English-Lingloss dictionary. Example output produced from the seven small example inputs in the lexicons folder follows below.

Mon Dec 24 16:28:24 2012window fenestrawine vinwing alawire filwith conwoman mulierwood leawoods boswool lanaword paralawork trabaarwould voudraisyear anoyellow galloyes siyou voiyoung jove

On the basis of the example data provided here, Lingloss, if it ever gets into circulation, would look very much like a Romance language, a kind of simplified, modernized Latin. However, that decision is by no means set in stone. The main point of computerizing parts of the process is to permit exploration of alternative design decisions.

The English word 'would' isn't expressed by a single word in these languages, thus illustrates the need for human pre-processing or post-processing. In fact, avword3.py also produces a listing file in which the quality of the 'verbal averages' is shown. This is meant to provide serious users with information to enable them to decide which of the proposed term equivalents need further attention.

These programs are prototypes, intended to illustrate a particular methodology, which I believe is novel. Much work remains to be done. For example, comparison of alternative string-similarity scoring functions would be a good idea; as would a test of whether each target word should be rendered into a common phonetic representation or just taken as spelled; and so on. The main point is to stimulate such work.

Running the programs

To execute the programs you will have to obtain Python (version 3 not 2) if you don't already have it. This can be found at

www.python.org

I have tested these programs under Windows7, but I believe they should run without alteration under Linux as well.

Then you will have to unzip the file

glossoft.zip

preferably at your top-level directory. This will have subfolders as follows.

lexicons sample LESUs and small-scale bilingual lexicons

libs common routines and variables for the programs in p3

op default directory to receive output

p3 Python3 programs

parapath directory to hold parameter files

Each program requires certain input parameters, which are put into a text file that can be edited by Notepad, Notepad++ or other text editors. Example parameter files for using the example data provided will be found on the parapath folder once the zipped file has been unpacked. Each line of a parameter file starts with a parameter name then one or more spaces then the value for that parameter. Unknown parameters are ignored. Parameters not given a value in the parameter file receive a default value.

A table of parameters used by the programs follows.

parameter name

type default description

casefold 0 .. 1 1 whether to fold uppercase to lower case on input; 1 implies yes, 0 implies no.

http://www.richardsandesforsyth.net/zips/glossoft.zip

http://www.python.org/

jobname alphanumeric string

same name as program

name to link output files

minfreq integer 2 minimum number of input LESU files in which a term must appear in to be kept for output

outgloss Windows or Linux file-spec

avwords_glos output file for consensus lexicon

vocfile Windows or Linux file-spec

corevox_vocs output file for consensus LESU

voclists Windows or Linux file-spec

lesu.dat / lexicons.txt

input text file containing list of input file-specs, 1 per line

withkey 0 .. 1 0 whether to include the source-language term along with the target-language equivalents in avwords (1), or not (0)

The content of coretest.txt, a simple initial parameter file for corevox1.py, is copied below.

voclists c:\glossoft\parapath\lesu.txt vocfile c:\glossoft\op\corelist.txt minfreq 2

The content of wordavs.txt, a starter parameter file for avwords3.py, is copied below.

voclists c:\glossoft\parapath\glossies.txt outgloss c:\glossoft\op\glossout.txt withkey 0

Pretty simple, eh?

References

Hogben, L. (1943). Interglossa. Harmondsworth: Penguin Books.

Hogben, L. (1963). Essential World English. London: Michael Joseph Ltd.

Longman (2003). Dictionary of Contemporary English. Harlow: Pearson Educational Ltd.

Macmillan (2002). MacMillan English Dictionary for Advanced Learners. Oxford: MacMillan Education.

Ogden, C.K. (1937). The ABC of Basic English. London: Kegan, Paul, Trench, Trubner & Co. Ltd.

Appendix

Constructed Auxiliary Languages :Year Language Surname Forename(s)1661 Universal Character Dalgarno George1668 Real Character Wilkins Bishop1699

Characteristica Universalis Leibniz Gottfried

1765 Nouvelle Langue de Villeneuve Faiguet1866 Solresol Sudre Francois1868 Universalglot Pirro Jean1880 Volapuk Schleyer Martin1886 Pasilingua Steiner Paul1887 Bopal de Max Saint1887 Esperanto Zamenhof Lazarus1888 Lingua Henderson George1888 Spelin Bauer Georg1890 Mundolingue Lott Julius1892 Latinesce Henderson George1893 Balta Dormoy Emile1893 Dil Fieweger Julius1893 Orba Guardiola Jose1896 Veltparl von Arnim Wilhelm1899 Langue Bleu Bollack Leon1902 Idiom Neutral Rosenberger Waldemar1903 Latino sine Flexione Peano Giuseppe190 Ro Foster Edward

61907 Ido de Beaufront Louis1913 Esperantido de Saussure Rene1922 Occidental de Wahl Edgar1928 Novial Jespersen Otto1943 Interglossa Hogben Lancelot1944 Mondial Heimer Helge1951 Interlingua Gode Alexander1957 Frater Thai Pham Xuan1961 Loglan Brown James1967 Lingloss Forsyth Richard1983 Uropi Landais Joel1996 Unish Jung Young Hee1998 Lingua Franca Nova Boeree George2002 Mondlango Yafu He2011 Angos Wood Benjamin

In Praise of Fluffy BunniesCopyright©2012, Richard Forsyth.BackgroundReading John Lanchester's Whoops!,an entertaining account of how highly paid hotshot tradersin a number ofprestigious financial institutions brought the world

to the brink ofeconomic collapse, I was struck by the following sentence:"In an ideal world, one populated by vegetarians, Esperanto speakers and fluffy bunny wabbits, derivatives would be used for one thing only: reducing levels of risk." (Lanchester, 2010: 37).What struck me about this throwaway remark, apart from the obvious implication that derivatives wereactually usedto magnifyrisk rather than reducing it(doubtless by carnivores ignorant of Esperanto),was its presumption that right-thinking readers would take itfor granted that Esperanto symbolizeswell-meaning futility --thus highlighting the author's status as a tough-minded realist.This is just one illustration that disdain for Esperanto in particular,and auxiliary languages in general,pervadesintellectual circles in Britain today, as in many other countries.And if youdare toraise the subject of constructed international languages with a professional translator or interpreter be prepared not just for disdain but

outright hostility.Of course professional interpreters are among themost linguistically gifted people on the planet, and can't see why the rest of us shouldn't become fluent in half a dozen natural languages in our spare time. (Not to mention the fact that a widespread adoption of Esperanto, or one of its competitors,would have aseriouslynegative impact on their opportunities for gainful employment.)Thus Esperanto has become a symbol of lost causes, to be dismissed out of hand by practical folk.Yet those risk-junkies busily trading complex derivatives who brought us to the brink of ruin also thought of themselves as supremely practical hard-headed folk. It turned out that they were in the grip of a collective delusion whose effects have impoverished us all. Perhaps they have something tolearnfrom vegetarians and Esperanto speakers.In the worldof supposedly practical folktoday, during an intercontinental recession, the European Union spends vast sums of money each year on translating thousands of tonnesof documents into 23different official languages. The demand for simultaneous interpreters in Brussels, Luxembourg, Strasbourg and at the UN consistently outstrip

ssupply. Meanwhile in the UK, cohort after cohort of schoolchildren emerge from secondary education unable to understand any language other than their own, often after years of instruction in French, German or Spanish."Never mind," retort the anglophone triumphalists, "English is the international language these days."If you really believe that English is an adequate lingua franca for Europe, let alonethe world, try working in a multi-nationalresearch project. I spent 2 years as the only native English speaker inan EUproject, with Englishas itsofficialworking language, and have been scarred by the experience.At first glance, this would seem to represent a triumph for the language of Shakespeare and Churchill: our nativetongue has conquered the world!Sitting in a meeting, listening to colleagues conversing in Euro-globish heavily laden with mispronounced English jargon, trying to understand and make one's self understood, one starts to realize that this is not the triumph of English after all. It seems more like a devious kind of linguistic ju-jitsu, in which the world takes its revengefor being forced to a

ccommodatemonoglotEnglish-speakers by twisting their languageinto a barbarous dialect which they find awkward and unfamiliar.Admittedly, English began as acreole, the offspring of a shotgun marriage between Anglo-Saxon and Norman French, but it has come a long way since then, and I personally am very fond ofit. The anglicized pidgin that passes for English as an international language isn't the language I love, and it isn't a very effective mediumof international communication either.As it happens, the most eloquent exponent of English as a means of communication that I have ever heard was a Hungarian. But most of us have neither the talent nor the dedication to reach such a heightin our mother tongue, still lessin a foreign language. We do,however, have sufficient ability to achievecommunicative competence in Esperanto within three months; and when we employ itwe'll be communicating with others in the same position as ourselves, i.e. second-language users. There won't be the fertile soil for misunderstanding that exists when a native speaker instinctively exploits the quirks of the language or a nonnative speaker makes a small slip of syntax with serious consequences.Why then does

Esperantoremaina fringe cult? Why doesn't the EU insist thatall children in Europe spend even a single term learning Esperanto?Part of the answer must be that, once you accept the idea of a constructed language, there is always the seductive possibility of doing better. At certain points duringa course on Esperanto you will come across a construction (such as using the so-called accusative after a preposition to indicate motion) that makes you ask: whydid Zamenhof do it that way --surely that wasn't a good idea?If I want to learn Chinese, I may be daunted by the tonal system, or the thousands of unfamiliar characters, but I have to accept them: that's the way it is. But with an artificial language I'm tempted to think "that should be changed" whenever I come across a difficult or unappealing aspect.Esperantowas inseveral respects superior to Volapuk, and the Idists think than Ido is better is manyrespects thanEsperanto. Not everyone agrees. Jespersen --no mere dabbler, he --believed that Novial was better than either.So it goes on. Hundreds, perhaps thousands, of artificial languages have been proposed in the past couple of centuries. Most never get used in action. In fact, the second most widely used artificial language, after Esperanto, is probably Klingon, whichwas deliberately designed to sound harsh and be hard to learn!Only Esperanto, for all its perceived imperfections, has ever sustained a community of users numbering more than a few thousand for more than a few decades.Other international language projects, apparently more elegant in concept (e.g. Interglossa,

Lingua Franca Nova), have remained on the drawing board.A list of those that have attracted at least some serious attention is given in the Appendix to this essay.Thus, early in the 21st century, we arrive at a situation where Esperanto stands as a proof of concept, buthas failed to take off. In spellingit approaches the ideal of one character for one phonememore closelythan almost any natural language, consequently it is easy to pronounce from the page. Its grammar is far more regular thanthat ofmost natural languages, consequentlyit can be mastered in a month. Its vocabulary contains a large number of roots found in the major European languages, consequently itdoesn't impose a forbidding memory load on adult learners --provided thattheir first language is Indo-European. Above all, it has demonstrated repeatedly that international meetings can proceed smoothly without banks of interpreters sitting in cubicles and wires leading into everyone's ears.Nevertheless it is generally viewedas merely a hobby for cranks. Linguists sneerat it. EU policy-makers would rather pourrivers of taxpayers'money into translation agencies and an endless stream of machine-translation projects that never quite achieve their desired objectives thanattemptto introduce

Esperanto into theworkingsof the EU.Personally, I believe this situation ishighlyunsatisfactory. I am motivated to attempt to do something about it for two primary reasons:1.In today's globalized civilization, the need for acommon international medium of communication is more urgent than ever before;2.The strain placed on English in its role as de factointernational language is turning it into a monstrosity.Therefore I intend part of my websiteto play host to yet another effort to devise a constructed auxiliary language for international communication. I plan to kick off the process and with luck enlist some support.Why should such a quixotic enterprise succeed, when hundreds before it have failed? Well, it might not;but there is one advantage that neither Zamenhof nor any of the early pioneers enjoyed, and which none of the more recent interlinguists seem to have exploited --the computer.Take my Word for it!An international language needs (1) a simple orthography, (2) a regular grammar, and (3) an easily learned vocabulary. Typical interlanguage projectstend toemphasize the first two points but leave the third in the background. Yet choice of lexical units is the most important of the three. It is normalfor proponents of an auxiliary languageto claim that itsvocabulary is 'international' in some sense but the foundation for this claim is almost invariably subjective.Zamenhof's approach to Esperanto vocabulary

-building can be described as 'eclectic'. It has been said that Esperanto sounds like aCzech speaking Italian. He selected a motley collection of roots from the Germanic, Romance and Slavic languages of Europe. The effect is not unpleasing, but it is hardly systematic. What he didn't do was employ a clearly stated methodto create a concise but effective core vocabulary, as Ogden (1937) and Hogben (1943) pointed out long ago.Most subsequent projects are open to the same criticism.When it comes to creating avocabulary, constructedlanguages take one of two main approaches:Eclectic, where the designerspick from a variety of linguistic sources, sometimeswith a small admixture of completely made-up items. Examples include: Esperanto, Novial, Loglan, Unish.Coherent, where the vocabulary is drawn predominantly from a single source. Examples include: Latino since Flexione (from Latin), Interglossa (from Greek), Interlingua (from the Romance languages), Lingua Franca Nova (from the Romance languages, apparently usingCatalan as a kind of tie-breaker).With

the notable exception of Hogben's Interglossa (1943), none of these projects paid much attention to word economy, i.e. to establishinga minimal necessary core vocabulary.Indeed, the Interlingua English Dictionary (IALA, 1951) boasts of having 27,000 entries; while the Unish website (www.unish.org) has a section soliciting suggested new words from interested readers.In other cases the designers appear to have relied on their intuitions to decide how many and which words were necessary.A Manifestofor Vegetarians, Esperantists & OtherCuteAnimalsMy contention is twofold: firstly, that the world does need an international language; secondly, that it is possible to create a language that is superior for this purpose, in terms of learnability and usability, than either English or Esperanto.1.Orthography: it isveryeasy to improve on English in this aspect, and not difficult to improve on Esperanto, where the accented consonants are an irritant. Several projects have already shown this, e.g. Lingua Franca Nova.2.Grammar: English grammar is a minefield for the unwary, and Esperanto also contains some unnecessary pitfalls.Again, ways of improving on this havealready been demonstrated by Lingua Franca Nova amongother projects.3.Lexis: Esperanto vocabulary is too large and disorderly,English much more so.

It is the third item that is really crucial, and that is where all previous projects have fallen down. I believe the time is ripe for a more systematic approach, with the aid of computerprocessing.

lingloss - artificial language

Documents