using language examples in an introductory sas programming class

Upload: bilisoly

Post on 07-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    1/12

    Using Language Examples in anIntroductory SAS Programming Class

    USCOTSOhio State UniversitySaturday, June 27th, 2009

    Roger Bilisoly, PhDDepartment of Mathematical SciencesCentral Connecticut State University

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    2/12

    Why analyze language in a SAS class?

    There are several excellent sources of free texts on the Web. For example,

    Project Gutenberg at http://www.gutenberg.org/wiki/Main_Page

    Google books at http://books.google.com/

    VIRGObeta at http://virgobeta.lib.virginia.edu/

    There are several sources of free word lists on the Web. For example,

    Moby word lists for English, German, Spanish, French, Italian, and Japanese atGutenberg.org.

    The American Cryptogram Association has lists for many additional languages.See http://cryptogram.org/cdb/words/words.html.

    The National Puzzlers League has many types of wordlists for English. See

    http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start.

    Adding variety to the types of data used could broaden the appeal of a statistics class.

    Many examples of statistical analyses of text have already been developed bylinguists and computer scientists.

    Corpus linguists use computers to analyze text samples designed to berepresentative of a certain aspect of a language. For example, the million-wordBrown corpus was created to be representative of American English in 1961.

    http://www.gutenberg.org/wiki/Main_Pagehttp://books.google.com/http://virgobeta.lib.virginia.edu/http://cryptogram.org/cdb/words/words.htmlhttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://cryptogram.org/cdb/words/words.htmlhttp://virgobeta.lib.virginia.edu/http://books.google.com/http://www.gutenberg.org/wiki/Main_Page
  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    3/12

    Homework Problem: Find the Proportion of Each Letterof the Alphabet in Dickens A Christmas Carol.

    Are there any letter frequency anomalies?

    For example, does the letter Jappear more often thanaverage due to the name Jacob Marley?

    This novel was originally published in 1843. How do its letter

    frequencies compare to American English in 1961, i.e., to theBrown Corpus?

    How do its letter frequencies compare to German frequencies:e.g., to Goethes Die Leiden des jungen Werther?

    Complications: Other languages using the Latin alphabet

    often employ diacritical marks (e.g., German has umlauts)and sometimes add new letters (e.g., German has , theEszett, which stands for a double s). Hence alphabets aremore complex than one might first suppose.

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    4/12

    This SAS Code Introduces bothCharacter Data and Frequency Tables.

    data carol;

    infile C:\A_Christmas_Carol.txt";

    input char $1. @@;

    lowchar = lowcase(char);

    run;

    data letters_carol; set carol;

    if anyalpha(lowchar) > 0;

    run;

    proc freq data=letters_carol order=freq;

    tables lowchar / out=carolfreq;

    run;

    SAS v9 has many

    character functions.

    Read characters

    one at a time.

    The above code can be introduced early in a programming class, and

    the ability to read in external files is important for applications.

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    5/12

    Letter Frequencies for A Christmas Carolwith some comparisons.

    The FREQ Procedure

    Cumulative Cumulative

    lowchar Frequency Percent Frequency Percent

    e 14869 12.27 14869 12.27

    t 10890 8.99 25759 21.26

    o 9696 8.00 35455 29.26a 9315 7.69 44770 36.95

    h 8378 6.91 53148 43.86

    i 8309 6.86 61457 50.72n 7962 6.57 69419 57.29

    s 7916 6.53 77335 63.82

    r 7038 5.81 84373 69.63d 5676 4.68 90049 74.31

    l 4555 3.76 94604 78.07

    u 3335 2.75 97939 80.82w 3096 2.55 101035 83.38

    c 3036 2.51 104071 85.88

    g 2980 2.46 107051 88.34m 2841 2.34 109892 90.68

    f 2438 2.01 112330 92.70

    y 2299 1.90 114629 94.59p 2122 1.75 116751 96.35

    b 1943 1.60 118694 97.95

    k 1031 0.85 119725 98.80v 1029 0.85 120754 99.65

    x 131 0.11 120885 99.76

    j 113 0.09 120998 99.85q 97 0.08 121095 99.93

    z 84 0.07 121179 100.00

    1 0.00 121180 100.00

    Top 12 letters infrequency order

    for several sources:

    Christmas Carol

    ETOAHI NSRDLU

    Brown Corpus

    ETAOIN SRHLDU

    junges Werthers

    ENIRSH TADULC

    Rule of Thumb

    ETAOIN SHRDLU

    The letterj

    Dickens: 0.0009

    Brown: 0.0020

    From the word

    Laocon, a figure

    from Greek

    mythology

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    6/12

    Homework Problem:Find Initial Consonant Clusters.

    How do languages differ in their use of consonants? As noted earlier, diacritical marks and additional letters makes

    this complicated. In addition, the same sound can berepresented in quite different ways in different languages.

    Sounds in a language are restricted in practice: these are

    called phonotactic constraintsin linguistics. For example, English has a tssound (as in cats), but it doesnt

    appear at the beginning of words, except for loanwords like tsar(from the Russian , where ts= ). German does have an initialtssound, but its represented with the letterz(as in Zimmer.)

    However, tscan also appear where tends a syllable and sstarts the

    next syllable as in pantsuit. In this case the sound is not the tsappearing in catsor tsar.

    Studying initial consonant clusters restricts attention to onesyllable, so boundaries are not a problem.

    Lets compare English and German.

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    7/12

    Initial Consonant Clusters:English vs. German

    Obs start COUNT PERCENT

    1 c 7333 8.01202

    2 r 7012 7.66129

    3 m 6261 6.84075

    4 d 5831 6.37094

    5 s 5540 6.05299

    6 p 5398 5.89784

    7 b 5265 5.75253

    8 h 4079 4.45671

    9 t 3713 4.0568210 l 3706 4.04917

    11 f 3426 3.74324

    12 g 2473 2.70199

    13 w 2284 2.49549

    14 n 2206 2.41027

    15 pr 2150 2.34908

    16 v 1924 2.10216

    17 st 1364 1.49030

    18 ch 1330 1.45315

    19 tr 1311 1.4324020 j 1104 1.20623

    21 k 1017 1.11117

    22 sh 987 1.07839

    Obs start COUNT PERCENT

    1 v 11356 9.43581

    2 g 9310 7.73577

    3 b 9282 7.71251

    4 w 8208 6.82011

    5 h 6851 5.69256

    6 z 6444 5.35438

    7 k 6214 5.16327

    8 s 4849 4.02908

    9 m 4847 4.0274210 f 4836 4.01828

    11 r 4035 3.35272

    12 d 3978 3.30536

    13 l 3501 2.90902

    14 t 3456 2.87162

    15 st 3144 2.61238

    16 n 2977 2.47362

    17 sch 2669 2.21770

    18 p 2521 2.09472

    19 tr 1724 1.4324920 pr 1681 1.39676

    21 sp 1348 1.12007

    22 fr 1296 1.07686

    23 gr 1258 1.04528

    First, note that English and German phonology (sounds the letters make) differ. For example, a German vis pronounced

    like the Englishf. Second, these two languages have different constraints on initial letters. For example, almost no words

    in German start with c, but z is pronounced like ts, which is a common starting letter (ranks 6th above) in German. Third,

    the frequencies of initial letters does not match the overall letter frequencies found earlier.

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    8/12

    Analyzing Word Games and Language

    Many language games require finding words given specificletter constraints. Crossword puzzles and hangman are twoexamples.

    In linguistics, morphology, the study of the structure of words,can be analyzed in similar ways. Words are broken intomorphemes, which are the smallest units of a word that havemeaning. For example, in English, many adverbs are formed by adding the

    morphemely to an existing word.

    Compare: Scoot is quick, and Scoot runs quickly. Quickis an

    adjective in the former sentence, and quicklyis an adverb in thelatter.

    Run here, Scoot, and be quick about it. Here quickis used as anadverb. However, a rule with exceptions can still be useful.

    This adverb example is from Section 6.4.3 from Practical Text Mining with Perl(Bilisoly, 2008).

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    9/12

    Can you solve the following word puzzles?

    1. Find all the words that fit the following crosswordpuzzle pattern: ___b__u

    2. Find all the words that fit the following hangmanpattern: _e____s, where t, a, o, i, n dont appear.

    3. How useful is the idea that most adverbs in Englishcan be formed by addingly to an existing word?Unfortunately, there are many complications:

    Happybecomes happily (ychanges to i.)

    Seasonablebecomes seasonably(eis dropped.)

    Automaticbecomes automatically(-al-is added.)

    Hillbecomes hilly(onlyy is added.)

    And there are words ending inly that are not adverbs:anomaly, apply, fly, etc.

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    10/12

    Here are the SAS solutionsto the crossword and hangman problems.

    data one;

    length word $30;

    infile "C:\crosswd.txt";

    input word;

    len = length(word);

    run;

    data two; set one;

    if len = 7;

    if substr(word,4,1) = 'b';

    if substr(word,7,1) = 'u';

    run;

    proc print data=two; run;

    SAS output:Obs word len

    1 jambeau 7

    data three; set one;

    if len = 7;

    if findc(word,'taoin') = 0;

    if findc(word,'e') = 2 and findc(word,'e',-30) = 2;

    if findc(word,'s') = 7 and findc(word,'s',-30) = 7;

    proc print data=three; run;

    SAS output:Obs word len

    1 bedbugs 72 bedrugs 7

    3 bedumbs 7

    4 begulfs 7

    5 ferrums 7

    6 peplums 7

    7 rebuffs 7

    8 redbuds 7

    9 redbugs 7

    10 regulus 7

    11 vellums 712 zephyrs 7

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    11/12

    Word Inflections

    A complete analysis of adverbs would be quite complicated.

    However, the exceptions noted earlier (happily, etc.) wereeasy to find by reading in a wordlist and then checking eachword that ends inly to see if it is still a word after removing

    ly. There is a methodology called regular expressionsthat finds

    general text patterns. This is implemented in version 9 of SASusing functions such as PRXPARSE and PRXMATCH.

    English is not very inflected, but this varies from language to

    language. For example, English is less inflected than German,and Finnish is heavily inflected.

    Moreover, there are many other word structures (morphemes)to analyze: plurals, verb conjugations, compound nouns, etc.

  • 8/4/2019 Using Language Examples in an Introductory SAS Programming Class

    12/12

    Current Status

    I used language examples in CCSUs STAT 456

    (Fundamentals of SAS), Spring, 2009, for the first time.

    Initial feedback is mixed. The language examples weredifficult for non-native speakers of English.

    Would this be helpful in an introductory class? I plan toask my future classes in their interest in word games to

    judge whether this is worth pursuing at the introductorylevel.