digital text and data processing tokenisation. today’s class □ tokenisation and creation of...

13
Digital Text and Data Processing Tokenisation

Upload: hubert-james

Post on 17-Jan-2016

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

Digital Text and

Data Processing

Tokenisation

Page 2: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

Today’s class

□ Tokenisation and creation of frequency lists

□ Keyword in context lists

□ Moretti and distant reading

□ Research projects and assignment 1

Page 3: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

Revision

□ Regular expressions

□ Simple sequences of characters

□ Character classes, e.g. \w , \d or .

□ Quantifiers, e.g. {2,4} or ?, +, *

□ Anchors, e.g. \b , ^ , $

Page 4: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

Match variables

□ Parentheses create substrings within a regular expression

□ In perl, this substring is stored as variable $1

□ Example:

$keyword = “quick-thinking” ;

if ( $keyword =~ /(\w+)-\w+/ ) {print $1 ;#This will print “quick”

}

Page 5: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

Three types of variables

□ Scalars: a single value; start with $

□ Arrays: multiple values; start with @

@titles = (“Ullyses”, “Dubliners”, “Finnegan’s Wake”) ;

□ Hashes: Multiple values which can be referenced with ‘keys’; start with %

%isbn ;$isbn{“9782070439713”} = “Ullyses”;

Page 6: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

$line = "If music be the food of love, play on" ;

@array = split(" " , $line ) ;

# $array[0] contains "If"# $array[4] contains "food"

Basic tokenisation

Page 7: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

Looping through an array

foreach my $w ( @words ) {

print $w ;

}

Looping through an array

Page 8: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

my %freq ;

$freq{"if"}++ ; $freq{"music"}++ ;

print $freq{"if"} . “\n" ;

Creating a hash

Assigning / updating a value

Page 9: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

Calculation of frequencies

my %freq ;

foreach my $w ( @words ) {

$freq{ $w }++ ;

}

Page 10: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

foreach my $f ( keys %freq )

{print $f . "\t" . $freq{$f} ;

}

Looping through a hash

Page 11: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

foreach my $f ( sort { $freq{$b} <=> $freq{$a} } keys %freq )

{print $f . "\t" . $freq{$f} ;

}

Sorting a hash

Page 12: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

But she returned to the writing-table, observing, as she passed her son, "Still page 322?" Freddy snorted, and turned over two leaves. For a brief space they were silent. Close by, beyond the curtains, the gentle murmur of a long conversation had never ceased.

Page 13: Digital Text and Data Processing Tokenisation. Today’s class □ Tokenisation and creation of frequency lists □ Keyword in context lists □ Moretti and distant

Is it actually a word?

foreach my $w ( @words ) {

if ( $w =~ /(\w)/ ) {$freq{ $1 }++ ;

}

}