digital text and data processing tokenisation. today’s class □ tokenisation and creation of...
TRANSCRIPT
Digital Text and
Data Processing
Tokenisation
Today’s class
□ Tokenisation and creation of frequency lists
□ Keyword in context lists
□ Moretti and distant reading
□ Research projects and assignment 1
Revision
□ Regular expressions
□ Simple sequences of characters
□ Character classes, e.g. \w , \d or .
□ Quantifiers, e.g. {2,4} or ?, +, *
□ Anchors, e.g. \b , ^ , $
Match variables
□ Parentheses create substrings within a regular expression
□ In perl, this substring is stored as variable $1
□ Example:
$keyword = “quick-thinking” ;
if ( $keyword =~ /(\w+)-\w+/ ) {print $1 ;#This will print “quick”
}
Three types of variables
□ Scalars: a single value; start with $
□ Arrays: multiple values; start with @
@titles = (“Ullyses”, “Dubliners”, “Finnegan’s Wake”) ;
□ Hashes: Multiple values which can be referenced with ‘keys’; start with %
%isbn ;$isbn{“9782070439713”} = “Ullyses”;
$line = "If music be the food of love, play on" ;
@array = split(" " , $line ) ;
# $array[0] contains "If"# $array[4] contains "food"
Basic tokenisation
Looping through an array
foreach my $w ( @words ) {
print $w ;
}
Looping through an array
my %freq ;
$freq{"if"}++ ; $freq{"music"}++ ;
print $freq{"if"} . “\n" ;
Creating a hash
Assigning / updating a value
Calculation of frequencies
my %freq ;
foreach my $w ( @words ) {
$freq{ $w }++ ;
}
foreach my $f ( keys %freq )
{print $f . "\t" . $freq{$f} ;
}
Looping through a hash
foreach my $f ( sort { $freq{$b} <=> $freq{$a} } keys %freq )
{print $f . "\t" . $freq{$f} ;
}
Sorting a hash
But she returned to the writing-table, observing, as she passed her son, "Still page 322?" Freddy snorted, and turned over two leaves. For a brief space they were silent. Close by, beyond the curtains, the gentle murmur of a long conversation had never ceased.
Is it actually a word?
foreach my $w ( @words ) {
if ( $w =~ /(\w)/ ) {$freq{ $1 }++ ;
}
}