text extraction using regular expressions shih-pei chen project manager, china biographical...
TRANSCRIPT
Text Extraction using Regular Expressions
Shih-Pei ChenProject Manager, China Biographical
Database, Harvard University
The Digitization in the Humanities Workshop @ Rice UniversityApril 5-7, 2013
Downloads for today
• Slides and sample texts (a package)– On OWL-Space
• Text editor(s)– Mac users: please install TextWrangler– PC users: EmEditor, UltraEdit, or both
• CBDB Regex Machine– http://isites.harvard.edu/icb/icb.do?keyword=k16229&pageid=icb.pa
ge515758 -- download the CBDBRegexMachine_July2012.zip on this page
The China Biographical Database – Modeling Life Histories – from anecdote to data
Biography Prosopography
Social Network AnalysisGeospatial Analysis
Big Data
• What you are going to do with the great amount of texts on the Web?– Is there information you want to search?– Is there thing you want to analyze?
• CBDB experience: we use regular expressions to extract biographical data from thousands of historical records (in their full texts)
What regular expressions can do for you
• Beyond keyword search– Search for written variations– Search for patterns– Search and replace => tagging
• You don’t have to learn programming in order to use regular expressions– Just use a text editor which supports regex
Today
• Part 1: Learn regular expressions– Hands on exercises of matching regexes against
some texts in a text editor.
• Part 2: A real play – Using regexes + Search and Replace in a text editor
• Part 3: CBDB Regex Machine – Using a graphical user interface to design regexes
and test them against a text. Tagging the matches in XML tags.
UNDERSTANDING REGULAR EXPRESSIONS
Regular expressions
• Is a powerful way of describing patterns of strings
• You describe the pattern, the machine matches it against the text (a string of letters, digits, and symbols)
Automata
• Imagine a belt sending characters in line:
• The string in line (the input): abcde– It can match this pattern: abcde– It can also match this pattern: bc (but only the substring
“bc” in the input will be matched – partial match)
aa bb cc dd ee
abcde?abcde?
Comparing the input against the regex character by character
aa bb cc dd eeInput:
Regex: aa bb cc dd ee Match!
Comparing the input against the regex character by character
aa bb cc dd eeInput:
Regex: bb cc Match!
Behind the scenes: The robot picks up a in the input, and finds that a does not match b, the first character in the regex. Then, the robot throws a out, and picks up the next character in the input, which is b. This time robot finds the two b’s match each other.
Comparing the input against the regex character by character
aa bb cc dd eeInput:
Regex: bb dd No match!
×
Switch to a good text editor
• Text editors which support regex– Windows: EmEditor or UltraEdit (both not free)
– Mac: TextWrangler (free)
Regular expressions – the syntax What you can describe using regular expressions?
Characters
• Literal characters– abcde , bc , bd (string match)
• Non-Printable Characters– \t (tab), \r (carriage return), \n (line feed)– Line breaks: \r (Mac), \n (Unix), or \r\n (Windows)
• Special characters (reserved characters / metacharacters)– [ ] \ ^ $ . | ? * + ( )
Examples come from: Regular-expressions.info
Exercise #1
• Download and install one of the above text editors. • Download the “regex text.txt” file. Open it in your
text editor.• Call up the “Search” or “Find” function in your editor,
and try the regexes in Exercise#1 to see which regexes can be matched.
Character Classes – what can appear at a certain position?
• gr[ae]y can match gray or grey– Characters in [ ] form a class (bag of characters) – gr[ae]y will not match graay nor graey !
• Common character classes– [a-z] , [A-Z] , [a-zA-Z] , [0-9]
• Exercise#2
gg rr aa ee yyInput:
Regex: gg rr [ae][ae] yy
Shorthand Character Classes
• \d (digit) : shorthand of [0-9]• \D (non-digit character)
• \w (word character): [A-Za-z0-9_]• \W (non-word character)
• \s (whitespace character): [ \t\r\n] (white space, tab, carriage return, line feed)
• \S (non-whitespace character)
Negated Character Classes
• Any character except these– [^aeiou] : not one of a, e, i, o, u– [^\d] : not digit– [^\s] : not white space
Dot .
• . can match any single character (almost)– Except the newline character => . is shorthand for
[^\n] (Unix), [^\r] (Mac), [^\r\n] (Windows)
• Exercise#3
Optional and Repeat operators
• 3 operators for expressing repentance– ? : zero or one time (optional)– + : repeat for one or more times – * : repeat for zero or more times
• Repeat certain times:– \d{1,4} : one to four digits– \d{1,} : one digit or more (EQ to \d+ )
• Exercise#4
Alternation (list of words)
• Useful when you have a list of words, and you want to find the occurrence of each– cat|dog|mouse|fish : find any one of the four– regex|regular expression : find either regex or
regular expression
• Exercise#5
Examples come from: Regular-expressions.info
Capturing writing variations
• Suppose you want to find all the occurrences mentioning regular expressions, but it can be written as “regular expression(s)” or “regex(es)”.
• Use this pattern to find them all: reg(ular expressions?|ex(es)?)
Examples come from: Regular-expressions.info
What can regular expressions do for you
• Provide better full-text search– Find a word without worrying its variations– Find specific info written in regular forms:
• dates, phone numbers, email addresses, HTML/XML tags, quotes, all capital abbreviations…
– Find two words near each other
• Perform formatting tasks toward a text• Automate tagging
Find information written in regular forms
• Exercise #6: finding dates as of mm/dd/yy– \d\d.\d\d.\d\d– \d\d[- /.]\d\d[- /.]\d\d– [0-1]\d[- /.][0-3]\d[- /.]\d\d– (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\d
• Exercise #7: finding texts within double quotes– ".*”– "[^"\r\n]*”– "[^"]*"
Examples come from: Regular-expressions.info
Grouping and back references
• Exercise #8: finding HTML/XML tags– <([a-z]+)\b[^>]*>.*?</\1>– <date format=“mmddyy”>04/07/13</date>
• () : capturing group• \1: back reference the 1st captured group
– If there are more than 1 pairs of (), use \2, \3, etc.– The whole matched string is referenced as \0
Examples come from: Regular-expressions.info
Formatting task
• Trimming unnecessary white spaces– Replace [ \t]{2,} with a single space– Delete leading whitespace within a line: replace ^[ \t]+ with
nothing (empty string)– Trim trailing whitespace of a line: replace [ \t]+$ with
nothing (empty string)
• Transform a text to a list of words– Append a line break after each word– Replace uppercase letters -> lowercase– Replace punctuation symbols with nothing– Rount frequency of each word in MS Excel
Examples come from: Regular-expressions.info
Automate tagging
• Idea: Find dates via some regex, and then surround each of the matches with tags: <date>some date</date>
• Replace our date pattern: (0[1-9]|1[012])[- /.]([012]\d|3[01])[- /.]\d\dwith :<date>\0</date>
• Try it in the date exercise • Once you can tag useful info in a text, it will be easy
to pull them out.
Resources for regular expressions
• Regular-expressions.info– http://www.regular-expressions.info/
• Profhacker article: “Finding the Women of Heimskringla with Regular Expressions”– http://chronicle.com/blogs/profhacker/finding-the-wome
n-of-heimskringla-with-regular-expressions/38631
• <oo>→<dh> Digital humanities article:– http://dh.obdurodon.org/regex.html
PLAY WITH TEXTS
Our texts today
• Get familiar with it• Use regex to do some search• Search in files• Then use the techniques to prepare the text
for Regex Machine
Texts for today: Old Bailey Proceedings
• You can find samples in today’s package under “Old Bailey Proceedings”
• Or, you can download them on your own:– select all and copy– paste it to a text editor– save it as UTF-8 without
BOM (byte order mark)
Old Bailey’s Proceeding: the HTML presentationOld Bailey’s Proceeding: the HTML presentation
Text formText form
Try some search
• Search for: t\d{8}-\d{3}• Replace it with: <refNo>\0</refNo>
Exercise: Preparing your text in a specific format (to feed to some software)
How to convert? Observe!
• Goal: to make each case a single line
• Patterns?• Every case begins with a
line of “Reference Number” and ends before the next “Reference Number”
• Got to remove all the line breaks
• Tricky things: does the text contain XML reserved characters &, <, >,…
Conversion Steps:Search and Replace + regexes
• Replace the XML reserved characters: – & => & % => %– < => > > => <
• Get rid of “285.”: ^\d{3}\. => nothing (empty string)• Replace all the line breaks (\r, \n, \r\n) with nothing• Reassign the line breaks by “Reference number:”
– Reference number: => \rReference number:
• Optional: Get rid of “See original”• The order above is crucial
What does the Regex Machine do?
• A graphical user interface (GUI) that enables people who do not have programming skills to– graphically design patterns– match them against a corpus of texts– see results immediately via a user-friendly color-
coding scheme (quick feedback)– export to XML => automates (part of) the tagging
procedure
3/23/2013 39
Credit: Elif Yamagil
Downloading CBDB RegexMachine
• Regex Machine (on CBDB website)– http://isites.harvard.edu/icb/icb.do?keyword=k16229&pa
geid=icb.page515758 -- download the CBDBRegexMachine_July2012.zip on this page
• Prerequisites: – Make sure your machine has Java Runtime Enrironment
(JRE) installed. If not, you can download it here: http://www.java.com/en/download/
Run the Regex Machine
• Double click the CBDBRegexMachine.jar• In the “Select Your User Director” window,
select the folder where you put your text files.– Tip: don’t double
click the folder! Single click is all you need.
GUIList of active List of active regexregex
List of “terms”List of “terms”
Your Text Your Text
Info BoxInfo Box
42
Open the text we just prepared
• File Open. Select your text file.
Create Active Regexes
• First regex: capture the reference number– Example: t18500107-285– Pattern: t\d{8}-\d{3}– It’s always good to test it first in a text editor
• Create it in Regex Machine
– Think first: is it one unit? Does it contain diff parts?
1. Click1. Click
2. Click2. Click
3. Fill in your regex and give it a name3. Fill in your regex and give it a name
4. Give the whole regex a name. Then
choose a color!
4. Give the whole regex a name. Then
choose a color!
5. Click on the Regex. Matches are highlighted!
5. Click on the Regex. Matches are highlighted!
Export to XML
7. Set records per file to 10007. Set records
per file to 1000
6. File Export6. File Export
8. Then an XML should be generated in the same folder of the
text file!
8. Then an XML should be generated in the same folder of the
text file!
XML header added.XML header added.
Each line is surrounded by the tag
<bio> with line number.
Each line is surrounded by the tag
<bio> with line number.
The number is now tagged with the Handle you specified!
The number is now tagged with the Handle you specified!
Try another regex
• Second regex: capture the “Reference number:” and the number – Example: Reference Number: t18500107-285– Pattern: Reference Number: t\d{8}-\d{3}
• Create it in Regex Machine– Think first: Do you want it to be tagged as a
whole? Should the match contain diff parts?
Using multiple groups in an Active Regex
• Add another Active Regex. Create two groups:• Group #1: Reference Number:• Group #2: t\d{8}-\d{3}
Group #1Group #1
Group #2
Capture this group!
Group #2
Capture this group!
Click on the new one to highlight the matched
strings.
Then click Move Up. Export to XML.
Click on the new one to highlight the matched
strings.
Then click Move Up. Export to XML.
The whole string is tagged, and the number part is
“captured” as an attribute!
The whole string is tagged, and the number part is
“captured” as an attribute!
What else to capture?
• Name of defendant(s)
• Verdict: guilty or not guilty, age, punishment
• Any patterns observed?
• Pattern for verdicts– If NOT GUILTY, normally nothing more.– If GUILTY, normally has Aged \d{2} followed by the
punishment.– There can be more than one verdicts in each
record (if more than one defendant)
NOT GUILTY
1: Give the whole regex a name. It will become the XML tag name surrounding the entire matched string
1: Give the whole regex a name. It will become the XML tag name surrounding the entire matched string
2: Give the pattern as the exact text “NOT GUILTY”2: Give the pattern as the exact text “NOT GUILTY”
Handle: give it a name
Handle: give it a name
Capturing group: The name here will be used as the attribute name of the XML
tag. The captured value will become the value of the attribute.
Capturing group: The name here will be used as the attribute name of the XML
tag. The captured value will become the value of the attribute.
GUILTY
• GUILTY.*Aged ?\d{1,3}[^—]*—.*– Group #1: guilty or not => GUILTY– Group #2: age => \d{1,3}– Group #3: punishment => .*– Something in between the desired groups
• Between group 1 & 2: .*Aged ?• Between group 2 & 3: [^—]*—
• Need to create 5 groups!
Export to XML
You can then use a browser to open it (more readable). You can further use an XML editor to correct mistakes (validation).
Open the XML in Excel
*Please note that not every XML can be well interpreted in Excel. It’s due to the capability of handling different data structure: Excel is for tabular data, and XML is for trees – much more flexible. *Also, Mac version MS Excel doesn’t read XML!
One last thing
• How about the names of the defendants?
• What is pattern?– The names are right after the reference number.– They are all capital.– There can be more than 1 names. In that case, a
mixture of space, comma, and “and” are used to connect each name.
• Test this pattern in a text editor:– Reference Number: ?[a-z]\d{8}-\d{3}\s+([A-Z' ]+),?
(?: ?([A-Z' ]+),)*(?: ?and([A-Z' ]+))?– What does it capture?
• Break into groups:– refNo: [a-z]\d{8}-\d{3}– First defendant: ([A-Z' ]+)– Second (or more) defendant: (?: ?([A-Z' ]+),)*– Last defendant: (?: ?and([A-Z' ]+))?
Good!Good!
Some problemSome problem
A real extraction project on local gazetteers – by Adam Mitchell
Raw descriptions written in the
gazetteers (extracted) SourceDate
Disaster type Location
Disaster types: Earthquakes and fires; Epidemics and Insect Plagues; Snow, Ice, and Tempests; Floods and Droughts; Famines, Hyperinflation, and Relief Efforts.
3/23/2013 71
Collect data at the local levels and then aggregate
Reflections on using the Regex Machine
• Carefully designing your regex and groups • Think ahead what you want in XML• Tuning regexes can take dozens of hours• It’s difficult to find regexes to capture them all
-- there are always left outs, exceptions, etc.• Keep in mind the cost of tuning “perfect”
regexes.
Put regular expressions in a bigger context
• Using regex to search / capture data of interest – only when the piece of information is written in regular patterns
• What if there are no regular patterns? How we can teach machines to identify important information among a corpus of texts?– If it’s location names, person names => Named Entity
Recognition (NER)– If it’s concepts => topic modeling, …– Text mining, machine learning, … and more
Conclusion
• Hope to let you understand what regex is• Hope to give you some hands on experience in
using regexes against some texts• Hope to give you some senses of what
machines can deal with texts• => Your imagination: you can begin to think
about what texts are available and what you can do with them.
ENJOY PLAYING!