statistical computing (36-350) importing data from the web...
TRANSCRIPT
![Page 1: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/1.jpg)
Statistical Computing (36-350)
Importing Data from the Web II
Cosma Shalizi and Vincent VuNovember 21, 2011
![Page 2: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/2.jpg)
Agenda
• Regular expressions
• Construction
• Debugging
• Example: Continuation from Friday’s Lab
![Page 3: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/3.jpg)
Agenda
• Continuation from Friday’s Lab
• Forbes.com: Celebrity 100 List
• The World’s Most Powerful Celebrities
• http://www.forbes.com/wealth/celebrities
• http://www.forbes.com/wealth/celebrities/list
• Goal: Scrape the list into a data frame in R
![Page 4: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/4.jpg)
First steps
• Read the webpage into R or a text editor
• Identify two cases (say, “Tiger Woods” and “Lady Gaga”)
• How do we find these cases in the html?
• How are these case coded in the html?
![Page 5: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/5.jpg)
html <- readLines('http://www.forbes.com/wealth/celebrities/list')
![Page 6: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/6.jpg)
i <- grep('Tiger Woods', html)print(html[(i-4):(i+11)])
![Page 7: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/7.jpg)
<tr><td class="rank">6</td><td class="company"><a rel="/profile/tiger-woods"></a><img src=" http://images.forbes.com/media/lists/people/tiger-woods_50x50.jpg " alt="Tiger Woods" /> <h3>Tiger Woods </h3></td<td>$75 M</td><td class="smallrank">14</td><td class="smallrank">6</td><td class="smallrank">5</td><td class="smallrank">40</td><td class="smallrank">38</td></tr>
![Page 8: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/8.jpg)
Next steps
• Construct a regular expression to match these strings
• Use capture groups to specify the parts that you want
• Hint: Do it in pieces – i.e. combine smaller, simpler regular expressions
![Page 9: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/9.jpg)
<tr><td class="rank">6</td><td class="company"><a rel="/profile/tiger-woods"></a><img src=" http://images.forbes.com/media/lists/people/tiger-woods_50x50.jpg " alt="Tiger Woods" /> <h3>Tiger Woods </h3></td<td>$75 M</td><td class="smallrank">14</td><td class="smallrank">6</td><td class="smallrank">5</td><td class="smallrank">40</td><td class="smallrank">38</td></tr>
![Page 10: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/10.jpg)
Problem?
• A single case is split over multiple lines, and so multiple strings
• Solution: Paste lines together, separated by ‘\n’ (newline)
![Page 11: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/11.jpg)
html <- paste(html, collapse = '\n')
Make one long string
![Page 12: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/12.jpg)
pat <- paste( '<tr>', '<td class="rank">(\\d+)</td>', '<td class="company"><a rel="[^"]*"></a>', '<img src="[^"]*" alt="[^"]*" />', '<h3>([[:alpha:][:space:]]+)</h3></td>', '<td>\\$([[:digit:]]+)M</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '</tr>',sep = '\\s*')
# Make one long stringhtml <- paste(html, collapse = '\n')
# First pass: extract disjoint cases m <- gregexpr(pat, html, ignore.case = TRUE)x <- regmatches(html, m)x <- do.call(c, x)
![Page 13: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/13.jpg)
Debugging a Regular Expression
• Test subsets of the regular expression to ensure that they work correctly
• Easier if we use paste() to construct the regular expression.
• Comment out some parts and then test the regex
![Page 14: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/14.jpg)
pat <- paste( '<tr>', '<td class="rank">(\\d+)</td>', '<td class="company"><a rel="[^"]*"></a>', '<img src="[^"]*" alt="[^"]*" />', '<h3>([[:alpha:][:space:]]+)</h3></td>', '<td>\\$([[:digit:]]+)M</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '</tr>',sep = '\\s*')
![Page 15: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/15.jpg)
Test small subsets of the regular expression
![Page 16: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/16.jpg)
pat <- paste( '<tr>', '<td class="rank">(\\d+)</td>', '<td class="company"><a rel="[^"]*"></a>', # '<img src="[^"]*" alt="[^"]*" />', # '<h3>([[:alpha:][:space:]]+)</h3></td>', # '<td>\\$([[:digit:]]+)M</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '</tr>',sep = '\\s*')
print(gregexpr(pat, html, ignore.case = TRUE))
![Page 17: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/17.jpg)
pat <- paste( '<tr>', '<td class="rank">(\\d+)</td>', '<td class="company"><a rel="[^"]*"></a>', '<img src="[^"]*" alt="[^"]*" />', '<h3>([[:alpha:][:space:]]+)</h3></td>', '<td>\\$([[:digit:]]+)M</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '</tr>',sep = '\\s*')
print(gregexpr(pat, html, ignore.case = TRUE))
![Page 18: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/18.jpg)
<tr><td class="rank">6</td><td class="company"><a rel="/profile/tiger-woods"></a><img src=" http://images.forbes.com/media/lists/people/tiger-woods_50x50.jpg " alt="Tiger Woods" /> <h3>Tiger Woods </h3></td<td>$75 M</td><td class="smallrank">14</td><td class="smallrank">6</td><td class="smallrank">5</td><td class="smallrank">40</td><td class="smallrank">38</td></tr>
Space!
![Page 19: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/19.jpg)
pat <- paste( '<tr>', '<td class="rank">(\\d+)</td>', '<td class="company"><a rel="[^"]*"></a>', '<img src="[^"]*" alt="[^"]*" />', '<h3>([[:alpha:][:space:]]+)</h3></td>', '<td>\\$([[:digit:]]+)M</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '</tr>',sep = '\\s*')
![Page 20: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/20.jpg)
pat <- paste( '<tr>', '<td class="rank">(\\d+)</td>', '<td class="company"><a rel="[^"]*"></a>', '<img src="[^"]*" alt="[^"]*" />', '<h3>([[:alpha:][:space:]]+)</h3></td>', '<td>\\$([[:digit:]]+)\\s*M</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '<td class="smallrank">(\\d+)</td>', # '</tr>',sep = '\\s*')
![Page 21: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/21.jpg)
Continue the process of testing increasing larger subsets of the regular expression, until
the entire regular expression is verified
![Page 22: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/22.jpg)
Final steps
• Extract the matches and capture groups
• Convert to a data frame
• Use ldply() or do.call(rbind, ...)
![Page 23: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/23.jpg)
# Second pass: extract capture groupsm <- regexec(pat, x, ignore.case = TRUE)celebs <- regmatches(x, m)
# Put it all togetherlibrary(plyr)df <- ldply(celebs, function(x) data.frame( rank = as.numeric(x[2]), name = x[3], pay = as.numeric(x[4]), money.rank = as.numeric(x[5]), tvradio.rank = as.numeric(x[6]), press.rank = as.numeric(x[7]), web.rank = as.numeric(x[8]), social.rank = as.numeric(x[9]),stringsAsFactors = FALSE))
![Page 24: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/24.jpg)
Problems?
• Did we miss anybody?
• Who?
• Go back to the regular expression and fix it
![Page 25: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/25.jpg)
pat <- paste( '<tr>', '<td class="rank">(\\d+)</td>', '<td class="company"><a rel="[^"]*"></a>', '<img src="[^"]*" alt="[^"]*" />', '<h3>([[:alpha:][:space:]]+)</h3></td>', '<td>\\$([[:digit:]]+)\\s*M</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '</tr>',sep = '\\s*')
Need to allow digits and punctuation marks too
![Page 26: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/26.jpg)
pat <- paste( '<tr>', '<td class="rank">(\\d+)</td>', '<td class="company"><a rel="[^"]*"></a>', '<img src="[^"]*" alt="[^"]*" />', '<h3>([^>]+)</h3></td>', '<td>\\$([[:digit:]]+)\\s*M</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '<td class="smallrank">(\\d+)</td>', '</tr>',sep = '\\s*')
Need to allow digits and punctuation marks too
![Page 27: Statistical Computing (36-350) Importing Data from the Web IIcshalizi/statcomp/11/lectures/24/lecture-24.pdf · Statistical Computing (36-350) Importing Data from the Web II Cosma](https://reader035.vdocuments.site/reader035/viewer/2022070812/5f0b30cb7e708231d42f4b83/html5/thumbnails/27.jpg)
Summary
• Construct regular expressions in parts
• Use paste()
• Test subsets of the regular expression
• Use comment marker #
• Next: Reshaping data and databases