query breakdown
TRANSCRIPT
http://www.imdb.com/chart(sc.parallelize(Seq(null)) +>
Wget("http://www.imdb.com/chart") !==)
.joinBySlice("div#boxoffice tbody tr")
.selectInto(
"rank" -> (_.ownText1("trtd.titleColumn").replaceAll("\"","").trim),
"name" -> (_.text1("tr td.titleColumn a")),
"year" -> (_.text1("tr td.titleColumn span")),
"box_weekend" -> (_.text("tr td.ratingColumn")(0)),
"box_gross" -> (_.text("td.ratingColumn")(1)),
"weeks" -> (_.text1("tr td.weeksColumn"))
)
.wgetJoin("tr td.titleColumn a")
http://www.imdb.com/title/tt2015381/?ref_=cht_bo_1
.selectInto(
"score" -> (_.text1("td#overview-topdiv.titlePageSprite")),
"rating_count" -> (_.text1("td#overview-topspan[itemprop=ratingCount]")),
"review_count" -> (_.text1("td#overview-topspan[itemprop=reviewCount]"))
)
.wgetLeftJoin("div#maindetails_quicklinksa:contains(Reviews)")
http://www.imdb.com/title/tt2015381/reviews?ref_=tt_ql_8
.wgetInsertPagination("div#tn15content a:has(img[alt~=Next])",500)
.joinBySlice("div#tn15content div:has(h2)")
.selectInto(
"review_rating" -> (_.attr1("img[alt]","alt")),
"review_title" -> (_.text1("h2")),
"review_meta" -> (_.text("small").toString())
)
.wgetLeftJoin("a")
http://www.imdb.com/user/ur23582121/
.selectInto(
"user_name" -> (_.text1("div.user-profile h1")),
"user_timestamp" -> (_.text1("div.user-profile div.timestamp")),
"user_post_count" -> (_.ownText1("div.user-lists div.see-more")),
"user_rating_count" -> (_.text1("div.ratings div.see-more")),
"user_review_count" -> (_.text1("div.reviews div.see-more")),
"user_rating_histogram" -> (_.attr("div.overall div.histogram-horizontal a","title").toString())
)
.asTsvRDD() //Output as TSV file
.collect()
How to test1. Go to: http://ec2-54-88-40-
125.compute-1.amazonaws.com:8888/notebooks/all_inclusive_demo.ipynb# in your browser.
2. Find IMDB review extraction
3. Execute! And wait to see the results.
4. Go to: http://ec2-54-88-40-125.compute-1.amazonaws.com:4040/stages/ to see your progress
http://www.rottentomatoes.com/Wget("http://www.rottentomatoes.com/") !==)
.wgetJoin("table.top_box_officetr.sidebarInTheaterTopBoxOffice a", indexKey = "rank")
http://www.rottentomatoes.com/m/guardians_of_the_galaxy/
.selectInto(
"name" -> (_.text1("h1.movie_title")),
"meter" -> (_.text1("div#all-critics-numbers span#all-critics-meter")),
"rating" -> (_.text1("div#all-critics-numbers p.critic_statsspan")),
"review_count" -> (_.text1("div#all-critics-numbersp.critic_stats span[itemprop=reviewCount]"))
)
.wgetJoin("div#contentReviews h3 a")
`
http://www.rottentomatoes.com/m/guardians_of_the_galaxy/reviews/
.wgetInsertPagination("div.scroller a.right", indexKey = "page") // grab all pages by using right arrow button
.joinBySlice("div#reviews div.media_block") //slice into review blocks
.selectInto(
"critic_name" -> (_.text1("div.criticinfo strong a")),
"critic_org" -> (_.text1("div.criticinfo em.subtle")),
"critic_review" -> (_.text1("div.reviewsnippet p")),
"critic_score" -> (_.ownText1("div.reviewsnippetp.subtle"))
)
.wgetJoin("div.criticinfo strong a")
http://www.rottentomatoes.com/critic/sean-means/
.selectInto(
"total_reviews_ratings" -> (_.text("div.media_blockdiv.clearfix dd").toString())
)
.asJsonRDD()
.collect()
How to test1. Go to: http://ec2-54-88-40-
125.compute-1.amazonaws.com:8888/notebooks/all_inclusive_demo.ipynb# in your browser.
2. Find Rotten Tomatoes Review Extraction
3. Execute! And wait to see the results.
4. Go to: http://ec2-54-88-40-125.compute-1.amazonaws.com:4040/stages/ to see your progress