query breakdown

Query breakdownPeng Cheng

http://www.imdb.com/chart(sc.parallelize(Seq(null)) +>

Wget("http://www.imdb.com/chart") !==)

.joinBySlice("div#boxoffice tbody tr")

.selectInto(

"rank" -> (_.ownText1("trtd.titleColumn").replaceAll("\"","").trim),

"name" -> (_.text1("tr td.titleColumn a")),

"year" -> (_.text1("tr td.titleColumn span")),

"box_weekend" -> (_.text("tr td.ratingColumn")(0)),

"box_gross" -> (_.text("td.ratingColumn")(1)),

"weeks" -> (_.text1("tr td.weeksColumn"))

)

.wgetJoin("tr td.titleColumn a")

http://www.imdb.com/title/tt2015381/?ref_=cht_bo_1

.selectInto(

"score" -> (_.text1("td#overview-topdiv.titlePageSprite")),

"rating_count" -> (_.text1("td#overview-topspan[itemprop=ratingCount]")),

"review_count" -> (_.text1("td#overview-topspan[itemprop=reviewCount]"))

)

.wgetLeftJoin("div#maindetails_quicklinksa:contains(Reviews)")

http://www.imdb.com/title/tt2015381/reviews?ref_=tt_ql_8

.wgetInsertPagination("div#tn15content a:has(img[alt~=Next])",500)

.joinBySlice("div#tn15content div:has(h2)")

.selectInto(

"review_rating" -> (_.attr1("img[alt]","alt")),

"review_title" -> (_.text1("h2")),

"review_meta" -> (_.text("small").toString())

)

.wgetLeftJoin("a")

http://www.imdb.com/user/ur23582121/

.selectInto(

"user_name" -> (_.text1("div.user-profile h1")),

"user_timestamp" -> (_.text1("div.user-profile div.timestamp")),

"user_post_count" -> (_.ownText1("div.user-lists div.see-more")),

"user_rating_count" -> (_.text1("div.ratings div.see-more")),

"user_review_count" -> (_.text1("div.reviews div.see-more")),

"user_rating_histogram" -> (_.attr("div.overall div.histogram-horizontal a","title").toString())

)

.asTsvRDD() //Output as TSV file

.collect()

How to test1. Go to: http://ec2-54-88-40-

125.compute-1.amazonaws.com:8888/notebooks/all_inclusive_demo.ipynb# in your browser.

2. Find IMDB review extraction

3. Execute! And wait to see the results.

4. Go to: http://ec2-54-88-40-125.compute-1.amazonaws.com:4040/stages/ to see your progress

http://ec2-54-88-40-125.compute-1.amazonaws.com:8888/notebooks/all_inclusive_demo.ipynb

http://ec2-54-88-40-125.compute-1.amazonaws.com:4040/stages/

rottentomatoes

http://www.rottentomatoes.com/Wget("http://www.rottentomatoes.com/") !==)

.wgetJoin("table.top_box_officetr.sidebarInTheaterTopBoxOffice a", indexKey = "rank")

http://www.rottentomatoes.com/m/guardians_of_the_galaxy/

.selectInto(

"name" -> (_.text1("h1.movie_title")),

"meter" -> (_.text1("div#all-critics-numbers span#all-critics-meter")),

"rating" -> (_.text1("div#all-critics-numbers p.critic_statsspan")),

"review_count" -> (_.text1("div#all-critics-numbersp.critic_stats span[itemprop=reviewCount]"))

)

.wgetJoin("div#contentReviews h3 a")

`

http://www.rottentomatoes.com/m/guardians_of_the_galaxy/reviews/

.wgetInsertPagination("div.scroller a.right", indexKey = "page") // grab all pages by using right arrow button

.joinBySlice("div#reviews div.media_block") //slice into review blocks

.selectInto(

"critic_name" -> (_.text1("div.criticinfo strong a")),

"critic_org" -> (_.text1("div.criticinfo em.subtle")),

"critic_review" -> (_.text1("div.reviewsnippet p")),

"critic_score" -> (_.ownText1("div.reviewsnippetp.subtle"))

)

.wgetJoin("div.criticinfo strong a")

http://www.rottentomatoes.com/critic/sean-means/

.selectInto(

"total_reviews_ratings" -> (_.text("div.media_blockdiv.clearfix dd").toString())

)

.asJsonRDD()

.collect()

How to test1. Go to: http://ec2-54-88-40-

125.compute-1.amazonaws.com:8888/notebooks/all_inclusive_demo.ipynb# in your browser.

2. Find Rotten Tomatoes Review Extraction

3. Execute! And wait to see the results.

4. Go to: http://ec2-54-88-40-125.compute-1.amazonaws.com:4040/stages/ to see your progress

http://ec2-54-88-40-125.compute-1.amazonaws.com:8888/notebooks/all_inclusive_demo.ipynb

http://ec2-54-88-40-125.compute-1.amazonaws.com:4040/stages/

query breakdown

Data & Analytics