prediction of box office revenue of movies using hype analysis of twitter data

Post on 22-Jul-2015

161 Views

Category:

Engineering

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PREDICTION OF BOX OFFICE SUCCESS OF MOVIES USING HYPE ANALYSIS OF TWITTER

DATA(PREDICTING THE FUTURE)

By

SAMEER THIGALE, TUSHAR PRASAD

MIT COLLEGE OF ENGINEERING, PUNE

Internal Guide:

PROF. REENA PAGARE

Sponsored Organization:

PERSISTENT SYSTEMS LIMITED

A BRIEF OUTLINE

• Presence of “rich insights” in

social networks

• The Hypothesis:

“A Movie Well Talked About is Well Watched”

• Pre-release buzz- a success factor

2

LITERATURE SURVEY

3

REFERENCE DESCRIPTION

[1] FORECASTING- Methods andApplications by- Spyros M., Steven W., RobH., 3rd Edition, Wiley Publication (book)

Basic concepts of statistics like correlationStudy of forecasting models.Linear regressionTime series regression

[2] Predicting the Future with Social Media-S Asur, B Huberman, HP Labs, HP Journal, Jan2012

The various factors that could be consideredfor calculating the success rate might beattention seeking, Distribution, Polarity, Typeof film etc.Prediction can be made using linearregression.

EXISTING MODELS

• HOLLYWOOD STOCKEXCHANGE (HSX.COM)

– Uses Virtual Stocks to predict revenue

– Accuracy 90%, confidence: medium

• INTERNET MOVIE DB (IMDB.COM)

– Uses clicks, reviews, blogs, star casts to predict

• BoxOfficeMojo.com

– Uses clicks, reviews, blogs, star casts to predict

4

But None of the leading movie database sites use Social Media to make predictions. Why?

PROBLEM DEFINITION

• To demonstrate that the amount of attentiona subject has, has strong correlation to itsranking in future.

• To show that a simple regression model builtfrom the Twitter chatter can outperformmarket based predictions.

• To demonstrate how the model built can alsobe extended to products of consumer interest

5

Technical Keywords:Statistical prediction, Social network analysis, Regression

THE DATASET

• 100,000+ unique users

• Dataset of 6 weeks4 million tweets

6

MOVIE NAME

Jupiter Ascending

Shamitabh

SpongeBob: Sponge out of water

LoveSick

Fifty Shades of Grey

Birdman

American Sniper

Foxcatcher

Hot Tub Time Machine 2

Chappie Movie

Badlapur

MODEL EMPLOYED

• MULTIPLE LINEAR REGRESSION

– BASED ON FINDING “A STRAIGHT LINE PREDICTING Y(INCOME)”

7

MODEL EMPLOYED

A AVG COUNT OF TWEETS PER HOUR

P CALCULATED USING SENTIMENT ANALYSISRANGE: 0 TO 4 (0: VERY NEGATIVE, 4: VERY POSITIVE)

D NUMBER OF THEATRES MOVIE IS RELEASED IN

C CATEGORY OF MOVIE:ACTION, THRILLER, COMEDY, ANIMATION, ROMANCE

E STAR CAST- DIVIDED INTO 3 CATEGORIES; DEPEND ON TWITTER FOLLOWER

S SEQUELRANGE: 0 IF NOT SEQUEL, 1 IF SEQUEL

8

CONTRIBUTION

• In our model we are using multiple linearregression for forecasting which guarantees abetter and accurate outcome rather thanusing complicated Neural Networks, patternrecognition and other AI concepts.

• Model is robust and can be extended to otherconsumer products by just changing theregression parameters.

9

DEMO

10

SYSTEM ARCHITECTURE

11

PLATFORM AND TECHNOLOGY

• OPERATING SYSTEM AND ARCHITECTURE INDEPENDENT

– TESTED ON WINDOWS XP+, UBUNTU 12.04 LTS+

– BOTH 32-BIT AND 64-BIT ARCHITECTURE

• SOFTWARE REQUIREMENTS (MINIMUM):

– JDK 8

– MYSQL 5+

12

SALIENT FEATURES• Client-server architecture

• Accurate prediction

• Displays

– Sentiment of tweets

– tag cloud of tweets

– Location of tweet

– Rate of tweets per hour

PROUDLY BUILT ON THE OPEN SOURCE MODEL. ALL OPEN-SOURCE TOOLS USED. SOFTWARE LICENSED UNDER GNU GPL. 13

RESULTS

Features R2

Avg tweet rate 0.02

Avg tweet rate + theatre count 0.91

14

Movie Name Release Date What we predicted (in USD)

What actually happened!

Fifty Shades of Grey 13-Feb-2015 80,214,910 85,043,000

Shamitabh 06-Feb-2015 243,661 241,720

Kingsman: Secret Service

13-Feb-2015 34,345,613 36,225,000

HotTubTimeMachine2

20-Feb-2015 30,255,168 ????(IMDB SAYS 25M)

APPLICATIONS

• Forecasting products of consumer interestgiven the chatter

– Movies

– Elections

– ICC World Cup

– Epidemiology (Google Flu trends)

• For theatre owners to predict the number ofshows to be scheduled

– Similarly to retailers of respective products

15

LIMITATIONS

• Data cleaning limitations– Presence of reference to two or more movies

– Presence of sarcastic tweets

– Emoticons

• CONSTRAINTS:– Due to Twitter API limitations only 1% of tweets

can be caught (Can be improved by Firehoseaccess)

– Only tweets in English language accepted

16

Such a wonderful movie #Humshakal is!

I <3 d mve #Shamitabh

FUTURE SCOPE

• Estimating from “negative hype”

– For e.g. Revenue of #PK increased due to the#PKDebate

• Correlating success of songs to success ofmovie

– Famous example of the song “Tum Hi Ho”

• Correlating “structure” of retweets and“favorited” tweets

17

THANK YOU!

18

top related