Modeling and prediction for movies

# Load libraries and data
library(ggplot2)
library(dplyr)
library(statsr)
load("movies.Rdata")

Part 1: Data

How the data was collected

Our Coursera instructors provided us with a random sample of movies from the web sites IMDb and Rotten Tomatoes. Although they do not tell us how this random selection process was chosen we place our faith in our instructors that it was done well.

Survey methodologies

We are given scores and ratings of the randomly selected films from three sources:

Critics reviews gathered by IMDb
Audience reviews from IMDb
Audience reviews from Rotten Tomatoes

Both IMDb and Rotten Tomatoes build their scores based on user reviews, which suffer from biases similar to online polls. This quote from the New York Times pertains to election polling done on news websites and blogs, but the concerns are the same for our audience scores:

“[Online polls] do a good job of engaging audiences online, and they do a good job of letting you know how other people who have come to the webpage feel about whatever issue,” said Mollyann Brodie, the executive director for public opinion and survey research at the Kaiser Family Foundation. “But they’re not necessarily good at telling you, in general, what people think, because we don’t know who’s come to that website and who’s taken it.”

Professional pollsters use scientific statistical methods to make sure that their small random samples are demographically appropriate to indicate how larger groups of people think. Online polls do nothing of the sort, and are not random, allowing anyone who finds the poll to vote.
(“Why You Shouldn’t Trust ‘Polls’ Conducted Online,” New York Times, September 28, 2016)

The critics scores are derived by an expert judgement of the critics reviews by people at Rotten Tomatoes. We do not know the process of human judgement that produces a score for a given critic’s review – how it might be 45% versus 50% – but we won’t be using critics reviews here, since our boss at Paramount is concerned about what makes a movie popular, as opposed to critically acclaimed. In fact, a critic’s score might actually be a predictor of the audience score.

Implications for inference

I believe it was the computer security guru Bruce Schneier who once said, “For marketing purposes, 30% accuracy can be good enough.” Perhaps it was a wry comment, but he meant that when you are trying to market a product to people having a 30% accuracy in reaching the correct audience is better than marketing to purely randomly selected people.

In that spirit, despite the bias problems with the audience scoring addressed above, at the very least our inferences will tell us what kinds of films IMDb or Rotten Tomatoes reviewers will like. We will treat the audience scores, for this report, as having veracity for generalizing to the greater population.

Note that we can only correlate movie parameters to audience score. We cannot infer causality even if we were guaranteed the audience scores came from randomly selected reviewers.

For me personally, the interest here is general. I’ve been a movie lover pretty much all of my life. I lived in New York City for twelve years, where going to the movies borders on being a sport. As a kid I saw “Star Wars” twelve times in the theater in its original run in 1977. I have been an avid fan of “Mystery Science Theater 3000”, have seen countless films in art house cinemas, and even sat through the film adaptation of “Battlefield Earth,” which gets a 3% rating on Rotten Tomatoes (which sounds a bit high).

Part 2: Research question

Our boss here at Paramount Pictures poses two questions for us:

She is interested in learning what attributes make a movie popular
She is also interested in learning something new about movies

We will use multiple linear regression to answer her first question.

As to learning something new: consider that a movie has a theatrical release date and a DVD release date. If we measure the amount of time from the theatrical release to the DVD release, does that length of time correlate to the popularity of the movie?

To speculate why this might be, we can think about a given movie having a “hype cycle.” A film may have an enormous amount of hype leading up to its release. It may do well at the box office. If the movie studio waits too long to release the DVD then perhaps both sales of the DVD and “online hype” (which probably generates more iMDB and Rotten Tomatoes responses) may suffer. We might see this in a negative correlation between the number of days and audience score.

One concern here may be: “Does this question call for time-series data?” This is one of the conditions for the least squares line we will plot. Because we deal, ultimately, with an integer (number of days from theatrical release to DVD release) this should not pose a problem.

Part 3: Exploratory data analysis

For our first question: cleaning up the data for the multiple linear regression

Our instructions list several things to remove: actor1 through actor5 and the URIs for the films. We’ll also leave out the director and title_type. At first I left in “studio” thinking my boss would want to know which studios netted films with higher scores, all other predictors held equal; but this made the output messy and the data looked somewhat questionable:

## studioWalt Disney Home Entertainment            38.31529   20.15666
## studioWalt Disney Pictures                      13.97832   10.88395
## studioWalt Disney Productions                    3.59856   13.25061

Walt Disney Home Entertainment does not create releases for theaters, I believe. Also Paramount Pictures (my boss) cannot control for ‘movie studio’ so I decided to omit it.

# Keep 'title' in case we want to identify a single film
movies_for_lm <- movies %>% select(title, genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)

str(movies_for_lm)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  16 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

For our second question: data analysis about the number of days between release dates

We need to compute the number of days from the date of the theatrical release to the release date of the DVD. First, let’s remove all films that have ‘NA’ for any release year.

movies_rel <- movies %>% filter(!is.na(dvd_rel_year)) %>% filter(!is.na(thtr_rel_year))
dim(movies_rel)

## [1] 643  32

That removed eight entries. Let’s compute the number of days between theatrical release date and DVD release date.

# Build a new column of type Date for the theatrical release date
movies_rel <- movies_rel %>% mutate(theatrical_release_date = as.Date(sprintf("%d-%d-%d", thtr_rel_year, thtr_rel_month, thtr_rel_day)))

# Build another new column of type Date for the DVD release date
movies_rel <- movies_rel %>% mutate(dvd_release_date = as.Date(sprintf("%d-%d-%d", dvd_rel_year, dvd_rel_month, dvd_rel_day)))

# Finally, calculate the number of days from theatrical release to DVD release
movies_rel <- movies_rel %>% mutate(days_to_dvd_release = as.numeric(dvd_release_date - theatrical_release_date, units="days"))

And finally let’s do a linear regression with our new predictor, days_to_dvd_release, and our response variable audience_score from Rotten Tomatoes:

does_dvd_matter <- lm(audience_score ~ days_to_dvd_release, data = movies_rel)
summary(does_dvd_matter)

## 
## Call:
## lm(formula = audience_score ~ days_to_dvd_release, data = movies_rel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.883 -15.859   3.146  17.148  33.917 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.181e+01  9.849e-01  62.761   <2e-16 ***
## days_to_dvd_release 2.779e-04  2.537e-04   1.095    0.274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.13 on 641 degrees of freedom
## Multiple R-squared:  0.001869,   Adjusted R-squared:  0.0003114 
## F-statistic:   1.2 on 1 and 641 DF,  p-value: 0.2737

Wow, talk about “no correlation!” I wonder, then, if something is wrong with our assumption. Let’s look at the years for the releases:

ggplot(data = movies_rel, aes(x = thtr_rel_year)) + geom_bar(fill="darkslateblue") + labs(title = "Theatrical Release Year")

Ah. We have data for movies released in the 1970s, so the DVD release date will be pretty long after the theatrical release date! Let’s look at the distribution of years for DVD release date:

ggplot(data = movies_rel, aes(x = dvd_rel_year)) + geom_bar(fill="darkslateblue") + labs(title = "DVD Release Year")

Quite a difference. What we really want, then, is to compare the theatrical release date to the DVD release date for films whose theatrical release date corresponds to the popularity of the DVD format. I had not taken into account three things:

We have films whose theatrical release date come from a time when there simply was no “home release” (prior to the VHS format)
DVDs didn’t actually make it to the consumer market until after 1995 (per Wikipedia)
The Blu-Ray format was introduced in 2003 and saw significant sales starting in 2006, at the expense of the DVD format (again, per Wikipedia)

With these realizations we can narrow our data to account for these facts. Let’s assume the years of 2000 to 2010 as the “heyday” of DVD sales and limit our model to films with theatrical releases in those years:

movies_dvd <- movies_rel %>% filter(thtr_rel_year >= 2000, thtr_rel_year <= 2010)
ggplot(data = movies_dvd, aes(x = factor(thtr_rel_year))) + geom_bar(fill="darkslateblue") + labs(title = "Theatrical Release Year 2000-2010")

Now we should have better data to make our inference.

does_dvd_matter <- lm(audience_score ~ days_to_dvd_release, data = movies_dvd)
summary(does_dvd_matter)

## 
## Call:
## lm(formula = audience_score ~ days_to_dvd_release, data = movies_dvd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.344 -14.543   3.084  17.008  31.659 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         61.792877   1.655776  37.320   <2e-16 ***
## days_to_dvd_release  0.006231   0.004530   1.376     0.17    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.22 on 252 degrees of freedom
## Multiple R-squared:  0.007453,   Adjusted R-squared:  0.003514 
## F-statistic: 1.892 on 1 and 252 DF,  p-value: 0.1702

While we still see a slope for days_to_dvd_release that is pretty much zero at least it’s not astronomically tiny as it was before. More on this in our “Inference” section below. Note we have three films whose DVDs came out on or before the theatrical release:

ggplot(data = movies_dvd, aes(x = days_to_dvd_release, y = audience_score)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

and those films are:

direct_to_dvd_maybe <- movies_dvd[movies_dvd$days_to_dvd_release < 0,]
direct_to_dvd_maybe[c("title", "genre", "thtr_rel_year", "dvd_rel_year")]

## # A tibble: 3 × 4
##                      title              genre thtr_rel_year dvd_rel_year
##                      <chr>             <fctr>         <dbl>        <dbl>
## 1 Dirty Sanchez: The Movie             Comedy          2007         2007
## 2       London to Brighton Mystery & Suspense          2008         2005
## 3                  Bookies              Drama          2004         2004

but this doesn’t qualify them as outliers and they probably have little influence on the model.

Part 4: Modeling

For our multiple linear regression we will use Adjusted R-squared, which is better suited for making predictions. Let’s start by looking at the full model using the audience score from Rotten Tomatoes:

full_model <- lm(audience_score ~ genre + runtime + mpaa_rating + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movies_for_lm)
summary(full_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box, data = movies_for_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.877  -8.763   0.515   9.113  40.712 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     30.20448    5.00242   6.038 2.68e-09 ***
## genreAnimation                   5.35017    5.46689   0.979 0.328132    
## genreArt House & International   6.32227    4.21886   1.499 0.134488    
## genreComedy                      0.01512    2.33917   0.006 0.994844    
## genreDocumentary                10.92760    3.17288   3.444 0.000611 ***
## genreDrama                       2.24704    2.04338   1.100 0.271901    
## genreHorror                     -8.40856    3.48101  -2.416 0.015997 *  
## genreMusical & Performing Arts  10.69044    4.49506   2.378 0.017693 *  
## genreMystery & Suspense         -3.79941    2.62451  -1.448 0.148211    
## genreOther                       1.45588    3.96310   0.367 0.713476    
## genreScience Fiction & Fantasy  -7.02810    4.96991  -1.414 0.157821    
## runtime                          0.07012    0.03332   2.105 0.035711 *  
## mpaa_ratingNC-17               -11.29602   10.58749  -1.067 0.286419    
## mpaa_ratingPG                   -1.98305    3.85925  -0.514 0.607542    
## mpaa_ratingPG-13                -2.99617    3.97551  -0.754 0.451339    
## mpaa_ratingR                    -1.35467    3.84036  -0.353 0.724397    
## mpaa_ratingUnrated              -3.06938    4.37962  -0.701 0.483667    
## critics_score                    0.43523    0.02290  19.004  < 2e-16 ***
## best_pic_nomyes                 10.00938    3.62296   2.763 0.005900 ** 
## best_pic_winyes                 -1.95947    6.36506  -0.308 0.758301    
## best_actor_winyes               -1.58429    1.66193  -0.953 0.340817    
## best_actress_winyes             -2.23071    1.83976  -1.213 0.225778    
## best_dir_winyes                 -0.07536    2.40966  -0.031 0.975060    
## top200_boxyes                    4.25460    3.79858   1.120 0.263121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.93 on 626 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5426, Adjusted R-squared:  0.5258 
## F-statistic: 32.29 on 23 and 626 DF,  p-value: < 2.2e-16

To save space (and scrolling) I’ll just print out the Adjusted R-squared value.

After performing backwards elimination for Adjusted R-squared for the Rotten Tomatoes audience scores, we wind up with an Adjusted R-squared value of about 0.526.

rt_model <- lm(audience_score ~ genre + runtime + mpaa_rating + critics_score + best_pic_nom + best_dir_win + top200_box, data = movies_for_lm)
summary(rt_model)$adj.r.squared

## [1] 0.5260853

Plotting the residuals, the scatter and distribution look good:

plot(rt_model$residuals)

hist(rt_model$residuals)

If we use IMDb’s ratings as our predicted value, again for Adjusted R-squared using backwards elimination we arrive at:

imdb_model <- lm(imdb_rating ~ genre + runtime + mpaa_rating + critics_score + best_pic_nom + best_dir_win + top200_box, data = movies_for_lm)
summary(imdb_model)$adj.r.squared

## [1] 0.6273571

which gets a higher Adjusted R-squared for the same parameters. Plotting the residuals here looks good too:

plot(imdb_model$residuals)

hist(imdb_model$residuals)

so both models appear to meet the requirements for multiple linear regression.

Part 5: Prediction

We’ll test our model with the Disney animated film Zootopia. We’ll run our predictions for both Rotten Tomatoes and IMDb.

Zootopia <- data.frame(genre="Comedy", runtime=108, mpaa_rating="PG", critics_score=98, best_pic_nom="no", best_dir_win="no", top200_box="yes")

First we predict with our Rotten Tomatoes model:

predict(rt_model, newdata=Zootopia, interval="confidence", level=.95)

##        fit      lwr      upr
## 1 81.76869 73.60684 89.93053

And then our IMDb model:

predict(imdb_model, newdata=Zootopia, interval="confidence", level=.95)

##        fit    lwr      upr
## 1 7.330328 6.9423 7.718356

Our Rotten Tomatoes model predicts a score of 82% with a 95% confidence interval of (74%, 90%). Zootopia might be an outlier: the actual Rotten Tomatoes audience score is 92%.

Our IMDb model predicts a score of 7.3 with a 95% confidence interval of (6.9, 7.7). The actual score on IMDb is 8.1.

Since our data consists primarily of live action films (as opposed to animated films like Zootopia) let’s try something else: Deadpool:

Deadpool <- data.frame(genre="Action & Adventure", runtime=108, mpaa_rating="R", critics_score=84, best_pic_nom="no", best_dir_win="no", top200_box="yes")

Again we first use our Rotten Tomatoes model:

predict(rt_model, newdata=Deadpool, interval="confidence", level=.95)

##        fit     lwr      upr
## 1 76.80159 68.7703 84.83288

And again with our IMDb model:

predict(imdb_model, newdata=Deadpool, interval="confidence", level=.95)

##        fit      lwr      upr
## 1 7.235894 6.854073 7.617716

We see a Rotten Tomatoes prediction of 77% with a 95% CI of (69%, 85%). The actual score is 84%.

For our IMDb prediction we get 7.2 with a 95% CI of (6.9, 7.6). The actual IMDb score for Deadpool is 8.0.

Part 6: Conclusion

We were tasked with two items from our boss at Paramount Pictures:

What attributes make a movie popular
Find out something new about movies

What makes a movie popular? There are far more variables than we were provided. I can imagine more complex models that account for the combination of actors and actresses, the producers, the screenwriters, holiday releases, and so on. In fact, there are probably models we can write that are specific to genres like animated films. One model (or two) cannot work for films as diverse as Zootopia and Deadpool, which are quite different. Our two models came pretty close to predicting our two movies; in fact the Rotten Tomatoes model correctly predicted Deadpool’s audience rating.

For the second item we have shown that there is no correlation between the number of days between the theatrical release and the DVD release of a film and the audience score. In data science it can be just as important to discover “There is no correlation” as it is to find a correlation.

Note, though, that the number of days between release dates is probably very important in terms of DVD sales, something we are not measuring. We can guess that the movie studios know very well that waiting too long to bring a DVD to market after the theatrical release can negatively impact sales.