# Load libraries and data
library(ggplot2)
library(dplyr)
library(statsr)
load("movies.Rdata")

Part 1: Data

How the data was collected

Our Coursera instructors provided us with a random sample of movies from the web sites IMDb and Rotten Tomatoes. Although they do not tell us how this random selection process was chosen we place our faith in our instructors that it was done well.

Survey methodologies

We are given scores and ratings of the randomly selected films from three sources:

  • Critics reviews gathered by IMDb
  • Audience reviews from IMDb
  • Audience reviews from Rotten Tomatoes

Both IMDb and Rotten Tomatoes build their scores based on user reviews, which suffer from biases similar to online polls. This quote from the New York Times pertains to election polling done on news websites and blogs, but the concerns are the same for our audience scores:

“[Online polls] do a good job of engaging audiences online, and they do a good job of letting you know how other people who have come to the webpage feel about whatever issue,” said Mollyann Brodie, the executive director for public opinion and survey research at the Kaiser Family Foundation. “But they’re not necessarily good at telling you, in general, what people think, because we don’t know who’s come to that website and who’s taken it.”

Professional pollsters use scientific statistical methods to make sure that their small random samples are demographically appropriate to indicate how larger groups of people think. Online polls do nothing of the sort, and are not random, allowing anyone who finds the poll to vote.

(“Why You Shouldn’t Trust ‘Polls’ Conducted Online,” New York Times, September 28, 2016)

The critics scores are derived by an expert judgement of the critics reviews by people at Rotten Tomatoes. We do not know the process of human judgement that produces a score for a given critic’s review – how it might be 45% versus 50% – but we won’t be using critics reviews here, since our boss at Paramount is concerned about what makes a movie popular, as opposed to critically acclaimed. In fact, a critic’s score might actually be a predictor of the audience score.

Implications for inference

I believe it was the computer security guru Bruce Schneier who once said, “For marketing purposes, 30% accuracy can be good enough.” Perhaps it was a wry comment, but he meant that when you are trying to market a product to people having a 30% accuracy in reaching the correct audience is better than marketing to purely randomly selected people.

In that spirit, despite the bias problems with the audience scoring addressed above, at the very least our inferences will tell us what kinds of films IMDb or Rotten Tomatoes reviewers will like. We will treat the audience scores, for this report, as having veracity for generalizing to the greater population.

Note that we can only correlate movie parameters to audience score. We cannot infer causality even if we were guaranteed the audience scores came from randomly selected reviewers.

For me personally, the interest here is general. I’ve been a movie lover pretty much all of my life. I lived in New York City for twelve years, where going to the movies borders on being a sport. As a kid I saw “Star Wars” twelve times in the theater in its original run in 1977. I have been an avid fan of “Mystery Science Theater 3000”, have seen countless films in art house cinemas, and even sat through the film adaptation of “Battlefield Earth,” which gets a 3% rating on Rotten Tomatoes (which sounds a bit high).


Part 2: Research question

Our boss here at Paramount Pictures poses two questions for us:

We will use multiple linear regression to answer her first question.

As to learning something new: consider that a movie has a theatrical release date and a DVD release date. If we measure the amount of time from the theatrical release to the DVD release, does that length of time correlate to the popularity of the movie?

To speculate why this might be, we can think about a given movie having a “hype cycle.” A film may have an enormous amount of hype leading up to its release. It may do well at the box office. If the movie studio waits too long to release the DVD then perhaps both sales of the DVD and “online hype” (which probably generates more iMDB and Rotten Tomatoes responses) may suffer. We might see this in a negative correlation between the number of days and audience score.

One concern here may be: “Does this question call for time-series data?” This is one of the conditions for the least squares line we will plot. Because we deal, ultimately, with an integer (number of days from theatrical release to DVD release) this should not pose a problem.


Part 3: Exploratory data analysis

For our first question: cleaning up the data for the multiple linear regression

Our instructions list several things to remove: actor1 through actor5 and the URIs for the films. We’ll also leave out the director and title_type. At first I left in “studio” thinking my boss would want to know which studios netted films with higher scores, all other predictors held equal; but this made the output messy and the data looked somewhat questionable:

## studioWalt Disney Home Entertainment            38.31529   20.15666
## studioWalt Disney Pictures                      13.97832   10.88395
## studioWalt Disney Productions                    3.59856   13.25061

Walt Disney Home Entertainment does not create releases for theaters, I believe. Also Paramount Pictures (my boss) cannot control for ‘movie studio’ so I decided to omit it.

# Keep 'title' in case we want to identify a single film
movies_for_lm <- movies %>% select(title, genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)

str(movies_for_lm)
## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  16 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

For our second question: data analysis about the number of days between release dates

We need to compute the number of days from the date of the theatrical release to the release date of the DVD. First, let’s remove all films that have ‘NA’ for any release year.

movies_rel <- movies %>% filter(!is.na(dvd_rel_year)) %>% filter(!is.na(thtr_rel_year))
dim(movies_rel)
## [1] 643  32

That removed eight entries. Let’s compute the number of days between theatrical release date and DVD release date.

# Build a new column of type Date for the theatrical release date
movies_rel <- movies_rel %>% mutate(theatrical_release_date = as.Date(sprintf("%d-%d-%d", thtr_rel_year, thtr_rel_month, thtr_rel_day)))

# Build another new column of type Date for the DVD release date
movies_rel <- movies_rel %>% mutate(dvd_release_date = as.Date(sprintf("%d-%d-%d", dvd_rel_year, dvd_rel_month, dvd_rel_day)))

# Finally, calculate the number of days from theatrical release to DVD release
movies_rel <- movies_rel %>% mutate(days_to_dvd_release = as.numeric(dvd_release_date - theatrical_release_date, units="days"))

And finally let’s do a linear regression with our new predictor, days_to_dvd_release, and our response variable audience_score from Rotten Tomatoes:

does_dvd_matter <- lm(audience_score ~ days_to_dvd_release, data = movies_rel)
summary(does_dvd_matter)
## 
## Call:
## lm(formula = audience_score ~ days_to_dvd_release, data = movies_rel)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.883 -15.859   3.146  17.148  33.917 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.181e+01  9.849e-01  62.761   <2e-16 ***
## days_to_dvd_release 2.779e-04  2.537e-04   1.095    0.274    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.13 on 641 degrees of freedom
## Multiple R-squared:  0.001869,   Adjusted R-squared:  0.0003114 
## F-statistic:   1.2 on 1 and 641 DF,  p-value: 0.2737

Wow, talk about “no correlation!” I wonder, then, if something is wrong with our assumption. Let’s look at the years for the releases:

ggplot(data = movies_rel, aes(x = thtr_rel_year)) + geom_bar(fill="darkslateblue") + labs(title = "Theatrical Release Year")

Ah. We have data for movies released in the 1970s, so the DVD release date will be pretty long after the theatrical release date! Let’s look at the distribution of years for DVD release date:

ggplot(data = movies_rel, aes(x = dvd_rel_year)) + geom_bar(fill="darkslateblue") + labs(title = "DVD Release Year")

Quite a difference. What we really want, then, is to compare the theatrical release date to the DVD release date for films whose theatrical release date corresponds to the popularity of the DVD format. I had not taken into account three things:
  1. We have films whose theatrical release date come from a time when there simply was no “home release” (prior to the VHS format)
  2. DVDs didn’t actually make it to the consumer market until after 1995 (per Wikipedia)
  3. The Blu-Ray format was introduced in 2003 and saw significant sales starting in 2006, at the expense of the DVD format (again, per Wikipedia)

With these realizations we can narrow our data to account for these facts. Let’s assume the years of 2000 to 2010 as the “heyday” of DVD sales and limit our model to films with theatrical releases in those years:

movies_dvd <- movies_rel %>% filter(thtr_rel_year >= 2000, thtr_rel_year <= 2010)
ggplot(data = movies_dvd, aes(x = factor(thtr_rel_year))) + geom_bar(fill="darkslateblue") + labs(title = "Theatrical Release Year 2000-2010")