# Load libraries and data
library(ggplot2)
library(dplyr)
library(statsr)
load("movies.Rdata")
Our Coursera instructors provided us with a random sample of movies from the web sites IMDb and Rotten Tomatoes. Although they do not tell us how this random selection process was chosen we place our faith in our instructors that it was done well.
We are given scores and ratings of the randomly selected films from three sources:
Both IMDb and Rotten Tomatoes build their scores based on user reviews, which suffer from biases similar to online polls. This quote from the New York Times pertains to election polling done on news websites and blogs, but the concerns are the same for our audience scores:
“[Online polls] do a good job of engaging audiences online, and they do a good job of letting you know how other people who have come to the webpage feel about whatever issue,” said Mollyann Brodie, the executive director for public opinion and survey research at the Kaiser Family Foundation. “But they’re not necessarily good at telling you, in general, what people think, because we don’t know who’s come to that website and who’s taken it.”
Professional pollsters use scientific statistical methods to make sure that their small random samples are demographically appropriate to indicate how larger groups of people think. Online polls do nothing of the sort, and are not random, allowing anyone who finds the poll to vote.
(“Why You Shouldn’t Trust ‘Polls’ Conducted Online,” New York Times, September 28, 2016)
The critics scores are derived by an expert judgement of the critics reviews by people at Rotten Tomatoes. We do not know the process of human judgement that produces a score for a given critic’s review – how it might be 45% versus 50% – but we won’t be using critics reviews here, since our boss at Paramount is concerned about what makes a movie popular, as opposed to critically acclaimed. In fact, a critic’s score might actually be a predictor of the audience score.
I believe it was the computer security guru Bruce Schneier who once said, “For marketing purposes, 30% accuracy can be good enough.” Perhaps it was a wry comment, but he meant that when you are trying to market a product to people having a 30% accuracy in reaching the correct audience is better than marketing to purely randomly selected people.
In that spirit, despite the bias problems with the audience scoring addressed above, at the very least our inferences will tell us what kinds of films IMDb or Rotten Tomatoes reviewers will like. We will treat the audience scores, for this report, as having veracity for generalizing to the greater population.
Note that we can only correlate movie parameters to audience score. We cannot infer causality even if we were guaranteed the audience scores came from randomly selected reviewers.
For me personally, the interest here is general. I’ve been a movie lover pretty much all of my life. I lived in New York City for twelve years, where going to the movies borders on being a sport. As a kid I saw “Star Wars” twelve times in the theater in its original run in 1977. I have been an avid fan of “Mystery Science Theater 3000”, have seen countless films in art house cinemas, and even sat through the film adaptation of “Battlefield Earth,” which gets a 3% rating on Rotten Tomatoes (which sounds a bit high).
Our boss here at Paramount Pictures poses two questions for us:
We will use multiple linear regression to answer her first question.
As to learning something new: consider that a movie has a theatrical release date and a DVD release date. If we measure the amount of time from the theatrical release to the DVD release, does that length of time correlate to the popularity of the movie?
To speculate why this might be, we can think about a given movie having a “hype cycle.” A film may have an enormous amount of hype leading up to its release. It may do well at the box office. If the movie studio waits too long to release the DVD then perhaps both sales of the DVD and “online hype” (which probably generates more iMDB and Rotten Tomatoes responses) may suffer. We might see this in a negative correlation between the number of days and audience score.
One concern here may be: “Does this question call for time-series data?” This is one of the conditions for the least squares line we will plot. Because we deal, ultimately, with an integer (number of days from theatrical release to DVD release) this should not pose a problem.
Our instructions list several things to remove: actor1 through actor5 and the URIs for the films. We’ll also leave out the director and title_type. At first I left in “studio” thinking my boss would want to know which studios netted films with higher scores, all other predictors held equal; but this made the output messy and the data looked somewhat questionable:
## studioWalt Disney Home Entertainment 38.31529 20.15666 ## studioWalt Disney Pictures 13.97832 10.88395 ## studioWalt Disney Productions 3.59856 13.25061
Walt Disney Home Entertainment does not create releases for theaters, I believe. Also Paramount Pictures (my boss) cannot control for ‘movie studio’ so I decided to omit it.
# Keep 'title' in case we want to identify a single film
movies_for_lm <- movies %>% select(title, genre, runtime, mpaa_rating, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
str(movies_for_lm)
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 16 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
We need to compute the number of days from the date of the theatrical release to the release date of the DVD. First, let’s remove all films that have ‘NA’ for any release year.
movies_rel <- movies %>% filter(!is.na(dvd_rel_year)) %>% filter(!is.na(thtr_rel_year))
dim(movies_rel)
## [1] 643 32
That removed eight entries. Let’s compute the number of days between theatrical release date and DVD release date.
# Build a new column of type Date for the theatrical release date
movies_rel <- movies_rel %>% mutate(theatrical_release_date = as.Date(sprintf("%d-%d-%d", thtr_rel_year, thtr_rel_month, thtr_rel_day)))
# Build another new column of type Date for the DVD release date
movies_rel <- movies_rel %>% mutate(dvd_release_date = as.Date(sprintf("%d-%d-%d", dvd_rel_year, dvd_rel_month, dvd_rel_day)))
# Finally, calculate the number of days from theatrical release to DVD release
movies_rel <- movies_rel %>% mutate(days_to_dvd_release = as.numeric(dvd_release_date - theatrical_release_date, units="days"))
And finally let’s do a linear regression with our new predictor, days_to_dvd_release, and our response variable audience_score from Rotten Tomatoes:
does_dvd_matter <- lm(audience_score ~ days_to_dvd_release, data = movies_rel)
summary(does_dvd_matter)
##
## Call:
## lm(formula = audience_score ~ days_to_dvd_release, data = movies_rel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.883 -15.859 3.146 17.148 33.917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.181e+01 9.849e-01 62.761 <2e-16 ***
## days_to_dvd_release 2.779e-04 2.537e-04 1.095 0.274
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.13 on 641 degrees of freedom
## Multiple R-squared: 0.001869, Adjusted R-squared: 0.0003114
## F-statistic: 1.2 on 1 and 641 DF, p-value: 0.2737
Wow, talk about “no correlation!” I wonder, then, if something is wrong with our assumption. Let’s look at the years for the releases:
ggplot(data = movies_rel, aes(x = thtr_rel_year)) + geom_bar(fill="darkslateblue") + labs(title = "Theatrical Release Year")
Ah. We have data for movies released in the 1970s, so the DVD release date will be pretty long after the theatrical release date! Let’s look at the distribution of years for DVD release date:
ggplot(data = movies_rel, aes(x = dvd_rel_year)) + geom_bar(fill="darkslateblue") + labs(title = "DVD Release Year")
With these realizations we can narrow our data to account for these facts. Let’s assume the years of 2000 to 2010 as the “heyday” of DVD sales and limit our model to films with theatrical releases in those years:
movies_dvd <- movies_rel %>% filter(thtr_rel_year >= 2000, thtr_rel_year <= 2010)
ggplot(data = movies_dvd, aes(x = factor(thtr_rel_year))) + geom_bar(fill="darkslateblue") + labs(title = "Theatrical Release Year 2000-2010")