library(ggplot2)
library(dplyr)
library(statsr)
load("gss.Rdata")

Part 1: Data

The data for the General Social Survey (GSS) are collected by researchers at the National Opinion Research Center (NORC) at the University of Chicago. According to Wikipedia:

The survey was conducted every year from 1972 to 1994 (except in 1979, 1981, and 1992). Since 1994, it has been conducted every other year.

Thus, the survey has collected data for over four decades.

How the data is collected

Survey respondents are 18 and over, living in a household in the United States, and are randomly selected from a mix of urban, suburban, and rural areas using area probability sampling, also known as random sampling.

Survey methodology

The survey is conducted face-to-face by NORC at the University of Chicago. (source)

Implications for inference

Because the respondents are randomly selected and represent less than ten percent of the population of the United States we have a very good data set for drawing inferences about the general population. Note that we may draw correlations from the data; but because it is not a random experiment we cannot draw causal conclusions from the data.


Part 2: Research question

For the past few decades the newspaper industry in the United States has seen declines in revenue and circulation. The Pew Research Center tracks revenue and circulation numbers about the newspaper industry and publishes their data online.

The GSS asks the question “How often do you read the newspaper - every day, a few times a week, once a week, less than once a week, or never?”

Our research question: If we make inferences to the U.S. population based on the GSS data for news readership, will our results agree with or contradict Pew’s results showing declines in newspaper circulation and revenue?

We might expect the self-reported frequency of newspaper reading done by respondents to show declines as well.

The decline of the newspaper industry has a personal fascination for me since I began my career as a software developer first for the Cleveland Plain Dealer’s web site, then the New York Times and later Forbes. I have many colleagues still working in the news business. The struggle of the industry to find profitability in the digital age has a special drama for me.

A look at Pew Research’s public data

According to Pew Research Center, the total number of daily newspapers has decreased by more than 100 since 2004, from 1,457 down to 1,331:

Pew Research Center: Total number of newspapers

Pew Research Center: Total number of newspapers

Declining revenues

Pew Research also give us numbers for the revenue declines. Note that every point on this graph that lies below zero represents a decline in revenue over the previous year; for advertising revenue every year has shown a decline, and for circulation revenue growth has largely been flat:

Pew Research: Advertising revenue sees biggest drop since 2009

Pew Research: Advertising revenue sees biggest drop since 2009

Declining circulation

Similarly, Pew reports circulation declines for most years since 2004:

Pew Research: Newspaper circulation declines for second consecutive year in 2015

Pew Research: Newspaper circulation declines for second consecutive year in 2015

We might expect to see declines in the amounts of time reported spent reading newspapers in the GSS data for the years 2004 and 2012.


Part 3: Exploratory data analysis

First let’s filter the data set down to just “year,” where the year is either 2004 or 2012 (the most recent year in the GSS data). We’ll filter out rows where “news” is N/A and omit all other columns except for “year” and “news.”

gssInterestingYears <- filter(gss, year == 2004 | year == 2012)
gssNewsResponded <- filter(gssInterestingYears, !is.na(news))
gssScoped <- gssNewsResponded[c("year","news")] # could have used select() here too, same result
str(gssScoped)
## 'data.frame':    2207 obs. of  2 variables:
##  $ year: int  2004 2004 2004 2004 2004 2004 2004 2004 2004 2004 ...
##  $ news: Factor w/ 5 levels "Everyday","Few Times A Week",..: 1 1 4 2 1 1 2 1 1 1 ...

Now let’s have a look at those values for 2004:

ggplot(data = filter(gssScoped, year == 2004), aes(x = news)) + geom_bar() + labs(title = "2004") +  xlab("Year 2004 responses: 'How often do you read the newspaper'")

Let’s do the same for 2012:

ggplot(data = filter(gssScoped, year == 2012), aes(x = news)) + geom_bar() + labs(title = "2012") + xlab("Year 2012 responses: 'How often do you read the newspaper'")