Ramsay Lewis
- Oct 29, 2022
- 4 min read

Python Project: Movie Data Correlations

I completed this project simply to demonstrate my data analysis skills using Python. It’s a simple project meant just to use Python to upload data, clean it, and do simple correlation analyses. You can find the full code for the project on GitHub.

Methods and Tools

Data source: I used movie data from Kaggle.

Python environment: I wrote this code in a Jupyter notebook.

Libraries: In this project, I used the following libraries: pandas, numpy, seaborn, scipy, and matplotlib.

 # Import libraries
 
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats.stats import pearsonr

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
plt.style.use('ggplot')
from matplotlib.pyplot import figure

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8) #adjusts configuration of the plots we'll create

pd.options.mode.chained_assignment = None

Step 1: Read the data

The first step was to read the data as a data frame.

#Read in the data
#Downloaded from Kaggle: https://www.kaggle.com/datasets/danielgrijalvas/movies 

df = pd.read_csv(r'/Users/ramsay/Documents/Coding/5. Python Movie Database/movies.csv')

Step 2: Look for missing data

Next, I looked for missing data using a for loop.

#Here, I'm looking for missing data using a For loop. 

for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, pct_missing))

I found several pieces of missing data. There are several ways you can deal with missing data; in this case, I simply decided to drop data that was missing from the data set. This method is the easiest, although there is a risk that it can bias the outcome.

#Here, I'm getting rid of rows with missing data

df = df.dropna()

#Now check again

for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print('{} - {}%'.format(col, pct_missing))

Step 3: Change data types

I noticed that some of the values had decimals on the end. I looked to see what type of data the columns were.

print(df.dtypes)

I decided to change the three numeric variables I was going to use—gross revenue, budget, and votes—to integers, getting rid of the decimal in the process.

#This changes the data type of the Budget and Gross columns

df['budget'] = df['budget'].astype('int64')
df['gross'] = df['gross'].astype('int64')
df['votes'] = df['votes'].astype('int64')

Step 4: Correlations

I was interested to see if there were associations between budget, votes, and runtime with gross revenue. I decided to test the hypothesis that there were positive correlations between each of the first three variables and gross revenue.

Is a movie's budget related to its gross revenue?

I started by looking at the budget and gross revenue and plotted a scatter plot.

plt.scatter(x=df['budget'], y=df['gross'])
plt.title('Budget vs. Gross Revenue')
plt.xlabel('Gross Revenue ($)')
plt.ylabel('Budget ($)')
plt.show()

Then I added a line of best fit.

sns.regplot(x='budget', y='gross', data=df, scatter_kws={"color":"red"}, line_kws={"color":"blue"})

Next, a created a heatmap for the correlation matrix of all the numeric variables.

correlation_matrix=df.corr(method='pearson')
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Matrix for Numeric Features')
plt.xlabel('Movie Features')
plt.ylabel('Movie Features')
plt.show()

It appears that the relationship between a movie’s budget and its revenue is quite strong—it has a Pearson correlation coefficient of 0.74 in this data.

Next, I wanted to test the statistical significance of that coefficient. I specified that the null hypothesis I was testing was that there wasn’t a relationship between budget and gross revenue. The alternative hypothesis is that there really is a relationship between budget and gross revenue. I specified an alpha of 0.05 and chose to use a two-tailed test.

from scipy import stats
pearson_coef1, p_value1 = stats.pearsonr(df["budget"], df["gross"])
print("Pearson Correlation Coefficient: ", round(pearson_coef1, 2), "and a P-value of:", "{0:.2f}".format(p_value1,))

The test found that the relationship was significant (r = 0.74, p < 0.01).

Because the p-value was less than alpha, I concluded that it would have been very unlikely to observe a relationship so strong under a true null hypothesis (i.e. if there wasn’t really a relationship between budget and gross revenue). So, I rejected the null hypothesis and concluded that there is a relationship between budget and gross revenue.

Are IMDB votes related to gross revenue?

Next, I looked to see if “votes”—the number of user votes a movie has received on IMDB—were also related to gross revenue. I looked at the correlation matrix heat map and saw a correlation coefficient of 0.61.

I conducted a hypothesis test to see if it was significantly different from 0. I specified the null hypothesis to be no relationship between votes and gross revenue. The alternative hypothesis is that there really is a relationship between votes and gross revenue. I used an alpha of 0.05 and a two-tailed test.

pearson_coef2, p_value2 = stats.pearsonr(df["votes"], df["gross"])
print("Pearson Correlation Coefficient: ", round(pearson_coef2, 2), "and a P-value of:", "{0:.2f}".format(p_value2))

The test found that the relationship was significant (r = 0.61, p < 0.01).

Because the p-value was less than alpha, I concluded that it would have been very unlikely to observe a relationship so strong under a true null hypothesis, so I rejected the null hypothesis and accepted the alternative—that there is a relationship between budget and gross revenue.

Is a movie's runtime related to its gross revenue?

Next, I looked to see if “runtime”—the length of a movie—was related to gross revenue. I looked at the correlation matrix heat map and saw a correlation coefficient of 0.28.

I conducted a hypothesis test to see if it was significantly different from 0. I specified the null hypothesis to be no relationship between runtime and gross revenue. The alternative hypothesis is that there really is a relationship between runtime and gross revenue. As before, I used an alpha of 0.05 and a two-tailed test.

pearson_coef3, p_value3 = stats.pearsonr(df["runtime"], df["gross"])
print("Pearson Correlation Coefficient: ", round(pearson_coef3, 2), "and a P-value of:", "{0:.2f}".format(p_value3))

The test found that the relationship was significant (r = 0.28, p < 0.01).

Need help preparing your data for analysis? Get in touch and let’s see if I can help.

Crisp Analytics