## TL;DR

You can use a point-biserial correlation if there are just two levels of the nominal/categorical data (i.e. it’s binary). But there are no correlation coefficients that are appropriate for a nominal/categorical variable and a continuous variable. Instead, consider a test that looks at mean differences—an ANOVA, for example.

Recently, I was watching data analysis tutorial videos.

The YouTuber was looking at __movie data__ and was explaining how to do __correlations in Python__. The first relationship we looked at was between a movie’s budget and gross revenue. As you might expect, the more a company spent on a movie, the more it tended to make in revenue.

So far so good.

But next, this analyst wanted to explore whether the company mattered. His question was, “*Did some movie companies tend to make more revenue than others?*”

His strategy was to “numerize” (assign numbers to, or measure) the categorical data—essentially, he assigned numbers randomly to each of the companies. Then he conducted __Pearson Product-Moment correlations__, looking at the relationship between those randomly assigned numbers and gross revenue.

I’m not going to link the video because I’m not out here to shame anyone. But that strategy shows a fundamental misunderstanding about what a correlation is and the questions it can help you answer.

And since I’ve taught statistics and I know that this is a common question—“What type of correlation can you do with categorical (nominal) data?”—I’m hoping this article can help clear up some of that misunderstanding and help us improve our data analysis skills.

## You can’t do a correlation with nominal/categorical data

The question I pose in the title suggests an essential misunderstanding of what correlations are and what they do.

At its most basic level, a correlation coefficient helps you understand what happens to the value of one variable as the value of another one changes.

It was a useful tool to use to investigate the question about movie budgets: As budgets increase, what happens to gross revenue? Does it increase? Does it decrease? Does it stay the same?

A correlation coefficient indicates the direction and quantifies the strength of that relationship. Correlation coefficients range between -1 and +1:

Variables that are perfectly positively related (i.e. as one goes up, the other also goes up) would get a correlation coefficient of +1.

Variables that are perfectly negatively related (i.e. as one goes up, the other goes down) would get a correlation coefficient of -1.

Variables that aren’t related (i.e. as one goes up, nothing happens to the other) would get a correlation coefficient of 0.

What’s critical here is that the question about relationships requires the values in the original variables to *mean *something. In other words, the numbers given to different levels of the variables have to represent a quantity—an amount of something.

With budget and revenue, they do. Each data point for “budget” represents some amount of dollars spent on the film. A higher number represents a larger quantity of dollars. Similarly, each point of “revenue” data represents a quantity of dollars earned from the film. Because the values used in the correlation both represent actual quantities, it’s meaningful to say, “As budget increases, so does revenue.”

The problem with assigning numbers to the companies is that the numbers *don’t *represent values or quantities. They’re just being used as a kind of *name*. If we randomly assigned “Universal Pictures” a 1 and we called “Paramount Pictures” a 2, and then “Warner Brothers” a 3, those numbers don’t represent a quantity. They’re just sort of a numerical name tag.

And because they don’t represent a quantity, it’s not meaningful to ask what happens as we go up from 1 to 3. And so, it’s also not meaningful to look at correlation coefficients because correlations are fundamentally asking what happens as a variable increases or decreases.

More generally, there are no correlation coefficients to use with categorical variables because they’re not measured on the right scale of measurement.

## What are scales of measurement?

Scales of measurement, or levels of measurement, are classifications that describe the nature of the information within the values assigned to variables. “Measurement” in this context means assigning a numerical value to levels of a variable.

We can measure variables at different scales of measurement. The four scales are:

**Nominal**—Here, the assigned values are essentially just names for different categories. For example, we can assign car brands a random number where “Toyota” = 1, “Honda” = 2, “GMC” would be 3, etc.**Ordinal**—Here, the assigned values represent order. For example, we could__rank car brands__on how many sales they made in the US. “Ford” would be 1, “Toyota” would be 2, Chevrolet would be 3, etc.**Interval**—Here, the values represent order*and*the distance between the values is the same. I can’t think of a good car example, but time on a 24-hour clock is a good one. So is temperature when it’s measured in degrees Celcius or Farenheit**Ratio**—Here, the values represent order, the distance between values is the same, and “0” represents a*true*0, i.e. it represents an absence of the variable you’re measuring. For example, we could measure car brands on how many cars they sold in Q3 of 2022 in the US. Ford would be “441,487”, Toyota would be “411,957”, Chevrolet would be “371,415”, and so on.

You can learn more about scales of measurement __here__.

The important point is that you have to use the right statistical methods for the level of measurement you have.

## The types of correlations and the data that they work for

There are several different types of correlations, some of the most common of which are:

**Pearson product-moment correlation.**This is the most common type of correlation coefficient. It’s appropriate for assessing linear relationships between two variables measured at the__interval or ratio levels__.**Spearman’s rank-order correlation (or “Spearman’s rho”)**. This correlation coefficient is appropriate for looking at the relationship between two variables, both measured at the__ordinal, interval, or ratio__level.**Kendall’s rank correlation coefficient (or “Kendall’s Tau”).**This correlation coefficient is appropriate for looking at the relationship between two variables, both measured at the__ordinal level, interval, or ratio level__.**Point-Biserial Correlation.**This correlation coefficient is appropriate for looking at the relationship between two variables when one is measured at the interval or ratio level, and the other is__dichotomous__—i.e. there are only two levels.

## What Type of Correlation is Appropriate for Nominal and Continuous Data?

This last type of correlation, the point-biserial correlation, is the only one that you can use for categorical data.

But you couldn’t just use any categories, you would have to set the categories up to have two levels.

For example, you could look at the extent to which movie revenue was related to location by assigning movies a “0” if they were made in the US and a “1” if they were made somewhere else.

Or, you could look at the relationship between revenue and company, but you would have to do something like Paramount Pictures gets a 0 and Warner Brothers gets a 1 (and the rest you wouldn’t include).

## What can you do instead of a correlation?

That’s obviously not super useful—we probably aren’t interested in looking at a correlation between just Paramount and Warner Brothers. More generally, what kinds of analyses can we do if we’re interested in knowing how levels of a categorical/nominal variable performs on a continuous variable?

One option is to compare means or averages. If you have two levels of the variable, you can do a __t-test__. But with several different levels, you would probably do some sort of __Analysis of Variance (ANOVA)__.

To do these tests, you would have to change your question a little bit. Instead of asking “Is country and gross revenue associated or correlated” you would ask, “Do some countries earn more revenue than others?”

The ANOVA would tell you if there was a significant difference between countries in terms of the average movie revenue that they produce.

## The bottom line: Let’s be careful how we apply statistical tests

The main point I’m making here is just that we have to be thoughtful about what tests we use. Just because we *can *do a test with our statistical program or library, doesn’t mean we *should. *

In this case, we can’t simply take a categorical variable and assign numbers to it randomly, and then use those numbers in a Pearson Product Moment Correlation (or any other type of correlation). It doesn’t work that way.

Instead, let’s make sure we check the assumptions that our statistical tests have and that our particular data set meets those assumptions.

But if a point-biserial correlation is mathematically the same as a Pearson's r, could you not still run Pearsons (or Spearman if the data calls for a non-parametric measure)? As LONG AS the researcher explains their reasoning for running a Pearson in place of a point-biserial correlation.