Load packages

library(tidyverse)
library(MASS)

Motivation

It is well-known that the notorious (Pearson’s) correlation cannot exceed an absolute value greater than 1, that is

$- 1 \leq r \leq + 1$

$| r | \leq 1$

However, proofing this fact is less straightforward. A classical way of proofing the above inequality is by using the Cauchy-Schwarz inequality. From a teacher’s perspective, the CS inequality may not be ideal, because the students may lack some knowledge necessary for appreciating this proof. In order to provide teachers’s (or anyone else for that matter), this posts provides an alternative way, one that does not demand much more than basic algebra and some knowledge about descriptive statistics (particularly including z-scores and correlation). This posts builds on this paper.

Definitions

Assume there are two sets of measurements (values), where $x_{i}$ and $y_{i}$ denote the $i$ measurement, $i = 1, 2, \dots, n$ . The (empirical) correlation $r$ can then be defined as

$r_{x y} = r = \frac{s_{x y}}{s_{x} s_{y}}$

where $s_{x}$ denotes the standard deviation of $X$ , and $s_{x} y$ the covariancee of $x$ and $y$ .

Let’s denote $x_{i} - \bar{x}$ with $d x_{i}$ (and $d y_{i}$ in the obvious way).

Note the analoguous definition of $s_{x}^{2}$ and $s_{x y}$ :

$\begin{aligned} s_{x}^{2} & = \frac{1}{n} \sum d x_{i}^{2} = \frac{1}{n} \sum (d x_{i} \cdot d x_{i}) \\ s_{x y} & = \frac{1}{n} \sum (d x_{i} \cdot d y_{i}) \end{aligned}$

Hence

$\begin{aligned} \sqrt{s_{x}^{2} \cdot s_{y}^{2}} = s_{x y} \end{aligned}$

In other words: The variance $s_{x} = \frac{1}{n} \sum (d x_{i} \cdot d x_{i})$ equals the covariance if we let the second $d x_{i}$ equal $d y_{i}$ .

$\begin{aligned} s_{x} s_{y} & = \sqrt{\frac{1}{n} \sum (d x_{i})^{2} \cdot \frac{1}{n} \sum (d y_{i})^{2}} \\ = \frac{1}{n} \sum (d x_{i} \cdot d y_{i}) \end{aligned}$

As the z-score is defined as

$z = \frac{x_{i} - \bar{x}}{s_{x}}$

we may also define $r$ as

$r = \frac{1}{n} \sum (z_{x} \cdot z_{y})$

Intuition about the magnitude or $r$

Let’s simulate some correlated data.

d <- mvrnorm(n = 100, mu = c(0,0) , Sigma = matrix(c(1, 0.7, 0.7, 1), nrow = 2), empirical = TRUE) 

d <- as_tibble(d)

Let’s color the dots with respect of the sign of the product of $V 1$ and $V 2$ .

d <- d %>% 
  mutate(dot_sign = ifelse(V1*V2 > 0, "pos", "neg"))

ggplot(d, aes(V1, V2, color = dot_sign)) +
  geom_point() +
  geom_vline(xintercept = mean(d$V1), linetype = "dashed") +
  geom_hline(yintercept = mean(d$V1), linetype = "dashed")

We see four “regions”, two with positive and two with negative sign.

Observe that in the two “positive” regions the product of V1 and V2 is positive, and in the two negative regions, their product is negative.

Some properties of z-scores and their sums and products

Further note that - as squares cannot be negative - this term must be nonnegative:

$(z_{x_{i}} + z_{y_{i}})^{2}$

for each $i$ .

Similarly,

$\frac{1}{n} \sum (z_{x_{i}} + z_{y_{i}})^{2} \geq 0 eq. 3$

Read the above equation as “the average squared sum of two z-scores must be nonnegative.”

Now multiply out the binomial part of the last step:

$\begin{aligned} \frac{1}{n} \sum [z_{x}^{2} + 2 z_{x} z_{y} + z_{y}^{2}] & \geq 0 \\ \frac{1}{n} \sum z_{x}^{2} + 2 \frac{1}{n} \sum z_{x} z_{y} + \frac{1}{n} \sum z_{y}^{2} & \geq 0 Note that \frac{1}{n} \sum z^{2} = 1 \\ 1 + 2 r_{x y} + 1 \geq 0 \\ 2 + 2 r_{x y} \geq 0 \\ r_{x y} \geq - 1 \end{aligned}$

If we change the plus sign in eq. 3 into a minus sign, we get $r_{x y} \leq 1$ . In sum:

$- 1 \leq r_{x y} \leq + 1$

Hence, we have found that the $r$ cannot exceed an absolute values of 1.

Simple proof that the correlation coefficient cannot exceed abs(1)

Load packages

Motivation

Definitions

Intuition about the magnitude or rr

Some properties of z-scores and their sums and products

Intuition about the magnitude or $r$