Prove of a local optimum of k-means (exercise in Witten et al., 2013)

Load packages

library(tidyverse)

The K-Means optimization reduces the variance in each iteration. To illuminate on that Witten et al. in An Introduction to Statistical Learning (2013) present the following entity (p. 388, chap. 10):

\[\frac{1}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p (x_{ij} - x_{i^\prime j})^2 = 2 \sum\limits_{i \in C_k} \sum\limits_{j=1}^{p} (x_{ij} - \bar{x}_{kj})^2\]

A proof can be found here; I’ll add some explanations.

Note 1. Note that \(\sum\limits_{i,i^{\prime} \in C_k}(\dots)\) essentially amounts to \(\sum\limits_{i \in C_k}\sum\limits_{i^{\prime} \in C_k}(\dots)\), when the order of summation does not matter.

Define \(TSS\) (total sum of deviation squares) in the following way: \(TSS \equiv \frac{1}{|C_k|} \sum\limits_{i, \in C_k} \sum\limits_{j=1}^p (x_{ij} - x_{i^\prime j})^2\).

Define \(ISS\) (individual sum of deviation squares) in the following way: \(ISS \equiv (x_{ij} - x_{i^\prime j})^2\), for some \(i\).

Note 2. Note that the mean sum of deviation squares amounts to \(MSS \equiv \frac{1}{|C_k|} TSS\). Note further that \(TSS = |C_K| \cdot MSS\).

Add \(-\bar{x}_{kj}+ \bar{x}_{kj}\)

As \(-\bar{x}_{kj}+ \bar{x}_{kj} = 0\) we can safely add this term to the equation.

Expand the binomial formula

\[= \frac{1}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p ((x_{ij} - \bar{x}_{kj}) - (x_{i^\prime j} - \bar{x}_{kj}))^2\]

Expand the sum

We can make use of note 2 in order to get rid of one summation.

\[= \frac{1}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p ((x_{ij} - \bar{x}_{kj})^2 - 2 (x_{ij} - \bar{x}_{kj})(x_{i^\prime j} - \bar{x}_{kj}) + (x_{i^\prime j} - \bar{x}_{kj})^2)\]

Deviation from mean sum to to zero

Note that the third term sums up to zero for uncorrelated variables; this term is analogous to the covariance.

\[= \frac{|C_k|}{|C_k|} \sum\limits_{i \in C_k} \sum\limits_{j=1}^p (x_{ij} - \bar{x}_{kj})^2 + \frac{|C_k|}{|C_k|} \sum\limits_{i^{\prime} \in C_k} \sum\limits_{j=1}^p (x_{i^\prime j} - \bar{x}_{kj})^2 - \frac{2}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p (x_{ij} - \bar{x}_{kj})(x_{i^\prime j} - \bar{x}_{kj}) \\ = 2 \sum\limits_{i \in C_k} \sum\limits_{j=1}^p (x_{ij} - \bar{x}_{kj})^2 + 0 \]