# Prove of a local optimum of k-means (exercise in Witten et al., 2013)

library(tidyverse)

The K-Means optimization reduces the variance in each iteration. To illuminate on that Witten et al. in An Introduction to Statistical Learning (2013) present the following entity (p. 388, chap. 10):

$\frac{1}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p (x_{ij} - x_{i^\prime j})^2 = 2 \sum\limits_{i \in C_k} \sum\limits_{j=1}^{p} (x_{ij} - \bar{x}_{kj})^2$

A proof can be found here; I’ll add some explanations.

Note 1. Note that $$\sum\limits_{i,i^{\prime} \in C_k}(\dots)$$ essentially amounts to $$\sum\limits_{i \in C_k}\sum\limits_{i^{\prime} \in C_k}(\dots)$$, when the order of summation does not matter.

Define $$TSS$$ (total sum of deviation squares) in the following way: $$TSS \equiv \frac{1}{|C_k|} \sum\limits_{i, \in C_k} \sum\limits_{j=1}^p (x_{ij} - x_{i^\prime j})^2$$.

Define $$ISS$$ (individual sum of deviation squares) in the following way: $$ISS \equiv (x_{ij} - x_{i^\prime j})^2$$, for some $$i$$.

Note 2. Note that the mean sum of deviation squares amounts to $$MSS \equiv \frac{1}{|C_k|} TSS$$. Note further that $$TSS = |C_K| \cdot MSS$$.

## Add $$-\bar{x}_{kj}+ \bar{x}_{kj}$$

As $$-\bar{x}_{kj}+ \bar{x}_{kj} = 0$$ we can safely add this term to the equation.

## Expand the binomial formula

$= \frac{1}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p ((x_{ij} - \bar{x}_{kj}) - (x_{i^\prime j} - \bar{x}_{kj}))^2$

## Expand the sum

We can make use of note 2 in order to get rid of one summation.

$= \frac{1}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p ((x_{ij} - \bar{x}_{kj})^2 - 2 (x_{ij} - \bar{x}_{kj})(x_{i^\prime j} - \bar{x}_{kj}) + (x_{i^\prime j} - \bar{x}_{kj})^2)$

## Deviation from mean sum to to zero

Note that the third term sums up to zero for uncorrelated variables; this term is analogous to the covariance.

$= \frac{|C_k|}{|C_k|} \sum\limits_{i \in C_k} \sum\limits_{j=1}^p (x_{ij} - \bar{x}_{kj})^2 + \frac{|C_k|}{|C_k|} \sum\limits_{i^{\prime} \in C_k} \sum\limits_{j=1}^p (x_{i^\prime j} - \bar{x}_{kj})^2 - \frac{2}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p (x_{ij} - \bar{x}_{kj})(x_{i^\prime j} - \bar{x}_{kj}) \\ = 2 \sum\limits_{i \in C_k} \sum\limits_{j=1}^p (x_{ij} - \bar{x}_{kj})^2 + 0$