Load packages

library(tidyverse)
library(mosaic)

What’s a Monte Carlo simulation?

A Monte Carlo Simulation is a numeric approach to solving difficult problems. Instead of having an analytic way of solving the problem, one just says “ok, let’s try it out and see what happens”.

Coin flip distribution

Simalatin a single coin flip (Bernoulli) distribution can be achieved like this:

rflip()
#> 
#> Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
#> 
#> T
#> 
#> Number of Heads: 0 [Proportion Heads: 0]

Normal distribution

rnorm(n = 10, mean = 0, sd = 1)
#>  [1] -1.0042797 -0.2713608  0.1446305  0.2200710  1.5630991  0.9739326
#>  [7] -2.9633715  2.0108904 -1.2170970 -1.4196704

And so on

For example, the log-norm distribution

rlnorm(n = 10)
#>  [1] 1.2862901 2.1063759 1.1456381 1.3393608 0.2507752 0.6209446 0.2133026
#>  [8] 0.4866470 0.3102503 0.8412159

Sampling distribution

Let’s draw 1000 samples from a (standard) Normal of size $n = 30$ .

samples1 <- do(1000) * rnorm(30)
samples1 %>% 
  head()
#>           V1           V2         V3         V4          V5          V6
#> 1  1.2863219 -0.274807049  1.0814329 -0.8267866 -1.50808569  0.13985435
#> 2 -0.8653548  1.140980377 -1.8949788 -1.0499773 -1.90784044  0.07065417
#> 3 -0.2310586  0.253997058  0.7136598 -0.1655298  0.08595934 -3.08765525
#> 4  0.6528163 -1.249222885 -0.9352459  0.6503552 -0.74092661 -0.94769574
#> 5 -0.2707257  0.001270702 -0.1759433 -0.5839333  0.30712417 -1.73348404
#> 6 -1.0663559 -0.503252790 -1.0056853  1.0801417  0.76640245 -1.34019155
#>           V7         V8         V9        V10        V11        V12         V13
#> 1  0.1028167 -1.2830022 -0.4498292  1.6148086 -0.6883792  1.6613820  0.06266613
#> 2 -1.7558073  0.1281623  0.6946798  0.9496056  0.6863822 -2.5996756  0.31201328
#> 3 -1.8591652  0.3190632  0.9354181 -0.7850001  2.0188404 -0.4131814 -1.02457662
#> 4 -0.2410525 -0.2379943  0.7073258 -1.8916074 -0.6208275 -0.7190065 -1.46339937
#> 5 -2.7324921  0.7023635  0.1180570  0.8523584  1.4635444  0.2715944 -1.75585981
#> 6  0.9910696  2.1352831 -0.7963178  0.7092296 -0.4589264  0.3246102 -0.46823924
#>          V14         V15        V16        V17        V18        V19
#> 1  1.1267859  0.10243489 -1.3773886 -0.2814393 -0.1218417 -0.7257556
#> 2 -1.4729182  0.57464962  1.1551708  0.4054580  1.9600981  0.1663946
#> 3  1.1332261  0.16570618 -1.0176817  0.6032048  1.2973820 -1.3559251
#> 4  0.2896663 -0.01344070 -0.3911377  0.1381243 -1.1510956  0.9459299
#> 5  0.5196629 -0.31279824  1.3103833  0.6014260 -0.7721067 -0.2173477
#> 6 -1.4481882 -0.05837044 -1.5427164 -0.2887681  2.0625063 -0.4895346
#>           V20        V21         V22        V23        V24          V25
#> 1 -0.17175860  2.5391486 -0.33733035 -0.6260165  0.8665592 -0.009655187
#> 2  2.09774909  0.6320319 -1.25315758  0.6768155 -1.0781442 -2.848880452
#> 3  1.47956152 -0.4123445 -0.41993393  0.7708856 -1.0675232  0.977194197
#> 4  0.54044542 -2.1823161 -0.05885838 -0.9025076 -0.1109677 -0.452467283
#> 5 -0.08705375 -1.7143591  0.57834199 -0.4412453 -0.9552024  1.264503219
#> 6  0.18169065 -1.5062575 -0.17430273 -0.5276682 -0.2932246  1.017935755
#>          V26         V27         V28       V29        V30
#> 1  0.9337085  1.36870646 -0.05470128 1.3706746  1.6609479
#> 2 -0.0705131 -0.04933185 -1.18124717 0.9545253  0.4226119
#> 3  0.1970384  0.26549223  0.19628074 0.2728573  0.4841887
#> 4  0.2354287  0.38256938  0.78410706 0.9474460 -0.9227730
#> 5  0.1292691  0.29589709 -1.15487477 0.4467893  0.2350930
#> 6  1.1253956 -1.59282980 -1.03756564 0.8328620  0.0220449

It’s probably sufficient to store only the means:

samples2 <- do(1000) * (rnorm(30) %>% mean())
samples2 %>% head()
#>         result
#> 1  0.059098945
#> 2  0.001258059
#> 3 -0.089272458
#> 4  0.153151420
#> 5 -0.130776400
#> 6 -0.132753366

And plot it:

gf_histogram( ~ result, data = samples2)

Practical example

Here’s a dataset with the height and sex (and shoe size) of some students.

data(wo_men, package = "pradadata")

According to this source, the average height among German men is about 1.75m. Unfortunately, there is no SD given, so we estimate from the SD from some other countries, at 0.07m (7cm).

Now, if our sample were drawn from this “typial German” distribution, how likely would it be to observe our empirical value?

wo_men %>% 
  filter(sex == "man") %>% 
  drop_na(height) %>% 
  summarise(height_avg = mean(height),
            n_men = n())
#> # A tibble: 1 x 2
#>   height_avg n_men
#>        <dbl> <int>
#> 1       183.    18

It may appear silly to draw samples in this case where the distribution is so simple and well-known. That’s true. However, in more complex scenarios – such as multivariate Normals – it’s a different story. What’s more, in some cases, we will not know the shape of the distribution, but we may well draw samples form it.

samples3 <- do(1000) * mean(rnorm(n = 18, mean = 175.4, sd = 7))

gf_histogram(~ mean, data = samples3) %>% 
  gf_vline(xintercept = 183.11)

As can be seen, our value of 183cm occurs infrequently, hence we are in a position to reject the null.

Simulating values according to some distribution