Load packages
library(tidyverse)
library(mosaic)
What’s a Monte Carlo simulation?
A Monte Carlo Simulation is a numeric approach to solving difficult problems. Instead of having an analytic way of solving the problem, one just says “ok, let’s try it out and see what happens”.
Coin flip distribution
Simalatin a single coin flip (Bernoulli) distribution can be achieved like this:
rflip()
#>
#> Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
#>
#> T
#>
#> Number of Heads: 0 [Proportion Heads: 0]
Normal distribution
rnorm(n = 10, mean = 0, sd = 1)
#> [1] -1.0042797 -0.2713608 0.1446305 0.2200710 1.5630991 0.9739326
#> [7] -2.9633715 2.0108904 -1.2170970 -1.4196704
And so on
For example, the log-norm distribution
rlnorm(n = 10)
#> [1] 1.2862901 2.1063759 1.1456381 1.3393608 0.2507752 0.6209446 0.2133026
#> [8] 0.4866470 0.3102503 0.8412159
Sampling distribution
Let’s draw 1000 samples from a (standard) Normal of size \(n=30\).
samples1 <- do(1000) * rnorm(30)
samples1 %>%
head()
#> V1 V2 V3 V4 V5 V6
#> 1 1.2863219 -0.274807049 1.0814329 -0.8267866 -1.50808569 0.13985435
#> 2 -0.8653548 1.140980377 -1.8949788 -1.0499773 -1.90784044 0.07065417
#> 3 -0.2310586 0.253997058 0.7136598 -0.1655298 0.08595934 -3.08765525
#> 4 0.6528163 -1.249222885 -0.9352459 0.6503552 -0.74092661 -0.94769574
#> 5 -0.2707257 0.001270702 -0.1759433 -0.5839333 0.30712417 -1.73348404
#> 6 -1.0663559 -0.503252790 -1.0056853 1.0801417 0.76640245 -1.34019155
#> V7 V8 V9 V10 V11 V12 V13
#> 1 0.1028167 -1.2830022 -0.4498292 1.6148086 -0.6883792 1.6613820 0.06266613
#> 2 -1.7558073 0.1281623 0.6946798 0.9496056 0.6863822 -2.5996756 0.31201328
#> 3 -1.8591652 0.3190632 0.9354181 -0.7850001 2.0188404 -0.4131814 -1.02457662
#> 4 -0.2410525 -0.2379943 0.7073258 -1.8916074 -0.6208275 -0.7190065 -1.46339937
#> 5 -2.7324921 0.7023635 0.1180570 0.8523584 1.4635444 0.2715944 -1.75585981
#> 6 0.9910696 2.1352831 -0.7963178 0.7092296 -0.4589264 0.3246102 -0.46823924
#> V14 V15 V16 V17 V18 V19
#> 1 1.1267859 0.10243489 -1.3773886 -0.2814393 -0.1218417 -0.7257556
#> 2 -1.4729182 0.57464962 1.1551708 0.4054580 1.9600981 0.1663946
#> 3 1.1332261 0.16570618 -1.0176817 0.6032048 1.2973820 -1.3559251
#> 4 0.2896663 -0.01344070 -0.3911377 0.1381243 -1.1510956 0.9459299
#> 5 0.5196629 -0.31279824 1.3103833 0.6014260 -0.7721067 -0.2173477
#> 6 -1.4481882 -0.05837044 -1.5427164 -0.2887681 2.0625063 -0.4895346
#> V20 V21 V22 V23 V24 V25
#> 1 -0.17175860 2.5391486 -0.33733035 -0.6260165 0.8665592 -0.009655187
#> 2 2.09774909 0.6320319 -1.25315758 0.6768155 -1.0781442 -2.848880452
#> 3 1.47956152 -0.4123445 -0.41993393 0.7708856 -1.0675232 0.977194197
#> 4 0.54044542 -2.1823161 -0.05885838 -0.9025076 -0.1109677 -0.452467283
#> 5 -0.08705375 -1.7143591 0.57834199 -0.4412453 -0.9552024 1.264503219
#> 6 0.18169065 -1.5062575 -0.17430273 -0.5276682 -0.2932246 1.017935755
#> V26 V27 V28 V29 V30
#> 1 0.9337085 1.36870646 -0.05470128 1.3706746 1.6609479
#> 2 -0.0705131 -0.04933185 -1.18124717 0.9545253 0.4226119
#> 3 0.1970384 0.26549223 0.19628074 0.2728573 0.4841887
#> 4 0.2354287 0.38256938 0.78410706 0.9474460 -0.9227730
#> 5 0.1292691 0.29589709 -1.15487477 0.4467893 0.2350930
#> 6 1.1253956 -1.59282980 -1.03756564 0.8328620 0.0220449
It’s probably sufficient to store only the means:
samples2 <- do(1000) * (rnorm(30) %>% mean())
samples2 %>% head()
#> result
#> 1 0.059098945
#> 2 0.001258059
#> 3 -0.089272458
#> 4 0.153151420
#> 5 -0.130776400
#> 6 -0.132753366
And plot it:
gf_histogram( ~ result, data = samples2)
Practical example
Here’s a dataset with the height and sex (and shoe size) of some students.
data(wo_men, package = "pradadata")
According to this source, the average height among German men is about 1.75m. Unfortunately, there is no SD given, so we estimate from the SD from some other countries, at 0.07m (7cm).
Now, if our sample were drawn from this “typial German” distribution, how likely would it be to observe our empirical value?
wo_men %>%
filter(sex == "man") %>%
drop_na(height) %>%
summarise(height_avg = mean(height),
n_men = n())
#> # A tibble: 1 x 2
#> height_avg n_men
#> <dbl> <int>
#> 1 183. 18
It may appear silly to draw samples in this case where the distribution is so simple and well-known. That’s true. However, in more complex scenarios – such as multivariate Normals – it’s a different story. What’s more, in some cases, we will not know the shape of the distribution, but we may well draw samples form it.
samples3 <- do(1000) * mean(rnorm(n = 18, mean = 175.4, sd = 7))
gf_histogram(~ mean, data = samples3) %>%
gf_vline(xintercept = 183.11)
As can be seen, our value of 183cm occurs infrequently, hence we are in a position to reject the null.