```
library(ggplot2)
<- data.frame(
df x = c(0, 1),
density = 1
)
<- ggplot(df, aes(x = x, y = density)) +
dens0_1 geom_line(colour = '#67a9cf') +
annotate("text", x = 0, y = 1, label = "density(0, 1)",
colour = '#67a9cf', hjust = 0, vjust = 1.2) +
scale_x_continuous(breaks = seq(0, 1, 0.1)) +
theme_classic()
dens0_1
```

# Recreating density plots ‘by hand’

Where I attempt to understand density plots and get stuck on Fast Fourier Transforms.

Often, in my visualisations, I use density plots in one way or another, for example through violin plots. I roughly understand what they mean, but I want to be able to calculate a density plot by hand. Just the once, so I will never have to do it again, and know that *in principle* I will be able to do it. Learning about them also helps in understanding the distinctions between `dunif`

, `punif`

, `qunif`

, and `runif`

(or `dnorm`

, `pnorm`

, `qnorm`

, and `rnorm`

or `dexp`

, `pexp`

, `qexp`

, and `rexp`

et cetera)

## Understanding the uniform (density) distribution

### density(min = 0, max = 1)

When generating generating random numbers we often do so from a uniform distribution between 0 and 1. Any number between 0 and 1 occurs equally frequently. As we learn from `?qunif`

, the density function of the uniform distribution is described by:

\[f(x) = \frac{1}{max(x) - min(x)}\]

Thus, the density function in this case, for all x’s is \(f(\text{for all x}) = \frac{1}{1 - 0} = \frac{1}{1} = 1\) (I am pretty sure there exists a mathematical symbol for “for all x”, but I do not know it). Let’s draw the function:

By definition, for any density curve, the sum under the area of the “curve” adds up to 1. Let’s check: the total area under the curve is the width, which equals 1 (from 0 to 1) times the height, which also equals 1 (for all x); thus, 1 x 1 is 1. Score!

`R`

could have told us this through:

`punif(1, min = 0, max = 1)`

`[1] 1`

Now, something *slightly* more complex. What is the chance of having a number between 0.7 and 0.8?

```
<- dens0_1 +
dens0_1 geom_area(aes(x = c(0.7, 0.8), y = c(1, 1)), fill = '#67a9cf', alpha = 0.9)
+
dens0_1 geom_segment(aes(x = 0.7, xend = 0.8, y = 0.02, yend = 0.02),
arrow = arrow(ends = "both", type = "closed",
length = unit(0.2, "cm")), colour = "grey") +
geom_segment(aes(x = 0.82, xend = 0.82, y = 0, yend = 1),
arrow = arrow(ends = "both", type = "closed",
length = unit(0.2, "cm")), colour = "grey") +
annotate("text", x = 0.75, y = 0, label = "0.1", vjust = 1) +
annotate("text", x = 0.82, y = 0.5, label = "1", hjust = 0)
```

The area under the curve is 0.1 * 1 = 0.1. Through `R`

, this could have been calculated via:

`punif(0.8, min = 0, max = 1) - punif(0.7, min = 0, max = 1)`

`[1] 0.1`

### density(min = 1, max = 5)

Ok, so what if we had used a density function from 1 to 5? Thus, the density function in this case, for all x’s is \(f(\text{for all x}) = \frac{1}{5 - 1} = \frac{1}{4} = 0.25\). The area under the curve should be 1: the area under the curve from density(min = 1, max = 5) is the width (from 1 to 5 is a width of 4) times height (0.25 for all); thus, 4 x 0.25 = 1. Check.

Let’s draw this function in the same graph, and let’s also draw the area under the curve between x = 3 and x = 4:

```
<- data.frame(
df2 x = c(1, 5),
density = c(0.25, 0.25)
)
+
dens0_1 geom_line(data = df2, colour = '#ef8a62') +
annotate("text", x = 1, y = 0.25, label = "density(1, 5)",
colour = '#ef8a62', hjust = 0, vjust = 1.2) +
geom_area(aes(x = c(3, 4), y = c(0.25, 0.25)), fill = '#ef8a62', alpha = 0.9) +
geom_segment(aes(x = 3, xend = 4, y = 0.02, yend = 0.02),
arrow = arrow(ends = "both", type = "closed",
length = unit(0.2, "cm")), colour = "grey") +
geom_segment(aes(x = 4.1, xend = 4.1, y = 0, yend = 0.25),
arrow = arrow(ends = "both", type = "closed",
length = unit(0.2, "cm")), colour = "grey") +
annotate("text", x = 3.5, y = 0, label = "1", vjust = 1) +
annotate("text", x = 4.1, y = 0.125, label = "0.25", hjust = 0) +
scale_x_continuous(breaks = seq(0, 5, 1)) +
scale_y_continuous(breaks = seq(0, 1, 0.25),
limits = c(-0.015, 1))
```

The area under the curve from density(min = 1, max = 5), and the relative frequency of having a number between 3 and 4, is the width (from 3 to 4 is a width of 1) times height (0.25 for all); thus, 1 x 0.25 = 0.25. 25%. Makes sense.

`R`

would have gotten this for us through:

`punif(4, min = 1, max = 5) - punif(3, min = 1, max = 5)`

`[1] 0.25`

### dunif vs punif vs qunif vs runif

For the uniform function, we calculated the density ourselves which is identical for each “x”. We could have let `R`

do this for us using `dunif`

:

For the uniform(min = 0, max = 1) distribution:

```
<- seq(0, 1, 0.1) # for each step of 0.1 between values 0 and 1
x # calculate density
dunif(x, min = 0, max = 1)
```

` [1] 1 1 1 1 1 1 1 1 1 1 1`

For the uniform(min = 1, max = 5) distribution:

```
<- seq(0, 5, 1) # for each step of 1 between values 0 and 5
x # calculate density
dunif(x, min = 1, max = 5)
```

`[1] 0.00 0.25 0.25 0.25 0.25 0.25`

0 for x = 0, as it should!

We’ve already seen `punif`

in action, but let’s see it again. Let’s find out the probability that a number is between 0 and 0.35 for the uniform(min = 0, max = 1) distribution:

`punif(q = 0.35, min = 0, max = 1)`

`[1] 0.35`

And let’s find out the probability that a number is between 2.5 and 4.2 for the uniform(min = 1, max = 5) distribution:

```
punif(q = 4.2, min = 1, max = 5) - # probability for numbers between 1 and 4.2
punif(q = 3.5, min = 1, max = 5) # probability for numbers between 1 and 2.5
```

`[1] 0.175`

The `qunif`

function goes from a probability back to the “x”. For example, which “x” corresponds to a probability of 0.5. Thus, which area under the curve, starting from the minimum of x to the x we are looking for equals 0.5.

For the unif(min = 0, max = 1) distribution:

`qunif(p = 0.5, min = 0, max = 1)`

`[1] 0.5`

For the unif(0.5, min = 1, max = 5) distribution:

`qunif(p = 0.5, min = 1, max = 5)`

`[1] 3`

Or let’s say we are interested in the top 1% of x’s from the `unif(min = 1, max = 5)`

distribution:

`qunif(p = 0.99, min = 1, max = 5)`

`[1] 4.96`

4.96 and higher! This *might* make more sense when we talk about the normal distribution (`qnorm`

).

`runif`

lets you draw random numbers from said distribution. Let’s draw 11 numbers from the `unif(min = 1, max = 5)`

distribution:

`runif(n = 11, min = 1, max = 5)`

```
[1] 2.201886 3.594314 1.285670 1.266898 1.843133 3.455094 1.605378 3.898504
[9] 4.929507 3.570989 3.762690
```

Alright, easy enough. Let’s move to the (standard) normal distribution, because that is the one that is used for violinplots and the like.

## Understanding the normal (density) distribution

The normal distribution is defined by the mean (\(\mu\)) and the standard deviation (\(\sigma\)), and its density distribution (better yet: probability density function) is defined by:

\[f(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ -( {x - \mu })^2} / 2\sigma^2} }\] With the standard normal distribution, \(\mu = 0\) and \(\sigma = 1\), in which case the above function reduces to:

\[f(x) = \frac{1}{{\sqrt {2\pi } }}e^{{{ -\frac{1}{2}x} } }\] Let’s turn the probability density function for a normal distribution into an `R`

-function:

```
<- function(x, mean = 0, sd = 1) {
pdf_norm 1 / (sd * sqrt( 2 * pi ))) * exp( (-1 * (x - mean)^2) / (2 * sd^2) )
( }
```

Let’s see if it works:

`pdf_norm(0, mean = 0, sd = 1) # obviously mean = 0, sd = 1 not necessary`

`[1] 0.3989423`

Let’s draw it:

```
library(tibble)
# I use tibble here so that y can be calculated immediately
<- tibble(x = seq(-5, 5, 0.01), density = pdf_norm(x))
df_norm
<- ggplot(df_norm, aes(x = x, y = density)) +
dens_norm geom_path(colour = '#67a9cf', size = 3) +
theme_classic()
```

```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
```

` dens_norm`

`# c(, '#ef8a62')`

We’ve created our own probability density function for the normal distribution in `pdf_norm`

, but we did not have to because this is exactly what is stored in the function `dnorm`

. Let’s see:

```
# Create y-values on the base of dnorm
<- tibble(x = seq(-5, 5, 0.01),
df_norm2 density = dnorm(x, mean = 0, sd = 1))
+
dens_norm geom_path(data = df_norm2,
aes(y = density), colour = '#ef8a62', linetype = "dashed", size = 3)
```

Now, through `dnorm`

, we can calculate the highest density value which lies at x = 0:

`dnorm(x = 0, mean = 0, sd = 1)`

`[1] 0.3989423`

Let’s visualise:

```
# We're using the results from dnorm here (y2)
<- ggplot(df_norm2, aes(x = x, y = density)) +
dens_norm2 geom_path(colour = '#ef8a62', size = 2) +
scale_x_continuous(breaks = seq(-5, 5, 1)) +
theme_classic()
+
dens_norm2 annotate("point", x = 0, y = dnorm(x = 0, mean = 0, sd = 1),
colour = '#67a9cf') +
annotate("text", x = 0, y = dnorm(x = 0, mean = 0, sd = 1),
label = dnorm(x = 0, mean = 0, sd = 1), vjust = 0, colour = '#67a9cf')
```

Now, we can look for probabilities. This time, probabilities under the curve are not as easily calculated by hand, but we’ve seen above that we could get these probabilities through `punif`

for the uniform distribution, and in the same way we can use `pnorm`

to find probabilities for the normal distribution! Let’s try finding the probability of finding an x equal to or below -1:

`pnorm(q = -1, mean = 0, sd = 1)`

`[1] 0.1586553`

```
+
dens_norm2 geom_area(data = subset(df_norm2, x <= -1),
fill = '#ef8a62', alpha = 0.7) +
annotate("text", x = -1, y = 0.05,
label = round(pnorm(q = -1, mean = 0, sd = 1), 4),
hjust = 1, colour = "white")
```

Or between 0 and 2:

```
<- pnorm(q = 2, mean = 0, sd = 1) - pnorm(q = 0, mean = 0, sd = 1)
p_0_2 p_0_2
```

`[1] 0.4772499`

```
+
dens_norm2 geom_area(data = subset(df_norm, x >= 0 & x <= 2),
fill = '#ef8a62', alpha = 0.7) +
annotate("text", x = 1, y = 0.05,
label = round(p_0_2, 4), colour = "white")
```

## Density plots

We’ve did the ground work for the density plots. Now we try to understand how we get something like this:

```
<- data.frame(x = c(0, 0.3, 1))
df_dens ggplot(df_dens, aes(x = x), colour = '#67a9cf', size = 2) +
geom_density() +
theme_classic()
```

It’s important to realise that under hood something like this happens:

```
ggplot(df_dens, aes(x = x)) +
geom_density(
kernel = "gaussian",
bw = "nrd0",
colour = '#67a9cf',
size = 2
+
) theme_classic()
```

These arguments (`kernel = "gaussian"`

and `bw = "nrd0"`

) are passed to the function `density()`

:

`density(x = df_dens$x, kernel = "gaussian", bw = "nrd0")`

```
Call:
density.default(x = df_dens$x, bw = "nrd0", kernel = "gaussian")
Data: df_dens$x (3 obs.); Bandwidth 'bw' = 0.2696
x y
Min. :-0.8087 Min. :0.005535
1st Qu.:-0.1544 1st Qu.:0.103430
Median : 0.5000 Median :0.447735
Mean : 0.5000 Mean :0.381337
3rd Qu.: 1.1544 3rd Qu.:0.545436
Max. : 1.8087 Max. :0.848818
```

`# same as density(x = df_dens$x)`

Just to show that `density`

leads to the same graph as `geom_density`

:

```
<- density(x = df_dens$x)
dens_calc <- data.frame(x = dens_calc$x, density = dens_calc$y)
df_dens2
ggplot(df_dens, aes(x = x)) +
geom_density(colour = '#67a9cf', size = 2) +
geom_density(data = df_dens2, stat = "identity",
aes(x = x, y = density),
colour = '#ef8a62', size = 2, linetype = "dashed") +
theme_classic()
```

To arrive at these results, it is important to understand the `kernel = "gaussian"`

and `bw = "nrd0"`

arguments.

## Kernel density distributions

The density estimation happens through something called a a kernel density distribution. This is defined by:

\[ \hat{f}_{h}(x)=\frac{1}{nh}\sum_{i=1}^nK(\frac{x-x_i}{h}) \]

In the above function, `K`

, is the kernel, some non-negative density distribution function, like the ones we have defined above. `h`

is the bandwidth, a positive number that effectively defines the smoothness of the graph. The argument `kernel =`

refers to `K`

and the argument `bw =`

refers to the bandwidth `h`

. As you might guess, the `kernel = gaussian"`

says that we are using the gaussian or normal density distribution as the kernal `K`

!

At this point, I was still a bit lost, but following this great blogpost I understood more. Let’s first assume a bandwidth of 1, and a sample of three numbers from a population {5, 7, 8}.

To build a kernel density estimation (using the normal density distribution) for a sample, we need to perform two steps:

For each element in the sample, draw a normal distribution where the sample element is the mean, and the variance is defined by the square of the bandwidth, \(h^2\).

Sum up all the normal distributions from step 1 and divide the sum by

`nh`

, which is equal to`n`

in this case, because we defined the bandwidth as 1.

So, below we’ll plot three normal distributions: \(\mathcal{N}(5, 1)\), \(\mathcal{N}(7, 1)\), and \(\mathcal{N}(8, 1)\), plus its sum divided by three (`nh`

) to arrive at the overall density distribution.

```
<- tibble(
df x = seq(0, 15, 0.01),
norm_5 = dnorm(x, mean = 5, sd = 1),
norm_7 = dnorm(x, mean = 7, sd = 1),
norm_8 = dnorm(x, mean = 8, sd = 1),
norm_sum = (norm_5 + norm_7 + norm_8) / 3
)
<- ggplot(df, aes(x = x)) +
kernel geom_line(aes(y = norm_5), colour = '#ef8a62', size = 2) +
geom_line(aes(y = norm_7), colour = "#a83299", size = 2) +
geom_line(aes(y = norm_8), colour = '#67a9cf', size = 2) +
geom_line(aes(y = norm_sum), colour = "darkgrey", size = 2) +
labs(y = "density") +
theme_classic()
kernel
```

How does this one stack up to the `geom_density’-distribution?

```
+
kernel geom_density(
data = data.frame(x = c(5, 7, 8)),
bw = 1, # remember we used a bandwidth of 1!
colour = '#f7f7f7', linetype = "dashed", size = 2
)
```

SUCCESS!

However, normally we would have `geom_density`

calculate the bandwidth (`bw`

) for us. Let’s see

```
ggplot(data = data.frame(x = c(5, 7, 8)), aes(x = x)) +
geom_density(bw = 1, colour = "darkgrey", size = 2) +
geom_density(size = 2) + # currently unknown bandwidth
theme_classic() +
scale_x_continuous(limits = c(0, 15))
```

Similar but not the same, and that’s because a different binwidth is used. Which binwidth? `geom_density`

uses `density`

. So let’s see what density gives us:

`density(x = c(5, 7, 8))`

```
Call:
density.default(x = c(5, 7, 8))
Data: c(5, 7, 8) (3 obs.); Bandwidth 'bw' = 0.8087
x y
Min. : 2.574 Min. :0.001845
1st Qu.: 4.537 1st Qu.:0.033991
Median : 6.500 Median :0.151694
Mean : 6.500 Mean :0.127113
3rd Qu.: 8.463 3rd Qu.:0.185661
Max. :10.426 Max. :0.273176
```

So a bandwidth equal to 0.8087322. Let’s validate:

```
ggplot(data = data.frame(x = c(5, 7, 8)), aes(x = x)) +
geom_density(bw = density(x = c(5, 7, 8))$bw,
colour = "darkgrey", size = 2) +
geom_density(size = 2, linetype = "dashed") + # currently unknown bandwidth
theme_classic() +
scale_x_continuous(limits = c(0, 15))
```

Success, yet again.

How did `geom_density`

, or better yet `density`

arrive at 0.8087322? This is where the other argument `bw = "nrd0"`

comes in. When using `density(kernel = "gaussian")`

, the default way to estimate the bandwidth is through the argument `bw = "nrd0"`

. The help function further tells us that:

“bw.nrd0 implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power (= Silverman’s ‘rule of thumb’, Silverman (1986, page 48, eqn (3.31))) unless the quartiles coincide when a positive result will be guaranteed.”

Not exactly beautiful prose. Anyway, Silverman’s rule of thumb is:

\[b = 0.9 \cdot min(\hat{\sigma}, \frac{IQR}{1.35})n^{-\frac{1}{5}}\] Let’s see if we can calculate it ourselves.

```
<- sd(c(5, 7, 8))
sd <- quantile(c(5, 7, 8), 0.75)
Q3 <- quantile(c(5, 7, 8), 0.25)
Q1 <- Q3 - Q1
IQR <- IQR / 1.34
IQR_div <- (0.9 * IQR_div) * 3^(-1/5) h_calc
```

The (sample) standard deviation of 5, 7, and 8 is 1.5275252.

The IQR is Q3 - Q1 = 7.5 - 6 = 1.5. [in R there are nine different ways to produce quantiles via the `quantile`

function. Sigh.]. The IQR divided by 1.34 is 1.119403. Thus, the IQR/1.34 is smaller than the standard deviation. Thus,

\(b = 0.9 \cdot 1.119403 \cdot n^{-\frac{1}{5}}\) = 0.8087322.

[ Let us completely ignore that using the Silverman’s rule of thumb is ill-advised ]

### Are we there yet?

NO. We’ve calculated the overall density distributions through the density function of the normal distribution, and we seem to be getting the exact same results. *However*, under the hood, the `density()`

function is using Fast Fourier Transform [signal processing course nightmares]. This great video (which I understand let’s say 73%) explains that a Fast Fourier Transform of a normal distribution is a normal distribution. So I guess it is not weird when we get the same (similar?) results when using the normal density distributions to calculate use as a Kernel. I have reached the limits of my cognitive abilities, so I will not explore this further currently.

I would have not been able to write this post that nobody reads without this blog.