Do not over-think about ‘outliers’, use a student-t distribution instead | by Daniel Manrique-Castano

A Student’s t-distribution is nothing more than a Gaussian distribution with heavier tails. In other words, we can say that the Gaussian distribution is a special case of the Student’s t-distribution. The Gaussian distribution is defined by the mean (μ) and the standard deviation (σ). The Student t distribution, on the other hand, adds an additional parameter, the degrees of freedom (df), which controls the “thickness” of the distribution. This parameter assigns greater probability to events further from the mean. This feature is particularly useful for small sample sizes, such as in biomedicine, where the assumption of normality is questionable. Note that as the degrees of freedom increase, the Student t-distribution approaches the Gaussian distribution. We can visualize this using density plots:

# Load necessary libraries
library(ggplot2)

# Set seed for reproducibility
set.seed(123)
# Define the distributions
x <- seq(-4, 4, length.out = 200)
y_gaussian <- dnorm(x)
y_t3 <- dt(x, df = 3)
y_t10 <- dt(x, df = 10)
y_t30 <- dt(x, df = 30)
# Create a data frame for plotting
df <- data.frame(x, y_gaussian, y_t3, y_t10, y_t30)
# Plot the distributions
ggplot(df, aes(x)) +
geom_line(aes(y = y_gaussian, color = "Gaussian")) +
geom_line(aes(y = y_t3, color = "t, df=3")) +
geom_line(aes(y = y_t10, color = "t, df=10")) +
geom_line(aes(y = y_t30, color = "t, df=30")) +
labs(title = "Comparison of Gaussian and Student t-Distributions",
x = "Value",
y = "Density") +
scale_color_manual(values = c("Gaussian" = "blue", "t, df=3" = "red", "t, df=10" = "green", "t, df=30" = "purple")) +
theme_classic()

Figure 1: Comparison of Gaussian and Student t-Distributions with different degrees of freedom.

Note in Figure 1 that the hill around the mean gets smaller as the degrees of freedom decrease as a result of the probability mass going to the tails, which are thicker. This property is what gives the Student’s t-distribution a reduced sensitivity to outliers. For more details on this matter, you can check this blog.

We load the required libraries:

library(ggplot2)
library(brms)
library(ggdist)
library(easystats)
library(dplyr)
library(tibble)
library(ghibli)

So, let’s skip data simulations and get serious. We’ll work with real data I have acquired from mice performing the rotarod test.

First, we load the dataset into our environment and set the corresponding factor levels. The dataset contains IDs for the animals, a groping variable (Genotype), an indicator for two different days on which the test was performed (day), and different trials for the same day. For this article, we model only one of the trials (Trial3). We will save the other trials for a future article on modeling variation.

As the data handling implies, our modeling strategy will be based on Genotype and Day as categorical predictors of the distribution of Trial3.

In biomedical science, categorical predictors, or grouping factors, are more common than continuous predictors. Scientists in this field like to divide their samples into groups or conditions and apply different treatments.

data <- read.csv("Data/Rotarod.csv")
data$Day <- factor(data$Day, levels = c("1", "2"))
data$Genotype <- factor(data$Genotype, levels = c("WT", "KO"))
head(data)

Let’s have an initial view of the data using Raincloud plots as shown by Guilherme A. Franchi, PhD in this great blog post.

edv <- ggplot(data, aes(x = Day, y = Trial3, fill=Genotype)) +
scale_fill_ghibli_d("SpiritedMedium", direction = -1) +
geom_boxplot(width = 0.1,
outlier.color = "red") +
xlab('Day') +
ylab('Time (s)') +
ggtitle("Rorarod performance") +
theme_classic(base_size=18, base_family="serif")+
theme(text = element_text(size=18),
axis.text.x = element_text(angle=0, hjust=.1, vjust = 0.5, color = "black"),
axis.text.y = element_text(color = "black"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position="bottom")+
scale_y_continuous(breaks = seq(0, 100, by=20), 
limits=c(0,100)) +
# Line below adds dot plots from {ggdist} package 
stat_dots(side = "left", 
justification = 1.12,
binwidth = 1.9) +
# Line below adds half-violin from {ggdist} package
stat_halfeye(adjust = .5, 
width = .6, 
justification = -.2, 
.width = 0, 
point_colour = NA)
edv

Figure 2: Exploratory data visualization.

Figure 2 looks different from the original by Guilherme A. Franchi, PhD because we are plotting two factors instead of one. However, the nature of the plot is the same. Pay attention to the red dots, these are the ones that can be considered extreme observations that tilt the measures of central tendency (especially the mean) toward one direction. We also observe that the variances are different, so modeling also sigma can give better estimates. Our task now is to model the output using the brms package.

Do not over-think about ‘outliers’, use a student-t distribution instead | by Daniel Manrique-Castano | Mar, 2024

Related Post

You Missed

Spotify set to increase prices and add more plans soon

Oura preempts Samsung’s Galaxy Ring with new features for its rings

14 Years of iPad, and “Trucks” Continue To Dominate

Apple Vision Pro Spatial Personas Hands-On: A Step-Change For Telepresence