Why you should use regression models instead of tests for statistical inferences ?

During the last seminar I gave for students and professionals in biology, I have been surprised by one of the attendees, who told me that she only knows tests and that regression is completely unknown from her. With this catchy title, I want to show that regression is always a better option than statistical tests for the complex problems commonly encountered in biology. I hope that at the end of this short post you will understand the advantages of regression over statistical tests for doing inferences, particularly in the presence of confounding.

A common question asked by researchers is “Are my groups significantly different ?”. Indeed, biologists are often interested in comparing two or more populations under different conditions, healthy and ill individuals for example. If the reader is unfamiliar with the concepts of significance testing and the related p-value, here my first post on the subject: https://statsxomics.blog/2024/01/17/understand-p-values-a-guide-for-researchers/. A common (and valid) approach, is to used the parametric t-test or the non-parametric Wilcoxon test to compare two populations. I argue that this choice could be valid, if and only if, there is no confounding factor modifying my association between the groups and the variable of interest (See here to learn more about confounding: https://en.wikipedia.org/wiki/Confounding). Although, there exists some stratification strategies to account for confounding, regression seems a more natural choice to deal with this scenario. Let’s take two examples to illustrate my point, for a design of independent individuals and under a paired design, respectively.

Comparing two unrelated populations

In our first example, we are interested in studying the association between a disease status and some expression levels for a given gene in two independent populations. In essence, our question is: Is my gene expression modified by the disease ? With enough individuals, we can easily plan to use a Wilcoxon test for comparing affected and unaffected people. But the tricky thing here is that I know that BMI and sex are confounding factors. So distinguish the true effect of the disease from those of the BMI or sex could be difficult if we do not have a valid option to adjust for these confounding factors.

For the sake of pedagogy I simulated data using a Poisson distribution comparing the Wilcoxon test to Poisson regression.

In the hypothetical scenario where I do not know the existence of regressions I can naively apply a two-sided Wilcoxon rank-sum test on my gene expression and my disease status and obtain something like this:

Wilcoxon rank sum test with continuity correction

data:  expression by status
W = 763, p-value = 0.001023
alternative hypothesis: true location shift is not equal to 0

Now the important question is: “How can I interpret this result ?”

And the answer is: it is difficult. Indeed, unlike the t-test, interpretation of the Wilcoxon rank-sum test is always restricted to the interpretation of the p-value (That’s why understanding this latter seems critical here). This is biologically non-meaningful since it lacks a measure of association which could add an important information of the effect directionality. Also, we did not control for the known confounding factors, e.g., BMI and sex. Finally, a legitimate question could be: “Is my result obtained because of the confounding factors or of the sole effect of the status ?” Let’s see it using a regression model.

In the first scenario I will just fit the expression against the status and in the second I will account for the BMI and the sex.

The first model gives:

Call:
glm(formula = expression ~ status, family = poisson())

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-256.52  -153.26   -94.87    56.88   583.95  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 9.630983   0.001093  8814.4   <2e-16 ***
status      0.900135   0.001337   673.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3687853  on 99  degrees of freedom
Residual deviance: 3201524  on 98  degrees of freedom
AIC: 3202630

Number of Fisher Scoring iterations: 6

From this, we have bunch of additional interpretation elements compared to the Wilcoxon test such as the strength of association (estimate), the variation of this association (standard error) and the corresponding p-value. As we can see, the association of the disease status on gene expression is quite overestimated compared to the true value when we did not account for BMI and sex (0.7 vs 0.9).

As clearly stated, regression models are intuitive ways for measuring associations in the presence of confounding factors. Thus, when we adjusted for BMI and sex the model provides the right answer for the association between the disease status and the gene expression (0.7 vs 0.69). Practically speaking, BMI and sex are directly incorporated in the model as adjustment variables.

Call:
glm(formula = expression ~ status + sex + BMI, family = poisson())

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0643  -0.6230   0.1099   0.5836   2.7005  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.8124559  0.0064786   125.4   <2e-16 ***
status      0.6988531  0.0013407   521.2   <2e-16 ***
sex         0.2003951  0.0012663   158.2   <2e-16 ***
BMI         0.2996372  0.0002005  1494.2   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3.6879e+06  on 99  degrees of freedom
Residual deviance: 9.3996e+01  on 96  degrees of freedom
AIC: 1203.6

Number of Fisher Scoring iterations: 3

But there exists regression models for more complex scenarios such as paired designs ?

N.B: A curious reader should ask: Why we did not use the t-test here since it provides interesting information for biologists in this case ? The reason here why I use a Poisson regression instead of a t-test is the flexibility provided by regression models, in terms of data distribution. Indeed, t-test, while a quite robust test, performs better in the presence of normal responses, there indeed exists many parametric and non-parametric regression models to account for very complex underlying data structures, more versatile and more powerful than the t-test.

Comparing two paired populations

Another scenario arises when you measure a certain amount of a biomarker in the same individual between two timepoints after a certain exposition. This scenario violates the underlying assumptions of standard regression models, such as the Poisson regression presented above. Indeed, linear, logistic or any other generalized linear regression assumes the individual independence, in order to have uncorrelated residuals. In essence, we want to separate the effect of the exposition from the individual-related effect, by accounting for the correlation occurring between measures of the same individual.

Assuming a gaussian response, measured for 50 individuals across two timepoints (before and after the treatment), we are still able to take advantage on regression models.

One more time, we can apply here either a two-sided Wilcoxon signed rank test or a two-sided paired t-test, since data are normally distributed.

Wilcoxon signed rank test


	Wilcoxon signed rank test with continuity correction

data:  levels by treatment
V = 1711, p-value = 0.005157
alternative hypothesis: true location shift is not equal to 0

Paired t-test

Paired t-test

data:  levels by treatment
t = -2.9951, df = 99, p-value = 0.003467
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.6906590 -0.1402098
sample estimates:
mean difference 
     -0.4154344

Fortunately, to address limitations of the two previous tests, we have a very powerful option exploiting linear mixed models which provides the possibility to discriminate the individual-related variance (intra-cluster variance) from the residual variance (inter-cluster variance), in addition to bring the possibility to adjust for certain confounding factors.

However, if linear mixed models is totally unknown from you, you can be tempted to use a standard linear regression, but as we can see, the treatment effect emerges as non-significant. This result could be explained by the fact that estimates and the related standard errors tend to capture the intra-individual variation, leading to “difficult-to-anticipate” bias.

Call:
lm(formula = levels ~ treatment + BMI + sex, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8346 -1.2508 -0.0488  1.3366  5.3086 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.46068    0.87140  12.004   <2e-16 ***
treatment    0.41543    0.28252   1.470    0.143    
BMI          1.19187    0.02716  43.882   <2e-16 ***
sex          2.65424    0.28759   9.229   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.998 on 196 degrees of freedom
Multiple R-squared:  0.9098,	Adjusted R-squared:  0.9084 
F-statistic: 658.8 on 3 and 196 DF,  p-value: < 2.2e-16

As explicitly introduced in the beginning of this section, we have a very (x10) good option to deal with scenario with repeated measures or when correlation between measurements of the same individual is expected, the linear mixed model a.k.a the final boss of the generalized linear regressions.

The easiest way is to tell the model that only the intercept will vary across individuals (we talk about random intercept) and this what we do here and this is what we obtain.

Linear mixed model fit by REML ['lmerMod']
Formula: levels ~ (1 | ind) + sex + BMI + treatment
   Data: df

REML criterion at convergence: 764

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.01847 -0.41129 -0.03703  0.45401  2.46802 

Random effects:
 Groups   Name        Variance Std.Dev.
 ind      (Intercept) 3.060    1.7493  
 Residual             0.962    0.9808  
Number of obs: 200, groups:  ind, 100

Fixed effects:
            Estimate Std. Error t value
(Intercept) 10.46068    1.03543  10.103
sex          2.65424    0.38311   6.928
BMI          1.19187    0.03618  32.941
treatment    0.41543    0.13871   2.995

Correlation of Fixed Effects:
          (Intr) sex    BMI   
sex       -0.200              
BMI       -0.951  0.051       
treatment -0.201  0.000  0.000

What is the difference between this output and the previous one ? In other words what are the additional information ?

The two main differences are the additional random effect section and values for the standard errors.

Let’s start with the former. Intuitively, the random effect section is composed by at least two elements, called the variance components: one related to the random effect and one to the residuals. The intra-individual variance (ind) is just the part of the data variability explained by the variance within individuals. More this value is high more the total variability is captured by the individual. The converse is also true. The Intra-Class Correlation (ICC) computed by the ratio of the within-individual variance on the total variance is an informative metric to tell how much variance is explained because of the strata.

By adjusting for the additional source of variability using random effects, we obtain more accurate values for the standard errors, while the estimates remain the same. This with a major implication on the test statistic hence the statistical significance.

N.B: Yes, we do not have a p-value when using mixed models in R, pushing many researchers to avoid these methods. The reason is that there is no consensus on how to accurately compute p-values in this context, with an intense debate across the community. My practical suggestions here are: 1) If you have enough data (individuals and repeated measures for individuals), you can have a crude approximation of the asymptotic p-value using a normal distribution. 2) If you believe that you do not have enough sample size, bootstrap is always a good friend. If you cannot wait my future post on the subject see this excellent introduction in R: https://bookdown.org/jgscott/DSGI/the-bootstrap.html. But keep in mind that we can live comfortably even without p-values !

Conclusion

In this post we have reviewed one of the most important statistical models, the regression. Through two quick examples, we have seen the advantages of regressions over statistical tests to account for confounding while providing additional interpretation aid. Also, I presented an elegant alternative of traditional tests for paired designs, the mixed models, when dealing with repeated measures and when assumptions of standard models are not met. Even though, the use case presented here is focused on a paired design for a Gaussian response, mixed models are easily extendable to more complex designs with different outcome distributions (binomial, Poisson, negative binomial). Also, in many situations, we have unmeasured confounding factors and using mixed models could be an interesting way to adjust for. Indeed, incorporating random effects can be seen as a proxy factor for these unobserved confounding variables.

Now, you cannot say that you do not know the existence of regression ! You are welcome.

The R code to reproduce results:

set.seed(1234) #For reproductibility

#Simulate some gene expression data for 100 individuals 
#with a balanced design for the disease status and sex, while BMI is randomly picked from a uniform distribution

status = rbinom(100, 1,0.5)
sex = rbinom(100, 1, 0.5)
BMI = ceiling(runif(100, 17, 35))

#Gene expression is simulated from a poisson distribution
expression = rpois(100, exp(0.8+0.7*status+0.2*sex+0.3*BMI))

wilcox.test(expression~status)

model1 = glm(expression~status, family=poisson())
summary(model1)
plot(model1)

model2 = glm(expression~status+sex+BMI, family=poisson())
summary(model2)
plot(model2)

#Simulate data for paired design
#Thanks to: https://aosmith.rbind.io/2018/04/23/simulate-simulate-part-2/
ind = rep(paste0("Ind", 1:100),each=2)
sex_paired = rep(rbinom(100, 1, 0.5),each=2)
BMI_paired = rep(ceiling(runif(100, 17, 35)),each=2)

ind_effect = rep(rnorm(100,0,1.8),each=2)
pop_effect = rnorm(200, 0,1)
treatment = rep(1:2, 100)

df = data.frame("Ind"=ind, "Ind_effect" = ind_effect, "Pop_effect"=pop_effect, "sex"=sex_paired, "BMI"=BMI_paired, "treatment"=treatment)

df$levels = 10+2.8*df$sex+1.2*df$BMI+0.6*df$treatment+df$Ind_effect+df$Pop_effect

t.test(levels~treatment, df , paired=T)
wilcox.test(levels~treatment, df , paired=T)

library(lme4) #To fit linear mixed models

summary(lm(levels~treatment+BMI+sex,df))
summary(lmer(levels~(1|ind)+sex+BMI+treatment, df))

StatsXOmics