Power and resolution¶

If we discretise a continuous variable (i.e. split it into high and low) do we lose statistical power for downstream analysis? This study is motivated by the choice of summarising patient responses as positive or negative vs keeping it as a continuous response.

We will look at several cases:

First where a single normally distributed variable contributes to a response term.
Then we have a categorical latent variable which generates an observed normal variable with a mean dependant on the latent variable. Then the response variable depends on the bimodal observed variable
Then we have a similar case but the response variable depends on the latent variable, in this case the observed predictor is just noise really.

The figure below shows the first case, where patient responses are split down the middle.

%load_ext rpy2.ipython
from rpy2.robjects import r

%%R
packages <- c("ggplot2", "dplyr", "diosR")
lapply(packages, require, character.only=T)

[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

%%R
ggplot(df, aes(x=x, fill=c))+
geom_histogram() + theme_dios()

Discretised continuous¶

x is a unimodal normal variable which y is dependant on.

%%R
n <- 100
b <- 0.3

mp <- c()
np <- c()

for(i in 1:500){
    x <- rnorm(n)
    c <- x>mean(x)
    y <- rnorm(n, b*x)

    df <- data.frame(x,y,c)

    m1 <- lm(y~x, data=df)
    mp[i] <- summary(m1)$coefficients[2,4]
    n1 <- lm(y~c, data=df)
    np[i] <- summary(n1)$coefficients[2,4]
}

# print(ggplot(df, aes(x=x, y=y))+geom_point())

# ggplot(df, aes(x=c, y=y))+geom_boxplot()

%%R
power <- data.frame(type=c('cont','discrete'), power=c(mean(mp<0.05), mean(np<0.05)))
power$se <- c(sd(mp)/sqrt(n/2), sd(np)/sqrt(n/2))
ggplot(power, aes(x=type, y=power))+
geom_errorbar(aes(ymin=power-se, ymax=power+se), width=.2)+
geom_point(size=5, colour="orange")+
theme_dios()

Bimodal continuous¶

x has a normal distribution but the mean depends on the latent variable C so is bimodal. y depends on x.

%%R
n <- 100
b <- 0.3
mdif <- 1

mp <- c()
np <- c()

for(i in 1:500){
    C <- rep(0:1, length.out=n)
    x <- rnorm(n, C*mdif)
    c <- x>mean(x)
    y <- rnorm(n, b*x)

    df <- data.frame(x,y,c,C)

    m1 <- lm(y~x, data=df)
    mp[i] <- summary(m1)$coefficients[2,4]
    n1 <- lm(y~c, data=df)
    np[i] <- summary(n1)$coefficients[2,4]
}

%%R
power <- data.frame(type=c('cont','discrete'), power=c(mean(mp<0.05), mean(np<0.05)))
power$se <- c(sd(mp)/sqrt(n/2), sd(np)/sqrt(n/2))
ggplot(power, aes(x=type, y=power))+
geom_errorbar(aes(ymin=power-se, ymax=power+se), width=.2)+
geom_point(size=5, colour="orange")+
theme_dios()

Y ~ latent variable¶

Both y and x are normally distributed with means dependant on the discrete latent variable.

%%R
n <- 100
b <- 0.3
mdif <- 1

mp <- c()
np <- c()
op <- c()
for(i in 1:500){
    C <- rep(0:1, length.out=n)
    x <- rnorm(n, C*mdif)
    c <- x>mean(x)
    y <- rnorm(n, C*mdif)

    df <- data.frame(x,y,c,C)

    m1 <- lm(y~x, data=df)
    mp[i] <- summary(m1)$coefficients[2,4]
    n1 <- lm(y~c, data=df)
    np[i] <- summary(n1)$coefficients[2,4]
    o1 <- lm(y~C, data=df)
    op[i] <- summary(o1)$coefficients[2,4]

}

How well does the discretisation capture true classification? The true classification is shown by colour, and the assigned classification is shown by shape and the cutoff is the vertical line. By playing with the mdif variable it seems that for large mdif the power rapidly increases for both models. At large values of mdif power is high for both cases but the discretisation is still slightly worse.

%%R
ggplot(df, aes(x=x,y=y,colour=as.factor(C), shape=c))+
geom_point(size=4)+
geom_vline(xintercept=mean(x), lty=2)+
scale_shape_manual(values=c(1,19))+
theme_dios()

%%R

power <- data.frame(type=c('cont','discrete','latent'), power=c(mean(mp<0.05), mean(np<0.05), mean(op<0.05)))
power$se <- c(sd(mp)/sqrt(n/2), sd(np)/sqrt(n/2), sd(op)/sqrt(n/2))
ggplot(power, aes(x=type, y=power))+
geom_errorbar(aes(ymin=power-se, ymax=power+se), width=.2)+
geom_point(size=5, colour="orange")+
theme_dios()

Conclusion¶

Discretising variables results in a sinificant loss of power. Even if the y depends on an unobserved discrete variable using the full distribution of observed data preserves power.

In these simulations the discretisation is very crude, above or below mean. A more sensitive process may limit the power loss and if the latent variable can be acurately reconstructed the power is substantially improved.

If the difference between positive and negative cases is clear the loss of power is limited, but if there is ambiguity in classification preserving the full distribution is beneficial.