Estimating Mean and Standard Deviation: Range vs. IQR for Skewed Data Analysis

This article delves into methods for estimating the sample mean and standard deviation when only summary statistics are available, a common scenario in clinical trial studies and meta-analyses. Specifically, we compare the use of the range and interquartile range (IQR) in these estimations, particularly considering situations where data might be skewed. Accurate estimation of these fundamental statistical measures is crucial for data interpretation and further analysis, even when raw data is inaccessible. We enhance existing methodologies to provide more robust and precise estimations, especially when dealing with datasets that may not perfectly conform to a normal distribution.

I. Scenario C1: Estimation from Minimum, Maximum, Median, and Sample Size

Scenario C1 is prevalent in research, where studies often report the median, minimum, maximum values, and sample size. This is the foundational assumption used by Hozo et al. in their established method. To estimate the sample mean ($bar{X}$) and standard deviation (S) under this scenario, we will first examine the Hozo et al. method, identify its limitations, particularly in standard deviation estimation, and then propose improvements that leverage sample size information for greater accuracy.

Let $X_1, X_2, ldots, Xn$ represent a random sample of size $n$ drawn from a normal distribution $N(mu, sigma^2)$, and let $X{(1)} leq X{(2)} leq cdots leq X{(n)}$ be the ordered statistics of this sample. For simplicity, we assume $n = 4Q + 1$, where $Q$ is a positive integer. This allows us to define specific order statistics corresponding to quartiles and the median. Thus:

$a = X{(1)} leq X{(2)} leq cdots leq X_{(Q+1)} = q1 leq X{(Q+2)} leq cdots leq X{(2Q+1)} = m leq X{(2Q+2)} leq cdots leq X_{(3Q+1)} = q3 leq X{(3Q+2)} leq cdots leq X{(4Q+1)} = X{(n)} = b$. (1)

Here, $a$ is the minimum, $m$ is the median, and $b$ is the maximum of the sample. Our objective is to estimate the sample mean $bar{X} = sum_{i=1}^{n} Xi / n$ and the sample standard deviation $S = sqrt{sum{i=1}^{n} (X_i – bar{X})^2 / (n-1)}$, given only $a$, $m$, $b$, and $n$.

Hozo et al.’s Method for Scenario C1

Let $M = 2Q + 1 = (n+1)/2$. Hozo et al. estimated the mean by applying inequalities to the ordered statistics. They reasoned that each data point falls within a certain range defined by the known statistics:

$a leq X{(1)} leq a$
$a leq X{(i)} leq m$ for $i = 2, ldots, M-1$
$m leq X{(M)} leq m$
$m leq X{(i)} leq b$ for $i = M+1, ldots, n-1$
$b leq X_{(n)} leq b$

Summing these inequalities and dividing by $n$ yields lower bound ($LB_1$) and upper bound ($UB_1$) for the sample mean $bar{X}$: $LB_1 leq bar{X} leq UB_1$, where

$LB_1 = frac{a + m}{2} + frac{2b – a – m}{2n}$, $UB_1 = frac{m + b}{2} + frac{2a – m – b}{2n}$.

Hozo et al.’s estimate for the sample mean is the midpoint of these bounds:

$frac{LB_1 + UB_1}{2} = frac{a + 2m + b}{4} + frac{a – 2m + b}{4n}$. (2)

For large sample sizes, the second term in (2) becomes negligible, leading to a simplified mean estimation:

$bar{X} approx frac{a + 2m + b}{4}$. (3)

To estimate the sample standard deviation, Hozo et al. assumed non-negative data and used inequalities for the squared values of the ordered statistics:

$aX{(1)} leq X{(1)}^2 leq aX{(1)}$
$aX{(i)} leq X{(i)}^2 leq mX{(i)}$ for $i = 2, ldots, M-1$
$mX{(M)} leq X{(M)}^2 leq mX{(M)}$
$mX{(i)} leq X{(i)}^2 leq bX{(i)}$ for $i = M+1, ldots, n-1$
$bX{(n)} leq X{(n)}^2 leq bX_{(n)}$ (4)

Through algebraic manipulation and approximations, they derived lower bound ($LSB_1$) and upper bound ($USB1$) for the sum of squared values $sum{i=1}^{n} Xi^2$. Using (3) and approximating $sum{i=1}^{n} X_i^2 approx (LSB_1 + USB_1)/2$, the sample variance $S^2$ is estimated, and the standard deviation $S$ is obtained by taking the square root. For large $n$, this simplifies to the well-known range rule of thumb:

$S approx frac{b – a}{4}$. (5)

This range rule of thumb (5) is independent of sample size, which can be a significant limitation, particularly for very small or very large samples. To address this, Hozo et al. proposed an adaptive range rule of thumb, adjusting for different sample sizes:

$S approx begin{cases} frac{1}{sqrt{12}} sqrt{(b-a)^2 + frac{(a-2m+b)^2}{4}} & n leq 15 frac{b-a}{4} & 15 < n leq 70 frac{b-a}{6} & n > 70 end{cases}$ (6)

The formula for $n leq 15$ is based on equidistantly spaced data, and for $n > 70$, it’s suggested by Chebyshev’s inequality [5]. For symmetric data where $a + b approx 2m$, the formula for small $n$ simplifies to approximately $(b-a)/sqrt{12}$. Hozo et al. demonstrated that this adaptive formula (6) generally outperforms the original range rule (5).

Improved Estimation of S for Scenario C1: Incorporating Sample Size and Addressing Skewness Concerns

While the adaptive formula (6) is an improvement, it still has limitations. The thresholds of 15 and 70 are somewhat arbitrary. Furthermore, for normally distributed data $N(mu, sigma^2)$ with a finite $sigma > 0$, the range rule suggests $sigma approx (b-a)/6 rightarrow infty$ as $n rightarrow infty$, which contradicts the assumption of a finite $sigma$. Additionally, the non-negative data assumption in Hozo et al.’s method is restrictive. Skewness in data can also significantly impact the range, making it a less reliable measure of dispersion compared to the IQR in skewed distributions.

We propose a new estimator to refine (6) and remove the non-negative data restriction. Let $Z_1, ldots, Zn$ be independent and identically distributed (i.i.d.) random variables from the standard normal distribution $N(0,1)$, and $Z{(1)} leq cdots leq Z_{(n)}$ be their ordered statistics. Then $X_i = mu + sigma Zi$ and $X{(i)} = mu + sigma Z{(i)}$. Consequently, $a = mu + sigma Z{(1)}$ and $b = mu + sigma Z{(n)}$. Since $E(Z{(1)}) = -E(Z{(n)})$, we have $E(b – a) = 2sigma E(Z{(n)})$. Let $xi(n) = 2E(Z_{(n)})$. We propose the following estimator for the sample standard deviation:

$S approx frac{b – a}{xi(n)}$. (7)

Here, $xi(n)$ is crucial for adjusting the range based on sample size. If $xi(n) equiv 4$, we get the original rule of thumb (5). If $xi(n)$ is set adaptively as $sqrt{12}$ for $n leq 15$, 4 for $15 < n leq 70$, and 6 for $n > 70$, it reduces to the improved rule (6).

To approximate $xi(n)$, we use David and Nagaraja’s method [6] for the expected value of $Z_{(n)}$:

$E(Z{(n)}) = n int{-infty}^{infty} z [Phi(z)]^{n-1} phi(z) dz$,

where $phi(z) = frac{1}{sqrt{2pi}} e^{-z^2/2}$ and $Phi(z) = int_{-infty}^{z} phi(t) dt$ are the PDF and CDF of the standard normal distribution, respectively. We computed $xi(n)$ numerically and provided values in Table 1 for $n leq 50$. This table demonstrates that Hozo et al.’s adaptive formula (6) lacks accuracy and flexibility compared to using $xi(n)$.

Table 1 Values of $xi(n)$ in formula (7) and formula (12) for $n leq 50$
!Table 1

For larger $n$ ($n > 50$), we can use Blom’s approximation [7] for $E(Z_{(r)})$:

$E(Z_{(r)}) approx Phi^{-1}left(frac{r – alpha}{n – 2alpha + 1}right)$, for $r = 1, ldots, n$, (8)

where $Phi^{-1}(z)$ is the inverse CDF of the standard normal distribution (the upper $z^{th}$ percentile). Blom suggested $alpha = 0.375$ as a compromise value for practical use, although $alpha$ varies with $n$ [8, 9]. Using (7) and (8) with $r=n$ and $alpha = 0.375$, we estimate the standard deviation as:

$S approx frac{b – a}{2Phi^{-1}left(frac{n – 0.375}{n + 0.25}right)}$. (9)

In R, $Phi^{-1}(z)$ can be computed using qnorm(z).

II. Scenario C2: Incorporating Quartiles for Enhanced Estimation

Scenario C2 expands on C1 by including the first quartile ($q_1$) and third quartile ($q_3$) in addition to the minimum ($a$), maximum ($b$), median ($m$), and sample size ($n$). Bland’s method [10] builds upon Hozo et al.’s work by utilizing the interquartile range (IQR = $q_3 – q_1$). Bland argued that these new estimators outperform those of Hozo et al. We will review Bland’s method, identify its limitations, and propose improvements by incorporating sample size. The IQR is inherently more robust to outliers and skewness than the range, making Scenario C2 potentially advantageous, especially when dealing with potentially skewed datasets.

Bland’s Method for Scenario C2

With $n = 4Q + 1$, Bland’s method uses the following inequalities to estimate the mean:

$a leq X{(1)} leq a$
$a leq X{(i)} leq q_1$ for $i = 2, ldots, Q$
$q1 leq X{(Q+1)} leq q_1$
$q1 leq X{(i)} leq m$ for $i = Q+2, ldots, 2Q$
$m leq X{(2Q+1)} leq m$
$m leq X{(i)} leq q_3$ for $i = 2Q+2, ldots, 3Q$
$q3 leq X{(3Q+1)} leq q_3$
$q3 leq X{(i)} leq b$ for $i = 3Q+2, ldots, n-1$
$b leq X_{(n)} leq b$

Summing and dividing by $n$ leads to bounds $LB_2 leq bar{X} leq UB_2$, where

$LB_2 = frac{a + q_1 + m + q_3}{4} + frac{4b – a – q_1 – m – q_3}{4n}$, $UB_2 = frac{q_1 + m + q_3 + b}{4} + frac{4a – q_1 – m – q_3 – b}{4n}$.

Bland’s mean estimate is the average of these bounds. For large $n$, neglecting the second terms, the simplified mean estimation is:

$bar{X} approx frac{a + 2q_1 + 2m + 2q_3 + b}{8}$. (10)

For standard deviation, Bland used inequalities similar to (4) and derived bounds $LSB2 leq sum{i=1}^{n} X_i^2 leq USB2$. Approximating $sum{i=1}^{n} X_i^2 approx (LSB_2 + USB_2)/2$, the variance $S^2$ is estimated as:

$S^2 approx frac{1}{16} (a^2 + 2q_1^2 + 2m^2 + 2q_3^2 + b^2) + frac{1}{8n-8} (aq_1 + q_1m + mq_3 + q_3b) – frac{1}{64} (a + 2q_1 + 2m + 2q_3 + b)^2$. (11)

Bland’s method estimates $S$ by taking the square root of $S^2$. However, estimator (11) is independent of sample size $n$, which can be limiting, especially for varying sample sizes.

Improved Estimation of S for Scenario C2: Combining Range and IQR with Sample Size Adjustment

In Scenario C1, the range ($b-a$) was used for standard deviation estimation. In Scenario C2, with the IQR ($q_3 – q_1$) available, we can also estimate $S$ using $(q_3 – q_1)/eta(n)$, where $eta(n)$ is a function of $n$. Given that IQR is less affected by skewness compared to range, using IQR in combination with range can provide a more balanced and robust estimate, especially for potentially skewed datasets. We propose a combined estimator:

$S approx frac{1}{2} left( frac{b – a}{xi(n)} + frac{q_3 – q_1}{eta(n)} right)$. (12)

From Scenario C1, $xi(n) = 2E(Z_{(n)})$. To determine $eta(n)$, we note $q1 = mu + sigma Z{(Q+1)}$ and $q3 = mu + sigma Z{(3Q+1)}$, so $q_3 – q1 = sigma (Z{(3Q+1)} – Z{(Q+1)})$. Given $E(Z{(Q+1)}) = -E(Z_{(3Q+1)})$, we have $E(q_3 – q1) = 2sigma E(Z{(3Q+1)})$. This suggests:

$eta(n) = 2E(Z_{(3Q+1)})$.

Using David and Nagaraja’s method [6],

$E(Z{(3Q+1)}) = frac{(4Q+1)!}{Q!(3Q)!} int{-infty}^{infty} z [Phi(z)]^{3Q} [1 – Phi(z)]^Q phi(z) dz$.

Table 2 provides numerical values for $eta(n) = 2E(Z_{(3Q+1)})$ for $Q leq 50$. For large $n$, we approximate $eta(n)$ using formula (8): $eta(n) approx 2Phi^{-1}((0.75n – 0.125)/(n + 0.25))$ for $r = 3Q + 1$ and $alpha = 0.375$. Thus, for Scenario C2, the standard deviation estimate becomes:

$S approx frac{1}{2} left( frac{b – a}{2Phi^{-1}left(frac{n – 0.375}{n + 0.25}right)} + frac{q_3 – q_1}{2Phi^{-1}left(frac{0.75n – 0.125}{n + 0.25}right)} right)$. (13)

Table 2 Values of $eta(n)$ in formula (12) and formula (15) for $Q leq 50$, where $n = 4Q + 1$
!Table 2

Formula (13) is more concise than (11). We will compare these formulas numerically in a simulation study.

III. Scenario C3: Estimation from Quartiles, Median, and Sample Size – IQR Focused

Scenario C3 reports the first and third quartiles, median, and sample size, but not the minimum and maximum values. This scenario is particularly relevant when dealing with datasets where outliers or skewness are a concern, as the IQR is less sensitive to extreme values than the range. Hozo et al.’s and Bland’s methods, as presented, are not directly applicable to C3. Following their inequality-based approach leads to unbounded intervals and thus fails to provide useful estimates.

The common, but flawed, approach in literature [11, 12] involves imputing range from IQR and median, which performs poorly in simulations (not shown).

A Quantile Method for Estimating Mean and Standard Deviation in Scenario C3

We propose a quantile-based method for estimating the mean and standard deviation in Scenario C3. Starting from the mean estimation in Scenario C2 (10):

$bar{X} approx frac{a + 2q_1 + 2m + 2q_3 + b}{8} = frac{a + b}{8} + frac{q_1 + m + q_3}{4}$.

In Scenario C3, $a$ and $b$ are unknown. Removing them and adjusting the denominator of the second term, we propose a mean estimator of the form $bar{X} approx (q_1 + m + q_3)/C$. Since $E(q_1 + m + q3) = 3mu + sigma E(Z{(Q+1)} + Z{(2Q+1)} + Z{(3Q+1)}) = 3mu$, we set $C = 3$:

$bar{X} approx frac{q_1 + m + q_3}{3}$. (14)

For standard deviation, following the approach in (12), and recognizing that in Scenario C3, IQR becomes the primary indicator of data dispersion, especially if skewness is suspected, we propose using only the IQR for estimation:

$S approx frac{q_3 – q_1}{eta(n)}$, (15)

where $eta(n) = 2E(Z_{(3Q+1)})$, as before. Since $E(q_3 – q1) = sigma eta(n)$, estimator (15) provides a good estimate for $S$. Values of $eta(n)$ are in Table 2. For large $n$, using the approximation for $E(Z{(3Q+1)})$, we have:

$S approx frac{q_3 – q_1}{2Phi^{-1}left(frac{0.75n – 0.125}{n + 0.25}right)}$. (16)

The Cochrane Handbook [13] provides a similar IQR-based estimator:

$S approx frac{q_3 – q_1}{1.35}$. (17)

Estimator (17) is also sample size independent and may be less accurate for general use. From Table 2, $eta(n)$ in (15) approaches approximately 1.35 as $n$ increases. The denominator in (16) converges to $2Phi^{-1}(0.75) approx 1.34898$ as $n rightarrow infty$. For small sample sizes, our method (15) and (16) offer more accurate standard deviation estimates compared to the fixed denominator in formula (17).

Conclusion

This article has presented and refined methods for estimating the mean and standard deviation from summary statistics, focusing on scenarios relevant to meta-analyses and clinical research. We highlighted the limitations of range-based estimations, especially in the context of varying sample sizes and potential data skewness. By incorporating sample size-adjusted factors and emphasizing the use of IQR, particularly in Scenario C3, we offer more robust and accurate estimation techniques. While range can be influenced by outliers and skewness, IQR provides a more stable measure of dispersion, making it especially valuable when dealing with datasets that may deviate from normality. The proposed methods provide practical tools for researchers to extract meaningful statistical insights even when individual-level data is unavailable, improving the utility of summary data in statistical analysis.