| Title: | Modern Resampling Methods: Bootstraps, Wild, Block, Permutation, and Selection Guidance |
|---|---|
| Description: | Implements modern resampling and permutation methods for robust statistical inference without restrictive parametric assumptions. Provides bias-corrected and accelerated (BCa) bootstrap (Efron and Tibshirani (1993) <doi:10.1201/9780429246593>), wild bootstrap for heteroscedastic regression (Liu (1988) <doi:10.1214/aos/1176351062>, Davidson and Flachaire (2008) <doi:10.1016/j.jeconom.2008.08.003>), block bootstrap for time series (Politis and Romano (1994) <doi:10.1080/01621459.1994.10476870>), and permutation-based multiple testing correction (Westfall and Young (1993) <ISBN:0-471-55761-7>). Methods handle non-normal data, heteroscedasticity, time series correlation, and multiple comparisons. |
| Authors: | Ibrahim Kholil Rakib [aut, cre] |
| Maintainer: | Ibrahim Kholil Rakib <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-05-19 08:16:58 UTC |
| Source: | https://github.com/ikrakib/modernboot |
Inspects data structure and recommends an appropriate resampling method.
auto_select_method(data, cluster = NULL)auto_select_method(data, cluster = NULL)
data |
numeric vector, matrix, or time series object representing data to analyze. For vectors: univariate sample. For matrices/data.frames: multivariate data with rows=observations, columns=variables. For ts objects: automatically detected as time series. |
cluster |
optional vector of cluster IDs (integer or factor). Length must equal nrow(data) for matrices or length(data) for vectors. Identifies hierarchical grouping (e.g., repeated measures within subjects, multiple samples from same source). If NULL (default), clustering is not considered. If provided, triggers clustered bootstrap recommendation. |
Decision logic: - If clusters provided: use clustered bootstrap - If time series detected: use block or stationary bootstrap - If multivariate matrix: use permutation maxT - Otherwise (IID data): use standard nonparametric bootstrap
A list with two character string elements:
method |
recommended resampling method name. Values: "clustered_boot" (if cluster provided), "block_boot" (if time series detected), "perm_maxT" (if multivariate matrix detected), "nonparametric_boot" (default for IID univariate data) |
rationale |
human-readable explanation of recommendation. Describes why chosen method is appropriate and what structure the data exhibits. |
# IID data x <- rnorm(50) auto_select_method(x) # Time series ts_data <- arima.sim(n = 100, list(ar = 0.7)) auto_select_method(ts_data)# IID data x <- rnorm(50) auto_select_method(x) # Time series ts_data <- arima.sim(n = 100, list(ar = 0.7)) auto_select_method(ts_data)
Computes BCa confidence interval for the mean using the boot package.
bca_ci(x, R = 2000, conf = 0.95)bca_ci(x, R = 2000, conf = 0.95)
x |
numeric vector of data. |
R |
integer number of bootstrap replicates (default 2000). |
conf |
numeric confidence level between 0 and 1 (default 0.95). |
A list with elements boot (object) and ci (matrix).
set.seed(42) x <- rnorm(50, mean = 5, sd = 2) result <- bca_ci(x, R = 500) result$ciset.seed(42) x <- rnorm(50, mean = 5, sd = 2) result <- bca_ci(x, R = 500) result$ci
Compute a bootstrap distribution for the sample mean.
bs_mean(x, R = 2000, conf = 0.95)bs_mean(x, R = 2000, conf = 0.95)
x |
numeric vector of data. |
R |
integer number of bootstrap replicates (default 2000). |
conf |
numeric confidence level between 0 and 1 (default 0.95). |
A list with elements stat (mean), boot (replicates), and ci (interval).
set.seed(42) x <- rnorm(50, mean = 5, sd = 2) result <- bs_mean(x, R = 500) result$ciset.seed(42) x <- rnorm(50, mean = 5, sd = 2) result <- bs_mean(x, R = 500) result$ci
Runs a simulation study comparing different bootstrap and permutation methods. Useful for empirical evaluation of method performance and coverage.
compare_methods_sim(data_generator, Rsim = 100, Rboot = 1000, parallel = TRUE)compare_methods_sim(data_generator, Rsim = 100, Rboot = 1000, parallel = TRUE)
data_generator |
function taking no arguments and returning numeric vector
of data. Called Rsim times to generate independent datasets.
Example: |
Rsim |
integer number of simulated datasets to generate (default 100). Larger values (500+) provide more stable coverage rate estimates but increase computation time. Each simulation applies both bs_mean and bca_ci methods independently. Must be >= 1. |
Rboot |
integer number of bootstrap replicates used within each method per dataset (default 1000). Matches R parameter passed to bs_mean and bca_ci. Larger values (2000-5000) improve interval accuracy but increase total computation (total bootstrap samples = Rsim * Rboot). Must be >= 1. |
parallel |
logical enable parallel computation across simulations
(default TRUE). Uses |
This function repeatedly generates data from data_generator and applies bs_mean and bca_ci methods. Results can be used to study coverage rates, interval width, and method robustness.
A list of length Rsim. Each element is a list with two components:
bs |
result from |
bca |
result from |
Use to assess coverage rates: did true parameter fall within returned CIs? Interval width variation indicates method stability.
# Fast example (runs in < 5 sec) - UNWRAPPED set.seed(42) generator <- function() rnorm(20) # Very small simulation for demonstration results_fast <- compare_methods_sim(generator, Rsim = 5, Rboot = 100) # Realistic simulation (takes > 5 sec) - WRAPPED in \donttest set.seed(42) generator <- function() rnorm(50, mean = 5, sd = 2) sim_results <- compare_methods_sim(generator, Rsim = 50, Rboot = 500) # Analyze coverage: does true mean (5) fall in each CI? coverage_bs <- mean(sapply(sim_results, function(res) { res$bs$ci[1] <= 5 && 5 <= res$bs$ci[2] })) coverage_bs# Fast example (runs in < 5 sec) - UNWRAPPED set.seed(42) generator <- function() rnorm(20) # Very small simulation for demonstration results_fast <- compare_methods_sim(generator, Rsim = 5, Rboot = 100) # Realistic simulation (takes > 5 sec) - WRAPPED in \donttest set.seed(42) generator <- function() rnorm(50, mean = 5, sd = 2) sim_results <- compare_methods_sim(generator, Rsim = 50, Rboot = 500) # Analyze coverage: does true mean (5) fall in each CI? coverage_bs <- mean(sapply(sim_results, function(res) { res$bs$ci[1] <= 5 && 5 <= res$bs$ci[2] })) coverage_bs
Performs moving block bootstrap resampling for dependent data (time series). Preserves temporal dependence structure within blocks.
moving_block_boot(x, block_size = 10, R = 1000)moving_block_boot(x, block_size = 10, R = 1000)
x |
numeric vector or time series object. Should be univariate time series with temporal dependence (autocorrelation). If using ts object, inherits frequency information. Length n >= block_size required. |
block_size |
integer length of consecutive observations to keep together in bootstrap samples (default 10). Rule of thumb: approximately sqrt(n) where n is series length. Must be >= 1 and <= length(x). Larger blocks preserve longer-range dependence; smaller blocks reduce distortion but may not capture autocorrelation structure. |
R |
integer number of bootstrap replicates (default 1000). Each replicate is a complete time series of length n obtained by concatenating randomly selected blocks. Must be >= 1. |
The moving block bootstrap divides the time series into overlapping blocks of length block_size and resamples these blocks with replacement. This preserves short-range dependence while allowing the empirical sampling distribution to reflect dependence.
A list of length R. Each element is a numeric vector of length n (same as original series length), representing one bootstrap replicate of the time series. Replicates preserve block structure and local dependence within blocks, though global autocorrelation structure may be altered.
set.seed(42) x <- arima.sim(n = 100, list(ar = 0.7)) result <- moving_block_boot(x, block_size = 10, R = 100) length(result) # 100 bootstrap replicatesset.seed(42) x <- arima.sim(n = 100, list(ar = 0.7)) result <- moving_block_boot(x, block_size = 10, R = 100) length(result) # 100 bootstrap replicates
Performs permutation-based multiple testing correction using the maxT method. Controls family-wise error rate (FWER) while testing multiple hypotheses.
perm_maxT(data_matrix, groups, R = 2000)perm_maxT(data_matrix, groups, R = 2000)
data_matrix |
numeric matrix of dimensions (n x p) where n = number of observations (samples), p = number of variables (features/genes/voxels). Rows are observations, columns are variables. No NAs allowed; remove or impute before calling. Example: gene expression matrix with n = 50 samples, p = 10000 genes. |
groups |
factor or vector of group labels (length n). Should have exactly 2 unique levels representing group membership. Numeric (0/1) or character ("control"/"treatment") both acceptable. Order corresponds to data_matrix rows. |
R |
integer number of permutation replicates (default 2000). Larger values (5000-10000) recommended for stable p-values. Computational cost scales linearly with R. Must be >= 1. |
The maxT method conducts individual t-tests for each variable, then corrects for multiple comparisons using the distribution of the maximum absolute t-statistic under permutation. This maintains FWER at the specified level while preserving power.
A list with two elements:
obs |
numeric vector of length p, observed t-statistics for each variable. Positive/negative values indicate direction of difference. Names preserved from column names of data_matrix if present. |
p.values |
numeric vector of length p, FWER-adjusted p-values. Computed as proportion of permutation replicates where max(|t_permuted|) >= |t_observed|. Controls family-wise error rate at level approximately R/(R+1). Values automatically sorted to match obs vector. |
set.seed(42) data <- matrix(rnorm(200), nrow = 50, ncol = 4) groups <- rep(0:1, each = 25) result <- perm_maxT(data, groups, R = 500) result$p.valuesset.seed(42) data <- matrix(rnorm(200), nrow = 50, ncol = 4) groups <- rep(0:1, each = 25) result <- perm_maxT(data, groups, R = 500) result$p.values
Performs exact or approximate permutation test for comparing two independent samples. Tests null hypothesis that two groups have equal distributions.
perm_test_2sample(x, y, R = 5000, stat = function(a, b) mean(a) - mean(b))perm_test_2sample(x, y, R = 5000, stat = function(a, b) mean(a) - mean(b))
x |
numeric vector, first sample (group A). May contain NAs which are removed. Minimum 2 observations required. Length need not equal length(y). |
y |
numeric vector, second sample (group B). May contain NAs which are removed. Minimum 2 observations required. Compared against x for differences. |
R |
integer number of permutation replicates (default 5000). Larger values (e.g., 10000) improve p-value accuracy. Must be >= 1. Exact test feasible when R = C(n1+n2, n1) (factorial complexity), otherwise uses approximate permutation distribution. |
stat |
function taking two arguments (a, b) and returning numeric scalar
test statistic. Default: |
Under the null hypothesis of equal distributions, exchanging group labels should be equally likely. The p-value is the proportion of permutations with test statistic at least as extreme as observed (in absolute value).
A list with three elements:
obs |
numeric scalar, the observed test statistic computed on original data |
reps |
numeric vector of length R, test statistics under random permutations of group labels. Represents null distribution assuming equal distributions between groups. |
p.value |
numeric scalar between 0 and 1, two-sided p-value computed as proportion of permutation replicates with |stat| >= |obs|. Value 1/R is smallest achievable p-value. |
set.seed(42) group1 <- rnorm(20, mean = 0) group2 <- rnorm(20, mean = 0.5) result <- perm_test_2sample(group1, group2, R = 1000) result$p.valueset.seed(42) group1 <- rnorm(20, mean = 0) group2 <- rnorm(20, mean = 0.5) result <- perm_test_2sample(group1, group2, R = 1000) result$p.value
Performs stationary bootstrap for dependent data with random block lengths. More flexible than fixed-block bootstrap for time series with variable dependence.
stationary_boot(x, p = 0.1, R = 1000)stationary_boot(x, p = 0.1, R = 1000)
x |
numeric vector or time series object. Should be univariate time series with temporal dependence. Length n >= 2 required. If ts object, frequency information is not preserved in output. |
p |
numeric probability parameter controlling average block length (default 0.1). Must satisfy 0 < p <= 1. Average block length approximately 1/p: set p = 0.1 for average blocks of ~10 observations, p = 0.05 for ~20 observations. Smaller p values preserve longer-range dependence; larger p values reduce distortion. |
R |
integer number of bootstrap replicates (default 1000). Each replicate is complete time series of length n with random block structure. Must be >= 1. |
The stationary bootstrap uses random block lengths drawn from a geometric distribution. This avoids artificial periodicity inherent in fixed-block methods. Set p = 1/m to have average block length approximately m.
A list of length R. Each element is a numeric vector of length n (matching original series). Unlike moving block bootstrap, block lengths vary randomly following geometric distribution, avoiding artificial periodicity. Replicates preserve local dependence structure more flexibly than fixed-block methods.
set.seed(42) x <- arima.sim(n = 100, list(ar = 0.7)) # Average block length of 10: p = 0.1 result <- stationary_boot(x, p = 0.1, R = 100) length(result) # 100 bootstrap replicatesset.seed(42) x <- arima.sim(n = 100, list(ar = 0.7)) # Average block length of 10: p = 0.1 result <- stationary_boot(x, p = 0.1, R = 100) length(result) # 100 bootstrap replicates
Computes a studentized (t-based) bootstrap confidence interval for quantiles. More accurate than percentile intervals, especially for extreme quantiles.
studentized_ci(x, q = 0.5, R = 1000, Rinner = 200, conf = 0.95)studentized_ci(x, q = 0.5, R = 1000, Rinner = 200, conf = 0.95)
x |
numeric vector of data. May contain NAs which are removed. Minimum 2 observations required. Works well for skewed distributions and extreme quantiles where percentile bootstrap performs poorly. |
q |
numeric quantile level between 0 and 1 (default 0.5 for median). 0.25 = first quartile, 0.75 = third quartile, 0.95 = 95th percentile. Method particularly effective for extreme quantiles (q < 0.1 or q > 0.9). |
R |
integer number of outer bootstrap replicates (default 1000). This is primary bootstrap sample for t-distribution estimation. Recommended: 500-1000 for exploration, 2000+ for publication. Must be >= 1. |
Rinner |
integer number of inner bootstrap replicates for standard error estimation of each outer replicate (default 200). Larger values increase accuracy but also computation time (total iterations = R * Rinner). Recommended: 100-200. Must be >= 1. |
conf |
numeric confidence level between 0 and 1 (default 0.95). Standard: 0.95 or 0.99. Higher values give wider intervals. |
Studentized bootstrap uses the bootstrap t-distribution to construct intervals. This method often provides better coverage than percentile bootstrap, especially for skewed distributions or extreme quantiles.
Computation is intensive: O(R * Rinner) bootstrap samples are generated.
A numeric vector of length 2 with names c("lower", "upper"):
lower |
numeric, lower confidence limit for the q-th quantile |
upper |
numeric, upper confidence limit for the q-th quantile |
Uses studentized bootstrap t-distribution to construct interval, providing better coverage probability than percentile method, especially for skewed distributions and extreme quantiles.
set.seed(42) x <- rexp(30) # Smaller sample: 30 instead of 50 # Fast example with reduced replications (< 5 sec) - UNWRAPPED studentized_ci(x, q = 0.75, R = 100, Rinner = 20) # Larger, more accurate example (takes > 5 sec) - WRAPPED in \donttest set.seed(42) x <- rexp(50) studentized_ci(x, q = 0.75, R = 1000, Rinner = 200)set.seed(42) x <- rexp(30) # Smaller sample: 30 instead of 50 # Fast example with reduced replications (< 5 sec) - UNWRAPPED studentized_ci(x, q = 0.75, R = 100, Rinner = 20) # Larger, more accurate example (takes > 5 sec) - WRAPPED in \donttest set.seed(42) x <- rexp(50) studentized_ci(x, q = 0.75, R = 1000, Rinner = 200)
Performs wild bootstrap resampling for linear regression models to handle heteroscedasticity. Supports Rademacher and Mammen weight schemes.
wild_boot_lm(fit, R = 2000, type = c("rademacher", "mammen"))wild_boot_lm(fit, R = 2000, type = c("rademacher", "mammen"))
fit |
an object of class 'lm' from |
R |
integer number of bootstrap replicates (default 2000). Larger values (5000-10000) recommended for confidence intervals in publications. Must be >= 1 and a whole number. |
type |
character string specifying weight distribution scheme. Options: "rademacher" (default, faster, robust) generates weights as +1/-1 with equal probability. "mammen" (asymptotically optimal) uses empirically calibrated Mammen distribution with golden ratio. Choice has minimal practical impact on results. |
The wild bootstrap works by resampling residuals with random signs/weights while keeping predictors fixed. This is particularly useful for heteroscedastic data.
A list with two elements:
coef |
numeric vector of original fitted model coefficients, including intercept if present. Names preserved from original model. |
boot |
numeric matrix of dimensions (p x R) where p is number of coefficients. Each column contains one bootstrap replicate of coefficients. Row names are coefficient names; column names are bootstrap iteration numbers. |
set.seed(42) x <- rnorm(50) y <- 2 + 1.5 * x + rnorm(50, sd = abs(x)) # heteroscedastic fit <- lm(y ~ x) result <- wild_boot_lm(fit, R = 500, type = "rademacher") head(result$boot, 3)set.seed(42) x <- rnorm(50) y <- 2 + 1.5 * x + rnorm(50, sd = abs(x)) # heteroscedastic fit <- lm(y ~ x) result <- wild_boot_lm(fit, R = 500, type = "rademacher") head(result$boot, 3)