MECfda: An R package for bias correction due to measurement error in functional and scalar covariates in scalar-on-function regression models

Introduction

Functional data analysis (FDA) is an essential statistical approach for handling high-dimensional data that appear in the form of functions or images [1–3]. In many applications, data are collected continuously or at frequent time intervals, resulting in complex functions over time. When the objective is to study the relationship between a scalar response and both functional and scalar covariates, scalar-on-function regression models become a natural choice.

Although self-reported measures (e.g., dietary intake) are known to suffer from measurement error [4], recent studies have demonstrated that even device-based measurements (such as those from wearable devices) can be error prone [5, 6]. Failing to correct for measurement error in either scalar or function-valued covariates may lead to biased parameter estimates.

The goal of the MECfda package is to provide solutions for a range of scalar-on-function regression models—including multi-level generalized scalar-on-function regression models and functional quantile regression models—while incorporating several bias-correction methods for measurement error in scalar-on-function regression problem.

This package is developed in R.

library(MECfda)

Scalar-on-Function Linear Regression Models

The general form of the scalar-on-function regression model is given by $$T(F_{Y_i|X_i,Z_i}) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) dt + (1,Z_i^T)\gamma$$ where

  • Yi is the scalar response variable;
  • Xli(t) denotes the lth function-valued covariate defined on the domain Ωl;
  • βl(t) is the corresponding function-valued regression coefficient, and both βl and Xli belong to L2(Ωl);
  • Zi = (Z1i, …, Zqi)T represents the scalar covariates;
  • γ = (γ0, γ1, …, γq)T is the parameter vector for the scalar covariates;
  • FYi ∣ Xi, Zi is the cumulative distribution function of Yi conditional on Xi and Zi;
  • T(⋅) is a statistical functional.

A statistical functional is defined as a mapping from a probability distribution to a real number. Statistical functionals represent quantifiable properties of probability distributions, (e.g., an expectation, a variance or a quantile).

Two common model types supported in this package are:

  1. Multi-level Generalized Scalar-on-Function Regression:
    This model is formulated as $$ T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = g\Bigl\{E(Y_i\mid X_i,Z_i)\Bigr\} = g\left\{\int_\mathbb{R}ydF_{Y_i|X_i,Z_i}(y)\right\}, $$ where g(⋅) is a strictly monotonic link function analogous to that used in generalized linear models.

  2. Scalar-on-Function Quantile Regression:
    This model is expressed as T(FYi ∣ Xi, Zi) = QYi ∣ Xi, Zi(τ) = FYi ∣ Xi, Zi−1(τ), where τ ∈ (0, 1) denotes the quantile of interest.

Representation of Functional Data

Function-valued variables (functional variables) are typically recorded as measurements at specific time points, resulting in an n × m data matrix (xij), where each row corresponds to an observation (subject) and each column to a measurement time.

Functional Data

Function-valued variables (functional variables) are typically recorded as the measurements of the values of the functions at specific (time) points in the domain. The data of a function-valued variable are often presented in the form of a matrix, (xij)n × m, where xij = fi(tj), represents the value of fi(tj), each row corresponds to an observation (subject) and each column to a measurement time.

S4 Class functional_variable

In the MECfda package, the s4 class functional_variable represents function-valued data stored in matrix form. The main slots include:

  • X: The data matrix (xij);
  • t_0 and period: The left endpoint and length of the function’s domain, respectively (The class only support the domain in the format of interval);
  • t_points: A vector of the time points at which the measurements are recorded.

A dedicated method, dim, returns the number of subjects and the number of measurement points for a functional_variable object.

fv = functional_variable(
  X = matrix(rnorm(10*24),10,24),
  t_0 = 0,
  period = 1,
  t_points = (0:9)/10
)
dim(fv)
#>     subject time_points 
#>          10          24

Basis Expansion

Let {ρk}k = 1 be a complete basis for L2(Ω). For an arbitrary function β(t) ∈ L2(Ω), there exists a sequence of (real) number {ck} such that $$ \beta(t) = \sum_{k=1}^\infty c_k \, \rho_k(t). $$

For the function-valued variable Xi(t), we have $$ \int_\Omega \beta(t) X_i(t) \, dt = \int_\Omega X_i(t) \sum_{k=1}^\infty c_{k}\rho_{k}(t) dt = \sum_{k=1}^\infty c_k \int_\Omega \rho_k(t) X_i(t) \, dt. $$

In practice, for Ωβ(t)Xi(t)dt in statistical models, we truncate the infinite series to p terms (p < ∞), using ckΩXi(t)ρk(t)dt to approximate Ωβ(t)Xi(t)dt so that the problem reduces to estimating a finite number of parameters c1, …, cp, and treating Ωρk(t)Xi(t) dt as new covariates. The number of basis, p, can be related to the sample size, n, and the p is in a form of p(n).

In practice, we may not necessarily use the truncated complete basis of L2 function space. Instead, we can just use a finite sequence of linearly independent functions as the basis function. Common choices for the basis include the Fourier basis, b-spline basis, and eigenfunction basis.

In numerical computing level, performing basis expansion methods on function-valued variable data in matrix form as we mentioned is to compute (bik)n × p, where bik = ∫Ωf(t)ρk(t)dt.

Fourier Basis

On the interval [t0, t0 + T], the Fourier basis consists of

$$ \frac{1}{2}, \quad \cos\left(\frac{2\pi}{T}k(x-t_0)\right), \quad \sin\left(\frac{2\pi}{T}k(x-t_0)\right), \quad k = 1, 2, \dots $$

S4 Class Fourier_series

In the MECfda package, s4 class Fourier_series represents a Fourier series of the form

$$ \frac{a_0}{2} + \sum_{k=1}^{p_a} a_k \cos\left(\frac{2\pi k (x-t_0)}{T}\right) + \sum_{k=1}^{p_b} b_k \sin\left(\frac{2\pi k (x-t_0)}{T}\right),\quad x \in [t_0, t_0+T]. $$

Its main slots include:

  • double_constant: The value of a0;
  • cos and sin: The coefficients ak and bk for the cosine and sine terms, respectively;
  • k_cos and k_sin: The frequency values corresponding to the cosine and sine coefficients;
  • t_0 and period: The left endpoint, t0, and length, T, of the domain (interval [t0, t0 + T]), respectively.

Additional methods such as plot, FourierSeries2fun, and extractCoef are provided for visualization, function evaluation, and coefficient extraction.

fsc = Fourier_series(
  double_constant = 3,
  cos = c(0,2/3),
  sin = c(1,7/5),
  k_cos = 1:2,
  k_sin = 1:2,
  t_0 = 0,
  period = 1
)
plot(fsc)

FourierSeries2fun(fsc,seq(0,1,0.05))
#>  [1]  2.1666667  3.1712610  3.6252757  3.4344848  2.7346112  1.8333333
#>  [7]  1.0888125  0.7715265  0.9623175  1.5254623  2.1666667  2.5532270
#> [13]  2.4497052  1.8164508  0.8324982 -0.1666667 -0.8133005 -0.8465074
#> [19] -0.2132530  0.9074283  2.1666667
extractCoef(fsc)
#> $a_0
#> 0 
#> 3 
#> 
#> $a_k
#>         1         2 
#> 0.0000000 0.6666667 
#> 
#> $b_k
#>   1   2 
#> 1.0 1.4

The object fsc represents the summation $$\frac32 + \frac23 \cos(2\pi\cdot2x) + \sin(2\pi x) + \frac75\sin(2\pi\cdot2x).$$

B-spline Basis

A b-spline basis {Bi, p(x)}i = −pk on the interval [t0, tk + 1] is defined recursively as follows: - For p = 0, $$ B_{i,0}(x) = \begin{cases} I_{(t_i, t_{i+1}]}(x), & i = 0, 1, \dots, k, \\ 0, & \text{otherwise}. \end{cases} $$ - For p > 0, the recursion is given by $$ B_{i,p}(x) = \frac{x-t_i}{t_{i+p}-t_i} B_{i,p-1}(x) + \frac{t_{i+p+1}-x}{t_{i+p+1}-t_{i+1}} B_{i+1,p-1}(x). $$

At discontinuities in the interval (t0, tk), the basis function is defined as its limit, Bi, p(x) = limt → xBi, p(t), to ensure continuity.

S4 Classes bspline_basis and bspline_series

The s4 class bspline_basis represents a b-spline basis, {Bi, p(x)}i = −pk, on the interval [t0, tk + 1]. Its key slots include:

  • Boundary.knots: The boundary [t0, tk + 1], (default is [0, 1]);
  • knots: The internal spline knots, (t1, …, tk), (automatically generated as equally spaced, $t_j = t_0 + j\cdot\frac{t_{k+1}-t_0}{k+1}$, if not assigned);
  • intercept: Indicates whether an intercept is included (default is TRUE);
  • df: Degrees of freedom of the basis, which is the number of the splines, equal to p + k + 1 (by default k = 0 and df = p + 1);
  • degree: The degree of the spline, which is the degree of piecewise polynomials, p, (default is 3).

The s4 class bspline_series represents the linear combination

$$ \sum_{i=0}^{k} b_i B_{i,p}(x), $$

where the slot coef stores the coefficients bi (i = 0, …, k) and bspline_basis specifies the associated basis (represented by a bspline_basis object).

Methods such as plot and bsplineSeries2fun are provided for visualization and function evaluation.

bsb = bspline_basis(
  Boundary.knots = c(0,24),
  df             = 7,
  degree         = 3
)
bss = bspline_series(
  coef = c(2,1,3/4,2/3,7/8,5/2,19/10),
  bspline_basis = bsb
)
plot(bss)

bsplineSeries2fun(bss,seq(0,24,0.5))
#>  [1] 2.0000000 1.7677509 1.5690908 1.4011502 1.2610597 1.1459499 1.0529514
#>  [8] 0.9791948 0.9218107 0.8779297 0.8446824 0.8191993 0.7986111 0.7805266
#> [15] 0.7644676 0.7504340 0.7384259 0.7284433 0.7204861 0.7145544 0.7106481
#> [22] 0.7087674 0.7089120 0.7110822 0.7152778 0.7216857 0.7312404 0.7450629
#> [29] 0.7642747 0.7899969 0.8233507 0.8654574 0.9174383 0.9804145 1.0555073
#> [36] 1.1438380 1.2465278 1.3634786 1.4897151 1.6190430 1.7452675 1.8621942
#> [43] 1.9636285 2.0433759 2.0952418 2.1130317 2.0905511 2.0216053 1.9000000

The object bsb represents {Bi, 3(x)}i = −30, and the object bss represents
$$2B_{i,-3}(x)+B_{i,-2}(x)+\frac34B_{i,-1}(x)+\frac23B_{i,0}(x) + \frac78B_{i,1}(x) + \frac52B_{i,2}(x) +\frac{19}{10}B_{i,3}(x),$$ where x ∈ [t0, t4] and t0 = 0, tk = tk − 1 + 6 ,k = 1, 2, 3, 4.

Eigenfunction basis

Suppose K(s, t) ∈ L2(Ω × Ω), f(t) ∈ L2(Ω). Then K induces an linear operator 𝒦 by (𝒦f)(x) = ∫ΩK(t, x)f(t)dt If ξ ∈ L2(Ω) such that 𝒦ξ = λξ where λ ∈ ℂ, we call ξ an eigenfunction/eigenvector of 𝒦, and λ an eigenvalue associated with ξ.

For a stochastic process {X(t), t ∈ Ω} we call the orthogonal basis, {ξk}k = 1 corresponding to eigenvalues {λk}k = 1 (λ1 ≥ λ2 ≥ …), induced by K(s, t) = Cov (X(t), X(s)) a functional principal component (FPC) basis for L2(Ω).

The eigenfunction basis relies on the covariance function K(s, t). And we usually cannot analytically express the equation of the basis funcitons in eigenfunction basis. In practice, the covariance function is estimated from the data and the eigenfunctions are computed numerically.

Numerical Basis

Sometimes, analytical expressions for the basis functions are not available. In these cases, they can be represented numerically. Let {ρk}k = 1 denote a basis of the function space. We can numerically approximate the basis by storing the values of a finite subset of these functions at a finite set of points in the domain, i.e., ρk(tj) for k = 1, …, p and j = 1, …, m.

S4 Class numeric_basis and numericBasis_series

In this package, the s4 class numeric_basis is designed to represent a finite sequence of functions {ρk}k = 1p by their values at a finite set of points within their domain. In this context, all functions share the same domain, which is assumed to be an interval. The key slots include:

  • basis_function: The matrix (ζjk)m × p, where each element is defined as ζjk = ρk(tj) for j = 1, …, m and k = 1, …, p. In this matrix, each row corresponds to a point in the domain, and each column corresponds to a specific basis function.
  • t_points: A numeric vector that represents the points in the domain at which the basis functions are evaluated. The j-th element of this vector corresponds to the j-th row of the basis_function matrix.
  • t_0 and period: The left endpoint and length of the domain, respectively.

Additionally, the package provides an s4 class numericBasis_series, which represents a linear combination of the basis functions represented by a numeric_basis object. The key slots include:

  • coef: The linear coefficients for the summation series.
  • numeric_basis: Function basis as represented by a numeric_basis object.

Function basis2fun

The generic function basis2fun is provided for Fourier_series, bspline_series, and numericBasis_series objects. For a Fourier_series object, it is equivalent to FourierSeries2fun; for a bspline_series object, it is equivalent to bsplineSeries2fun; For a numericBasis_series object, it is equivalent to numericBasisSeries2fun.

basis2fun(fsc,seq(0,1,0.05))
#>  [1]  2.1666667  3.1712610  3.6252757  3.4344848  2.7346112  1.8333333
#>  [7]  1.0888125  0.7715265  0.9623175  1.5254623  2.1666667  2.5532270
#> [13]  2.4497052  1.8164508  0.8324982 -0.1666667 -0.8133005 -0.8465074
#> [19] -0.2132530  0.9074283  2.1666667
basis2fun(bss,seq(0,24,0.5))
#>  [1] 2.0000000 1.7677509 1.5690908 1.4011502 1.2610597 1.1459499 1.0529514
#>  [8] 0.9791948 0.9218107 0.8779297 0.8446824 0.8191993 0.7986111 0.7805266
#> [15] 0.7644676 0.7504340 0.7384259 0.7284433 0.7204861 0.7145544 0.7106481
#> [22] 0.7087674 0.7089120 0.7110822 0.7152778 0.7216857 0.7312404 0.7450629
#> [29] 0.7642747 0.7899969 0.8233507 0.8654574 0.9174383 0.9804145 1.0555073
#> [36] 1.1438380 1.2465278 1.3634786 1.4897151 1.6190430 1.7452675 1.8621942
#> [43] 1.9636285 2.0433759 2.0952418 2.1130317 2.0905511 2.0216053 1.9000000

Basis expansion methods for the functional_variable class

The MECfda pakcage provide the methods fourier_basis_expansion, bspline_basis_expansion, FPC_basis_expansion, and numeric_basis_expansion for the class functional_variable, which perform basis expansion using Fourier basis, b-spline basis, functional principal component basis, and numerical basis respectively.

data(MECfda.data.sim.0.0)
fv = MECfda.data.sim.0.0$FC[[1]]
BE.fs = fourier_basis_expansion(fv,5L)
BE.bs = bspline_basis_expansion(fv,5L,3L)

Numerical Computation of Integrals

The package approximates the integral

$$ \int_{\Omega} \rho_{k}(t) X_{i}(t) \, dt \approx \frac{\mu(\Omega)}{|T|} \sum_{t \in T} \rho_{k}(t) X_{i}(t), $$

where μ(⋅) denote the Lebesgue measure. T denotes the set of measurement time points for Xi(t) and |T| represents the cardinal number of T.

Scalar-on-Function Linear Regression in MECfda

fcRegression

fcRegression(Y, FC, Z, formula.Z, family = gaussian(link = "identity"),
             basis.type = c("Fourier", "Bspline"), basis.order = 6L, 
             bs_degree = 3)

The MECfda package provides a function, fcRegression, for fitting generalized linear mixed-effect models—including ordinary linear models and generalized linear models with fixed and random effects—by discretizing function-valued covariates using basis expansion. The function can solve a linear model of the form

$$ g\bigl(E(Y_i\mid X_i,Z_i)\bigr) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) \, dt + (1,Z_i^T)\gamma. $$

It supports one or multiple function-valued covariates as fixed effects, and zero, one, or multiple scalar covariates, which can be specified as either fixed or random effects. The response variable, function-valued covariates, and scalar covariates are supplied separately via the arguments Y, FC, and Z, respectively, allowing great flexibility in data format.

  • Response Variable:
    The response can be provided as an atomic vector, a one-column matrix, or a data frame. However, a one-column data frame or matrix with a column name is recommended so that the response variable’s name is clearly specified.

  • Function-Valued Covariates:
    The function-valued covariates can be input as a functional_variable object, a matrix, a data frame, or a list of these objects. If a single functional_variable object (or matrix or data frame) is provided via FC, the model includes one function-valued covariate. If a list is provided, the model accommodates multiple function-valued covariates, with each element in the list representing a distinct covariate.

  • Scalar Covariates:
    Scalar covariates can be provided as a matrix, data frame, atomic vector, or even NULL if no scalar covariates are present. If Z is omitted or set to NULL, no scalar covariate is included and the argument formula.Z should also be NULL or omitted. When an atomic vector is used for Z, it represents a single scalar covariate (without a name). Hence, even when only one scalar covariate is included, it is recommended to supply it as a matrix or data frame with a column name. The formula.Z argument specifies which parts of Z to use and whether scalar covariates should be treated as fixed or random effects.

Other key arguments include:

  • family: Specifies the response variable’s distribution and the link function to be used.
  • basis.type: Indicates the type of basis function for the expansion. Options include 'Fourier', 'Bspline', and 'FPC', corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively.
  • basis.order: Specifies the number of basis functions to use. For the Fourier basis (i.e., $\frac{1}{2},\ \sin(kt),\ \cos(kt)$ for k = 1, …, pf), basis.order equals pf. For the b-spline basis {Bi, p(x)}i = −pk, basis.order is set to k + p + 1. For the FPC basis, basis.order represents the number of functional principal components.
  • bs_degree: Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis.

The function returns an s3 object of class fcRegression, which is a list containing the following elements:

  1. regression_result: The result of the regression analysis.
  2. FC.BasisCoefficient: A list of Fourier_series, bspline_series, or numericBasis_series objects representing the function-valued linear coefficients of the covariates.
  3. function.basis.type: The type of basis used.
  4. basis.order: The number of basis functions used, as specified in the input.
  5. data: The original data provided to the model.
  6. bs_degree: The degree of the b-spline basis, same as specified in the input (included only when the b-spline basis is used).

Additionally, the predict method can be used to obtain predicted values from the fitted model, and the fc.beta method is available to evaluate the estimated function-valued coefficient parameters βl(t).

data(MECfda.data.sim.0.0)
res = fcRegression(FC = MECfda.data.sim.0.0$FC, 
                   Y=MECfda.data.sim.0.0$Y, 
                   Z=MECfda.data.sim.0.0$Z,
                   family = gaussian(link = "identity"),
                   basis.order = 5, basis.type = c('Bspline'),
                   formula.Z = ~ Z_1 + (1|Z_2))
t = (0:100)/100
plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t)))

plot(x = t, y = fc.beta(res,2,t), ylab = expression(beta[2](t)))

data(MECfda.data.sim.1.0)
predict(object = res, newData.FC = MECfda.data.sim.1.0$FC,
        newData.Z = MECfda.data.sim.1.0$Z)
#>        1        2        3        4        5 
#> 6.500129 5.690171 2.388979 5.441011 4.821000

fcQR

fcQR(Y, FC, Z, formula.Z, tau = 0.5, basis.type = c("Fourier", "Bspline"),
     basis.order = 6L, bs_degree = 3)

The MECfda package provides a function, fcRegression, for fitting quantile linear regression models by discretizing function-valued covariates using basis expansion. The function can solve a linear model of the form

$$Q_{Y_i|X_i,Z_i}(\tau) = \sum_{l=1}^L\int_{\Omega_l} \beta_l(\tau,t) X_{li}(t) dt + (1,Z_i^T)\gamma.$$ The function supports one or multiple function-valued covariates, as well as zero, one, or multiple scalar-valued covariates. Data input follows the same guidelines as for the fcRegression function, and the treatment of scalar covariates is determined by the formula.Z argument. The primary difference between fcQR and fcRegression is that the quantile linear regression model does not incorporate random effects. The quantile τ is specified by the argument tau, and the type and parameters of the basis functions are defined by the basis.type, basis.order, and bs_degree arguments, just as in fcRegression.

The function returns an s3 object of class fcQR, which is a list containing the following elements:

  1. regression_result: The result of the regression analysis.
  2. FC.BasisCoefficient: A list of Fourier_series, bspline_series, or numericBasis_series objects representing the function-valued linear coefficients of the covariates.
  3. function.basis.type: The type of basis used.
  4. basis.order: The number of basis functions used, as specified in the input.
  5. data: The original data provided to the model.
  6. bs_degree: The degree of the b-spline basis, same as specified in the input (included only when the b-spline basis is used).

Additionally, the predict method can be used to obtain predicted values from the fitted model, and the fc.beta method is available to evaluate the estimated function-valued coefficient parameters βl(t).

data(MECfda.data.sim.0.0)
res = fcQR(FC = MECfda.data.sim.0.0$FC, 
           Y=MECfda.data.sim.0.0$Y, 
           Z=MECfda.data.sim.0.0$Z,
           tau = 0.5,
           basis.order = 5, basis.type = c('Bspline'),
           formula.Z = ~ .)
t = (0:100)/100
plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t)))

plot(x = t, y = fc.beta(res,2,t), ylab = expression(beta[2](t)))

data(MECfda.data.sim.1.0)
predict(object = res, newData.FC = MECfda.data.sim.1.0$FC,
        newData.Z = MECfda.data.sim.1.0$Z)
#>        1        2        3        4        5 
#> 6.497759 5.682573 2.404464 5.440699 4.830085

Measurement Error Models and Bias Correction Estimation Methods

Data collected in real-world settings often include measurement error, especially function-valued data. Measurement error in a data set may result in estimation bias. The *MECfda** package also provides bias correction estimation method functions for certain linear regression models for use with data containing measurement error.

ME.fcRegression_MEM

ME.fcRegression_MEM(
  data.Y,
  data.W,
  data.Z,
  method = c("UP_MEM", "MP_MEM", "average"),
  t_interval = c(0, 1),
  t_points = NULL,
  d = 3,
  family.W = c("gaussian", "poisson"),
  family.Y = "gaussian",
  formula.Z,
  basis.type = c("Fourier", "Bspline"),
  basis.order = NULL,
  bs_degree = 3,
  smooth = FALSE,
  silent = TRUE
)

Wearable monitoring devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behaviors, such as sleep quality and physical activity.
Continuous monitoring often collect data in 60-second epochs over multiple days, resulting in high-dimensional multi-level longitudinal curves that are best described and analyzed as multi-level functional data.
Although researchers have previously addressed measurement error in scalar covariates prone to error, less work has been done for correcting measurement error in multi-level high dimensional curves prone to heteroscedastic measurement error. Therefore, Luan et. al. proposed mixed effects-based-model-based (MEM) methods for bias correction due to measurement error in multi-level functional data from the exponential family of distributions that are prone to complex heteroscedastic measurement error.

They first developed MEM methods to adjust for biases due to the presence of measurement error in multi-level generalized functional regression models.
They assumed that the distributions of the function-valued covariates prone to measurement error belong to the exponential family. This assumption allows for a more general specification of the distributions of error-prone functional covariates compared to current approaches that often entail normality assumptions for these observed measures. The approach adopted by Luan et al. allows a nonlinear association between the true measurement and the observed measurement prone to measurement error.
Second, they treated the random errors in the observed measures as complex heteroscedastic errors from the Gaussian distribution with covariance error functions. Third, these methods can be used to evaluate relationships between multi-level function-valued covariates with complex measurement error structures and scalar outcomes with distributions in the exponential family.
Fourth, they treat the function-valued covariate as an observed measure of the true function-valued unobserved latent covariate.
Additionally, these methods employ a point-wise method for fitting the multi-level functional MEM approach, avoiding the need to compute complex and intractable integrals that would be required in traditional approaches for reducing biases due to measurement error in multi-level functional data [7].

Their statistical model is as follows: where the response variable Yi and scalar-valued covariates Zi are measured without error, function-valued covariate Xi(t) is repeatedly measured with error as Wij(t). The model also includes the following assumptions:

  1. Yi|Xi, Zi ∼ EF(⋅), EF refers to an exponential family distribution.
  2. g(⋅) and h(⋅) are known to be strictly monotone, twice continuously differentiable functions.
  3. Cov{Xi(t), Wij(t)} ≠ 0,
  4. Wij(t)|Xi(t) ∼ EF(⋅)
  5. fYi|Wij(t), Xi(t)(⋅) = fYi|Xi(t)(⋅), f refers to probability density function.
  6. Xi(t) ∼ GP{μx(t), Σxx}, GP refers to the Gaussian process.

They proposed a MEM estimation method to correct for bias caused by the presence of measurement error in the function-valued covariate. This method allows for the investigation of a nonlinear measurement error model, in which the relationship between the true and observed measurements is not constrained to be linear, and the distribution assumption for the observed measurement is relaxed to encompass the exponential family rather than being limited to a Gaussian distribution.

The MEM approach is a two stage method that employs functional mixed-effects models. The first stage of the MEM approach involves using a functional mixed-effects model
to approximate the true measure Xi(t) with the repeated observed measurement Wij(t). The MEM approach is primarily based on the assumptions that h[E{Wij(t)|Xi(t)}] = Xi(t) and Xi(t) = μx(t) + εxi(t). That is, in the functional mixed-effects model containing one fixed intercept and one random intercept, the random intercept is assumed to to be the subject-wise deviation of Xi(t) from the mean process μx(t), and the fixed intercept is assumed to represent μx(t). The second stage involves replacing Xi(t) in the regression model with the resulting approximation of Xi(t) from the first stage. The MEM approach employs point-wise (UP_MEM) and multi-point-wise (MP_MEM) estimation procedures to avoid potential computational complexities caused by analyses of multi-level functional data and computations of potentially intractable and complex integrals.

The MECfda package provides the function ME.fcRegression_MEM for applying bias-correction estimation methods. In this function, the response variable, function-valued covariates, and scalar covariates are supplied separately via the arguments data.Y, data.W, and data.Z, respectively.

  • data.Y (Response Variable):
    The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. However, a one-column data frame with a column name is recommended so that the response variable is explicitly identified.

  • data.W (Measurement Data):
    This argument represents W, the observed measurement of the true function-valued covariate X in the statistical model. It should be provided as a three-dimensional array where each row corresponds to a subject, each column to a measurement (time) point, and each layer to a separate observation.

  • data.Z (Scalar Covariates):
    The scalar covariates can be input as a matrix, data frame, or atomic vector. If the model does not include any scalar covariates, data.Z can be omitted or set to NULL. For a single scalar covariate, although an atomic vector is acceptable, a data frame or matrix with a column name is recommended. For multiple scalar covariates, use a matrix or data frame with named columns.

Other key arguments include:

  • method: Specifies the method for constructing the substitute for X; available options are 'UP_MEM', 'MP_MEM', and 'average'.
  • t_interval: Defines the domain of the function-valued covariate as a two-element vector (default is c(0,1), representing the interval [0, 1]).
  • t_points: Specifies the sequence of measurement time points (default is NULL).
  • d: When using the MP_MEM method, this argument sets the number of measurement points involved (default is 3, which is also the minimum value).
  • family.W: Specifies the distribution of W given X; available options are 'gaussian' and 'poisson'.
  • family.Y: Describes the error distribution and link function for the response variable (see stats::family for details).
  • formula.Z: Indicates which parts of data.Z to include in the model and whether to treat the scalar covariates as fixed or random effects.
  • basis.type: Indicates the type of basis function for the expansion. Options include 'Fourier', 'Bspline', and 'FPC', corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively.
  • basis.order: Specifies the number of basis functions to use. For the Fourier basis (i.e., $\frac{1}{2},\ \sin(kt),\ \cos(kt)$ for k = 1, …, pf), basis.order equals pf. For the b-spline basis {Bi, p(x)}i = −pk, basis.order is set to k + p + 1. For the FPC basis, basis.order represents the number of functional principal components.
  • bs_degree: Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis.
  • smooth: Specifies whether to smooth the substitution for X (default is FALSE).
  • silent: Controls whether the function displays its progress during execution (default is TRUE).

The function ME.fcRegression_MEM returns an object of class fcRegression.

The package also provide a function, MEM_X_hat. This function can return the (t) in this bias-correction method.

data(MECfda.data.sim.0.1)
res = ME.fcRegression_MEM(data.Y = MECfda.data.sim.0.1$Y,
                          data.W = MECfda.data.sim.0.1$W,
                          data.Z = MECfda.data.sim.0.1$Z,
                          method = 'UP_MEM',
                          family.W = "gaussian",
                          basis.type = 'Bspline')

ME.fcQR_IV.SIMEX

ME.fcQR_IV.SIMEX(
  data.Y,
  data.W,
  data.Z,
  data.M,
  tau = 0.5,
  t_interval = c(0, 1),
  t_points = NULL,
  formula.Z,
  basis.type = c("Fourier", "Bspline"),
  basis.order = NULL,
  bs_degree = 3
)

Health a major concerns for many people, and as technology advances, wearable devices have become the mainstream method for monitoring and evaluating individual physical activity levels. Despite personal preferences in brands and feature design, the accuracy of the data presented is what makes the device a great product. These functional data collected from devices are generally considered more accurate and subjective compared to other objective methods like questionnaires and self-reports. Because physical activity levels are not directly observable, algorithms are used generate corresponding summary reports of different activity level using complex data (e.g. steps, heart rate).
However, these results may differ depending on which device is used. And aside from variation in hardware, data collected could also vary by the combinations between how the device is worn and the activity of interest. Although current devices may be sufficiently accurate to monitor general physical activity levels, more precise data may enable additional functions such as detecting irregular heart rhythms or radiation exposures that would contribute toward improving the health of the general public and elevating the overall well-being of society.

Quantile regression is a tool that was developed to meet the need for modeling complex relationships between a set of covariates and quantiles of an outcome. In obesity studies, the effects of interventions or covariates on body mass index (BMI) are most commonly estimated using ordinary least square methods. However, mean regression approaches are unable to address these effects for individuals whose BMI values fall in the upper quantile of the distribution. Compared with traditional linear regression approaches, quantile regression approaches make no assumptions regarding the distribution of residuals and are robust to outlying observations.

Tekwe et. al.  proposed a simulation extrapolation (SIMEX) estimation method that uses an instrumental variable to correct for bias due to measurement error in scalar-on-function quantile linear regression. They demonstrated the usefulness of this method by evaluating the association between physical activity and obesity using the National Health and Nutrition Examination Survey (NHANES) dataset [8]. Their statistical model is as follows: where the response variable Yi and scalar-valued covariates Zi are measured without error, the function-valued covariate Xi(t) is measured with error as Wi(t), and Mi(t) is a measured instrumental variable. The model also includes the following assumptions:

  1. Cov{Xi(t), Ui(s)} = 0,
  2. Cov{Mi(t), Ui(s)} = 0,
  3. E(Wi(t)|Xi(t)) = Xi(t)
  4. Ui(t) ∼ GP{0, Σuu}, GP refers to Gaussian process.

for t, s ∈ [0, 1] and i = 1, …, n.

Their bias correction estimation method uses a two-stage strategy to correct for measurement error in a function-valued covariate and then fits a linear quantile regression model. In the first stage, an instrumental variable is used to estimate the covariance matrix associated with the measurement error. In the second stage, SIMEX is used to correct for measurement error in the function-valued covariate.

The MECfda package provides the function ME.fcQR_IV.SIMEX for applying its bias-correction estimation method in scalar-on-function quantile regression. The arguments are described as follows:

  • data.Y (Response Variable):
    The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. A one-column data frame with a column name is recommended for clarity.

  • data.W (Measurement Data):
    This argument corresponds to W, the observed measurement of X in the statistical model. It should be provided as a data frame or matrix where each row represents a subject and each column corresponds to a measurement (time) point.

  • data.Z (Scalar Covariates):
    The scalar covariate data can be omitted (or set to NULL) if no scalar covariates are included in the model. Alternatively, it can be provided as an atomic vector (for a single scalar covariate) or as a matrix/data frame (for multiple scalar covariates). A data frame with column names is recommended.

  • data.M (Instrumental Variable):
    This argument provides the data for M, the instrumental variable. It should be supplied as a data frame or matrix where each row represents a subject and each column corresponds to a measurement (time) point.

  • tau (Quantile Level):
    Specifies the quantile level τ ∈ (0, 1) for the quantile regression model. The default is 0.5.

  • t_interval (Domain of the Function-Valued Covariate):
    A two-element vector defining the interval over which the function-valued covariate is defined (default is c(0,1), representing the interval [0, 1]).

  • t_points (Measurement Time Points):
    Specifies the sequence of time points at which measurements are taken. The default is NULL.

  • formula.Z (Scalar Covariate Formula):
    This argument determines which components of data.Z are included in the model and how the scalar covariates are treated (i.e., as fixed or random effects).

  • basis.type: Indicates the type of basis function for the expansion. Options include 'Fourier', 'Bspline', and 'FPC', corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively.

  • basis.order: Specifies the number of basis functions to use. For the Fourier basis (i.e., $\frac{1}{2},\ \sin(kt),\ \cos(kt)$ for k = 1, …, pf), basis.order equals pf. For the b-spline basis {Bi, p(x)}i = −pk, basis.order is set to k + p + 1. For the FPC basis, basis.order represents the number of functional principal components.

  • bs_degree: Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis.

The function ME.fcQR_IV.SIMEX returns an object of class ME.fcQR_IV.SIMEX, which is a list containing the following elements:

  1. coef.X: A Fourier_series or bspline_series object representing the function-valued coefficient parameter of the function-valued covariate.
  2. coef.Z: The estimated linear coefficients of the scalar covariates.
  3. coef.all: The original estimate of the linear coefficients.
  4. function.basis.type: The type of function basis used.
  5. basis.order: The number of basis functions used (as specified in the input).
  6. t_interval: A two-element vector representing the interval, which defines the domain of the function-valued covariate.
  7. t_points: The sequence of measurement (time) points.
  8. formula: The regression model.
  9. formula.Z: A formula object containing only the scalar covariates.
  10. zlevels: The levels of any categorical (non-continuous) scalar covariates.

This comprehensive structure ensures that users can flexibly input their data and clearly specify the modeling framework for bias-correction in scalar-on-function quantile regression.

rm(list = ls())
data(MECfda.data.sim.0.2)
res = ME.fcQR_IV.SIMEX(data.Y = MECfda.data.sim.0.2$Y,
                       data.W = MECfda.data.sim.0.2$W,
                       data.Z = MECfda.data.sim.0.2$Z,
                       data.M = MECfda.data.sim.0.2$M,
                       tau = 0.5,
                       basis.type = 'Bspline')

ME.fcQR_CLS

ME.fcQR_CLS(
  data.Y,
  data.W,
  data.Z,
  tau = 0.5,
  t_interval = c(0, 1),
  t_points = NULL,
  grid_k,
  grid_h,
  degree = 45,
  observed_X = NULL
)

Zhang et. al.  proposed a corrected loss score estimation method for scalar-on-function quantile linear regression to correct for bias due to measurement error [9]. Their statistical model is as follows: where the response variable Yi and the scalar-valued covariates Zi are measured without error, the function-valued covariate Xi(t) is measured with error as Wi(t).

  1. E[Ui(t)] = 0.
  2. Cov{Ui(t), Ui(s)} = ΣU(s, t), where ΣU(s, t) is unknown.
  3. Ui(t) are i.i.d for different i.

Zhang et al. proposed a new corrected loss function for a partially functional linear quantile model with functional measurement error in this manuscript. They established a corrected quantile objective function for an observed measurement, which is an unbiased estimator of the quantile objective function that would be obtained if the true measurements were available. The estimators of the regression parameters are obtained by optimizing the resulting corrected loss function. The resulting estimator of the regression parameters is shown to be consistent.

The MECfda package provides a function, ME.fcQR_CLS, for implementing their bias-correction estimation method. Below is a description of its arguments and return values:

  • data.Y (Response Variable):
    The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. The recommended format is a one-column data frame with a column name.

  • data.W (Measurement Data for W):
    This argument represents W, the observed measurement of X in the statistical model. It should be provided as a three-dimensional array where each row corresponds to a subject, each column to a measurement (time) point, and each layer to an observation.

  • data.Z (Scalar Covariates):
    Scalar covariates can be omitted (or set to NULL) if no scalar covariates are included in the model, provided as an atomic vector for a single scalar covariate, or as a matrix or data frame for multiple covariates. The recommended format is a data frame with column names.

  • tau (Quantile Level):
    Specifies the quantile τ ∈ (0, 1) to be used in the quantile regression model. The default value is 0.5.

  • t_interval (Domain of the Function-Valued Covariate):
    A two-element vector that specifies the domain (interval) of the function-valued covariate. The default is c(0,1), representing the interval [0, 1].

  • t_points (Measurement Time Points):
    Specifies the sequence of measurement (time) points. The default is NULL.

  • grid_k (Candidate Basis Numbers):
    An atomic vector where each element is a candidate number for the basis functions.

  • grid_h (Candidate Tuning Parameter Values):
    A non-zero atomic vector where each element is a candidate value for the tuning parameter.

  • degree (Degree for Derivative and Integral Calculations):
    This argument is used when computing derivatives and integrals, with a default value of 45—which is sufficient for most scenarios.

  • observed_X (Observed X Data for Variance Estimation):
    Used for estimating parametric variance; the default value is NULL.

The function returns an object of class ME.fcQR_CLS (a list) containing the following elements:

  1. estimated_beta_hat: Estimated coefficients from the corrected loss function (including the functional component).
  2. estimated_beta_t: The estimated functional curve.
  3. SE_est: The estimated parametric variance (returned only if observed_X is not NULL).
  4. estimated_Xbasis: The basis matrix used in the estimation.
  5. res_naive: Results from the naive (uncorrected) method.
rm(list = ls())
data(MECfda.data.sim.0.1)
res = ME.fcQR_CLS(data.Y = MECfda.data.sim.0.1$Y,
                  data.W = MECfda.data.sim.0.1$W,
                  data.Z = MECfda.data.sim.0.1$Z,
                  tau = 0.5,
                  grid_k = 4:7,
                  grid_h = 1:2)

ME.fcLR_IV

ME.fcLR_IV(
  data.Y,
  data.W,
  data.M,
  t_interval = c(0, 1),
  t_points = NULL,
  CI.bootstrap = F
)

Tekwe et. al.  proposed a bias correction estimation method for scalar-on-function linear regression model with measurement error using an instrumental variable [6]. Their statistical model is as follows: where the response variable Yi and are measured without error, the function-valued covariate Xi(t) is measured with error as Wi(t), and Mi(t) is an measured instrumental variable. They included the following additional assumptions:

  1. Eεi(t) = 0,
  2. EUi(t) = 0,
  3. Eηi(t) = 0,
  4. Cov{Xi(t), εi} = 0,
  5. Cov{Mi(t), εi} = 0,
  6. Cov{Mi(t), Ui(s)} = 0,

for t, s ∈ [0, 1] and i = 1, …, n.

The MECfda package provides the function ME.fcLR_IV for implementing their bias-correction estimation method. The arguments are as follows:

  • data.Y (Response Variable):
    The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. The recommended format is a one-column data frame with a column name.

  • data.W (Measurement Data for W):
    This argument represents W, the observed measurement of X in the statistical model. It should be provided as a data frame or matrix, where each row represents a subject and each column corresponds to a measurement (time) point.

  • data.M (Instrumental Variable):
    This argument provides the data for M, the instrumental variable. It should be provided as a data frame or matrix, with each row representing a subject and each column representing a measurement (time) point.

  • t_interval (Domain of the Function-Valued Covariate):
    A two-element vector that specifies the domain (interval) of the function-valued covariate. The default is c(0,1), representing the interval [0, 1].

  • t_points (Measurement Time Points):
    Specifies the sequence of measurement (time) points. The default is NULL.

  • CI.bootstrap (Bootstrap Confidence Interval):
    A logical flag indicating whether to return a confidence interval using the bootstrap method. The default is FALSE.

The function returns an object of class ME.fcLR_IV, which is a list containing the following elements:

  1. beta_tW: Parameter estimates.
  2. CI: Confidence interval, which is returned only if CI.bootstrap is TRUE.
rm(list = ls())
data(MECfda.data.sim.0.3)
res = ME.fcLR_IV(data.Y = MECfda.data.sim.0.3$Y,
                 data.W = MECfda.data.sim.0.3$W,
                 data.M = MECfda.data.sim.0.3$M)

Simulated Data Generation

The package MECfda includes some functions to generate simulated data to test the functions in the package.

References

1. Wang J-L, Chiou J-M, Müller H-G. Functional data analysis. Annual Review of Statistics and its application. 2016;3:257–95.
2. Ramsay JO, Silverman BW. Applied functional data analysis: Methods and case studies. Springer; 2002.
3. Ramsay JO, Dalzell C. Some tools for functional data analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1991;53:539–61.
4. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models: A modern perspective. Chapman; Hall/CRC; 2006.
5. Crainiceanu CM, Staicu A-M, Di C-Z. Generalized multilevel functional regression. Journal of the American Statistical Association. 2009;104:1550–61.
6. Tekwe CD, Zoh RS, Yang M, Carroll RJ, Honvoh G, Allison DB, et al. Instrumental variable approach to estimating the scalar-on-function regression model with measurement error with application to energy expenditure assessment in childhood obesity. Statistics in medicine. 2019;38:3764–81.
7. Luan Y, Zoh RS, Cui E, Lan X, Jadhav S, Tekwe CD. Scalable regression calibration approaches to correcting measurement error in multi-level generalized functional linear regression models with heteroscedastic measurement errors. arXiv preprint arXiv:230512624. 2023.
8. Tekwe CD, Zhang M, Carroll RJ, Luan Y, Xue L, Zoh RS, et al. Estimation of sparse functional quantile regression with measurement error: A SIMEX approach. Biostatistics. 2022;23:1218–41.
9. Zhang M, Xue L, Tekwe CD, Bai Y, Qu A. Partially functional linear quantile regression model with measurement errors. Statistica Sinica. 2023;33:2257–80.
10. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2017.