Equivalence testing has first appeared in the field of pharmacokinetics Metzler (1974), and find their most common application in the context of generic drug manufacturing Senn (2021) where the aim is to determine whether a parameter like the mean variation of treatment response between a reference and a generic drug falls within a predetermined equivalence range that we denote (−c,c), indicating that drugs are equivalent. Deviations within this region would be considered insignificant as they are too small to reflect the dissimilarity of the therapeutic effects between the compared treatments. In contrast to standard hypothesis testing for equality of means, where the null hypothesis assumes that both means are equal, and the alternative assumes they are not, equivalence testing reverses the roles of the hypothesis formulations and considers a region rather than a point. Specifically, it defines the alternative as the equivalence region within which the parameter of interest must lie for the treatments to be considered equivalent, and the null hypothesis as the opposite. This paradigm puts the burden of proof on equivalence, rather than non-equality, and emphasizes the importance of assessing the similarity of treatments in addition to their differences.
More formally, the hypotheses of interest are defined as
H0:θ∉Θ1vs.H1:θ∈Θ1:=(θL,θU),
where Θ1=(θL,θU) denotes the range of equivalence whose limits are known constants, and are usually symmetrical so we define c:=θU=−θL.
Existing approaches, such as the state-of-the-art Two One-Sided Tests (TOST) Schuirmann (1987), have demonstrated conservativeness and decreased power, particularly for highly variable responses. This can be demonstrated by computing the power function which would be evaluated at different points of the parameter space. In particular, evaluating it at the equivalence bounds yields the size and shows that it is strictly bounded by the significance level α. To see this more clearly, consider a canonical form for the average equivalence problem in the univariate framework, which consists of two independent random variables ˆθ and ˆσν with distributions
ˆθ∼N(θ,σ2ν),andνˆσ2νσ2ν∼χ2ν, where θ is the parameter of interest, σν is the standard error and ν are the degrees of freedom.
The TOST is based on two test statistics
TL:=ˆθ+cˆσν∼tν(θ+cσν),andTU:=ˆθ−cˆσν∼tν(θ−cσν). H0 is rejected in favor of equivalence if both tests simultaneously reject their marginal null hypotheses, i.e.,
TL≥t1−α,ν, andTU≤−t1−α,ν.
By rearranging the terms, we can define a rejection region for the TOST: C1:={ˆθ∈R,ˆσν∈R+||ˆθ|≤c−t1−α,νˆσν}.
Given α, θ, σν, ν and c, the power function of the TOST corresponds to the probability of rejecting H0, i.e., the integral of the joint density of (ˆθ,ˆσν) over the rejection area C1 (see also Phillips (1990)), that is p(α,θ,σν,ν,c):=Pr(TL≥t1−α,ν{ and}TU≤−t1−α,ν|α,θ,σν,ν,c)=∫∞0I(ˆσνt1−α,ν<c){Φ(c+tα,νˆσν−θσν)−Φ(−c+tα,νˆσν+θσν)}=×fW(ˆσν|σν,ν)dˆσν. Noting that the vector (TL,TU) has a bivariate non-central t-distribution with non-centrality parameters θ−cσν and θ+cσν respectively, we can express the power function in terms of Owen’s Q-function Owen (1965): p(α,θ,σν,ν,c):=Qν(−t1−α,ν,θ−cσν,c√νσνt1−α,ν)−Qν(t1−α,ν,θ+cσν,c√νσνt1−α,ν).
This formulation allows to demonstrate that, for σν>0, the TOST is not size-α in finite samples as ων(α,c,σν):=supθ∈Θ0p(α,θ,σν,ν,c)<Qν(−t1−α,ν,0,c√νσνt1−α,ν)<Pr(Tν≤−t1−α,ν)=α, where Θ0:=R∖(−c,c) corresponds to parameter space under the null.
As ων(γ,δ,σν) is continuously differentiable and strictly increasing in γ and δ, solving the following matching paradigms represents a finite sample correction to the TOST and would ensure size-α tests: α∗:=argzeroγ∈[α,0.5)[ων(γ,c,σν)−α], c∗:=argzeroδ∈[1,∞)[ων(α,δ,σν)−α].
The conditions under which the solutions are singletons and provide a uniformly more powerful inference are studied in length in Boulaguiem et al. (2023) and compared to existing methods.