%\VignetteEngine{knitr::knitr} %\VignetteDepends{ggplot2} %\VignetteDepends{gridExtra} %\VignetteDepends{performance} %\VignetteDepends{merDeriv} %\VignetteDepends{DHARMa} %\VignetteIndexEntry{Fitting Generalized Linear Mixed-Effects Models using lme4} \documentclass[nojss]{jss} %% need no \usepackage{Sweave.sty} \usepackage[T1]{fontenc}% for correct hyphenation and T1 encoding \usepackage[utf8]{inputenc} \usepackage{lmodern}% latin modern font \usepackage[american]{babel} %% for texi2dvi ~ bug \usepackage{bm,amsmath,amsfonts} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\VEC}{vec} \newcommand{\bbeta}{{\bm \beta}} \newcommand{\btheta}{{\bm \theta}} \DeclareUnicodeCharacter{0394}{$\Delta$} % \shortcites{lesnoff_within-herd_2004} %% BMB: what does JSS assume? %\bibliographystyle{chicago} \author{ Anna Ly\\McMaster University \And Rune Haubo Bojesen Christensen\\Copenhagen Research Centre \\for Biological and Precision Psychiatry\AND Douglas Bates\\University of Wisconsin - Madison \And Martin M\"achler\\ETH Zurich\AND Benjamin M. Bolker\\McMaster University } \Plainauthor{A. Ly, R. H. B. Christensen, D. M. Bates, M. M\"achler, B. M. Bolker} \title{Fitting generalized linear mixed-effects models using \pkg{lme4}} \Plaintitle{Fitting generalized linear mixed models using lme4} \Shorttitle{GLMMs with lme4} \Abstract{% The \pkg{lme4} \proglang{R} package can be used to fit generalized linear mixed models (GLMMs), which extend the class of linear mixed models (LMMs). The two main extensions provided by GLMMs are (1) allowing for the conditional distribution of the response given the random effects to be non-Gaussian (e.g. binomial, Poisson) and (2) allowing the conditional mean to be a nonlinear function of a linear combination of the fixed and random effect coefficients, via an inverse link function. The conditional mode of the random effects given the observed data, the variance-covariance matrix of the random effects, and the fixed effect parameters are determined using penalized iteratively reweighted least squares (PIRLS). We compute an approximation of the integral over the distributions of the conditional modes to compute the MLE for a given set of parameters (by default we use the Laplace approximation or, alternatively, the more computationally expensive adaptive Gauss-Hermite quadrature). The package provides all the standard features available for GLMs in base \proglang{R}, including the the standard set of accessor functions as well as the possibility of user-specified distributions (within the exponential dispersion family) and link functions.} % \Keywords{% sparse matrix methods, generalized linear mixed models, penalized least squares, Cholesky decomposition} \Address{ Anna Ly\\ Department of Mathematics \& Statistics \\ McMaster University \\ 1280 Main Street W \\ Hamilton, ON L8S 4K1, Canada \\ \par\bigskip Rune Haubo Bojesen Christensen \\ Copenhagen Research Centre for Biological and Precision Psychiatry \\ Mental Health Centre Copenhagen, Copenhagen University Hospital - Bispebjerg and Frederiksberg \\ Copenhagen, Denmark \\ Email: \email{Rune.Haubo@pm.me} \par\bigskip Douglas Bates\\ Department of Statistics, University of Wisconsin - Madison\\ 1205 University Ave. \\ Madison, WI 53706, U.S.A.\\ E-mail: \email{bates@stat.wisc.edu} \par\bigskip Martin M\"achler\\ Seminar f\"ur Statistik, HG G~16\\ ETH Zurich\\ 8092 Zurich, Switzerland\\ E-mail: \email{maechler@stat.math.ethz.ch}\\ % URL: \url{http://stat.ethz.ch/people/maechler} \par\bigskip Benjamin M. Bolker\\ Departments of Mathematics \& Statistics and Biology \\ McMaster University \\ 1280 Main Street W \\ Hamilton, ON L8S 4K1, Canada \\ E-mail: \email{bolker@mcmaster.ca} } \newcommand{\Var}{\operatorname{Var}} \newcommand{\abs}{\operatorname{abs}} \newcommand{\bLt}{\ensuremath{\bm\Lambda_\theta}} \newcommand{\mc}[1]{\ensuremath{\mathcal{#1}}} \newcommand{\trans}{\ensuremath{^\prime}} \newcommand{\yobs}{\ensuremath{\bm y_{\mathrm{obs}}}} \newcommand{\Lsat}{\ensuremath{L_{\mathrm{sat}}}} \newcommand*{\eq}[1]{eqn.~\ref{#1}}% or just {(\ref{#1})} % \usepackage{etoolbox} \usepackage{lineno} %% line numbering hack from Claude \makeatletter \let\@@par@orig\@@par \renewcommand{\@@par}{% \@@par@orig \ifnum\currentgrouplevel=0 \linenumbers \fi } \makeatother % <>= options(width=69, show.signif.stars=FALSE, str=strOptions(strict.width="cut")) library(knitr) library(lme4) library(ggplot2) theme_set(theme_bw()) revguide <- guide_legend(reverse = TRUE) library(grid) library(gridExtra) zmargin <- theme(panel.margin=unit(0,"lines")) library(lattice) library(DHARMa) library(performance) opts_chunk$set(engine='R',dev='pdf',fig.width=10, fig.height=6.5,prompt=TRUE, ## strip.white=all, error=FALSE, ## stop on error tidy=FALSE,comment=NA) ## using knitr::knit2pdf() may lead to xcolor 'incompatible color definition' ## warnings. The code below resolves that, but simultaneously breaks ## vignette compilation in other contexts ... ## https://tex.stackexchange.com/questions/148188/knitr-xcolor-incompatible-color-definition ## knit_hooks$set(document = function(x) {sub('\\usepackage[]{xcolor}', '\\usepackage[dvipsnames]{xcolor}', x, fixed = TRUE)}) @ \begin{document} \linenumbers \section{Introduction} \label{sec:intro} FIXME: check digits of output? The \pkg{lme4} package for \proglang{R} can be used to fit a broad range of mixed-effects models. One major advantage of \pkg{lme4} over its predecessor, \pkg{nlme}, is that it can be used to fit generalized linear mixed models (GLMMs), which combine the flexibility of linear mixed models (LMMs) and generalized linear models (GLMs). In a companion paper, we have described the facilities in \pkg{lme4} for fitting linear mixed models (LMMs). Here we describe the facilities for fitting GLMMs. \section{Generalized Linear Mixed Models} \label{sec:GLMMdef} Generalized linear mixed models extend the class of generalized linear models by allowing for both fixed and random effects. In a GL(M)M, the length-$n$ vector-valued response variable, $\mc Y$, has a conditional distribution in the exponential dispersion family (e.g. Normal, binomial, Poisson).\footnote{The \code{lme4} package supports fitting negative binomial GLMMs, which are an extension of GLMMs; the negative binomial family is within the exponential dispersion family if the dispersion parameter is fixed. The \code{glmer.nb} function uses a one-dimensional optimizer to estimate the dispersion parameter, fitting a negative binomial GLMM with a fixed dispersion parameter at each trial value. In practice this implementation is slower than those in other R packages (e.g., the \code{glmmTMB} package).} The mean, $\bm\mu_{\mc Y }$, of $\mc Y$ depends on a linear predictor, \begin{equation} \label{eq:GLMlinpred} \bm\eta=\bm X\bm\beta . \end{equation} where $\bm\beta$ is a $p$-dimensional coefficient vector and $\bm X$ is an $n \times p$ model matrix. The mapping from $\bm\mu_{\mc Y }$ to $\bm\eta$, which is called the \emph{link function} and written, \begin{equation} \label{eq:linkfun} \bm X\bm\beta=\bm\eta=\bm g\left(\bm\mu_{\mc Y }\right) , \end{equation} is a \emph{diagonal mapping} in the sense that there is a scalar function, $g$, such that the $i$th component of $\bm\eta$ is $g$ applied to the $i$th component of $\bm\mu_{\mc Y }$. (The name ``diagonal'' reflects the fact that the Jacobian matrix, $\frac{d\eta}{d\mu\trans}$, of such a mapping will be diagonal.) The scalar link function must be invertible and differentiable over its range. The vector-valued \emph{inverse link} function, $\bm g^{-1}$, will be the scalar inverse link, $g^{-1}$, applied component-wise to $\bm\eta$. % The scalar link function must be invertible over its range. % The vector-valued \emph{inverse link} function, $\bm g^{-1}$, will be % the scalar inverse link, $g^{-1}$, applied component-wise to % $\bm\eta$. % commented out the repetition - DMB In the GLMM case, the vector of means of the exponential dispersion family distribution of $\mc Y$ depends on an unobserved random vector, $\mc B$, of length $q$, called the random-effects coefficient vector. In particular, the conditional mean of $\mc Y$ given that $\mc B = \bm b$, written $\bm\mu_{\mc Y|\mc B = \bm b}$, depends on the linear predictor, \begin{equation} \label{eq:GLMMlinpred} \bm\eta=\bm Z\bm b+\bm X\bm\beta . \end{equation} where $\bm Z$ is an $n \times q$ random-effects model matrix. Similar to the GLM case, the mapping from the conditional mean, $\bm\mu_{\mc Y|\mc B = \bm b}$, to the linear predictor, $\bm\eta$, is \begin{equation} \label{eq:linkfunii} \bm Z\bm b + \bm X\bm\beta= \bm\eta= \bm g\left(\bm\mu_{\mc Y|\mc B =\bm b}\right) , \end{equation} The random vector $\mc B$ is assumed to be distributed multivariate normally, \begin{equation} \label{eq:2} \mc B \sim \mc N(\bm 0, \bm\Sigma_\theta) \end{equation} where $\bm\Sigma_\theta$ is the covariance matrix of $\mc B$, which depends on a vector of covariance parameters, $\bm\theta$. The optimization routines of \pkg{lme4} never actually compute $\bm\Sigma_\theta$ directly, and instead use the elements of the covariance factor, $\bm\Lambda_\theta$, which is a matrix square root of $\bm\Sigma_\theta$ (in practice a Cholesky factor), \begin{equation} \label{eq:GLMMSigma} \bm\Sigma_\theta=\bm\Lambda_\theta\bm\Lambda_\theta\trans . \end{equation} This characterization of the random-effects covariance structure allows us to write the linear predictor as \begin{equation} \label{eq:GLMMlinpredii} \bm\eta=\bm Z\bm\Lambda_\theta\bm u+\bm X\bm\beta , \end{equation} where the spherical random effects vector, $\bm u$ (see \cite{bates2015fitting}) is a realization of the random vector, $\mc U$, \begin{equation} \label{eq:UdistGLMM} \mc U\sim\mc N(\bm0,\bm I_q) \end{equation} where $\bm I_q$ is the identity matrix. Common forms of the conditional distribution are Bernoulli, for binary responses, binomial for binary responses that are recorded as the number of trials and the number of successes, and Poisson, for count data. The combination of a distributional form and a link function is called a \emph{family}. For distributional forms in the exponential dispersion family there is a \emph{canonical link}. For Bernoulli or binomial forms the canonical link is the \emph{logit} link function \begin{equation} \label{eq:logitLink} \eta_i=\log\left(\frac{\mu_i}{1-\mu_i}\right); \end{equation} for the Poisson distribution the canonical link is the natural logarithm. The form of the distribution determines the conditional variance, $\Var(\mc Y|\mc U=\bm u)$, as a function of the conditional mean and, possibly, a separate scale factor. (In the most common cases the conditional variance is completely determined by the conditional mean.) Assuming that the conditional distribution belongs to the exponential dispersion family, we can weight observations differently by scaling the dispersion parameter, $\phi$. These \textbf{prior weights}, $\bm w$, a vector of size $n$, are known positive constants (equal to the number of observations per trial in the particular case of a binomial GLMM). By scaling the dispersion parameters, we also modify the conditional variance. Consider the form, where $i = 1, 2, \dots, n$: \begin{equation} \text{Var}(\mc Y_i|\mc U=\bm u) = \frac{\phi}{w_i} \Var(\mu_i) \end{equation} where $\Var(\mu_i)$ is the family-specific variance function. Higher weights indicate a smaller variance. If prior weights are not specified, by default $\bm w$ is just a vector of 1's. Another common modification when dealing with generalized linear mixed models is to allow for an \textbf{offset}, which is an efficient way to include known scaling factors (like population or area) without introducing additional parameters to the model. In particular, the inclusion of an offset would modify the linear predictor as follows: \begin{equation} \label{eq:GLMMoffset} \bm\eta=\bm Z\bm b+\bm X\bm\beta + \text{offset}. \end{equation} For GLMMs, the scale parameter is not typically estimated as part of the nonlinear minimization of the negative likelihood over the $\beta$ and $\theta$ parameters. As is common in generalized linear model contexts, it is set equal to the (residual) deviance divided by the residual degrees of freedom. (Except for the Binomial and Poisson families, where the scale parameter is fixed at 1.) The deviance itself is estimated using the method of moments estimator, which is the Pearson residual sum of squares. This approach is preferred by \citet{McCullaghNelder1989} as it is less sensitive to small errors or model misspecification than the maximum likelihood estimator. We therefore estimate the deviance using the penalized weighted Pearson residual sum of squares, as described in Section 3.2 of \cite{bates2015fitting}. The likelihood of the parameters, given the observed data, is now \begin{equation} \label{eq:GLMMlike} L(\bm\beta,\bm\theta|\yobs)=\int_{\mathbb{R}^q}f_{\mc Y,\mc U}(\yobs,\bm u)\,d\bm u \end{equation} where, as in the case of linear mixed models, $f_{\mc Y,\mc U}(\yobs,\bm u)$ is the unscaled conditional density of $\mc U$ given $\mc Y=\yobs$. The notation here is a bit blurred because, although the joint distribution of $\mc Y$ and $\mc U$ is always continuous with respect to $\mc U$, it can be (and often is) discrete with respect to $\mc Y$. However, when we condition on the observed value $\mc Y=\yobs$, the resulting function is continuous with respect to $\bm u$ so the unscaled conditional density is indeed well-defined as a density, up to a scale factor. To evaluate the integrand in (\ref{eq:GLMMlike}) we need the unscaled conditional density $f_{\mc Y,\mc U}$, which depends on the likelihood of $\yobs$ conditional on $\bm u$: $f(\yobs, \bm u) = f(\yobs|\bm u) f(\bm u)$. As usual in likelihood estimation we will eventually be working, for convenience, with negative log likelihoods rather than likelihoods. However, base R's machinery for defining GLM models, or ``families'', does not provide a convenient expression for the negative log-likelihood: instead, objects of class \code{family} include a \code{dev.resids} function that returns a vector of observation-specific deviances, i.e. $-2(\log L(\yobs|\bm u) - \log \Lsat)$ where $\Lsat$ is the likelihood corresponding to a saturated model with as many parameters as observations. % \footnote{There is some confusion in \proglang{R} (and in its predecessor, \proglang{S}) about the exact definition of the deviance residuals of a family. As indicated above, we will use this name for the value of the \code{dev.resids} function for the family. The signed square root of this vector, using the signs of $\yobs-\mu$, is returned when the \code{residuals} method is applied to a fitted model of class \code{"glm"} when \code{type="deviance"}, the default, is specified. Both are called ``deviance residuals'' at different points in the documentation.} % For example, the log-likelihood for a Poisson observation $y$ with mean $\lambda$ is (omitting the normalization constant) $\log L(y, \lambda) = y \log \lambda - \lambda$; the deviance is $$ -2(\log L(y, \lambda) - \log L(y, y)) = -2(y(\log \lambda - \log y) - (\lambda-y)) $$ (with appropriate adjustments if $y=0$). Because deviances differ from (negative) log likelihoods only by a parameter-independent constant (the log-likelihood of the saturated model) and a change of scale (multiplication by 2), this change does not affect our optimization procedure. One advantage of using the pre-existing GLM family structure is that the software can thus fit models with user-specified families and link functions (although common families and link functions are hard-coded in C++ for computational speed), in contrast to early versions of \code{lme4} and some other \proglang{R} packages for GLMM fitting. If \code{dev.resids} returns $\bm d(\yobs|\bm u)$, with elements $d_i(\yobs|\bm u), i=1,\dots,n$, the full deviance of a generalized linear model is \begin{displaymath} f(\yobs, \bm u) = \sum_{i=1}^n d_i(\yobs|\bm u) + \|\bm u\|^{2}. \end{displaymath} The likelihood can now be expressed as \begin{equation} \label{eq:GLMMlike1} L(\bm\beta,\bm\theta|\yobs)= \int_{\mathbb{R}^q}\exp\left(-\frac{\sum_{i=1}^nd_i(\yobs|\bm u)+\|\bm u\|^2}{2}\right)\,(2\pi)^{-q/2}\,d\bm u \end{equation} As for linear mixed models, we simplify evaluation of the integral (\ref{eq:GLMMlike}) by determining the value, $\tilde{\bm u}_{\beta,\theta}$, that maximizes the integrand. When the conditional density, $\mc U|\mc Y=\yobs$, is multivariate Gaussian, this conditional mode will also be the conditional mean. However, for most families used in GLMMs, the mode and the mean need not coincide so we use the more general term and call $\tilde{\bm u}_{\beta,\theta}$ the \emph{conditional mode}. We first describe the numerical methods for determining the conditional mode using the Penalized Iteratively Reweighted Least Squares (PIRLS) algorithm then return to the question of evaluating the integral (\ref{eq:GLMMlike}). \subsection{Determining the conditional mode} \label{sec:conditionalMode} The iteratively reweighted least squares (IRLS) algorithm is an efficient method of determining the maximum likelihood estimates of the coefficients in a generalized linear model. We extend it to a \emph{penalized iteratively reweighted least squares} (PIRLS) algorithm for determining the conditional mode, $\tilde{\bm u}_{\beta,\theta}$. This algorithm has the form \begin{enumerate} \item Given parameter values, $\bm\beta$ and $\bm\theta$, and starting estimates, $\bm u_0$, evaluate the linear predictor, $\bm\eta$, the corresponding conditional mean, $\bm\mu_{\mc Y|\mc U=\bm u}$, and the conditional variance. Establish the weights as the inverse of the variance. We write these weights in the form of a diagonal weight matrix, $\bm W$, although they are stored and manipulated as a vector. \item Let \begin{equation} Q(\bm u) = \left\|\bm W^{1/2}\left(\yobs-\bm\mu_{\mc Y|\mc U=\bm u}\right)\right\|^2+\|\bm u\|^2. \end{equation} Solve the penalized, weighted, nonlinear least squares problem \begin{equation} \label{eq:weightedNLS} \arg\min_{\bm u}\left( Q(\bm u) \right). \end{equation} \item Update the weights, $\bm W$, and check for convergence. If not converged, go to step 2. \end{enumerate} We use a Gauss-Newton algorithm with an orthogonality convergence criterion~\citep[\S 2.2.3]{bateswatts88:_nonlin} to solve the penalized, weighted, nonlinear least squares problem in step 2. At the $i$th iteration we determine an increment, $\bm\delta_i$, as the solution to the penalized, weighted, linear least squares problem \begin{equation} \label{eq:incr} \bm\delta_i=\arg\min_{\bm\delta}\left\| \begin{bmatrix} \bm W^{1/2}\left(\yobs-\bm\mu_i\right)\\ \bm u_i \end{bmatrix}- \begin{bmatrix} \bm W^{1/2}\bm M_i\bm Z\bm\Lambda_\theta\\ \bm I_q \end{bmatrix}\bm u\right\|^2 \end{equation} where $\bm u_i$ is current value of $\bm u$, $\bm\mu_i$ is the corresponding conditional mean of $\mc Y|\mc U=\bm u_i$ and $\bm M_i$ is the Jacobian matrix of the vector-valued inverse link, evaluated at $\bm\mu_i$. That is \begin{equation} \label{eq:Jacobian} \bm M_i=\left.\frac{d\bm\mu}{d\bm\eta\trans}\right|_{\bm\eta_i}, \end{equation} which will be a diagonal matrix so, as for the weights, we store and manipulate the Jacobian as a vector. When solving for the minimum of $Q(\bm u)$, the proposed update for $\bm u$ may increase rather than decrease the value of the objective function $Q(\bm u)$, if the step size estimated by the Gauss-Newton step is too large. Let $Q(\bm u_i)$ represent the value of the objective function $Q(\bm u)$ at the current iterate $\bm u_{i}$; similarly, let $Q(\bm u_{i-1})$ represent the the value of the objective function of the previous iteration. If $Q(\bm u_i) > Q(\bm u_{i-1})$, we proceed with a \emph{step-halving} procedure. That is, we successively consider scaled updates of the form \begin{equation} \bm u_i^{(j)} = \bm u_{i-1} + \frac{\bm \delta_i}{2^j}, \quad j = 1, 2, \ldots, \end{equation} computing $Q(\bm u_i^{(j)})$ at each step. The step-halving procedure continues until a value of $j$ is found such that $Q(\bm u_i^{(j)}) < Q(\bm u_{i-1})$, which the iterate is then updated by setting $\bm u_{i} = \bm u_{i}^{(j)}$. If no such $j$ is found after a predetermined number of halvings, then the step-halving procedure has failed and the algorithm terminates. The minimizer, $\bm\delta_i$, of (\ref{eq:incr}) satisfies \begin{equation} \label{eq:incrEq} \bm P\left(\bLt\trans\bm Z\trans\bm M_i\bm W\bm M_i\bm Z\bLt+\bm I_q\right)\bm P\trans \bm\delta_i=\bLt\trans\bm Z\trans\bm M_i\bm W(\yobs-\bm\mu_i) - \bm u_i \end{equation} which we solve using the sparse Cholesky factor. At convergence, the factor, $\bm L_{\beta,\theta}$, satisfies \begin{equation} \label{eq:CholFactorGLMM} \bm L_{\beta,\theta}\bm L_{\beta,\theta}\trans = \bm P\left(\bLt\trans\bm Z\trans\bm M\bm W\bm M\bm Z\bLt+\bm I_q\right)\bm P\trans \end{equation} As we show in the next section, the matrix $(\bm L_{\beta,\theta}\bm L_{\beta,\theta}\trans)^{-1}$ is a Laplace approximation of the covariance matrix for the spherical random effects, conditional on the observed data. This fact is useful for constructing a nonlinear objective function for finding the approximate maximum likelihood estimates of $\theta$ and $\beta$. \subsection{Evaluating the likelihood for GLMMs using the Laplace approximation} \label{sec:Laplace} Evaluating the likelihood for generalized linear mixed models requires approximating an intractable integral over the random effects distribution. The \code{glmer} function offers several approximations, controlled by the \code{nAGQ} argument. The default value of \code{nAGQ=1} specifies the \emph{Laplace approximation} \citep{madsen2011}. A second-order Taylor series approximation to $-2\log[f_{\mc Y,\mc U}(\yobs,\bm u)]$ based at $\tilde{\bm u}$ provides an approximation of the unscaled conditional density as a multiple of the density for the multivariate Gaussian $\mathcal{N}(\tilde{\bm u},\bm L\bm L\trans)$. The change of variable \begin{equation} \label{eq:LaplaceChg} \bm u = \tilde{\bm u} + \bm L\bm z \end{equation} provides \begin{equation} \label{eq:GLMMLaplace} \begin{aligned} L(\bm\beta,\bm\theta|\yobs)&=\int_{\mathbb{R}^q}f_{\mc Y,\mc U}(\yobs,\bm u)\,d\bm u\\ &\approx \tilde{f}\,|\bm L|\, \int_{\mathbb{R}^q}e^{-\|\bm z\|^2/2}\,(2\pi)^{-q/2}\,d\bm z\\ &=\tilde{f}\, |\bm L| \end{aligned} \end{equation} or, on the deviance scale, \begin{equation} \label{eq:LaplaceDev} -2\ell(\bm\beta,\bm\theta|\yobs)\approx\sum_{i=1}^nd_i(\yobs,\tilde{\bm u}) + \|\tilde{\bm u}\|^2 + \log(|\bm L|^2)+\frac{q}{2}\log(2\pi) \end{equation} The Laplace approximation normally conditions on both the fixed effects $\beta$ and the variance-covariance parameters $\theta$. A further approximation, which is denoted in \code{glmer} by \code{nAGQ=0}, profiles out the fixed effects by minimizing $\beta$ and the conditional modes $\bm u$ simultaneously in eq.~\ref{eq:weightedNLS}. This approximation is exact when (1) $\partial(\log L)/\partial \beta$ is a linear function of the conditional modes $\bm u$ and (2) when the conditional mode is equal to the conditional mean (typically, although not necessarily, implying a symmetric conditional distribution). Both assumptions hold for linear mixed models (although a Laplace approximation is not necessary there), consistent with \cite{bates2015fitting} showing that the fixed effects can be profiled out of the log-likelihood for LMMs. The Julia \code{MixedModels.jl} package offers the same approximation as the \code{fast} argument to the \code{pirls!} function (\url{https://juliastats.org/MixedModels.jl/stable/optimization/}); Template Model Builder \citep{kristensenTMB2016}, and downstream packages such as \code{glmmTMB}, provide this function via a \code{profile} argument. By default, \code{glmer} uses a two-stage optimization procedure (described below) with \code{nAGQ=0} in the first stage; users can also specify \code{nAGQ=0} for faster, approximate model fits. \subsubsection{Decomposing the deviance for simple models} \label{sec:simplescalar} A common special case of mixed models is those where scalar (typically intercept) random effects are associated with levels of a single grouping factor, $\bm h$. In this case the dimension, $q$, of the random effects is the number of levels of $\bm h$ --- i.e.{} there is exactly one random effect associated with each level of $\bm h$. We will write the vector of variance-covariance parameters, which is one-dimensional, as a scalar, $\theta$. The matrix $\bm\Lambda_{\bm\theta}$ is a multiple of the identity, $\theta\bm I_q$, and $\bm Z$ is the $n\times q$ matrix of indicators of the levels of $\bm f$. The permutation matrix, $\bm P$, can be set to the identity and $\bm L$ is diagonal, although not necessarily homogeneous (i.e., a scalar multiple of the identity matrix). Because each element of $\bm\mu$ depends on only one element of $\bm u$ and the elements of $\mc Y$ are conditionally independent, given $\mc U=\bm u$, the conditional densities of the $u_j,j=1,\dots,q$ given $\mc Y=\yobs$ are independent. We partition the indices $1,\dots,n$ as $\mathbb{I}_j,j=1,\dots,q$ according to the levels of $\bm h$. That is, the index $i$ is in $\mathbb{I}_j$ if $h_i=j$. This partitioning also applies to the deviance residuals in that the $i$th deviance residual depends only on $u_j$ when $i\in\mathbb{I}_j$. Writing the univariate conditional densities as \begin{equation} \label{eq:univariateCondDens} f_j(\yobs,u_j)=\exp\left(-\frac{\sum_{i\in\mathbb{I}_j}d_i(\yobs, u_j)+u_j^2}{2}\right)(2\pi)^{-1/2} \end{equation} we have \begin{equation} \label{eq:vectorCondDens} f_{\mc Y,\mc U}(\yobs,\bm u)=\prod_{j=1}^q f_j(\yobs,u_j) \end{equation} and \begin{equation} \label{eq:ssLike} \begin{aligned} L(\bm\beta,\bm\theta|\yobs)=\prod_{j=1}^q\int_{\mathbb{R}}f_j(\yobs,u)\,du \end{aligned} \end{equation} We consider this special case both because it occurs frequently and because, for some software, it is the only type of GLMM that can be fit. Also, in this particular case we can graphically assess the quality of the Laplace approximation by comparing the actual integrand to its approximation. Consider the \code{cbpp} data on contagious bovine pleuropneumonia (CBPP) incidence according to season and herd, available in the \pkg{lme4} package (see \ref{sec:cbpp} for more details), and the model <>= print(m1 <- glmer(cbind(incidence, size-incidence) ~ period + (1|herd), cbpp, binomial), corr=FALSE) @ This model has been fit by minimizing the Laplace approximation to the deviance. We can assess the quality of this approximation by evaluating the unscaled conditional density at $u_j(z)=\tilde{u_j} + z/{\bm L_{j,j}}$ and comparing the ratio, $f_j(\yobs,u)/(\tilde{f_j}\sqrt{2\pi})$, to the standard normal density, $\phi(z)=e^{-z^2/2}/\sqrt{2\pi}$, as shown in Figure~\ref{fig:densities}. \begin{figure}[tbp] \centering <>= zeta <- function(m, zmin=-3, zmax=3, npts=301L, rank = FALSE) { stopifnot (is(m, "glmerMod"), length(m@flist) == 1L, # single grouping factor length(m@cnms[[1]]) == 1L) # single column for that grouping factor pp <- m rr <- m@resp u0 <- getME(pp,"u") sd <- 1/getME(pp,"L")@x ff <- as.integer(getME(pp,"flist")[[1]]) fc <- getME(pp,"X") %*% getME(pp,"beta") # fixed-effects contribution to linear predictor ZL <- t(getME(pp,"Lambdat") %*% getME(pp,"Zt")) dc <- function(z) { # evaluate the unscaled conditional density on the deviance scale uu <- u0 + z * sd rr$updateMu(fc + ZL %*% uu) unname(as.vector(tapply(rr$devResid(), ff, sum))) + uu * uu } zvals <- seq(zmin, zmax, length.out = npts) d0 <- dc(0) # because this is the last evaluation, the model is restored to its incoming state # signed square root sqrtmat <- t(sqrt(vapply(zvals, dc, d0, USE.NAMES=FALSE) - d0)) * array(ifelse(zvals < 0, -1, 1), c(npts, length(u0))) if (rank) { ## order by 'badness' (variance of normalized profile) nvals <- exp(-0.5*sqrtmat^2)/sqrt(2*pi)/dnorm(zvals) colnames(sqrtmat) <- seq(ncol(sqrtmat)) sqrtmat <- sqrtmat[,order(apply(sqrtmat, 2, var), decreasing = TRUE)] } list(zvals=zvals, sqrtmat= sqrtmat ) } plot_zeta <- function(z, norm = FALSE, ylab = if (!norm) "density" else "t(z)", which = NULL, layout = c(5,3)) { dmat <- exp(-0.5*z$sqrtmat^2)/sqrt(2*pi) if (!is.null(which)) { dmat <- dmat[, which] } if (norm) dmat <- dmat/dnorm(z$zvals) xyplot(as.vector(dmat) ~ rep.int(z$zvals, ncol(dmat))|gl(ncol(dmat), nrow(dmat)), type=c("g","l"), aspect=0.6, layout=layout, xlab="z", ylab=ylab, panel=function(...){ if (!norm) panel.lines(zm$zvals, dnorm(zm$zvals), lty=2) panel.xyplot(...)} ) } zm <- zeta(m1, -3.750440, 3.750440) plot_zeta(zm) @ \caption{Comparison of univariate integrands (solid line) and standard normal density function (dashed line)} \label{fig:densities} \end{figure} As Figure~\ref{fig:densities} shows, the univariate integrands are very close to the standard normal density, indicating that the Laplace approximation to the deviance is a good approximation in this case. \section{Adaptive Gauss-Hermite quadrature for GLMMs} \label{sec:aGQ} When the integral (\ref{eq:GLMMlike}) can be expressed as a product of low-dimensional integrals, we can use Gauss-Hermite quadrature to provide a closer approximation to the integral. Univariate Gauss-Hermite quadrature evaluates the integral of a function that is multiplied by a ``kernel'' where the kernel is a multiple of $e^{-z^2}$ or $e^{-z^2/2}$. For statisticians the natural candidate is the standard normal density, $\phi(z)=e^{-z^2/2}/\sqrt(2\pi)$. A $k$th-order Gauss-Hermite formula provides knots, $z_i,i=1,...,k$, and weights, $w_i,i=1,\dots,k$, such that \begin{displaymath} \int_{\mathbb{R}}t(z)\phi(z)\,dz\approx\sum_{i=1}^kw_it(z_i) \end{displaymath} The function \code{GHrule} in \pkg{lme4} (based on code in the \pkg{SparseGrid} package) provides knots and weights relative to the standard normal kernel for orders $k$ from 1 to 100. For example, <>= GHrule(5) @ where \code{z} is the vector of knots, \code{w} is the vector of weights, and \code{dlnorm} is the log-density of the standard normal distribution at $z$. The choice of the value of $k$ depends on the behavior of the function $t(z)$. If $t(z)$ is a polynomial of degree $2k-1$ then the Gauss-Hermite formula for orders $k$ or greater provides an exact answer. The fact that we want $t(z)$ to behave like a low-order polynomial is often neglected in the formulation of a Gauss-Hermite approximation to a quadrature. The quadrature knots on the $u$ scale are chosen as \begin{equation} \label{eq:quadraturepts} u_{i,j}(z)=\tilde{u_j} + z_i/{\bm L_{j,j}},\quad i=1,\dots,k;\;j=1,\dots,q \end{equation} exactly so that the function $t(z)$ should behave like a low-order polynomial over the region of interest, which is to say the region where quadrature knots with large weights are located. The term ``adaptive Gauss-Hermite quadrature'' reflects the fact that the approximating Gaussian density is scaled and shifted to provide a second order approximation to the logarithm of the unscaled conditional density. Figure~\ref{fig:tfunc} shows $t(z)$ for each of the unidimensional integrals in the likelihood for the model \code{m1} at the parameter estimates. \begin{figure}[tbp] \centering <>= plot_zeta(zm, norm = TRUE) @ \caption{The function $t(z)$, which is the ratio of the normalized unscaled conditional density to the standard normal density, for each of the univariate integrals in the evaluation of the deviance for model \code{m1}. These functions should behave like low-order polynomials.} \label{fig:tfunc} \end{figure} <>= data("toenail", package = "lme4") ttab <- with(toenail, table(table(patientID))) ## sum(as.numeric(names(ttab))*ttab/sum(ttab)) @ The CBPP data set is a relatively well-behaved data set, where Laplace approximation works well. In contrast, a widely used data set on toenail onychomycosis \citep{debacker+1998}, which has a very low effective sample size per cluster --- an average of about 6.5 binary observations (``moderate or severe'' vs ``none or mild'' disease) per patient --- represents an example where Gauss-Hermite quadrature is necessary for reliable results. In this case the conditional densities depart much more clearly from the standard normal (Figure~\ref{fig:toenailplot}; note the scale of the density ratios goes from 0 to 10, in contrast the maximum of 4 in Figure~\ref{fig:tfunc}). \cite{stringerAsymptoticsNumericalIntegration2022} and \cite{stringerExactGradientEvaluation2024} further explore the limitations of Laplace approximation. \begin{figure}[tbp] \centering <>= data("toenail", package = "lme4") toemod <- glmer(outcome ~ time*treatment + (1 | patientID), data = toenail, family = binomial(link = "logit")) zm <- zeta(toemod) set.seed(101); s <- sample(ncol(zm$sqrtmat), size = 9) gridExtra::grid.arrange(plot_zeta(zm, which = s, layout = c(3,3)), plot_zeta(zm, which = s, norm = TRUE, layout = c(3,3)), nrow = 1) @ \caption{Normalized unscaled conditional density (left) and ratio of density to the standard normal density (right) for a random sample of 9 patients from the toenail onychomycosis data set.} \label{fig:toenailplot} \end{figure} To use adaptive Gauss-Hermite quadrature for model fitting in \code{glmer} models, set the argument \code{nAGQ}, the number of quadrature points, to a value greater than 1. Increasing the number of nodes generally improves the accuracy of the likelihood approximation at the expense of computation time --- although rounding errors may accumulate when using large numbers of quadrature points. At present, AGQ is only available for models with a single scalar random effect. (The \code{GLMMadaptive} package implements AGQ for vector-valued random effect models, although it is still restricted to models with a single random effect.) \section{Model fitting} Once we can calculate the deviance by PIRLS for specified values of $\btheta$ and $\bbeta$ (or only $\btheta$ if profiling out the fixed effects via \code{nAGQ=0}), we then estimate the parameters by nonlinear optimization. This procedure largely follows the description in \cite{bates2015fitting}, using derivative-free optimizers with box constraints to prevent non-positive-(semi)definite covariance matrices. Specifically, the elements of $\btheta$ corresponding to the diagonal of $\bm\Lambda_\theta$ are currently constrained to be non-negative. The only difference is that by default \code{glmer} uses a two-step fitting procedure, using \code{nAGQ=0} at the first stage to get preliminary estimates which are then used as starting points for a second optimization with Laplace approximation or Gauss-Hermite quadrature as specified by the user. Different nonlinear optimizers can be used at each stage: the current default, as specified in \code{glmerControl}, is to use Powell's BOBYQA followed by a box-constrained variant of the Nelder-Mead simplex algorithm. For faster, approximate fitting, the second stage can be omitted; in the rare cases where the initial \code{nAGQ=0} fit gives poor results, the first stage can be skipped via \code{glmerControl(nAGQ0initStep = FALSE)}. However, because the profiled log-likelihood is an even function of the diagonal elements of $\btheta$ and $\bm\Sigma_\theta=\bm\Lambda_\theta\bm\Lambda_\theta\trans$ depends on them only through their squares, positive and negative values yield identical likelihoods. This symmetry means that constrained optimization is not strictly necessary --- an unconstrained optimizer will converge to a correct solution, approaching zero from either side in the boundary case of a singular random-effects covariance matrix. (Once a solution is found, we can map it to a unique solution where the diagonal elements are all non-negative.) A similar approach to removing constraints could work for structured covariance matrices where correlation parameters are constrained to $(-1,1)$, e.g. by parameterizing the model in terms of a phase parameter $p$ where $\rho = \sin(p)$ (and then mapping $p$ to $(0, 2\pi)$). Removing these constraint would permit the use of a broader class of unconstrained optimizers, though this feature has not yet been incorporated into \code{lme4}. \section{Examples} \subsection{CBPP} \label{sec:cbpp} The \code{?cbpp} help page describes the CBPP data set \citep{lesnoff_within-herd_2004} as follows: \begin{quote} Contagious bovine pleuropneumonia (CBPP) is a major disease of cattle in Africa, caused by a mycoplasma. This dataset describes the serological incidence of CBPP in zebu cattle during a follow-up survey implemented in 15 commercial herds located in the Boji district of Ethiopia. The goal of the survey was to study the within-herd spread of CBPP in newly infected herds. Blood samples were quarterly collected from all animals of these herds to determine their CBPP status. These data were used to compute the serological incidence of CBPP (new cases occurring during a given time period). Some data are missing (lost to follow-up). \end{quote} \cite{lesnoff_within-herd_2004} estimated the effects of different treatments using (1) ordinary logistic regression incorporating a variance-inflation factor, also known as a quasi-binomial model (``logistic regression'' is sometimes used specifically to describe analyses of Bernoulli responses, but in this case there are multiple trials per observation [cows that could become seropositive], and so a dispersion or scale parameter can be estimated); (2) a GLMM implemented in \code{lme4}; and a (3) Markov chain Monte Carlo algorithm \citet{zeger1991generalized}, which as they state allows for a non-parametric rather than a Normal model for the random effects. The authors did not find any significant effects of treatment, ascribing the null results to ``a lack of power in the statistical analyses or to a quality problem for the medications used (and more generally, for health-care delivery in the Boji district).'' <>= cbpp2 <- read.csv(system.file("vignette_data", "cbpp2.csv", package = "lme4")) cbpp2 <- transform(cbpp2,period=factor(period), treatment=factor(treatment, levels=c("Partial/null","Complete","Unknown"))) @ (Note that Table 1 of \cite{lesnoff_within-herd_2004} contains a known typographical error for herd 6. Consequently, results obtained using the \code{cbpp} data set may not exactly reproduce some of the findings reported in that paper.) The \code{lme4} package includes two variants of the \code{cbpp} data set. The second variant, \code{cbpp2}, contains corrected values corresponding to Table~1 of \cite{lesnoff_within-herd_2004}. Although the precise provenance of these data sets is unclear, the \code{cbpp} data set matches the version held by the corresponding author of \cite{lesnoff_within-herd_2004}. <>= g0 <- ggplot(cbpp2,aes(period, incidence/size, colour=treatment, shape = treatment)) + ## scale_y_continuous(limits = c(0, 1))+ labs(x="Period",y="Proportional incidence")+ scale_colour_brewer(palette="Dark2") ## faceted plot ## g0 + ## geom_point(aes(size = size)) + ## facet_wrap(~herd) + ## zmargin ## g0 + geom_point(aes(size = size),alpha=0.4) + geom_line(aes(group=herd),alpha=0.3) ## geom_boxplot(fill="black",alpha=0.1,width=0.5, ## aes(group=interaction(period,treatment))) ## ## boxplot(incidence/size ~ period, data = cbpp, las = 1, ## xlab = 'Period', ylab = 'Probability of sero-positivity') @ We model proportional incidence as a binomial response depending on the additive fixed effects of period, treatment and average herd size; to account for repeated measures we fit a model with a random effect of herd. In practice we might also be interested in a period by treatment interaction, but we neglect that term here. As in \code{glm}, we can specify a binomial response as a proportion and use the \code{weights} argument to specify the sample size, instead of the more typical two-column \code{cbind(successes,failures)} format: <>= gm1 <- glmer(incidence/size ~ period + treatment + avg_size + (1 | herd), family = binomial, data = cbpp2, weights = size) @ <>= ## diagnostic junk about N-M vs bobyqa differences tmpf <- function(x) c(deviance(x), unlist(getME(x,c("theta","beta")))) gm1 <- glmer(incidence/size ~ period + treatment + avg_size + (1 | herd), family = binomial, data = cbpp2, weights = size) gm1B <- update(gm1,control=glmerControl(optimizer="bobyqa")) cbind(tmpf(gm1),tmpf(gm1B)) dd1 <- update(gm1,devFunOnly=TRUE) dd1B <- update(gm1B,devFunOnly=TRUE) gm1@optinfo$derivs$gradient gm1B@optinfo$derivs$gradient gm1@optinfo$derivs$gradient/gm1B@optinfo$derivs$gradient library(numDeriv) grad(dd1,tmpf(gm1)[-1]) grad(dd1B,tmpf(gm1B)[-1]) @ It is also worth considering adding an observation-level random effect to the model, which we can do by creating a new factor based on observation number and using \code{update()} on the previous model: <>= cbpp2 <- transform(cbpp2,obs=factor(seq(nrow(cbpp2)))) gm2 <- update(gm1,.~.+(1|obs)) ## herd and observation-level REs gm3 <- update(gm1,.~.-(1|herd)+(1|obs)) ## observation-level REs only @ \subsubsection{Model summary} The first part of the summary reiterates the family and link function used, the model formula, and gives various summary statistics (log-likelihood etc.), as well as quantiles of the scaled (Pearson) residuals: <>= ss <- summary(gm1) cc <- capture.output(print(summary(gm1))) reRow <- grep("^Random effects",cc) cat(cc[1:(reRow-2)],sep="\n") @ These quantities are also accessible via standard accessors (\code{AIC()}, \code{BIC()}, \code{logLik()}). The next chunk of \code{summary()} describes the random effects and the number of levels associated with each grouping factor (the latter is useful for checking that random-effects formulae have been specified correctly): <>= feRow <- grep("^Fixed effects",cc) cat(cc[reRow:(feRow-2)],sep="\n") @ This information is also accessible via \code{VarCorr()}, which returns a list of variance-covariance matrices (the \code{print} method for \code{VarCorr} objects allows control of whether the variance, or standard deviation, or both, are printed). Next come the estimates of the fixed effects, along with Wald estimates of the standard error, $Z$ statistic, and $p$-value: <>= corRow <- grep("^Correlation",cc) cat(cc[feRow:(corRow-2)],sep="\n") @ One can use \code{coef(summary())} to retrieve this information, and optionally format it with \code{printCoefmat()}. The last component of \code{summary()} gives the estimated correlations among the fixed-effect parameters, which can be useful for assessing multicollinearity (it can also be overwhelming: it is suppressed by default for models with more than 20 fixed-effect parameters, and can also be suppressed by using \code{print(summary(.),correlation=FALSE)}). <>= cat(cc[corRow:length(cc)],sep="\n") @ \subsubsection{Diagnostics} A range of graphical diagnostic tools is available for \code{merMod} objects. The plot methods in the \code{lme4} package are inspired by those in the \code{nlme} package, using \code{lattice} plots to provide a reasonable blend of convenience and flexibility. \code{merMod} objects are also compatible with the \code{performance} package and the \code{DHARMa} package, both commonly used for model checking. The following code produces a standard range of diagnostic plots (Figure~\ref{fig:glmerDiag}), similar to the ones in base R's \code{plot.lm} method. These diagnostics will generally be useful for models where the conditional density is approximately normal (but heteroscedastic) --- e.g., Poisson responses with large mean or binomial responses with large numbers of successes --- and less so otherwise, e.g. for binary responses. <>= ## basic residual plot plot(gm1) ## scale-location plot plot(gm1,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth")) ## boxplot of residuals grouped by a categorical predictor plot(gm1,period~resid(.)) ## Q-Q plot qqmath(gm1) @ <>= p1 <- plot(gm1,type=c("p","smooth"), main="Default (Pearson resid vs. fitted)") p2 <- plot(gm1,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"), main="Scale-location") p3 <- plot(gm1,period~resid(.), main = "Grouped residuals") p4 <- qqmath(gm1,main="Quantile-quantile plot",type=c("p","r")) grid.arrange(p1,p2,p3,p4,nrow=2) @ The \code{ranef()} accessor extracts the conditional modes; the argument \code{condVar=TRUE} additionally extracts the variances of the conditional modes, which are stored as an attribute labelled \code{"postVar"} --- a three-dimensional array that gives the variance-covariance matrix of the conditional modes for each level of the grouping variable. The plotting methods \code{dotplot()} and \code{qqmath()} return lists of graphical objects showing \emph{caterpillar plots} (ordered values of the random effects with confidence bars); in the case of the Q-Q plot (\code{qqmath}) the $y$-axis shows corresponding values of the standard normal quantiles (Figure~\ref{fig:glmerRanefplot}). <>= rr <- ranef(gm1,condVar=TRUE) dd <- dotplot(rr) qq <- qqmath(rr,type=c("p","r")) grid.arrange(dd$herd,qq$herd,ncol=2) @ \code{performance::check\_model()} checks a variety of classical assumptions of GLMs. For mixed-effect models, it also assesses the normality of the distribution of conditional modes (Figure~\ref{fig:perf_chk_mod_cbpp}). <>= check_model(gm1, check = c("pp_check", "binned_residuals", "qq", "reqq")) @ The \code{DHARMa} package is also compatible with \code{merMod} objects. \code{DHARMa} generates simulation-based residuals for generalized linear (mixed) models and uses them for graphical and statistical tests of model assumptions (Figure~\ref{fig:dharma_mod_cbpp}). <>=$ simulationOutput <- simulateResiduals(fittedModel = gm1) plot(simulationOutput) @ Having checked the diagnostics, we would now like to compare the three models we have fitted. Inspecting the \code{VarCorr} components, we see that when we fit both herd- and observation-level random effects, the among-herd variance is estimated as zero. The appropriate procedure at this point (e.g. whether one drops non-significant terms, or those with small scaled magnitudes, or those that worsen the AIC or BIC of the model) depends on the goals of the analysis and one's philosophy of model-building \citep{barrRandomEffectsStructure2013,matuschekBalancingTypeError2017,scandola2024}. One might either stick with the full model, or continue with the reduced model with observation-level random effects only (as it has exactly the same likelihood as the full model but uses an additional parameter, it would be chosen according to either an information-theoretic or a hypothesis-testing model selection framework). Here we will start by computing likelihood profiles and confidence intervals for the model incorporating both random effects; although it has the same point estimates and maximum likelihood as the reduced model, confidence intervals that incorporate non-local information (i.e. profile- or parametric bootstrap-based) will give different, more conservative results for the full model. %% lme4 wishlist: allow [Ww]ald in confint.merMod? <>= cbpp_batch_vars <- load(system.file("vignette_data", "cbpp_batch.rda", package = "lme4")) contr_batch_vars <- load(system.file("vignette_data", "Contraception_batch.rda", package = "lme4")) @ The \code{profile} method computes profile likelihoods. The computation can be slow, since complete profiling for a model with $p$ random- and fixed-effect parameters requires fitting $p$ profiles, each of which requires many $p-1$-dimensional optimizations. The machinery for generating likelihood profiles for GLMMs is similar to that for LMMs (see \cite{bates2015fitting}, \S\ 5.1). The \code{profile} method returns an object of class \code{thpr} --- a data frame containing the profiles, augmented with attributes containing interpolation splines for each parameter profile and their inverses (using \code{splines::interpSpline} and \code{splines::backSpline}); the latter are used for plotting profiles and computing confidence intervals. An \code{as.data.frame} method adds \code{.focal} and \code{.par} variables to the data frame, useful for customized plots. Profiles can be used for univariate (\code{xyplot}) and bivariate (\code{splom}) profile plots, and to compute profile confidence intervals (\code{confint}). (\code{confint} applied to a \code{glmer} fit will first fit the profile, then use it to compute profile confidence intervals. Given the computational cost of profiling, it makes sense to compute and save the profile as an intermediate step if one plans to do anything other than computing confidence intervals.) Two other common methods for computing confidence intervals are parametric bootstrapping (\code{method = "boot"}) and the classical Wald approximation (\code{method = "Wald"}). Parametric bootstrapping is much slower, but more accurate (and, via the \code{FUN} argument, can generate confidence intervals for any quantity that can be derived from a fitted model). The Wald approximation is faster and less accurate than profile confidence intervals. By default \code{glmer} only returns estimates for the fixed-effects parameters, as the assumptions of the Wald approximation are often violated badly for random-effects (co)variances and correlations. In the examples below we use the finite-difference Hessian (second derivative matrix of the estimated parameters) and the delta method to compute Wald confidence intervals for random-effects standard deviations and correlations, when possible. Figure~\ref{fig:cbppcompplot} compares all three of these confidence intervals across all three of the models fitted. <>= des_ord <- c("sd_(Intercept)|herd", "sd_(Intercept)|obs", "avg_size", "treatmentUnknown", "treatmentComplete") combCI <- subset(cbpp_combCI,var!="(Intercept)" & !grepl("^period",var)) combCI <- transform(combCI, var = factor(var, levels = rev(des_ord))) combCI <- transform(combCI, vartype = factor(ifelse(grepl("^sd", var), "random effects", "fixed effects")) ) @ <>= mnames <- cbpp_df_name$mnames pd <- position_dodge(width=0.6) ggplot(combCI, aes(var, est, colour=type, shape=model)) + geom_pointrange(aes(ymin=lwr, ymax=upr), position=pd) + labs(x="", y="Estimate (log-odds of seropositivity)") + geom_hline(yintercept=0, lty=2) + coord_flip() + scale_colour_brewer(palette="Dark2", guide = revguide) + scale_shape(guide=revguide, labels = mnames) + facet_wrap(~ vartype, ncol = 1, scale = "free") @ If the default optimizers (Nelder-Mead followed by BOBYQA) do not perform well, one could attempt to re-fit a [g]lmer model with a variety of different optimizers using \code{allFit()}. To see which optimizers \code{lme4} currently supports, one can use the command \code{allFit(show.meth.tab=TRUE)} to show all of the different methods. To use \code{allFit()}, supply the initial fitted model as input. Useful results are obtained from \code{summary(allFit())}: \begin{itemize} \item \code{\$which.OK} returns which optimizers worked. Below only applies to optimizers that were successful. \item \code{\$llik} returns log-likelihoods \item \code{\$fixef} fixed-effect estimates \item \code{\$sdcor} random-effect standard deviations and correlations \item \code{\$theta} random-effect parameters on the Cholesky scale \end{itemize} We will show the summary results for \code{\$sdcor}; the other components are similar. <>= gm_all <- allFit(gm1) @ <>= ss <- summary(gm_all) ss$sdcor @ As of version 2.0, \code{lme4} can also specify structured variance-covariance matrices for (generalized) linear mixed models. \code{lme4} now supports unstructured (general positive definite), diagonal, compound symmetry, and first-order autoregressive (AR1) structures. By default, AR1 models assume a homogeneous-variance model (the variance is the same for all time steps), while the other models assume heterogeneous-variance models (variances differ for every level of the varying term); users can adjust this with the \code{hom} argument (e.g. \code{ar1(..., hom = FALSE)}). The unstructured covariance structure is the default for mixed models. Here we illustrate fitting an AR1 model; the next example will show diagonal and compound symmetric models. <>= gm.ar1 <- glmer(incidence/size ~ ar1(1 + herd | period), family = binomial, data = cbpp, weights = size) print(VarCorr(gm.ar1)) @ For this particular model, a heterogeneous AR1 model (\code{ar1(..., hom = FALSE)}) results in a singular fit. See the \href{https://CRAN.R-project.org/package=lme4/vignettes/covariance_structures.html}{lme4 covariance structures vignette} for more detail. \subsection{Contraception} \citet{huq1990bangladesh} use multilevel models to analyze data from a fertility survey of women in Bangladesh. These data are available as the \code{Contraception} object in the \pkg{mlmRev} package. The response variable is binary and indicates whether or not each woman was using contraception at the time of the survey. Covariates included the woman's age, the number of live children she had, whether she lived in an urban or rural setting, and the district in which she lived. <>= library(mlmRev) data(Contraception) @ <>= Contraception <- transform(Contraception, facet_urban = ifelse(urban=="Y","Urban","Rural")) gg0 <- ggplot(Contraception) + geom_smooth(aes(age, ifelse(use == "Y",1,0), colour = livch, fill = livch), method = "gam", alpha = 0.1, method.args = list(gamma = 0.6), formula = y ~ s(x, bs = "cs")) + facet_wrap(~facet_urban) + scale_x_continuous("Centered age") + scale_y_continuous("Proportion", limits = c(0, 1), expand = c(0,0)) + scale_colour_brewer("Number\nof living\nchildren", palette="Dark2") + scale_fill_brewer("Number\nof living\nchildren", palette="Dark2") ## add marginal density ribbons gg0 + geom_density( data = subset(Contraception, use == "N"), aes(x = age, y = after_stat(density), fill = livch), alpha = 0.3, colour = NA) + geom_ribbon( data = subset(Contraception, use == "Y"), stat = "density", aes( x = age, ymin = 1 - after_stat(density) * 0.5, ymax = 1, fill = livch ), alpha = 0.3, colour = NA ) @ <>= Contraception <- transform(Contraception, ch = factor(livch != 0, labels = c("N","Y")), age_s = age/(2*sd(age))) @ Figure~\ref{fig:graphContraception} shows exploratory smooth curves of contraceptive use as a function of centered age, stratified by number of living children and urban/rural residence. In rural areas, women with two living children show the highest rates of contraceptive use, peaking near 50\% at the average age, while women in urban areas with three or more living children show the highest rates near the average age but with a more pronounced decline at older ages. In both settings, women with no living children consistently show the lowest rates of contraceptive use. The nonmonotonic effect of age on contraceptive use and the apparent dependence of this age trend on child status motivate the model specifications explored below. (These nonmonotonic trends were not noticed in the original analysis of the data, which concluded that there was no significant effect of age.) We construct six \code{glmer} models with varying choices of explanatory variables and random effects structure, comparing them via \code{anova}. The models considered different variables. For instance, there are two different ways to encode the number of living children: \code{livch}, a four-level factor distinguishing \code{0}, \code{1}, \code{2}, or \code{3+} children, while other models use \code{ch} instead (later we label as \code{binary\_child}), which is a binary indicator for whether the woman has any living children. Second, we vary whether child status interacts with the woman's age: two models include \code{ch} (or \code{livch}) and \code{age} as additive terms, while the other four include a \code{ch} and \code{age} interaction. A quadratic effect of age (\code{I(age\^{}2)}) was added to account for the nonlinear effect of age. Third, we explore different random effects structures at the district level. Three of them use a single random intercept per district \code{(1 | district)}. One of them extends this with a random slope for urban status \code{(urban | district)}, allowing the urban/rural difference to vary by district. The next uses nested random intercepts for district and site within district (\code{1 | district/urban}) separating district-level from urban-within-district variation. (\cite{scandola2024} refer to this formulation as a ``complex random intercepts'' model). The remaining model uses only the \code{urban:district} grouping. To reduce convergence warnings and facilitate interpretability, age (already approximately centered in the data set) was standardized by scaling by twice the standard deviation \citep{gelman2008scaling}. <>= cm1 <- glmer(use ~ age_s + I(age_s^2) + urban + livch + (1|district), Contraception, binomial) ## switch from livch (ordinal) to ch (binary) cm2 <- update(cm1, . ~ . - livch + ch) ## add age by children interaction cm3 <- update(cm2, . ~ . + age_s:ch) ## allow urban effect to vary across districts (correlated) cm4 <- update(cm3, . ~ . - (1|district) + (1+urban|district)) ## compound symmetric/nested formulation cm5 <- update(cm3, . ~ . - (1|district) + (1 | district/urban)) ## as above but drop district effect cm6 <- update(cm3, . ~ . - (1|district) + (1 | district:urban)) @ <>= ## copy of bbmle::AICtab AICtab <- function(..., mnames) { aic_df <- do.call(AIC, list(...)) LLvec <- sapply(list(...), function(x) c(logLik(x))) rownames(aic_df) <- mnames aic_df <- transform(aic_df, Δnegloglik = -LLvec - min(-LLvec), ΔAIC = AIC - min(AIC)) aic_df <- subset(aic_df, select = -AIC) aic_df <- aic_df[order(aic_df$ΔAIC),] aic_df } mod_list <- c(cm1, cm2, cm3, cm4, cm5, cm6) op <- options(digits = 3) mnamevec <- c("int_child + age + (1 | district)", "binary_child + age + (1 | district)", "binary_child × age + (1 | district)", "binary_child × age + (1 + urban | district)", "binary_child × age + (1 | district/urban)", "binary_child × age + (1 | district:urban)") aictab <- do.call(AICtab, c(mod_list, list(mnames = mnamevec))) @ \begin{table} <>= knitr::kable(aictab, digits = 3) @ \caption{Model comparison for Contraception fits. Note that for likelihood ratio tests, or for AIC comparisons restricted to nested models \citep{ripleySelectingAmongstLarge2004a}, the nesting sequence for the random-effects models is \code{(1+urban|district)} $>$ \code{(1|district/urban)} $>$ \{% \code{(1|district)}, \code{(1|district:urban)}% \}. } \label{tab:contraictab} \end{table} Table~\ref{tab:contraictab} shows that the top three models, all of which include a child-by-age interaction and some effect of urbanization, fit approximately equally well ($\Delta$ negative log-likelihood $<0.5$). The \code{(1|district/urban)} model barely improves on the fit of \code{(1|district:urban)} (0.005 log-likelihood units), at the cost of an extra variance parameter, so it is almost 2 AIC units worse. \code{(1 + urban | district)} is a bit better ($\approx$ 0.5 log-likelihood units), but includes a covariance parameter. Models that drop the interaction between child status and age or replace the binary child indicator with an integer count perform substantially worse ($\Delta \textrm{AIC} > 10$). Overall, the results suggest that accounting for urban/rural variation at the district level and including the age and child interaction are both important, but the precise random effects structure for urban/rural variation is relatively unimportant. <>= agevec <- seq(-15, 20) pframe <- with(Contraception, expand.grid(age = agevec, urban = levels(urban), ch = levels(ch))) pframe <- transform(pframe, age_s = age/(2*sd(Contraception$age)), facet_urban = ifelse(urban=="Y","Urban","Rural")) pred <- predict(cm5, newdata = pframe, se.fit = TRUE, re.form = NA) pframe$use <- plogis(pred$fit) pframe$lwr <- plogis(pred$fit-1.96*pred$se.fit) pframe$upr <- plogis(pred$fit+1.96*pred$se.fit) pframe2 <- with(Contraception, expand.grid(age = agevec, urban = levels(urban), ch = levels(ch), district = levels(district))) pframe2 <- transform(pframe2, age_s = age/(2*sd(Contraception$age)), facet_urban = ifelse(urban=="Y","Urban","Rural"), dist_urb = paste(urban, district, sep = ":")) c_urb <- unique(with(Contraception, paste(urban, district, sep = ":"))) pframe2 <- subset(pframe2, dist_urb %in% c_urb) pred2 <- predict(cm5, newdata = pframe2, re.form = NULL) pframe2$use <- plogis(pred2) ggplot(pframe, aes(age)) + geom_line(aes(y = use, colour = ch)) + geom_ribbon(aes(ymin = lwr, ymax = upr, fill = ch), color = NA, alpha = 0.3) + facet_wrap(~facet_urban) + scale_x_continuous("Centered age") + scale_y_continuous("Proportion", limits = c(0, 1), expand = c(0,0)) + scale_colour_brewer("Living\nchildren", palette="Dark2") + scale_fill_brewer("Living\nchildren", palette="Dark2") + geom_line(data = pframe2, aes(group = interaction(dist_urb, ch), colour = ch, y = use), alpha = 0.2) @ As with the CBPP data set, we can also compute profile, Wald, and parametric bootstrap confidence intervals for all of the models to understand the effects of each variable and visualize the among-model variation. <>= contr_combCI <- subset(contr_combCI, var != "(Intercept)") combCI <- transform(contr_combCI, vartype = factor(ifelse(grepl("^(sd|cor)", var), "random effects", "fixed effects")), model = factor(model, levels = rev(levels(model))) ## flip AND flip order of guide below ) @ <>= ggplot(combCI, aes(var, est, colour=type, shape=model)) + geom_pointrange(aes(ymin=lwr, ymax=upr), position=position_dodge(width=0.6)) + labs(x="", y="Estimate (log-odds of seropositivity)") + geom_hline(yintercept=0, lty=2) + coord_flip() + scale_colour_brewer(palette="Dark2", guide = revguide) + scale_shape(guide=revguide) + facet_wrap(~vartype, ncol = 1, scale = "free", space = "free_y") + theme(legend.position = "inside", legend.justification = c("left", "top"), legend.box = "horizontal", legend.background = element_rect(fill = "white", colour = "grey50"), #legend.margin = margin(t = 5, r = 5, b = 5, l = 5, unit = "pt"), legend.box.margin = margin(t = 10, l = 10, unit = "pt")) @ The point estimates and confidence intervals of the explanatory variables are similar across the six models, and across different methods for confidence interval construction, with a few exceptions. Urban residence, presence of living children, and age were all strong predictors of contraceptive use among women in Bangladesh (because we are fitting a quadratic model for the effect of age, the non-significant effect of \code{age\_s} simply means that the marginal effect of age \emph{at the mean age} is not clearly negative or positive). With \code{lme4} versions greater than 2.0, we can also set up models with structured random effects covariance matrices. In the models below \code{diag} specifies a diagonal variance-covariance structure, while \code{cs} specifies a compound symmetric model. The \code{hom} argument specifies whether the model should allow different variances for each varying effect (e.g. for intercepts and slopes in the models below, which allow the effects of the continuous covariate of age to vary across districts). Thus instead of (e.g.) \code{(1+urban|district)}, we can specify the random effect with diagonal, heterogeneous variances (\code{diag(1+urban|district)}); diagonal, homogeneous variances (\code{diag(1+urban|district, hom = TRUE)}); compound symmetric, heterogeneous variances (\code{cs(1+urban|district)}); or compound symmetric, homogeneous variances (\code{cs(1+urban|district, hom = TRUE)}). <>= cc <- glmerControl(check.conv.grad = .makeCC("warning", tol = 1e-2), check.conv.hess = "ignore") base_form <- use ~ age*ch + I(age^2) + urban cm.diag <- glmer(update(base_form, . ~ . + diag(1 + urban|district)), Contraception, binomial, control = cc ) cm.homdiag <- glmer(update(base_form, . ~ . + diag(1 + urban|district, hom = TRUE)), Contraception, binomial, control = cc) cm.cs <- glmer(update(base_form, . ~ . + cs(1 + urban|district)), Contraception, binomial, control = cc) cm.homcs <- glmer(update(base_form, . ~ . + cs(1 + urban|district, hom = TRUE)), Contraception, binomial, control = cc) @ We can use \code{VarCorr} and other accessor methods as we would for models with default, unstructured covariance matrices, e.g.: <>= VarCorr(cm.cs) @ \bibliography{glmer} \section{Package versions used} <>= pkglist <- c("lme4", "performance", "DHARMa", "see") pp <- sapply(pkglist, function(x) sprintf("%s: %s", x, format(packageVersion(x)))) ppp <- paste(pp, collapse = "; ") @ Compiled with \Sexpr{R.version.string} and package versions \Sexpr{pp}. \section{Appendix: derivation of PIRLS} We seek to maximize the unscaled conditional log density for a GLMM over the conditional modes, $\bm u$. This problem is very similar to maximizing the log-likelihood for a GLM, which is a very thoroughly studied problem \citep[e.g.][]{McCullaghNelder1989}. The standard algorithm for dealing with this kind of problem is iteratively reweighted least squares (IRLS). Here we modify IRLS by incorporating a penalty term that accounts for variation in the random effects; we call the resulting algorithm penalized iteratively reweighted least squares (PIRLS). The unscaled conditional log-density takes the form, \begin{equation} f(\bm u) = \log p(\bm y, \bm u | \bm\beta, \bm\theta) = \bm\psi^\top \bm A \bm y - \bm a^\top \bm \phi + \bm c - \frac{1}{2}\bm u^\top \bm u - \frac{q}{2}\log{2\pi} \end{equation} where $\bm\psi$ is the $n$-by-$1$ canonical parameter of an exponential family, $\bm\phi$ is the $n$-by-$1$ vector of cumulant functions, $\bm c$ an $n$-by-$1$ vector of normalizing constants, and $\bm A$ is an $n$-by-$n$ diagonal matrix of prior weights, $\bm a$. Both $\bm a$ and $\bm c$ could depend on a dispersion parameter, although we ignore this possibility for now. The canonical parameter, $\bm\psi$, and vector of cumulant functions, $\bm\phi$, depend on a linear predictor, \begin{equation} \bm\eta = \bm o + \bm X \bm\beta + \bm Z \bm\Lambda_\theta \bm u \end{equation} where $\bm o$ is an $n$-by-$1$ vector of \emph{a priori} offsets. The specific form of this dependency is specified by the choice of the exponential family. The mean of this distribution, $\bm\mu$, is the \emph{inverse link function} $g^{-1}$ applied to $\bm\eta$. Our goal is to find the values of $\bm u$ that maximize the unscaled conditional density, for given $\bm\theta$ and $\bm\beta$ vectors. These maximizers are the conditional modes, which we require for the Laplace approximation and adaptive Gauss-Hermite quadrature. To do this maximization we use a variant of the Fisher scoring method, which is the basis of the iteratively reweighted least squares algorithm for generalized linear models. Fisher scoring is itself based on Newton's method, which we apply first. \subsection{Newton's method} To apply Newton's method, we need the gradient and the Hessian of the unscaled conditional log-likelihood. Following standard GLM theory \citep{McCullaghNelder1989}, we use the chain rule, \begin{displaymath} \frac{d L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u} = \frac{d L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm\psi} \frac{d \bm\psi}{d \bm\mu} \frac{d \bm\mu}{d \bm\eta} \frac{d \bm\eta}{d \bm u} \end{displaymath} The first derivative in this chain follow from basic results in GLM theory, \begin{displaymath} \frac{d L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm\psi} = (\bm y - \bm\mu)^\top \bm A \end{displaymath} Again from standard GLM theory, the next two derivatives define the inverse diagonal variance matrix, \begin{displaymath} \frac{d \bm\psi}{d \bm\mu} = \bm V^{-1} \end{displaymath} and the diagonal Jacobian matrix, \begin{displaymath} \frac{d \bm\mu}{d \bm\eta} = \bm M \quad . \end{displaymath} Finally, because $\bm\beta$ affects $\bm\eta$ only linearly, \begin{displaymath} \frac{d \bm\eta}{d \bm u} = \bm Z \bm\Lambda_\theta \end{displaymath} Therefore we have, \begin{equation} \frac{d L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u} = (\bm y - \bm\mu)^\top \bm A \bm V^{-1} \bm M \bm Z \bm\Lambda_\theta + \bm u^\top \quad . \label{eq:dPDEVdu} \end{equation} This is very similar to the gradient for GLMs with respect to fixed effects coefficients, $\bm\beta$. The only difference induced by differentiating with respect to the random effects, $\bm u$, is the addition of the $\bm u^\top$ term. Again we apply the chain rule to take the Hessian, \begin{equation} \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} = \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm\mu} \frac{d \bm\mu}{d \bm\eta} \frac{d \bm\eta}{d \bm u} + \bm I_q \end{equation} which leads to, \begin{equation} \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} = \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm\mu}\bm M \bm Z \bm\Lambda_\theta + \bm I_q \end{equation} The first derivative in this chain can be expressed as, \begin{equation} \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm\mu} = -\bm\Lambda_\theta^\top \bm Z^\top \bm M \bm V^{-1} \bm A + \bm\Lambda_\theta^\top \bm Z^\top \left[ \frac{d \bm M \bm V^{-1}}{d \bm\mu} \right] \bm A \bm R \end{equation} where $\bm R$ is a diagonal residuals matrix with $\bm y-\bm\mu$ on the diagonal. The two terms arise from a type of product rule, where we first differentiate the residuals, $\bm y-\bm\mu$, and then the diagonal matrix, $\bm M \bm V^{-1}$, with respect to $\bm\mu$. The Hessian can therefore be expressed as, \begin{equation} \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u d \bm u} = -\bm \Lambda_\theta^\top \bm Z^\top \bm M \bm A^{1/2}\bm V^{-1/2}\left( \bm I_n - \bm V \bm M^{-1}\left[ \frac{d \bm M \bm V^{-1}}{d \bm\mu} \right] \bm R \right) \bm V^{-1/2}\bm A^{1/2} \bm M \bm Z \bm\Lambda_\theta + \bm I_q \label{eq:betaHessian} \end{equation} This result can be simplified by expressing it in terms of a weighted random-effects design matrix, $\bm U = \bm A^{1/2}\bm V^{-1/2}\bm M \bm Z \bm\Lambda_\theta$, \begin{equation} \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} = -\bm U^\top\left( \bm I_n - \bm V \bm M^{-1}\left[ \frac{d \bm V^{-1}\bm M}{d \bm\mu} \right] \bm R \right) \bm U + \bm I_q \quad . \label{eq:betaHessiansimp} \end{equation} \subsection{Fisher-like scoring} There are two ways to further simplify this expression for $\bm U^\top \bm U$. The first is to use the canonical link function for the family being used. Canonical links have the property that $\bm V = \bm M$, which means that for canonical links, \begin{equation} \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} = -\bm U^\top\left( \bm I_n - \bm I_n \left[ \frac{d \bm I_n}{d \bm\mu} \right] \bm R \right) \bm U + \bm I_q = \bm U^\top \bm U + \bm I_q \end{equation} The second way to simplify the Hessian is to take its expectation with respect to the distribution of the response, conditional on the current values of the spherical random effects coefficients, $\bm u$. The diagonal residual matrix, $\bm R$, has expectation 0. Therefore, because the response only enters into the expression for the Hessian via $\bm R$, we have that, \begin{equation} E\left(\frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} | \bm u \right) = -\bm U^\top\left( \bm I_n - \bm U \bm M^{-1}\left[ \frac{d \bm V^{-1}\bm M}{d\mu} \right] E(\bm R) \right) \bm U + \bm I_q = \bm U^\top \bm U + \bm I_q \label{eq:simphesstwo} \end{equation} \end{document}