%\VignetteEngine{knitr::knitr}
%\VignetteDepends{ggplot2}
%\VignetteDepends{gridExtra}
%\VignetteDepends{performance}
%\VignetteDepends{merDeriv}
%\VignetteDepends{DHARMa}
%\VignetteIndexEntry{Fitting Generalized Linear Mixed-Effects Models using lme4}
\documentclass[nojss]{jss}
%% need no \usepackage{Sweave.sty}
\usepackage[T1]{fontenc}% for correct hyphenation and T1 encoding
\usepackage[utf8]{inputenc}
\usepackage{lmodern}% latin modern font
\usepackage[american]{babel}  %% for texi2dvi ~ bug
\usepackage{bm,amsmath,amsfonts}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator{\VEC}{vec}
\newcommand{\bbeta}{{\bm \beta}}
\newcommand{\btheta}{{\bm \theta}}

\DeclareUnicodeCharacter{0394}{$\Delta$}
%
\shortcites{lesnoff_within-herd_2004}
%% BMB: what does JSS assume?
%\bibliographystyle{chicago}
\author{
  Anna Ly\\McMaster University \And
  Rune Haubo Bojesen Christensen\\Copenhagen Research Centre \\for Biological and Precision Psychiatry\AND
  Douglas Bates\\University of Wisconsin - Madison \And
  Martin M\"achler\\ETH Zurich\AND
  Benjamin M. Bolker\\McMaster University
}
\Plainauthor{A. Ly, R. H. B. Christensen, D. M. Bates, M. M\"achler, B. M. Bolker}
\title{Fitting generalized linear mixed-effects models using \pkg{lme4}}
\Plaintitle{Fitting generalized linear mixed models using lme4}
\Shorttitle{GLMMs with lme4}
\Abstract{%
  The \pkg{lme4} \proglang{R} package can be used to fit generalized linear
  mixed models (GLMMs), which extend the class of linear mixed models
  (LMMs). The two main extensions provided by GLMMs are (1)
  allowing for the conditional distribution of the response given the
  random effects to be non-Gaussian (e.g. binomial,
  Poisson) and (2) allowing the conditional mean to be
  a nonlinear function of a linear combination of the fixed
  and random effect coefficients, via an inverse link
  function. The conditional mode of the random effects given the
  observed data, the variance-covariance matrix of the random 
  effects, and the fixed effect parameters
  are determined using penalized iteratively reweighted
  least squares (PIRLS). We compute an approximation of the integral
  over the distributions of the conditional modes to compute
  the MLE for a given set of parameters (by default we use the
  Laplace approximation or, alternatively, the more computationally
  expensive adaptive Gauss-Hermite quadrature). The package provides
  all the standard features available for GLMs in base \proglang{R},
  including the the standard set of accessor functions as
  well as the possibility of user-specified distributions 
  (within the exponential dispersion family) and link
  functions.}
%
\Keywords{%
  sparse matrix methods,
  generalized linear mixed models,
  penalized least squares,
  Cholesky decomposition}
\Address{
  Anna Ly\\
  Department of Mathematics \& Statistics \\
  McMaster University \\
  1280 Main Street W \\
  Hamilton, ON L8S 4K1, Canada \\
  \par\bigskip
  Rune Haubo Bojesen Christensen \\
  Copenhagen Research Centre for Biological and Precision Psychiatry \\
  Mental Health Centre Copenhagen, Copenhagen University Hospital - Bispebjerg and Frederiksberg \\
  Copenhagen, Denmark \\
  Email: \email{Rune.Haubo@pm.me}
  \par\bigskip
  Douglas Bates\\
  Department of Statistics, University of Wisconsin - Madison\\
  1205 University Ave. \\
  Madison, WI 53706, U.S.A.\\
  E-mail: \email{bates@stat.wisc.edu}
  \par\bigskip
  Martin M\"achler\\
  Seminar f\"ur Statistik, HG G~16\\
  ETH Zurich\\
  8092 Zurich, Switzerland\\
  E-mail: \email{maechler@stat.math.ethz.ch}\\
  % URL: \url{http://stat.ethz.ch/people/maechler}
  \par\bigskip
  Benjamin M. Bolker\\
  Departments of Mathematics \& Statistics and Biology \\
  McMaster University \\
  1280 Main Street W \\
  Hamilton, ON L8S 4K1, Canada \\
  E-mail: \email{bolker@mcmaster.ca}
}
\newcommand{\Var}{\operatorname{Var}}
\newcommand{\abs}{\operatorname{abs}}
\newcommand{\bLt}{\ensuremath{\bm\Lambda_\theta}}
\newcommand{\mc}[1]{\ensuremath{\mathcal{#1}}}
\newcommand{\trans}{\ensuremath{^\prime}}
\newcommand{\yobs}{\ensuremath{\bm y_{\mathrm{obs}}}}
\newcommand{\Lsat}{\ensuremath{L_{\mathrm{sat}}}}
\newcommand*{\eq}[1]{eqn.~\ref{#1}}% or just {(\ref{#1})}
%
\usepackage{etoolbox}
\usepackage{lineno}
%% line numbering hack from Claude
\makeatletter
\let\@@par@orig\@@par
\renewcommand{\@@par}{%
  \@@par@orig
  \ifnum\currentgrouplevel=0
    \linenumbers
  \fi
}
\makeatother
%
<<preliminaries,include=FALSE,cache=FALSE,message=FALSE,echo=FALSE>>=
options(width=69, show.signif.stars=FALSE, str=strOptions(strict.width="cut"))
library(knitr)
library(lme4)
library(ggplot2)
theme_set(theme_bw())
revguide <- guide_legend(reverse = TRUE)
library(grid)
library(gridExtra)
zmargin <- theme(panel.margin=unit(0,"lines"))
library(lattice)
library(DHARMa)
library(performance)
opts_chunk$set(engine='R',dev='pdf',fig.width=10,
               fig.height=6.5,prompt=TRUE,
               ## strip.white=all,
               error=FALSE, ## stop on error
               tidy=FALSE,comment=NA)
## using knitr::knit2pdf() may lead to xcolor 'incompatible color definition'
## warnings. The code below resolves that, but simultaneously breaks
## vignette compilation in other contexts ...
## https://tex.stackexchange.com/questions/148188/knitr-xcolor-incompatible-color-definition               
## knit_hooks$set(document = function(x) {sub('\\usepackage[]{xcolor}', '\\usepackage[dvipsnames]{xcolor}', x, fixed = TRUE)})

@ 
\begin{document}
\linenumbers
\section{Introduction}
\label{sec:intro}

FIXME: check digits of output?

The \pkg{lme4} package for \proglang{R} can be used to fit a broad
range of mixed-effects models.  One major advantage of
\pkg{lme4} over its predecessor, \pkg{nlme}, is that it can be used to
fit generalized linear mixed
models (GLMMs), which combine the flexibility of
linear mixed models (LMMs) and generalized
linear models (GLMs). In a companion paper, we have described the
facilities in \pkg{lme4} for fitting linear mixed models (LMMs). Here
we describe the facilities for fitting GLMMs.

\section{Generalized Linear Mixed Models}
\label{sec:GLMMdef}

Generalized linear mixed models extend the class of
generalized linear models by allowing for both fixed and random
effects. In a GL(M)M, the length-$n$ vector-valued response variable,
$\mc Y$, has a conditional distribution in the exponential dispersion family (e.g. Normal,
binomial, Poisson).\footnote{The \code{lme4} package supports fitting negative binomial GLMMs,
which are an extension of GLMMs; the negative binomial family is within the
exponential dispersion family if the dispersion parameter is fixed.
The \code{glmer.nb} function uses a one-dimensional optimizer
to estimate the dispersion parameter, fitting a negative binomial GLMM with
a fixed dispersion parameter at each trial value. In practice this
implementation is slower than those in other R packages (e.g., the \code{glmmTMB} package).}
The mean, $\bm\mu_{\mc Y }$, of $\mc Y$ depends on
a linear predictor,
\begin{equation}
  \label{eq:GLMlinpred}
  \bm\eta=\bm X\bm\beta .
\end{equation}
where $\bm\beta$ is a $p$-dimensional coefficient vector and $\bm X$
is an $n \times p$ model matrix. The mapping from $\bm\mu_{\mc Y }$ to
$\bm\eta$, which is called the \emph{link function} and written,
\begin{equation}
  \label{eq:linkfun}
  \bm X\bm\beta=\bm\eta=\bm g\left(\bm\mu_{\mc Y }\right) ,
\end{equation}
is a \emph{diagonal mapping} in the sense that there is a scalar
function, $g$, such that the $i$th component of $\bm\eta$ is $g$
applied to the $i$th component of $\bm\mu_{\mc Y }$.  (The name
``diagonal'' reflects the fact that the Jacobian matrix,
$\frac{d\eta}{d\mu\trans}$, of such a mapping will be diagonal.) The
scalar link function must be invertible and differentiable over its range.  The
vector-valued \emph{inverse link} function, $\bm g^{-1}$, will be the
scalar inverse link, $g^{-1}$, applied component-wise to
$\bm\eta$.
% The scalar link function must be invertible over its range.
% The vector-valued \emph{inverse link} function, $\bm g^{-1}$, will be
% the scalar inverse link, $g^{-1}$, applied component-wise to
% $\bm\eta$.  % commented out the repetition - DMB

In the GLMM case, the vector of means of the exponential dispersion family distribution of
$\mc Y$ depends on an unobserved random vector, $\mc B$, of length
$q$, called the random-effects coefficient vector. In particular, the
conditional mean of $\mc Y$ given that $\mc B = \bm b$, written
$\bm\mu_{\mc Y|\mc B = \bm b}$, depends on the linear predictor,
\begin{equation}
  \label{eq:GLMMlinpred}
  \bm\eta=\bm Z\bm b+\bm X\bm\beta .
\end{equation}
where $\bm Z$ is an $n \times q$ random-effects model matrix. Similar
to the GLM case, the mapping from the conditional mean, $\bm\mu_{\mc
  Y|\mc B = \bm b}$, to the linear predictor, $\bm\eta$, is
\begin{equation}
  \label{eq:linkfunii}
  \bm Z\bm b + \bm X\bm\beta=
  \bm\eta=
  \bm g\left(\bm\mu_{\mc Y|\mc B =\bm b}\right) ,
\end{equation}
The random vector $\mc B$ is assumed to be distributed multivariate
normally,
\begin{equation}
  \label{eq:2}
  \mc B \sim \mc N(\bm 0, \bm\Sigma_\theta)
\end{equation}
where $\bm\Sigma_\theta$ is the covariance matrix of $\mc B$, which
depends on a vector of covariance parameters, $\bm\theta$.

The optimization routines of \pkg{lme4} never actually compute
$\bm\Sigma_\theta$ directly, and instead use the elements of the covariance factor,
$\bm\Lambda_\theta$, which is a matrix square root of $\bm\Sigma_\theta$ (in practice a Cholesky factor),
\begin{equation}
  \label{eq:GLMMSigma}
  \bm\Sigma_\theta=\bm\Lambda_\theta\bm\Lambda_\theta\trans .
\end{equation}
This characterization of the random-effects covariance structure
allows us to write the linear predictor as
\begin{equation}
  \label{eq:GLMMlinpredii}
  \bm\eta=\bm Z\bm\Lambda_\theta\bm u+\bm X\bm\beta ,
\end{equation}
where the spherical random effects vector, $\bm u$
(see \cite{bates2015fitting}) is a realization
of the random vector, $\mc U$,
\begin{equation}
  \label{eq:UdistGLMM}
  \mc U\sim\mc N(\bm0,\bm I_q)
\end{equation}
where $\bm I_q$ is the identity matrix.

Common forms of the conditional distribution are Bernoulli, for binary
responses, binomial for binary responses that are recorded as the
number of trials and the number of successes, and Poisson, for count
data.  The combination of a distributional form and a link function is
called a \emph{family}.  For distributional forms in the exponential
dispersion family there is a \emph{canonical link}.  For Bernoulli or binomial
forms the canonical link is the \emph{logit} link function
\begin{equation}
  \label{eq:logitLink}
  \eta_i=\log\left(\frac{\mu_i}{1-\mu_i}\right);
\end{equation}
for the Poisson distribution the canonical link is the natural
logarithm.

The form of the distribution determines the conditional variance,
$\Var(\mc Y|\mc U=\bm u)$, as a function of the conditional mean and,
possibly, a separate scale factor. (In the most common cases the conditional
variance is completely determined by the conditional mean.) 

Assuming that the conditional distribution belongs to the exponential 
dispersion family, we can weight observations differently by scaling the dispersion parameter, 
$\phi$. These \textbf{prior weights},  $\bm w$, a vector of size $n$, are known 
positive constants (equal to the number of observations per trial in the particular case of a binomial GLMM).

By scaling the dispersion parameters, we also modify the conditional variance. 
Consider the form, where $i = 1, 2, \dots, n$:
\begin{equation}
  \text{Var}(\mc Y_i|\mc U=\bm u) = \frac{\phi}{w_i} \Var(\mu_i)
\end{equation}
where $\Var(\mu_i)$ is the family-specific variance function. 
Higher weights indicate a smaller variance.

If prior weights are not specified, by default $\bm w$ is just a vector of 1's.

Another common modification when dealing with generalized linear mixed models
is to allow for an \textbf{offset}, which is an efficient way to include known scaling 
factors (like population or area) without introducing additional parameters to 
the model.

In particular, the inclusion of an offset would modify the linear predictor 
as follows:
\begin{equation}
  \label{eq:GLMMoffset}
  \bm\eta=\bm Z\bm b+\bm X\bm\beta + \text{offset}.
\end{equation}


For GLMMs, the scale parameter is not typically estimated as part of the nonlinear minimization of the negative likelihood over the $\beta$ and $\theta$ parameters.
As is common in generalized linear model contexts, it is 
set equal to the (residual) deviance
divided by the residual degrees of freedom. 
(Except for the Binomial and Poisson families,
where the scale parameter is fixed at 1.)

The deviance itself is estimated using the method of moments estimator, which 
is the Pearson residual sum of squares. This approach is preferred by \citet{McCullaghNelder1989} as it is less sensitive to small errors or model 
misspecification than the maximum likelihood estimator. 

We therefore estimate the deviance using the penalized weighted Pearson 
residual sum of squares, as described in Section 3.2 of \cite{bates2015fitting}.

The likelihood of the parameters, given the observed data, is now
\begin{equation}
  \label{eq:GLMMlike}
  L(\bm\beta,\bm\theta|\yobs)=\int_{\mathbb{R}^q}f_{\mc Y,\mc U}(\yobs,\bm u)\,d\bm u
\end{equation}
where, as in the case of linear mixed models, $f_{\mc Y,\mc
  U}(\yobs,\bm u)$ is the unscaled conditional density of $\mc U$
given $\mc Y=\yobs$.  The notation here is a bit blurred because,
although the joint distribution of $\mc Y$ and $\mc U$ is always
continuous with respect to $\mc U$, it can be (and often is) discrete
with respect to $\mc Y$. However, when we condition on the observed
value $\mc Y=\yobs$, the resulting function is continuous with respect
to $\bm u$ so the unscaled conditional density is indeed well-defined
as a density, up to a scale factor.

To evaluate the integrand in (\ref{eq:GLMMlike}) we need the unscaled conditional density $f_{\mc Y,\mc U}$, which depends on the likelihood of $\yobs$ conditional on $\bm u$: $f(\yobs, \bm u) = f(\yobs|\bm u) f(\bm u)$.
As usual in likelihood estimation we will eventually be working, for convenience, with negative log likelihoods rather than likelihoods.
However, base R's machinery for defining GLM models, or ``families'', does not provide a convenient expression for the negative log-likelihood: instead, objects of class \code{family} include a \code{dev.resids} function that returns a vector of observation-specific deviances, i.e. $-2(\log L(\yobs|\bm u) - \log \Lsat)$ where $\Lsat$ is the likelihood corresponding to a saturated model with as many parameters as observations.
%
\footnote{There is
some confusion in \proglang{R} (and in its predecessor,
\proglang{S}) about the exact definition of the deviance residuals
of a family.  As indicated above, we will use this name for the
value of the \code{dev.resids} function for the family.  The signed
square root of this vector, using the signs of $\yobs-\mu$, is returned
when the \code{residuals} method is applied to a fitted model
of class \code{"glm"} when \code{type="deviance"}, the
default, is specified.  Both are called ``deviance residuals''
at different points in the documentation.}
%
For example, the log-likelihood for a Poisson observation $y$ with mean $\lambda$ is (omitting the normalization constant) $\log L(y, \lambda) = y \log \lambda - \lambda$; the deviance is
$$
-2(\log L(y, \lambda) - \log L(y, y)) = -2(y(\log \lambda - \log y) - (\lambda-y))
$$
(with appropriate adjustments if $y=0$).
Because deviances differ from (negative) log likelihoods only by a parameter-independent constant (the log-likelihood of the saturated model) and a change of scale (multiplication by 2), this change does not affect our optimization procedure.

One advantage of using the pre-existing GLM family structure is
that the software can thus fit
models with user-specified families and link functions 
(although common families and link functions are
hard-coded in C++ for computational speed),
in contrast to early versions of \code{lme4} and some other
\proglang{R} packages for GLMM fitting.

If \code{dev.resids} returns 
$\bm d(\yobs|\bm u)$, with elements $d_i(\yobs|\bm u), i=1,\dots,n$,
the full deviance of a generalized linear model is
\begin{displaymath}
f(\yobs, \bm u) = \sum_{i=1}^n d_i(\yobs|\bm u) + \|\bm u\|^{2}.
\end{displaymath}
The likelihood can now be expressed as
\begin{equation}
  \label{eq:GLMMlike1}
  L(\bm\beta,\bm\theta|\yobs)=
  \int_{\mathbb{R}^q}\exp\left(-\frac{\sum_{i=1}^nd_i(\yobs|\bm u)+\|\bm u\|^2}{2}\right)\,(2\pi)^{-q/2}\,d\bm u
\end{equation}

As for linear mixed models, we simplify evaluation of the integral
(\ref{eq:GLMMlike}) by determining the value, $\tilde{\bm
  u}_{\beta,\theta}$, that maximizes the integrand.  When the
conditional density, $\mc U|\mc Y=\yobs$, is multivariate Gaussian,
this conditional mode will also be the conditional mean.  However, for
most families used in GLMMs, the mode and the mean need not coincide
so we use the more general term and call $\tilde{\bm
  u}_{\beta,\theta}$ the \emph{conditional mode}.  We first describe
the numerical methods for determining the conditional mode using the
Penalized Iteratively Reweighted Least Squares (PIRLS) algorithm then
return to the question of evaluating the integral (\ref{eq:GLMMlike}).

\subsection{Determining the conditional mode}
\label{sec:conditionalMode}

The iteratively reweighted least squares (IRLS) algorithm is an
efficient method of determining the maximum likelihood
estimates of the coefficients in a generalized linear model.  We
extend it to a \emph{penalized iteratively reweighted least squares}
(PIRLS) algorithm for determining the conditional mode, $\tilde{\bm
  u}_{\beta,\theta}$.   This algorithm has the form
\begin{enumerate}
\item Given parameter values, $\bm\beta$ and $\bm\theta$, and starting
  estimates, $\bm u_0$, evaluate the linear predictor, $\bm\eta$, the
  corresponding conditional mean, $\bm\mu_{\mc Y|\mc U=\bm u}$, and
  the conditional variance.  Establish the weights as the inverse of
  the variance.  We write these weights in the form of a diagonal
  weight matrix, $\bm W$, although they are stored and manipulated as
  a vector.
\item Let 
  \begin{equation}
  Q(\bm u) = \left\|\bm W^{1/2}\left(\yobs-\bm\mu_{\mc
    Y|\mc U=\bm u}\right)\right\|^2+\|\bm u\|^2.
  \end{equation}

  Solve the penalized, weighted, nonlinear least squares problem
  \begin{equation}
    \label{eq:weightedNLS}
    \arg\min_{\bm u}\left( Q(\bm u) \right).
  \end{equation}
\item Update the weights, $\bm W$, and check for convergence.  If not
  converged, go to step 2.
\end{enumerate}

We use a Gauss-Newton algorithm with an orthogonality convergence
criterion~\citep[\S 2.2.3]{bateswatts88:_nonlin} to solve the
penalized, weighted, nonlinear least squares problem in step 2.  At
the $i$th iteration we determine an increment, $\bm\delta_i$, as the
solution to the penalized, weighted, linear least squares problem
\begin{equation}
  \label{eq:incr}
  \bm\delta_i=\arg\min_{\bm\delta}\left\|
    \begin{bmatrix}
      \bm W^{1/2}\left(\yobs-\bm\mu_i\right)\\
      \bm u_i
    \end{bmatrix}-
    \begin{bmatrix}
      \bm W^{1/2}\bm M_i\bm Z\bm\Lambda_\theta\\
      \bm I_q
    \end{bmatrix}\bm u\right\|^2
\end{equation}
where $\bm u_i$ is current value of $\bm u$, $\bm\mu_i$ is the
corresponding conditional mean of $\mc Y|\mc U=\bm u_i$ and $\bm M_i$ is
the Jacobian matrix of the vector-valued inverse link, evaluated at
$\bm\mu_i$.  That is
\begin{equation}
  \label{eq:Jacobian}
  \bm M_i=\left.\frac{d\bm\mu}{d\bm\eta\trans}\right|_{\bm\eta_i},
\end{equation}
which will be a diagonal matrix so, as for the weights, we store and
manipulate the Jacobian as a vector.

When solving for the minimum of $Q(\bm u)$, the proposed update
  for $\bm u$ may increase rather than decrease the value of the objective 
  function $Q(\bm u)$, if the step size estimated by the Gauss-Newton step
  is too large. Let $Q(\bm u_i)$ represent the value of the objective 
  function $Q(\bm u)$ at the current iterate $\bm u_{i}$; similarly, let 
  $Q(\bm u_{i-1})$ represent the the value of the objective function of the 
  previous iteration. 

  If $Q(\bm u_i) > Q(\bm u_{i-1})$, we proceed with a \emph{step-halving}
  procedure. That is, we successively consider scaled updates of the form
  \begin{equation}
  \bm u_i^{(j)} = \bm u_{i-1} + \frac{\bm \delta_i}{2^j}, \quad j = 1, 2, \ldots,
\end{equation}
computing $Q(\bm u_i^{(j)})$ at each step. The step-halving procedure 
continues until a value of $j$ is found such that 
$Q(\bm u_i^{(j)}) < Q(\bm u_{i-1})$,
which the iterate is then updated by setting $\bm u_{i} = \bm u_{i}^{(j)}$.

If no such $j$ is found after a predetermined number of halvings, then the 
step-halving procedure has failed and the algorithm terminates. 

The minimizer, $\bm\delta_i$, of (\ref{eq:incr}) satisfies
\begin{equation}
  \label{eq:incrEq}
  \bm P\left(\bLt\trans\bm Z\trans\bm M_i\bm W\bm M_i\bm Z\bLt+\bm I_q\right)\bm P\trans
      \bm\delta_i=\bLt\trans\bm Z\trans\bm M_i\bm W(\yobs-\bm\mu_i) - \bm u_i
\end{equation}
which we solve using the sparse Cholesky factor.  At convergence, the
factor, $\bm L_{\beta,\theta}$, satisfies
\begin{equation}
  \label{eq:CholFactorGLMM}
  \bm L_{\beta,\theta}\bm L_{\beta,\theta}\trans =
  \bm P\left(\bLt\trans\bm Z\trans\bm M\bm W\bm M\bm Z\bLt+\bm I_q\right)\bm P\trans
\end{equation}
As we show in the next section, the matrix $(\bm L_{\beta,\theta}\bm
L_{\beta,\theta}\trans)^{-1}$ is a Laplace approximation of the
covariance matrix for the spherical random effects, conditional on the
observed data. This fact is useful for constructing a nonlinear
objective function for finding the approximate maximum likelihood
estimates of $\theta$ and $\beta$.


\subsection{Evaluating the likelihood for GLMMs using the Laplace approximation}
\label{sec:Laplace}

Evaluating the likelihood for generalized linear mixed models
requires approximating an intractable 
integral over the random effects distribution. The \code{glmer} function offers 
several approximations, controlled by the \code{nAGQ} argument. 
The default value of \code{nAGQ=1} specifies the
\emph{Laplace approximation} \citep{madsen2011}.

A second-order Taylor series approximation to $-2\log[f_{\mc Y,\mc
  U}(\yobs,\bm u)]$ based at $\tilde{\bm u}$ provides an approximation
of the unscaled conditional density as a multiple of the density for
the multivariate Gaussian $\mathcal{N}(\tilde{\bm u},\bm L\bm
L\trans)$.  The change of variable
\begin{equation}
  \label{eq:LaplaceChg}
  \bm u = \tilde{\bm u} + \bm L\bm z
\end{equation}
provides
\begin{equation}
  \label{eq:GLMMLaplace}
  \begin{aligned}
    L(\bm\beta,\bm\theta|\yobs)&=\int_{\mathbb{R}^q}f_{\mc Y,\mc U}(\yobs,\bm u)\,d\bm u\\
    &\approx \tilde{f}\,|\bm L|\, \int_{\mathbb{R}^q}e^{-\|\bm z\|^2/2}\,(2\pi)^{-q/2}\,d\bm z\\
    &=\tilde{f}\, |\bm L|
  \end{aligned}
\end{equation}
or, on the deviance scale,
\begin{equation}
  \label{eq:LaplaceDev}
  -2\ell(\bm\beta,\bm\theta|\yobs)\approx\sum_{i=1}^nd_i(\yobs,\tilde{\bm u}) +
    \|\tilde{\bm u}\|^2 + \log(|\bm L|^2)+\frac{q}{2}\log(2\pi)
\end{equation}

The Laplace approximation normally conditions on both the fixed effects $\beta$ and the variance-covariance parameters $\theta$. A further approximation, which is denoted in \code{glmer} by \code{nAGQ=0}, profiles out the fixed effects by minimizing $\beta$ and the conditional modes $\bm u$ simultaneously in eq.~\ref{eq:weightedNLS}. This approximation is exact when (1) $\partial(\log L)/\partial \beta$ is a linear function of the conditional modes $\bm u$ and (2) when the conditional mode is equal to the conditional mean (typically, although not necessarily, implying a symmetric conditional distribution). Both assumptions hold for linear mixed models (although a Laplace approximation is not necessary there), consistent with \cite{bates2015fitting} showing that the fixed effects can be profiled out of the log-likelihood for LMMs. The Julia \code{MixedModels.jl} package offers the same approximation as the \code{fast} argument to the \code{pirls!} function (\url{https://juliastats.org/MixedModels.jl/stable/optimization/}); Template Model Builder \citep{kristensenTMB2016}, and downstream packages such as \code{glmmTMB}, provide this function via a \code{profile} argument.

By default, \code{glmer} uses a two-stage optimization procedure (described below) with \code{nAGQ=0} in the first stage; users can also specify \code{nAGQ=0} for faster, approximate model fits.

\subsubsection{Decomposing the deviance for simple models}
\label{sec:simplescalar}

A common special case of mixed models is those where scalar (typically
intercept) random effects are
associated with levels of a single grouping factor, $\bm h$.  In this
case the dimension, $q$, of the random effects is the number of levels
of $\bm h$ --- i.e.{} there is exactly one random effect associated
with each level of $\bm h$.  We will write the vector of
variance-covariance parameters, which is one-dimensional, as a scalar,
$\theta$.  The matrix $\bm\Lambda_{\bm\theta}$ is a multiple of the
identity, $\theta\bm I_q$, and $\bm Z$ is the $n\times q$ matrix of
indicators of the levels of $\bm f$.  The permutation matrix, $\bm
P$, can be set to the identity and $\bm L$ is diagonal, although not
necessarily homogeneous (i.e., a scalar multiple of the identity matrix).

Because each element of $\bm\mu$ depends on only one element of $\bm
u$ and the elements of $\mc Y$ are conditionally independent, given
$\mc U=\bm u$, the conditional densities of the $u_j,j=1,\dots,q$
given $\mc Y=\yobs$ are independent.  We partition the indices
$1,\dots,n$ as $\mathbb{I}_j,j=1,\dots,q$ according to the levels of
$\bm h$.  That is, the index $i$ is in $\mathbb{I}_j$ if $h_i=j$.
This partitioning also applies to the deviance residuals in that
the $i$th deviance residual depends only on $u_j$ when $i\in\mathbb{I}_j$.

Writing the univariate conditional densities as
\begin{equation}
  \label{eq:univariateCondDens}
  f_j(\yobs,u_j)=\exp\left(-\frac{\sum_{i\in\mathbb{I}_j}d_i(\yobs, u_j)+u_j^2}{2}\right)(2\pi)^{-1/2}
\end{equation}
we have
\begin{equation}
  \label{eq:vectorCondDens}
  f_{\mc Y,\mc U}(\yobs,\bm u)=\prod_{j=1}^q f_j(\yobs,u_j)
\end{equation}
and
\begin{equation}
  \label{eq:ssLike}
  \begin{aligned}
    L(\bm\beta,\bm\theta|\yobs)=\prod_{j=1}^q\int_{\mathbb{R}}f_j(\yobs,u)\,du
  \end{aligned}
\end{equation}

We consider this special case both because it occurs frequently and
because, for some software, it is the only type of GLMM that can be
fit.  Also, in this particular case we can graphically assess the
quality of the Laplace approximation by comparing the actual integrand
to its approximation.

Consider the \code{cbpp} data on contagious bovine pleuropneumonia
(CBPP)
incidence according to season and herd, available in the \pkg{lme4} package
(see \ref{sec:cbpp} for more details), and the model
<<m1>>=
print(m1 <- glmer(cbind(incidence, size-incidence) ~ period + (1|herd),
      cbpp, binomial), corr=FALSE)
@
This model has been fit by minimizing the Laplace approximation to the
deviance.  We can assess the quality of this approximation by
evaluating the unscaled conditional density at $u_j(z)=\tilde{u_j} +
z/{\bm L_{j,j}}$ and comparing the ratio,
$f_j(\yobs,u)/(\tilde{f_j}\sqrt{2\pi})$, to the standard normal
density, $\phi(z)=e^{-z^2/2}/\sqrt{2\pi}$, as shown in Figure~\ref{fig:densities}.
\begin{figure}[tbp]
  \centering
<<densities,echo=FALSE,fig.height=5>>=
zeta <- function(m, zmin=-3, zmax=3, npts=301L, rank = FALSE) {
    stopifnot (is(m, "glmerMod"),
               length(m@flist) == 1L,    # single grouping factor
              length(m@cnms[[1]]) == 1L) # single column for that grouping factor
    pp <- m
    rr <- m@resp
    u0 <- getME(pp,"u")
    sd <- 1/getME(pp,"L")@x
    ff <- as.integer(getME(pp,"flist")[[1]])
    fc <- getME(pp,"X") %*% getME(pp,"beta") # fixed-effects contribution to linear predictor
    ZL <- t(getME(pp,"Lambdat") %*% getME(pp,"Zt"))
    dc <- function(z) { # evaluate the unscaled conditional density on the deviance scale
        uu <- u0 + z * sd
        rr$updateMu(fc + ZL %*% uu)
        unname(as.vector(tapply(rr$devResid(), ff, sum))) + uu * uu
    }
    zvals <- seq(zmin, zmax, length.out = npts)
    d0 <- dc(0) # because this is the last evaluation, the model is restored to its incoming state
    # signed square root
    sqrtmat <- t(sqrt(vapply(zvals, dc, d0, USE.NAMES=FALSE) - d0)) *
      array(ifelse(zvals < 0, -1, 1), c(npts, length(u0)))
    if (rank) {
      ## order by 'badness' (variance of normalized profile)
      nvals <- exp(-0.5*sqrtmat^2)/sqrt(2*pi)/dnorm(zvals)
      colnames(sqrtmat) <- seq(ncol(sqrtmat))
      sqrtmat <- sqrtmat[,order(apply(sqrtmat, 2, var), decreasing = TRUE)]
    }
    list(zvals=zvals,
         sqrtmat= sqrtmat 
         )
}
plot_zeta <- function(z, norm = FALSE,
                      ylab = if (!norm) "density" else "t(z)",
                      which = NULL, layout = c(5,3)) {
  dmat <- exp(-0.5*z$sqrtmat^2)/sqrt(2*pi)
  if (!is.null(which)) {
    dmat <- dmat[, which]
  }
  if (norm) dmat <- dmat/dnorm(z$zvals)
  xyplot(as.vector(dmat) ~ rep.int(z$zvals, ncol(dmat))|gl(ncol(dmat), nrow(dmat)),
         type=c("g","l"), aspect=0.6, layout=layout,
         xlab="z", ylab=ylab,
         panel=function(...){
           if (!norm) panel.lines(zm$zvals, dnorm(zm$zvals), lty=2)
           panel.xyplot(...)}
         )
}
zm <- zeta(m1, -3.750440, 3.750440)
plot_zeta(zm)                      
@
  \caption{Comparison of univariate integrands (solid line) and standard normal density function (dashed line)}
  \label{fig:densities}
\end{figure}
As Figure~\ref{fig:densities} shows, the univariate integrands are very
close to the standard normal density, indicating that the Laplace
approximation to the deviance is a good approximation in this case.

\section{Adaptive Gauss-Hermite quadrature for GLMMs}
\label{sec:aGQ}
When the integral (\ref{eq:GLMMlike}) can be expressed as a product of
low-dimensional integrals, we can use Gauss-Hermite quadrature to
provide a closer approximation to the integral.  Univariate
Gauss-Hermite quadrature evaluates the integral of a function that is
multiplied by a ``kernel'' where the kernel is a multiple of
$e^{-z^2}$ or $e^{-z^2/2}$.  For statisticians the natural candidate
is the standard normal density, $\phi(z)=e^{-z^2/2}/\sqrt(2\pi)$.
A $k$th-order Gauss-Hermite formula provides knots, $z_i,i=1,...,k$,
and weights, $w_i,i=1,\dots,k$, such that
\begin{displaymath}
  \int_{\mathbb{R}}t(z)\phi(z)\,dz\approx\sum_{i=1}^kw_it(z_i)
\end{displaymath}
The function \code{GHrule} in \pkg{lme4} (based on code in the
\pkg{SparseGrid} package) provides knots and weights relative to the
standard normal kernel for orders $k$ from 1 to 100.  For example,
<<GHrule5>>=
GHrule(5)
@
where \code{z} is the vector of knots, \code{w} is the vector
of weights, and \code{dlnorm} is the log-density of the standard
normal distribution at $z$.

The choice of the value of $k$ depends on the behavior of the function
$t(z)$.  If $t(z)$ is a polynomial of degree $2k-1$ then the
Gauss-Hermite formula for orders $k$ or greater provides an exact
answer.  The fact that we want $t(z)$ to behave like a low-order
polynomial is often neglected in the formulation of a Gauss-Hermite
approximation to a quadrature.  The quadrature knots on the $u$ scale
are chosen as
\begin{equation}
  \label{eq:quadraturepts}
  u_{i,j}(z)=\tilde{u_j} + z_i/{\bm L_{j,j}},\quad i=1,\dots,k;\;j=1,\dots,q
\end{equation}
exactly so that the function $t(z)$ should behave like a low-order
polynomial over the region of interest, which is to say the region
where quadrature knots with large weights are located.  The term
``adaptive Gauss-Hermite quadrature'' reflects the fact that the
approximating Gaussian density is scaled and shifted to provide a
second order approximation to the logarithm of the unscaled
conditional density.

Figure~\ref{fig:tfunc}
shows $t(z)$ for each of the unidimensional integrals in the
likelihood for the model \code{m1} at the parameter estimates.

\begin{figure}[tbp]
  \centering
<<tfunc,echo=FALSE>>=
plot_zeta(zm, norm = TRUE)
@
  \caption{The function $t(z)$, which is the ratio of the normalized
    unscaled conditional density to the standard normal density, for
    each of the univariate integrals in the evaluation of the deviance
    for model \code{m1}.  These functions should behave like low-order
    polynomials.}
  \label{fig:tfunc}
\end{figure}

<<toenail, echo = FALSE>>=
data("toenail", package = "lme4")
ttab <- with(toenail, table(table(patientID)))
## sum(as.numeric(names(ttab))*ttab/sum(ttab))
@

The CBPP data set is a relatively well-behaved data set, where Laplace approximation works well. In contrast,
a widely used data set on toenail onychomycosis \citep{debacker+1998}, which has a very low effective sample size per cluster --- an
average of about 6.5 binary observations (``moderate or severe'' vs ``none or mild'' disease) per patient --- represents
an example where Gauss-Hermite quadrature is necessary for reliable results. In this case the
conditional densities depart much more clearly from the standard normal (Figure~\ref{fig:toenailplot}; note the scale of the
density ratios goes from 0 to 10, in contrast the maximum of 4 in Figure~\ref{fig:tfunc}).
\cite{stringerAsymptoticsNumericalIntegration2022} and \cite{stringerExactGradientEvaluation2024} further explore the limitations of Laplace approximation.

\begin{figure}[tbp]
  \centering
<<toenailplot,echo=FALSE>>=
data("toenail", package = "lme4")
toemod <- glmer(outcome ~ time*treatment + (1 | patientID), 
                     data = toenail, family = binomial(link = "logit"))
zm <- zeta(toemod)
set.seed(101); s <- sample(ncol(zm$sqrtmat), size = 9)
gridExtra::grid.arrange(plot_zeta(zm, which = s, layout = c(3,3)),
             plot_zeta(zm, which = s, norm = TRUE, layout = c(3,3)), nrow = 1)
@
\caption{Normalized unscaled conditional density (left) and ratio of density to the standard normal density (right)
for a random sample of 9 patients from the toenail onychomycosis data set.}
\label{fig:toenailplot}
\end{figure}

To use adaptive Gauss-Hermite quadrature for model fitting 
in \code{glmer} models, set the argument \code{nAGQ}, the 
number of quadrature points, to a value greater than 1.
Increasing the number of nodes generally improves the accuracy of the likelihood 
approximation at the expense of computation time --- although rounding errors
may accumulate when using large numbers of quadrature points. At present, AGQ is only available 
for models with a single scalar random effect. (The \code{GLMMadaptive} package implements
AGQ for vector-valued random effect models, although it is still restricted to models with
a single random effect.)

\section{Model fitting}

  Once we can calculate the deviance by PIRLS for specified values of $\btheta$ and $\bbeta$ (or only $\btheta$ if profiling out the fixed effects via \code{nAGQ=0}), we then estimate the parameters by nonlinear optimization. This procedure largely follows the description in \cite{bates2015fitting}, using derivative-free optimizers with box constraints to prevent non-positive-(semi)definite covariance matrices. Specifically, the elements of $\btheta$ corresponding to the diagonal of $\bm\Lambda_\theta$ are currently constrained to be non-negative.
  
The only difference is that by default \code{glmer} uses a two-step fitting procedure, using \code{nAGQ=0} at the first stage to get preliminary estimates which are then used as starting points for a second optimization with Laplace approximation or Gauss-Hermite quadrature as specified by the user. Different nonlinear optimizers can be used at each stage: the current default, as specified in \code{glmerControl}, is to use Powell's BOBYQA followed by a box-constrained variant of the Nelder-Mead simplex algorithm. For faster, approximate fitting, the second stage can be omitted; in the rare cases where the initial \code{nAGQ=0} fit gives poor results, the first stage can be skipped via \code{glmerControl(nAGQ0initStep = FALSE)}.
  
However, because the profiled log-likelihood is an even function of the diagonal elements of $\btheta$
and $\bm\Sigma_\theta=\bm\Lambda_\theta\bm\Lambda_\theta\trans$ depends on them only through their squares, positive and negative values yield identical likelihoods. This symmetry means that constrained optimization is not strictly necessary --- an unconstrained optimizer will converge to a correct solution, approaching zero from either side in the boundary case of a singular random-effects covariance matrix. (Once a solution is found, we can map it to a unique solution where the diagonal elements are all non-negative.)
A similar approach to removing constraints could work for structured covariance matrices where correlation parameters are constrained to $(-1,1)$, e.g. by parameterizing the model in terms of a phase parameter $p$ where $\rho = \sin(p)$ (and then mapping $p$ to $(0, 2\pi)$).
Removing these constraint would permit the use of a broader class of unconstrained optimizers, though this feature has not yet been incorporated into \code{lme4}.

\section{Examples}

\subsection{CBPP}
\label{sec:cbpp}

The \code{?cbpp} help page describes the CBPP data set
\citep{lesnoff_within-herd_2004} as follows:
\begin{quote}
Contagious bovine pleuropneumonia (CBPP) is a major disease of
cattle in Africa, caused by a mycoplasma.  This dataset describes
the serological incidence of CBPP in zebu cattle during a
follow-up survey implemented in 15 commercial herds located in the
Boji district of Ethiopia.  The goal of the survey was to study
the within-herd spread of CBPP in newly infected herds. Blood
samples were quarterly collected from all animals of these herds
to determine their CBPP status.  These data were used to compute
the serological incidence of CBPP (new cases occurring during a
given time period).  Some data are missing (lost to follow-up).
\end{quote}
\cite{lesnoff_within-herd_2004} estimated
the effects of different treatments using (1) ordinary logistic regression
incorporating a variance-inflation factor, also known as
a quasi-binomial model (``logistic regression'' is sometimes
used specifically to describe analyses of Bernoulli responses, but
in this case there are multiple trials per observation [cows that
could become seropositive], and so a dispersion or scale parameter
can be estimated); (2) a GLMM implemented in \code{lme4}; and a
(3) Markov chain Monte Carlo algorithm \citet{zeger1991generalized}, 
which as they state allows for a non-parametric rather than a Normal model 
for the random effects.
The authors did not find any significant effects of treatment,
ascribing the null results to 
``a lack of power in the statistical analyses or to a quality problem for the medications used (and more generally, for health-care delivery in the Boji district).''

<<newcbpp2,echo=FALSE>>=
cbpp2 <- read.csv(system.file("vignette_data", "cbpp2.csv",
            package = "lme4"))
cbpp2 <- transform(cbpp2,period=factor(period),
                   treatment=factor(treatment,
                   levels=c("Partial/null","Complete","Unknown")))
@

(Note that Table 1 of \cite{lesnoff_within-herd_2004} contains a known
typographical error for herd 6. Consequently, results obtained using
the \code{cbpp} data set may not exactly reproduce some of the findings
reported in that paper.)

The \code{lme4} package includes two variants of the \code{cbpp}
data set. The second variant, \code{cbpp2}, contains corrected values
corresponding to Table~1 of \cite{lesnoff_within-herd_2004}. Although
the precise provenance of these data sets is unclear, the \code{cbpp}
data set matches the version held by the corresponding author of
\cite{lesnoff_within-herd_2004}.

<<cbppPlot,echo=FALSE,fig.cap="Incidence (proportion of cows becoming seropositive per observation period) vs. period. Colours show treatment category for each herd; point sizes reflect the number of seronegative cows at the start of each period. Lines connect the sets of observations from each herd.",fig.scap="Incidence">>=
g0 <- ggplot(cbpp2,aes(period, incidence/size, colour=treatment, shape = treatment)) +
    ## scale_y_continuous(limits = c(0, 1))+
    labs(x="Period",y="Proportional incidence")+
    scale_colour_brewer(palette="Dark2")
## faceted plot
## g0 + 
##     geom_point(aes(size = size)) +
##     facet_wrap(~herd) +
##     zmargin
## 
g0 +  geom_point(aes(size = size),alpha=0.4) +
    geom_line(aes(group=herd),alpha=0.3)
## geom_boxplot(fill="black",alpha=0.1,width=0.5,
## aes(group=interaction(period,treatment)))
## 
## boxplot(incidence/size ~ period, data = cbpp, las = 1,
##        xlab = 'Period', ylab = 'Probability of sero-positivity')
@

We model proportional incidence as a binomial response depending
on the additive fixed effects of period, treatment and average herd size;
to account for repeated measures we fit a model with a random effect
of herd. In practice we might also be interested in a period by treatment
interaction, but we neglect that term here.

As in \code{glm}, we can specify 
a binomial response as a proportion and use the \code{weights}
argument to specify the sample size, instead of the
more typical two-column \code{cbind(successes,failures)} format:
<<cbppModelI>>=
gm1 <- glmer(incidence/size ~ period + treatment + avg_size + (1 | herd),
             family = binomial,
             data = cbpp2, weights = size)
@

<<cbpp_diag,eval=FALSE,echo=FALSE>>=
## diagnostic junk about N-M vs bobyqa differences
tmpf <- function(x) c(deviance(x), unlist(getME(x,c("theta","beta"))))
gm1 <- glmer(incidence/size ~ period + treatment + avg_size + (1 | herd),
             family = binomial,
             data = cbpp2, weights = size)
gm1B <- update(gm1,control=glmerControl(optimizer="bobyqa"))
cbind(tmpf(gm1),tmpf(gm1B))
dd1 <- update(gm1,devFunOnly=TRUE)
dd1B <- update(gm1B,devFunOnly=TRUE)
gm1@optinfo$derivs$gradient
gm1B@optinfo$derivs$gradient
gm1@optinfo$derivs$gradient/gm1B@optinfo$derivs$gradient
library(numDeriv)
grad(dd1,tmpf(gm1)[-1])
grad(dd1B,tmpf(gm1B)[-1])
@

It is also worth considering
adding an observation-level random effect to the model,
which we can do by creating a new factor based on observation
number and using \code{update()} on the previous model:
<<cbppModelII, warning=FALSE>>=
cbpp2 <- transform(cbpp2,obs=factor(seq(nrow(cbpp2))))    
gm2 <- update(gm1,.~.+(1|obs))  ## herd and observation-level REs
gm3 <- update(gm1,.~.-(1|herd)+(1|obs))  ## observation-level REs only
@

\subsubsection{Model summary}

The first part of the summary reiterates
the family and link function used,
the model formula, and gives various summary statistics
(log-likelihood etc.), as well as quantiles of the
scaled (Pearson) residuals:
<<summary1,echo=FALSE>>=
ss <- summary(gm1)
cc <- capture.output(print(summary(gm1)))
reRow <- grep("^Random effects",cc)
cat(cc[1:(reRow-2)],sep="\n")
@
These quantities are also
accessible via standard accessors
(\code{AIC()}, \code{BIC()}, \code{logLik()}).

The next chunk of \code{summary()} describes the random effects
and the number of levels associated with each grouping factor
(the latter is useful for checking that random-effects formulae
have been specified correctly):
<<summary2,echo=FALSE>>=
feRow <- grep("^Fixed effects",cc)
cat(cc[reRow:(feRow-2)],sep="\n")
@
This information is also accessible via \code{VarCorr()}, which returns
a list of variance-covariance matrices (the \code{print} method
for \code{VarCorr} objects allows control of whether the variance,
or standard deviation, or both, are printed).

Next come the estimates of the fixed effects, along with Wald
estimates of the standard error, $Z$ statistic, and $p$-value:
<<summary3,echo=FALSE>>=
corRow <- grep("^Correlation",cc)
cat(cc[feRow:(corRow-2)],sep="\n")
@
One can use \code{coef(summary())} to retrieve this information,
and optionally format it with \code{printCoefmat()}.

The last component of \code{summary()} gives the
estimated correlations among the fixed-effect parameters,
which can be useful for assessing multicollinearity
(it can also be overwhelming: it is suppressed by default
for models with more than 20 fixed-effect parameters, and
can also be suppressed by using \code{print(summary(.),correlation=FALSE)}).
<<summary4,echo=FALSE>>=
cat(cc[corRow:length(cc)],sep="\n")
@

\subsubsection{Diagnostics}

A range of graphical diagnostic tools is available
for \code{merMod} objects. The plot methods in the \code{lme4}
package are inspired by those in the \code{nlme} package, using
\code{lattice} plots to provide a reasonable blend
of convenience and flexibility. 

\code{merMod} objects are also compatible with the 
\code{performance} package and the \code{DHARMa} package, both commonly 
used for model checking.

The following code produces a standard range
of diagnostic plots (Figure~\ref{fig:glmerDiag}), similar to the ones
in base R's \code{plot.lm} method. These diagnostics will generally be
useful for models where the conditional density is approximately normal
(but heteroscedastic) --- e.g., Poisson responses with large mean or
binomial responses with large numbers of successes --- and less so otherwise,
e.g. for binary responses.

<<glmerDiagFake,eval=FALSE>>=
## basic residual plot
plot(gm1)
## scale-location plot
plot(gm1,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"))
## boxplot of residuals grouped by a categorical predictor
plot(gm1,period~resid(.))
## Q-Q plot
qqmath(gm1)
@    

<<glmerDiag,fig.cap="Graphical diagnostics",warning=FALSE,echo=FALSE>>=
p1 <- plot(gm1,type=c("p","smooth"),
    main="Default (Pearson resid vs. fitted)")
p2 <- plot(gm1,sqrt(abs(resid(.)))~fitted(.),type=c("p","smooth"),
     main="Scale-location")
p3 <- plot(gm1,period~resid(.), main = "Grouped residuals")
p4 <- qqmath(gm1,main="Quantile-quantile plot",type=c("p","r"))
grid.arrange(p1,p2,p3,p4,nrow=2)
@    

The \code{ranef()} accessor extracts the conditional modes; the argument
\code{condVar=TRUE} additionally extracts the variances of
the conditional modes, which are stored as an attribute
labelled \code{"postVar"} --- a three-dimensional array that
gives the variance-covariance matrix of the conditional modes
for each level of the grouping variable. The plotting methods
\code{dotplot()} and \code{qqmath()} return lists of graphical
objects showing \emph{caterpillar plots} (ordered values of
the random effects with confidence bars); in the case of
the Q-Q plot (\code{qqmath}) the $y$-axis shows corresponding 
values of the standard normal quantiles (Figure~\ref{fig:glmerRanefplot}).

<<glmerRanefplot,fig.cap="Graphical display of random effects. \\emph{Left}: conditional modes $\\pm 1.96 \\times \\text{conditional standard deviation}$, ordered by magnitude. \\emph{Right}: quantile-quantile plot, with linear regression line overlaid.",echo=FALSE>>=
rr <- ranef(gm1,condVar=TRUE)
dd <- dotplot(rr)
qq <- qqmath(rr,type=c("p","r"))
grid.arrange(dd$herd,qq$herd,ncol=2)
@

\code{performance::check\_model()} checks a variety of classical assumptions 
of GLMs. For mixed-effect models, it also assesses
the normality of the distribution of conditional modes (Figure~\ref{fig:perf_chk_mod_cbpp}).

<<perf_chk_mod_cbpp,fig.cap="Model diagnostics using \\code{performance::check\\_model}. The plot showcasing the normality of random effects (lower right panel) is the transpose of the right panel in Figure 6.",fig.height=10,echo=FALSE>>=
check_model(gm1, check = c("pp_check", "binned_residuals",
                           "qq", "reqq"))
@

The \code{DHARMa} package is also compatible with \code{merMod} objects.
\code{DHARMa} generates simulation-based residuals for
generalized linear (mixed) models and uses them for graphical
and statistical tests of model assumptions (Figure~\ref{fig:dharma_mod_cbpp}).

<<dharma_mod_cbpp,fig.cap="Model assumption checks using the DHARMa package. \\code{n.s.} denotes ``not significant''. Other exploration (such as \\code{plot(simulationOutput, form = cbpp2\\$period)}) does not reveal any obvious problems such as nonlinear reponses, so we proceed with the analysis despite the significant adjusted quantile test.",echo=FALSE>>=$
simulationOutput <- simulateResiduals(fittedModel = gm1)
plot(simulationOutput)
@

Having checked the diagnostics, we would now like to compare
the three models we have fitted.  Inspecting the \code{VarCorr}
components, we see that when we fit both herd- and observation-level
random effects, the among-herd variance is estimated as zero.
The appropriate procedure
at this point (e.g. whether one
drops non-significant terms, or those with small scaled magnitudes,
or those that worsen the AIC or BIC of the model)
depends on the goals of the analysis and one's
philosophy of model-building \citep{barrRandomEffectsStructure2013,matuschekBalancingTypeError2017,scandola2024}.

One might either stick with the full model, or continue
with the reduced model with observation-level random effects
only (as it has exactly the same likelihood as the full
model but uses an additional parameter, it would be chosen
according to either an information-theoretic or a
hypothesis-testing model selection framework).

Here we will start by
computing likelihood profiles and confidence intervals
for the model incorporating both random effects;
although it has the same point estimates and maximum likelihood
as the reduced model, confidence intervals that incorporate
non-local information (i.e. profile- or parametric bootstrap-based)
will give different, more conservative results for the full model.

%% lme4 wishlist: allow [Ww]ald in confint.merMod?
<<loadbatch,echo=FALSE>>=
cbpp_batch_vars <- load(system.file("vignette_data", "cbpp_batch.rda", package = "lme4"))
contr_batch_vars <- load(system.file("vignette_data", "Contraception_batch.rda", package = "lme4"))
@

The \code{profile} method computes profile likelihoods.
The computation can be slow, since complete profiling for
a model with $p$ random- and fixed-effect parameters requires fitting $p$
profiles, each of which requires many $p-1$-dimensional optimizations.  The machinery for generating likelihood profiles for GLMMs is similar to that for LMMs (see \cite{bates2015fitting}, \S\ 5.1).

The \code{profile} method returns an
object of class \code{thpr} --- a data frame
containing the profiles, augmented with attributes containing
interpolation splines for each parameter profile and their inverses
(using \code{splines::interpSpline} and \code{splines::backSpline});
the latter are used for plotting profiles and computing confidence
intervals.  An \code{as.data.frame} method adds \code{.focal} and
\code{.par} variables to the data frame, useful for customized
plots.

Profiles can be used for univariate (\code{xyplot}) and
bivariate (\code{splom}) profile plots, and to compute
profile confidence intervals (\code{confint}). (\code{confint}
applied to a \code{glmer} fit will first fit the profile,
then use it to compute profile confidence intervals.  Given
the computational cost of profiling, it makes sense to compute
and save the profile as an intermediate step if one plans to
do anything other than computing confidence intervals.)

Two other common methods for computing confidence intervals
are parametric bootstrapping (\code{method = "boot"}) and the
classical Wald approximation (\code{method = "Wald"}).
Parametric bootstrapping is much slower, but more accurate
(and, via the \code{FUN} argument, can generate confidence intervals
for any quantity that can be derived from a fitted model).
The Wald approximation is faster and less accurate than profile
confidence intervals.
By default \code{glmer} only returns estimates for the fixed-effects
parameters, as the assumptions of the Wald approximation are often
violated badly for random-effects (co)variances and correlations.
In the examples below we use the finite-difference Hessian
(second derivative matrix of the estimated parameters) and the
delta method to compute Wald confidence intervals for random-effects
standard deviations and correlations, when possible.

Figure~\ref{fig:cbppcompplot} compares all three of these confidence intervals
across all three of the models fitted.

<<cbppConfidence,echo=FALSE>>=
des_ord <- c("sd_(Intercept)|herd", "sd_(Intercept)|obs",
             "avg_size", "treatmentUnknown", "treatmentComplete")
combCI <- subset(cbpp_combCI,var!="(Intercept)" & !grepl("^period",var))
combCI <- transform(combCI,  var = factor(var, levels = rev(des_ord)))
combCI <- transform(combCI,
            vartype = factor(ifelse(grepl("^sd", var), "random effects", "fixed effects"))
)                    
@

<<cbppcompplot,fig.cap="CBPP example: comparison of point and confidence interval estimation for different methods. Wald CIs are missing for random-effects parameters when the fitted model is singular.",fig.scap="CBPP comparison",echo=FALSE,fig.width=7,fig.height=5,warning=FALSE>>=
mnames <- cbpp_df_name$mnames
pd <- position_dodge(width=0.6)
ggplot(combCI, aes(var, est, colour=type, shape=model)) +
  geom_pointrange(aes(ymin=lwr, ymax=upr),
                  position=pd) +
  labs(x="", y="Estimate (log-odds of seropositivity)") +
  geom_hline(yintercept=0, lty=2) +
  coord_flip() +
  scale_colour_brewer(palette="Dark2", guide = revguide) +
  scale_shape(guide=revguide, labels = mnames) +
  facet_wrap(~ vartype, ncol = 1, scale = "free")
@    

If the default optimizers (Nelder-Mead followed by BOBYQA) do not
perform well, one could attempt to re-fit a [g]lmer model with a variety of 
different optimizers using \code{allFit()}. To see which optimizers \code{lme4}
currently supports, one can use the command \code{allFit(show.meth.tab=TRUE)} 
to show all of the different methods. 

To use \code{allFit()}, supply the initial fitted model as input. Useful 
results are obtained from \code{summary(allFit())}: 
\begin{itemize}
  \item \code{\$which.OK} returns which optimizers worked. Below only applies to
  optimizers that were successful.
  \item \code{\$llik} returns log-likelihoods
  \item \code{\$fixef} fixed-effect estimates 
  \item \code{\$sdcor} random-effect standard deviations and correlations
  \item \code{\$theta} random-effect parameters on the Cholesky scale
\end{itemize}

We will show the summary results for \code{\$sdcor}; the other components are similar.

<<allfit_cbpp1,message=FALSE,warning=FALSE>>=
gm_all <- allFit(gm1)
@

<<allfit_cbpp2>>=
ss <- summary(gm_all)
ss$sdcor
@


As of version 2.0, \code{lme4} can also specify structured variance-covariance 
matrices for (generalized) linear mixed models.
\code{lme4} now supports unstructured (general positive definite), diagonal, 
compound symmetry, and first-order autoregressive (AR1) structures.
By default, AR1 models assume a homogeneous-variance model (the variance is the same for all time steps), while the other models assume heterogeneous-variance models (variances differ for every level of the varying term); users can adjust this with the \code{hom} argument (e.g. \code{ar1(..., hom = FALSE)}).

The unstructured covariance structure is the default for mixed models. 
Here we illustrate fitting an AR1 model; the next example will show diagonal and compound symmetric
models.

<<covar_cbpp>>=
gm.ar1 <- glmer(incidence/size ~ ar1(1 + herd | period),
                family = binomial,
                data = cbpp, weights = size)
print(VarCorr(gm.ar1))
@

For this particular model, a heterogeneous AR1 model (\code{ar1(..., hom = FALSE)}) results in a singular fit.

See the \href{https://CRAN.R-project.org/package=lme4/vignettes/covariance_structures.html}{lme4 covariance structures vignette}
for more detail.

\subsection{Contraception}

\citet{huq1990bangladesh} use multilevel models to analyze data from a
fertility survey of women in Bangladesh. These data are available as
the \code{Contraception} object in the \pkg{mlmRev} package. The
response variable is binary and indicates whether or not each woman
was using contraception at the time of the survey. Covariates included
the woman's age, the number of live children she had, whether she
lived in an urban or rural setting, and the district in which she
lived.

<<loadContraception, echo=FALSE>>=
library(mlmRev)
data(Contraception)
@ 

<<graphContraception,echo=FALSE,fig.cap="Contraception example: proportion of contraceptive use by centered age, number of living children (line type), and urban/rural residence (facet). Curves fitted using generalized additive models with cubic spline smoothing. Density plots along bottom and top margins show the age distributions of women not using contraception (bottom) and using contraception (top).">>=
Contraception <- transform(Contraception, 
                        facet_urban = ifelse(urban=="Y","Urban","Rural"))
gg0 <- ggplot(Contraception) +
    geom_smooth(aes(age, ifelse(use == "Y",1,0), colour = livch, fill = livch),
                method = "gam", alpha = 0.1,
                method.args = list(gamma = 0.6),
                formula = y ~ s(x, bs = "cs")) +
    facet_wrap(~facet_urban) +
    scale_x_continuous("Centered age") + 
    scale_y_continuous("Proportion", limits = c(0, 1), expand = c(0,0)) +
    scale_colour_brewer("Number\nof living\nchildren", palette="Dark2") +
  scale_fill_brewer("Number\nof living\nchildren", palette="Dark2")

## add marginal density ribbons
gg0 + geom_density(
        data = subset(Contraception, use == "N"),
        aes(x = age, y = after_stat(density), fill = livch),
        alpha = 0.3, colour = NA) +
  geom_ribbon(
    data = subset(Contraception, use == "Y"),
    stat = "density",
    aes(
      x = age,  
      ymin = 1 - after_stat(density) * 0.5,
      ymax = 1,
      fill = livch
    ),
    alpha = 0.3, colour = NA
  )

@ 

<<ch,echo=FALSE>>=
Contraception <- transform(Contraception,
   ch = factor(livch != 0, labels = c("N","Y")),
   age_s = age/(2*sd(age)))
@  

Figure~\ref{fig:graphContraception} shows exploratory smooth curves of
contraceptive use as a function of centered age, stratified by number
of living children and urban/rural residence. In rural areas, women
with two living children show the highest rates of contraceptive use,
peaking near 50\% at the average age, while women in urban areas with
three or more living children show the highest rates near the average
age but with a more pronounced decline at older ages. In both
settings, women with no living children consistently show the lowest
rates of contraceptive use. The nonmonotonic effect of age on contraceptive use
and the apparent dependence of this age trend on child status motivate the
model specifications explored below. (These nonmonotonic trends were not
noticed in the original analysis of the data, which concluded that there
was no significant effect of age.)

We construct six \code{glmer} models with varying choices of explanatory
variables and random effects structure, comparing them via \code{anova}. The
models considered different variables.

For instance, there are two different ways to encode the number of living 
children: \code{livch}, a four-level factor distinguishing \code{0}, \code{1}, 
\code{2}, or \code{3+} children, while other models use \code{ch} instead (later 
we label as \code{binary\_child}), which is a binary indicator for whether the 
woman has any living children. 

Second, we vary whether child status interacts with the woman's age: two models
include \code{ch} (or \code{livch}) and \code{age} as additive terms, while the 
other four include a \code{ch} and \code{age} interaction. A quadratic effect of
age (\code{I(age\^{}2)}) was added to account for the nonlinear effect of age.

Third, we explore different random effects structures at the district level.
Three of them use a single random intercept per district \code{(1 | district)}.
One of them extends this with a random slope for urban status 
\code{(urban | district)}, allowing the urban/rural difference to vary by district. 
The next uses nested random intercepts for district and site within district (\code{1 | district/urban})
separating district-level from urban-within-district variation.
(\cite{scandola2024} refer to this formulation as a ``complex random intercepts'' model).
The remaining model uses only the \code{urban:district} grouping. 
To reduce convergence warnings and facilitate interpretability, age (already approximately centered
in the data set) was standardized by scaling by twice the standard deviation \citep{gelman2008scaling}.

<<cm1,message=FALSE>>=
cm1 <- glmer(use ~ age_s + I(age_s^2) + urban + livch + (1|district), 
                           Contraception, binomial)
## switch from livch (ordinal) to ch (binary)
cm2 <- update(cm1, . ~ . - livch + ch)
## add age by children interaction
cm3 <- update(cm2, . ~ . + age_s:ch)
## allow urban effect to vary across districts (correlated)
cm4 <- update(cm3, . ~ . - (1|district) + (1+urban|district))
## compound symmetric/nested formulation
cm5 <- update(cm3, . ~ . - (1|district) + (1 | district/urban))
## as above but drop district effect
cm6 <- update(cm3, . ~ . - (1|district) + (1 | district:urban))
@

<<anovaContraception, echo = FALSE>>=
## copy of bbmle::AICtab
AICtab <- function(..., mnames) {
  aic_df <- do.call(AIC, list(...))
  LLvec <- sapply(list(...), function(x) c(logLik(x)))
  rownames(aic_df) <- mnames
  aic_df <- transform(aic_df,
                        Δnegloglik = -LLvec - min(-LLvec),
                        ΔAIC = AIC - min(AIC))
  aic_df <- subset(aic_df, select = -AIC)
  aic_df <- aic_df[order(aic_df$ΔAIC),]
  aic_df
}

mod_list <- c(cm1, cm2, cm3, cm4, cm5, cm6)
op <- options(digits = 3)
mnamevec <- c("int_child    + age  + (1 | district)",
              "binary_child + age  + (1 | district)",
                "binary_child × age + (1 | district)",
                "binary_child × age + (1 + urban | district)",
                "binary_child × age + (1 | district/urban)",
                "binary_child × age + (1 | district:urban)")
aictab <- do.call(AICtab, c(mod_list, list(mnames = mnamevec)))
@

\begin{table}
<<aictab, echo  = FALSE, results = "asis">>=
knitr::kable(aictab, digits = 3)
@
\caption{Model comparison for Contraception fits. Note that for likelihood ratio tests, or for AIC comparisons restricted to nested models \citep{ripleySelectingAmongstLarge2004a}, the nesting sequence for the random-effects models is \code{(1+urban|district)} $>$ \code{(1|district/urban)} $>$ \{%
\code{(1|district)}, \code{(1|district:urban)}%
\}.
}
\label{tab:contraictab}
\end{table}

Table~\ref{tab:contraictab} shows that
the top three models, all of which include a child-by-age interaction and some effect of urbanization, fit approximately equally well ($\Delta$ negative log-likelihood $<0.5$). The \code{(1|district/urban)} model barely improves on the fit of \code{(1|district:urban)} (0.005 log-likelihood units), at the cost of an extra variance parameter, so it is almost 2 AIC units worse. \code{(1 + urban | district)} is a bit better ($\approx$ 0.5 log-likelihood units), but includes a covariance parameter.
Models that drop the interaction between child status and age or replace the binary child indicator with an integer count perform substantially worse ($\Delta \textrm{AIC} > 10$).

Overall, the results suggest that accounting for urban/rural variation at the district level and including the age and child interaction are both important, but the precise random effects structure for urban/rural variation is relatively
unimportant.

<<contr-predplot, warning = FALSE, fig.cap="Predictions for contraception fit (\\code{(1|district/urban)} model). Heavy lines and ribbons show population-level predictions; light lines show district-level predictions.", echo = FALSE>>=
agevec <- seq(-15, 20)
pframe <- with(Contraception, expand.grid(age = agevec, urban = levels(urban), ch = levels(ch)))
pframe <- transform(pframe, age_s = age/(2*sd(Contraception$age)),
                    facet_urban = ifelse(urban=="Y","Urban","Rural"))
pred  <- predict(cm5, newdata = pframe, se.fit = TRUE, re.form = NA)
pframe$use <- plogis(pred$fit)
pframe$lwr <- plogis(pred$fit-1.96*pred$se.fit)
pframe$upr <- plogis(pred$fit+1.96*pred$se.fit)
pframe2 <- with(Contraception, expand.grid(age = agevec, urban = levels(urban), ch = levels(ch),
                 district = levels(district)))
pframe2 <- transform(pframe2, age_s = age/(2*sd(Contraception$age)),
            facet_urban = ifelse(urban=="Y","Urban","Rural"),
            dist_urb = paste(urban, district, sep = ":"))
c_urb <- unique(with(Contraception, paste(urban, district, sep = ":")))
pframe2 <- subset(pframe2, dist_urb %in% c_urb)
pred2  <- predict(cm5, newdata = pframe2, re.form = NULL)
pframe2$use <- plogis(pred2)

ggplot(pframe, aes(age)) + 
    geom_line(aes(y = use, colour = ch)) +
    geom_ribbon(aes(ymin = lwr, ymax = upr, fill = ch), color = NA, alpha = 0.3) +
    facet_wrap(~facet_urban) +
    scale_x_continuous("Centered age") + 
    scale_y_continuous("Proportion", limits = c(0, 1), expand = c(0,0)) +
    scale_colour_brewer("Living\nchildren", palette="Dark2") +
    scale_fill_brewer("Living\nchildren", palette="Dark2") +
    geom_line(data = pframe2, aes(group = interaction(dist_urb, ch), colour = ch, y = use), alpha = 0.2)
@

As with the CBPP data set, we can also compute
profile, Wald, and parametric bootstrap confidence intervals for all of the models to
understand the effects of each variable and visualize the among-model variation.

<<ContraceptionBatch,echo=FALSE>>=
contr_combCI <- subset(contr_combCI, var != "(Intercept)")
combCI <- transform(contr_combCI,
    vartype = factor(ifelse(grepl("^(sd|cor)", var), "random effects", "fixed effects")),
    model = factor(model, levels = rev(levels(model)))  ## flip AND flip order of guide below
    )
@

<<contracompplot,fig.cap="Contraception example: comparison of point and confidence interval estimation for different methods. (Note that the Wald CIs for variation in the intercept across districts, in the \\code{(1|district/urban)} model, include negative values.)",fig.scap="Contraception comparison",echo=FALSE,fig.width=10,fig.height=10,warning=FALSE,fig.lab="fig:contracomp">>=
ggplot(combCI, aes(var, est, colour=type, shape=model)) +
  geom_pointrange(aes(ymin=lwr, ymax=upr),
                  position=position_dodge(width=0.6)) +
  labs(x="", y="Estimate (log-odds of seropositivity)") +
  geom_hline(yintercept=0, lty=2) +
  coord_flip() +
  scale_colour_brewer(palette="Dark2", guide = revguide) +
  scale_shape(guide=revguide) +
  facet_wrap(~vartype, ncol = 1, scale = "free", space = "free_y") +
  theme(legend.position = "inside",
  legend.justification = c("left", "top"),
  legend.box = "horizontal",
      legend.background = element_rect(fill = "white", colour = "grey50"),
      #legend.margin = margin(t = 5, r = 5, b = 5, l = 5, unit = "pt"),
      legend.box.margin = margin(t = 10, l = 10, unit = "pt"))
@

The point estimates and confidence intervals of the explanatory variables are similar across the six models, and across different methods for confidence interval construction, with a few exceptions. 

Urban residence, presence of living children, and age were all strong predictors of contraceptive use among women in Bangladesh (because we are fitting a quadratic model for the effect of age, the non-significant effect of \code{age\_s} simply means that the marginal effect of age \emph{at the mean age} is not clearly negative or positive).

With \code{lme4} versions greater than 2.0, we can
also set up models with structured random effects
covariance matrices.
In the models below \code{diag} specifies a diagonal 
variance-covariance structure, while \code{cs} specifies a compound symmetric model. The \code{hom} argument specifies
whether the model should allow different variances for
each varying effect (e.g. for intercepts and slopes in
the models below, which allow the effects of the
continuous covariate of age to vary across districts).

Thus instead of (e.g.) \code{(1+urban|district)}, we can specify the random effect with
diagonal, heterogeneous variances (\code{diag(1+urban|district)});
 diagonal, homogeneous variances (\code{diag(1+urban|district, hom = TRUE)});
compound symmetric, heterogeneous variances (\code{cs(1+urban|district)}); or
compound symmetric, homogeneous variances (\code{cs(1+urban|district, hom = TRUE)}).

<<cm_varcorr, echo = FALSE>>=
cc <- glmerControl(check.conv.grad = .makeCC("warning", tol = 1e-2), check.conv.hess = "ignore")
base_form <- use ~ age*ch + I(age^2) + urban
cm.diag <- glmer(update(base_form, . ~ . + diag(1 + urban|district)), Contraception, binomial, control = cc )
cm.homdiag <- glmer(update(base_form, . ~ . + diag(1 + urban|district, hom = TRUE)), Contraception, binomial, control = cc)
cm.cs <- glmer(update(base_form, . ~ . + cs(1 + urban|district)), Contraception, binomial, control = cc)
cm.homcs <- glmer(update(base_form, . ~ . + cs(1 + urban|district, hom = TRUE)), Contraception, binomial, control = cc)
@

We can use \code{VarCorr} and other accessor methods as we would for models with default, unstructured covariance matrices, e.g.:

<<cm_varcorr_view>>=
VarCorr(cm.cs)
@

\bibliography{glmer}

\section{Package versions used}

<<pkgs, echo=FALSE>>=
pkglist <- c("lme4", "performance", "DHARMa", "see")
pp <- sapply(pkglist, function(x)
  sprintf("%s: %s", x, format(packageVersion(x))))
ppp <- paste(pp, collapse  = "; ")
@ 

Compiled with \Sexpr{R.version.string} and package versions \Sexpr{pp}.

\section{Appendix: derivation of PIRLS}

We seek to maximize the unscaled conditional log density for a GLMM
over the conditional modes, $\bm u$. This problem is very similar to
maximizing the log-likelihood for a GLM, which is a very
thoroughly studied problem \citep[e.g.][]{McCullaghNelder1989}.
The standard algorithm for dealing with this kind of
problem is iteratively reweighted least squares (IRLS). Here we modify
IRLS by incorporating a penalty term that accounts for variation in
the random effects; we call the resulting algorithm
penalized iteratively reweighted
least squares (PIRLS).

The unscaled conditional log-density takes the form,
\begin{equation}
  f(\bm u) = \log p(\bm y, \bm u | \bm\beta, \bm\theta) = 
  \bm\psi^\top \bm A \bm y - 
  \bm a^\top \bm \phi  + 
  \bm c -
  \frac{1}{2}\bm u^\top \bm u -
  \frac{q}{2}\log{2\pi}
\end{equation}
where $\bm\psi$ is the $n$-by-$1$ canonical parameter of an
exponential family, $\bm\phi$ is the $n$-by-$1$ vector of cumulant
functions, $\bm c$ an $n$-by-$1$ vector of normalizing constants, and
$\bm A$ is an $n$-by-$n$ diagonal matrix of prior weights, $\bm
a$. Both $\bm a$ and $\bm c$ could depend on a dispersion parameter,
although we ignore this possibility for now.

The canonical parameter, $\bm\psi$, and vector of cumulant functions,
$\bm\phi$, depend on a linear predictor,
\begin{equation}
  \bm\eta = \bm o + \bm X \bm\beta + \bm Z \bm\Lambda_\theta \bm u
\end{equation}
where $\bm o$ is an $n$-by-$1$ vector of \emph{a priori} offsets. The
specific form of this dependency is specified by the choice of the
exponential family. The mean of this distribution, $\bm\mu$,
is the \emph{inverse link function} $g^{-1}$ applied to $\bm\eta$.

Our goal is to find the values of $\bm u$ that maximize the unscaled
conditional density, for given $\bm\theta$ and $\bm\beta$
vectors. These maximizers are the conditional modes, which we require
for the Laplace approximation and adaptive Gauss-Hermite
quadrature. To do this maximization we use a variant of the Fisher
scoring method, which is the basis of the iteratively reweighted least
squares algorithm for generalized linear models. Fisher scoring is
itself based on Newton's method, which we apply first.

\subsection{Newton's method}

To apply Newton's method, we need the gradient and the Hessian of the
unscaled conditional log-likelihood. Following standard GLM theory
\citep{McCullaghNelder1989}, we use the chain rule,
\begin{displaymath}
  \frac{d L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u} = 
  \frac{d L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm\psi}
  \frac{d \bm\psi}{d \bm\mu}
  \frac{d \bm\mu}{d \bm\eta}
  \frac{d \bm\eta}{d \bm u}
\end{displaymath}
The first derivative in this chain follow from basic results in GLM
theory,
\begin{displaymath}
  \frac{d L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm\psi} = 
  (\bm y - \bm\mu)^\top \bm A
\end{displaymath}
Again from standard GLM theory, the next two derivatives define the
inverse diagonal variance matrix,
\begin{displaymath}
  \frac{d \bm\psi}{d \bm\mu} = \bm V^{-1}
\end{displaymath}
and the diagonal Jacobian matrix,
\begin{displaymath}
\frac{d \bm\mu}{d \bm\eta} = \bm M \quad .
\end{displaymath}
Finally, because $\bm\beta$ affects $\bm\eta$ only linearly,
\begin{displaymath}
  \frac{d \bm\eta}{d \bm u} = \bm Z \bm\Lambda_\theta
\end{displaymath}
Therefore we have,
\begin{equation}
  \frac{d L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u} = 
  (\bm y - \bm\mu)^\top \bm A
  \bm V^{-1}
  \bm M
  \bm Z \bm\Lambda_\theta +
  \bm u^\top \quad .
\label{eq:dPDEVdu}
\end{equation}
This is very similar to the gradient for GLMs with respect to fixed
effects coefficients, $\bm\beta$. The only difference induced by
differentiating with respect to the random effects, $\bm u$, is the
addition of the $\bm u^\top$ term.

Again we apply the chain rule to take the Hessian,
\begin{equation}
  \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} = 
  \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm\mu}
  \frac{d \bm\mu}{d \bm\eta}
  \frac{d \bm\eta}{d \bm u} + \bm I_q
\end{equation}
which leads to,
\begin{equation}
  \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} = 
  \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm\mu}\bm
  M \bm Z \bm\Lambda_\theta 
  + \bm I_q
\end{equation}
The first derivative in this chain can be expressed as,
\begin{equation}
  \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm\mu} =
  -\bm\Lambda_\theta^\top \bm Z^\top \bm M \bm V^{-1} \bm A  + 
  \bm\Lambda_\theta^\top \bm Z^\top \left[ \frac{d \bm M \bm V^{-1}}{d \bm\mu} \right] \bm A \bm R
\end{equation}
where $\bm R$ is a diagonal residuals matrix with $\bm y-\bm\mu$ on
the diagonal. The two terms arise from a type of product rule, where
we first differentiate the residuals, $\bm y-\bm\mu$, and then the
diagonal matrix, $\bm M \bm V^{-1}$, with respect to $\bm\mu$.

The Hessian can therefore be expressed as,
\begin{equation}
  \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u d \bm u} = 
  -\bm \Lambda_\theta^\top \bm Z^\top \bm M \bm A^{1/2}\bm V^{-1/2}\left( 
    \bm I_n - 
    \bm V \bm M^{-1}\left[ \frac{d \bm M \bm V^{-1}}{d \bm\mu} \right] \bm R
  \right) \bm V^{-1/2}\bm A^{1/2} \bm M \bm Z \bm\Lambda_\theta + \bm I_q
\label{eq:betaHessian}
\end{equation}
This result can be simplified by expressing it in terms of a weighted
random-effects design matrix, $\bm U = \bm A^{1/2}\bm V^{-1/2}\bm M
\bm Z \bm\Lambda_\theta$,
\begin{equation}
  \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} = 
  -\bm U^\top\left( 
    \bm I_n - 
    \bm V \bm M^{-1}\left[ \frac{d \bm V^{-1}\bm M}{d \bm\mu} \right] \bm R
  \right) \bm U + \bm I_q \quad .
\label{eq:betaHessiansimp}
\end{equation}

\subsection{Fisher-like scoring}

There are two ways to further simplify this expression for $\bm U^\top
\bm U$. The first is to use the canonical link function for the family
being used. Canonical links have the property that $\bm V = \bm M$,
which means that for canonical links,
\begin{equation}
  \frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm u} = 
  -\bm U^\top\left( 
    \bm I_n - 
    \bm I_n \left[ \frac{d \bm I_n}{d \bm\mu} \right] \bm R
  \right) \bm U + \bm I_q = \bm U^\top \bm U + \bm I_q
\end{equation}
The second way to simplify the Hessian is to take its expectation with
respect to the distribution of the response, conditional on the
current values of the spherical random effects coefficients, $\bm
u$. The diagonal residual matrix, $\bm R$, has expectation
0. Therefore, because the response only enters into the expression for
the Hessian via $\bm R$, we have that,
\begin{equation}
  E\left(\frac{d^2 L(\bm\beta, \bm\theta | \bm y, \bm u)}{d \bm u \, d \bm
      u} | \bm u \right) = 
  -\bm U^\top\left( 
    \bm I_n - 
    \bm U \bm M^{-1}\left[ \frac{d \bm V^{-1}\bm M}{d\mu} \right] E(\bm R)
  \right) \bm U + \bm I_q = \bm U^\top \bm U + \bm I_q
\label{eq:simphesstwo}
\end{equation}

\end{document}