Bayesian inference methods • jSDM

1 Bernoulli distribution with probit link function

1.1 Model definition

According to the article Albert & Siddhartha (1993), a possible model is to assume the existence of an underlying latent variable related to our observed binary variable using the following proposition :

Proposition

\[ \begin{aligned} &z_{ij} = \alpha_i X_i'\beta_j+ W_i'\lambda_j + \epsilon_{ij},\\ &\text{ with } \epsilon_{ij} \sim \mathcal{N}(0,1) \ \forall ij \text{ and such as : } \\ &y_{ij}= \begin{cases} 1 & \text{ if } z_{ij} > 0 \\ 0 & \text{ otherwise.} \end{cases} \end{aligned} \Rightarrow \begin{cases} y_{ij}| z_{ij} \sim \mathcal{B}ernoulli(\theta_{ij}) \text{ with } \\ \theta_{ij} = \Phi(\alpha_i + X_i'\beta_j+ W_i'\lambda_j) \\ \text{where } \Phi \text{ correspond to the repartition function} \\ \text{of the reduced centred normal distribution.} \end{cases} \]

Proof

\[\begin{aligned} \mathbb{P}(y_{ij}=1) & = \mathbb{P}(z_{ij} > 0)\\ & = \mathbb{P}(\alpha_i + X_i'\beta_j + W_i'\lambda_j + \epsilon_{ij} > 0)\\ & = \mathbb{P}(\epsilon_{ij} > - (\alpha_i + X_i'\beta_j + W_i'\lambda_j) \ ) \\ & = \mathbb{P}(\epsilon_{ij} \leq \alpha_i + X_i'\beta_j + W_i'\lambda_j) \\ & = \Phi( \alpha_i + X_i'\beta_j + W_i'\lambda_j) \\ \end{aligned}\]

In the same way:

\[\begin{aligned} \mathbb{P}(y_{ij}=0) & = \mathbb{P}(z_{ij} \leq 0)\\ & = \mathbb{P}(\epsilon_{ij} \leq - (\alpha_i + X_i'\beta_j + W_i'\lambda_j) \ ) \\ & = \mathbb{P}(\epsilon_{ij} > \alpha_i + X_i'\beta_j + W_i'\lambda_j) \\ & = 1 - \Phi( \alpha_i + X_i'\beta_j + W_i'\lambda_j) \\ \end{aligned}\]

with the following parameters and priors :

Latent variables: $W_i=(W_{i1},\ldots,W_{iq})$ where $q$ is the number of latent variables considered, which has to be fixed by the user (by default q=2). We assume that $W_i \sim \mathcal{N}(0,I_q)$ and we define the associated coefficients: $\lambda_j=(\lambda_{j1},\ldots, \lambda_{jq})'$. We use a prior distribution $\mathcal{N}(\mu_{\lambda},V_{\lambda})$ for each lambda not concerned by constraints to $0$ on upper diagonal and to strictly positive values on diagonal.
Explanatory variables:
- bioclimatic data about each site. $X=(X_i)_{i=1,\ldots,nsite}$ with $X_i=(x_{i0},x_{i1},\ldots,x_{ip})\in \mathbb{R}^{p+1}$ where $p$ is the number of bioclimatic variables considered and $\forall i, x_{i0}=1$.
- traits data about each species. $T=(T_j)_{j=1,\ldots,nspecies}$ with $T_j=(t_{j0},t_{j1},\ldots,t_{jq},\ldots,t_{jnt})\in \mathbb{R}^{nt+1}$ where $nt$ is the number of species specific traits considered and $t_{j0}=1,\forall j$.
The corresponding regression coefficients for each species $j$ are noted : $\beta_j=(\beta_{j0},\beta_{j1},\ldots,\beta_{jp})'$ where $\beta_{j0}$ represents the intercept for species $j$ which is assume to be a fixed effect.
- In the absence of data on species traits, the effect of species $j$: $\beta_j$; follows the same a priori Gaussian distribution such that $\beta_j \sim \mathcal{N}_{p+1}(\mu_{\beta},V_{\beta})$, for each species.
- If species traits data are provided, the effect of species $j$: $\beta_j$; follows an a priori Gaussian distribution such that $\beta_j \sim \mathcal{N}_{p+1}(\mu_{\beta_j},V_{\beta})$, where $\mu_{\beta_{jk}} = \sum_{r=0}^{nt} t_{jr}.\gamma_{rk}$ for $k=0,\ldots,p$, takes different values for each species. We assume that $\gamma_{rk} \sim \mathcal{N}(\mu_{\gamma_{rk}},V_{\gamma_{rk}})$ as prior distribution.
$\alpha_i$ represents the random effect of site $i$ such as $\alpha_i \sim \mathcal{N}(0,V_{\alpha})$ and we assumed that $V_{\alpha} \sim \mathcal {IG}(\text{shape}=0.5, \text{rate}=0.005)$ as prior distribution by default.

1.2 Conjugate priors

1.2.1 Fixed species effects

Proposition

We go back to a model of the form: $Z' = X\beta + \epsilon$ to estimate the posterior distributions of betas, lambdas and latent variables $W_i$ of the model. For example concerning $\lambda_j$, we define $Z'_{ij} = Z_{ij} - \alpha_i - X_i'\beta_j$ such as $Z'_{ij} = W_i'\lambda_j + \epsilon_{ij}$ so $Z'_{ij} \ | \ W_i \ , \ \lambda_j \ \sim \mathcal{N}( W_i'\lambda_j, 1)$.

In this case we can use the following proposition:

\[\begin{cases} Y \ | \ \beta &\sim \mathcal{N}_n ( X\beta, I_n) \\ \beta &\sim \mathcal{N}_p (m,V) \end{cases} \Rightarrow \begin{cases} \beta|Y &\sim \mathcal{N}_p (m^*,V^*) \text{ with } \\ m^* &= (V^{-1} + X'X)^{-1}(V^{-1}m + X'Y)\\ V^*&=(V^{-1} + X'X)^{-1} \end{cases}\].

Proof

\[\begin{aligned} p(\beta \ | \ Y) & \propto p(Y \ | \ \beta) \ p(\beta) \\ & \propto \frac{1}{(2\pi)^{\frac{n}{2}}}\exp\left(-\frac{1}{2}(Y-X\beta)'(Y-X\beta)\right)\frac{1}{(2\pi)^{\frac{p}{2}}|V|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}(\beta-m)'V^{-1}(\beta-m)\right) \\ & \propto \exp\left(-\frac{1}{2}\left((\beta-m)'V^{-1}(\beta-m) + (Y-X\beta)'(Y-X\beta)\right)\right) \\ & \propto \exp\left(-\frac{1}{2}\left(\beta'V^{-1}\beta + m'V^{-1}m - m'V^{-1}\beta -\beta'V^{-1}m + Y'Y + \beta'X'X\beta - Y'X\beta - \beta'X'Y\right)\right) \\ & \propto \exp\left(-\frac{1}{2}\left(\beta'(V^{-1}+X'X)\beta -\beta'(V^{-1}m + X'Y) - (Y'X + m'V^{-1})\beta + m'V^{-1}m + Y'Y \right)\right) \\ & \propto \exp\left(-\frac{1}{2}\left(\beta'(V^{-1}+X'X)\beta -\beta'(V^{-1}m + X'Y) - (X'Y + V^{-1}m)'\beta + m'V^{-1}m + Y'Y \right)\right) \\ & \propto \exp(-\frac{1}{2}\left(\beta - (V^{-1}+X'X)^{-1}(V^{-1}m + X'Y)\right)'(V^{-1}+X'X)\left(\beta - (V^{-1}+X'X)^{-1}(V^{-1}m + X'Y)\right)\\ & \quad -(V^{-1}m + X'Y)'(V^{-1}+X'X)^{-1}(V^{-1}m + X'Y) +m'V^{-1}m + Y'Y)\\ & \propto \exp\left(-\frac{1}{2}\left(\beta - \underbrace{(V^{-1}+X'X)^{-1}(V^{-1}m + X'Y)}_{m^*}\right)'\underbrace{(V^{-1}+X'X)}_{{V^*}^{-1}}\left(\beta - (V^{-1}+X'X)^{-1}(V^{-1}m + X'Y)\right)\right) \end{aligned}\]

Actually, we use that proposition to estimate betas, lambdas and gammas if species traits data are provided.

1.2.2 Random site effects

Proposition

About the posterior distribution of the random site effects $(\alpha_i)_{i=1,\dots,nsite}$, we can use a transformation of the form $Z'_{ij} = \alpha_i + \epsilon_{ij}$, with $Z'_{ij} = Z_{ij} - W_i'\lambda_j - X_i'\beta_j$ so $Z'_{ij} \ | \ W_i, \ \lambda_j, \ \beta_j, \ \alpha_i \ \sim \mathcal{N}(\alpha_i,1)$. We then use the following proposition:

\[\begin{cases} x \ | \ \theta & \sim \mathcal{N}(\theta, \ \sigma^2) \\ \theta & \sim \mathcal{N}(\mu_0,{\tau_0}^2) \\ \sigma^2 & \text{ known} \end{cases} \Rightarrow \begin{cases} \theta | \ x &\sim \mathcal{N}(\mu_1,{\tau_1}^2) \text{ with }\\ \mu_1 &= \dfrac{{\tau_0}^2\mu_0 + x\sigma^2}{{\tau_0}^{-2}+\sigma^{-2}} \\ {\tau_1}^{-2} &={\tau_0}^{-2}+\sigma^{-2} \end{cases}\].

Proof

\[\begin{aligned} p(\theta \ | \ x) & \propto p(x \ | \ \theta) \ p(\theta) \\ & \propto \frac{1}{(2\pi\sigma^2)^{\frac{1}{2}}}\exp\left(-\frac{1}{2\sigma^2}(x-\theta)^2\right)\frac{1}{(2\pi{\tau_0}^2)^{\frac{1}{2}}}\exp\left(-\frac{1}{2{\tau_0}^2}(\theta-\mu_0)^2\right) \\ & \propto \exp\left(-\frac{1}{2{\tau_0}^2}(\theta-\mu_0)^2-\frac{1}{2\sigma^2}(x-\theta)^2\right) \\ & \propto \exp\left(-\frac{1}{2{\tau_0}^2}(\theta^2-2\mu_0\theta)-\frac{1}{2\sigma^2}(\theta^2-2x\theta)\right)\\ & \propto \exp\left(-\frac{1}{2}\left(\theta^2 ({\tau_0}^{-2}+\sigma^{-2})-2\mu_0\theta{\tau_0}^{-2}-2x\theta\sigma^{-2}\right)\right)\\ & \propto \exp\left(-\frac{1}{2({\tau_0}^{-2}+\sigma^{-2})^{-1}}\left(\theta^2 -2\theta \frac{\mu_0{\tau_0}^{-2}+ x\sigma^{-2}}{{\tau_0}^{-2}+\sigma^{-2}}\right)\right)\\ \end{aligned}\]

1.2.3 Random site effect variance

Proposition

Concerning posterior distribution of $V_{\alpha}$, the variance of random site effects $(\alpha_i)_{i=1,\dots,nsite}$, we use the following proposition :
If \[\begin{cases} x \ | \ \sigma^2 & \sim \mathcal{N}_n (\theta, \ \sigma^2I_n) \\ \sigma^2 & \sim \mathcal{IG} (a,b) \\ \theta & \text{ known} \end{cases} \Rightarrow \begin{cases} \sigma^2|x \sim \mathcal{IG}(a',b') \text{ with } \\ a' = a + \frac{n}{2} \\ b' = \frac{1}{2}\sum\limits_{i=1}^n(x_i-\theta)^2 + b. \end{cases}\]

Proof

\[\begin{aligned} p(\sigma^2 \ | \ x) & \propto p(x \ | \ \sigma^2) \ p(\sigma^2) \\ & \propto \frac{1}{(2\pi\sigma^2)^{\frac{n}{2}}}\exp\left(-\frac{1}{2\sigma^2}(x-\theta)'(x-\theta)\right)\frac{b^a}{\Gamma(a)}{(\sigma^2)}^{-(a+1)}\exp\left(-\frac{b}{\sigma^2}\right) \\ & \propto {(\sigma^2)}^{-\left(\underbrace{\frac{n}{2}+a}_{a'}+1\right)}\exp\left(-\frac{1}{\sigma^2}\underbrace{\left(b+\frac{1}{2}\sum\limits_{i=1}^n(x_i-\theta)^2\right)}_{b'}\right) \end{aligned}\]

1.3 Gibbs sampler principle

In the Bayesian framework, Gibbs’ algorithm produces a realization of the parameter $\theta=(\theta_1,\ldots,\theta_m)$ according to the a posteriori law $\Pi(\theta \ | \ x)$ as soon as we are able to express the conditional laws: $\Pi(\theta_i | \theta_1,\dots,\theta_{i-1},\theta_{i+1},\ldots,\theta_m, x)$ for $i =1,\ldots,m$.

Gibbs sampling consists of:

Initialization : arbitrary choice of $\theta^{(0)}= (\theta_1^{(0)},\dots,\theta_m^{(0)})$.
Iteration $t$ : Genererate $\theta^{(t)}$ as follows :
- $\theta_1^{(t)} \sim \Pi\left(\theta_1 \ | \theta_2^{(t-1)},\dots, \theta_m^{(t-1)}, x \right)$
- $\theta_2^{(t)} \sim \Pi\left((\theta_2 \ | \ (\theta_1^{(t)}, \theta_3^{(t-1)},\ldots,\theta_m^{(t-1)},x\right)$
- $\theta_m^{(t)} \sim \Pi\left(\theta_m \ | \ \theta_1^{(t)}, \ldots, \theta_{m-1}^{(t)},x\right)$

Successive iterations of this algorithm generate the states of a Markov chain $\{\theta^{(t)}, t > 0\}$ to values in $\mathbb{R}^{m}$, we show that this chain admits an invariant measure which is the a posteriori law.

For a sufficiently large number of iterations, the vector $\theta$ obtained can thus be considered as a realization of the joint a posteriori law $\Pi(\theta \ | \ x)$.

Consequently, the implementation of a Gibbs sampler requires the knowledge of the a posteriori distributions of each of the parameters conditionally to the other parameters of the model, which can be deduced from the conjugated priors formulas in the case of the probit model but are not explicitly expressible in the case where a logit or log link function is used.

1.4 Gibbs sampler using conjuate priors

The algorithm used in jSDM_binomial_probit() function to estimate the parameters of the probit model is therefore as follows:

Define the constants $N_{Gibbs}$, $N_{burn}$, $N_{thin}$ such that $N_{Gibbs}$ corresponds to the number of iterations performed by the Gibbs sampler, $N_{burn}$ to the number of iterations required for burn-in or warm-up time and $N_{samp} = \dfrac{N_{Gibbs}-N_{burn}}{N_{thin}}$ to the number of estimated values retained for each parameter. Indeed, the estimated parameters are recorded at certain iterations, in order to obtain a sample of $N_{samp}$ values distributed according to the $a \ posteriori$ distribution for each of the parameters.

Initialize all parameters to $0$ for example, except the diagonal values of $\Lambda$ initialized at $1$ and $V_{\alpha}^{(0)}=1$.

Gibbs sampler: at each iteration $t$ for $t=1,\ldots,N_{Gibbs}$ we repeat each of these steps :
- Generate the latent variable $Z^{(t)}=\left(Z_{ij}^{(t)}\right)_{i=1,\ldots,I}^{j=1,\ldots,J}$ such that \[Z_{ij}^{(t)} \sim \begin{cases} \mathcal{N}\left(\alpha_i^{(t−1)} + X_i\beta_j^{(t−1)} + W_i^{(t−1)}\lambda_j^{(t−1)}, \ 1 \right) \text{ right truncated by } 0 & \text{ if } y_{ij } =0 \\ \mathcal{N}\left(\alpha_i^{(t−1)} + X_i\beta_j^{(t−1)} + W_i^{(t−1)}\lambda_j^{(t−1)}, \ 1 \right) \text{ left truncated by } 0 & \text{ if } y_{ij} =1 \end{cases}\] , the latent variable is thus initialized at the first iteration by generating it according to these centered normal laws.
- If species traits data are provided, generate the effects of species-specific traits on species’ responses $\gamma^{(t)}=\left(\gamma_{rk}^{(t)}\right)^{r=0,\ldots,nt}_{k=0,\ldots,p}$ such as : \[\gamma_{rk}^{(t)} \ | \beta_{1k}^{(t-1)}, \ldots, \beta_{Jk}^{(t-1)} \sim \mathcal{N}(m^\star,V^\star) \text{, with }\] \[m^\star = (V_{\gamma_{rk}}^{-1} + T_r'T_r)^{-1}(V_{\gamma_{rk}}^{-1}\mu_{\gamma_{rk}} + T_r\left(\beta_k^{(t-1)} - \sum\limits_{r' \neq r} T_{r'} \gamma_{r'k}^{(t-1)} \right) \text{ and } V^\star = \left(V_{\gamma_{rk}}^{-1}+ T_r'T_r\right)^{-1}.\]
- Generate the fixed species effects $\beta_j^{(t)}=(\beta_{j0}^{(t)},\beta_{j1}^{(t)}, \ldots, \beta_{jp}^{(t)})'$ for $j=1,\ldots,J$ such as : \[\beta_j^{(t)} \ | \ Z^{(t)}, W_1^{(t-1)}, \alpha_1^{(t-1)}, \ldots, W_I^{(t−1)}, \alpha_I^{(t-1), ,\lambda_{j1}^{(t-1)},\ldots, \lambda_{jq}^{(t-1)}} \sim \mathcal{N}_{p+1}(m^\star,V^\star) \text{, with }\] \[m^\star = (V_{\beta}^{-1} + X'X)^{-1}(V_{\beta}^{-1}\mu_{\beta_j} + X'Z^\star_j) \text{ and } V^\star = \left(V_{\beta}^{-1}+ X'X\right)^{-1},\] \[\text{ where } Z_j^\star =(Z_{1j}^\star,\ldots,Z_{Ij}^\star)' \text{ such as } Z^\star_{ij} = Z_{ij}^{(t)} - W_i^{(t−1)}\lambda_j^{(t−1)} - \alpha_i^{(t-1)}.\]
- Generate the the loading factors related to latent variables $\lambda_j^{(t)}=(\lambda_{j1}^{(t)},\ldots, \lambda_{jq}^{(t)})'$ for $j=1,\ldots,J$ such as : \[\lambda_{jl}^{(t)} \ | \ Z^{(t)}, \beta_j^{(t)}, \alpha^{(t-1)}, W^{(t-1)}, \lambda_1^{(t-1)}, \ldots, \lambda_{l-1}^{(t−1)},\lambda_{l+1}^{(t−1)},\ldots, \lambda_q^{(t-1)} \sim \mathcal{N}(m^\star,V^\star) \text{, with }\] \[m^\star = (V_{\lambda}^{-1} + {W_l^{(t-1)}}'W_l^{(t-1)})^{-1}(V_{\lambda}^{-1}\mu_{\lambda} + {W_l^{(t-1)}}'Z^\star_j) \text{ and } V^\star = \left(V_{\lambda}^{-1}+{W_l^{(t-1)}}'W_l^{(t-1)}\right)^{-1},\] \[\text{ where } Z_j^\star =(Z_{1j}^\star,\ldots,Z_{Ij}^\star)' \text{ such as } Z^\star_{ij} = Z_{ij}^{(t)}-X_i\beta_j^{(t)}-\alpha_i^{(t-1)}-\sum\limits_{l'\neq l}W_{il'}\lambda_{jl'}.\] In order to constrain the diagonal values of $\Lambda =\left(\lambda_{jl}\right)_{j=1,\ldots,J}^{l=1,\ldots,q}$ to positive values and make the matrix lower triangular, the values of $\lambda_j^{(t)}$ are simulated according to the following conditions: \[ \lambda_{jl}^{(t)} \sim \begin{cases} P \text{ such as } \mathbb{P}(\lambda_{jl} = 0)=1 & \text{ if } l>j, \\ \mathcal{N}(m^\star,V^\star) \text{ left truncated by } 0 & \text{ if } l=j, \\ \mathcal{N}(m^\star,V^\star) & \text{ if } l<j. \end{cases}\]
- Generate the latent variables (or unmeasured predictors) $W_i^{(t)}$ for $i=1,\ldots,I$ according to : \[W_i^{(t)} \ | \ Z^{(t)}, \lambda^{(t)}, \beta^{(t)}, \alpha_i^{(t-1)} \sim \mathcal{N}_{q} \left((I_q + {\Lambda^{(t)}}'\Lambda^{(t)})^{-1}({\Lambda^{(t)}}'Z_i^{\star}),(I_q + {\Lambda^{(t)}}'\Lambda^{(t)})^{-1}\right),\] \[\text{ where } Z_i^{\star} =(Z_{i1}^{\star},\ldots,Z_{iJ}^{\star}) \text{ such as } Z_{ij}^{\star} = Z_{ij}^{(t)}-\alpha_i^{(t-1)} - X_i\beta_j^{(t)}.\]
- Generate the random site effects $\alpha_i^{(t)}$ for $i=1,\ldots,I$ selon : \[ \alpha_i | \ Z^{(t)}, \lambda^{(t)}, \beta^{(t)}, W_i^{(t)} \sim \mathcal{N}\left(\dfrac{ \sum_{j=1}^J Z_{ij}^{(t)} - X_i\beta_j^{(t)} - W_i^{(t)}\lambda_j^{(t)}}{{V_{\alpha}^{(t-1)}}^{-1} + J} , \left( \frac{1}{V_{\alpha}^{(t-1)}}+ J \right)^{-1} \right)\]
- Generate the variance of random site effects $V_\alpha^{(t)}$ according to: \[V_\alpha^{(t)} \ | \ \alpha_1^{(t)},\ldots,\alpha_I^{(t)} \sim \mathcal{IG}\left( \text{shape}=0.5 + \frac{I}{2}, \text{rate}=0.005 + \frac{1}{2}\sum\limits_{i=1}^I \left(\alpha_i^{(t)}\right)^2\right)\]

2 Binomial distribution with logit link function

2.1 Model definition

In the same way as for the probit model, the logit model can be defined by means of a latent variable: $Z_{ij}= \alpha_i + X_i\beta_j + W_i\lambda_j + \epsilon_{ij}$ for $i=1,\ldots,I$ et $j=1,\ldots,J$, with $\epsilon_{ij} \sim \mathrm{logistique}(0,1)$ iid and such as: \[y_{ij}= \begin{cases} 1 & \text{ if } Z_{ij} > 0 \\ 0 & \text{ else } \end{cases}\] However in this case the a priori distributions of the latent variable and the parameters are not conjugated, we are not able to use the properties of the conjugated priors, so modelling using a latent variable is irrelevant.
In this case it is assumed that \[y_{ij} \ | \theta_{ij} \sim \mathcal{B}inomial(n_i,\theta_{ij})\], with $\mathrm{probit(\theta_{ij})} = \alpha_i + X_i\beta_j+ W_i\lambda_j$ and $n_i$ the number of visits to the site $i$.
Therefore, the parameters of this model will be sampled by estimating their conditional a posteriori distributions using an adaptive Metropolis algorithm.

2.2 Priors used

An a priori distribution is determined for each parameter of the model :
\[\begin{array}{lll} V_{\alpha} & \sim & \mathcal {IG}(\text{shape}=0.5, \text{rate}=0.005) \text{ with } \mathrm{rate}=\frac{1}{\mathrm{scale}}, \\ \beta_{jk} & \sim & \begin{cases} \mathcal{N}(\mu_{\beta_{jk}},V_{\beta_{k}}) \text{ for } j=1,\ldots,J \text{ and } k=0,\ldots,p, & \text{if species traits data are provided} \\ \text{ where } \mu_{\beta_{jk}} = \sum_{r=0}^{nt} t_{jr}.\gamma_{rk} \text{ and } \gamma_{rk} \sim \mathcal{N}(\mu_{\gamma_{rk}},V_{\gamma_{rk}}) & \\ \text{ for } r=0,\ldots,nt \text{ and } k=0,\ldots,p. & \\ \mathcal{N}(\mu_{\beta_{k}},V_{\beta_{k}}) \text{ for } j=1,\ldots,J \text{ and } k=0,\ldots,p, & \text{if species traits data are not provided} \\ \end{cases} \\ \lambda_{jl} & \sim & \begin{cases} \mathcal{N}(\mu_{\lambda_{l}},V_{\lambda_{l}}) & \text{if } l < j \\ \mathcal{N}(\mu_{\lambda_{l}},V_{\lambda_{l}}) \text{ left truncated by } 0 & \text{if } l=j \\ P \text{ such as } \mathbb{P}(\lambda_{jl} = 0)=1 & \text{if } l>j \end{cases} \\ \quad & \quad & \text{ for } j=1,\ldots,J \text{ and } l=1,\ldots,q. \end{array}\]

2.3 Adaptive Metropolis algorithm principle

This algorithm belongs to the MCMC methods and allows to obtain a realization of the parameter $\theta=(\theta_1,\ldots,,\theta_m)$ according to their conditional a posteriori distributions $\Pi(\theta_i | \theta_1,\dots,\theta_{i-1},\theta_{i+1},\ldots,\theta_m, x)$, for $i =1,\ldots,m$ known to within a multiplicative constant.
It is called adaptive because the variance of the conditional instrumental density used is adapted according to the number of acceptances in the last iterations.

Initialization : $\theta^{(0)}= (\theta_1^{(0)},\ldots,\theta_m^{(0)})$ arbitrarily set, the acceptance numbers $(n^A_{i})_{i=1,\ldots,m}$ are initialized at $0$ and the variances $(\sigma^2_i)_{i=1,\ldots,m}$ are initialized at $1$.
Iteration t : for $i=1,\ldots,m$
- Generate $\theta_i^\star \sim q(\theta_i^{(t-1)},.)$, with conditional instrumental density $q(\theta_i^{(t-1)},\theta_i^\star)$ symmetric, we will choose a law $\mathcal{N}(\theta_i^{(t-1)},{\sigma^2_{i}})$ for example.
- Calculate the probability of acceptance : \[\gamma= min\left(1,\dfrac{\Pi\left(\theta_i^\star \ | \ \theta_1^{(t-1)},\dots,\theta_{i-1}^{(t-1)},\theta_{i+1}^{(t-1)},\ldots,\theta_m^{(t-1)}, x \right)}{\Pi\left(\theta_i^{(t-1)} \ | \ \theta_1^{(t-1)},\dots,\theta_{i-1}^{ (t-1)},\theta_{i+1}^{(t-1)},\ldots,\theta_m^{(t-1)},x\right)}\right)\].
- \[\theta_i^{(t)} = \begin{cases} \theta_i^\star & \text{ with probability } \gamma \\ &\text{ if we are in this case the acceptance number becomes : } n^A_{i} \leftarrow n^A_{i} +1 \\ \theta_i^{(t-1)} & \text{ with probability } 1-\gamma. \\ \end{cases}\]
During the burn-in, every $\mathrm{DIV}$ iteration, with \[\mathrm{DIV} = \begin{cases} 100 & \text{ if } N_{Gibbs} \geq 1000 \\ \dfrac{N_{Gibbs}}{10}& \text{ else } \\ \end{cases}\] , where $N_{Gibbs}$ is the total number of iterations performed.
The variances are modified as a function of the acceptance numbers as follows for $i=1,\ldots,m$ :
- The acceptance rate is calculated : $r^A_{i} = \dfrac{ n^A_i}{\mathrm{DIV}}$.
- The variances are adapted according to the acceptance rate and a fixed constant $R_{opt}$ : \[\sigma_i \leftarrow \begin{cases} \sigma_i\left(2-\dfrac{1-r^A_i}{1-R_{opt}}\right) & \text{ if } r^A_{i} \geq R_{opt} \\ \\ \dfrac{\sigma_i}{2-\dfrac{1-r^A_i}{1-R_{opt}}} & \text{ else } \end{cases}\]
- We reset the acceptance numbers : $n^A_i \leftarrow 0$.
Every $\dfrac{N_{Gibbs}}{10}$ iteration, average acceptance rates are calculated and displayed $m^A = \dfrac{1}{m}\sum\limits_{i=1,\ldots,m}r^A_i$.

2.4 Gibbs sampler using adaptative Metropolis algorithm

An adaptive Metropolis algorithm is used to sample the model parameters according to their conditional a posteriori distributions estimated to within one multiplicative constant.

First we define the $f$ function that calculates the likelihood of the model as a function of the estimated parameters:
\[ f : \lambda_j,\beta_j,\alpha_i, W_i, X_i, y_{ij},n_i \rightarrow f(\lambda_j,\beta_j,\alpha_i, W_i, X_i, y_{ij},n_i)=\mathrm{L}(\theta_{ij})\] - Compute $\mathrm{logit}(\theta_{ij})= \alpha_i + X_i\beta_j + W_i\lambda_j$.

Compute $\theta_{ij}= \dfrac{1}{1+\exp\left(-\mathrm{logit}(\theta_{ij})\right)}$.
Return $\mathrm{L}(\theta_{ij})= p(y_{ij} \ | \ \theta_{ij},n_i)= \dbinom{n_i}{y_{ij}}(\theta_{ij})^{y_{ij}}(1-\theta_{ij})^{n_i-y_{ij}}$.

We repeat those steps for $i=1,\ldots,I$ et $j=1,\ldots,J$, and then we define $\theta = \left(\theta{ij}\right)_{i=1,\ldots I}^{j= 1,\ldots,J}$.
This allows us to calculate the likelihood of the model: $\mathrm{L}(\theta)= \prod\limits_{\substack{1\leq i\leq I \\ 1 \leq j\leq I}}\mathrm{L}(\theta_{ij})$.

According to Bayes’ formula we have \[\mathrm{p}(\theta \ | \ Y) \propto \Pi(\theta) \mathrm{L}(\theta).\] We thus use the following relations to approach the conditional a posteriori densities of each of the parameters with $\Pi(.)$ the densities corresponding to their a priori laws. \[\begin{aligned} & p(\beta_{jk} \ | \ \beta_{j0},\beta_{j1},\ldots,\beta_{jk-1},\beta_{jk+1},\ldots,\beta_{jp}, \lambda_j,\alpha_1,\ldots,\alpha_I, W_1,\ldots,W_I,Y) \propto \Pi(\beta_{jk})\prod\limits_{1\leq i\leq I} \mathrm{L}(\theta_{ij})\\ &p(\lambda_{jl} \ | \ \lambda_{j1},\ldots,\lambda_{jl-1},\lambda_{jl+1},\ldots,\lambda_{jq}, \beta_j,\alpha_1,\ldots,\alpha_I, W_1,\ldots,W_I,Y) \propto \Pi(\lambda_{jl}) \prod\limits_{1\leq i \leq I}\mathrm{L}(\theta_{ij})\\ &p(W_{il} \ | \ W_{i1},\ldots,W_{il-1},W_{il+1},\ldots,W_{iq},\alpha_i,\beta_1,\ldots,\beta_J,\lambda_1,\ldots, \lambda_J,Y) \propto \Pi(W_{il}) \prod\limits_{1\leq j\leq J}\mathrm{L}(\theta_{ij})\\ &p(\alpha_i \ | \ W_i,\beta_1,\ldots,\beta_J,\lambda_1,\ldots, \lambda_j,V_{\alpha},Y) \propto \Pi(\alpha_i \ | \ V_{\alpha}) \prod\limits_{1\leq j\leq J}\mathrm{L}(\theta_{ij})\\ & \text{, for $i=1,\ldots,I$, $j=1,\ldots,J$, $k=1,\ldots,p$ and $l=1,\ldots,q$. } \end{aligned}\]

The algorithm implemented in jSDM_binomial_logit() on the basis of Rosenthal (2009) and Roberts & Rosenthal (2001) articles, to estimate the parameters of the logit model is the following :

Definition of constants $N_{Gibbs}$, $N_{burn}$, $N_{thin}$ and $R_{opt}$ such that $N_{Gibbs}$ corresponds to the number of iterations performed by the algorithm, $N_{burn}$ to the number of iterations required for the burn-in or warm-up time,
$N_{samp}= \dfrac{N_{Gibbs}-N_{burn}}{N_{thin}}$ corresponding to the number of estimated values retained for each parameter. Indeed we record the estimated parameters at certain iterations in order to obtain $N_{samp}$ values, allowing us to represent a $a \ posteriori$ distribution for each parameter.
We set $R_{opt}$ the optimal acceptance ratio used in the adaptive Metropolis algorithms implemented for each parameter of the model.
Initialize all parameters to $0$ for example, except the diagonal values of $\Lambda$ initialized at $1$ and $V_{\alpha}^{(0)}=1$. The acceptance number of each parameter is initialized to $0$ and the variances of their conditional instrumental densities take the value $1$.
Gibbs sampler at each iteration $t$ for $t=1,\ldots,N_{Gibbs}$ we repeat each of these steps:
- Generate the random site effects $\alpha_i^{(t)}$ for $i=1,\ldots,I$ according to an adaptive Metropolis algorithm that simulates $\alpha_i^\star \sim \mathcal{N}(\alpha_i^{(t-1)},\sigma_{\alpha_i}^2)$ and then calculates the acceptance rate as follows:
  \[\gamma =min\left(1, \ \dfrac{\Pi\left(\alpha_i^\star \ | \ V_{\alpha}^{(t-1)}\right)\prod\limits_{1\leq j\leq J}\left(\alpha_i^\star, W_i^{(t-1)},\beta_j^{(t-1)}, \lambda_j^{(t-1)}, X_i,y_{ij},n_i\right)}{\Pi\left(\alpha_i^{(t-1)} \ | \ V_{\alpha}^{(t-1)}\right)\prod\limits_{1\leq j\leq J}f\left(\alpha_i^{(t-1)}, W_i^{(t-1)},\beta_j^{(t-1)}, \lambda_j^{(t-1)}, X_i,y_{ij},n_i\right)}\right).\]
- Generate the variance of random site effects $V_\alpha^{(t)}$ according to: \[V_\alpha^{(t)} \ | \ \alpha_1^{(t)},\ldots,\alpha_I^{(t)} \sim \mathcal{IG}\left( \text{shape}=0.5 + \frac{I}{2}, \text{rate}=0.005 + \frac{1}{2}\sum\limits_{i=1}^I \left(\alpha_i^{(t)}\right)^2\right)\]
- Generate the latent variables (or unmeasured predictors) $W_{il}^{(t)}$ for $i=1,\ldots,I$ and $l=1,\ldots,q$ according to an adaptive Metropolis algorithm that simulates $W_{il}^\star \sim \mathcal{N}(W_{il}^{(t-1)}, \sigma_{W_{il}}^2)$and then calculates the acceptance rate as follows:

\[\gamma = min\left(1,\ \dfrac{\Pi\left(W_{il}^\star\right)\prod\limits_{1\leq j\leq J}f\left(W_{il}^\star, \alpha_i^{(t)},\beta_j^{(t-1)}, \lambda_j^{(t-1)},X_i,y_{ij},n_i\right)} {\Pi\left(W_{il}^{(t-1)}\right)\prod\limits_{1\leq j\leq J}f\left(W_{il}^{(t-1)}, \alpha_i^{(t)},\beta_j^{(t-1)}, \lambda_j^{(t-1)}, X_i,y_{ij},n_i\right)}\right).\] * If species traits data are provided, generate the effects of species-specific traits on species’ responses $\gamma^{(t)}=\left(\gamma_{rk}^{(t)}\right)^{r=0,\ldots,nt}_{k=0,\ldots,p}$ such as : \[\gamma_{rk}^{(t)} \ | \beta_{1k}^{(t-1)}, \ldots, \beta_{Jk}^{(t-1)} \sim \mathcal{N}(m^\star,V^\star) \text{, with }\] \[m^\star = (V_{\gamma_{rk}}^{-1} + T_r'T_r)^{-1}(V_{\gamma_{rk}}^{-1}\mu_{\gamma_{rk}} + T_r\left(\beta_k^{(t-1)} - \sum\limits_{r' \neq r} T_{r'} \gamma_{r'k}^{(t-1)} \right) \text{ and } V^\star = \left(V_{\gamma_{rk}}^{-1}+ T_r'T_r\right)^{-1}.\]

Generate the fixed species effects $\beta_{jk}^{(t)}$ for $j=1,\ldots,J$ and $k=0,\ldots,p$ using an adaptive Metropolis algorithm that simulates $\beta_{jk}^\star \sim \mathcal{N}(\beta_{jk}^{(t-1)}, \sigma_{\beta_{jk}}^2)$ and then calculates the acceptance rate as follows:

\[\gamma = min\left(1,\dfrac{\Pi\left(\beta_{jk}^\star\right)\prod\limits_{1\leq i\leq I}f\left(\beta_{j0}^{(t)},\small{\ldots},\beta_{jk-1}^{(t)},\beta_{jk}^\star,\beta_{jk+1}^{(t-1)},\small{\ldots}, \beta_{jp}^{(t-1)},\lambda_j^{(t-1)}, \alpha_1^{(t)},W_1^{(t)},\small{\ldots},\alpha_I^{(t)}, W_I^{(t)},X_i,y_{ij},n_i\right)} {\Pi\left(\beta_{jk}^{(t-1)}\right)\prod\limits_{1\leq i\leq I}f\left(\beta_{j0}^{(t)},\small{\ldots},\beta_{jk-1}^{(t)},\beta_{jk}^{(t-1)},\beta_{jk+1}^{(t-1)},\small{\ldots}, \beta_{jp}^{(t-1)},\lambda_j^{(t-1)}, \alpha_1^{(t)},W_1^{(t)}, \small{\ldots},\alpha_I^{(t)}, W_I^{(t)},X_i,y_{ij},n_i\right)}\right).\]

Generate the loading factors related to latent variables $\lambda_{jl}^{(t)}$ for $j=1,\ldots,J$ and $l=1,\ldots,q$ according to an adaptive Metropolis algorithm for $l \leq j$, simulating $\lambda_{jl}^\star \sim \mathcal{N}(\lambda_{jl}^{(t-1)},\sigma_{\lambda_{jl}}^2)$ and then calculating the acceptance rate as follows: : \[\gamma = min\left(1,\dfrac{\Pi\left(\lambda_{jl}^\star\right)\prod\limits_{1\leq i\leq I}f\left(\lambda_{j1}^{(t)},\small{\ldots},\lambda_{jl-1}^{(t)},\lambda_{jl}^\star,\lambda_{jl+1}^{(t-1)},\small{\ldots}, \lambda_{jq}^{(t-1)},\beta_j^{(t)}, \alpha_1^{(t)},W_1^{(t)},\small{\ldots},\alpha_I^{(t)}, W_I^{(t)},X_i,y_{ij},n_i\right)} {\Pi\left(\lambda_{jl}^{(t-1)}\right)\prod\limits_{1\leq i\leq I}f\left(\lambda_{j1}^{(t)},\small{\ldots},\lambda_{jl-1}^{(t)},\lambda_{jl}^{(t-1)},\lambda_{jl+1}^{(t-1)},\small{\ldots}, \lambda_{jq}^{(t-1)},\beta_j^{(t)}, \alpha_1^{(t)},W_1^{(t)},\small{\ldots},\alpha_I^{(t)}, W_I^{(t)},X_i,y_{ij},n_i\right)}\right).\] In the case of $l>j$, we put $\lambda_{jl}^{(t)} = 0$.

3 Poisson distribution with log link function

3.1 Model definition

According to the article Hui (2016), we can use the Poisson distribution for the analysis of multivariate abundance data, with estimation performed using Bayesian Markov chain Monte Carlo methods.

In this case, it is assumed that \[y_{ij} \sim \mathcal{P}oisson(\theta_{ij})\], with $\mathrm{log}(\theta_{ij}) = \alpha_i + X_i\beta_j+ W_i\lambda_j$.

We therefore consider abundance data with a response variable noted : $Y=(y_{ij})^{i=1,\ldots,nsite}_{j=1,\ldots,nsp}$ such as :

\[y_{ij}=\begin{cases} 0 & \text{if species $j$ has been observed as absent at site $i$}\\ n & \text{if $n$ individuals of the species $j$ have been observed at the site $i$}. \end{cases}\]

3.2 Gibbs sampler using adaptative Metropolis algorithm

In this case, we cannot use the properties of the conjugate priors, therefore, the parameters of this model will be sampled by estimating their conditional a posteriori distributions using an adaptive Metropolis algorithm in the Gibbs sampler, in the same way as for the logit model.

We use the same algorithm as before by replacing the logit link function by a log link function and the binomial distribution by a poisson’s law to calculate the likelihood of the model in the function jSDM_poisson_log().

References

Albert, J.H. & Siddhartha, C. (1993) Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679.

Hui, F.K.C. (2016) Boral – Bayesian Ordination and Regression Analysis of Multivariate Abundance Data in r. Methods in Ecology and Evolution, 7, 744–750.

Roberts, G.O. & Rosenthal, J.S. (2001) Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science, 16, 351–367.

Rosenthal, S. (2009) Optimal Proposal Distributions and Adaptive MCMC. Handbook of Markov Chain Monte Carlo.