## 1 Model definition

Referring to the models used in the articles Warton et al. (2015) and Albert & Siddhartha (1993), we define the following model :

$\mathrm{probit}(\theta_{ij}) =\alpha_i + \beta_{0j}+X_i.\beta_j+ W_i.\lambda_j$

• Link function probit: $$\mathrm{probit}: q \rightarrow \Phi^{-1}(q)$$ where $$\Phi$$ correspond to the repartition function of the reduced centred normal distribution.

• Response variable: $$Y=(y_{ij})^{i=1,\ldots,nsite}_{j=1,\ldots,nsp}$$ with:

$y_{ij}=\begin{cases} 0 & \text{ if species j is absent on the site i}\\ 1 & \text{ if species j is present on the site i}. \end{cases}$

• Latent variable $$z_{ij} = \alpha_i + \beta_{0j} + X_i.\beta_j + W_i.\lambda_j + \epsilon_{i,j}$$, with $$\forall (i,j) \ \epsilon_{ij} \sim \mathcal{N}(0,1)$$ and such that:

$y_{ij}=\begin{cases} 1 & \text{if} \ z_{ij} > 0 \\ 0 & \text{otherwise.} \end{cases}$

It can be easily shown that: $$y_{ij} \sim \mathcal{B}ernoulli(\theta_{ij})$$.

• Latent variables: $$W_i=(W_i^1,\ldots,W_i^q)$$ where $$q$$ is the number of latent variables considered, which has to be fixed by the user (by default $$q=2$$). We assume that $$W_i \sim \mathcal{N}(0,I_q)$$ and we define the associated coefficients: $$\lambda_j=(\lambda_j^1,\ldots, \lambda_j^q)'$$. We use a prior distribution $$\mathcal{N}(0,10)$$ for all lambdas not concerned by constraints to $$0$$ on upper diagonal and to strictly positive values on diagonal.

• Explanatory variables: bioclimatic data about each site. $$X=(X_i)_{i=1,\ldots,nsite}$$ with $$X_i=(x_i^1,\ldots,x_i^p)\in \mathbb{R}^p$$ where $$p$$ is the number of bioclimatic variables considered. The corresponding regression coefficients for each species $$j$$ are noted : $$\beta_j=(\beta_j^1,\ldots,\beta_j^p)'$$.

• $$\beta_{0j}$$ correspond to the intercept for species $$j$$ which is assumed to be a fixed effect. We use a prior distribution $$\mathcal{N}(0,10)$$ for all betas.

• $$\alpha_i$$ represents the random effect of site $$i$$ such as $$\alpha_i \sim \mathcal{N}(0,V_{\alpha})$$ and we assume that $$V_{\alpha} \sim \mathcal {IG}(\text{shape}=0.1, \text{rate}=0.1)$$ as prior distribution.

## 2 Occurrence data-set

This data-set is available in the jSDM-package R package. It can be loaded with the data() command. The frogs dataset is in “wide” format: each line is a site and the occurrence data (from Species_1 to Species_9) are in columns. A site is characterized by its x-y geographical coordinates, one discrete covariate and two other continuous covariates.

library(jSDM)
#> ##
#> ## jSDM R package
#> ## For joint species distribution models
#> ## https://ecology.ghislainv.fr/jSDM
#> ##
# frogs data
data(frogs, package="jSDM")
#>   Covariate_1 Covariate_2 Covariate_3 Species_1 Species_2 Species_3 Species_4
#> 1    3.870111           0    0.045334         1         0         0         0
#> 2    3.326950           1    0.115903         0         0         0         0
#> 3    2.856729           1    0.147034         0         0         0         0
#> 4    1.623249           1    0.124283         0         0         0         0
#> 5    4.629685           1    0.081655         0         0         0         0
#> 6    0.698970           1    0.107048         0         0         0         0
#>   Species_5 Species_6 Species_7 Species_8 Species_9        y        x
#> 1         0         0         0         0         0 66.41479 9.256424
#> 2         0         1         0         0         0 67.03841 9.025588
#> 3         0         1         0         0         0 67.03855 9.029416
#> 4         0         1         0         0         0 67.04200 9.029745
#> 5         0         1         0         0         0 67.04439 9.026514
#> 6         0         0         0         0         0 67.03894 9.023580

We rearrange the data in two data-sets: a first one for the presence-absence observations for each species (columns) at each site (rows), and a second one for the site characteristics.

We also normalize the continuous explanatory variables to facilitate MCMC convergence.

# data.obs
PA_frogs <- frogs[,4:12]

# Normalized continuous variables
Env_frogs <- cbind(scale(frogs[,1]),frogs[,2],scale(frogs[,3]))
colnames(Env_frogs) <- colnames(frogs[,1:3])

## 3 Parameter inference

We use the jSDM_binomial_probit() function to fit the JSDM (increase the number of iterations to achieve convergence).

mod_frogs_jSDM_probit <- jSDM_binomial_probit(
# Chains
burnin=1000, mcmc=1000, thin=1,
# Response variable
presence_data = PA_frogs,
# Explanatory variables
site_formula = ~.,
site_data = Env_frogs,
# Model specification
n_latent=2, site_effect="random",
# Starting values
alpha_start=0, beta_start=0,
lambda_start=0, W_start=0,
V_alpha=1,
# Priors
shape_Valpha=0.1,
rate_Valpha=0.1,
mu_beta=0, V_beta=1,
mu_lambda=0, V_lambda=1,
# Various
seed=1234, verbose=1)
#>
#> Running the Gibbs sampler. It may be long, please keep cool :)
#>
#> **********:10.0%
#> **********:20.0%
#> **********:30.0%
#> **********:40.0%
#> **********:50.0%
#> **********:60.0%
#> **********:70.0%
#> **********:80.0%
#> **********:90.0%
#> **********:100.0%

## 4 Analysis of the results

np <- nrow(mod_frogs_jSDM_probit$model_spec$beta_start)

## beta_j of the first two species
par(mfrow=c(2,2))
for (j in 1:2) {
for (p in 1:np) {
coda::traceplot(coda::as.mcmc(mod_frogs_jSDM_probit$mcmc.sp[[j]][,p])) coda::densplot(coda::as.mcmc(mod_frogs_jSDM_probit$mcmc.sp[[j]][,p]),
main = paste(colnames(mod_frogs_jSDM_probit$mcmc.sp[[j]])[p], ", species : ",j)) } }  ## lambda_j of the first two species n_latent <- mod_frogs_jSDM_probit$model_spec$n_latent par(mfrow=c(2,2)) for (j in 1:2) { for (l in 1:n_latent) { coda::traceplot(coda::as.mcmc(mod_frogs_jSDM_probit$mcmc.sp[[j]][,np+l]))
coda::densplot(coda::as.mcmc(mod_frogs_jSDM_probit$mcmc.sp[[j]][,np+l]), main = paste(colnames(mod_frogs_jSDM_probit$mcmc.sp[[j]])
[np+l],", species : ",j))
}
}


## Latent variables W_i for the first two sites
par(mfrow=c(2,2))
for (l in 1:n_latent) {
for (i in 1:2) {
coda::traceplot(mod_frogs_jSDM_probit$mcmc.latent[[paste0("lv_",l)]][,i], main = paste0("Latent variable W_", l, ", site ", i)) coda::densplot(mod_frogs_jSDM_probit$mcmc.latent[[paste0("lv_",l)]][,i],
main = paste0("Latent variable W_", l, ", site ", i))
}
}


## alpha_i of the first two sites
plot(coda::as.mcmc(mod_frogs_jSDM_probit$mcmc.alpha[,1:2]))  ## V_alpha par(mfrow=c(2,2)) coda::traceplot(mod_frogs_jSDM_probit$mcmc.V_alpha)
coda::densplot(mod_frogs_jSDM_probit$mcmc.V_alpha) ## Deviance coda::traceplot(mod_frogs_jSDM_probit$mcmc.Deviance)
coda::densplot(mod_frogs_jSDM_probit$mcmc.Deviance)  ## probit_theta par (mfrow=c(2,1)) hist(mod_frogs_jSDM_probit$probit_theta_latent,
main = "Predicted probit theta", xlab ="predicted probit theta")
hist(mod_frogs_jSDM_probit$theta_latent, main = "Predicted theta", xlab ="predicted theta") ## 5 Matrice of correlations After fitting the jSDM with latent variables, the full species residual correlation matrix $$R=(R_{ij})^{i=1,\ldots, nspecies}_{j=1,\ldots, nspecies}$$ can be derived from the covariance in the latent variables such as : $\Sigma_{ij} = \lambda_i^T .\lambda_j$, then we compute correlations from covariances : $R_{i,j} = \frac{\Sigma_{ij}}{\sqrt{\Sigma _{ii}\Sigma _{jj}}}$. We use the plot_residual_cor() function to compute and display the residual correlation matrix : plot_residual_cor(mod_frogs_jSDM_probit) ## 6 Predictions We use the predict.jSDM() S3 method on the mod_frogs_jSDM_probit object of class jSDM to compute the mean (or expectation) of the posterior distributions obtained and get the expected values of model’s parameters. # Sites and species concerned by predictions : ## 50 sites among the 104 Id_sites <- sample.int(nrow(PA_frogs), 50) ## All species Id_species <- colnames(PA_frogs) # Simulate new observations of covariates on those sites simdata <- matrix(nrow=50, ncol = ncol(mod_frogs_jSDM_probit$model_spec$site_data)) colnames(simdata) <- colnames(mod_frogs_jSDM_probit$model_spec$site_data) rownames(simdata) <- Id_sites simdata <- as.data.frame(simdata) simdata$Covariate_1 <- rnorm(50)
simdata$Covariate_3 <- rnorm(50) simdata$Covariate_2 <- rbinom(50,1,0.5)

# Predictions
theta_pred <- predict(mod_frogs_jSDM_probit, newdata=simdata, Id_species=Id_species,
Id_sites=Id_sites, type="mean")
hist(theta_pred, main="Predicted theta with simulated data", xlab="predicted theta")

## References

Albert, J.H. & Siddhartha, C. (1993) Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679.
Warton, D.I., Blanchet, F.G., O’Hara, R.B., Ovaskainen, O., Taskinen, S., Walker, S.C. & Hui, F.K.C. (2015) So many variables: Joint modeling in community ecology. Trends in Ecology & Evolution, 30, 766–779.
Wilkinson, D.P., Golding, N., Guillera-Arroita, G., Tingley, R. & McCarthy, M.A. (2019) A comparison of joint species distribution models for presence-absence data. Methods in Ecology and Evolution, 10, 198–211.