This paper by Emily Fox and François Caron has been on arXiv for a while, but a fortnight ago it was read at an ordinary meeting of the RSS. “Sparse graphs using exchangeable random measures” (J. R. Stat. Soc. Series B, 2017) enables simulation and Bayesian analysis of graphs with hundreds of thousands of nodes and over a million edges. This represents a major breakthrough for computationally tractable inference on substantial datasets. My thoughts on the paper and some preliminary experimental results are below.

The scalability of this approach is due to the representation of the discrete graph structure using a latent continuous model, the generalised gamma process (GGP). The sociability parameters $w_i$ of each node can be estimated using Hamiltonian Monte Carlo (HMC), since the gradient of the conditional log-posterior is available in closed form:

$\left[ \nabla_{\omega_{1:N_\alpha}} \log\{p(\omega_{1:N_\alpha}, w_* \mid D_\alpha )\} \right]_i = m_i - \sigma - w_i\left(\tau + 2\sum_{j=1}^{N_\alpha}w_j + 2w_* \right)$

where $\omega_i = \log\{w_i\}$; $N_\alpha$ is the number of nodes with degree $\ge 1$; $w_*$ is the leftover sociability of the zero-degree nodes; $D_\alpha$ is the degree matrix; $m_i = \sum_{j=1}^{N_\alpha} n_{i, j} + n_{j,i}$ is the degree of node $i$; and $\alpha, \sigma, \tau$ are the hyperparameters of the GGP.

Full details of the algorithm are provided in Appendix F of the paper and the supplementary material contains a reference implementation. Further improvements in scalability might be achieved by taking advantage of parallelism, as mentioned by the authors. The source code is written in pure MATLAB, without any use of .mex files (compiled C code). It only depends on the Stats toolbox for generating random variables. It would be interesting to see how Stan would perform on this model, or alternatively a C++ implementation using Rcpp. For an idea of what is possible, check out the R package BradleyTerryScalable by Ella Kaye and David Firth.

The pair potentials of the GGP are symmetric:

$\Pr(z_{i, j} = 1 \mid w_i, w_j) = \Pr(n_{i, j} + n_{j,i} > 0 \mid w_i, w_j) = 1 - \exp\{-2 w_i w_j\}$

where $i \ne j$. This implies a very different type of generative process than the adversarial graphs that are typical of Bradley-Terry or Plackett-Luce models, where nodes compete with each other for edges. Instead, the in-degree and out-degree are assumed to be independent Poisson counts.

To illustrate the difference between these two approaches, I computed $w_i$ for each node using the GGP as well as the PageRank algorithm (Brin & Page, 1998). Famously, this is the original ranking algorithm that was used by Google. It is a spectral method, since the ranks correspond to the principal eigenvector of the degree matrix $D_\alpha$, where the nodes are websites and the edges are hyperlinks. Alternatively, PageRank can be derived from the stationary distribution of a Markov chain that describes random walks through the World-Wide Web.

Symmetrised adjacency matrix $Z_\alpha$ for the WWW graph

PageRank has proven to be highly successful for datasets such as the WWW graph from the University of Notre Dame (Albert, Jeong & Barabási, 1999). This graph has 325,729 nodes, far larger than was previously feasible for computation with MCMC. The matrix $D_\alpha$ has 1,497,134 directed edges, while the symmetric adjacency matrix $Z_\alpha$ has 1,090,108 $\times 2$ nonzero entries. The GGP can be used for directed multigraphs, but this makes little difference due to the symmetric pair potentials shown above.

MCMC trace plot for $\sigma$, indicating 20k iterations discarded as burn-in

I ran the MATLAB code for 100k iterations with 2 parallel chains, which took 4 hours per chain on my iMac (3.5GHz Intel Core i7 with 16GB RAM). The chains appeared to converge after 20k iterations, as shown in the traceplot for one of the GGP parameters, $\sigma$. The 99% posterior credible interval was [0.368; 0.375], which differs significantly from the estimate in Table 2 of the paper (pg. 29). This might be due to mixing and convergence issues with a smaller number of iterations (40k, discarding 20k as burn-in). Memory usage climbed to over 30GB, which required a significant amount of swapping to disk. It would definitely be worthwhile running this code for longer on a Linux cluster with more available RAM, to see whether the MCMC has in fact converged as it appears from the traceplots. The estimated runtime of 132min that was reported in the paper seems to be quite optimistic for this dataset. This might also explain why the authors did not consider any graphs larger than this one.

95% posterior predictive interval for the degree distribution

PageRank can be computed using the centrality(..) function in MATLAB, or using the igraph package in R. The 3D histogram below shows that the relationship between $\log(w_i)$ and log(PR) is highly nonlinear. Many nodes are assigned a PageRank less than 6e-6 (log(PR) < -12), whereas $w_i$ is more spread out between 4.5e-5 and 2e-9.

log-log 3D histogram of PageRank against the sociability parameter $w_i$

It would be interesting to see whether the Bradley-Terry model (or equivalently Plackett-Luce with m=2) would show a similar relationship. A nonparametric Plackett-Luce model was proposed in an earlier paper (Caron, Teh & Murphy, 2014). Since this is similarly based on a marked Poisson process with intensity defined by a gamma process, it would also be interesting to compare the scalability between the two models. MATLAB code for nonparametric Plackett-Luce is also available from François Caron’s homepage.

From → MCMC, Network Data

Ella Kaye on Ella Kaye

Computational Bayesian statistics

Bayes' Food Cake

A bit of statistics, a bit of cakes.

RWeekly.org - Blogs to Learn R from the Community

Computational Bayesian statistics

Richard Everitt's blog

Computational Bayesian statistics

Let's Look at the Figures

David Firth's blog

Nicholas Tierney

Computational Bayesian statistics

Musings, useful code etc. on R and data science

Another Astrostatistics Blog

The random musings of a reformed astronomer ...

Darren Wilkinson's research blog

Statistics, computing, data science, Bayes, stochastic modelling, systems biology and bioinformatics

Computational Bayesian statistics

Igor Kromin

Computational Bayesian statistics

Statisfaction

I can't get no

Xi'an's Og

an attempt at bloggin, nothing more...

Sam Clifford

Postdoctoral Fellow, Bayesian Statistics, Aerosol Science