# Scalable nonparametric inference for random graphs

This paper by Emily Fox and François Caron has been on arXiv for a while, but a fortnight ago it was read at an ordinary meeting of the RSS. “Sparse graphs using exchangeable random measures” (J. R. Stat. Soc. Series B, 2017) enables simulation and Bayesian analysis of graphs with hundreds of thousands of nodes and over a million edges. This represents a major breakthrough for computationally tractable inference on substantial datasets. My thoughts on the paper and some preliminary experimental results are below.

The scalability of this approach is due to the representation of the discrete graph structure using a latent continuous model, the generalised gamma process (GGP). The sociability parameters of each node can be estimated using Hamiltonian Monte Carlo (HMC), since the gradient of the conditional log-posterior is available in closed form:

where ; is the number of nodes with degree ; is the leftover sociability of the zero-degree nodes; is the degree matrix; is the degree of node ; and are the hyperparameters of the GGP.

Full details of the algorithm are provided in Appendix F of the paper and the supplementary material contains a reference implementation. Further improvements in scalability might be achieved by taking advantage of parallelism, as mentioned by the authors. The source code is written in pure MATLAB, without any use of .mex files (compiled C code). It only depends on the Stats toolbox for generating random variables. It would be interesting to see how Stan would perform on this model, or alternatively a C++ implementation using Rcpp. For an idea of what is possible, check out the R package BradleyTerryScalable by Ella Kaye and David Firth.

The pair potentials of the GGP are symmetric:

where . This implies a very different type of generative process than the adversarial graphs that are typical of Bradley-Terry or Plackett-Luce models, where nodes compete with each other for edges. Instead, the in-degree and out-degree are assumed to be independent Poisson counts.

To illustrate the difference between these two approaches, I computed for each node using the GGP as well as the PageRank algorithm (Brin & Page, 1998). Famously, this is the original ranking algorithm that was used by Google. It is a spectral method, since the ranks correspond to the principal eigenvector of the degree matrix , where the nodes are websites and the edges are hyperlinks. Alternatively, PageRank can be derived from the stationary distribution of a Markov chain that describes random walks through the World-Wide Web.

PageRank has proven to be highly successful for datasets such as the WWW graph from the University of Notre Dame (Albert, Jeong & Barabási, 1999). This graph has 325,729 nodes, far larger than was previously feasible for computation with MCMC. The matrix has 1,497,134 directed edges, while the symmetric adjacency matrix has 1,090,108 nonzero entries. The GGP can be used for directed multigraphs, but this makes little difference due to the symmetric pair potentials shown above.

I ran the MATLAB code for 100k iterations with 2 parallel chains, which took 4 hours per chain on my iMac (3.5GHz Intel Core i7 with 16GB RAM). The chains appeared to converge after 20k iterations, as shown in the traceplot for one of the GGP parameters, . The 99% posterior credible interval was [0.368; 0.375], which differs significantly from the estimate in Table 2 of the paper (pg. 29). This might be due to mixing and convergence issues with a smaller number of iterations (40k, discarding 20k as burn-in). Memory usage climbed to over 30GB, which required a significant amount of swapping to disk. It would definitely be worthwhile running this code for longer on a Linux cluster with more available RAM, to see whether the MCMC has in fact converged as it appears from the traceplots. The estimated runtime of 132min that was reported in the paper seems to be quite optimistic for this dataset. This might also explain why the authors did not consider any graphs larger than this one.

PageRank can be computed using the centrality(..) function in MATLAB, or using the igraph package in R. The 3D histogram below shows that the relationship between and log(PR) is highly nonlinear. Many nodes are assigned a PageRank less than 6e-6 (log(PR) < -12), whereas is more spread out between 4.5e-5 and 2e-9.

It would be interesting to see whether the Bradley-Terry model (or equivalently Plackett-Luce with m=2) would show a similar relationship. A nonparametric Plackett-Luce model was proposed in an earlier paper (Caron, Teh & Murphy, 2014). Since this is similarly based on a marked Poisson process with intensity defined by a gamma process, it would also be interesting to compare the scalability between the two models. MATLAB code for nonparametric Plackett-Luce is also available from François Caron’s homepage.