The first surprise was that the Xcode command-line tools were no longer working:

$ g++ --version xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun

These needed to be reinstalled from scratch:

xcode-select --install

The Gnu toolchain was still working, since it was correctly installed in /usr/local/bin, but it was now giving a worrying error message:

gfortran: warning: couldn’t understand kern.osversion ‘16.4.0

I decided it was better to reinstall these as well. This gave me version 6.3.0 of GCC, g++ and gfortran.

R still ran from inside RStudio, but it was no longer on the PATH. I would advise reinstalling R from scratch, so that the R executable will be located in /usr/local/bin where it belongs, rather than wherever the macOS upgrade scripts have relocated it to! I upgraded to the latest RStudio as well, for good measure.

Likewise with LaTeX, /Library/TeX/texbin was hosed. I could no longer compile any TeX documents, even from inside TeXShop or RStudio. Happily, this did not require a reinstall, just a modification of the PATH and the TeXShop preferences to point to the new location. RStudio ignores any .profile or .bashrc in your home directory, so you need to edit the system-wide PATH:

sudo vi /etc/paths

As for MATLAB, the news is not good:

If you are running MATLAB R2013b or earlier a patch for these releases is not available. Please update to MATLAB R2014a or later to use MATLAB on macOS Sierra.

Java apps (such as JabRef) gave an interesting error message:

To open "JabRef" you need to install the legacy Java SE 6 runtime.

…which is yet another reason to abandon Java! But alas, JabRef remains the best reference manager for BibTeX files. Certainly, the one that comes with MacTeX is pretty awful. If only they had used Qt, like RStudio does.

]]>

Two brilliant slides from Philip Dawid responding to Hennig & Gelman pic.twitter.com/UXaD7CY00X

— Robert Grant (@robertstats) 12 April 2017

Statistics is an essential element of modern science, and has been for quite some time. As such, statistical procedures should be evaluated with regard to the philosophy of science. Towards this goal, the authors propose seven statistical virtues that could serve as a guide for authors (and reviewers) of scientific papers. The chief of these is transparency: thorough documentation of the choices, assumptions and limitations of the analysis. These choices need to be justified within the context of the scientific study. Given the ‘no free lunch’ theorems (Wolpert, 1996), such contextual dependence is a necessary property of any useful method.

The authors argue that “subjective” and “objective” are ambiguous terms that harm statistical discourse. No methodology has an exclusive claim to objectivity, since even null hypothesis significance testing (NHST) involves choice of the sampling distribution, as well as the infamous α=0.05. The use of default priors, as in Objective Bayes, requires ignoring any available information about the parameters of interest. This can conflict with other goals, such as identifiability and regularisation. The seven virtues are intended to be universal and can apply irrespective of whether the chosen methodology is frequentist or Bayesian. Indeed, the authors advocate a methodology that combines features of both.

There have been many other attempts to reconcile frequentist and Bayesian approaches to produce a grand unified theory of statistics. The main feature of the methodology in Section 5.5 is iterative refinement of the model (including priors and tuning parameters) to better fit the observed data. Rather than Bayesian updating or model choice, the suggested procedure involves graphical summaries of model fit (Gelman et al. 2013). This has connections with well-calibrated Bayes (Dawid 1982) and hypothetico-deductive Bayes (Gelman & Shalizi, 2013). I think that this is a good approach, albeit saddled with an unfortunate misnomer.

The term “falsificationist” might be slightly less clumsy than “hypothetico-deductive,” but nevertheless seems misleading. Leaving aside the question of whether statistical hypotheses are falsifiable at all, except in the limit of infinite data, falsification in the Popperian sense is really not the goal. This would imply abandoning an inadequate model and starting again from scratch. As stated by Gelman (2007),

“…the purpose of model checking (as we see it) is not to reject a model but rather to understand the ways in which it does not fit the data.”

Furthermore, this approach is not limited to posterior predictive distributions. It could be applied to any generative model, not necessarily a Bayesian one. Thus, falsificationist Bayesianism as presented in this paper is neither falsificationist nor Bayesian, but it is an excellent approach nevertheless.

For other viewpoints, check out the blog posts by Xian and Nick Horton, as well as this interview on YouTube.

]]>

The C++ standard library in the OpenCSW version of GCC 5.2.0 is incompatible with GCC on other platforms (such as Windows, Linux and Mac OS). This causes errors like the following:

smcPotts.cpp: In function ‘Rcpp::IntegerVector resample_resid(Rcpp::NumericVector&, arma::vec&, Rcpp::NumericMatrix&)’: smcPotts.cpp:219:48: error: call of overloaded ‘log(const unsigned int&)’ is ambiguous int tW = (int)trunc(exp(log_wt(i) + log(n))); ^ In file included from /usr/include/math.h:15:0, from /opt/csw/include/c++/5.2.0/cmath:44, from /home/ripley/R/Lib32/Rcpp/include/Rcpp/platform/compiler.h:100, from /home/ripley/R/Lib32/Rcpp/include/Rcpp/r/headers.h:48, from /home/ripley/R/Lib32/Rcpp/include/RcppCommon.h:29, from /home/ripley/R/Lib32/RcppArmadillo/include/RcppArmadilloForward.h:26, from /home/ripley/R/Lib32/RcppArmadillo/include/RcppArmadillo.h:31, from smcPotts.h:4, from smcPotts.cpp:20: /opt/csw/lib/gcc/sparc-sun-solaris2.10/5.2.0/include-fixed/iso/math_iso.h:200:21: note: candidate: long double std::log(long double) inline long double log(long double __X) { return __logl(__X); } ^ /opt/csw/lib/gcc/sparc-sun-solaris2.10/5.2.0/include-fixed/iso/math_iso.h:168:15: note: candidate: float std::log(float) inline float log(float __X) { return __logf(__X); } ^ /opt/csw/lib/gcc/sparc-sun-solaris2.10/5.2.0/include-fixed/iso/math_iso.h:68:15: note: candidate: double std::log(double) extern double log __P((double)); *** Error code 1 make: Fatal error: Command failed for target `smcPotts.o' Current working directory /tmp/RtmpPtailS/R.INSTALL5859270b1fe0/bayesImageS/src ERROR: compilation failed for package ‘bayesImageS’

The Writing R Extensions manual describes this as an “overloading ambiguity,” in the section on Portable C and C++ code. Given that this code compiles fine on Gnu g++, LLVM clang++ (Xcode) and icpc (Intel Parallel Studio XE) on Windows, Mac OS and Linux, somehow I don’t think the problem with portability is at my end!

The fix is quite straightforward, albeit painstaking. It involves searching through the source code for every use of math.h functions log(), exp(), pow(), etc. for implicit casts from int/uint/long to float/double and making them explicit to resolve the “ambiguity.” For example, the fix for the code above was as follows:

> int tW = (int)trunc(exp(log_wt(i) + log((double)n))); --- < int tW = (int)trunc(exp(log_wt(i) + log(n)));

Refer to checkin 4d1a7ec on BitBucket for details. I was pleased to see that the Solaris build on CRAN is now working:

using R Under development (unstable) (2017-03-21 r72378) using platform: sparc-sun-solaris2.10 (32-bit)

I’m glad that I don’t have any issues with 32 vs. 64bit OS, that really *would* be annoying! If compile errors in OpenCSW on SPARC Solaris are considered as vital issues (i.e. your R package is scheduled for archival unless you fix them) then I hope that services like R-hub and Travis CI will eventually provide support for building on this platform. It’s been more than 15 years since the last time that I installed Solaris myself. Back then, I was paid to care about customers with obscure taste in operating systems. In the UK I don’t have a spare computer that I would want to sacrifice in such a pointless task.

When I came to submit this fix to CRAN, I saw that my build status had an additional NOTE that wasn’t there when the package was first submitted:

Found no calls to: ‘R_registerRoutines’, ‘R_useDynamicSymbols’

It is good practice to register native routines and to disable symbol

search.See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.

Although this has only recently been added to the CRAN package checks, it has been a feature of R for a very long time. Duncan Temple Lang explained the technical details in an article for *R News* back in 2001 (Vol. 1/3, pp. 20-23). The idea is that you register every C, C++ or Fortran function explicitly with R so that it doesn’t have to search the dynamic symbol table every time you make a call. This is also safer, because R checks to make sure that .C() isn’t used for functions that require .Call() and vice-versa. All of the relevant code is in bayesImageS_init.cpp, which I created in checkin 60d8fa5. Note that this isn’t an issue for anyone who uses Rcpp attributes, since the RcppExports.cpp handles all of this for you (since Rcpp version 0.10.0). This gives me a further reason to rewrite this package to use annotations, since it will be easier to maintain in the long run.

I also had a lot of headaches with the LaTeX setup on my new laptop. Fortunately I could still refer to my old iMac to figure out what the problem was. One issue that I never managed to fix was that knitr wouldn’t convert EPS files to PDF. In the end, I just included the *-eps-converted-to.pdf files in the R package. This is also something that I’ll probably need to revisit in future, or it will keep coming back to bite me on the arse.

]]>

M. Moores, A. N. Pettitt & K. Mengersen (2015) Scalable Bayesian inference for the inverse temperature of a hidden Potts model. arXiv:1503.08066 [stat.CO]

M. Moores, C. C. Drovandi, K. Mengersen & C. P. Robert (2015) Pre-processing for approximate Bayesian computation in image analysis. *Statistics & Computing* **25**(1): 23-33.

M. Falk, C. Alston, C. McGrory, S. Clifford, E. Heron, D. Leonte, M. Moores, C. Walsh, A.N. Pettitt & K. Mengersen (2015) Recent Bayesian approaches for spatial analysis of 2-D images with application to environmental modelling. *Envir. Ecol. Stat.* **22**(3): 571-600.

M. Moores & K. Mengersen (2014) Bayesian approaches to spatial inference: modelling and computational challenges and solutions. In *Proc. 33rd MaxEnt*, AIP Conf. Proc. **1636**: 112-117.

]]>

Note that this approach is not the same as the Runge-Kutta (RK45) and backward-diﬀerentiation formula (BDF) methods for solving ODEs in Stan. In the case of the dugongs data (and the other examples that I consider) the times are not strictly ordered (ie. there are multiple observations at a single timepoint. This results in the following error message from the integrate_ode function:

[1] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:" [2] "Exception thrown at line 32: integrate_ode_rk45: times is not a valid ordered vector. The element at 3 is 1.5, but should be greater than the previous element, 1.5"

The model from Carlin & Gelfand (1991) is as follows:

Which is the solution to the following ODE:

We don’t need to solve the equation numerically, since the solution is already available in closed form. Our interest is primarily in parameter estimation, particularly for ill-posed problems when the parameters are not well-identified.

Base R includes the nls() function, which stands for nonlinear least squares. By default, it uses the Gauss-Newton algorithm to search for parameter values that fit the observed data. In this case, repeated observations at the same timepoint are not an issue:

dat <- list( "N" = 27, "x" = c(1, 1.5, 1.5, 1.5, 2.5, 4, 5, 5, 7, 8, 8.5, 9, 9.5, 9.5, 10, 12, 12, 13, 13, 14.5, 15.5, 15.5, 16.5, 17, 22.5, 29, 31.5), "Y" = c(1.8, 1.85, 1.87, 1.77, 2.02, 2.27, 2.15, 2.26, 2.47, 2.19, 2.26, 2.4, 2.39, 2.41, 2.5, 2.32, 2.32, 2.43, 2.47, 2.56, 2.65, 2.47, 2.64, 2.56, 2.7, 2.72, 2.57)) plot(dat$x, dat$Y, xlab="Age", ylab="Length") nlm <- nls(Y ~ alpha - beta * lambda^x, data=dat, start=list(alpha=1, beta=1, lambda=0.9)) summary(nlm) nlm_fn <- predict(nlm, newdata=dat$x) lines(dat$x, nlm_fn, col=6, lty=2)

Formula: Y ~ alpha - beta * lambda^x Parameters: Estimate Std. Error t value Pr(>|t|) alpha 2.65807 0.06151 43.21 < 2e-16 *** beta 0.96352 0.06968 13.83 6.3e-13 *** lambda 0.87146 0.02460 35.42 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.09525 on 24 degrees of freedom Number of iterations to convergence: 6 Achieved convergence tolerance: 3.574e-06

If you compare the posterior modes from Carlin & Gelfand (1991), the results are almost identical:

> exp(0.975) [1] 2.651167 > exp(-0.014) [1] 0.9860975 > inv.logit(1.902) [1] 0.8701177

The high posterior density (HPD) regions are asymmetric, due to the reparameterisation of the model.

Now let’s look at a slightly more complex problem:

where r is a rate parameter and L is the limit (horizontal asymptote). The solution to this ODE for an initial value is:

Using these equations, we can simulate data from a heteroskedastic, truncated normal distribution where the variance is equal to the gradient of the mean:

library(truncnorm) t <- seq(-4,4,by=0.05) ft <- function(t, r, y0, tC, L){ y0*L*exp(r*(t-tC))/(L + y0*(exp(r*(t-tC)) - 1)) } plot(t,ft(t,2,4,0,12),type='l',ylim=c(0,12),col=4, ylab='Y') abline(v=0,lty=3) abline(h=12,lty=3,col=2) abline(h=0,lty=3) dfdt <- function(t, r, tC, L){ r*L*exp(r*(t+tC))/((exp(r*tC) + exp(r*t))^2) } simt <- rnorm(25, sd=1.5) simy <- matrix(nrow=length(simt), ncol=20) for (i in 1:length(simt)) { simy[i,] <- rtruncnorm(ncol(simy), a=0, b=12, mean=ft(simt[i],2,4,0,12), sd = sqrt(dfdt(simt[i],2,0,12))) points(rep(simt[i], ncol(simy)), simy[i,], pch='*') }

The Gauss-Newton method underestimates the true parameter value by almost 4 standard errors:

dat <- list(Y=as.vector(simy), t=rep(simt,ncol(simy)), y0=4, maxY=12, tcrit=0) nlm <- nls(Y ~ y0*maxY*exp(r*(t-tcrit))/(maxY + y0*(exp(r*(t-tcrit)) - 1)), data=dat, start=list(r=1)) summary(nlm) nlm_fn <- predict(nlm, newdata=list(t=t)) lines(t, nlm_fn, col=6, lty=2, lwd=2)

Formula: Y ~ y0 * maxY * exp(r * (t - tcrit))/(maxY + y0 * (exp(r * (t - tcrit)) - 1)) Parameters: Estimate Std. Error t value Pr(>|t|) r 1.76053 0.04219 41.73 <2e-16 ***

Although we were unable to recover the true parameter value using nls(), the approximation (magenta curve) is nevertheless quite close to the true function (in blue). With 500 observations at 25 different timepoints, this suggests that the data are not very informative about the rate parameter. This problem is exacerbated when we introduce additional parameters with the aim of obtaining a more flexible function. Next time, we will look at analysing this data using Stan.

Bates & Watts (1988) “Nonlinear Regression Analysis and Its Applications.” Wiley-Interscience

Carlin & Gelfand (1991) “An iterative Monte Carlo method for nonconjugate Bayesian analysis.” *Stat. Comp.* **1**(2), 119-128.

Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li & Riddell (2017) “Stan: A Probabilistic Programming Language.” *J. Stat. Soft.* **76**(1)

Ratkowsky (1983) “Nonlinear Regression Modeling: A Unified Practical Approach.” Marcel Dekker

Richards (1959) “A Flexible Growth Function for Empirical Use.” *J. Experimental Botany* **10** (2): 290–300.

Thomas, Best, Lunn & Spiegelhalter (2012) “The BUGS Book: A Practical Introduction to Bayesian Analysis.” Chapman & Hall/CRC Press

]]>

The most significant change is the increased security features that were introduced in El Capitan, in an attempt to make OS X less vulnerable to rootkits and other script kiddie exploits. Upgrading caused a lot of headaches for users of R and LaTeX, since folders were magically moved to different locations on the hard drive. This is one of the reasons that I chose to skip Yosemite and El Capitan (OK, mostly it was sheer laziness…)

I use the Outlook email client because my work email is hosted on an Exchange server and I’ve found Mac mail very flaky for that. I also have co-authors that use Excel and Word, so having Office installed makes it easier to collaborate with them. Wherever possible, I use R Markdown with pandoc to generate Word documents programmatically, which saves me a lot of time in the long run (reproducible research FTW!). Other commercial software that I have installed includes MATLAB, Mathematica, Dropbox, and Skype.

I was very pleased that I was able to install the Xcode command-line tools and the Gnu compilers without any problems. I can switch between clang++ and g++ by editing my ~/.R/Makeconf as described here. I also use XQuartz for X/Windows support. Of course, I have R and RStudio installed, using vecLib instead of the default R BLAS library. The instructions for switching BLAS libraries on OS X in Sect. 10.5 of the installation guide are out of date. On OS X Mavericks or later, type the following commands in Terminal:

cd /Library/Frameworks/R.framework/Resources/lib ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib

I have limited space on my solid state drive, so I decided to install the minimal BasicTeX with TeXShop for editing and compiling LaTeX documents. Overall, I think this is a better approach. I wouldn’t install all 10,000 packages from CRAN just in case I might need them, so I see no reason why CTAN should be any different. So far, I’ve had to install the following additional packages from CTAN using tlmgr:

- preprint
- algorithms
- subfigure
- bbm-macros
- latexmk
- multirow
- morefloats
- framed
- titling
- inconsolata
- chicago
- courier
- helvetic

Even on Sierra, you need to run tlmgr as root, using:

sudo /usr/local/texlive/2016basic/bin/x86_64-darwin/tlmgr install algorithms

As with MS Word, you can generate LaTeX programmatically using knitr, but you still need to have pdfLatex installed to compile the PDF. I still use JabRef to manage all of my BibTeX files, since I haven’t found anything that does the job better. The installer for JabRef isn’t signed, so Gatekeeper will complain when you try to install it.

Filesystem Size Used Avail Capacity Mounted on /dev/disk1 465Gi 54Gi 411Gi 12% / devfs 182Ki 182Ki 0Bi 100% /dev map -hosts 0Bi 0Bi 0Bi 100% /net map auto_home 0Bi 0Bi 0Bi 100% /home

No major changes from my previous list. I’ve dropped Stata since the R package foreign does a good enough job of importing .dta files. Instead, I’ve got Mathematica installed to help out with my poor algebra skills. I did consider getting a 15in MacBook Pro instead of the Air, which could have had up to 16GB of RAM and 2TB solid state drive. The Thunderbolt 3 (USB-C) ports and touch bar would have been annoying, but the real deal breaker was Radeon instead of nVidia graphics. If I couldn’t use CUDA, then it just wasn’t worth the extra expense (plus, I’m not sure I’d trust myself with such an expensive laptop!). I’ve got a GeForce GTX 970M in my Windows machine if I need to run any CUDA code. ASUS laptops offer a lot more bang for your buck, but Mac OS is still nicer to use (particularly for developing software and writing academic papers). I think it’s well worth the extra money, if you can afford it.

]]>

`3006`

of the meta-data. Shortly after I wrote that original post, Reid F. Thompson made his R package

The example DICOM-RT image used in this code is available from here. The code runs fine on Windows, but in order to run it on OS X you will need to install an X-Windows server such as XQuartz. I’m currently running Mac OS 10.9.5 (**Mavericks**) with XQuartz 2.7.11. You will require administrator privileges to install this on your Mac. If you get an error message about `/opt/X11/lib/libGLU.1.dylib`

, that will be why.

library(rgl) library(RadOnc)

`## Loading required package: geometry`

`## Loading required package: magic`

`## Loading required package: abind`

`## Loading required package: oro.dicom`

`## ## oro.dicom: Rigorous - DICOM Input / Output (version = 0.5.0)`

`## Loading required package: ptinpoly`

`## Loading required package: misc3d`

rtstruct <- read.DICOM.RT(path="~/Downloads/ITK4SampleDCMRT/", modality="MR", verbose=TRUE, DVH=FALSE, modality="MR")

`## Reading 61 DICOM files from path: '~/Downloads/ITK4SampleDCMRT/' ... FINISHED ## Extracting MR data ... FINISHED [60 slices, 1.5x1.5x3.0mm res]`

`## Warning in read.DICOM.RT(path = "~/Downloads/ITK4SampleDCMRT/", verbose = ## TRUE, : Unable to extract DVH data from DICOM-RT (no dose grid available)`

`## Reading structure set from file: '/Users/stsrjs/Downloads/ITK4SampleDCMRT//RS.1.2.246.352.71.4.886768594.5257.20090622110825.dcm' ... (7 structures identified)`

`## Warning in read.DICOM.RT(path = "~/Downloads/ITK4SampleDCMRT/", verbose = ## TRUE, : Structure(s) 'GS1', 'GS2', 'GS3' are empty`

`## FINISHED ## Processing (7) structures: ## JP: GS1 [EMPTY] ... FINISHED ## JP: MRI_RECTUM_RT [32 axial slice(s), 5784 point(s)] ... FINISHED ## JP: MRI_BONEOUTER_RT [182 axial slice(s), 77052 point(s)] ... FINISHED ## JP: GS2 [EMPTY] ... FINISHED ## JP: GS3 [EMPTY] ... FINISHED ## JP: MRI_BLADDER_RT [23 axial slice(s), 11058 point(s)] ... FINISHED ## JP: MRI_PROSTATE_RO [12 axial slice(s), 3024 point(s)] ... FINISHED`

print(rtstruct)

`## [1] "RT data '~/Downloads/ITK4SampleDCMRT/' containing CT image (256x256x60), 7 structure(s)"`

print(rtstruct$structures)

`## [1] "List containing 7 structure3D objects (GS1 JP, MRI_RECTUM_RT JP, MRI_BONEOUTER_RT JP, GS2 JP, GS3 JP, MRI_BLADDER_RT JP, MRI_PROSTATE_RO JP)"`

plot(rtstruct$structures[[3]])

The interactive 3D visualisation using **rgl** is very nice, but it would also be helpful to be able to scroll through the axial slices one at a time. For example, see my earlier post about **misc3d**:

library(tkrplot)

`## Loading required package: tcltk`

slices3d(rtstruct$CT)

`## <environment: 0x7fa20396f898>`

The `slices3d`

function does support overlays, so this seems like it would be fairly straightforward to do.

The following code demonstrates how to calculate the Hausdorff distance between the bladder and prostate and plot a visualisation of the contours:

compareStructures(rtstruct$structures[c(6,7)], method="hausdorff", hausdorff.method="mean")

`## Analyzing structure 1/2 (MRI_BLADDER_RT JP) ...`

`## FINISHED ## Analyzing structure 2/2 (MRI_PROSTATE_RO JP) ... FINISHED`

`## MRI_BLADDER_RT JP MRI_PROSTATE_RO JP ## MRI_BLADDER_RT JP 0.00000 29.46258 ## MRI_PROSTATE_RO JP 29.46258 0.00000`

Previously, the best way to quantify volumetric differences would be to use something like the SlicerRT extension for ITK. However, you then still needed to import that quantification into R for statistical modelling. Being able to do everything in R makes it much easier to follow best practices for reproducible research (such as this blog post).

It is also possible to plot the contours, slice by slice:

compareStructures(rtstruct$structures[c(6,7)], method="axial", plot=TRUE)

Admittedly, this is quite an artificial example, since the usual use case would be to compare two contours of the same patient. For example, to quantify volumetric changes over time. The **RadOnc** package has an excellent vignette that goes into more detail than what I’ve done here. There is also the 2014 journal article by Reid Thompson (details below). My original code is useful if you want to study the DICOM standard itself, but of course you need to use version 0.3-7 of **oro.dicom** to get it to work. The R package **RadOnc** is actively maintained and compatible with the latest versions of other packages on CRAN. It also offers additional features that are very handy for anyone who wants to perform statistical analysis of radiotherapy contours and dosimetry.

Dice, L.R. (1945) “Measures of the amount of ecologic association between species” *Ecology* **26**(3): 297-302.

Dowling, J.; Malaterre, M; Greer, P.B. & Salvado, O. (2009) “Importing Contours from DICOM-RT Structure Sets” *The Insight Journal*

Hargrave, C.E.; Mason, N.; Guidi, R.; Miller, J-A; Becker, J.; Moores, M.T.; Mengersen, K.; Poulsen, M. & Harden, F. (2016) “Automated replication of cone beam CT-guided treatments in the Pinnacle³ treatment planning system for adaptive radiotherapy” *J. Med. Rad. Sci.* **63**(1): 48-58.

Hausdorff, F. (1914) “*Grundzüge der Mengenlehre*” Veit and Company, Leipzig.

Moores, M.T.; Hargrave, C.E.; Deegan, T.; Poulsen, M.; Harden, F. & Mengersen, K. (2015) “An external field prior for the hidden Potts model with application to cone-beam computed tomography” *Comput. Stat. Data Anal.* **86**: 27-41.

Pinter, C.; Lasso, A.;Wang, A.; Jaffray, D. & Fichtinger, G. (2012) “SlicerRT: Radiation therapy research toolkit for 3D Slicer” *Med. Phys.* **39**(10): 6332–6338.

Thompson, R.F. (2014) “RadOnc: An R Package for Analysis of Dose-Volume Histogram and Three-Dimensional Structural Data” *J. Rad. Onc. Info.* **6**(1): 98-100.

Whitcher, B.; Schmid, V.J. & Thornton, A. (2011) “Working with the DICOM and NIfTI Data Standards in R” *J. Statist. Soft.* ** 44**(6): 1-28.

]]>

The Potts model has a doubly-intractable likelihood, so its expectation and variance cannot be computed exactly. Instead, we can use Markov chain Monte Carlo (MCMC) algorithms such as SW or Gibbs sampling to simulate from its distribution for a given value of β. However, we need to know how many MCMC iterations to use, so that the chain will have converged to a steady state. Otherwise, any inference using the MCMC samples will be biased.

In the following, the labels **z** of the Potts model can take *k* different values. This state space is not ordered, so algorithms such as perfect sampling (Propp & Wilson, 1996; Huber, 2016) cannot be applied. The Potts model is a member of the exponential family, so it has a sufficient statistic S(**z**) which is the count of like neighbours. The maximum value of S(**z**), which we will call *M*, is equal to 2(n − √n) for a regular, square lattice. For example, *M* = 112 for an 8×8 lattice; *M* = 31,000 for 125×125; and *M* = 1,998,000 for 1000×1000.

There are two exceptions where the distribution of the Potts model can be computed exactly. When β=0 the labels **z** are independent, hence the sufficient statistic S(**z**) follows a Binomial distribution with expectation *M/k* and variance *M(1/k)(1 – 1/k)*. For an 8×8 lattice with *k*=3, the expectation is 37.33 with a variance of 24.89. As β approaches infinity, all of the labels have the same value almost surely. This means that the expectation approaches *M* asymptotically, while the variance approaches 0.

We can use the endpoints of the distribution to estimate how long the SW and Gibbs algorithms take to converge. The algorithm is initialised at one endpoint, then we monitor S(**z**) at each iteration until the distribution of the samples has converged to the known expectation and variance. First, let’s look at chequerboard Gibbs sampling for an 8×8 lattice with *k*=3:

library(PottsUtils) k <- 3 n <- 8*8 mask <- matrix(1,nrow=sqrt(n),ncol=sqrt(n)) neigh <- getNeighbors(mask, c(2,2,0,0)) block <- getBlocks(mask, 2) edges <- getEdges(mask, c(2,2,0,0)) print(paste(sum(mask),"pixels"))

## [1] "64 pixels"

print(paste("maximum sufficient statistic S(z) =",nrow(edges)))

## [1] "maximum sufficient statistic S(z) = 112"

library(bayesImageS) res.Gibbs <- mcmcPottsNoData(beta=5, k=3, neigh, block, niter=50) ts.plot(res.Gibbs$sum, ylim=c(nrow(edges)/3, nrow(edges))) abline(h=nrow(edges), col=2, lty=3)

summary(res.Gibbs$sum[26:50])

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 112 112 112 112 112 112

var(res.Gibbs$sum[26:50])

## [1] 0

We can see that it only takes around 25 iterations for the Gibbs sampler to converge for a lattice of that size. Now for a 125×125 lattice:

n <- 125*125 mask <- matrix(1,nrow=sqrt(n),ncol=sqrt(n)) neigh <- getNeighbors(mask, c(2,2,0,0)) block <- getBlocks(mask, 2) edges <- getEdges(mask, c(2,2,0,0)) print(paste(sum(mask),"pixels"))

## [1] "15625 pixels"

print(paste("maximum sufficient statistic S(z) =",nrow(edges)))

## [1] "maximum sufficient statistic S(z) = 31000"

res.Gibbs <- mcmcPottsNoData(beta=5, k=3, neigh, block, niter=2000) ts.plot(res.Gibbs$sum, ylim=c(nrow(edges)/3, nrow(edges))) abline(h=nrow(edges), col=2, lty=3)

summary(res.Gibbs$sum[1001:2000])

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 30740 30780 30810 30810 30830 30890

var(res.Gibbs$sum[1001:2000])

## [1] 1440.661

Even after 1000 iterations, the distribution of S(**z**) might not have converged to the known value. Now let’s see how Swendsen-Wang performs for the same lattice:

res.SW <- swNoData(beta=5, k=3, neigh, block, niter=50) ts.plot(res.SW$sum, ylim=c(nrow(edges)/3, nrow(edges))) abline(h=nrow(edges), col=2, lty=3)

summary(res.SW$sum[26:50])

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 31000 31000 31000 31000 31000 31000

var(res.SW$sum[26:50])

## [1] 0

After 25 iterations, SW has already converged to the exact distribution. Even though this algorithm is much more expensive for each iteration, it more than makes up for that in efficiency when β is large. Now let’s see what happens when we go in the other direction: initialising the lattice with all labels set to the same value, then updating with β=0:

res2.Gibbs <- mcmcPottsNoData(beta=0, k=3, neigh, block, niter=500, random=FALSE) ts.plot(res2.Gibbs$sum, ylim=range(c(res2.Gibbs$sum, nrow(edges)))) abline(h=nrow(edges), col=2, lty=3) abline(h=nrow(edges)/3, col=4, lty=3)

summary(res2.Gibbs$sum)

## V1 ## Min. : 9988 ## 1st Qu.:10281 ## Median :10340 ## Mean :10334 ## 3rd Qu.:10389 ## Max. :10522

var(res2.Gibbs$sum)

## [,1] ## [1,] 6710.494

The distribution of all 500 samples are very close to the exact distribution with mean 10333.33 and variance 6888.89:

hist(res2.Gibbs$sum, freq=FALSE, breaks=50, col=3) abline(v=nrow(edges)/3, col=4, lty=3, lwd=3) curve(dnorm(x, mean=nrow(edges)/3, sd=sqrt(nrow(edges)*(1/3)*(2/3))), col="darkblue", lwd=2, add=TRUE, yaxt="n")

Now for Swendsen-Wang:

res2.SW <- swNoData(beta=0, k=3, neigh, block, niter=500, random=FALSE) ts.plot(res2.SW$sum, ylim=range(c(res2.SW$sum, nrow(edges)))) abline(h=nrow(edges), col=2, lty=3) abline(h=nrow(edges)/3, col=4, lty=3)

summary(res2.SW$sum)

## V1 ## Min. :10080 ## 1st Qu.:10273 ## Median :10327 ## Mean :10328 ## 3rd Qu.:10382 ## Max. :10542

var(res2.SW$sum)

## [,1] ## [1,] 6740.577

hist(res2.SW$sum, freq=FALSE, breaks=50, col=3) abline(v=nrow(edges)/3, col=4, lty=3, lwd=3) curve(dnorm(x, mean=nrow(edges)/3, sd=sqrt(nrow(edges)*(1/3)*(2/3))), col="darkblue", lwd=2, add=TRUE, yaxt="n")

The distribution of the SW samples with β=0 is almost identical to what we obtained from the chequerboard Gibbs sampler and matches the exact distribution very closely. Based on these results, I would be confident in using 500 iterations of SW to simulate images of this size for any value of β. One might reasonably ask if there is any scenario where Gibbs sampling outperforms SW. The answer lies in the “NoData” part of the function name: in the presence of an external field, such as when fitting the Potts model to an observed image, the Gibbs sampler will have much better performance. This is due to the inhomogeneity of the distributions of each pixel.

Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images *IEEE Trans. PAMI* **6**: 721-741.

Huber, M. (2016) Perfect Simulation*Chapman & Hall/CRC Press*

Moores, M.T.; Pettitt, A.N. & Mengersen, K. (2015) Scalable Bayesian Inference for the Inverse Temperature of a Hidden Potts Model *arXiv preprint* arXiv:1503.08066 [stat.CO]

Moores, M.T. & Mengersen, K. (2016) bayesImageS: Bayesian Methods for Image Segmentation using a Potts Model *R package* v0.3-4

Propp, J. G. & Wilson, D. B. (1996) Exact sampling with coupled Markov chains and applications to statistical mechanics *Random Struct. Algor.* **9**(1-2): 223-252.

Swendsen, R.H. & Wang, J-S (1987) Nonuniversal critical dynamics in Monte Carlo simulations *Phys. Rev. Lett.* **58**(2): 86–88.

]]>

There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism. I will also discuss the process of releasing this software as an open source R package on the CRAN repository.

]]>

The dependency issue with **callr** that I mentioned in my previous post seems to occur with any version of R prior to 3.3.x, including Microsoft R Open (MRO). As with OS X, upgrading to the latest release of R fixes this issue.

I’ve rewritten my package vignette to use listings instead of **algorithm2e** or **algorithmicx** to format the pseudocode for intractable likelihoods. I’m not completely happy with how this looks, but following some advice from StackExchange I’ve managed to produce something readable. If you compare the current vignette with the paper on arXiv, you’ll see what I mean. I’ve also used subcaption for the figures and amsfonts for equations. Note that the **subfigure** and **subfig** packages on CTAN are now considered obsolete.

The DESCRIPTION and NAMESPACE files for **bayesImageS** were originally generated by RcppArmadillo::package.skeleton, but it seems that the preferred method for linking to the Armadillo linear algebra library has changed over the years. When I run R CMD check –as-cran I get the following note:

* checking CRAN incoming feasibility ... NOTE New submission * checking package dependencies ... NOTE Package in Depends/Imports which should probably only be in LinkingTo: ‘RcppArmadillo’ Status: 2 NOTEs

On CRAN, I can see that there are a ton of packages listed in the reverse LinkingTo, but only 4 packages in reverse Imports and 3 in reverse Depends. However, if I remove RcppArmadillo from Depends/Imports I get an ERROR instead of a NOTE:

* checking package dependencies ... ERROR Namespace dependency not required: ‘RcppArmadillo’

This was really frustrating. It looks like the bug has now been fixed in both RcppArmadillo and RcppEigen, but there are no clear instructions on how to produce a working DESCRIPTION and NAMESPACE if you were unlucky enough to run package.skeleton using an old version of either of those packages. Removing RcppArmadillo from LinkingTo just resulted in a compile error:

In file included from PottsUtil.cpp:20:0: PottsUtil.h:23:27: fatal error: RcppArmadillo.h: No such file or directory #include <RcppArmadillo.h> ^ compilation terminated. make: *** [PottsUtil.o] Error 1 ERROR: compilation failed for package ‘bayesImageS’

I submitted my package to CRAN with this NOTE, but it was rejected. I ended up fixing the problem by trial and error:

DESCRIPTION

Imports: Rcpp (>= 0.10.2) LinkingTo: Rcpp, RcppArmadillo

NAMESPACE

exportPattern("^[[:alpha:]]+") importFrom(Rcpp,evalCpp) useDynLib(bayesImageS)

By deleting RcppArmadillo from both Imports and NAMESPACE, I managed to produce a working version of my R package that passes check_for_cran with flying colours:

Perhaps they should add a new function to RcppArmadillo and RcppEigen, something like fix_broken_package?

]]>