To briefly recap, the mean function is characterised by the following differential equation:

which is a logistic curve with rate parameter and a horizontal asymptote at . The solution for an initial value is:

The observations are generated from a heteroskedastic, truncated normal distribution where the variance is equal to the gradient of the mean:

Simulating data from this model is straightforward, for example using the R package **rtruncnorm**:

This is what it looks like in the Stan modelling language (a derivative of BUGS):

functions { vector ft(vector t, real r, real y0, real tC, real L) { vector[num_elements(t)] mu; vector[num_elements(t)] exprt; exprt = exp(r*(t-tC)); mu = y0*L*exprt ./ (L + y0*(exprt - 1)); return mu; } vector dfdt(vector t, real r, real tC, real L) { vector[num_elements(t)] dmu; vector[num_elements(t)] sddenom; for (i in 1:num_elements(t)) { sddenom[i] = ((exp(r*tC) + exp(r*t[i]))^(-2)); } dmu = r*L*exp(r*(t+tC)) .* sddenom; return dmu; } } data { int<lower = 1> M; int<lower = 1> N; real<lower = 0> mu0; real<lower = 1> maxY; real tcrit; matrix<lower=0, upper=maxY>[M,N] y; vector[M] t; } parameters { real<lower = 0> r; } transformed parameters { vector[M] curr_mu; vector[M] curr_sd; curr_mu = ft(t, r, mu0, tcrit, maxY); curr_sd = dfdt(t, r, tcrit, maxY); } model { for (i in 1:M) { for (j in 1:N) { y[i,j] ~ normal(curr_mu[i], curr_sd[i]) T[0,maxY]; } } }

When we try to fit this model, Stan gives the following error:

[1] "The following numerical problems occured the indicated number of times after warmup on chain 1" count Exception thrown at line 28: normal_log: Scale parameter is 0, but must be > 0! 35

This indicates that there are problems with numerical precision in the tails of the distribution, as . To avoid this, I added an lower bound on the standard deviation:

curr_sd = dfdt(t, r, tcrit, maxY) + eps;

Even after fixing this problem, I still had the dreaded *divergent transitions*:

Warning messages: 1: There were 2155 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

I tried increasing adapt_delta as suggested, but I still got divergent transitions even at adapt_delta=0.99. Even adding a prior on r didn’t really help. In the end, I dropped the truncated normal distribution:

model { for (i in 1:M) { y[i,] ~ normal(curr_mu[i], curr_sd[i]); } }

Even though this meant that the model was slightly misspecified, the true value of r was now well within the region of highest posterior density:

Inference for Stan model: GrowthCurve4. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat r 1.98 0 0.02 1.95 1.97 1.98 2 2.02 3079 1 Samples were drawn using NUTS(diag_e) at Wed Apr 19 15:17:27 2017. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).

I can modify the model so that both r and mu0 are treated as unknown parameters:

parameters { real<lower = 0> r; real<lower = 0> mu0; }

In this case, with r=2.5 and mu0=4, the true values are within the region of posterior support:

Inference for Stan model: GrowthCurve4. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat r 2.43 0.00 0.04 2.35 2.41 2.43 2.45 2.50 1761 1 mu0 3.80 0.01 0.30 3.24 3.59 3.79 3.99 4.42 1702 1 Samples were drawn using NUTS(diag_e) at Wed Apr 19 16:00:23 2017.

]]>

Next Monday (June 5), I will be giving an introduction to splines, based on chapter 5 of the Elements of Statistical Learning (Hastie, Tibshirani & Friedman, 2009, 2nd ed.) supplemented by assorted other references:

- Eilers & Marx (1996) Flexible smoothing with B-splines and penalties.
*Statist. Sci.***11**(2): 89-121. - Ruppert, Wand & Carroll (2003)
*Semiparametric Regression*, ch. 16. CUP - Lang & Brezger (2004) Bayesian P-splines.
*JCGS***13**(1) - Wood (2006, 2
^{nd}ed. 2017)*Generalized Additive Models: An Introduction with R*. Chapman & Hall/CRC Press https://CRAN.R-project.org/package=mgcv - Ventrucci & Rue (2016) Penalized complexity priors for degrees of freedom in Bayesian P-splines.
*Statist. Mod.***16**(6): 429-453. - Wood, Pya & Säfken (2016) Smoothing Parameter and Model Selection for General Smooth Models.
*JASA***111**(516) - Wood (2016) P-splines with derivative based penalties and tensor product smoothing of unevenly distributed data.
*Stat. Comput.***27**(4)

Obviously, I’m not going to cover all of that in a one hour talk. The main ideas that I’ll aim to get across are Gibbs sampling and other estimation methods for the smoothing parameter, . This talk will be part of the reading group in statistical machine learning.

The last talk will be an introduction to approximate Bayesian computation (ABC) for the Warwick ML Club. This talk will largely be based on a previous talk that I gave at ABC in Sydney:

The summer is a busy time for conferences in the UK. I will be presenting a poster at the Workshop on New mathematical methods in computational imaging at Heriot-Watt University, Edinburgh, on June 30:

Approximate Posterior Inference for the Inverse Temperature of a Hidden Potts ModelThere are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm and approximate Bayesian computation (ABC). A serious drawback of these algorithms is that they do not scale well for models with a large state space. Markov random fields, such as the Potts model and exponential random graph model (ERGM), are particularly challenging because the number of discrete variables increases linearly with the size of the image or graph. The likelihood of these models cannot be computed directly, due to the presence of an intractable normalising constant. In this context, it is necessary to employ algorithms that provide a suitable compromise between accuracy and computational cost.

Bayesian indirect likelihood (BIL) is a class of methods that approximate the likelihood function using a surrogate model. This model can be trained using a pre-computation step, utilising massively parallel hardware to simulate auxiliary variables. We review various types of surrogate model that can be used in BIL. In the case of the Potts model, we introduce a parametric approximation to the score function that incorporates its known properties, such as heteroskedasticity and critical temperature. We demonstrate this method on 2D satellite remote sensing and 3D computed tomography (CT) images. We achieve a hundredfold improvement in the elapsed runtime, compared to the exchange algorithm or ABC. Our algorithm has been implemented in the R package “bayesImageS,” which is available from CRAN.

During July, I will be attending the programme “Scalable inference; statistical, algorithmic, computational aspects” at the Isaac Newton Institute (INI), Cambridge. This includes two workshops (so far, that I am aware of): Scalable Statistical Inference (July 3-7) and “Sampling methods in statistical physics and Bayesian inference” (July 18).

My abstract has also been accepted for the RSS International Conference in Glasgow, September 4-7. According to the conference programme, my talk has been scheduled for contributed session 6.5 Big Data, after lunch on the Wednesday. Hope to see you there!

]]>

The scalability of this approach is due to the representation of the discrete graph structure using a latent continuous model, the generalised gamma process (GGP). The sociability parameters of each node can be estimated using Hamiltonian Monte Carlo (HMC), since the gradient of the conditional log-posterior is available in closed form:

where ; is the number of nodes with degree ; is the leftover sociability of the zero-degree nodes; is the degree matrix; is the degree of node ; and are the hyperparameters of the GGP.

Full details of the algorithm are provided in Appendix F of the paper and the supplementary material contains a reference implementation. Further improvements in scalability might be achieved by taking advantage of parallelism, as mentioned by the authors. The source code is written in pure MATLAB, without any use of .mex files (compiled C code). It only depends on the Stats toolbox for generating random variables. It would be interesting to see how Stan would perform on this model, or alternatively a C++ implementation using Rcpp. For an idea of what is possible, check out the R package BradleyTerryScalable by Ella Kaye and David Firth.

The pair potentials of the GGP are symmetric:

where . This implies a very different type of generative process than the adversarial graphs that are typical of Bradley-Terry or Plackett-Luce models, where nodes compete with each other for edges. Instead, the in-degree and out-degree are assumed to be independent Poisson counts.

To illustrate the difference between these two approaches, I computed for each node using the GGP as well as the PageRank algorithm (Brin & Page, 1998). Famously, this is the original ranking algorithm that was used by Google. It is a spectral method, since the ranks correspond to the principal eigenvector of the degree matrix , where the nodes are websites and the edges are hyperlinks. Alternatively, PageRank can be derived from the stationary distribution of a Markov chain that describes random walks through the World-Wide Web.

PageRank has proven to be highly successful for datasets such as the WWW graph from the University of Notre Dame (Albert, Jeong & Barabási, 1999). This graph has 325,729 nodes, far larger than was previously feasible for computation with MCMC. The matrix has 1,497,134 directed edges, while the symmetric adjacency matrix has 1,090,108 nonzero entries. The GGP can be used for directed multigraphs, but this makes little difference due to the symmetric pair potentials shown above.

I ran the MATLAB code for 100k iterations with 2 parallel chains, which took 4 hours per chain on my iMac (3.5GHz Intel Core i7 with 16GB RAM). The chains appeared to converge after 20k iterations, as shown in the traceplot for one of the GGP parameters, . The 99% posterior credible interval was [0.368; 0.375], which differs significantly from the estimate in Table 2 of the paper (pg. 29). This might be due to mixing and convergence issues with a smaller number of iterations (40k, discarding 20k as burn-in). Memory usage climbed to over 30GB, which required a significant amount of swapping to disk. It would definitely be worthwhile running this code for longer on a Linux cluster with more available RAM, to see whether the MCMC has in fact converged as it appears from the traceplots. The estimated runtime of 132min that was reported in the paper seems to be quite optimistic for this dataset. This might also explain why the authors did not consider any graphs larger than this one.

PageRank can be computed using the centrality(..) function in MATLAB, or using the igraph package in R. The 3D histogram below shows that the relationship between and log(PR) is highly nonlinear. Many nodes are assigned a PageRank less than 6e-6 (log(PR) < -12), whereas is more spread out between 4.5e-5 and 2e-9.

It would be interesting to see whether the Bradley-Terry model (or equivalently Plackett-Luce with m=2) would show a similar relationship. A nonparametric Plackett-Luce model was proposed in an earlier paper (Caron, Teh & Murphy, 2014). Since this is similarly based on a marked Poisson process with intensity defined by a gamma process, it would also be interesting to compare the scalability between the two models. MATLAB code for nonparametric Plackett-Luce is also available from François Caron’s homepage.

]]>

The first surprise was that the Xcode command-line tools were no longer working:

$ g++ --version xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun

These needed to be reinstalled from scratch:

xcode-select --install

Unfortunately, the latest (January 2017) CUDA Toolkit 8.0.61 is incompatible with Apple LLVM 8.1.0:

`nvcc fatal : The version ('80000') of the host compiler ('Apple clang') is not supported`

So after upgrading the XCode command-line tools, I then needed to *downgrade* again.

The Gnu toolchain was still working, since it was correctly installed in /usr/local/bin, but it was now giving a worrying error message:

gfortran: warning: couldn’t understand kern.osversion ‘16.4.0

I decided it was better to reinstall these as well. This gave me version 6.3.0 of GCC, g++ and gfortran.

R still ran from inside RStudio, but it was no longer on the PATH. I would advise reinstalling R from scratch, so that the R executable will be located in /usr/local/bin where it belongs, rather than wherever the macOS upgrade scripts have relocated it to! I upgraded to the latest RStudio as well, for good measure.

Likewise with LaTeX, /Library/TeX/texbin was hosed. I could no longer compile any TeX documents, even from inside TeXShop or RStudio. Happily, this did not require a reinstall, just a modification of the PATH and the TeXShop preferences to point to the new location. RStudio ignores any .profile or .bashrc in your home directory, so you need to edit the system-wide PATH:

sudo vi /etc/paths

Java apps (such as JabRef) gave an interesting error message:

To open "JabRef" you need to install the legacy Java SE 6 runtime.

…which is yet another reason to abandon Java! But alas, JabRef remains the best reference manager for BibTeX files. Certainly, the one that comes with MacTeX is pretty awful. If only they had used Qt, like RStudio does.

As for MATLAB, the news is not good:

If you are running MATLAB R2013b or earlier a patch for these releases is not available. Please update to MATLAB R2014a or later to use MATLAB on macOS Sierra.

I tried installing R2014a as suggested above, but it just crashes with a segfault immediately upon launch. R2017a appears to be running fine, but of course this can create compatibility issues with existing code. For all of the idiosyncrasies in R that are due to backwards compatibility with S, at least they are careful not to arbitrarily break things every couple of years! If you’re distributing your research code in MATLAB, you might as well be writing in Fortran and storing it on punch cards in a shoebox.

]]>

Two brilliant slides from Philip Dawid responding to Hennig & Gelman pic.twitter.com/UXaD7CY00X

— Robert Grant (@robertstats) 12 April 2017

Statistics is an essential element of modern science, and has been for quite some time. As such, statistical procedures should be evaluated with regard to the philosophy of science. Towards this goal, the authors propose seven statistical virtues that could serve as a guide for authors (and reviewers) of scientific papers. The chief of these is transparency: thorough documentation of the choices, assumptions and limitations of the analysis. These choices need to be justified within the context of the scientific study. Given the ‘no free lunch’ theorems (Wolpert, 1996), such contextual dependence is a necessary property of any useful method.

The authors argue that “subjective” and “objective” are ambiguous terms that harm statistical discourse. No methodology has an exclusive claim to objectivity, since even null hypothesis significance testing (NHST) involves choice of the sampling distribution, as well as the infamous α=0.05. The use of default priors, as in Objective Bayes, requires ignoring any available information about the parameters of interest. This can conflict with other goals, such as identifiability and regularisation. The seven virtues are intended to be universal and can apply irrespective of whether the chosen methodology is frequentist or Bayesian. Indeed, the authors advocate a methodology that combines features of both.

There have been many other attempts to reconcile frequentist and Bayesian approaches to produce a grand unified theory of statistics. The main feature of the methodology in Section 5.5 is iterative refinement of the model (including priors and tuning parameters) to better fit the observed data. Rather than Bayesian updating or model choice, the suggested procedure involves graphical summaries of model fit (Gelman et al. 2013). This has connections with well-calibrated Bayes (Dawid 1982) and hypothetico-deductive Bayes (Gelman & Shalizi, 2013). I think that this is a good approach, albeit saddled with an unfortunate misnomer.

The term “falsificationist” might be slightly less clumsy than “hypothetico-deductive,” but nevertheless seems misleading. Leaving aside the question of whether statistical hypotheses are falsifiable at all, except in the limit of infinite data, falsification in the Popperian sense is really not the goal. This would imply abandoning an inadequate model and starting again from scratch. As stated by Gelman (2007),

“…the purpose of model checking (as we see it) is not to reject a model but rather to understand the ways in which it does not fit the data.”

Furthermore, this approach is not limited to posterior predictive distributions. It could be applied to any generative model, not necessarily a Bayesian one. Thus, falsificationist Bayesianism as presented in this paper is neither falsificationist nor Bayesian, but it is an excellent approach nevertheless.

For other viewpoints, check out the blog posts by Xian and Nick Horton, as well as this interview on YouTube.

]]>

The C++ standard library in the OpenCSW version of GCC 5.2.0 is incompatible with GCC on other platforms (such as Windows, Linux and Mac OS). This causes errors like the following:

smcPotts.cpp: In function ‘Rcpp::IntegerVector resample_resid(Rcpp::NumericVector&, arma::vec&, Rcpp::NumericMatrix&)’: smcPotts.cpp:219:48: error: call of overloaded ‘log(const unsigned int&)’ is ambiguous int tW = (int)trunc(exp(log_wt(i) + log(n))); ^ In file included from /usr/include/math.h:15:0, from /opt/csw/include/c++/5.2.0/cmath:44, from /home/ripley/R/Lib32/Rcpp/include/Rcpp/platform/compiler.h:100, from /home/ripley/R/Lib32/Rcpp/include/Rcpp/r/headers.h:48, from /home/ripley/R/Lib32/Rcpp/include/RcppCommon.h:29, from /home/ripley/R/Lib32/RcppArmadillo/include/RcppArmadilloForward.h:26, from /home/ripley/R/Lib32/RcppArmadillo/include/RcppArmadillo.h:31, from smcPotts.h:4, from smcPotts.cpp:20: /opt/csw/lib/gcc/sparc-sun-solaris2.10/5.2.0/include-fixed/iso/math_iso.h:200:21: note: candidate: long double std::log(long double) inline long double log(long double __X) { return __logl(__X); } ^ /opt/csw/lib/gcc/sparc-sun-solaris2.10/5.2.0/include-fixed/iso/math_iso.h:168:15: note: candidate: float std::log(float) inline float log(float __X) { return __logf(__X); } ^ /opt/csw/lib/gcc/sparc-sun-solaris2.10/5.2.0/include-fixed/iso/math_iso.h:68:15: note: candidate: double std::log(double) extern double log __P((double)); *** Error code 1 make: Fatal error: Command failed for target `smcPotts.o' Current working directory /tmp/RtmpPtailS/R.INSTALL5859270b1fe0/bayesImageS/src ERROR: compilation failed for package ‘bayesImageS’

The Writing R Extensions manual describes this as an “overloading ambiguity,” in the section on Portable C and C++ code. Given that this code compiles fine on Gnu g++, LLVM clang++ (Xcode) and icpc (Intel Parallel Studio XE) on Windows, Mac OS and Linux, somehow I don’t think the problem with portability is at my end!

The fix is quite straightforward, albeit painstaking. It involves searching through the source code for every use of math.h functions log(), exp(), pow(), etc. for implicit casts from int/uint/long to float/double and making them explicit to resolve the “ambiguity.” For example, the fix for the code above was as follows:

> int tW = (int)trunc(exp(log_wt(i) + log((double)n))); --- < int tW = (int)trunc(exp(log_wt(i) + log(n)));

Refer to checkin 4d1a7ec on BitBucket for details. I was pleased to see that the Solaris build on CRAN is now working:

using R Under development (unstable) (2017-03-21 r72378) using platform: sparc-sun-solaris2.10 (32-bit)

I’m glad that I don’t have any issues with 32 vs. 64bit OS, that really *would* be annoying! If compile errors in OpenCSW on SPARC Solaris are considered as vital issues (i.e. your R package is scheduled for archival unless you fix them) then I hope that services like R-hub and Travis CI will eventually provide support for building on this platform. It’s been more than 15 years since the last time that I installed Solaris myself. Back then, I was paid to care about customers with obscure taste in operating systems. In the UK I don’t have a spare computer that I would want to sacrifice in such a pointless task.

When I came to submit this fix to CRAN, I saw that my build status had an additional NOTE that wasn’t there when the package was first submitted:

Found no calls to: ‘R_registerRoutines’, ‘R_useDynamicSymbols’

It is good practice to register native routines and to disable symbol

search.See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.

Although this has only recently been added to the CRAN package checks, it has been a feature of R for a very long time. Duncan Temple Lang explained the technical details in an article for *R News* back in 2001 (Vol. 1/3, pp. 20-23). The idea is that you register every C, C++ or Fortran function explicitly with R so that it doesn’t have to search the dynamic symbol table every time you make a call. This is also safer, because R checks to make sure that .C() isn’t used for functions that require .Call() and vice-versa. All of the relevant code is in bayesImageS_init.cpp, which I created in checkin 60d8fa5. Note that this isn’t an issue for anyone who uses Rcpp attributes, since the RcppExports.cpp handles all of this for you (since Rcpp version 0.10.0). This gives me a further reason to rewrite this package to use annotations, since it will be easier to maintain in the long run.

I also had a lot of headaches with the LaTeX setup on my new laptop. Fortunately I could still refer to my old iMac to figure out what the problem was. One issue that I never managed to fix was that knitr wouldn’t convert EPS files to PDF. In the end, I just included the *-eps-converted-to.pdf files in the R package. This is also something that I’ll probably need to revisit in future, or it will keep coming back to bite me on the arse.

]]>

M. Moores, A. N. Pettitt & K. Mengersen (2015) Scalable Bayesian inference for the inverse temperature of a hidden Potts model. arXiv:1503.08066 [stat.CO]

M. Moores, C. C. Drovandi, K. Mengersen & C. P. Robert (2015) Pre-processing for approximate Bayesian computation in image analysis. *Statistics & Computing* **25**(1): 23-33.

M. Falk, C. Alston, C. McGrory, S. Clifford, E. Heron, D. Leonte, M. Moores, C. Walsh, A.N. Pettitt & K. Mengersen (2015) Recent Bayesian approaches for spatial analysis of 2-D images with application to environmental modelling. *Envir. Ecol. Stat.* **22**(3): 571-600.

M. Moores & K. Mengersen (2014) Bayesian approaches to spatial inference: modelling and computational challenges and solutions. In *Proc. 33rd MaxEnt*, AIP Conf. Proc. **1636**: 112-117.

]]>

Note that this approach is not the same as the Runge-Kutta (RK45) and backward-diﬀerentiation formula (BDF) methods for solving ODEs in Stan. In the case of the dugongs data (and the other examples that I consider) the times are not strictly ordered (ie. there are multiple observations at a single timepoint. This results in the following error message from the integrate_ode function:

[1] "Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:" [2] "Exception thrown at line 32: integrate_ode_rk45: times is not a valid ordered vector. The element at 3 is 1.5, but should be greater than the previous element, 1.5"

The model from Carlin & Gelfand (1991) is as follows:

Which is the solution to the following ODE:

We don’t need to solve the equation numerically, since the solution is already available in closed form. Our interest is primarily in parameter estimation, particularly for ill-posed problems when the parameters are not well-identified.

Base R includes the nls() function, which stands for nonlinear least squares. By default, it uses the Gauss-Newton algorithm to search for parameter values that fit the observed data. In this case, repeated observations at the same timepoint are not an issue:

dat <- list( "N" = 27, "x" = c(1, 1.5, 1.5, 1.5, 2.5, 4, 5, 5, 7, 8, 8.5, 9, 9.5, 9.5, 10, 12, 12, 13, 13, 14.5, 15.5, 15.5, 16.5, 17, 22.5, 29, 31.5), "Y" = c(1.8, 1.85, 1.87, 1.77, 2.02, 2.27, 2.15, 2.26, 2.47, 2.19, 2.26, 2.4, 2.39, 2.41, 2.5, 2.32, 2.32, 2.43, 2.47, 2.56, 2.65, 2.47, 2.64, 2.56, 2.7, 2.72, 2.57)) plot(dat$x, dat$Y, xlab="Age", ylab="Length") nlm <- nls(Y ~ alpha - beta * lambda^x, data=dat, start=list(alpha=1, beta=1, lambda=0.9)) summary(nlm) nlm_fn <- predict(nlm, newdata=dat$x) lines(dat$x, nlm_fn, col=6, lty=2)

Formula: Y ~ alpha - beta * lambda^x Parameters: Estimate Std. Error t value Pr(>|t|) alpha 2.65807 0.06151 43.21 < 2e-16 *** beta 0.96352 0.06968 13.83 6.3e-13 *** lambda 0.87146 0.02460 35.42 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.09525 on 24 degrees of freedom Number of iterations to convergence: 6 Achieved convergence tolerance: 3.574e-06

If you compare the posterior modes from Carlin & Gelfand (1991), the results are almost identical:

> exp(0.975) [1] 2.651167 > exp(-0.014) [1] 0.9860975 > inv.logit(1.902) [1] 0.8701177

The high posterior density (HPD) regions are asymmetric, due to the reparameterisation of the model.

Now let’s look at a slightly more complex problem:

where r is a rate parameter and L is the limit (horizontal asymptote). The solution to this ODE for an initial value is:

Using these equations, we can simulate data from a heteroskedastic, truncated normal distribution where the variance is equal to the gradient of the mean:

library(truncnorm) t <- seq(-4,4,by=0.05) ft <- function(t, r, y0, tC, L){ y0*L*exp(r*(t-tC))/(L + y0*(exp(r*(t-tC)) - 1)) } plot(t,ft(t,2,4,0,12),type='l',ylim=c(0,12),col=4, ylab='Y') abline(v=0,lty=3) abline(h=12,lty=3,col=2) abline(h=0,lty=3) dfdt <- function(t, r, tC, L){ r*L*exp(r*(t+tC))/((exp(r*tC) + exp(r*t))^2) } simt <- rnorm(25, sd=1.5) simy <- matrix(nrow=length(simt), ncol=20) for (i in 1:length(simt)) { simy[i,] <- rtruncnorm(ncol(simy), a=0, b=12, mean=ft(simt[i],2,4,0,12), sd = sqrt(dfdt(simt[i],2,0,12))) points(rep(simt[i], ncol(simy)), simy[i,], pch='*') }

The Gauss-Newton method underestimates the true parameter value by almost 4 standard errors:

dat <- list(Y=as.vector(simy), t=rep(simt,ncol(simy)), y0=4, maxY=12, tcrit=0) nlm <- nls(Y ~ y0*maxY*exp(r*(t-tcrit))/(maxY + y0*(exp(r*(t-tcrit)) - 1)), data=dat, start=list(r=1)) summary(nlm) nlm_fn <- predict(nlm, newdata=list(t=t)) lines(t, nlm_fn, col=6, lty=2, lwd=2)

Formula: Y ~ y0 * maxY * exp(r * (t - tcrit))/(maxY + y0 * (exp(r * (t - tcrit)) - 1)) Parameters: Estimate Std. Error t value Pr(>|t|) r 1.76053 0.04219 41.73 <2e-16 ***

Although we were unable to recover the true parameter value using nls(), the approximation (magenta curve) is nevertheless quite close to the true function (in blue). With 500 observations at 25 different timepoints, this suggests that the data are not very informative about the rate parameter. This problem is exacerbated when we introduce additional parameters with the aim of obtaining a more flexible function. Next time, we will look at analysing this data using Stan.

Bates & Watts (1988) “Nonlinear Regression Analysis and Its Applications.” Wiley-Interscience

Carlin & Gelfand (1991) “An iterative Monte Carlo method for nonconjugate Bayesian analysis.” *Stat. Comp.* **1**(2), 119-128.

Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li & Riddell (2017) “Stan: A Probabilistic Programming Language.” *J. Stat. Soft.* **76**(1)

Ratkowsky (1983) “Nonlinear Regression Modeling: A Unified Practical Approach.” Marcel Dekker

Richards (1959) “A Flexible Growth Function for Empirical Use.” *J. Experimental Botany* **10** (2): 290–300.

Thomas, Best, Lunn & Spiegelhalter (2012) “The BUGS Book: A Practical Introduction to Bayesian Analysis.” Chapman & Hall/CRC Press

]]>

The most significant change is the increased security features that were introduced in El Capitan, in an attempt to make OS X less vulnerable to rootkits and other script kiddie exploits. Upgrading caused a lot of headaches for users of R and LaTeX, since folders were magically moved to different locations on the hard drive. This is one of the reasons that I chose to skip Yosemite and El Capitan (OK, mostly it was sheer laziness…)

I use the Outlook email client because my work email is hosted on an Exchange server and I’ve found Mac mail very flaky for that. I also have co-authors that use Excel and Word, so having Office installed makes it easier to collaborate with them. Wherever possible, I use R Markdown with pandoc to generate Word documents programmatically, which saves me a lot of time in the long run (reproducible research FTW!). Other commercial software that I have installed includes MATLAB, Mathematica, Dropbox, and Skype.

I was very pleased that I was able to install the Xcode command-line tools and the Gnu compilers without any problems. I can switch between clang++ and g++ by editing my ~/.R/Makeconf as described here. I also use XQuartz for X/Windows support. Of course, I have R and RStudio installed, using vecLib instead of the default R BLAS library. The instructions for switching BLAS libraries on OS X in Sect. 10.5 of the installation guide are out of date. On OS X Mavericks or later, type the following commands in Terminal:

cd /Library/Frameworks/R.framework/Resources/lib ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib

I have limited space on my solid state drive, so I decided to install the minimal BasicTeX with TeXShop for editing and compiling LaTeX documents. Overall, I think this is a better approach. I wouldn’t install all 10,000 packages from CRAN just in case I might need them, so I see no reason why CTAN should be any different. So far, I’ve had to install the following additional packages from CTAN using tlmgr:

- preprint
- algorithms
- subfigure
- bbm-macros
- latexmk
- multirow
- morefloats
- framed
- titling
- xstring
- pgfopts
- inconsolata
- chicago
- courier
- helvetic

Even on Sierra, you need to run tlmgr as root, using:

sudo /usr/local/texlive/2016basic/bin/x86_64-darwin/tlmgr install algorithms

I also needed to install Ghostscript from here, otherwise epstopdf doesn’t work. As with MS Word, you can generate LaTeX programmatically using knitr, but you still need to have pdfLatex installed to compile the PDF. I still use JabRef to manage all of my BibTeX files, since I haven’t found anything that does the job better. The installer for JabRef isn’t signed, so Gatekeeper will complain when you try to install it.

Filesystem Size Used Avail Capacity Mounted on /dev/disk1 465Gi 54Gi 411Gi 12% / devfs 182Ki 182Ki 0Bi 100% /dev map -hosts 0Bi 0Bi 0Bi 100% /net map auto_home 0Bi 0Bi 0Bi 100% /home

No major changes from my previous list. I’ve dropped Stata since the R package foreign does a good enough job of importing .dta files. Instead, I’ve got Mathematica installed to help out with my poor algebra skills. I did consider getting a 15in MacBook Pro instead of the Air, which could have had up to 16GB of RAM and 2TB solid state drive. The Thunderbolt 3 (USB-C) ports and touch bar would have been annoying, but the real deal breaker was Radeon instead of nVidia graphics. If I couldn’t use CUDA, then it just wasn’t worth the extra expense (plus, I’m not sure I’d trust myself with such an expensive laptop!). I’ve got a GeForce GTX 970M in my Windows machine if I need to run any CUDA code. ASUS laptops offer a lot more bang for your buck, but Mac OS is still nicer to use (particularly for developing software and writing academic papers). I think it’s well worth the extra money, if you can afford it.

]]>

`3006`

of the meta-data. Shortly after I wrote that original post, Reid F. Thompson made his R package

The example DICOM-RT image used in this code is available from here. The code runs fine on Windows, but in order to run it on OS X you will need to install an X-Windows server such as XQuartz. I’m currently running Mac OS 10.9.5 (**Mavericks**) with XQuartz 2.7.11. You will require administrator privileges to install this on your Mac. If you get an error message about `/opt/X11/lib/libGLU.1.dylib`

, that will be why.

library(rgl) library(RadOnc)

`## Loading required package: geometry`

`## Loading required package: magic`

`## Loading required package: abind`

`## Loading required package: oro.dicom`

`## ## oro.dicom: Rigorous - DICOM Input / Output (version = 0.5.0)`

`## Loading required package: ptinpoly`

`## Loading required package: misc3d`

rtstruct <- read.DICOM.RT(path="~/Downloads/ITK4SampleDCMRT/", modality="MR", verbose=TRUE, DVH=FALSE, modality="MR")

`## Reading 61 DICOM files from path: '~/Downloads/ITK4SampleDCMRT/' ... FINISHED ## Extracting MR data ... FINISHED [60 slices, 1.5x1.5x3.0mm res]`

`## Warning in read.DICOM.RT(path = "~/Downloads/ITK4SampleDCMRT/", verbose = ## TRUE, : Unable to extract DVH data from DICOM-RT (no dose grid available)`

`## Reading structure set from file: '/Users/stsrjs/Downloads/ITK4SampleDCMRT//RS.1.2.246.352.71.4.886768594.5257.20090622110825.dcm' ... (7 structures identified)`

`## Warning in read.DICOM.RT(path = "~/Downloads/ITK4SampleDCMRT/", verbose = ## TRUE, : Structure(s) 'GS1', 'GS2', 'GS3' are empty`

`## FINISHED ## Processing (7) structures: ## JP: GS1 [EMPTY] ... FINISHED ## JP: MRI_RECTUM_RT [32 axial slice(s), 5784 point(s)] ... FINISHED ## JP: MRI_BONEOUTER_RT [182 axial slice(s), 77052 point(s)] ... FINISHED ## JP: GS2 [EMPTY] ... FINISHED ## JP: GS3 [EMPTY] ... FINISHED ## JP: MRI_BLADDER_RT [23 axial slice(s), 11058 point(s)] ... FINISHED ## JP: MRI_PROSTATE_RO [12 axial slice(s), 3024 point(s)] ... FINISHED`

print(rtstruct)

`## [1] "RT data '~/Downloads/ITK4SampleDCMRT/' containing CT image (256x256x60), 7 structure(s)"`

print(rtstruct$structures)

`## [1] "List containing 7 structure3D objects (GS1 JP, MRI_RECTUM_RT JP, MRI_BONEOUTER_RT JP, GS2 JP, GS3 JP, MRI_BLADDER_RT JP, MRI_PROSTATE_RO JP)"`

plot(rtstruct$structures[[3]])

The interactive 3D visualisation using **rgl** is very nice, but it would also be helpful to be able to scroll through the axial slices one at a time. For example, see my earlier post about **misc3d**:

library(tkrplot)

`## Loading required package: tcltk`

slices3d(rtstruct$CT)

`## <environment: 0x7fa20396f898>`

The `slices3d`

function does support overlays, so this seems like it would be fairly straightforward to do.

The following code demonstrates how to calculate the Hausdorff distance between the bladder and prostate and plot a visualisation of the contours:

compareStructures(rtstruct$structures[c(6,7)], method="hausdorff", hausdorff.method="mean")

`## Analyzing structure 1/2 (MRI_BLADDER_RT JP) ...`

`## FINISHED ## Analyzing structure 2/2 (MRI_PROSTATE_RO JP) ... FINISHED`

`## MRI_BLADDER_RT JP MRI_PROSTATE_RO JP ## MRI_BLADDER_RT JP 0.00000 29.46258 ## MRI_PROSTATE_RO JP 29.46258 0.00000`

Previously, the best way to quantify volumetric differences would be to use something like the SlicerRT extension for ITK. However, you then still needed to import that quantification into R for statistical modelling. Being able to do everything in R makes it much easier to follow best practices for reproducible research (such as this blog post).

It is also possible to plot the contours, slice by slice:

compareStructures(rtstruct$structures[c(6,7)], method="axial", plot=TRUE)

Admittedly, this is quite an artificial example, since the usual use case would be to compare two contours of the same patient. For example, to quantify volumetric changes over time. The **RadOnc** package has an excellent vignette that goes into more detail than what I’ve done here. There is also the 2014 journal article by Reid Thompson (details below). My original code is useful if you want to study the DICOM standard itself, but of course you need to use version 0.3-7 of **oro.dicom** to get it to work. The R package **RadOnc** is actively maintained and compatible with the latest versions of other packages on CRAN. It also offers additional features that are very handy for anyone who wants to perform statistical analysis of radiotherapy contours and dosimetry.

Dice, L.R. (1945) “Measures of the amount of ecologic association between species” *Ecology* **26**(3): 297-302.

Dowling, J.; Malaterre, M; Greer, P.B. & Salvado, O. (2009) “Importing Contours from DICOM-RT Structure Sets” *The Insight Journal*

Hargrave, C.E.; Mason, N.; Guidi, R.; Miller, J-A; Becker, J.; Moores, M.T.; Mengersen, K.; Poulsen, M. & Harden, F. (2016) “Automated replication of cone beam CT-guided treatments in the Pinnacle³ treatment planning system for adaptive radiotherapy” *J. Med. Rad. Sci.* **63**(1): 48-58.

Hausdorff, F. (1914) “*Grundzüge der Mengenlehre*” Veit and Company, Leipzig.

Moores, M.T.; Hargrave, C.E.; Deegan, T.; Poulsen, M.; Harden, F. & Mengersen, K. (2015) “An external field prior for the hidden Potts model with application to cone-beam computed tomography” *Comput. Stat. Data Anal.* **86**: 27-41.

Pinter, C.; Lasso, A.;Wang, A.; Jaffray, D. & Fichtinger, G. (2012) “SlicerRT: Radiation therapy research toolkit for 3D Slicer” *Med. Phys.* **39**(10): 6332–6338.

Thompson, R.F. (2014) “RadOnc: An R Package for Analysis of Dose-Volume Histogram and Three-Dimensional Structural Data” *J. Rad. Onc. Info.* **6**(1): 98-100.

Whitcher, B.; Schmid, V.J. & Thornton, A. (2011) “Working with the DICOM and NIfTI Data Standards in R” *J. Statist. Soft.* ** 44**(6): 1-28.

]]>