Tuesday, June 25 at 5:15 for a 6pm start

College of Business and Economics, Australian National University, Canberra, ACT

The planned Mars 2020 mission to Jezero Crater will include a rover equipped with 2 Raman spectrometers: SuperCam and SHERLOC. This would be the first time that this type of spectroscopy has been performed on the Martian surface, which will enable new kinds of analysis of minerals and organic molecules. In the meantime, the Mars Science Laboratory Curiosity rover continues to build on the massive dataset of laser-induced breakdown spectroscopy (LIBS) that it has been accumulating since 2012. These data pose particular challenges for statistical signal processing, since pre-flight calibration on Earth can only approximate Martian environmental conditions. Analytical methods must be robust to artefacts and other changes in the spectral profile, such as nonlinear interactions between signals. This talk will introduce a Bayesian method for source separation of spectroscopy. We derive informative priors from online databases of known reference spectra, as well as quantum-mechanical computer models. The components of the combined spectrum are identified and quantified using a sequential Monte Carlo algorithm. An open-source implementation of our method is available in the R package ‘**serrsBayes**.’

Tuesday, July 9 at 12pm

Business School, University of Technology, Sydney, NSW

The Potts model is commonly used for classification, where the labels are spatially-correlated. The strength of spatial association is governed by a smoothing parameter, known as the inverse temperature. A difficulty arises from the dependence of an intractable normalising constant on the value of this parameter, thus there is no closed-form solution for sampling from the posterior distribution directly. There are a variety of Markov chain Monte Carlo methods for sampling from the posterior without evaluating the normalising constant, including the exchange algorithm and approximate Bayesian computation (ABC). A serious drawback of these algorithms is that they do not scale well for models with a large state space, such as images with a million or more pixels. In this talk, I will introduce the parametric functional approximate Bayesian (PFAB) algorithm, which uses an integral curve to approximate the score function of the Potts model. PFAB incorporates known properties of the likelihood, such as heteroskedasticity and critical temperature. I will demonstrate this method using synthetic data as well as remotely-sensed imagery from the Landsat-8 satellite. The proposed algorithm achieves up to a hundredfold improvement in the elapsed runtime, compared to the exchange algorithm or ABC. An open source implementation of PFAB is available in the R package ‘**bayesImageS**.’

We are pleased to announce two upcoming talks by Dr Anthony Lee (Senior Lecturer from the University of Bristol): Tuesday, July 2 at QUT and Thursday, July 18 at Monash University. The call for abstracts has now opened for Bayes on the Beach. 250 word abstracts can be submitted by email to bob.admin@qut.edu.au before August 16. We also mention some other upcoming conferences: MCM 2019, July 8-12 in Sydney;EAC-ISBA 2019, July 13-14 in Kobe, Japan; BayesComp 2020, January 7-10 in Florida, USA; and ABC in Grenoble, March 19-20 in France. __Read more here__.

There are now two example Raman spectra available in the R package, methanol and TAMRA, as well as a new vignette.

The vignette explains the main differences between the 3 functions fitSpectraMCMC, fitSpectraSMC, and fitVoigtPeaksSMC, and how to choose which function to use in a given situation. The methanol spectrum was kindly provided by my co-author, Professor Karen Faulds. It only has 4 (maybe 5?) peaks, so it is a bit easier to see what is going on than the TAMRA example in the first vignette.

I’m currently working on a new function that uses the iterated batch importance sampling (IBIS) algorithm of Chopin (2002). Expect that to be available in the R package later in the year.

]]>

**When:** 11am, Wednesday December 5

**Where:** Building 39A, Room 208, University of Wollongong, NSW (main campus)

**Speaker:** A/Prof Mirko Draca, Department of Economics, University of Warwick, UK

**Abstract:**

Strong evidence has been emerging that major democracies have become more politically polarized, at least according to measures based on the ideological positions of political elites. We ask: have the general public (‘citizens’) followed the same pattern? Our approach is based on unsupervised machine learning models as applied to issue- position survey data. This approach firstly indicates that coherent, latent ideologies are strongly apparent in the data, with a number of major, stable types that we label as: Liberal Centrist, Conservative Centrist, Left Anarchist and Right Anarchist. Using this framework, and a resulting measure of ‘citizen slant’, we are then able to decompose the shift in ideological positions across the population over time. Specifically, we find evidence of a ‘disappearing center’ in a range of countries with citizens shifting away from centrist ideologies into anti-establishment ‘anarchist’ ideologies over time. This trend is especially pronounced for the US.

This is joint work with Carlo Schwarz (University of Warwick)

**When: **Wednesday December 12

**Where:** P504, QUT Gardens Point Campus, George St, Brisbane QLD

**Speaker:** Dr Matt Moores, Lecturer in Statistical Science, University of Wollongong

**Abstract:**

The hidden Potts model can be used for image segmentation, where the pixels are assumed to be noisy observations of some hidden states. The inverse temperature parameter governs the strength of spatial cohesion between neighbours in the image lattice. A difficulty arises from the dependence of an intractable normalising constant on the value of this parameter and thus there is no closed-form solution for sampling from the posterior distribution directly. There are a variety of computational approaches for sampling from the posterior without evaluating the normalising constant, including the exchange algorithm and approximate Bayesian computation (ABC). A serious drawback of these algorithms is that they do not scale well for models with a large state space, such as images with a million or more pixels. In this talk, I will introduce the parametric functional approximate Bayesian (PFAB) algorithm, which uses an integral curve to approximate the score function. PFAB incorporates known properties of the likelihood, such as heteroskedasticity and critical temperature. I will demonstrate this method using synthetic data as well as remotely-sensed imagery from the Landsat-8 satellite. The proposed algorithm achieves up to a hundredfold improvement in the elapsed runtime, compared to the exchange algorithm or ABC. An open source implementation of PFAB is available in the R package `bayesImageS’.

This is joint work with Kerrie Mengersen & Tony Pettitt (QUT) and Geoff Nicholls (Oxford).

]]>

The spectral signature of a molecule can be predicted using a quantum-mechanical model, such as time-dependent density functional theory (TD-DFT). However, there are no uncertainty estimates associated with these predictions, and matching with peaks in observed spectra is often performed by eye. This talk introduces a model-based approach for baseline estimation and peak fitting, using TD-DFT predictions as an informative prior. The peaks are modelled as a mixture of Lorentzian, Gaussian, or pseudo-Voigt broadening functions, while the baseline is represented as a penalised cubic spline. We fit this model using a sequential Monte Carlo (SMC) algorithm, which is robust to local maxima and enables the posterior distribution to be incrementally updated as more data becomes available. We apply our method to multivariate calibration of Raman-active dye molecules, enabling us to estimate the limit of detection (LOD) of each peak.

]]>

The Potts (1952) model is an example of a Gibbs random field on a regular lattice, where each node can take values in the set . The Ising model can be viewed as a special case, when *q*=2. The size of the configuration space is therefore , where n is the number of nodes. The dual lattice defines undirected edges between neighbouring nodes . If the nodes in a 2D lattice with c columns are indexed row-wise, the nearest (first-order) neighbours are , except at the boundary. Nodes situated on the boundary of the domain have less than four neighbours. The total number of unique edges is thus for a square lattice, or if the lattice is rectangular.

The sufficient statistic of the Potts model is the sum of all like neighbour pairs:

where is the Kronecker delta function, which equals 1 if a = b and equals 0 otherwise. ranges from 0, when all of the nodes form a chequerboard pattern, to when all of the nodes have the same value. The likelihood of the Potts model is thus:

The normalising constant of the Potts model is intractable for any non-trivial lattice, since it requires a sum over the configuration space:

When the inverse temperature , simplifies to , hence the labels are independent and uniformly-distributed.

The sum over configuration space of the sufficient statistic of the

q-state Potts model on a rectangular 2D lattice is.

For a *q*=2 state Potts model on a lattice with *n*=4 nodes and edges, contains 16 possible configurations:

. This can also be written as .

Now consider a rectangular lattice with *r* > 1 rows and *c* > 1 columns, so that and the dual lattice . The size of the configuration space is . Assume that the sum over configuration space is equal to . This sum can be decomposed into within each row, plus between rows.

If this lattice is extended by adding another row (or equivalently, another column), then (or otherwise, ) and the dual lattice . The nodes in this new row can take possible values, so the size of the configuration space is now . will increase proportional to for the new row, plus for the connections with its adjacent row:

Q.E.D.

The expectation of the

q-state Potts model on a rectangular 2D lattice is when the inverse temperature .

The proof follows from Theorem 1 by noting that and hence:

Q.E.D.

The sum over configuration space of the square of the sufficient statistic of the

q-state Potts model on a rectangular 2D lattice is

For a *q*=2 state Potts model on a lattice with *n*=4 nodes and edges, . This can also be written as .

Now assume for a rectangular lattice with *r *> 1 rows and *c* > 1 columns that

This can be decomposed into .

If we extend the lattice by adding another row, then

Q.E.D.

The variance of the

q-state Potts model on a rectangular 2D lattice is when the inverse temperature .

The proof follows from Theorems 1 and 3:

Q.E.D.

]]>

There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism. I will also discuss the process of releasing this software as an open source R package on the CRAN repository.

]]>

- Google Chrome
- Dropbox
- TeX Live 2018 (including TexWorks)
- Java SE JDK 8u171 with NetBeans 8.2 IDE (64 bit)
- JabRef (64 bit)
- Microsoft R Open 3.5.0 (including MKL)
- Rtools34
- RStudio Desktop 1.1.453
- Microsoft Office Home & Student 2016 (64 bit)
- IBM SPSS Statistics 23
- Adobe Acrobat Reader DC
- PuTTY 0.7 (64 bit)
- WinSCP 5.13.3
- Git 2.18.0 for Windows (64 bit)

With all of this installed and my Dropbox synced, I have 197 GB of free space on my 446 GB solid-state drive.

]]>

This will be my farewell tour of the UK, as I’ll be relocating back to Australia after an amazing four years as a postdoc at the University of Warwick. After UseR!, I’ll be taking up a lectureship in the School of Mathematics and Statistics and the National Institute for Applied Statistics Research Australia (NIASRA) at the University of Wollongong.

ABC in Edinburgh, Sunday June 24

The inverse temperature parameter of the Potts model governs the strength of spatial cohesion and therefore has a major influence over the resulting model fit. A difficulty arises from the dependence of an intractable normalising constant on the value of this parameter and thus there is no closed-form solution for sampling from the posterior distribution directly. There are a variety of computational approaches for sampling from the posterior without evaluating the normalising constant, including the exchange algorithm and approximate Bayesian computation (ABC). A serious drawback of these algorithms is that they do not scale well for models with a large state space, such as images with a million or more pixels. We introduce a parametric surrogate model, which approximates the score function using an integral curve. Our surrogate model incorporates known properties of the likelihood, such as heteroskedasticity and critical temperature. We demonstrate this method using synthetic data as well as remotely-sensed imagery from the Landsat-8 satellite. We achieve up to a hundredfold improvement in the elapsed runtime, compared to the exchange algorithm or ABC. An open source implementation of our algorithm is available in the R package `bayesImageS’.

Moores, Pettitt & Mengersen (2015; v2 2018) “Scalable Bayesian inference for the inverse temperature of a hidden Potts model” arXiv:1503.08066 [stat.CO]

ISBA World Meeting, University of Edinburgh, Monday June 25

Raman spectroscopy can be used to identify molecules by the characteristic scattering of light from a laser. Each Raman-active dye label has a unique spectral signature, comprised by the locations and amplitudes of the peaks. The presence of a large, non-uniform background presents a major challenge to analysis of these spectra. We introduce a sequential Monte Carlo (SMC) algorithm to separate the observed spectrum into a series of peaks plus a smoothly-varying baseline, corrupted by additive white noise. The peaks are modelled as Lorentzian, Gaussian, or pseudo-Voigt functions, while the baseline is estimated using a penalised cubic spline. Our model-based approach accounts for differences in resolution and experimental conditions. We incorporate prior information to improve identifiability and regularise the solution. By utilising this representation in a Bayesian functional regression, we can quantify the relationship between molecular concentration and peak intensity, resulting in an improved estimate of the limit of detection. The posterior distribution can be incrementally updated as more data becomes available, resulting in a scalable algorithm that is robust to local maxima. These methods have been implemented as an R package, using RcppEigen and OpenMP.

Moores, Gracie, Carson, Faulds, Graham & Girolami (2016; v2 2018) “Bayesian modelling and quantification of Raman spectroscopy” arXiv:1604.07299 [stat.AP]

]]>

Depending on your configuration, you might need to edit the following file:

/Library/Frameworks/R.framework/Resources/etc/Makeconf

and change this line:

MAIN_LDFLAGS = -fopenmp

to something like this (depending where you installed CUDA):

MAIN_LDFLAGS = -L/usr/local/cuda/lib

This fixes the following error from nvcc:

** arch - /usr/local/cuda/bin/nvcc -shared -fopenmp -L/usr/local/lib -F/Library/Frameworks/R.framework/.. -framework R -lpcre -llzma -lbz2 -lz -licucore -lm -liconv -lpcre -llzma -lbz2 -lz -licucore -lm -liconv -lcublas -lnvrtc -lcuda rinterface.o mi.o sort.o granger.o qrdecomp.o correlation.o hcluster.o distance.o matmult.o lsfit.o kendall.o cuseful.o -o gputools.so nvcc fatal : Unknown option 'fopenmp' make: *** [gputools.so] Error 1 ERROR: compilation failed for package ‘gputools’

**Note**: this is probably why the package was removed from CRAN…

You might also need to edit ~/.R/Makevars if you followed my previous instructions on how to compile parallel OpenMP code on macOS X.

There is a second line that also causes problems with nvcc:

LIBR = -F/Library/Frameworks/R.framework/.. -framework R

Thanks to this post on StackExchange, which references this post in the nVidia forum, this line should be changed to:

LIBR = -Xlinker -framework,R

Finally, remember to set the following environment variables:

export CUDA_HOME=/usr/local/cuda export DYLD_LIBRARY_PATH=/usr/local/cuda/lib/:$DYLD_LIBRARY_PATH

**Final note**: system-wide changes to Makeconf are generally a *very* bad idea. The instructions above are likely to break compilation for any other (non-CUDA) R packages. Therefore, I would recommend reverting all of these changes once **gputools** has been successfully installed. Alternatively, you might want to investigate other R packages that provide CUDA support…

More details about the model and SMC algorithm are available in my preprint on arXiv (Moores et al., 2006; v2 2018). The following gives an example of applying **serrsBayes** to surface-enhanced Raman spectroscopy (SERS) from a previous paper (Gracie et al., 2016).

This is a type of functional data analysis (Ramsay et al., 2009), since the discretised spectrum is represented using latent (unobserved), continuous functions. The background fluorescence is estimated using a penalised B-spline (Wood, 2017), while the peaks can be modelled as Gaussian, Lorentzian, or pseudo-Voigt functions.

The Voigt function is a *convolution* of a Gaussian and a Lorentzian: . It has an additional parameter that equals 0 for pure Gaussian and 1 for Lorentzian:

where is the amplitude of peak ; is the peak location; and is the broadening. The horizontal axis of a Raman spectrum is measured in wavenumbers , with units of inverse centimetres (). The vertical axis is measured in arbitrary units (a.u.), since the intensity of the Raman signal depends on the properties of the spectrometer.

We can download some SERS spectra in a zip file:

tmp <- tempfile() download.file("https://pure.strath.ac.uk/portal/files/43595106/Figure_2.zip", tmp) tmp2 <- unzip(tmp, "Figure 2/T20 SERS spectra/T20_1_ REP1 Well_A1.SPC")

trying URL 'https://pure.strath.ac.uk/portal/files/43595106/Figure_2.zip'

downloaded 270 KB

This data is in the binary SPC file format used by Grams/AI. Fortunately, we can use the R package **hyperSpec** to read this file and plot the spectrum:

library(hyperSpec) spcT20 <- read.spc (tmp2) plot(spcT20[1,], col=4, wl.range=600~1800, title.args=list(main="Raman Spectrum of TAMRA+DNA")) spectra <- spcT20[1,,600~1800] wavenumbers <- wl(spectra) nWL <- length(wavenumbers)

We will use the same priors that were described in the paper (Moores et al., 2016), including the TD-DFT peak locations from Watanabe et al. (2005):

peakLocations <- c(615, 631, 664, 673, 702, 705, 771, 819, 895, 923, 1014, 1047, 1049, 1084, 1125, 1175, 1192, 1273, 1291, 1307, 1351, 1388, 1390, 1419, 1458, 1505, 1530, 1577, 1601, 1615, 1652, 1716) nPK <- length(peakLocations) priors <- list(loc.mu=peakLocations, loc.sd=rep(50,nPK), scaG.mu=log(16.47) - (0.34^2)/2, scaG.sd=0.34, scaL.mu=log(25.27) - (0.4^2)/2, scaL.sd=0.4, noise.nu=5, noise.sd=50, bl.smooth=1, bl.knots=121)

Now we run the SMC algorithm to fit the model:

library(serrsBayes) tm <- system.time(result <- fitVoigtPeaksSMC(wavenumbers, as.matrix(spectra), priors, npart=2000)) result$time <- tm save(result, file="Figure 2/result.rda")

[1] "SMC with 1 observations at 1 unique concentrations, 2000 particles, and 2401 wavenumbers."

[1] "Step 0: computing 125 B-spline basis functions (r=10) took 0.28sec."

[1] "Mean noise parameter sigma is now 60.3304671005565"

[1] "Mean spline penalty lambda is now 1"

[1] "Step 1: initialization for 32 Voigt peaks took 24.959 sec."

[1] "Reweighting took 1.208sec. for ESS 1800.80025019536 with new kappa 0.00096893310546875."

[1] "Iteration 2 took 253.487sec. for 10 MCMC loops (acceptance rate 0.3053)"

[1] "Reweighting took 1.07499999999999sec. for ESS 1621.343255666 with new kappa 0.00144911924144253."

. . .

[1] "Iteration 239 took 250.380000000005sec. for 10 MCMC loops (acceptance rate 0.2247)"

[1] "Reweighting took 0.0559999999968568sec. for ESS 1270.7842854632 with new kappa 1."

[1] "Iteration 240 took 249.332999999999sec. for 10 MCMC loops (acceptance rate 0.2313)"

The default values for the number of particles, Markov chain steps, and learning rate can be somewhat conservative, depending on the application. Unfortunately, the new function fitVoigtPeaksSMC has not been parallelised yet, so it only runs on a single core. Thus, it can take a long time to fit the model with 34 peaks and 2401 wavenumbers:

print(paste(result$time["elapsed"]/3600,"hours for",length(result$ess),"SMC iterations."))

[1] "16.4389 hours for 240 SMC iterations."

The downside of choosing smaller values for these tuning parameters is that you run the risk of the SMC collapsing. The quality of the particle distribution deteriorates with each iteration, as measured by the effective sample size (ESS):

plot.ts(result$ess, ylab="ESS", main="Effective Sample Size", xlab="SMC iteration") abline(h=length(result$sigma)/2, col=4, lty=2) abline(h=0,lty=2)

Note: this is very bad! The variance of the importance sampling estimator is unbounded in this case. The resampling step is intended to refresh the particles, but this introduces duplicates into the population. The Metropolis-Hastings (M-H) steps move some of the particles, but the bandwidths of the random walk proposals are chosen adaptively, based on the particle distribution. If this degenerates too far, then the M-H acceptance rate will also fall to zero:

If SMC collapses, the best solution is to increase the number of particles and run it again. Thus, choosing a conservative number to begin with is a sensible strategy. With 2000 particles and 10 M-H steps per SMC iteration, the algorithm converges to the target distribution:

A subsample of particles can be used to plot the posterior distribution of the baseline and peaks:

samp.idx <- sample.int(length(result$weights), 50, prob=result$weights) samp.mat <- resid.mat <- matrix(0,nrow=length(samp.idx), ncol=nWL) samp.sigi <- samp.lambda <- numeric(length=nrow(samp.mat)) spectra <- as.matrix(spectra) plot(wavenumbers, spectra[1,], type='l', xlab="Raman offset", ylab="intensity") for (pt in 1:length(samp.idx)) { k <- samp.idx[pt] samp.mat[pt,] <- mixedVoigt(result$location[k,], result$scale_G[k,], result$scale_L[k,], result$beta[k,], wavenumbers) samp.sigi[pt] <- result$sigma[k] samp.lambda[pt] <- result$lambda[k] Obsi <- spectra[1,] - samp.mat[pt,] g0_Cal <- length(Obsi) * samp.lambda[pt] * result$priors$bl.precision gi_Cal <- crossprod(result$priors$bl.basis) + g0_Cal mi_Cal <- as.vector(solve(gi_Cal, crossprod(result$priors$bl.basis, Obsi))) bl.est <- result$priors$bl.basis %*% mi_Cal # smoothed residuals = estimated basline lines(wavenumbers, bl.est, col="#C3000020") lines(wavenumbers, bl.est + samp.mat[pt,], col="#0000C30F") resid.mat[pt,] <- Obsi - bl.est[,1] } title(main="Baseline for TAMRA")

Notice that the uncertainty in the baseline is greatest where the peaks are bunched close together, which is exactly what we would expect. This is also reflected in uncertainty of the spectral signature:

plot(range(wavenumbers), range(samp.mat), type='n', xlab="Raman offset", ylab="Intensity") abline(h=0,lty=2) for (pt in 1:length(samp.idx)) { lines(wavenumbers, samp.mat[pt,], col="#0000C330") lines(wavenumbers, resid.mat[pt,] + samp.mat[pt,], col="#00000020") } title(main="Spectral Signature")

Del Moral, Pierre, Arnaud Doucet, and Ajay Jasra. 2006. “Sequential Monte Carlo Samplers.” *J. R. Stat. Soc. Ser. B* 68 (3): 411–36. doi:10.1111/j.1467-9868.2006.00553.x.

Gracie, K., M. Moores, W. E. Smith, Kerry Harding, M. Girolami, D. Graham, and K. Faulds. 2016. “Preferential Attachment of Specific Fluorescent Dyes and Dye Labelled DNA Sequences in a SERS Multiplex.” *Anal. Chem.* 88 (2): 1147–53. doi:10.1021/acs.analchem.5b02776.

Jacob, Pierre E., Lawrence M. Murray, and Sylvain Rubenthaler. 2015. “Path Storage in the Particle Filter.” *Stat. Comput.* 25 (2): 487–96. doi:10.1007/s11222-013-9445-x.

Lee, Anthony, and Nick Whiteley. 2015. “Variance Estimation in the Particle Filter.” *arXiv Preprint arXiv:1509.00394 [Stat.CO]*. https://arxiv.org/abs/1509.00394.

Moores, M., K. Gracie, J. Carson, K. Faulds, D. Graham, and M. Girolami. 2016. “Bayesian Modelling and Quantification of Raman Spectroscopy.” *arXiv Preprint arXiv:1604.07299 [Stat.AP]*. http://arxiv.org/abs/1604.07299.

Ramsay, Jim O., Giles Hooker, and Spencer Graves. 2009. *Functional Data Analysis with R and MATLAB*. Use R! New York: Springer. doi:10.1007/978-0-387-98185-7.

Watanabe, Hiroyuki, Norihiko Hayazawa, Yasushi Inouye, and Satoshi Kawata. 2005. “DFT Vibrational Calculations of Rhodamine 6g Adsorbed on Silver: Analysis of Tip-Enhanced Raman Spectroscopy.” *J. Phys. Chem. B* 109 (11): 5012–20. doi:10.1021/jp045771u.

Wood, Simon N. 2017. *Generalized Additive Models: An Introduction with R*. 2nd ed. Boca Raton, FL, USA: Chapman & Hall/CRC Press. https://people.maths.bris.ac.uk/~sw15190/igam/index.html.

]]>