Skip to content

R package serrsBayes

February 19, 2018

My second R package, serrsBayes, is now available on CRAN. serrsBayes uses a sequential Monte Carlo (SMC) algorithm to separate an observed spectrum into 3 components: the peaks s_i(\tilde\nu); baseline \xi_i(\tilde\nu); and additive white noise \epsilon_{i,j} \sim \mathcal{N}(0, \sigma^2_\epsilon):

\mathbf{y}_i = \xi_i(\tilde\nu) + s_i(\tilde\nu) + \boldsymbol\epsilon_i

More details about the model and SMC algorithm are available in my preprint on arXiv (Moores et al., 2006; v2 2018). The following gives an example of applying serrsBayes to surface-enhanced Raman spectroscopy (SERS) from a previous paper (Gracie et al., 2016).

This is a type of functional data analysis (Ramsay et al., 2009), since the discretised spectrum \mathbf{y}_i is represented using latent (unobserved), continuous functions. The background fluorescence \xi_i(\tilde\nu) is estimated using a penalised B-spline (Wood, 2017), while the peaks can be modelled as Gaussian, Lorentzian, or pseudo-Voigt functions.

The Voigt function is a convolution of a Gaussian and a Lorentzian: f_V(\nu_j) = (f_G * f_L)(\nu_j). It has an additional parameter \eta_p that equals 0 for pure Gaussian and 1 for Lorentzian:

s_i(\nu_j) = \sum_{p=1}^P A_{i,p} f_V(\nu_j ; \ell_p, \varphi_p, \eta_p)

where A_{i,p} is the amplitude of peak p; \ell_p is the peak location; and \varphi_p is the broadening. The horizontal axis of a Raman spectrum is measured in wavenumbers \nu_j \in \Delta\tilde\nu, with units of inverse centimetres (\mathrm{cm}^{-1}). The vertical axis is measured in arbitrary units (a.u.), since the intensity of the Raman signal depends on the properties of the spectrometer.

Data

We can download some SERS spectra in a zip file:

tmp <- tempfile()
download.file("https://pure.strath.ac.uk/portal/files/43595106/Figure_2.zip", tmp)
tmp2 <- unzip(tmp, "Figure 2/T20 SERS spectra/T20_1_ REP1  Well_A1.SPC")


trying URL 'https://pure.strath.ac.uk/portal/files/43595106/Figure_2.zip'
downloaded 270 KB

This data is in the binary SPC file format used by Grams/AI. Fortunately, we can use the R package hyperSpec to read this file and plot the spectrum:

library(hyperSpec)
spcT20 <- read.spc (tmp2)
plot(spcT20[1,], col=4, wl.range=600~1800,
     title.args=list(main="Raman Spectrum of TAMRA+DNA"))
spectra <- spcT20[1,,600~1800]
wavenumbers <- wl(spectra)
nWL <- length(wavenumbers)

TAMRAspc

Priors

We will use the same priors that were described in the paper (Moores et al., 2016), including the TD-DFT peak locations from Watanabe et al. (2005):

peakLocations <- c(615, 631, 664, 673, 702, 705, 771, 819, 895, 923, 1014,
1047, 1049, 1084, 1125, 1175, 1192, 1273, 1291, 1307, 1351, 1388, 1390,
1419, 1458, 1505, 1530, 1577, 1601, 1615, 1652, 1716)
nPK <- length(peakLocations)
priors <- list(loc.mu=peakLocations, loc.sd=rep(50,nPK), scaG.mu=log(16.47) - (0.34^2)/2,
scaG.sd=0.34, scaL.mu=log(25.27) - (0.4^2)/2, scaL.sd=0.4, noise.nu=5, noise.sd=50,
bl.smooth=1, bl.knots=121)

SMC

Now we run the SMC algorithm to fit the model:

library(serrsBayes)
tm <- system.time(result <- fitVoigtPeaksSMC(wavenumbers,
                  as.matrix(spectra), priors, npart=2000))
result$time <- tm
save(result, file="Figure 2/result.rda")


[1] "SMC with 1 observations at 1 unique concentrations, 2000 particles, and 2401 wavenumbers."
[1] "Step 0: computing 125 B-spline basis functions (r=10) took 0.28sec."
[1] "Mean noise parameter sigma is now 60.3304671005565"
[1] "Mean spline penalty lambda is now 1"
[1] "Step 1: initialization for 32 Voigt peaks took 24.959 sec."
[1] "Reweighting took 1.208sec. for ESS 1800.80025019536 with new kappa 0.00096893310546875."
[1] "Iteration 2 took 253.487sec. for 10 MCMC loops (acceptance rate 0.3053)"
[1] "Reweighting took 1.07499999999999sec. for ESS 1621.343255666 with new kappa 0.00144911924144253."
. . .
[1] "Iteration 239 took 250.380000000005sec. for 10 MCMC loops (acceptance rate 0.2247)"
[1] "Reweighting took 0.0559999999968568sec. for ESS 1270.7842854632 with new kappa 1."
[1] "Iteration 240 took 249.332999999999sec. for 10 MCMC loops (acceptance rate 0.2313)"

The default values for the number of particles, Markov chain steps, and learning rate can be somewhat conservative, depending on the application. Unfortunately, the new function fitVoigtPeaksSMC has not been parallelised yet, so it only runs on a single core. Thus, it can take a long time to fit the model with 34 peaks and 2401 wavenumbers:

print(paste(result$time["elapsed"]/3600,"hours for",length(result$ess),"SMC iterations."))


[1] "16.4389 hours for 240 SMC iterations."

The downside of choosing smaller values for these tuning parameters is that you run the risk of the SMC collapsing. The quality of the particle distribution deteriorates with each iteration, as measured by the effective sample size (ESS):

plot.ts(result$ess, ylab="ESS", main="Effective Sample Size", xlab="SMC iteration")
abline(h=length(result$sigma)/2, col=4, lty=2)
abline(h=0,lty=2)

smcCollapse
Note: this is very bad! The variance of the importance sampling estimator is unbounded in this case. The resampling step is intended to refresh the particles, but this introduces duplicates into the population. The Metropolis-Hastings (M-H) steps move some of the particles, but the bandwidths of the random walk proposals are chosen adaptively, based on the particle distribution. If this degenerates too far, then the M-H acceptance rate will also fall to zero:
smcCollapseMH
If SMC collapses, the best solution is to increase the number of particles and run it again. Thus, choosing a conservative number to begin with is a sensible strategy. With 2000 particles and 10 M-H steps per SMC iteration, the algorithm converges to the target distribution:
Ess_convergence

Posterior Inference

A subsample of particles can be used to plot the posterior distribution of the baseline and peaks:

samp.idx <- sample.int(length(result$weights), 50, prob=result$weights)
samp.mat <- resid.mat <- matrix(0,nrow=length(samp.idx), ncol=nWL)
samp.sigi <- samp.lambda <- numeric(length=nrow(samp.mat))
spectra <- as.matrix(spectra)
plot(wavenumbers, spectra[1,], type='l', xlab="Raman offset", ylab="intensity")
for (pt in 1:length(samp.idx)) {
  k <- samp.idx[pt]
  samp.mat[pt,] <- mixedVoigt(result$location[k,], result$scale_G[k,],
                              result$scale_L[k,], result$beta[k,], wavenumbers)
  samp.sigi[pt] <- result$sigma[k]
  samp.lambda[pt] <- result$lambda[k]

  Obsi <- spectra[1,] - samp.mat[pt,]
  g0_Cal <- length(Obsi) * samp.lambda[pt] * result$priors$bl.precision
  gi_Cal <- crossprod(result$priors$bl.basis) + g0_Cal
  mi_Cal <- as.vector(solve(gi_Cal, crossprod(result$priors$bl.basis, Obsi)))

  bl.est <- result$priors$bl.basis %*% mi_Cal # smoothed residuals = estimated basline
  lines(wavenumbers, bl.est, col="#C3000020")
  lines(wavenumbers, bl.est + samp.mat[pt,], col="#0000C30F")
  resid.mat[pt,] <- Obsi - bl.est[,1]
}
title(main="Baseline for TAMRA")

T20baseline

Notice that the uncertainty in the baseline is greatest where the peaks are bunched close together, which is exactly what we would expect. This is also reflected in uncertainty of the spectral signature:

plot(range(wavenumbers), range(samp.mat), type='n', xlab="Raman offset", ylab="Intensity")
abline(h=0,lty=2)
for (pt in 1:length(samp.idx)) {
  lines(wavenumbers, samp.mat[pt,], col="#0000C330")
  lines(wavenumbers, resid.mat[pt,] + samp.mat[pt,], col="#00000020")
}
title(main="Spectral Signature")

T20peaks

References

Del Moral, Pierre, Arnaud Doucet, and Ajay Jasra. 2006. “Sequential Monte Carlo Samplers.” J. R. Stat. Soc. Ser. B 68 (3): 411–36. doi:10.1111/j.1467-9868.2006.00553.x.

Gracie, K., M. Moores, W. E. Smith, Kerry Harding, M. Girolami, D. Graham, and K. Faulds. 2016. “Preferential Attachment of Specific Fluorescent Dyes and Dye Labelled DNA Sequences in a SERS Multiplex.” Anal. Chem. 88 (2): 1147–53. doi:10.1021/acs.analchem.5b02776.

Jacob, Pierre E., Lawrence M. Murray, and Sylvain Rubenthaler. 2015. “Path Storage in the Particle Filter.” Stat. Comput. 25 (2): 487–96. doi:10.1007/s11222-013-9445-x.

Lee, Anthony, and Nick Whiteley. 2015. “Variance Estimation in the Particle Filter.” arXiv Preprint arXiv:1509.00394 [Stat.CO]. https://arxiv.org/abs/1509.00394.

Moores, M., K. Gracie, J. Carson, K. Faulds, D. Graham, and M. Girolami. 2016. “Bayesian Modelling and Quantification of Raman Spectroscopy.” arXiv Preprint arXiv:1604.07299 [Stat.AP]. http://arxiv.org/abs/1604.07299.

Ramsay, Jim O., Giles Hooker, and Spencer Graves. 2009. Functional Data Analysis with R and MATLAB. Use R! New York: Springer. doi:10.1007/978-0-387-98185-7.

Watanabe, Hiroyuki, Norihiko Hayazawa, Yasushi Inouye, and Satoshi Kawata. 2005. “DFT Vibrational Calculations of Rhodamine 6g Adsorbed on Silver: Analysis of Tip-Enhanced Raman Spectroscopy.” J. Phys. Chem. B 109 (11): 5012–20. doi:10.1021/jp045771u.

Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R. 2nd ed. Boca Raton, FL, USA: Chapman & Hall/CRC Press. https://people.maths.bris.ac.uk/~sw15190/igam/index.html.

 

Advertisements

From → Functional Data, R

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Ella Kaye on Ella Kaye

Computational Bayesian statistics

Bayes' Food Cake

A bit of statistics, a bit of cakes.

RWeekly.org - Blogs to Learn R from the Community

Computational Bayesian statistics

Richard Everitt's blog

Computational Bayesian statistics

Let's Look at the Figures

David Firth's blog

Nicholas Tierney

Computational Bayesian statistics

Mad (Data) Scientist

Musings, useful code etc. on R and data science

Another Astrostatistics Blog

The random musings of a reformed astronomer ...

Darren Wilkinson's research blog

Statistics, computing, data science, Bayes, stochastic modelling, systems biology and bioinformatics

(badness 10000)

Computational Bayesian statistics

Igor Kromin

Computational Bayesian statistics

Statisfaction

I can't get no

Xi'an's Og

an attempt at bloggin, nothing more...

Sam Clifford

Postdoctoral Fellow, Bayesian Statistics, Aerosol Science

%d bloggers like this: