Skip to content

Haar wavelets in Google docs (and R)

June 20, 2016

Igor recently posted some interesting data, showing the change in temperature over time with 2 different types of fan, or without any cooling. This data exhibits some unusual features: the underlying trend is smooth, over a long timeframe (minutes) but there is also periodic switching between “low” and “high” signal, which is sometimes accompanied by jitter. I’ll first show how to generate some similar data in R, then how we might compute the discrete wavelet transform (DWT) in Google Sheets. Of course, the simplest solution is often the best (unless you work in research…), so check out Igor’s post for a quicker way to filter this data.

Here I use a sine wave for the overall trend, but this could just as easily be a polynomial or cubic spline. I’m modelling the long-term switching as a deterministic function, although in Igor’s data this appears to be random. I’ve added a small amount of white noise as well as the jitter. I’m assuming a Poisson distribution for the number of time points of jitter following a switching event. A zero-inflated Poisson would be a better model, since jitter only occurs at some of the switches in Igor’s data. Finally, I assemble the four additive components of the signal to produce the red curve in the plot below (the original trend is in black).

t <- 1:2400
trend <- 28 + 6*sin(-0.3 - t/500)
plot.ts(trend, ylab="Temperature (degrees C)", xlab="Elapsed Time (seconds)")
sw <- as.integer(cos(1.7 + t/50) > 0)
lines(trend + sw, col=3)
noise <- rnorm(length(t))/20
lines(trend + sw + noise, col=4)
sw_idx <- which(diff(sw) > 0)
jit_len <- rpois(length(sw_idx), lambda=30)
jit <- rep(0, length(t))
for (i in 1:length(sw_idx)) {
sw_idx2 <- seq(sw_idx[i]+1, to=sw_idx[i] + jit_len[i], by=2)
jit[sw_idx2] <- - 1
}
lines(trend + sw + noise + jit, col=2)

temp1

Now that I’ve added the jitter in, I’ll show some ways to filter it out. I’m not using a model-based approach, since ideally this is something one could implement using a spreadsheet. Instead, I threshold the lag 1 difference between successive values to detect the jumps, where jitter is assumed to be two jumps in opposite directions.

temp <- trend + sw + noise + jit
idx1 <- which(diff(temp) > 0.8)
idx2 <- which(diff(temp) < -0.8)
idx3 <- na.omit(match(idx2, idx1+1))
abline(v=idx2[idx3], lty=3, col=3)
filter.temp <- temp
for (i in idx2[idx3]) {
filter.temp[i+1] <- temp[i]
}
plot.ts(filter.temp, ylab="Temperature (degrees C)", xlab="Elapsed Time (seconds)")

temp3

Due to the Gaussian noise, a threshold of 0.8 is not guaranteed to catch every instance of jitter. I haven’t set the random seed in this example, so your mileage may vary. Now, to demonstrate how to filter this data in Google Sheets, we could just export a CSV file and upload it. But of course, there is also the R package googlesheets:

library(googlesheets)
temp_df <- data.frame(cbind(t, temp))
temp_gs <- gs_new(title="Simulated Temperature", verbose = TRUE, input=temp_df)
Sheet "Simulated Temperature" created in Google Drive.
Range affected by the update: "A1:B2401"
Accessing worksheet titled 'Sheet1'.
Sheet successfully identified: "Simulated Temperature"
Accessing worksheet titled 'Sheet1'.
Worksheet "Sheet1" dimensions changed to 2401 x 26.
Worksheet "Sheet1" successfully updated with 4802 new value(s).
Worksheet dimensions: 2401 x 26.
There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: 'xml2::xml_find_one' is deprecated.
Use 'xml_find_first' instead.
See help("Deprecated")

temp4

temp6

To create the lag 1 difference (and the forward lag), which were named idx1 & idx2 in my R code above, I created 2 new columns with the formulae “B3-B2” & “B2-B3”. Note that the first row of column C is 0 and so is the final row of column D. Filters in Google Sheets are cumulative, which means that we don’t have the same flexibility that we had in R. We can filter by either “Lag Diff <= 0.8” or “Fwd Lag >= -0.8” but not by the intersection of these sets.

temp7

The resulting plot replaces the jitter with a smooth transition between values. In the plot below, I’ve also added a cubic polynomial trend line:

temp8

Haar wavelets are very useful functions for changepoint detection in data such as this. There are two excellent reference books on this topic, “Wavelet Methods in Statistics with R” by Guy Nason (part of Springer’s Use R! series) and “A Wavelet Tour of Signal Processing” by Stéphane Mallat. Guy Nason has also created the R package wavethresh, so there is no need to shell out for the MATLAB toolbox.

haar1

The discrete wavelet transform (DWT) is only defined for vectors with a length that is a power of 2. Usually we would pad the data to avoid edge effects, but in this case I just truncated it to the last 2048 seconds of temperature measurements.

library(wavethresh)
draw.default(filter.number = 1, family = "DaubExPhase")
temp_wd <- wd(temp, filter.number = 1, family = "DaubExPhase")
Error in wd(temp, filter.number = 1, family = "DaubExPhase") : 
 Data length is not power of two
length(temp) - 2^11 + 1
temp_wd <- wd(temp[353:2400], filter.number = 1, family = "DaubExPhase")
plot(temp_wd)

wave2

The DWT is a multiscale transform of the data. The Haar wavelet coefficients correspond to the locations of jumps in the data at various resolutions. We have 2^11 time points, so there are 11 resolution levels. Section 2.1 of Nason (2008) describes the DWT algorithm step by step. The complexity of this transform is O(N) and in the case of Haar wavelets it is simple enough to perform in a spreadsheet. At level 10 (the finest resolution), there are 1024 detail coefficients, also known as “mother wavelet” coefficients. Computing this is a little tricky, since we want to subtract the values with even-numbered indices from their preceding values:

haar2

So the first value is B355 – B354, the second is B357 – B356, and so on. Column F contains the coarsening or smoothed coefficients: B355 + B354, B357 + B356, etc. Column G is the next level of resolution: now we take the difference of the smoothed coefficients, so that G354 =  F355 – F354. At each level of resolution, we have half as many coefficients as the previous level: 1024 at level 10, 512 at level 9, and so on. At level 0, we end up with only a single coefficient. Of course, there is no need to compute these by hand when we can just import the data into R, but sometimes it can be beneficial to step through an algorithm in a spreadsheet to understand what it is doing at each iteration.

The point of all of this is that we can filter out noise in our data by shrinking the wavelet coefficients. As a crude example, we can just threshold out any coefficient less than 0.7:

temp_thresh <- threshold(temp_wd, verbose=TRUE, policy="manual", value=0.7)
temp_wr <- wr(temp_thresh)
plot.ts(temp_wr, ylab="Temperature (degrees C)", xlab="Elapsed Time (seconds)")
threshold.wd:
Argument checking
Manual policy
Level:  3  there are  0  zeroes
Level:  4  there are  1  zeroes
Level:  5  there are  9  zeroes
Level:  6  there are  49  zeroes
Level:  7  there are  119  zeroes
Level:  8  there are  251  zeroes
Level:  9  there are  510  zeroes
Level:  10  there are  973  zeroes

haar4

From → Functional Data, R

Leave a Comment

Leave a comment

ELLA KAYE

Computational Bayesian statistics

Bayes' Food Cake

A bit of statistics, a bit of cakes.

RWeekly.org - Blogs to Learn R from the Community

Computational Bayesian statistics

Richard Everitt's blog

Computational Bayesian statistics

Let's Look at the Figures

David Firth's blog

Nicholas Tierney

Computational Bayesian statistics

Sweet Tea, Science

Two southern scientistas will be bringing you all that is awesome in STEM as we complete our PhDs. Ecology, statistics, sass.

Mad (Data) Scientist

Musings, useful code etc. on R and data science

Darren Wilkinson's blog

Statistics, computing, functional programming, data science, Bayes, stochastic modelling, systems biology and bioinformatics

(badness 10000)

Computational Bayesian statistics

Igor Kromin

Computational Bayesian statistics

Statisfaction

I can't get no

Xi'an's Og

an attempt at bloggin, nothing more...

Sam Clifford

Postdoctoral Fellow, Bayesian Statistics, Aerosol Science