R with GotoBLAS on Windows 10
In my experience, the default libRblas gives horrible performance across all of the platforms that I regularly use (Windows, Mac OS & Linux). The R package that I’m currently developing uses RcppEigen, which is not dependent on an efficient BLAS or LAPACK library. However, many other R packages do have this dependency. Therefore, I would recommend following the instructions in the R Installation and Administration guide to switch over to a more efficient implementation. Avraham Adler, Tony Fischetti and Zachary Mayer have written similar blog posts on this topic. I use the Accelerate Umbrella Framework (vecLib) on OS X and Intel MKL with icc on Linux. The following instructions describe how (and why) to install GotoBLAS for R on Microsoft Windows.
As described in pp. 191-192 of “Seamless R and C++ Integration with Rcpp” (DOI: 10.1007/978-1-4614-6868-4), the lmBenchmark script can be used as a rough performance measurement for dense matrix algebra on any system. You need to install the packages RcppEigen and rbenchmark, then run:
Rscript -e "source(system.file(\"examples\", \"lmBenchmark.R\", package = \"RcppEigen\"))"
The output should look something like this (with default Rblas.dll):
lm benchmark for n = 1650 and p = 875: nrep = 20 user system elapsed 2021.83 21.27 2043.75 test relative elapsed user.self sys.self 3 LDLt 1.000 4.70 4.54 0.17 7 QR 1.374 6.46 6.25 0.20 8 LLt 1.436 6.75 6.44 0.32 1 lm.fit 3.853 18.11 18.03 0.03 6 SymmEig 5.700 26.79 26.57 0.22 2 PivQR 9.783 45.98 26.75 19.20 9 arma 19.155 90.03 89.72 0.29 4 GESDD 19.313 90.77 90.53 0.22 5 SVD 184.362 866.50 865.94 0.39 10 GSL 188.672 886.76 886.19 0.21 R version 3.2.3 (2015-12-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
These timings are for a 1650 × 875 matrix, rather than the default of 100,000 × 40. The results are highly dependent on the matrix dimensions, so you should use a size that is representative of the data that you are working with. The benchmark was run on a 2GHz Intel Core i7-4750HQ with Windows 10. You can compare these results to Table 12.2 on pg. 191 of Eddelbuettel (2013).
I installed OpenBLAS from SourceForge (the usual caveats for downloading binaries from SourceForge notwithstanding). Unfortunately, the usual method of replacing libRblas with libopenblas (or a softlink) is a one-way trip to DLL hell on Windows. If you start getting an error message that libgcc_s_seh-1.dll is missing, switch back to the original Rblas.dll (you did make a backup copy, right?). There are drop-in replacements using SurviveGotoBLAS 3.14 for some CPU architectures available here, but unfortunately not for the 4th generation Haswell with SSE 4.2 and AVX 2.0 instructions.
These are the results for the SurviveGotoBLAS binary (Nehalem architecture):
lm benchmark for n = 1650 and p = 875: nrep = 20 user system elapsed 2012.86 115.13 2072.31 test relative elapsed user.self sys.self 3 LDLt 1.000 4.71 4.45 0.25 7 QR 1.465 6.90 6.59 0.30 8 LLt 1.643 7.74 7.47 0.27 4 GESDD 4.662 21.96 38.75 5.33 6 SymmEig 6.270 29.53 29.24 0.28 9 arma 6.696 31.54 61.69 11.65 2 PivQR 9.688 45.63 26.45 19.10 1 lm.fit 25.607 120.61 36.31 77.37 5 SVD 190.735 898.36 897.09 0.43 10 GSL 191.970 904.18 903.67 0.13
RcppArmadillo (“arma”) improved from 90s elapsed time to 31.5, almost a 3× speedup. Likewise, GESDD improved from 90.8s to 22. However, lm.fit is slower at 120.6s elapsed. The 77s spent in sys.self is likely due to threading issues. Clearly, this is far from the desired outcome in switching BLAS implementations. There is a workaround for this issue, which is to install the R package RhpcBLASctl. This offers a function blas_set_num_threads(..) that you can use to force BLAS to be single-threaded.
Results for single-threaded SurviveGotoBLAS were as follows:
lm benchmark for n = 1650 and p = 875: nrep = 20 user system elapsed 1894.76 23.33 1919.85 test relative elapsed user.self sys.self 3 LDLt 1.000 4.67 4.46 0.21 7 QR 1.355 6.33 6.08 0.25 8 LLt 1.422 6.64 6.43 0.22 1 lm.fit 1.623 7.58 7.50 0.08 6 SymmEig 5.728 26.75 26.39 0.33 9 arma 6.503 30.37 29.89 0.40 4 GESDD 7.017 32.77 32.45 0.32 2 PivQR 10.107 47.20 26.33 20.86 5 SVD 186.285 869.95 868.83 0.45 10 GSL 189.852 886.61 885.47 0.19
As described by Avraham Adler, the alternative is to install MSYS2 & MINGW64, then compile both OpenBLAS & R from source. Once again, Windows is the redheaded stepchild of platforms for running R.
Note that the Gnu scientific library (GSL) is also available for Windows. If you want to run lmBenchmark for RcppGSL, then you will also need to install it. However, GSL uses its own libgslcblas.dll, so it won’t benefit from installing OpenBLAS as described above. I don’t know why gsl_multifit_linear(..) is so slow, in comparison to all of the other implementations. I’ve observed similar RcppGSL performance when I ran lmBenchmark on Linux and OS X.
I just updated those instructions for R3.3+ and Rtools34 so they are a bit different.