Distance matrix calculations in R

Can anyone share some good parallel processing libraries for doing distance matrix calculations in R ?

Here is a link to the CRAN Task View with several suggestions:


1 Like

To compute large dense matrices in parallel in R, I usually use the foreach and doSNOW packages like this:


cluster <- makeCluster(ncpu, type = "SOCK") 
D <- matrix(NA, nrow = n, ncol = n)
D <- foreach(j = c(1:n), .combine='cbind') %:%
      foreach(i = c(1:n), .combine = 'c') %dopar% {
        compute_distance(data[i,], data[j,])

Hi @jkh1,

Thank you for the response.
I have been using the same libraries but because of the memory constraints (or maybe any other reason which I am not aware of) my clusters stops after each iteration. Also, each time all the packages and variables are copied to each cluster it increases the run time. Any comments on that ?

If memory is an issue because of data duplication in the workers, the solution is to reduce the amount of data given to the workers. Typically each variable referenced in the foreach loop is sent to each worker. In my example, I reference data so the whole data matrix is copied to each worker. One way to avoid this is to rewrite the loop to iterate over the rows only with something like

foreach(rowi = t(data)....
 foreach(rowj = t(data)....

Another option could be to use the bigmemory package.

EDIT: Another thing to pay attention to is that you should probably not use all the cores available.

For parallel computing operations like Distance matrix calculations in R, use the HighPerformanceComputing package.