Speed up computation on lists of lists in R (scalability issues)
up vote
1
down vote
favorite
My goal is to take a source list of vectors and apply a mapping to each vector. The mapping creates a list of lists. When the vector size in each list element is large (~20000), and the number of elements in the source list is large (~5000), my program does not seem to be efficient. How can I optimize my code?
I've tried two things: implementation in Rcpp and mclapply. The Rcpp implementation is comparable as expected, and the mclapply implementation may not be that efficient because I'm running my function in a larger function that is already parallelized over multiple cores.
Basic example
source.list = rep(list(seq(6)),3)
target.list = list(c(1,2),c(3,4),c(5,6))
result = map_partition(source.list,target.list)
> result
[[1]]
[[1]][[1]]
[1] 1 2
[[1]][[2]]
[1] 3 4
[[1]][[3]]
[1] 5 6
[[2]]
[[2]][[1]]
[1] 1 2
[[2]][[2]]
[1] 3 4
[[2]][[3]]
[1] 5 6
[[3]]
[[3]][[1]]
[1] 1 2
[[3]][[2]]
[1] 3 4
[[3]][[3]]
[1] 5 6
R implementation
map.partition.R <- function(invec, partitionlist) {
  lst <- lapply(partitionlist, function(x) invec[invec %in% x])
  return(lst)
}
result = lapply(X=source.list, FUN=map.partition,
                               partitionlist=target.list)
performance r
add a comment |
up vote
1
down vote
favorite
My goal is to take a source list of vectors and apply a mapping to each vector. The mapping creates a list of lists. When the vector size in each list element is large (~20000), and the number of elements in the source list is large (~5000), my program does not seem to be efficient. How can I optimize my code?
I've tried two things: implementation in Rcpp and mclapply. The Rcpp implementation is comparable as expected, and the mclapply implementation may not be that efficient because I'm running my function in a larger function that is already parallelized over multiple cores.
Basic example
source.list = rep(list(seq(6)),3)
target.list = list(c(1,2),c(3,4),c(5,6))
result = map_partition(source.list,target.list)
> result
[[1]]
[[1]][[1]]
[1] 1 2
[[1]][[2]]
[1] 3 4
[[1]][[3]]
[1] 5 6
[[2]]
[[2]][[1]]
[1] 1 2
[[2]][[2]]
[1] 3 4
[[2]][[3]]
[1] 5 6
[[3]]
[[3]][[1]]
[1] 1 2
[[3]][[2]]
[1] 3 4
[[3]][[3]]
[1] 5 6
R implementation
map.partition.R <- function(invec, partitionlist) {
  lst <- lapply(partitionlist, function(x) invec[invec %in% x])
  return(lst)
}
result = lapply(X=source.list, FUN=map.partition,
                               partitionlist=target.list)
performance r
 
 
 
 
 
 
 Is- function(x) invec[invec %in% x]what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.
 – flodel
 Nov 17 at 1:55
 
 
 
 
 
 
 
 
 
 @flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
 – stats134711
 Nov 17 at 13:03
 
 
 
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
My goal is to take a source list of vectors and apply a mapping to each vector. The mapping creates a list of lists. When the vector size in each list element is large (~20000), and the number of elements in the source list is large (~5000), my program does not seem to be efficient. How can I optimize my code?
I've tried two things: implementation in Rcpp and mclapply. The Rcpp implementation is comparable as expected, and the mclapply implementation may not be that efficient because I'm running my function in a larger function that is already parallelized over multiple cores.
Basic example
source.list = rep(list(seq(6)),3)
target.list = list(c(1,2),c(3,4),c(5,6))
result = map_partition(source.list,target.list)
> result
[[1]]
[[1]][[1]]
[1] 1 2
[[1]][[2]]
[1] 3 4
[[1]][[3]]
[1] 5 6
[[2]]
[[2]][[1]]
[1] 1 2
[[2]][[2]]
[1] 3 4
[[2]][[3]]
[1] 5 6
[[3]]
[[3]][[1]]
[1] 1 2
[[3]][[2]]
[1] 3 4
[[3]][[3]]
[1] 5 6
R implementation
map.partition.R <- function(invec, partitionlist) {
  lst <- lapply(partitionlist, function(x) invec[invec %in% x])
  return(lst)
}
result = lapply(X=source.list, FUN=map.partition,
                               partitionlist=target.list)
performance r
My goal is to take a source list of vectors and apply a mapping to each vector. The mapping creates a list of lists. When the vector size in each list element is large (~20000), and the number of elements in the source list is large (~5000), my program does not seem to be efficient. How can I optimize my code?
I've tried two things: implementation in Rcpp and mclapply. The Rcpp implementation is comparable as expected, and the mclapply implementation may not be that efficient because I'm running my function in a larger function that is already parallelized over multiple cores.
Basic example
source.list = rep(list(seq(6)),3)
target.list = list(c(1,2),c(3,4),c(5,6))
result = map_partition(source.list,target.list)
> result
[[1]]
[[1]][[1]]
[1] 1 2
[[1]][[2]]
[1] 3 4
[[1]][[3]]
[1] 5 6
[[2]]
[[2]][[1]]
[1] 1 2
[[2]][[2]]
[1] 3 4
[[2]][[3]]
[1] 5 6
[[3]]
[[3]][[1]]
[1] 1 2
[[3]][[2]]
[1] 3 4
[[3]][[3]]
[1] 5 6
R implementation
map.partition.R <- function(invec, partitionlist) {
  lst <- lapply(partitionlist, function(x) invec[invec %in% x])
  return(lst)
}
result = lapply(X=source.list, FUN=map.partition,
                               partitionlist=target.list)
performance r
performance r
asked Nov 16 at 17:13
stats134711
1374
1374
 
 
 
 
 
 
 Is- function(x) invec[invec %in% x]what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.
 – flodel
 Nov 17 at 1:55
 
 
 
 
 
 
 
 
 
 @flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
 – stats134711
 Nov 17 at 13:03
 
 
 
add a comment |
 
 
 
 
 
 
 Is- function(x) invec[invec %in% x]what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.
 – flodel
 Nov 17 at 1:55
 
 
 
 
 
 
 
 
 
 @flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
 – stats134711
 Nov 17 at 13:03
 
 
 
Is
function(x) invec[invec %in% x] what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.– flodel
Nov 17 at 1:55
Is
function(x) invec[invec %in% x] what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.– flodel
Nov 17 at 1:55
@flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
– stats134711
Nov 17 at 13:03
@flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
– stats134711
Nov 17 at 13:03
add a comment |
                                1 Answer
                                1
                        
active
oldest
votes
up vote
1
down vote
accepted
You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.
Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.
First, we turn your two lists into two data.tables:
idx1 <- rep(seq_along(source.list), lengths(source.list))
idx2 <- rep(seq_along(target.list), lengths(target.list))
library(data.table)
X <- data.table(id = unlist(source.list), idx1 = idx1,
                pos = seq_along(idx1), key = "id")
Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
where idx1 and idx2 are the variables telling us in which element of source.list and target.list a specific item (id) belongs. Also pos is just a temporary row number we will use later to sort the data back, as the next statement below (the merge) shuffles things around:
Z <- Y[X, allow.cartesian=TRUE]
Z <- Z[order(Z$pos)]
At this point, if it's possible for you, I would recommend you stop here. By that, I mean to make the next steps of your analysis use this Z data.table rather than a nested list, as to avoid further slow processing loops (for/lapply/Map/etc.). However, if you really need the nested list output, you can do:
no_name_split <- function(...) unname(split(...))
gp1 <- factor(Z$idx1, seq_along(source.list))
gp2 <- factor(Z$idx2, seq_along(target.list))
res <- Map(no_name_split, no_name_split(Z$id, gp1),
                          no_name_split(gp2,  gp1))
Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:
set.seed(632)
source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)
target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)
system.time({
  idx1 <- rep(seq_along(source.list), lengths(source.list))
  idx2 <- rep(seq_along(target.list), lengths(target.list))
  library(data.table)
  X <- data.table(id = unlist(source.list), idx1 = idx1,
                  pos = seq_along(idx1), key = "id")
  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
  Z <- Y[X, allow.cartesian = TRUE]
  Z <- Z[order(Z$pos)]
  no_name_split <- function(...) unname(split(...))
  gp1 <- factor(Z$idx1, seq_along(source.list))
  gp2 <- factor(Z$idx2, seq_along(target.list))
  res <- Map(no_name_split, no_name_split(Z$id, gp1),
                            no_name_split(gp2,  gp1))
})
#    user  system elapsed 
#   3.394   0.382   3.646 
system.time({
  result <- lapply(X = source.list, FUN = map.partition.R,
                   partitionlist = target.list)
})
#    user  system elapsed 
#  23.943   5.329  36.999 
identical(res, result)
# [1] TRUE
 
 
 
 
 
 
 how do you do it? that is a very clever approach with- data.table. So in general is it because you are using a lookup table to speed up the computation?
 – stats134711
 2 days ago
 
 
 
 
 
add a comment |
                                1 Answer
                                1
                        
active
oldest
votes
                                1 Answer
                                1
                        
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.
Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.
First, we turn your two lists into two data.tables:
idx1 <- rep(seq_along(source.list), lengths(source.list))
idx2 <- rep(seq_along(target.list), lengths(target.list))
library(data.table)
X <- data.table(id = unlist(source.list), idx1 = idx1,
                pos = seq_along(idx1), key = "id")
Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
where idx1 and idx2 are the variables telling us in which element of source.list and target.list a specific item (id) belongs. Also pos is just a temporary row number we will use later to sort the data back, as the next statement below (the merge) shuffles things around:
Z <- Y[X, allow.cartesian=TRUE]
Z <- Z[order(Z$pos)]
At this point, if it's possible for you, I would recommend you stop here. By that, I mean to make the next steps of your analysis use this Z data.table rather than a nested list, as to avoid further slow processing loops (for/lapply/Map/etc.). However, if you really need the nested list output, you can do:
no_name_split <- function(...) unname(split(...))
gp1 <- factor(Z$idx1, seq_along(source.list))
gp2 <- factor(Z$idx2, seq_along(target.list))
res <- Map(no_name_split, no_name_split(Z$id, gp1),
                          no_name_split(gp2,  gp1))
Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:
set.seed(632)
source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)
target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)
system.time({
  idx1 <- rep(seq_along(source.list), lengths(source.list))
  idx2 <- rep(seq_along(target.list), lengths(target.list))
  library(data.table)
  X <- data.table(id = unlist(source.list), idx1 = idx1,
                  pos = seq_along(idx1), key = "id")
  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
  Z <- Y[X, allow.cartesian = TRUE]
  Z <- Z[order(Z$pos)]
  no_name_split <- function(...) unname(split(...))
  gp1 <- factor(Z$idx1, seq_along(source.list))
  gp2 <- factor(Z$idx2, seq_along(target.list))
  res <- Map(no_name_split, no_name_split(Z$id, gp1),
                            no_name_split(gp2,  gp1))
})
#    user  system elapsed 
#   3.394   0.382   3.646 
system.time({
  result <- lapply(X = source.list, FUN = map.partition.R,
                   partitionlist = target.list)
})
#    user  system elapsed 
#  23.943   5.329  36.999 
identical(res, result)
# [1] TRUE
 
 
 
 
 
 
 how do you do it? that is a very clever approach with- data.table. So in general is it because you are using a lookup table to speed up the computation?
 – stats134711
 2 days ago
 
 
 
 
 
add a comment |
up vote
1
down vote
accepted
You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.
Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.
First, we turn your two lists into two data.tables:
idx1 <- rep(seq_along(source.list), lengths(source.list))
idx2 <- rep(seq_along(target.list), lengths(target.list))
library(data.table)
X <- data.table(id = unlist(source.list), idx1 = idx1,
                pos = seq_along(idx1), key = "id")
Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
where idx1 and idx2 are the variables telling us in which element of source.list and target.list a specific item (id) belongs. Also pos is just a temporary row number we will use later to sort the data back, as the next statement below (the merge) shuffles things around:
Z <- Y[X, allow.cartesian=TRUE]
Z <- Z[order(Z$pos)]
At this point, if it's possible for you, I would recommend you stop here. By that, I mean to make the next steps of your analysis use this Z data.table rather than a nested list, as to avoid further slow processing loops (for/lapply/Map/etc.). However, if you really need the nested list output, you can do:
no_name_split <- function(...) unname(split(...))
gp1 <- factor(Z$idx1, seq_along(source.list))
gp2 <- factor(Z$idx2, seq_along(target.list))
res <- Map(no_name_split, no_name_split(Z$id, gp1),
                          no_name_split(gp2,  gp1))
Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:
set.seed(632)
source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)
target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)
system.time({
  idx1 <- rep(seq_along(source.list), lengths(source.list))
  idx2 <- rep(seq_along(target.list), lengths(target.list))
  library(data.table)
  X <- data.table(id = unlist(source.list), idx1 = idx1,
                  pos = seq_along(idx1), key = "id")
  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
  Z <- Y[X, allow.cartesian = TRUE]
  Z <- Z[order(Z$pos)]
  no_name_split <- function(...) unname(split(...))
  gp1 <- factor(Z$idx1, seq_along(source.list))
  gp2 <- factor(Z$idx2, seq_along(target.list))
  res <- Map(no_name_split, no_name_split(Z$id, gp1),
                            no_name_split(gp2,  gp1))
})
#    user  system elapsed 
#   3.394   0.382   3.646 
system.time({
  result <- lapply(X = source.list, FUN = map.partition.R,
                   partitionlist = target.list)
})
#    user  system elapsed 
#  23.943   5.329  36.999 
identical(res, result)
# [1] TRUE
 
 
 
 
 
 
 how do you do it? that is a very clever approach with- data.table. So in general is it because you are using a lookup table to speed up the computation?
 – stats134711
 2 days ago
 
 
 
 
 
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.
Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.
First, we turn your two lists into two data.tables:
idx1 <- rep(seq_along(source.list), lengths(source.list))
idx2 <- rep(seq_along(target.list), lengths(target.list))
library(data.table)
X <- data.table(id = unlist(source.list), idx1 = idx1,
                pos = seq_along(idx1), key = "id")
Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
where idx1 and idx2 are the variables telling us in which element of source.list and target.list a specific item (id) belongs. Also pos is just a temporary row number we will use later to sort the data back, as the next statement below (the merge) shuffles things around:
Z <- Y[X, allow.cartesian=TRUE]
Z <- Z[order(Z$pos)]
At this point, if it's possible for you, I would recommend you stop here. By that, I mean to make the next steps of your analysis use this Z data.table rather than a nested list, as to avoid further slow processing loops (for/lapply/Map/etc.). However, if you really need the nested list output, you can do:
no_name_split <- function(...) unname(split(...))
gp1 <- factor(Z$idx1, seq_along(source.list))
gp2 <- factor(Z$idx2, seq_along(target.list))
res <- Map(no_name_split, no_name_split(Z$id, gp1),
                          no_name_split(gp2,  gp1))
Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:
set.seed(632)
source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)
target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)
system.time({
  idx1 <- rep(seq_along(source.list), lengths(source.list))
  idx2 <- rep(seq_along(target.list), lengths(target.list))
  library(data.table)
  X <- data.table(id = unlist(source.list), idx1 = idx1,
                  pos = seq_along(idx1), key = "id")
  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
  Z <- Y[X, allow.cartesian = TRUE]
  Z <- Z[order(Z$pos)]
  no_name_split <- function(...) unname(split(...))
  gp1 <- factor(Z$idx1, seq_along(source.list))
  gp2 <- factor(Z$idx2, seq_along(target.list))
  res <- Map(no_name_split, no_name_split(Z$id, gp1),
                            no_name_split(gp2,  gp1))
})
#    user  system elapsed 
#   3.394   0.382   3.646 
system.time({
  result <- lapply(X = source.list, FUN = map.partition.R,
                   partitionlist = target.list)
})
#    user  system elapsed 
#  23.943   5.329  36.999 
identical(res, result)
# [1] TRUE
You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.
Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.
First, we turn your two lists into two data.tables:
idx1 <- rep(seq_along(source.list), lengths(source.list))
idx2 <- rep(seq_along(target.list), lengths(target.list))
library(data.table)
X <- data.table(id = unlist(source.list), idx1 = idx1,
                pos = seq_along(idx1), key = "id")
Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
where idx1 and idx2 are the variables telling us in which element of source.list and target.list a specific item (id) belongs. Also pos is just a temporary row number we will use later to sort the data back, as the next statement below (the merge) shuffles things around:
Z <- Y[X, allow.cartesian=TRUE]
Z <- Z[order(Z$pos)]
At this point, if it's possible for you, I would recommend you stop here. By that, I mean to make the next steps of your analysis use this Z data.table rather than a nested list, as to avoid further slow processing loops (for/lapply/Map/etc.). However, if you really need the nested list output, you can do:
no_name_split <- function(...) unname(split(...))
gp1 <- factor(Z$idx1, seq_along(source.list))
gp2 <- factor(Z$idx2, seq_along(target.list))
res <- Map(no_name_split, no_name_split(Z$id, gp1),
                          no_name_split(gp2,  gp1))
Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:
set.seed(632)
source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)
target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)
system.time({
  idx1 <- rep(seq_along(source.list), lengths(source.list))
  idx2 <- rep(seq_along(target.list), lengths(target.list))
  library(data.table)
  X <- data.table(id = unlist(source.list), idx1 = idx1,
                  pos = seq_along(idx1), key = "id")
  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")
  Z <- Y[X, allow.cartesian = TRUE]
  Z <- Z[order(Z$pos)]
  no_name_split <- function(...) unname(split(...))
  gp1 <- factor(Z$idx1, seq_along(source.list))
  gp2 <- factor(Z$idx2, seq_along(target.list))
  res <- Map(no_name_split, no_name_split(Z$id, gp1),
                            no_name_split(gp2,  gp1))
})
#    user  system elapsed 
#   3.394   0.382   3.646 
system.time({
  result <- lapply(X = source.list, FUN = map.partition.R,
                   partitionlist = target.list)
})
#    user  system elapsed 
#  23.943   5.329  36.999 
identical(res, result)
# [1] TRUE
answered Nov 17 at 23:34


flodel
3,2021915
3,2021915
 
 
 
 
 
 
 how do you do it? that is a very clever approach with- data.table. So in general is it because you are using a lookup table to speed up the computation?
 – stats134711
 2 days ago
 
 
 
 
 
add a comment |
 
 
 
 
 
 
 how do you do it? that is a very clever approach with- data.table. So in general is it because you are using a lookup table to speed up the computation?
 – stats134711
 2 days ago
 
 
 
 
 
how do you do it? that is a very clever approach with
data.table. So in general is it because you are using a lookup table to speed up the computation?– stats134711
2 days ago
how do you do it? that is a very clever approach with
data.table. So in general is it because you are using a lookup table to speed up the computation?– stats134711
2 days ago
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f207821%2fspeed-up-computation-on-lists-of-lists-in-r-scalability-issues%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
 
Is
function(x) invec[invec %in% x]what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.– flodel
Nov 17 at 1:55
@flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
– stats134711
Nov 17 at 13:03