Speed up computation on lists of lists in R (scalability issues)

up vote
1
down vote

favorite

My goal is to take a source list of vectors and apply a mapping to each vector. The mapping creates a list of lists. When the vector size in each list element is large (~20000), and the number of elements in the source list is large (~5000), my program does not seem to be efficient. How can I optimize my code?

I've tried two things: implementation in Rcpp and mclapply. The Rcpp implementation is comparable as expected, and the mclapply implementation may not be that efficient because I'm running my function in a larger function that is already parallelized over multiple cores.

Basic example

source.list = rep(list(seq(6)),3)

target.list = list(c(1,2),c(3,4),c(5,6))



result = map_partition(source.list,target.list)



> result

[[1]]

[[1]][[1]]

[1] 1 2



[[1]][[2]]

[1] 3 4



[[1]][[3]]

[1] 5 6





[[2]]

[[2]][[1]]

[1] 1 2



[[2]][[2]]

[1] 3 4



[[2]][[3]]

[1] 5 6





[[3]]

[[3]][[1]]

[1] 1 2



[[3]][[2]]

[1] 3 4



[[3]][[3]]

[1] 5 6

R implementation

map.partition.R <- function(invec, partitionlist) {

  lst <- lapply(partitionlist, function(x) invec[invec %in% x])

  return(lst)

}



result = lapply(X=source.list, FUN=map.partition,

                               partitionlist=target.list)

asked Nov 16 at 17:13

stats134711

1374

Is function(x) invec[invec %in% x] what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.
– flodel
Nov 17 at 1:55

@flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
– stats134711
Nov 17 at 13:03

add a comment |

up vote
1
down vote

favorite

Basic example

source.list = rep(list(seq(6)),3)

target.list = list(c(1,2),c(3,4),c(5,6))



result = map_partition(source.list,target.list)



> result

[[1]]

[[1]][[1]]

[1] 1 2



[[1]][[2]]

[1] 3 4



[[1]][[3]]

[1] 5 6





[[2]]

[[2]][[1]]

[1] 1 2



[[2]][[2]]

[1] 3 4



[[2]][[3]]

[1] 5 6





[[3]]

[[3]][[1]]

[1] 1 2



[[3]][[2]]

[1] 3 4



[[3]][[3]]

[1] 5 6

R implementation

map.partition.R <- function(invec, partitionlist) {

  lst <- lapply(partitionlist, function(x) invec[invec %in% x])

  return(lst)

}



result = lapply(X=source.list, FUN=map.partition,

                               partitionlist=target.list)

asked Nov 16 at 17:13

stats134711

1374

Is function(x) invec[invec %in% x] what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.
– flodel
Nov 17 at 1:55

@flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
– stats134711
Nov 17 at 13:03

add a comment |

up vote
1
down vote

favorite

Basic example

source.list = rep(list(seq(6)),3)

target.list = list(c(1,2),c(3,4),c(5,6))



result = map_partition(source.list,target.list)



> result

[[1]]

[[1]][[1]]

[1] 1 2



[[1]][[2]]

[1] 3 4



[[1]][[3]]

[1] 5 6





[[2]]

[[2]][[1]]

[1] 1 2



[[2]][[2]]

[1] 3 4



[[2]][[3]]

[1] 5 6





[[3]]

[[3]][[1]]

[1] 1 2



[[3]][[2]]

[1] 3 4



[[3]][[3]]

[1] 5 6

R implementation

map.partition.R <- function(invec, partitionlist) {

  lst <- lapply(partitionlist, function(x) invec[invec %in% x])

  return(lst)

}



result = lapply(X=source.list, FUN=map.partition,

                               partitionlist=target.list)

asked Nov 16 at 17:13

stats134711

1374

Basic example

source.list = rep(list(seq(6)),3)

target.list = list(c(1,2),c(3,4),c(5,6))



result = map_partition(source.list,target.list)



> result

[[1]]

[[1]][[1]]

[1] 1 2



[[1]][[2]]

[1] 3 4



[[1]][[3]]

[1] 5 6





[[2]]

[[2]][[1]]

[1] 1 2



[[2]][[2]]

[1] 3 4



[[2]][[3]]

[1] 5 6





[[3]]

[[3]][[1]]

[1] 1 2



[[3]][[2]]

[1] 3 4



[[3]][[3]]

[1] 5 6

R implementation

map.partition.R <- function(invec, partitionlist) {

  lst <- lapply(partitionlist, function(x) invec[invec %in% x])

  return(lst)

}



result = lapply(X=source.list, FUN=map.partition,

                               partitionlist=target.list)

performance r

asked Nov 16 at 17:13

stats134711

1374

asked Nov 16 at 17:13

stats134711

1374

asked Nov 16 at 17:13

stats134711

1374

asked Nov 16 at 17:13

stats134711

1374

asked Nov 16 at 17:13

stats134711

1374

Is function(x) invec[invec %in% x] what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.
– flodel
Nov 17 at 1:55

@flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
– stats134711
Nov 17 at 13:03

add a comment |

Is function(x) invec[invec %in% x] what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.
– flodel
Nov 17 at 1:55

@flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
– stats134711
Nov 17 at 13:03

Is function(x) invec[invec %in% x] what you really intend to apply or is it just here for illustration purposes? If it is really what you intend to do, I might have an idea using a data.table merge.
– flodel
Nov 17 at 1:55

@flodel - it is indeed what I want to use. I want to "match" the elements of each vector in source.list and and each vector in target.list
– stats134711
Nov 17 at 13:03

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.

Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.

First, we turn your two lists into two data.tables:

idx1 <- rep(seq_along(source.list), lengths(source.list))

idx2 <- rep(seq_along(target.list), lengths(target.list))



library(data.table)

X <- data.table(id = unlist(source.list), idx1 = idx1,

                pos = seq_along(idx1), key = "id")

Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

where idx1 and idx2 are the variables telling us in which element of source.list and target.list a specific item (id) belongs. Also pos is just a temporary row number we will use later to sort the data back, as the next statement below (the merge) shuffles things around:

Z <- Y[X, allow.cartesian=TRUE]

Z <- Z[order(Z$pos)]

At this point, if it's possible for you, I would recommend you stop here. By that, I mean to make the next steps of your analysis use this Z data.table rather than a nested list, as to avoid further slow processing loops (for/lapply/Map/etc.). However, if you really need the nested list output, you can do:

no_name_split <- function(...) unname(split(...))

gp1 <- factor(Z$idx1, seq_along(source.list))

gp2 <- factor(Z$idx2, seq_along(target.list))



res <- Map(no_name_split, no_name_split(Z$id, gp1),

                          no_name_split(gp2,  gp1))

Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:

set.seed(632)

source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)

target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)



system.time({

  idx1 <- rep(seq_along(source.list), lengths(source.list))

  idx2 <- rep(seq_along(target.list), lengths(target.list))



  library(data.table)

  X <- data.table(id = unlist(source.list), idx1 = idx1,

                  pos = seq_along(idx1), key = "id")

  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

  Z <- Y[X, allow.cartesian = TRUE]

  Z <- Z[order(Z$pos)]



  no_name_split <- function(...) unname(split(...))

  gp1 <- factor(Z$idx1, seq_along(source.list))

  gp2 <- factor(Z$idx2, seq_along(target.list))

  res <- Map(no_name_split, no_name_split(Z$id, gp1),

                            no_name_split(gp2,  gp1))

})

#    user  system elapsed 

#   3.394   0.382   3.646 



system.time({

  result <- lapply(X = source.list, FUN = map.partition.R,

                   partitionlist = target.list)

})

#    user  system elapsed 

#  23.943   5.329  36.999 



identical(res, result)

# [1] TRUE

answered Nov 17 at 23:34

flodel

3,2021915

how do you do it? that is a very clever approach with data.table. So in general is it because you are using a lookup table to speed up the computation?
– stats134711
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f207821%2fspeed-up-computation-on-lists-of-lists-in-r-scalability-issues%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.

Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.

First, we turn your two lists into two data.tables:

idx1 <- rep(seq_along(source.list), lengths(source.list))

idx2 <- rep(seq_along(target.list), lengths(target.list))



library(data.table)

X <- data.table(id = unlist(source.list), idx1 = idx1,

                pos = seq_along(idx1), key = "id")

Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

Z <- Y[X, allow.cartesian=TRUE]

Z <- Z[order(Z$pos)]

no_name_split <- function(...) unname(split(...))

gp1 <- factor(Z$idx1, seq_along(source.list))

gp2 <- factor(Z$idx2, seq_along(target.list))



res <- Map(no_name_split, no_name_split(Z$id, gp1),

                          no_name_split(gp2,  gp1))

Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:

set.seed(632)

source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)

target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)



system.time({

  idx1 <- rep(seq_along(source.list), lengths(source.list))

  idx2 <- rep(seq_along(target.list), lengths(target.list))



  library(data.table)

  X <- data.table(id = unlist(source.list), idx1 = idx1,

                  pos = seq_along(idx1), key = "id")

  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

  Z <- Y[X, allow.cartesian = TRUE]

  Z <- Z[order(Z$pos)]



  no_name_split <- function(...) unname(split(...))

  gp1 <- factor(Z$idx1, seq_along(source.list))

  gp2 <- factor(Z$idx2, seq_along(target.list))

  res <- Map(no_name_split, no_name_split(Z$id, gp1),

                            no_name_split(gp2,  gp1))

})

#    user  system elapsed 

#   3.394   0.382   3.646 



system.time({

  result <- lapply(X = source.list, FUN = map.partition.R,

                   partitionlist = target.list)

})

#    user  system elapsed 

#  23.943   5.329  36.999 



identical(res, result)

# [1] TRUE

answered Nov 17 at 23:34

flodel

3,2021915

how do you do it? that is a very clever approach with data.table. So in general is it because you are using a lookup table to speed up the computation?
– stats134711
2 days ago

add a comment |

up vote
1
down vote

accepted

You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.

Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.

First, we turn your two lists into two data.tables:

idx1 <- rep(seq_along(source.list), lengths(source.list))

idx2 <- rep(seq_along(target.list), lengths(target.list))



library(data.table)

X <- data.table(id = unlist(source.list), idx1 = idx1,

                pos = seq_along(idx1), key = "id")

Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

Z <- Y[X, allow.cartesian=TRUE]

Z <- Z[order(Z$pos)]

no_name_split <- function(...) unname(split(...))

gp1 <- factor(Z$idx1, seq_along(source.list))

gp2 <- factor(Z$idx2, seq_along(target.list))



res <- Map(no_name_split, no_name_split(Z$id, gp1),

                          no_name_split(gp2,  gp1))

Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:

set.seed(632)

source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)

target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)



system.time({

  idx1 <- rep(seq_along(source.list), lengths(source.list))

  idx2 <- rep(seq_along(target.list), lengths(target.list))



  library(data.table)

  X <- data.table(id = unlist(source.list), idx1 = idx1,

                  pos = seq_along(idx1), key = "id")

  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

  Z <- Y[X, allow.cartesian = TRUE]

  Z <- Z[order(Z$pos)]



  no_name_split <- function(...) unname(split(...))

  gp1 <- factor(Z$idx1, seq_along(source.list))

  gp2 <- factor(Z$idx2, seq_along(target.list))

  res <- Map(no_name_split, no_name_split(Z$id, gp1),

                            no_name_split(gp2,  gp1))

})

#    user  system elapsed 

#   3.394   0.382   3.646 



system.time({

  result <- lapply(X = source.list, FUN = map.partition.R,

                   partitionlist = target.list)

})

#    user  system elapsed 

#  23.943   5.329  36.999 



identical(res, result)

# [1] TRUE

answered Nov 17 at 23:34

flodel

3,2021915

how do you do it? that is a very clever approach with data.table. So in general is it because you are using a lookup table to speed up the computation?
– stats134711
2 days ago

add a comment |

up vote
1
down vote

accepted

You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.

Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.

First, we turn your two lists into two data.tables:

idx1 <- rep(seq_along(source.list), lengths(source.list))

idx2 <- rep(seq_along(target.list), lengths(target.list))



library(data.table)

X <- data.table(id = unlist(source.list), idx1 = idx1,

                pos = seq_along(idx1), key = "id")

Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

Z <- Y[X, allow.cartesian=TRUE]

Z <- Z[order(Z$pos)]

no_name_split <- function(...) unname(split(...))

gp1 <- factor(Z$idx1, seq_along(source.list))

gp2 <- factor(Z$idx2, seq_along(target.list))



res <- Map(no_name_split, no_name_split(Z$id, gp1),

                          no_name_split(gp2,  gp1))

Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:

set.seed(632)

source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)

target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)



system.time({

  idx1 <- rep(seq_along(source.list), lengths(source.list))

  idx2 <- rep(seq_along(target.list), lengths(target.list))



  library(data.table)

  X <- data.table(id = unlist(source.list), idx1 = idx1,

                  pos = seq_along(idx1), key = "id")

  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

  Z <- Y[X, allow.cartesian = TRUE]

  Z <- Z[order(Z$pos)]



  no_name_split <- function(...) unname(split(...))

  gp1 <- factor(Z$idx1, seq_along(source.list))

  gp2 <- factor(Z$idx2, seq_along(target.list))

  res <- Map(no_name_split, no_name_split(Z$id, gp1),

                            no_name_split(gp2,  gp1))

})

#    user  system elapsed 

#   3.394   0.382   3.646 



system.time({

  result <- lapply(X = source.list, FUN = map.partition.R,

                   partitionlist = target.list)

})

#    user  system elapsed 

#  23.943   5.329  36.999 



identical(res, result)

# [1] TRUE

answered Nov 17 at 23:34

flodel

3,2021915

You have two lapply calls in there, so essentially two hidden for loops contributing to the slow execution.

Here, one way to tap into faster compiled functions is to rely on the data.table package for merging (a.k.a. a database join) your two data sets.

First, we turn your two lists into two data.tables:

idx1 <- rep(seq_along(source.list), lengths(source.list))

idx2 <- rep(seq_along(target.list), lengths(target.list))



library(data.table)

X <- data.table(id = unlist(source.list), idx1 = idx1,

                pos = seq_along(idx1), key = "id")

Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

Z <- Y[X, allow.cartesian=TRUE]

Z <- Z[order(Z$pos)]

no_name_split <- function(...) unname(split(...))

gp1 <- factor(Z$idx1, seq_along(source.list))

gp2 <- factor(Z$idx2, seq_along(target.list))



res <- Map(no_name_split, no_name_split(Z$id, gp1),

                          no_name_split(gp2,  gp1))

Below is a simulation with a larger dataset, comparing computation times and checking that the results are identical:

set.seed(632)

source.list <- replicate(500, sample(100000, 1000), simplify = FALSE)

target.list <- replicate(555, sample(100000, 1055), simplify = FALSE)



system.time({

  idx1 <- rep(seq_along(source.list), lengths(source.list))

  idx2 <- rep(seq_along(target.list), lengths(target.list))



  library(data.table)

  X <- data.table(id = unlist(source.list), idx1 = idx1,

                  pos = seq_along(idx1), key = "id")

  Y <- data.table(id = unlist(target.list), idx2 = idx2, key = "id")

  Z <- Y[X, allow.cartesian = TRUE]

  Z <- Z[order(Z$pos)]



  no_name_split <- function(...) unname(split(...))

  gp1 <- factor(Z$idx1, seq_along(source.list))

  gp2 <- factor(Z$idx2, seq_along(target.list))

  res <- Map(no_name_split, no_name_split(Z$id, gp1),

                            no_name_split(gp2,  gp1))

})

#    user  system elapsed 

#   3.394   0.382   3.646 



system.time({

  result <- lapply(X = source.list, FUN = map.partition.R,

                   partitionlist = target.list)

})

#    user  system elapsed 

#  23.943   5.329  36.999 



identical(res, result)

# [1] TRUE

answered Nov 17 at 23:34

flodel

3,2021915

answered Nov 17 at 23:34

flodel

3,2021915

answered Nov 17 at 23:34

flodel

3,2021915

answered Nov 17 at 23:34

flodel

3,2021915

how do you do it? that is a very clever approach with data.table. So in general is it because you are using a lookup table to speed up the computation?
– stats134711
2 days ago

add a comment |

how do you do it? that is a very clever approach with data.table. So in general is it because you are using a lookup table to speed up the computation?
– stats134711
2 days ago

how do you do it? that is a very clever approach with data.table. So in general is it because you are using a lookup table to speed up the computation?
– stats134711
2 days ago

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk