Stitching together badly formatted log files
up vote
1
down vote
favorite
I needed to stitch together a large number of badly formatted log files. I [painfully] developed a script for it in R, one that successfully dealt with the various flaws in the dataset. However, this script is rather slow. Considering I'm a neophyte to the field (and don't know what I do not know), I was hoping I could get some suggestions on how to speed this code up - using either R, or any other Windows-based/CLI tools.
#Getting path for all files within a particular directory.
files = list.files(path = dirs[j], full.names = T)
#The files are in csv format and in the ideal case have exactly 5 columns. However, the 5th column can contain an arbitary number of commas. If I try to fread with sep = ",", certain rows can be of arbitarily high length. If I use select = 1:5 to subset each row, I lose data.
#My solution was to read each line into a single column and then seperate into columns within the script based on the location of the first 4 commas.
data <- rbindlist(lapply(files,fread,sep = "n",fill = T,header = F))
#Removing empty rows.
retain <- unlist(lapply(data, function(x) {
str_detect(x,".")
}))
data[retain,] -> data
#Removing rows where there is no data in the 5th column.
retain <- unlist(lapply(data, function(x) {
str_detect(trimws(x,which ='both') ,".+,.+,.*,.*,.+")
}))
data[retain,] -> data
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4){
data <- data.frame(lapply(data, function(x) {
str_replace(x,",","t")
}),stringsAsFactors = F)
}
#This splits the row into 5 seperate columns, always.
data <- unlist(lapply(data, function(x) {
unlist(strsplit(x,"t",fixed = T))
}))
#Changes the format from a character vector to a data table.
data = data.frame(matrix(data,ncol=5,byrow = T),stringsAsFactors = F)
performance beginner csv r
add a comment |
up vote
1
down vote
favorite
I needed to stitch together a large number of badly formatted log files. I [painfully] developed a script for it in R, one that successfully dealt with the various flaws in the dataset. However, this script is rather slow. Considering I'm a neophyte to the field (and don't know what I do not know), I was hoping I could get some suggestions on how to speed this code up - using either R, or any other Windows-based/CLI tools.
#Getting path for all files within a particular directory.
files = list.files(path = dirs[j], full.names = T)
#The files are in csv format and in the ideal case have exactly 5 columns. However, the 5th column can contain an arbitary number of commas. If I try to fread with sep = ",", certain rows can be of arbitarily high length. If I use select = 1:5 to subset each row, I lose data.
#My solution was to read each line into a single column and then seperate into columns within the script based on the location of the first 4 commas.
data <- rbindlist(lapply(files,fread,sep = "n",fill = T,header = F))
#Removing empty rows.
retain <- unlist(lapply(data, function(x) {
str_detect(x,".")
}))
data[retain,] -> data
#Removing rows where there is no data in the 5th column.
retain <- unlist(lapply(data, function(x) {
str_detect(trimws(x,which ='both') ,".+,.+,.*,.*,.+")
}))
data[retain,] -> data
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4){
data <- data.frame(lapply(data, function(x) {
str_replace(x,",","t")
}),stringsAsFactors = F)
}
#This splits the row into 5 seperate columns, always.
data <- unlist(lapply(data, function(x) {
unlist(strsplit(x,"t",fixed = T))
}))
#Changes the format from a character vector to a data table.
data = data.frame(matrix(data,ncol=5,byrow = T),stringsAsFactors = F)
performance beginner csv r
1
Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39
Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36
@Abs can you supply firstdata
example?
– minem
Oct 19 at 6:17
@Abs why dont you read in data withsep = ','
?
– minem
Oct 19 at 6:42
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I needed to stitch together a large number of badly formatted log files. I [painfully] developed a script for it in R, one that successfully dealt with the various flaws in the dataset. However, this script is rather slow. Considering I'm a neophyte to the field (and don't know what I do not know), I was hoping I could get some suggestions on how to speed this code up - using either R, or any other Windows-based/CLI tools.
#Getting path for all files within a particular directory.
files = list.files(path = dirs[j], full.names = T)
#The files are in csv format and in the ideal case have exactly 5 columns. However, the 5th column can contain an arbitary number of commas. If I try to fread with sep = ",", certain rows can be of arbitarily high length. If I use select = 1:5 to subset each row, I lose data.
#My solution was to read each line into a single column and then seperate into columns within the script based on the location of the first 4 commas.
data <- rbindlist(lapply(files,fread,sep = "n",fill = T,header = F))
#Removing empty rows.
retain <- unlist(lapply(data, function(x) {
str_detect(x,".")
}))
data[retain,] -> data
#Removing rows where there is no data in the 5th column.
retain <- unlist(lapply(data, function(x) {
str_detect(trimws(x,which ='both') ,".+,.+,.*,.*,.+")
}))
data[retain,] -> data
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4){
data <- data.frame(lapply(data, function(x) {
str_replace(x,",","t")
}),stringsAsFactors = F)
}
#This splits the row into 5 seperate columns, always.
data <- unlist(lapply(data, function(x) {
unlist(strsplit(x,"t",fixed = T))
}))
#Changes the format from a character vector to a data table.
data = data.frame(matrix(data,ncol=5,byrow = T),stringsAsFactors = F)
performance beginner csv r
I needed to stitch together a large number of badly formatted log files. I [painfully] developed a script for it in R, one that successfully dealt with the various flaws in the dataset. However, this script is rather slow. Considering I'm a neophyte to the field (and don't know what I do not know), I was hoping I could get some suggestions on how to speed this code up - using either R, or any other Windows-based/CLI tools.
#Getting path for all files within a particular directory.
files = list.files(path = dirs[j], full.names = T)
#The files are in csv format and in the ideal case have exactly 5 columns. However, the 5th column can contain an arbitary number of commas. If I try to fread with sep = ",", certain rows can be of arbitarily high length. If I use select = 1:5 to subset each row, I lose data.
#My solution was to read each line into a single column and then seperate into columns within the script based on the location of the first 4 commas.
data <- rbindlist(lapply(files,fread,sep = "n",fill = T,header = F))
#Removing empty rows.
retain <- unlist(lapply(data, function(x) {
str_detect(x,".")
}))
data[retain,] -> data
#Removing rows where there is no data in the 5th column.
retain <- unlist(lapply(data, function(x) {
str_detect(trimws(x,which ='both') ,".+,.+,.*,.*,.+")
}))
data[retain,] -> data
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4){
data <- data.frame(lapply(data, function(x) {
str_replace(x,",","t")
}),stringsAsFactors = F)
}
#This splits the row into 5 seperate columns, always.
data <- unlist(lapply(data, function(x) {
unlist(strsplit(x,"t",fixed = T))
}))
#Changes the format from a character vector to a data table.
data = data.frame(matrix(data,ncol=5,byrow = T),stringsAsFactors = F)
performance beginner csv r
performance beginner csv r
edited Oct 18 at 16:44
200_success
127k15148411
127k15148411
asked Oct 18 at 15:14
Abs
61
61
1
Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39
Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36
@Abs can you supply firstdata
example?
– minem
Oct 19 at 6:17
@Abs why dont you read in data withsep = ','
?
– minem
Oct 19 at 6:42
add a comment |
1
Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39
Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36
@Abs can you supply firstdata
example?
– minem
Oct 19 at 6:17
@Abs why dont you read in data withsep = ','
?
– minem
Oct 19 at 6:42
1
1
Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39
Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39
Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36
Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36
@Abs can you supply first
data
example?– minem
Oct 19 at 6:17
@Abs can you supply first
data
example?– minem
Oct 19 at 6:17
@Abs why dont you read in data with
sep = ','
?– minem
Oct 19 at 6:42
@Abs why dont you read in data with
sep = ','
?– minem
Oct 19 at 6:42
add a comment |
2 Answers
2
active
oldest
votes
up vote
0
down vote
Without knowing exactly your data structure, I have to guess that it looks like this:
data <- data.table(a = c('1,1,1,4,6',
'1,2,3,4,',
'',
'1,2,3,4,',
'1,2,3,4,5'))
If it is correct, then your desired operations could be done like this:
#############################################
#Removing empty rows.
data <- data[data[[1]] != '']
#Removing rows where there is no data in the 5th column.
retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")
data <- data[retain, ]
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")
#This splits the row into 5 seperate columns, always.
dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))
dNew
# V1 V2 V3 V4 V5
# 1: 1 1 1 4 6
# 2: 1 2 3 4 5
The main problem is that you are using lapply
where it is not needed.
add a comment |
up vote
0
down vote
Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.
You can quickly fix these files using sed
.
There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;
Similarly lines can be deleted as shown in this question.
So this turns the first 4 ,
into ;
(4 times the same instruction), deletes all lines that don't contain 4 ;
and creates a backup of the files with extension .old
. The *.csv
does it for all files in the directory.
sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv
Of course run this over a copy! Might need a few tweaks for windows.
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Without knowing exactly your data structure, I have to guess that it looks like this:
data <- data.table(a = c('1,1,1,4,6',
'1,2,3,4,',
'',
'1,2,3,4,',
'1,2,3,4,5'))
If it is correct, then your desired operations could be done like this:
#############################################
#Removing empty rows.
data <- data[data[[1]] != '']
#Removing rows where there is no data in the 5th column.
retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")
data <- data[retain, ]
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")
#This splits the row into 5 seperate columns, always.
dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))
dNew
# V1 V2 V3 V4 V5
# 1: 1 1 1 4 6
# 2: 1 2 3 4 5
The main problem is that you are using lapply
where it is not needed.
add a comment |
up vote
0
down vote
Without knowing exactly your data structure, I have to guess that it looks like this:
data <- data.table(a = c('1,1,1,4,6',
'1,2,3,4,',
'',
'1,2,3,4,',
'1,2,3,4,5'))
If it is correct, then your desired operations could be done like this:
#############################################
#Removing empty rows.
data <- data[data[[1]] != '']
#Removing rows where there is no data in the 5th column.
retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")
data <- data[retain, ]
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")
#This splits the row into 5 seperate columns, always.
dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))
dNew
# V1 V2 V3 V4 V5
# 1: 1 1 1 4 6
# 2: 1 2 3 4 5
The main problem is that you are using lapply
where it is not needed.
add a comment |
up vote
0
down vote
up vote
0
down vote
Without knowing exactly your data structure, I have to guess that it looks like this:
data <- data.table(a = c('1,1,1,4,6',
'1,2,3,4,',
'',
'1,2,3,4,',
'1,2,3,4,5'))
If it is correct, then your desired operations could be done like this:
#############################################
#Removing empty rows.
data <- data[data[[1]] != '']
#Removing rows where there is no data in the 5th column.
retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")
data <- data[retain, ]
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")
#This splits the row into 5 seperate columns, always.
dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))
dNew
# V1 V2 V3 V4 V5
# 1: 1 1 1 4 6
# 2: 1 2 3 4 5
The main problem is that you are using lapply
where it is not needed.
Without knowing exactly your data structure, I have to guess that it looks like this:
data <- data.table(a = c('1,1,1,4,6',
'1,2,3,4,',
'',
'1,2,3,4,',
'1,2,3,4,5'))
If it is correct, then your desired operations could be done like this:
#############################################
#Removing empty rows.
data <- data[data[[1]] != '']
#Removing rows where there is no data in the 5th column.
retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")
data <- data[retain, ]
#This replaces the first 4 commas with a tab-delimiter.
for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")
#This splits the row into 5 seperate columns, always.
dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))
dNew
# V1 V2 V3 V4 V5
# 1: 1 1 1 4 6
# 2: 1 2 3 4 5
The main problem is that you are using lapply
where it is not needed.
answered Oct 19 at 6:39
minem
3671310
3671310
add a comment |
add a comment |
up vote
0
down vote
Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.
You can quickly fix these files using sed
.
There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;
Similarly lines can be deleted as shown in this question.
So this turns the first 4 ,
into ;
(4 times the same instruction), deletes all lines that don't contain 4 ;
and creates a backup of the files with extension .old
. The *.csv
does it for all files in the directory.
sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv
Of course run this over a copy! Might need a few tweaks for windows.
add a comment |
up vote
0
down vote
Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.
You can quickly fix these files using sed
.
There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;
Similarly lines can be deleted as shown in this question.
So this turns the first 4 ,
into ;
(4 times the same instruction), deletes all lines that don't contain 4 ;
and creates a backup of the files with extension .old
. The *.csv
does it for all files in the directory.
sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv
Of course run this over a copy! Might need a few tweaks for windows.
add a comment |
up vote
0
down vote
up vote
0
down vote
Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.
You can quickly fix these files using sed
.
There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;
Similarly lines can be deleted as shown in this question.
So this turns the first 4 ,
into ;
(4 times the same instruction), deletes all lines that don't contain 4 ;
and creates a backup of the files with extension .old
. The *.csv
does it for all files in the directory.
sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv
Of course run this over a copy! Might need a few tweaks for windows.
Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.
You can quickly fix these files using sed
.
There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;
Similarly lines can be deleted as shown in this question.
So this turns the first 4 ,
into ;
(4 times the same instruction), deletes all lines that don't contain 4 ;
and creates a backup of the files with extension .old
. The *.csv
does it for all files in the directory.
sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv
Of course run this over a copy! Might need a few tweaks for windows.
answered Oct 19 at 8:03
bdecaf
27516
27516
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205828%2fstitching-together-badly-formatted-log-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39
Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36
@Abs can you supply first
data
example?– minem
Oct 19 at 6:17
@Abs why dont you read in data with
sep = ','
?– minem
Oct 19 at 6:42