Stitching together badly formatted log files

up vote
1
down vote

favorite

I needed to stitch together a large number of badly formatted log files. I [painfully] developed a script for it in R, one that successfully dealt with the various flaws in the dataset. However, this script is rather slow. Considering I'm a neophyte to the field (and don't know what I do not know), I was hoping I could get some suggestions on how to speed this code up - using either R, or any other Windows-based/CLI tools.

#Getting path for all files within a particular directory. 

files = list.files(path = dirs[j], full.names = T)



#The files are in csv format and in the ideal case have exactly 5 columns. However, the 5th column can contain an arbitary number of commas. If I try to fread with sep = ",", certain rows can be of arbitarily high length. If I use select = 1:5 to subset each row, I lose data. 



#My solution was to read each line into a single column and then seperate into columns within the script based on the location of the first 4 commas.  

data <- rbindlist(lapply(files,fread,sep = "n",fill = T,header = F))



#Removing empty rows.

retain <- unlist(lapply(data, function(x) {

       str_detect(x,".")

   }))

data[retain,] -> data



#Removing rows where there is no data in the 5th column. 

retain <- unlist(lapply(data, function(x) {

       str_detect(trimws(x,which ='both') ,".+,.+,.*,.*,.+")

   }))

data[retain,] -> data



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4){

data <- data.frame(lapply(data, function(x) {

                    str_replace(x,",","t")

              }),stringsAsFactors = F)

}



#This splits the row into 5 seperate columns, always. 

data <- unlist(lapply(data, function(x) {

  unlist(strsplit(x,"t",fixed = T))

}))



#Changes the format from a character vector to a data table. 

data = data.frame(matrix(data,ncol=5,byrow = T),stringsAsFactors = F)

edited Oct 18 at 16:44

200_success

127k15148411

asked Oct 18 at 15:14

Abs

1

Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39

Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36

@Abs can you supply first data example?
– minem
Oct 19 at 6:17

@Abs why dont you read in data with sep = ','?
– minem
Oct 19 at 6:42

add a comment |

up vote
1
down vote

favorite

#Getting path for all files within a particular directory. 

files = list.files(path = dirs[j], full.names = T)



#The files are in csv format and in the ideal case have exactly 5 columns. However, the 5th column can contain an arbitary number of commas. If I try to fread with sep = ",", certain rows can be of arbitarily high length. If I use select = 1:5 to subset each row, I lose data. 



#My solution was to read each line into a single column and then seperate into columns within the script based on the location of the first 4 commas.  

data <- rbindlist(lapply(files,fread,sep = "n",fill = T,header = F))



#Removing empty rows.

retain <- unlist(lapply(data, function(x) {

       str_detect(x,".")

   }))

data[retain,] -> data



#Removing rows where there is no data in the 5th column. 

retain <- unlist(lapply(data, function(x) {

       str_detect(trimws(x,which ='both') ,".+,.+,.*,.*,.+")

   }))

data[retain,] -> data



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4){

data <- data.frame(lapply(data, function(x) {

                    str_replace(x,",","t")

              }),stringsAsFactors = F)

}



#This splits the row into 5 seperate columns, always. 

data <- unlist(lapply(data, function(x) {

  unlist(strsplit(x,"t",fixed = T))

}))



#Changes the format from a character vector to a data table. 

data = data.frame(matrix(data,ncol=5,byrow = T),stringsAsFactors = F)

edited Oct 18 at 16:44

200_success

127k15148411

asked Oct 18 at 15:14

Abs

1

Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39

Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36

@Abs can you supply first data example?
– minem
Oct 19 at 6:17

@Abs why dont you read in data with sep = ','?
– minem
Oct 19 at 6:42

add a comment |

up vote
1
down vote

favorite

#Getting path for all files within a particular directory. 

files = list.files(path = dirs[j], full.names = T)



#The files are in csv format and in the ideal case have exactly 5 columns. However, the 5th column can contain an arbitary number of commas. If I try to fread with sep = ",", certain rows can be of arbitarily high length. If I use select = 1:5 to subset each row, I lose data. 



#My solution was to read each line into a single column and then seperate into columns within the script based on the location of the first 4 commas.  

data <- rbindlist(lapply(files,fread,sep = "n",fill = T,header = F))



#Removing empty rows.

retain <- unlist(lapply(data, function(x) {

       str_detect(x,".")

   }))

data[retain,] -> data



#Removing rows where there is no data in the 5th column. 

retain <- unlist(lapply(data, function(x) {

       str_detect(trimws(x,which ='both') ,".+,.+,.*,.*,.+")

   }))

data[retain,] -> data



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4){

data <- data.frame(lapply(data, function(x) {

                    str_replace(x,",","t")

              }),stringsAsFactors = F)

}



#This splits the row into 5 seperate columns, always. 

data <- unlist(lapply(data, function(x) {

  unlist(strsplit(x,"t",fixed = T))

}))



#Changes the format from a character vector to a data table. 

data = data.frame(matrix(data,ncol=5,byrow = T),stringsAsFactors = F)

edited Oct 18 at 16:44

200_success

127k15148411

asked Oct 18 at 15:14

Abs

#Getting path for all files within a particular directory. 

files = list.files(path = dirs[j], full.names = T)



#The files are in csv format and in the ideal case have exactly 5 columns. However, the 5th column can contain an arbitary number of commas. If I try to fread with sep = ",", certain rows can be of arbitarily high length. If I use select = 1:5 to subset each row, I lose data. 



#My solution was to read each line into a single column and then seperate into columns within the script based on the location of the first 4 commas.  

data <- rbindlist(lapply(files,fread,sep = "n",fill = T,header = F))



#Removing empty rows.

retain <- unlist(lapply(data, function(x) {

       str_detect(x,".")

   }))

data[retain,] -> data



#Removing rows where there is no data in the 5th column. 

retain <- unlist(lapply(data, function(x) {

       str_detect(trimws(x,which ='both') ,".+,.+,.*,.*,.+")

   }))

data[retain,] -> data



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4){

data <- data.frame(lapply(data, function(x) {

                    str_replace(x,",","t")

              }),stringsAsFactors = F)

}



#This splits the row into 5 seperate columns, always. 

data <- unlist(lapply(data, function(x) {

  unlist(strsplit(x,"t",fixed = T))

}))



#Changes the format from a character vector to a data table. 

data = data.frame(matrix(data,ncol=5,byrow = T),stringsAsFactors = F)

performance beginner csv r

edited Oct 18 at 16:44

200_success

127k15148411

asked Oct 18 at 15:14

Abs

edited Oct 18 at 16:44

200_success

127k15148411

asked Oct 18 at 15:14

Abs

edited Oct 18 at 16:44

200_success

127k15148411

edited Oct 18 at 16:44

200_success

127k15148411

edited Oct 18 at 16:44

200_success

127k15148411

asked Oct 18 at 15:14

Abs

asked Oct 18 at 15:14

Abs

asked Oct 18 at 15:14

Abs

1

Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39

Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36

@Abs can you supply first data example?
– minem
Oct 19 at 6:17

@Abs why dont you read in data with sep = ','?
– minem
Oct 19 at 6:42

add a comment |

1

Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39

Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36

@Abs can you supply first data example?
– minem
Oct 19 at 6:17

@Abs why dont you read in data with sep = ','?
– minem
Oct 19 at 6:42

Welcome to Code Review. The focus of this site is to review working code. From your question, it looks like the code does work, but it's slow, and you're not sure why. That's fine, though. However, you should edit your post to show that your code does indeed work and describe the log files a little bit. Also, the current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How to Ask for examples, and revise the title accordingly.
– Zeta
Oct 18 at 15:39

Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.
– Toby Speight
Oct 18 at 16:36

@Abs can you supply first data example?
– minem
Oct 19 at 6:17

@Abs why dont you read in data with sep = ','?
– minem
Oct 19 at 6:42

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

Without knowing exactly your data structure, I have to guess that it looks like this:

data <- data.table(a = c('1,1,1,4,6',

                         '1,2,3,4,',

                         '',

                         '1,2,3,4,',

                         '1,2,3,4,5'))

If it is correct, then your desired operations could be done like this:

#############################################

#Removing empty rows.

data <- data[data[[1]] != '']



#Removing rows where there is no data in the 5th column. 

retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")

data <- data[retain, ]



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")



#This splits the row into 5 seperate columns, always. 

dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))

dNew

#    V1 V2 V3 V4 V5

# 1:  1  1  1  4  6

# 2:  1  2  3  4  5

The main problem is that you are using lapply where it is not needed.

answered Oct 19 at 6:39

minem

3671310

add a comment |

up vote
0
down vote

Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.

You can quickly fix these files using sed.

There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;

Similarly lines can be deleted as shown in this question.

So this turns the first 4 , into ; (4 times the same instruction), deletes all lines that don't contain 4 ; and creates a backup of the files with extension .old. The *.csv does it for all files in the directory.

sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv

Of course run this over a copy! Might need a few tweaks for windows.

answered Oct 19 at 8:03

bdecaf

27516

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205828%2fstitching-together-badly-formatted-log-files%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

Without knowing exactly your data structure, I have to guess that it looks like this:

data <- data.table(a = c('1,1,1,4,6',

                         '1,2,3,4,',

                         '',

                         '1,2,3,4,',

                         '1,2,3,4,5'))

If it is correct, then your desired operations could be done like this:

#############################################

#Removing empty rows.

data <- data[data[[1]] != '']



#Removing rows where there is no data in the 5th column. 

retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")

data <- data[retain, ]



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")



#This splits the row into 5 seperate columns, always. 

dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))

dNew

#    V1 V2 V3 V4 V5

# 1:  1  1  1  4  6

# 2:  1  2  3  4  5

The main problem is that you are using lapply where it is not needed.

answered Oct 19 at 6:39

minem

3671310

add a comment |

up vote
0
down vote

Without knowing exactly your data structure, I have to guess that it looks like this:

data <- data.table(a = c('1,1,1,4,6',

                         '1,2,3,4,',

                         '',

                         '1,2,3,4,',

                         '1,2,3,4,5'))

If it is correct, then your desired operations could be done like this:

#############################################

#Removing empty rows.

data <- data[data[[1]] != '']



#Removing rows where there is no data in the 5th column. 

retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")

data <- data[retain, ]



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")



#This splits the row into 5 seperate columns, always. 

dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))

dNew

#    V1 V2 V3 V4 V5

# 1:  1  1  1  4  6

# 2:  1  2  3  4  5

The main problem is that you are using lapply where it is not needed.

answered Oct 19 at 6:39

minem

3671310

add a comment |

up vote
0
down vote

Without knowing exactly your data structure, I have to guess that it looks like this:

data <- data.table(a = c('1,1,1,4,6',

                         '1,2,3,4,',

                         '',

                         '1,2,3,4,',

                         '1,2,3,4,5'))

If it is correct, then your desired operations could be done like this:

#############################################

#Removing empty rows.

data <- data[data[[1]] != '']



#Removing rows where there is no data in the 5th column. 

retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")

data <- data[retain, ]



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")



#This splits the row into 5 seperate columns, always. 

dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))

dNew

#    V1 V2 V3 V4 V5

# 1:  1  1  1  4  6

# 2:  1  2  3  4  5

The main problem is that you are using lapply where it is not needed.

answered Oct 19 at 6:39

minem

3671310

Without knowing exactly your data structure, I have to guess that it looks like this:

data <- data.table(a = c('1,1,1,4,6',

                         '1,2,3,4,',

                         '',

                         '1,2,3,4,',

                         '1,2,3,4,5'))

If it is correct, then your desired operations could be done like this:

#############################################

#Removing empty rows.

data <- data[data[[1]] != '']



#Removing rows where there is no data in the 5th column. 

retain <- str_detect(trimws(data[[1]], which = 'both') ,".+,.+,.*,.*,.+")

data <- data[retain, ]



#This replaces the first 4 commas with a tab-delimiter. 

for(i in 1:4) data[[1]] <- str_replace(data[[1]], ",", "t")



#This splits the row into 5 seperate columns, always. 

dNew <- as.data.table(tstrsplit(data[[1]], "t", fixed = T))

dNew

#    V1 V2 V3 V4 V5

# 1:  1  1  1  4  6

# 2:  1  2  3  4  5

The main problem is that you are using lapply where it is not needed.

answered Oct 19 at 6:39

minem

3671310

answered Oct 19 at 6:39

minem

3671310

answered Oct 19 at 6:39

minem

3671310

answered Oct 19 at 6:39

minem

3671310

add a comment |

up vote
0
down vote

Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.

You can quickly fix these files using sed.

There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;

Similarly lines can be deleted as shown in this question.

sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv

Of course run this over a copy! Might need a few tweaks for windows.

answered Oct 19 at 8:03

bdecaf

27516

add a comment |

up vote
0
down vote

Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.

You can quickly fix these files using sed.

There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;

Similarly lines can be deleted as shown in this question.

sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv

Of course run this over a copy! Might need a few tweaks for windows.

answered Oct 19 at 8:03

bdecaf

27516

add a comment |

up vote
0
down vote

Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.

You can quickly fix these files using sed.

There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;

Similarly lines can be deleted as shown in this question.

sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv

Of course run this over a copy! Might need a few tweaks for windows.

answered Oct 19 at 8:03

bdecaf

27516

Using R for such a task is an overkill and wasteful. This is easier and quicker done with some common command line tools. But afterwards it's trivial to work in R.

You can quickly fix these files using sed.

There is a detailed description how to adjust the first occurrences in this question
The idea would be to select a non-offending separator - say ;

Similarly lines can be deleted as shown in this question.

sed -i.old -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e 's/,/;/' -e "/^(.*;){4}.*$/!d" *.csv

Of course run this over a copy! Might need a few tweaks for windows.

answered Oct 19 at 8:03

bdecaf

27516

answered Oct 19 at 8:03

bdecaf

27516

answered Oct 19 at 8:03

bdecaf

27516

answered Oct 19 at 8:03

bdecaf

27516

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk