how to find duplicate field in over 100 files

up vote
1
down vote

favorite

I have about 120 files each with >1000 lines

Each line has it's own key in it. The columns are | separated

here is an example line, the key column (column 11 always column 11) is: 201075ITE854075_RECardProtectionlogi.msg

Error: null, Data:
|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05
14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country
not known

Is there a way to find all lines that have matching key/column 11 value(the whole line won't match)?
Can i do this on the command line or in a script?
I will be using cygwin.

I have no idea how to even start, so even if you only feel willing to give me suitable commands to look up, that would be gratefully received.

Each line has it's own key so there are potentially as many keys as there are lines.

I just want the script to run on an entire directory and report duplicate keys amongst all the files, without any other user input.

What defines a key is being in column 11.

edited Nov 21 at 17:47

asked Nov 21 at 16:00

WendyG

1063

New contributor

I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
– Jesse_b
Nov 21 at 16:18

@Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
– WendyG
Nov 21 at 16:36

Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
– Jesse_b
Nov 21 at 16:56

do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
– Jeff Schaller
Nov 21 at 17:14

add a comment |

up vote
1
down vote

favorite

I have about 120 files each with >1000 lines

Each line has it's own key in it. The columns are | separated

here is an example line, the key column (column 11 always column 11) is: 201075ITE854075_RECardProtectionlogi.msg

Error: null, Data:
|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05
14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country
not known

Is there a way to find all lines that have matching key/column 11 value(the whole line won't match)?
Can i do this on the command line or in a script?
I will be using cygwin.

I have no idea how to even start, so even if you only feel willing to give me suitable commands to look up, that would be gratefully received.

Each line has it's own key so there are potentially as many keys as there are lines.

I just want the script to run on an entire directory and report duplicate keys amongst all the files, without any other user input.

What defines a key is being in column 11.

edited Nov 21 at 17:47

asked Nov 21 at 16:00

WendyG

1063

New contributor

I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
– Jesse_b
Nov 21 at 16:18

@Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
– WendyG
Nov 21 at 16:36

Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
– Jesse_b
Nov 21 at 16:56

do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
– Jeff Schaller
Nov 21 at 17:14

add a comment |

up vote
1
down vote

favorite

I have about 120 files each with >1000 lines

Each line has it's own key in it. The columns are | separated

here is an example line, the key column (column 11 always column 11) is: 201075ITE854075_RECardProtectionlogi.msg

Error: null, Data:
|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05
14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country
not known

Is there a way to find all lines that have matching key/column 11 value(the whole line won't match)?
Can i do this on the command line or in a script?
I will be using cygwin.

I have no idea how to even start, so even if you only feel willing to give me suitable commands to look up, that would be gratefully received.

Each line has it's own key so there are potentially as many keys as there are lines.

I just want the script to run on an entire directory and report duplicate keys amongst all the files, without any other user input.

What defines a key is being in column 11.

edited Nov 21 at 17:47

asked Nov 21 at 16:00

WendyG

1063

New contributor

I have about 120 files each with >1000 lines

Each line has it's own key in it. The columns are | separated

here is an example line, the key column (column 11 always column 11) is: 201075ITE854075_RECardProtectionlogi.msg

Error: null, Data:
|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05
14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country
not known

Is there a way to find all lines that have matching key/column 11 value(the whole line won't match)?
Can i do this on the command line or in a script?
I will be using cygwin.

I have no idea how to even start, so even if you only feel willing to give me suitable commands to look up, that would be gratefully received.

Each line has it's own key so there are potentially as many keys as there are lines.

I just want the script to run on an entire directory and report duplicate keys amongst all the files, without any other user input.

What defines a key is being in column 11.

shell-script files

edited Nov 21 at 17:47

asked Nov 21 at 16:00

WendyG

1063

New contributor

edited Nov 21 at 17:47

asked Nov 21 at 16:00

WendyG

1063

New contributor

edited Nov 21 at 17:47

asked Nov 21 at 16:00

WendyG

1063

New contributor

asked Nov 21 at 16:00

WendyG

1063

asked Nov 21 at 16:00

WendyG

1063

New contributor

WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
– Jesse_b
Nov 21 at 16:18

@Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
– WendyG
Nov 21 at 16:36

Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
– Jesse_b
Nov 21 at 16:56

do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
– Jeff Schaller
Nov 21 at 17:14

add a comment |

I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
– Jesse_b
Nov 21 at 16:18

@Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
– WendyG
Nov 21 at 16:36

Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
– Jesse_b
Nov 21 at 16:56

do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
– Jeff Schaller
Nov 21 at 17:14

I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
– Jesse_b
Nov 21 at 16:18

@Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
– WendyG
Nov 21 at 16:36

Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
– Jesse_b
Nov 21 at 16:56

do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
– Jeff Schaller
Nov 21 at 17:14

add a comment |

2 Answers
2

active

oldest

votes

up vote
0
down vote

Assuming that by "key" you mean "column", you could use something like this:

You will probably have to change the contents of the $(find -iname) to whatever extension your log files have (or just remove it if the only files in the directory are log files. This will recursively find all log files and match them.

The output for some test data looks like this:

test_data.txt:1:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known test_data.txt:5:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known test_data_2.txt:2:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known test_data.txt:3:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known test_data_2.txt:4:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known test_data.txt:7:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known

Where those are all lines within the files with field 11 duplicated.

Explanation of what the command does.

cut -f 11 -d "|" Get 11th field (as delimited by |)

find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)

sort | uniq -d show all duplicated "field 11s"

sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.

while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.

answered Nov 21 at 17:18

f41lurizer

1091

New contributor

yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
– WendyG
Nov 21 at 17:31

add a comment |

up vote
0
down vote

It's no clear what are you trying to do, but, I'll give a try:

First, what is your line? You gave this as a line:

Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known

If your lines looks like that, then, your key is in field 11

201175ITE854075_RECardProtectionlogi.msg

but, what define your key? Is it just to be in field 11?

If that so, you can do something like this, in the directory with your target files:

sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)

this will give you a colorized output of the name of the file, followed by the line number in that file, and then the line itself, all sorted by key field 11; so, in the output, all matching keys in any file, appears one on top of each other...

I think that this will give you a clue, at least

Note: the backslash in front of grep it's to prevent any grep aliases.

edited Nov 21 at 17:37

answered Nov 21 at 17:29

matsib.dev

14613

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

WendyG is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483248%2fhow-to-find-duplicate-field-in-over-100-files%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

Assuming that by "key" you mean "column", you could use something like this:

The output for some test data looks like this:

Where those are all lines within the files with field 11 duplicated.

Explanation of what the command does.

cut -f 11 -d "|" Get 11th field (as delimited by |)

find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)

sort | uniq -d show all duplicated "field 11s"

sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.

while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.

answered Nov 21 at 17:18

f41lurizer

1091

New contributor

yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
– WendyG
Nov 21 at 17:31

add a comment |

up vote
0
down vote

Assuming that by "key" you mean "column", you could use something like this:

The output for some test data looks like this:

Where those are all lines within the files with field 11 duplicated.

Explanation of what the command does.

cut -f 11 -d "|" Get 11th field (as delimited by |)

find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)

sort | uniq -d show all duplicated "field 11s"

sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.

while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.

answered Nov 21 at 17:18

f41lurizer

1091

New contributor

yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
– WendyG
Nov 21 at 17:31

add a comment |

up vote
0
down vote

Assuming that by "key" you mean "column", you could use something like this:

The output for some test data looks like this:

Where those are all lines within the files with field 11 duplicated.

Explanation of what the command does.

cut -f 11 -d "|" Get 11th field (as delimited by |)

find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)

sort | uniq -d show all duplicated "field 11s"

sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.

while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.

answered Nov 21 at 17:18

f41lurizer

1091

New contributor

Assuming that by "key" you mean "column", you could use something like this:

The output for some test data looks like this:

Where those are all lines within the files with field 11 duplicated.

Explanation of what the command does.

cut -f 11 -d "|" Get 11th field (as delimited by |)

find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)

sort | uniq -d show all duplicated "field 11s"

sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.

while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.

answered Nov 21 at 17:18

f41lurizer

1091

New contributor

answered Nov 21 at 17:18

f41lurizer

1091

New contributor

answered Nov 21 at 17:18

f41lurizer

1091

answered Nov 21 at 17:18

f41lurizer

1091

New contributor

f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
– WendyG
Nov 21 at 17:31

add a comment |

yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
– WendyG
Nov 21 at 17:31

yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
– WendyG
Nov 21 at 17:31

add a comment |

up vote
0
down vote

It's no clear what are you trying to do, but, I'll give a try:

First, what is your line? You gave this as a line:

Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known

If your lines looks like that, then, your key is in field 11

201175ITE854075_RECardProtectionlogi.msg

but, what define your key? Is it just to be in field 11?

If that so, you can do something like this, in the directory with your target files:

sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)

I think that this will give you a clue, at least

Note: the backslash in front of grep it's to prevent any grep aliases.

edited Nov 21 at 17:37

answered Nov 21 at 17:29

matsib.dev

14613

add a comment |

up vote
0
down vote

It's no clear what are you trying to do, but, I'll give a try:

First, what is your line? You gave this as a line:

Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known

If your lines looks like that, then, your key is in field 11

201175ITE854075_RECardProtectionlogi.msg

but, what define your key? Is it just to be in field 11?

If that so, you can do something like this, in the directory with your target files:

sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)

I think that this will give you a clue, at least

Note: the backslash in front of grep it's to prevent any grep aliases.

edited Nov 21 at 17:37

answered Nov 21 at 17:29

matsib.dev

14613

add a comment |

up vote
0
down vote

It's no clear what are you trying to do, but, I'll give a try:

First, what is your line? You gave this as a line:

Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known

If your lines looks like that, then, your key is in field 11

201175ITE854075_RECardProtectionlogi.msg

but, what define your key? Is it just to be in field 11?

If that so, you can do something like this, in the directory with your target files:

sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)

I think that this will give you a clue, at least

Note: the backslash in front of grep it's to prevent any grep aliases.

edited Nov 21 at 17:37

answered Nov 21 at 17:29

matsib.dev

14613

It's no clear what are you trying to do, but, I'll give a try:

First, what is your line? You gave this as a line:

Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known

If your lines looks like that, then, your key is in field 11

201175ITE854075_RECardProtectionlogi.msg

but, what define your key? Is it just to be in field 11?

If that so, you can do something like this, in the directory with your target files:

sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)

I think that this will give you a clue, at least

Note: the backslash in front of grep it's to prevent any grep aliases.

edited Nov 21 at 17:37

answered Nov 21 at 17:29

matsib.dev

14613

edited Nov 21 at 17:37

answered Nov 21 at 17:29

matsib.dev

14613

answered Nov 21 at 17:29

matsib.dev

14613

answered Nov 21 at 17:29

matsib.dev

14613

add a comment |

WendyG is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

WendyG is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk