how to find duplicate field in over 100 files











up vote
1
down vote

favorite












I have about 120 files each with >1000 lines



Each line has it's own key in it. The columns are | separated



here is an example line, the key column (column 11 always column 11) is: 201075ITE854075_RECardProtectionlogi.msg



Error: null, Data:
|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05
14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country
not known



Is there a way to find all lines that have matching key/column 11 value(the whole line won't match)?
Can i do this on the command line or in a script?
I will be using cygwin.



I have no idea how to even start, so even if you only feel willing to give me suitable commands to look up, that would be gratefully received.





Each line has it's own key so there are potentially as many keys as there are lines.



I just want the script to run on an entire directory and report duplicate keys amongst all the files, without any other user input.



What defines a key is being in column 11.










share|improve this question









New contributor




WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
    – Jesse_b
    Nov 21 at 16:18










  • @Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
    – WendyG
    Nov 21 at 16:36










  • Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
    – Jesse_b
    Nov 21 at 16:56










  • do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
    – Jeff Schaller
    Nov 21 at 17:14















up vote
1
down vote

favorite












I have about 120 files each with >1000 lines



Each line has it's own key in it. The columns are | separated



here is an example line, the key column (column 11 always column 11) is: 201075ITE854075_RECardProtectionlogi.msg



Error: null, Data:
|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05
14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country
not known



Is there a way to find all lines that have matching key/column 11 value(the whole line won't match)?
Can i do this on the command line or in a script?
I will be using cygwin.



I have no idea how to even start, so even if you only feel willing to give me suitable commands to look up, that would be gratefully received.





Each line has it's own key so there are potentially as many keys as there are lines.



I just want the script to run on an entire directory and report duplicate keys amongst all the files, without any other user input.



What defines a key is being in column 11.










share|improve this question









New contributor




WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
    – Jesse_b
    Nov 21 at 16:18










  • @Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
    – WendyG
    Nov 21 at 16:36










  • Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
    – Jesse_b
    Nov 21 at 16:56










  • do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
    – Jeff Schaller
    Nov 21 at 17:14













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have about 120 files each with >1000 lines



Each line has it's own key in it. The columns are | separated



here is an example line, the key column (column 11 always column 11) is: 201075ITE854075_RECardProtectionlogi.msg



Error: null, Data:
|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05
14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country
not known



Is there a way to find all lines that have matching key/column 11 value(the whole line won't match)?
Can i do this on the command line or in a script?
I will be using cygwin.



I have no idea how to even start, so even if you only feel willing to give me suitable commands to look up, that would be gratefully received.





Each line has it's own key so there are potentially as many keys as there are lines.



I just want the script to run on an entire directory and report duplicate keys amongst all the files, without any other user input.



What defines a key is being in column 11.










share|improve this question









New contributor




WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I have about 120 files each with >1000 lines



Each line has it's own key in it. The columns are | separated



here is an example line, the key column (column 11 always column 11) is: 201075ITE854075_RECardProtectionlogi.msg



Error: null, Data:
|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05
14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country
not known



Is there a way to find all lines that have matching key/column 11 value(the whole line won't match)?
Can i do this on the command line or in a script?
I will be using cygwin.



I have no idea how to even start, so even if you only feel willing to give me suitable commands to look up, that would be gratefully received.





Each line has it's own key so there are potentially as many keys as there are lines.



I just want the script to run on an entire directory and report duplicate keys amongst all the files, without any other user input.



What defines a key is being in column 11.







shell-script files






share|improve this question









New contributor




WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited Nov 21 at 17:47





















New contributor




WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Nov 21 at 16:00









WendyG

1063




1063




New contributor




WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






WendyG is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
    – Jesse_b
    Nov 21 at 16:18










  • @Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
    – WendyG
    Nov 21 at 16:36










  • Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
    – Jesse_b
    Nov 21 at 16:56










  • do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
    – Jeff Schaller
    Nov 21 at 17:14


















  • I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
    – Jesse_b
    Nov 21 at 16:18










  • @Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
    – WendyG
    Nov 21 at 16:36










  • Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
    – Jesse_b
    Nov 21 at 16:56










  • do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
    – Jeff Schaller
    Nov 21 at 17:14
















I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
– Jesse_b
Nov 21 at 16:18




I'm confused by your question. Are you simply looking for the literal string 201075ITE854075_RECardProtectionlogi.msg in all files or will the search key be dynamic in some way? If you are only looking for that one string you can probably just use grep -r '201075ITE854075_RECardProtectionlogi.msg' /directory/with/files
– Jesse_b
Nov 21 at 16:18












@Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
– WendyG
Nov 21 at 16:36




@Jesse_b I believe this was easier to understand until it was edited, the key was in bold in the middle of the row. I have added extra detail
– WendyG
Nov 21 at 16:36












Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
– Jesse_b
Nov 21 at 16:56




Knowing the key is not the issue. I don't understand whether the key is dynamic or not.
– Jesse_b
Nov 21 at 16:56












do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
– Jeff Schaller
Nov 21 at 17:14




do you want the user of your script to enter a key to search for? Or do you want to extract all of the keys and find if any are duplicated (title)? Are the keys the 4th pipe-delimited field in all the files?
– Jeff Schaller
Nov 21 at 17:14










2 Answers
2






active

oldest

votes

















up vote
0
down vote













Assuming that by "key" you mean "column", you could use something like this:



cut -f 11 -d "|" $(find . -type f -iname "*.txt") | sort | uniq -d | sed 's/\/./g' | while read duplicate; do grep -rHn "|$duplicate|" * ; done



You will probably have to change the contents of the $(find -iname) to whatever extension your log files have (or just remove it if the only files in the directory are log files. This will recursively find all log files and match them.



The output for some test data looks like this:




test_data.txt:1:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
test_data.txt:5:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
test_data_2.txt:2:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
test_data.txt:3:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
test_data_2.txt:4:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known
test_data.txt:7:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known



Where those are all lines within the files with field 11 duplicated.



Explanation of what the command does.



cut -f 11 -d "|" Get 11th field (as delimited by |)



find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)



sort | uniq -d show all duplicated "field 11s"



sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.



while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.






share|improve this answer








New contributor




f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















  • yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
    – WendyG
    Nov 21 at 17:31


















up vote
0
down vote













It's no clear what are you trying to do, but, I'll give a try:



First, what is your line? You gave this as a line:



Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known


If your lines looks like that, then, your key is in field 11



201175ITE854075_RECardProtectionlogi.msg


but, what define your key? Is it just to be in field 11?



If that so, you can do something like this, in the directory with your target files:



sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)


this will give you a colorized output of the name of the file, followed by the line number in that file, and then the line itself, all sorted by key field 11; so, in the output, all matching keys in any file, appears one on top of each other...



I think that this will give you a clue, at least





Note: the backslash in front of grep it's to prevent any grep aliases.






share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    WendyG is a new contributor. Be nice, and check out our Code of Conduct.










     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483248%2fhow-to-find-duplicate-field-in-over-100-files%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    Assuming that by "key" you mean "column", you could use something like this:



    cut -f 11 -d "|" $(find . -type f -iname "*.txt") | sort | uniq -d | sed 's/\/./g' | while read duplicate; do grep -rHn "|$duplicate|" * ; done



    You will probably have to change the contents of the $(find -iname) to whatever extension your log files have (or just remove it if the only files in the directory are log files. This will recursively find all log files and match them.



    The output for some test data looks like this:




    test_data.txt:1:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
    test_data.txt:5:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
    test_data_2.txt:2:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
    test_data.txt:3:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
    test_data_2.txt:4:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known
    test_data.txt:7:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known



    Where those are all lines within the files with field 11 duplicated.



    Explanation of what the command does.



    cut -f 11 -d "|" Get 11th field (as delimited by |)



    find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)



    sort | uniq -d show all duplicated "field 11s"



    sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.



    while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.






    share|improve this answer








    New contributor




    f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.


















    • yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
      – WendyG
      Nov 21 at 17:31















    up vote
    0
    down vote













    Assuming that by "key" you mean "column", you could use something like this:



    cut -f 11 -d "|" $(find . -type f -iname "*.txt") | sort | uniq -d | sed 's/\/./g' | while read duplicate; do grep -rHn "|$duplicate|" * ; done



    You will probably have to change the contents of the $(find -iname) to whatever extension your log files have (or just remove it if the only files in the directory are log files. This will recursively find all log files and match them.



    The output for some test data looks like this:




    test_data.txt:1:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
    test_data.txt:5:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
    test_data_2.txt:2:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
    test_data.txt:3:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
    test_data_2.txt:4:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known
    test_data.txt:7:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known



    Where those are all lines within the files with field 11 duplicated.



    Explanation of what the command does.



    cut -f 11 -d "|" Get 11th field (as delimited by |)



    find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)



    sort | uniq -d show all duplicated "field 11s"



    sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.



    while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.






    share|improve this answer








    New contributor




    f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.


















    • yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
      – WendyG
      Nov 21 at 17:31













    up vote
    0
    down vote










    up vote
    0
    down vote









    Assuming that by "key" you mean "column", you could use something like this:



    cut -f 11 -d "|" $(find . -type f -iname "*.txt") | sort | uniq -d | sed 's/\/./g' | while read duplicate; do grep -rHn "|$duplicate|" * ; done



    You will probably have to change the contents of the $(find -iname) to whatever extension your log files have (or just remove it if the only files in the directory are log files. This will recursively find all log files and match them.



    The output for some test data looks like this:




    test_data.txt:1:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
    test_data.txt:5:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
    test_data_2.txt:2:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
    test_data.txt:3:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
    test_data_2.txt:4:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known
    test_data.txt:7:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known



    Where those are all lines within the files with field 11 duplicated.



    Explanation of what the command does.



    cut -f 11 -d "|" Get 11th field (as delimited by |)



    find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)



    sort | uniq -d show all duplicated "field 11s"



    sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.



    while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.






    share|improve this answer








    New contributor




    f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.









    Assuming that by "key" you mean "column", you could use something like this:



    cut -f 11 -d "|" $(find . -type f -iname "*.txt") | sort | uniq -d | sed 's/\/./g' | while read duplicate; do grep -rHn "|$duplicate|" * ; done



    You will probably have to change the contents of the $(find -iname) to whatever extension your log files have (or just remove it if the only files in the directory are log files. This will recursively find all log files and match them.



    The output for some test data looks like this:




    test_data.txt:1:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
    test_data.txt:5:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msg|Country not known
    test_data_2.txt:2:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
    test_data.txt:3:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIDONTMATCH|Country not known
    test_data_2.txt:4:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known
    test_data.txt:7:Error: null, Data: |862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201075ITE854075_RECardProtectionlogi.msgIlikecake|Country not known



    Where those are all lines within the files with field 11 duplicated.



    Explanation of what the command does.



    cut -f 11 -d "|" Get 11th field (as delimited by |)



    find . -type f -iname "*.txt" consider any files ending in .txt in current directory (recursively)



    sort | uniq -d show all duplicated "field 11s"



    sed /\/./g' This is a hack because messes up bash. We replace it with ., which grep matches as any character.



    while read duplicate; do grep -rHn "|$duplicate|" *; done - iterate over list of duplicates and find all occurrences of them, outputting filename and line numbers of where duplicates occured.







    share|improve this answer








    New contributor




    f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.









    share|improve this answer



    share|improve this answer






    New contributor




    f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.









    answered Nov 21 at 17:18









    f41lurizer

    1091




    1091




    New contributor




    f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





    New contributor





    f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    f41lurizer is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.












    • yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
      – WendyG
      Nov 21 at 17:31


















    • yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
      – WendyG
      Nov 21 at 17:31
















    yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
    – WendyG
    Nov 21 at 17:31




    yes i did mean column, these are errors from a DB migration, and these are the keys in this DB table (filename)
    – WendyG
    Nov 21 at 17:31












    up vote
    0
    down vote













    It's no clear what are you trying to do, but, I'll give a try:



    First, what is your line? You gave this as a line:



    Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known


    If your lines looks like that, then, your key is in field 11



    201175ITE854075_RECardProtectionlogi.msg


    but, what define your key? Is it just to be in field 11?



    If that so, you can do something like this, in the directory with your target files:



    sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)


    this will give you a colorized output of the name of the file, followed by the line number in that file, and then the line itself, all sorted by key field 11; so, in the output, all matching keys in any file, appears one on top of each other...



    I think that this will give you a clue, at least





    Note: the backslash in front of grep it's to prevent any grep aliases.






    share|improve this answer



























      up vote
      0
      down vote













      It's no clear what are you trying to do, but, I'll give a try:



      First, what is your line? You gave this as a line:



      Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known


      If your lines looks like that, then, your key is in field 11



      201175ITE854075_RECardProtectionlogi.msg


      but, what define your key? Is it just to be in field 11?



      If that so, you can do something like this, in the directory with your target files:



      sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)


      this will give you a colorized output of the name of the file, followed by the line number in that file, and then the line itself, all sorted by key field 11; so, in the output, all matching keys in any file, appears one on top of each other...



      I think that this will give you a clue, at least





      Note: the backslash in front of grep it's to prevent any grep aliases.






      share|improve this answer

























        up vote
        0
        down vote










        up vote
        0
        down vote









        It's no clear what are you trying to do, but, I'll give a try:



        First, what is your line? You gave this as a line:



        Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known


        If your lines looks like that, then, your key is in field 11



        201175ITE854075_RECardProtectionlogi.msg


        but, what define your key? Is it just to be in field 11?



        If that so, you can do something like this, in the directory with your target files:



        sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)


        this will give you a colorized output of the name of the file, followed by the line number in that file, and then the line itself, all sorted by key field 11; so, in the output, all matching keys in any file, appears one on top of each other...



        I think that this will give you a clue, at least





        Note: the backslash in front of grep it's to prevent any grep aliases.






        share|improve this answer














        It's no clear what are you trying to do, but, I'll give a try:



        First, what is your line? You gave this as a line:



        Error: null, Data:|862799|00318070L|EMA|EMAIL|null|20100705|2010-07-05 14:59:39.0|null|AUTO_20100705|201175ITE854075_RECardProtectionlogi.msg|Country not known


        If your lines looks like that, then, your key is in field 11



        201175ITE854075_RECardProtectionlogi.msg


        but, what define your key? Is it just to be in field 11?



        If that so, you can do something like this, in the directory with your target files:



        sort --field-separator='|' --key=11 <(grep --recursive --line-number --color=always --with-filename '' *)


        this will give you a colorized output of the name of the file, followed by the line number in that file, and then the line itself, all sorted by key field 11; so, in the output, all matching keys in any file, appears one on top of each other...



        I think that this will give you a clue, at least





        Note: the backslash in front of grep it's to prevent any grep aliases.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 21 at 17:37

























        answered Nov 21 at 17:29









        matsib.dev

        14613




        14613






















            WendyG is a new contributor. Be nice, and check out our Code of Conduct.










             

            draft saved


            draft discarded


















            WendyG is a new contributor. Be nice, and check out our Code of Conduct.













            WendyG is a new contributor. Be nice, and check out our Code of Conduct.












            WendyG is a new contributor. Be nice, and check out our Code of Conduct.















             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483248%2fhow-to-find-duplicate-field-in-over-100-files%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            List directoties down one level, excluding some named directories and files

            list processes belonging to a network namespace

            list systemd RuntimeDirectory mounts