Is there a Unix command that searches for similar strings, based mostly on how they sound when spoken?












7















I have a file of names, and I want to search within it, not caring too much about whether I have spelled the name ( that I am searching for ) correctly. I know that grep has quite a bit of functionality to search for a whole slew of similar strings within a file or stream, but as far as I am aware, it does not have functionality to correct for spelling errors, and even if it did, since these are names of people, they wouldn't be found inside a standard dictionary.



Perhaps I can make my file of names into a special dictionary, and then use some standard spell checking tool? Of particular importance in this application is the ability to match similarly sounding words.



For example: "jacob" should return "Jakob". Even better would be if inter-language similarities were also accounted for, so that "miguel" should match "Michael".



Is this something that has been implemented already, or will I have to build my own?










share|improve this question




















  • 3





    agrep for approximate grep (not for sound/language). Also in zsh pattern matching (#a3) for allow up to 3 mistakes.

    – Stéphane Chazelas
    Jun 14 '13 at 9:36








  • 7





    Take a look at the Text::Soundex perl core module too: pastebin.com/UbeVFBQA

    – manatwork
    Jun 14 '13 at 10:40






  • 1





    @manatwork - you should write that up as an answer!

    – slm
    Jun 14 '13 at 14:01






  • 2





    @slm, not sure how useful that can be in practice. For example “miguel”, “Michael”, “michelle”, “majkul” and “mysql” all have soundex code “M240”. When I tried to use it, I discovered that its too broad to be useful for most tasks. So I better let someone capable to fine tune it to make it an answer.

    – manatwork
    Jun 14 '13 at 14:17
















7















I have a file of names, and I want to search within it, not caring too much about whether I have spelled the name ( that I am searching for ) correctly. I know that grep has quite a bit of functionality to search for a whole slew of similar strings within a file or stream, but as far as I am aware, it does not have functionality to correct for spelling errors, and even if it did, since these are names of people, they wouldn't be found inside a standard dictionary.



Perhaps I can make my file of names into a special dictionary, and then use some standard spell checking tool? Of particular importance in this application is the ability to match similarly sounding words.



For example: "jacob" should return "Jakob". Even better would be if inter-language similarities were also accounted for, so that "miguel" should match "Michael".



Is this something that has been implemented already, or will I have to build my own?










share|improve this question




















  • 3





    agrep for approximate grep (not for sound/language). Also in zsh pattern matching (#a3) for allow up to 3 mistakes.

    – Stéphane Chazelas
    Jun 14 '13 at 9:36








  • 7





    Take a look at the Text::Soundex perl core module too: pastebin.com/UbeVFBQA

    – manatwork
    Jun 14 '13 at 10:40






  • 1





    @manatwork - you should write that up as an answer!

    – slm
    Jun 14 '13 at 14:01






  • 2





    @slm, not sure how useful that can be in practice. For example “miguel”, “Michael”, “michelle”, “majkul” and “mysql” all have soundex code “M240”. When I tried to use it, I discovered that its too broad to be useful for most tasks. So I better let someone capable to fine tune it to make it an answer.

    – manatwork
    Jun 14 '13 at 14:17














7












7








7








I have a file of names, and I want to search within it, not caring too much about whether I have spelled the name ( that I am searching for ) correctly. I know that grep has quite a bit of functionality to search for a whole slew of similar strings within a file or stream, but as far as I am aware, it does not have functionality to correct for spelling errors, and even if it did, since these are names of people, they wouldn't be found inside a standard dictionary.



Perhaps I can make my file of names into a special dictionary, and then use some standard spell checking tool? Of particular importance in this application is the ability to match similarly sounding words.



For example: "jacob" should return "Jakob". Even better would be if inter-language similarities were also accounted for, so that "miguel" should match "Michael".



Is this something that has been implemented already, or will I have to build my own?










share|improve this question
















I have a file of names, and I want to search within it, not caring too much about whether I have spelled the name ( that I am searching for ) correctly. I know that grep has quite a bit of functionality to search for a whole slew of similar strings within a file or stream, but as far as I am aware, it does not have functionality to correct for spelling errors, and even if it did, since these are names of people, they wouldn't be found inside a standard dictionary.



Perhaps I can make my file of names into a special dictionary, and then use some standard spell checking tool? Of particular importance in this application is the ability to match similarly sounding words.



For example: "jacob" should return "Jakob". Even better would be if inter-language similarities were also accounted for, so that "miguel" should match "Michael".



Is this something that has been implemented already, or will I have to build my own?







search text pattern-matching natural-language






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 8 at 7:07







gabkdlly

















asked Jun 14 '13 at 9:20









gabkdllygabkdlly

1384




1384








  • 3





    agrep for approximate grep (not for sound/language). Also in zsh pattern matching (#a3) for allow up to 3 mistakes.

    – Stéphane Chazelas
    Jun 14 '13 at 9:36








  • 7





    Take a look at the Text::Soundex perl core module too: pastebin.com/UbeVFBQA

    – manatwork
    Jun 14 '13 at 10:40






  • 1





    @manatwork - you should write that up as an answer!

    – slm
    Jun 14 '13 at 14:01






  • 2





    @slm, not sure how useful that can be in practice. For example “miguel”, “Michael”, “michelle”, “majkul” and “mysql” all have soundex code “M240”. When I tried to use it, I discovered that its too broad to be useful for most tasks. So I better let someone capable to fine tune it to make it an answer.

    – manatwork
    Jun 14 '13 at 14:17














  • 3





    agrep for approximate grep (not for sound/language). Also in zsh pattern matching (#a3) for allow up to 3 mistakes.

    – Stéphane Chazelas
    Jun 14 '13 at 9:36








  • 7





    Take a look at the Text::Soundex perl core module too: pastebin.com/UbeVFBQA

    – manatwork
    Jun 14 '13 at 10:40






  • 1





    @manatwork - you should write that up as an answer!

    – slm
    Jun 14 '13 at 14:01






  • 2





    @slm, not sure how useful that can be in practice. For example “miguel”, “Michael”, “michelle”, “majkul” and “mysql” all have soundex code “M240”. When I tried to use it, I discovered that its too broad to be useful for most tasks. So I better let someone capable to fine tune it to make it an answer.

    – manatwork
    Jun 14 '13 at 14:17








3




3





agrep for approximate grep (not for sound/language). Also in zsh pattern matching (#a3) for allow up to 3 mistakes.

– Stéphane Chazelas
Jun 14 '13 at 9:36







agrep for approximate grep (not for sound/language). Also in zsh pattern matching (#a3) for allow up to 3 mistakes.

– Stéphane Chazelas
Jun 14 '13 at 9:36






7




7





Take a look at the Text::Soundex perl core module too: pastebin.com/UbeVFBQA

– manatwork
Jun 14 '13 at 10:40





Take a look at the Text::Soundex perl core module too: pastebin.com/UbeVFBQA

– manatwork
Jun 14 '13 at 10:40




1




1





@manatwork - you should write that up as an answer!

– slm
Jun 14 '13 at 14:01





@manatwork - you should write that up as an answer!

– slm
Jun 14 '13 at 14:01




2




2





@slm, not sure how useful that can be in practice. For example “miguel”, “Michael”, “michelle”, “majkul” and “mysql” all have soundex code “M240”. When I tried to use it, I discovered that its too broad to be useful for most tasks. So I better let someone capable to fine tune it to make it an answer.

– manatwork
Jun 14 '13 at 14:17





@slm, not sure how useful that can be in practice. For example “miguel”, “Michael”, “michelle”, “majkul” and “mysql” all have soundex code “M240”. When I tried to use it, I discovered that its too broad to be useful for most tasks. So I better let someone capable to fine tune it to make it an answer.

– manatwork
Jun 14 '13 at 14:17










1 Answer
1






active

oldest

votes


















5














@manatwork has it right, soundex is probably the tool you're looking for.



Install the perl Soundex module using CPAN:



$ sudo cpan Text::Soundex
CPAN: Storable loaded ok (v2.27)
....
Text::Soundex is up to date (3.04).


Make a file full of names to test called names.txt



jacob
Jakob
miguel
Michael


Now the perl script to use the Soundex module, soundslike.pl



#!/usr/bin/perl

use Text::Soundex;

open(FH, 'names.txt');

$targetSoundex=soundex($ARGV[0]);
print "Target soundex of $ARGV[0] is $targetSoundexn";

while(<FH>) {
chomp;
print "Soundex of $_ is ".soundex($_);
if($targetSoundex eq soundex($_)) {
print " (match).n";
}else {
print " (no match).n";
}
}
close(FH);


Make it executable and run some examples:



$ chmod +x soundslike.pl 
$ ./soundslike.pl michael
Target soundex of michael is M240
Soundex of jacob is J210 (no match).
Soundex of Jakob is J210 (no match).
Soundex of miguel is M240 (match).
Soundex of Michael is M240 (match).
$ ./soundslike.pl jagub
Target soundex of jagub is J210
Soundex of jacob is J210 (match).
Soundex of Jakob is J210 (match).
Soundex of miguel is M240 (no match).
Soundex of Michael is M240 (no match).





share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "106"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f79377%2fis-there-a-unix-command-that-searches-for-similar-strings-based-mostly-on-how-t%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5














    @manatwork has it right, soundex is probably the tool you're looking for.



    Install the perl Soundex module using CPAN:



    $ sudo cpan Text::Soundex
    CPAN: Storable loaded ok (v2.27)
    ....
    Text::Soundex is up to date (3.04).


    Make a file full of names to test called names.txt



    jacob
    Jakob
    miguel
    Michael


    Now the perl script to use the Soundex module, soundslike.pl



    #!/usr/bin/perl

    use Text::Soundex;

    open(FH, 'names.txt');

    $targetSoundex=soundex($ARGV[0]);
    print "Target soundex of $ARGV[0] is $targetSoundexn";

    while(<FH>) {
    chomp;
    print "Soundex of $_ is ".soundex($_);
    if($targetSoundex eq soundex($_)) {
    print " (match).n";
    }else {
    print " (no match).n";
    }
    }
    close(FH);


    Make it executable and run some examples:



    $ chmod +x soundslike.pl 
    $ ./soundslike.pl michael
    Target soundex of michael is M240
    Soundex of jacob is J210 (no match).
    Soundex of Jakob is J210 (no match).
    Soundex of miguel is M240 (match).
    Soundex of Michael is M240 (match).
    $ ./soundslike.pl jagub
    Target soundex of jagub is J210
    Soundex of jacob is J210 (match).
    Soundex of Jakob is J210 (match).
    Soundex of miguel is M240 (no match).
    Soundex of Michael is M240 (no match).





    share|improve this answer




























      5














      @manatwork has it right, soundex is probably the tool you're looking for.



      Install the perl Soundex module using CPAN:



      $ sudo cpan Text::Soundex
      CPAN: Storable loaded ok (v2.27)
      ....
      Text::Soundex is up to date (3.04).


      Make a file full of names to test called names.txt



      jacob
      Jakob
      miguel
      Michael


      Now the perl script to use the Soundex module, soundslike.pl



      #!/usr/bin/perl

      use Text::Soundex;

      open(FH, 'names.txt');

      $targetSoundex=soundex($ARGV[0]);
      print "Target soundex of $ARGV[0] is $targetSoundexn";

      while(<FH>) {
      chomp;
      print "Soundex of $_ is ".soundex($_);
      if($targetSoundex eq soundex($_)) {
      print " (match).n";
      }else {
      print " (no match).n";
      }
      }
      close(FH);


      Make it executable and run some examples:



      $ chmod +x soundslike.pl 
      $ ./soundslike.pl michael
      Target soundex of michael is M240
      Soundex of jacob is J210 (no match).
      Soundex of Jakob is J210 (no match).
      Soundex of miguel is M240 (match).
      Soundex of Michael is M240 (match).
      $ ./soundslike.pl jagub
      Target soundex of jagub is J210
      Soundex of jacob is J210 (match).
      Soundex of Jakob is J210 (match).
      Soundex of miguel is M240 (no match).
      Soundex of Michael is M240 (no match).





      share|improve this answer


























        5












        5








        5







        @manatwork has it right, soundex is probably the tool you're looking for.



        Install the perl Soundex module using CPAN:



        $ sudo cpan Text::Soundex
        CPAN: Storable loaded ok (v2.27)
        ....
        Text::Soundex is up to date (3.04).


        Make a file full of names to test called names.txt



        jacob
        Jakob
        miguel
        Michael


        Now the perl script to use the Soundex module, soundslike.pl



        #!/usr/bin/perl

        use Text::Soundex;

        open(FH, 'names.txt');

        $targetSoundex=soundex($ARGV[0]);
        print "Target soundex of $ARGV[0] is $targetSoundexn";

        while(<FH>) {
        chomp;
        print "Soundex of $_ is ".soundex($_);
        if($targetSoundex eq soundex($_)) {
        print " (match).n";
        }else {
        print " (no match).n";
        }
        }
        close(FH);


        Make it executable and run some examples:



        $ chmod +x soundslike.pl 
        $ ./soundslike.pl michael
        Target soundex of michael is M240
        Soundex of jacob is J210 (no match).
        Soundex of Jakob is J210 (no match).
        Soundex of miguel is M240 (match).
        Soundex of Michael is M240 (match).
        $ ./soundslike.pl jagub
        Target soundex of jagub is J210
        Soundex of jacob is J210 (match).
        Soundex of Jakob is J210 (match).
        Soundex of miguel is M240 (no match).
        Soundex of Michael is M240 (no match).





        share|improve this answer













        @manatwork has it right, soundex is probably the tool you're looking for.



        Install the perl Soundex module using CPAN:



        $ sudo cpan Text::Soundex
        CPAN: Storable loaded ok (v2.27)
        ....
        Text::Soundex is up to date (3.04).


        Make a file full of names to test called names.txt



        jacob
        Jakob
        miguel
        Michael


        Now the perl script to use the Soundex module, soundslike.pl



        #!/usr/bin/perl

        use Text::Soundex;

        open(FH, 'names.txt');

        $targetSoundex=soundex($ARGV[0]);
        print "Target soundex of $ARGV[0] is $targetSoundexn";

        while(<FH>) {
        chomp;
        print "Soundex of $_ is ".soundex($_);
        if($targetSoundex eq soundex($_)) {
        print " (match).n";
        }else {
        print " (no match).n";
        }
        }
        close(FH);


        Make it executable and run some examples:



        $ chmod +x soundslike.pl 
        $ ./soundslike.pl michael
        Target soundex of michael is M240
        Soundex of jacob is J210 (no match).
        Soundex of Jakob is J210 (no match).
        Soundex of miguel is M240 (match).
        Soundex of Michael is M240 (match).
        $ ./soundslike.pl jagub
        Target soundex of jagub is J210
        Soundex of jacob is J210 (match).
        Soundex of Jakob is J210 (match).
        Soundex of miguel is M240 (no match).
        Soundex of Michael is M240 (no match).






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jun 17 '13 at 13:04









        Nate from KalamazooNate from Kalamazoo

        96157




        96157






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f79377%2fis-there-a-unix-command-that-searches-for-similar-strings-based-mostly-on-how-t%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Morgemoulin

            Scott Moir

            Souastre