How do I update this recursive directory file search for input and name outputs to handle the below case












1














I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.



In its simple version, it works.



ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1


I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:



find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )


The example batch and parallel processing example is below:



find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'


My question is in two parts:



'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.



find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1


Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?










share|improve this question





























    1














    I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.



    In its simple version, it works.



    ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1


    I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:



    find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )


    The example batch and parallel processing example is below:



    find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'


    My question is in two parts:



    'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.



    find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1


    Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?










    share|improve this question



























      1












      1








      1







      I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.



      In its simple version, it works.



      ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1


      I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:



      find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )


      The example batch and parallel processing example is below:



      find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'


      My question is in two parts:



      'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.



      find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1


      Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?










      share|improve this question















      I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.



      In its simple version, it works.



      ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1


      I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:



      find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )


      The example batch and parallel processing example is below:



      find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'


      My question is in two parts:



      'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.



      find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1


      Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?







      find alias ocr






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 16 at 11:58









      Rui F Ribeiro

      38.9k1479129




      38.9k1479129










      asked Sep 26 at 2:45









      markephillips

      63




      63






















          2 Answers
          2






          active

          oldest

          votes


















          1














          I think there are two points.




          • Alias expansion works only on the first word, not on an option.

          • You need some modification to the names provided by find.


          While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:



          #!/bin/bash

          languages='eng+rus+vie+...'
          base="${1%.*}
          ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1


          Then you can run it with



          find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'





          share|improve this answer





















          • I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
            – markephillips
            Sep 26 at 7:10










          • find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
            – markephillips
            Sep 26 at 7:13










          • Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
            – markephillips
            Sep 26 at 7:18










          • Thanks for your direction. I was able to get it working and will publish the details tomorrow!
            – markephillips
            Sep 26 at 8:17










          • Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
            – markephillips
            Sep 26 at 21:11



















          0














          So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.



          Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.



          function do_ocr () {
          #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
          find_all_formats | parallel --tag -j 2
          ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
          --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
          -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1


          }



          Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata



          Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).



          My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.






          share|improve this answer























            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "106"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471466%2fhow-do-i-update-this-recursive-directory-file-search-for-input-and-name-outputs%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            I think there are two points.




            • Alias expansion works only on the first word, not on an option.

            • You need some modification to the names provided by find.


            While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:



            #!/bin/bash

            languages='eng+rus+vie+...'
            base="${1%.*}
            ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1


            Then you can run it with



            find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'





            share|improve this answer





















            • I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
              – markephillips
              Sep 26 at 7:10










            • find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
              – markephillips
              Sep 26 at 7:13










            • Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
              – markephillips
              Sep 26 at 7:18










            • Thanks for your direction. I was able to get it working and will publish the details tomorrow!
              – markephillips
              Sep 26 at 8:17










            • Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
              – markephillips
              Sep 26 at 21:11
















            1














            I think there are two points.




            • Alias expansion works only on the first word, not on an option.

            • You need some modification to the names provided by find.


            While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:



            #!/bin/bash

            languages='eng+rus+vie+...'
            base="${1%.*}
            ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1


            Then you can run it with



            find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'





            share|improve this answer





















            • I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
              – markephillips
              Sep 26 at 7:10










            • find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
              – markephillips
              Sep 26 at 7:13










            • Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
              – markephillips
              Sep 26 at 7:18










            • Thanks for your direction. I was able to get it working and will publish the details tomorrow!
              – markephillips
              Sep 26 at 8:17










            • Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
              – markephillips
              Sep 26 at 21:11














            1












            1








            1






            I think there are two points.




            • Alias expansion works only on the first word, not on an option.

            • You need some modification to the names provided by find.


            While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:



            #!/bin/bash

            languages='eng+rus+vie+...'
            base="${1%.*}
            ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1


            Then you can run it with



            find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'





            share|improve this answer












            I think there are two points.




            • Alias expansion works only on the first word, not on an option.

            • You need some modification to the names provided by find.


            While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:



            #!/bin/bash

            languages='eng+rus+vie+...'
            base="${1%.*}
            ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1


            Then you can run it with



            find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Sep 26 at 5:41









            RalfFriedl

            5,3153925




            5,3153925












            • I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
              – markephillips
              Sep 26 at 7:10










            • find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
              – markephillips
              Sep 26 at 7:13










            • Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
              – markephillips
              Sep 26 at 7:18










            • Thanks for your direction. I was able to get it working and will publish the details tomorrow!
              – markephillips
              Sep 26 at 8:17










            • Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
              – markephillips
              Sep 26 at 21:11


















            • I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
              – markephillips
              Sep 26 at 7:10










            • find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
              – markephillips
              Sep 26 at 7:13










            • Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
              – markephillips
              Sep 26 at 7:18










            • Thanks for your direction. I was able to get it working and will publish the details tomorrow!
              – markephillips
              Sep 26 at 8:17










            • Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
              – markephillips
              Sep 26 at 21:11
















            I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
            – markephillips
            Sep 26 at 7:10




            I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
            – markephillips
            Sep 26 at 7:10












            find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
            – markephillips
            Sep 26 at 7:13




            find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
            – markephillips
            Sep 26 at 7:13












            Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
            – markephillips
            Sep 26 at 7:18




            Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
            – markephillips
            Sep 26 at 7:18












            Thanks for your direction. I was able to get it working and will publish the details tomorrow!
            – markephillips
            Sep 26 at 8:17




            Thanks for your direction. I was able to get it working and will publish the details tomorrow!
            – markephillips
            Sep 26 at 8:17












            Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
            – markephillips
            Sep 26 at 21:11




            Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
            – markephillips
            Sep 26 at 21:11













            0














            So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.



            Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.



            function do_ocr () {
            #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
            find_all_formats | parallel --tag -j 2
            ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
            --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
            -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1


            }



            Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata



            Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).



            My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.






            share|improve this answer




























              0














              So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.



              Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.



              function do_ocr () {
              #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
              find_all_formats | parallel --tag -j 2
              ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
              --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
              -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1


              }



              Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata



              Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).



              My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.






              share|improve this answer


























                0












                0








                0






                So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.



                Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.



                function do_ocr () {
                #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
                find_all_formats | parallel --tag -j 2
                ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
                --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
                -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1


                }



                Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata



                Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).



                My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.






                share|improve this answer














                So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.



                Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.



                function do_ocr () {
                #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
                find_all_formats | parallel --tag -j 2
                ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
                --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
                -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1


                }



                Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata



                Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).



                My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Sep 28 at 1:04

























                answered Sep 26 at 21:24









                markephillips

                63




                63






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471466%2fhow-do-i-update-this-recursive-directory-file-search-for-input-and-name-outputs%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Morgemoulin

                    Scott Moir

                    Souastre