How do I update this recursive directory file search for input and name outputs to handle the below case
I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.
In its simple version, it works.
ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1
I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )
The example batch and parallel processing example is below:
find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'
My question is in two parts:
'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1
Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?
find alias ocr
add a comment |
I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.
In its simple version, it works.
ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1
I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )
The example batch and parallel processing example is below:
find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'
My question is in two parts:
'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1
Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?
find alias ocr
add a comment |
I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.
In its simple version, it works.
ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1
I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )
The example batch and parallel processing example is below:
find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'
My question is in two parts:
'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1
Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?
find alias ocr
I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.
In its simple version, it works.
ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1
I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )
The example batch and parallel processing example is below:
find . -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'
My question is in two parts:
'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1
Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?
find alias ocr
find alias ocr
edited Dec 16 at 11:58
Rui F Ribeiro
38.9k1479129
38.9k1479129
asked Sep 26 at 2:45
markephillips
63
63
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
I think there are two points.
- Alias expansion works only on the first word, not on an option.
- You need some modification to the names provided by
find
.
While it is possible to do everything in the find
command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh
:
#!/bin/bash
languages='eng+rus+vie+...'
base="${1%.*}
ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1
Then you can run it with
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'
I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10
find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13
Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18
Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17
Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11
add a comment |
So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.
Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.
function do_ocr () {
#find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
find_all_formats | parallel --tag -j 2
ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
--clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
-i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1
}
Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata
Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).
My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471466%2fhow-do-i-update-this-recursive-directory-file-search-for-input-and-name-outputs%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think there are two points.
- Alias expansion works only on the first word, not on an option.
- You need some modification to the names provided by
find
.
While it is possible to do everything in the find
command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh
:
#!/bin/bash
languages='eng+rus+vie+...'
base="${1%.*}
ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1
Then you can run it with
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'
I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10
find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13
Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18
Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17
Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11
add a comment |
I think there are two points.
- Alias expansion works only on the first word, not on an option.
- You need some modification to the names provided by
find
.
While it is possible to do everything in the find
command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh
:
#!/bin/bash
languages='eng+rus+vie+...'
base="${1%.*}
ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1
Then you can run it with
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'
I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10
find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13
Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18
Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17
Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11
add a comment |
I think there are two points.
- Alias expansion works only on the first word, not on an option.
- You need some modification to the names provided by
find
.
While it is possible to do everything in the find
command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh
:
#!/bin/bash
languages='eng+rus+vie+...'
base="${1%.*}
ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1
Then you can run it with
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'
I think there are two points.
- Alias expansion works only on the first word, not on an option.
- You need some modification to the names provided by
find
.
While it is possible to do everything in the find
command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh
:
#!/bin/bash
languages='eng+rus+vie+...'
base="${1%.*}
ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1
Then you can run it with
find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'
answered Sep 26 at 5:41
RalfFriedl
5,3153925
5,3153925
I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10
find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13
Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18
Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17
Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11
add a comment |
I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10
find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13
Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18
Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17
Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11
I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10
I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10
find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13
find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13
Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18
Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18
Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17
Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17
Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11
Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11
add a comment |
So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.
Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.
function do_ocr () {
#find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
find_all_formats | parallel --tag -j 2
ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
--clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
-i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1
}
Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata
Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).
My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.
add a comment |
So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.
Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.
function do_ocr () {
#find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
find_all_formats | parallel --tag -j 2
ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
--clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
-i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1
}
Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata
Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).
My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.
add a comment |
So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.
Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.
function do_ocr () {
#find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
find_all_formats | parallel --tag -j 2
ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
--clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
-i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1
}
Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata
Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).
My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.
So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.
Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.
function do_ocr () {
#find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'
find_all_formats | parallel --tag -j 2
ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa
--clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75
-i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1
}
Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata
Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).
My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.
edited Sep 28 at 1:04
answered Sep 26 at 21:24
markephillips
63
63
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471466%2fhow-do-i-update-this-recursive-directory-file-search-for-input-and-name-outputs%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown