How do I update this recursive directory file search for input and name outputs to handle the below case

I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.

In its simple version, it works.

ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1

I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )

The example batch and parallel processing example is below:

find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'

My question is in two parts:

'Languages' is an alias to the full list of supported tesseract training data. Simply typed into the shell on macosx expands out: alias languages='eng+rus+vie+ukr+fra+spa+afr+amh+ara+asm+aze+aze_cyrl+bel+ben+bod+bos+bre+bul+cat+ceb+ces+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+chr+cos+cym+dan+dan_frak+deu+deu_frak+div+dzo+ell+eng+enm+epo+equ+est+eus+fao+fas+fil+fin+fra+frk+frm+fry+gla+gle+glg+grc+guj+hat+heb+hin+hrv+hun+hye+ik...and so on - ocrmypdf thinks its languages so that isn't working. And I'd like to --sidecar out a text file and '{}.txt' complains that there is no such file. Here is where I am at.

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1

Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?

edited Dec 16 at 11:58

Rui F Ribeiro

38.9k1479129

asked Sep 26 at 2:45

markephillips

add a comment |

I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.

In its simple version, it works.

ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1

I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )

The example batch and parallel processing example is below:

find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'

My question is in two parts:

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1

Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?

edited Dec 16 at 11:58

Rui F Ribeiro

38.9k1479129

asked Sep 26 at 2:45

markephillips

add a comment |

I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.

In its simple version, it works.

ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1

I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )

The example batch and parallel processing example is below:

find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'

My question is in two parts:

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1

Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?

edited Dec 16 at 11:58

Rui F Ribeiro

38.9k1479129

asked Sep 26 at 2:45

markephillips

I am updating a script that recursively goes through a directory and ocrs the pdf and updates the pdf.

In its simple version, it works.

ocrmypdf -l vie --deskew --clean --force-ocr --sidecar vietnamese_website.txt Vietnamese Website.jpg Vietnamese Website.pdf --verbose 1

I would like to make it recursively go through a folder and consume all sorts of file types so I am expanding find to:

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' )

The example batch and parallel processing example is below:

find .  -name '*.pdf' | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --verbose 1 '{}' '{}'

My question is in two parts:

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf -l languages --deskew --clean --force-ocr --sidecar '{}.txt' '{}' '{}' --verbose 1

Where find gets what I need, but --sidecar is unhappy. So how to deal with the alias and '$1.txt'?

find alias ocr

edited Dec 16 at 11:58

Rui F Ribeiro

38.9k1479129

asked Sep 26 at 2:45

markephillips

edited Dec 16 at 11:58

Rui F Ribeiro

38.9k1479129

asked Sep 26 at 2:45

markephillips

edited Dec 16 at 11:58

Rui F Ribeiro

38.9k1479129

edited Dec 16 at 11:58

Rui F Ribeiro

38.9k1479129

edited Dec 16 at 11:58

Rui F Ribeiro

38.9k1479129

asked Sep 26 at 2:45

markephillips

asked Sep 26 at 2:45

markephillips

asked Sep 26 at 2:45

markephillips

add a comment |

2 Answers
2

active

oldest

votes

I think there are two points.

Alias expansion works only on the first word, not on an option.

You need some modification to the names provided by find.

While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:

#!/bin/bash



languages='eng+rus+vie+...'

base="${1%.*}

ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1

Then you can run it with

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'

answered Sep 26 at 5:41

RalfFriedl

5,3153925

I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10

find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13

Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18

Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17

Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11

add a comment |

So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.

Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.

function do_ocr () {

    #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'

    find_all_formats | parallel --tag -j 2 

    ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa 

    --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75 

    -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1

}

Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata

Update: There is a docker file of the Tesseract 4.0 that you have to add the language data and MacOSX step-by-step instructions for the installation - which make sure you have Java 8 co-installed and in your environment for the ScrollViewer.jar. If you get this, then the above function lets you use all the languages "auto-detect" and then ocr images if possible, convert to PDF, and producing a sidecar txt file of the contents (in the original language).

My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.

edited Sep 28 at 1:04

answered Sep 26 at 21:24

markephillips

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f471466%2fhow-do-i-update-this-recursive-directory-file-search-for-input-and-name-outputs%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I think there are two points.

Alias expansion works only on the first word, not on an option.

You need some modification to the names provided by find.

While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:

#!/bin/bash



languages='eng+rus+vie+...'

base="${1%.*}

ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1

Then you can run it with

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'

answered Sep 26 at 5:41

RalfFriedl

5,3153925

I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10

find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13

Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18

Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17

Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11

add a comment |

I think there are two points.

Alias expansion works only on the first word, not on an option.

You need some modification to the names provided by find.

While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:

#!/bin/bash



languages='eng+rus+vie+...'

base="${1%.*}

ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1

Then you can run it with

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'

answered Sep 26 at 5:41

RalfFriedl

5,3153925

I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10

find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13

Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18

Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17

Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11

add a comment |

I think there are two points.

Alias expansion works only on the first word, not on an option.

You need some modification to the names provided by find.

While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:

#!/bin/bash



languages='eng+rus+vie+...'

base="${1%.*}

ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1

Then you can run it with

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'

answered Sep 26 at 5:41

RalfFriedl

5,3153925

I think there are two points.

Alias expansion works only on the first word, not on an option.

You need some modification to the names provided by find.

While it is possible to do everything in the find command line, I think it is easier to create a script for this purpose, let's call it ocrmypdf.sh:

#!/bin/bash



languages='eng+rus+vie+...'

base="${1%.*}

ocrmypdf -l "$languages" --deskew --clean --force-ocr --sidecar "$base.txt" "$1" "$base.pdf --verbose 1

Then you can run it with

find . ( -name '*.pdf' -o -name '*.jpg' -o -name '*.tiff' -o -name '*.jpeg' -o -name '*.tif' -o -name '*.png' ) | parallel --tag -j 2 ocrmypdf.sh '{}'

answered Sep 26 at 5:41

RalfFriedl

5,3153925

answered Sep 26 at 5:41

RalfFriedl

5,3153925

answered Sep 26 at 5:41

RalfFriedl

5,3153925

answered Sep 26 at 5:41

RalfFriedl

5,3153925

I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10

find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13

Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18

Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17

Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11

add a comment |

I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10

find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13

Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18

Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17

Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11

I was also told that -printf '%Pn' was to be added on the find command and that didn't work - or even was supported at all on macosx. :(
– markephillips
Sep 26 at 7:10

find . ( -name '.pdf' -o -name '.jpg' -o -name '.tiff' -o -name '.jpeg' -o -name '.tif' -o -name '.png' ) | parallel --tag -j 2 /User/markphillips/.ocrmypdf.sh '{}' ./Vietnamese Website.jpg /bin/bash: /User/markphillips/.ocrmypdf.sh: No such file or directory - at first I forgot to chmod on the script but?
– markephillips
Sep 26 at 7:13

Is base assignment 2nd quote and ocrmypdf last argument base quote on purpose?
– markephillips
Sep 26 at 7:18

Thanks for your direction. I was able to get it working and will publish the details tomorrow!
– markephillips
Sep 26 at 8:17

Hi, so I ran what basically I updated from your suggestions and the result of all the files was skipped ocr analysis despite the work it was doing. I should have also stated that I hand executed the command using the 4.0 engine and then this engine I rebuilt from source to be the 3.05 which allowed me to pull in the 102 language list which gives that annoying long lang list but very powerful. So my point is I have a solution to publish to this shell question which you answered and I am grateful.
– markephillips
Sep 26 at 21:11

add a comment |

So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.

Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.

function do_ocr () {

    #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'

    find_all_formats | parallel --tag -j 2 

    ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa 

    --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75 

    -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1

}

Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata

My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.

edited Sep 28 at 1:04

answered Sep 26 at 21:24

markephillips

add a comment |

So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.

Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.

function do_ocr () {

    #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'

    find_all_formats | parallel --tag -j 2 

    ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa 

    --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75 

    -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1

}

Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata

My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.

edited Sep 28 at 1:04

answered Sep 26 at 21:24

markephillips

add a comment |

So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.

Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.

function do_ocr () {

    #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'

    find_all_formats | parallel --tag -j 2 

    ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa 

    --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75 

    -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1

}

Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata

My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.

edited Sep 28 at 1:04

answered Sep 26 at 21:24

markephillips

So with direction from user-ralfiedl the following works with the newest LSTM based Tessearct 4.0 on MacOSX.

Updated: I was able to figure out how to shove all this into the .profile or . bashrc which is where I wanted it in the first place...the following doesn't need variables for the txt file.

function do_ocr () {

    #find . -name '*.pdf' -o -name '*.jpg' -o -name '*.tif' -o -name '*.png' -o -name '*.jpeg' -o -name '*.tiff'

    find_all_formats | parallel --tag -j 2 

    ocrmypdf -l ori+por+srp+hin+chi_sim+spa+uzb_cyrl+mar+swa+ces+urd+nep+cat+mya+lit+dan+mlt+enm+bod+tir+tgl+tha+fas+hrv+ukr+lao+ben+eus+eng+dzo+nld+vie+ita+kir+pus+msa+heb+slv+kaz+rus+eng+vie+ukr+spa 

    --clean --deskew --rotate-pages --image-dpi 300 --jpeg-quality 75 --png-quality 75 

    -i -f -O 2 --sidecar - --force-ocr '{}' '{}' --verbose 1

}

Note: You have to hand rebuild each of the training sets for 4.0 which As brew install Tessearact 4.0 - Github Link to Instructions to Install 4.0 traineddata

My next effort will be to making something that takes language Office documents and translate them and using Machine Learning by adding more data to the text files OCRing the images.

edited Sep 28 at 1:04

answered Sep 26 at 21:24

markephillips

edited Sep 28 at 1:04

answered Sep 26 at 21:24

markephillips

answered Sep 26 at 21:24

markephillips

answered Sep 26 at 21:24

markephillips

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Cfrtjryk