Choose OCR Software for Chinese PDFs and Images Initiator des Themas: Kirill Loktionov
|
Hello!
I've seen some old topics where people asked about OCR software, but what about the current-time market? The question is what application works best when recognizing scanned Chinese documents (incl. drawings) or documents saved as images? We use ABBYY FineReader 15 for most of the languages but its quality of Chinese OCR is really bad. It often misrecognizes characters if they are blurred or simply expanded/condensed. Any analogues? I tried ReadIris, and it has way less options when... See more Hello!
I've seen some old topics where people asked about OCR software, but what about the current-time market? The question is what application works best when recognizing scanned Chinese documents (incl. drawings) or documents saved as images? We use ABBYY FineReader 15 for most of the languages but its quality of Chinese OCR is really bad. It often misrecognizes characters if they are blurred or simply expanded/condensed. Any analogues? I tried ReadIris, and it has way less options when setting areas on a page. Online resources like 2ocr work great, though the result is just a plain text, so it may only help as a support unit at parts, where FineReader fails.
[Edited at 2023-06-29 06:41 GMT]
[Edited at 2023-06-29 11:25 GMT] ▲ Collapse | | | Sakshi Garg Vereinigte Staaten Local time: 19:00 Mitglied Deutsch > Italienisch + ...
Hi,
I hope you are doing well! There are multiple softwares in the market now-a-days that support OCR facilities to the maximalist. For Chinese, I personally prefer Tesseract.
Tesseract is an open-source OCR engine that supports numerous languages, including Chinese. It can be a bit technical to set up and use, but it is known for its high accuracy.
You may try it once to see the accuracy of the characters.
I hope this helps!
Thank... See more Hi,
I hope you are doing well! There are multiple softwares in the market now-a-days that support OCR facilities to the maximalist. For Chinese, I personally prefer Tesseract.
Tesseract is an open-source OCR engine that supports numerous languages, including Chinese. It can be a bit technical to set up and use, but it is known for its high accuracy.
You may try it once to see the accuracy of the characters.
I hope this helps!
Thank you.
Regards
S ▲ Collapse | | | | Kirill Loktionov Ungarn Local time: 14:30 Englisch > Russisch + ... THEMENSTARTER Tesseract Settings | Jul 1, 2023 |
Sakshi Garg wrote:
Hi,
I hope you are doing well! There are multiple softwares in the market now-a-days that support OCR facilities to the maximalist. For Chinese, I personally prefer Tesseract.
Tesseract is an open-source OCR engine that supports numerous languages, including Chinese. It can be a bit technical to set up and use, but it is known for its high accuracy.
You may try it once to see the accuracy of the characters.
I hope this helps!
Thank you.
Regards
S
Hi Sakshi,
Thank you for a cue! But how can I use GUI with Tesseract? Unfortunately machine still does not understand which and what areas to recognize by itself. Is there documentation for such a setting?
Kind regards,
Kirill | |
|
|
Mr. Satan (X) Englisch > Bahasa Indonesia
Kirill Loktionov wrote:
But how can I use GUI with Tesseract?
You have several choices:
https://tesseract-ocr.github.io/tessdoc/User-Projects-–-3rdParty.html
That being said, I don't work with Chinese language in any capacity. So I don't know if it is any good for Hanzi.
Is there documentation for such a setting?
The man page for Tesseract:
https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
HTH, FWIW.
Milan Condak wrote:
Tesseract is part of several SWs that have user interfaces.
I don't think this description is particularly accurate. Other software you are referring to are simply the graphical front-ends for the Tesseract program itself. It's the similar situation to ffmpeg or espeak. You can use these programs from the command-line interface, which I usually prefer.
[Edited at 2023-07-02 01:55 GMT] | | | Kirill Loktionov Ungarn Local time: 14:30 Englisch > Russisch + ... THEMENSTARTER
I am a bit flabbergasted that I am writing the following words… Thank you Mr. Satan!
I have tried several of the products mentioned here: https://tesseract-ocr.github.io/tessdoc/User-Projects-–-3rdParty.html
particularly,
Rescribe (unfortunately, I couldn't have even seen connection to server to open a document, just local folders),
nor... See more I am a bit flabbergasted that I am writing the following words… Thank you Mr. Satan!
I have tried several of the products mentioned here: https://tesseract-ocr.github.io/tessdoc/User-Projects-–-3rdParty.html
particularly,
Rescribe (unfortunately, I couldn't have even seen connection to server to open a document, just local folders),
normcap (I did not understand how to launch it, is it for Win PCs?),
Free-Ocr-Windows-Desktop (looks like no other languages apart from En/De/Es are available — found no settings for it, alas it is a plain text OCR and job quality in English is quite low, e.g. it understood 'MACHINE MAINTENANCE INSTRUCTIONS' as 'uacanwc zxmn-ru\'An'cc msnaucnorvs').
I guess there is nothing as flexible as FineReader (except for it is a proprietary software). Perhaps there are some Chinese competitors to work with logograms. Time will show us. ▲ Collapse | | | Mr. Satan (X) Englisch > Bahasa Indonesia Using Tesseract from the Command-line Interface | Jul 5, 2023 |
This is why I prefer using Tesseract from the command-line interface. It worked quite nicely for me when I had to deal with scanned English documents. Here are the commands to use it without GUI. Please note that I modified it to your specific use case by adding the language parameter for Chinese language, with both traditional and simplified variants. The language parameter is not required if the source document is in English, since Tesseract defaults to this. Feel free to pick one that suits y... See more This is why I prefer using Tesseract from the command-line interface. It worked quite nicely for me when I had to deal with scanned English documents. Here are the commands to use it without GUI. Please note that I modified it to your specific use case by adding the language parameter for Chinese language, with both traditional and simplified variants. The language parameter is not required if the source document is in English, since Tesseract defaults to this. Feel free to pick one that suits your needs.
tesseract INPUT_FILENAME OUTPUT_FILENAME -l chi_tra
tesseract INPUT_FILENAME OUTPUT_FILENAME -l chi_sim
You will need Chinese language packages installed. The same is true for any languages you are working with. I'm using Linux, so it's easy to get them as they are available in the official repository. My apologies, but I can't help you if you're using Windows.
I should mention that Tesseract by itself doesn't support PDF as input file format. For this, I'd use GIMP with export-layer plugin to convert the PDF document into separate image files. Then I'd extract the texts with Tesseract using the commands above.
https://www.gimp.org/
https://github.com/kamilburda/gimp-export-layers
https://www.linuxuprising.com/2019/03/how-to-convert-pdf-to-image-png-jpeg.html
[Edited at 2023-07-05 01:04 GMT] ▲ Collapse | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Choose OCR Software for Chinese PDFs and Images Pastey | Your smart companion app
Pastey is an innovative desktop application that bridges the gap between human expertise and artificial intelligence. With intuitive keyboard shortcuts, Pastey transforms your source text into AI-powered draft translations.
Find out more » |
| CafeTran Espresso | You've never met a CAT tool this clever!
Translate faster & easier, using a sophisticated CAT tool built by a translator / developer.
Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools.
Download and start using CafeTran Espresso -- for free
Buy now! » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |