Choose OCR Software for Chinese PDFs and Images (Software applications)

Technische Foren » Software applications »
Choose OCR Software for Chinese PDFs and Images
Track this topic

Choose OCR Software for Chinese PDFs and Images

Initiator des Themas: Kirill Loktionov

Kirill Loktionov
Ungarn
Local time: 14:30
Englisch > Russisch
+ ...

Jun 29, 2023

Hello!
I've seen some old topics where people asked about OCR software, but what about the current-time market? The question is what application works best when recognizing scanned Chinese documents (incl. drawings) or documents saved as images? We use ABBYY FineReader 15 for most of the languages but its quality of Chinese OCR is really bad. It often misrecognizes characters if they are blurred or simply expanded/condensed. Any analogues? I tried ReadIris, and it has way less options when setting areas on a page. Online resources like 2ocr work great, though the result is just a plain text, so it may only help as a support unit at parts, where FineReader fails.

[Edited at 2023-06-29 06:41 GMT]

[Edited at 2023-06-29 11:25 GMT] ▲ Collapse

Sakshi Garg

Vereinigte Staaten
Local time: 19:00
Mitglied
Deutsch > Italienisch
+ ...

Tesseract

Jun 30, 2023

Hi,

I hope you are doing well! There are multiple softwares in the market now-a-days that support OCR facilities to the maximalist. For Chinese, I personally prefer Tesseract.

Tesseract is an open-source OCR engine that supports numerous languages, including Chinese. It can be a bit technical to set up and use, but it is known for its high accuracy.

You may try it once to see the accuracy of the characters.

I hope this helps!

Thank... See more

Milan Condak

Local time: 14:30
Englisch > Tschechisch

PDF24

Jun 30, 2023

Sakshi Garg wrote:

Tesseract is an open-source OCR engine that supports numerous languages, including Chinese.

Tesseract is part of several SWs that have user interfaces. One of them is the pdf24 suite of programs. Look for OCR.

https://www.pdf24.org/zh/

https://www.pdf24.org/en/

https://www.pdf24.org/cs/

Milan

Kirill Loktionov
Ungarn
Local time: 14:30
Englisch > Russisch
+ ...

THEMENSTARTER

Tesseract Settings

Jul 1, 2023

Sakshi Garg wrote:

Hi,

I hope you are doing well! There are multiple softwares in the market now-a-days that support OCR facilities to the maximalist. For Chinese, I personally prefer Tesseract.

Tesseract is an open-source OCR engine that supports numerous languages, including Chinese. It can be a bit technical to set up and use, but it is known for its high accuracy.

You may try it once to see the accuracy of the characters.

I hope this helps!

Thank you.

Regards
S

Hi Sakshi,

Thank you for a cue! But how can I use GUI with Tesseract? Unfortunately machine still does not understand which and what areas to recognize by itself. Is there documentation for such a setting?

Kind regards,
Kirill

Mr. Satan (X)
Englisch > Bahasa Indonesia

Choices

Jul 2, 2023

Kirill Loktionov wrote:
But how can I use GUI with Tesseract?

You have several choices:
https://tesseract-ocr.github.io/tessdoc/User-Projects-–-3rdParty.html

That being said, I don't work with Chinese language in any capacity. So I don't know if it is any good for Hanzi.

Is there documentation for such a setting?

The man page for Tesseract:
https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc

HTH, FWIW.

Milan Condak wrote:
Tesseract is part of several SWs that have user interfaces.

I don't think this description is particularly accurate. Other software you are referring to are simply the graphical front-ends for the Tesseract program itself. It's the similar situation to ffmpeg or espeak. You can use these programs from the command-line interface, which I usually prefer.

[Edited at 2023-07-02 01:55 GMT]

Kirill Loktionov
Ungarn
Local time: 14:30
Englisch > Russisch
+ ...

THEMENSTARTER

Interesting

Jul 4, 2023

I am a bit flabbergasted that I am writing the following words… Thank you Mr. Satan!
I have tried several of the products mentioned here: https://tesseract-ocr.github.io/tessdoc/User-Projects-–-3rdParty.html
particularly,
Rescribe (unfortunately, I couldn't have even seen connection to server to open a document, just local folders),
normcap (I did not understand how to launch it, is it for Win PCs?),
Free-Ocr-Windows-Desktop (looks like no other languages apart from En/De/Es are available — found no settings for it, alas it is a plain text OCR and job quality in English is quite low, e.g. it understood 'MACHINE MAINTENANCE INSTRUCTIONS' as 'uacanwc zxmn-ru\'An'cc msnaucnorvs').
I guess there is nothing as flexible as FineReader (except for it is a proprietary software). Perhaps there are some Chinese competitors to work with logograms. Time will show us. ▲ Collapse

Mr. Satan (X)
Englisch > Bahasa Indonesia

Using Tesseract from the Command-line Interface

Jul 5, 2023

This is why I prefer using Tesseract from the command-line interface. It worked quite nicely for me when I had to deal with scanned English documents. Here are the commands to use it without GUI. Please note that I modified it to your specific use case by adding the language parameter for Chinese language, with both traditional and simplified variants. The language parameter is not required if the source document is in English, since Tesseract defaults to this. Feel free to pick one that suits y... See more

tesseract INPUT_FILENAME OUTPUT_FILENAME -l chi_tra
tesseract INPUT_FILENAME OUTPUT_FILENAME -l chi_sim

You will need Chinese language packages installed. The same is true for any languages you are working with. I'm using Linux, so it's easy to get them as they are available in the official repository. My apologies, but I can't help you if you're using Windows.

I should mention that Tesseract by itself doesn't support PDF as input file format. For this, I'd use GIMP with export-layer plugin to convert the PDF document into separate image files. Then I'd extract the texts with Tesseract using the commands above.

https://www.gimp.org/
https://github.com/kamilburda/gimp-export-layers
https://www.linuxuprising.com/2019/03/how-to-convert-pdf-to-image-png-jpeg.html

[Edited at 2023-07-05 01:04 GMT] ▲ Collapse

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderatoren dieses Forums
Natalie	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

Choose OCR Software for Chinese PDFs and Images

Forum rules

Help and orientation

Pastey
Your smart companion app Pastey is an innovative desktop application that bridges the gap between human expertise and artificial intelligence. With intuitive keyboard shortcuts, Pastey transforms your source text into AI-powered draft translations. Find out more »

CafeTran Espresso
You've never met a CAT tool this clever! Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free Buy now! »

Neueste Beiträge | FAQ | Regeln | Moderatoren | Artikelbank

Your current localization setting

Deutsch

Select a language

More languages...

Choose OCR Software for Chinese PDFs and Images

Choose OCR Software for Chinese PDFs and Images

You have native languages that can be verified

Your current localization setting

Select a language