OCR-ing graphics embedded in Word? (Software applications)

Technische Foren » Software applications »
OCR-ing graphics embedded in Word?
Track this topic

Vom Thema belegte Seiten: [1 2] >

OCR-ing graphics embedded in Word?

Initiator des Themas: pj-ffm

pj-ffm
Local time: 21:03
Deutsch > Englisch

Jun 13, 2011

Hi all,

does anyone know of a product (or if it is at all possible) for extracting the text from graphics directly within Word?

Situtation:
The document to be translated contains a large number of screenshots which need to be translated. I obviously don't want to have to type a glossary of source-target words by hand for each...

Due to the large number of graphics (several hundred) I also don't want to have to save/name/OCR each one separately. Managing that would be a nightmare.

Ideally I would be able to select the graphic in questions, start a macro/application and end up with the text in another windows, or inserted as a text block below the graphic in Word.

Any chance of this?

cheers,
Peter. ▲ Collapse

Jorge Payan

Kolumbien
Local time: 15:03
Mitglied (2002)
Deutsch > Spanisch
+ ...

CodeZapper might do the job

Jun 14, 2011

Among other things, it allows to extract the images out of Word file, without any text; then you can use FineReader to OCR the extracted graphics-only file.

You can get CodeZapper here: http://asap-traduction.com/CodeZapper

saludos

pj-ffm
Local time: 21:03
Deutsch > Englisch

THEMENSTARTER

Will try out a copy

Jun 14, 2011

Hi Jorge,

thanks for the suggestion. It seems like a useful tool to have generally so I've requested a copy from the site you linked to.

I'm guessing I'll have to brush up my VBA skills and try and write a macro for kicking off the OCR, grabbing the graphic, naming etc...

cheers,
Peter.

István Hirsch

Local time: 21:03
Englisch > Ungarisch

... or try this...

Jun 14, 2011

I think it is also possible to convert the whole Word file (as is) into a Pdf file (with a free pdf maker or Adobe), OCR this Pdf file, and translate the product of OCR in Word.

pj-ffm
Local time: 21:03
Deutsch > Englisch

THEMENSTARTER

Still no luck finding auto-graphic-grab-ocr-from-Word-macro, but...

Jun 14, 2011

Well, I've had a helpful Email exchange with Dave (CodeZapper) however, it seems it won't really help me in what I want to do.

He also mentioned the possiblity of using a dictation solution, but I'm not so sure about how I'd integrate that into my workflow.

Another tip he gave me was a script someone has written here:
http://www.autohotkey.com/forum/topic11186.html
The description says it takes a screen grab, calls an external OCR and generates a text file, pretty close to what I want conceptually and I'll try and investigate when I have time.

István, thanks for your input; It would be an idea if the document were nearly all graphics, but doing it like that in this case would really present an even greater challenge in reformating the document afterwards (there are lots of awkward tables etc. in addition to the graphics...)

cheers,
Pete.

p.s. what do other translators do when a significant part of a large document is graphics?
I mean, if it's a few graphics I just retype the source and target and charge a supplement, but it makes for horrible workflow and the CAT doesn't really benefit... ▲ Collapse

Natalie

Polen
Local time: 21:03
Mitglied (2002)
Englisch > Russisch
+ ...

Moderator dieses Forums

SITE LOCALIZER

Hi Pete

Jun 14, 2011

I, for example, own Finereader and use it to perform OCR of images (it gives perfect results). I have never used any images embedded into Word files. Aren't you able to obtain the images in their native format?

Please also take a look at http://www.abbyy.com/screenshot_reader/ - maybe this would be what you need. However, I doubt you would be able to use the embedded images with it.<... See more

Michael Beijer

Vereinigtes Königreich
Local time: 20:03
Mitglied
Niederländisch > Englisch
+ ...

.doc -> .docx -> .zip

Jun 14, 2011

one way of getting at just the images is:

re-save the Word .doc as a .docx, then rename it to a .zip

this will concert it into a zip folder containing all of the images in your Word document

Natalie

Polen
Local time: 21:03
Mitglied (2002)
Englisch > Russisch
+ ...

Moderator dieses Forums

SITE LOCALIZER

???

Jun 14, 2011

Michael J.W. Beijer wrote:

one way of getting at just the images is:

re-save the Word .doc as a .docx, then rename it to a .zip

this will concert it into a zip folder containing all of the images in your Word document

1) you cannot make a ZIP file by renaming anything
2) if the images are embedded the are part of the doc file
3) what would you expect from a ZIP? It is just an archive

pj-ffm
Local time: 21:03
Deutsch > Englisch

THEMENSTARTER

The OCR software is not really the issue

Jun 14, 2011

Hi Natalie,

Thanks for your suggestion. The OCR software is not really the issue here (I've heard good things about Finereader too), it's more a workflow issue.

The images are embedded in the document, not linked to, i.e. I can click on an image and copy it into the clipboard and paste into a graphics program, but I don't have access to the originals.
Unfortunately, some of them are literally just a few lines in height, so rather than a manageable number of large screen-shots there are hundreds of little ones... D'oh!

- Obviously I want to avoid having to re-type each source word by reading each graphic and then translating (time consuming and error prone)

- I also don't want to have to go through the document and select each graphic-copy-paste into empty file-name/save-open OCR s/w-convert to text-save result file-open in Word-translate-reimport into original doc. (Even if I can save a few steps by cutting/pasting rather than saving as a file each time, it's still an awfully time-consuming process.)

Michael, I'm also not quite sure I understand your approach. If I could get the images out in one go, it might help, but it would still represent a pretty nasty workflow (how would they be named and how would I locate each in the original document to re-insert the translated text after OCRing..?)

Hmm... I guess my utopian idea of a macro for doing this is not so straight forward...

cheers,
Peter. ▲ Collapse

Michael Beijer

Vereinigtes Königreich
Local time: 20:03
Mitglied
Niederländisch > Englisch
+ ...

let me clarify

Jun 14, 2011

I was only trying to point out a way of accessing all of the images embedded in a Word document in a simple way. It really does work*, just try it.

In your specific case however, if the document is decently formatted, you could try:

saving it as a PDF from within Word,
and then import it into ABBYY,
then save it back out to a Word doc, ...

ABBYY will now use OCR on the images inside the document.

Michael

*... See more

"With Word 2007, Microsoft introduced the XML-based .docx file format. The new format is essentially a ZIP container, which contains a series of XML files and any embedded images. To access the embedded images in a .docx file, use the following steps:
If it's not already a .docx file, Open the file in Word 2007 and save the file as a Word Document (*.docx).
Change the file extension on the original file from .docx to .zip."

(http://www.techrepublic.com/photos/save-images-in-microsoft-word-documents-as-separate-files/206113?seq=4)

[Edited at 2011-06-14 12:52 GMT] ▲ Collapse

Peter Linton (X)

Local time: 20:03
Schwedisch > Englisch
+ ...

Tell the customer

Jun 14, 2011

Tell the customer about the problem, explain that you are a translator, not a graphics specialist, and would they please send you an editable file.

They may not like it, but they have not fulfilled their side of the bargain. Time to educate the client as diplomatically as possible.

pj-ffm
Local time: 21:03
Deutsch > Englisch

THEMENSTARTER

Re-educating customers...

Jun 14, 2011

Peter Linton wrote:

Tell the customer about the problem, explain that you are a translator, not a graphics specialist, and would they please send you an editable file.

They may not like it, but they have not fulfilled their side of the bargain. Time to educate the client as diplomatically as possible.

Indeed, I have explained that I would charge 50% more or by the hour for doing the graphics.

If it weren't for the fact that I have translated many other documents for this project and have built up a useful TM, I would have refused the job...

I just thought that now, when faced with hundreds of them, would be a good time to look for an efficient and consitent way of dealing with graphics in my workflow.

cheers,
Peter.

p.s. I tried Michael's suggestion about saving as .docx and renaming to .zip, and indeed, in the "\word\media" folder there are all the images (all 292 of them!) saved as "image###.png". I'd have to think how I could make an efficient workflow from here though...

István Hirsch

Local time: 21:03
Englisch > Ungarisch

This works for my sample file

Jun 15, 2011

1. If there are tabulators in the text, temporarily Replace All with something that is not, for example, with #.
2. Replace ^g with ^t^& (that is: insert a tabulator in front of each graphic element).
3. Select All.
4. Go to Table/Convert, select Text to Table, where choose: Number of columns: 2, Cell separator: tabulator (to push graphic elements into a 2nd column).
5. Now you have all the graphic elements in the 2nd column. Take this column to OCR (keeping its column-structure), then replace it with the OCR-ed column. Now you have a table with 2 columns (of course, in the first column there can be embedded tables.)
6. Select the table, go to Table/Convert, select Table to Text, where check paragraph mark as cell separator and uncheck „Embedded tables…” (to keep the embedded tables untouched) - (to restore the original layout).
7. Replace All # with tabulator. (to restore the original tabulators). ▲ Collapse

pj-ffm
Local time: 21:03
Deutsch > Englisch

THEMENSTARTER

Sounds interesting...

Jun 15, 2011

István Hirsch wrote:

1. If there are tabulators in the text, temporarily Replace All with something that is not, for example, with #.
2. Replace ^g with ^t^& (that is: insert a tabulator in front of each graphic element).
3. Select All.
4. Go to Table/Convert, select Text to Table, where choose: Number of columns: 2, Cell separator: tabulator (to push graphic elements into a 2nd column).
5. Now you have all the graphic elements in the 2nd column. Take this column to OCR (keeping its column-structure), then replace it with the OCR-ed column. Now you have a table with 2 columns (of course, in the first column there can be embedded tables.)
6. Select the table, go to Table/Convert, select Table to Text, where check paragraph mark as cell separator and uncheck „Embedded tables…” (to keep the embedded tables untouched) - (to restore the original layout).
7. Replace All # with tabulator. (to restore the original tabulators).

Hi István and thanks for the suggestion!

So if I understand correctly, the entire document will be put into a new, all encompassing giant table with two columns: the second of which will contain just the graphics, the first will contain every other document element (text, tables, text boxes, TOCs, links, etc.).

- Just the second column is copy/pasted into the OCR?

- The post-OCR result will be a single column with the text from the graphics in a one-column Word-compatible table

- This column is then pasted over the "graphics" column 2 in the doc? (ideally it needs to be aggregated, so that the text is below, or in some other way, associated with the corresponding graphic, but I guess there could be an enhanced solution involving further cunning search/replace steps...)

If it doesn't mess with the formatting, internal refs etc. and plays nice with Wordfast's segmentation it looks like a big step in the right direction!

I will give it a try when I have a moment and see what it does to the formatting in the rest of the document.

cheers,
Pete.

István Hirsch

Local time: 21:03
Englisch > Ungarisch

That's it

Jun 15, 2011

Absolute correct. That is what I suggest and tried out on a sample file which was a mixture of some sentences, a 3 x 3 table and 3 embedded pictures. First, the pictures went one step right into a second column to be OCR-ed as a batch. Then the 2nd column was deleted, and the OCR-ed column was inserted. Then the OCR-ed cells went one step left into their original position.
Of course, this file is far from the complexity of a „real” file, so preliminary trials with a file similar to mi... See more

Vom Thema belegte Seiten: [1 2] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderatoren dieses Forums
Natalie	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

OCR-ing graphics embedded in Word?

Forum rules

Help and orientation

Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business. More info »

TM-Town
Manage your TMs and Terms ... and boost your translation business Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work. More info »

Neueste Beiträge | FAQ | Regeln | Moderatoren | Artikelbank

Your current localization setting

Deutsch

Select a language

More languages...

-->

OCR-ing graphics embedded in Word?

OCR-ing graphics embedded in Word?

You have native languages that can be verified

Your current localization setting

Select a language