Vom Thema belegte Seiten: [1 2] > | OCR-ing graphics embedded in Word? Initiator des Themas: pj-ffm
| pj-ffm Local time: 21:03 Deutsch > Englisch
Hi all,
does anyone know of a product (or if it is at all possible) for extracting the text from graphics directly within Word?
Situtation:
The document to be translated contains a large number of screenshots which need to be translated. I obviously don't want to have to type a glossary of source-target words by hand for each...
Due to the large number of graphics (several hundred) I also don't want to have to save/name/OCR each one separately. Managi... See more Hi all,
does anyone know of a product (or if it is at all possible) for extracting the text from graphics directly within Word?
Situtation:
The document to be translated contains a large number of screenshots which need to be translated. I obviously don't want to have to type a glossary of source-target words by hand for each...
Due to the large number of graphics (several hundred) I also don't want to have to save/name/OCR each one separately. Managing that would be a nightmare.
Ideally I would be able to select the graphic in questions, start a macro/application and end up with the text in another windows, or inserted as a text block below the graphic in Word.
Any chance of this?
cheers,
Peter. ▲ Collapse | | | Jorge Payan Kolumbien Local time: 15:03 Mitglied (2002) Deutsch > Spanisch + ... CodeZapper might do the job | Jun 14, 2011 |
Among other things, it allows to extract the images out of Word file, without any text; then you can use FineReader to OCR the extracted graphics-only file.
You can get CodeZapper here: http://asap-traduction.com/CodeZapper
saludos | | | pj-ffm Local time: 21:03 Deutsch > Englisch THEMENSTARTER Will try out a copy | Jun 14, 2011 |
Hi Jorge,
thanks for the suggestion. It seems like a useful tool to have generally so I've requested a copy from the site you linked to.
I'm guessing I'll have to brush up my VBA skills and try and write a macro for kicking off the OCR, grabbing the graphic, naming etc...
cheers,
Peter. | | | ... or try this... | Jun 14, 2011 |
I think it is also possible to convert the whole Word file (as is) into a Pdf file (with a free pdf maker or Adobe), OCR this Pdf file, and translate the product of OCR in Word. | |
|
|
pj-ffm Local time: 21:03 Deutsch > Englisch THEMENSTARTER Still no luck finding auto-graphic-grab-ocr-from-Word-macro, but... | Jun 14, 2011 |
Well, I've had a helpful Email exchange with Dave (CodeZapper) however, it seems it won't really help me in what I want to do.
He also mentioned the possiblity of using a dictation solution, but I'm not so sure about how I'd integrate that into my workflow.
Another tip he gave me was a script someone has written here: ... See more Well, I've had a helpful Email exchange with Dave (CodeZapper) however, it seems it won't really help me in what I want to do.
He also mentioned the possiblity of using a dictation solution, but I'm not so sure about how I'd integrate that into my workflow.
Another tip he gave me was a script someone has written here:
http://www.autohotkey.com/forum/topic11186.html
The description says it takes a screen grab, calls an external OCR and generates a text file, pretty close to what I want conceptually and I'll try and investigate when I have time.
István, thanks for your input; It would be an idea if the document were nearly all graphics, but doing it like that in this case would really present an even greater challenge in reformating the document afterwards (there are lots of awkward tables etc. in addition to the graphics...)
cheers,
Pete.
p.s. what do other translators do when a significant part of a large document is graphics?
I mean, if it's a few graphics I just retype the source and target and charge a supplement, but it makes for horrible workflow and the CAT doesn't really benefit... ▲ Collapse | | | Natalie Polen Local time: 21:03 Mitglied (2002) Englisch > Russisch + ... Moderator dieses Forums SITE LOCALIZER
I, for example, own Finereader and use it to perform OCR of images (it gives perfect results). I have never used any images embedded into Word files. Aren't you able to obtain the images in their native format?
Please also take a look at http://www.abbyy.com/screenshot_reader/ - maybe this would be what you need. However, I doubt you would be able to use the embedded images with it.<... See more I, for example, own Finereader and use it to perform OCR of images (it gives perfect results). I have never used any images embedded into Word files. Aren't you able to obtain the images in their native format?
Please also take a look at http://www.abbyy.com/screenshot_reader/ - maybe this would be what you need. However, I doubt you would be able to use the embedded images with it.
Natalia ▲ Collapse | | | Michael Beijer Vereinigtes Königreich Local time: 20:03 Mitglied Niederländisch > Englisch + ... .doc -> .docx -> .zip | Jun 14, 2011 |
one way of getting at just the images is:
re-save the Word .doc as a .docx, then rename it to a .zip
this will concert it into a zip folder containing all of the images in your Word document | | | Natalie Polen Local time: 21:03 Mitglied (2002) Englisch > Russisch + ... Moderator dieses Forums SITE LOCALIZER
Michael J.W. Beijer wrote:
one way of getting at just the images is:
re-save the Word .doc as a .docx, then rename it to a .zip
this will concert it into a zip folder containing all of the images in your Word document
1) you cannot make a ZIP file by renaming anything
2) if the images are embedded the are part of the doc file
3) what would you expect from a ZIP? It is just an archive | |
|
|
pj-ffm Local time: 21:03 Deutsch > Englisch THEMENSTARTER The OCR software is not really the issue | Jun 14, 2011 |
Hi Natalie,
Thanks for your suggestion. The OCR software is not really the issue here (I've heard good things about Finereader too), it's more a workflow issue.
The images are embedded in the document, not linked to, i.e. I can click on an image and copy it into the clipboard and paste into a graphics program, but I don't have access to the originals.
Unfortunately, some of them are literally just a few lines in height, so rather than a manageable number of large ... See more Hi Natalie,
Thanks for your suggestion. The OCR software is not really the issue here (I've heard good things about Finereader too), it's more a workflow issue.
The images are embedded in the document, not linked to, i.e. I can click on an image and copy it into the clipboard and paste into a graphics program, but I don't have access to the originals.
Unfortunately, some of them are literally just a few lines in height, so rather than a manageable number of large screen-shots there are hundreds of little ones... D'oh!
- Obviously I want to avoid having to re-type each source word by reading each graphic and then translating (time consuming and error prone)
- I also don't want to have to go through the document and select each graphic-copy-paste into empty file-name/save-open OCR s/w-convert to text-save result file-open in Word-translate-reimport into original doc. (Even if I can save a few steps by cutting/pasting rather than saving as a file each time, it's still an awfully time-consuming process.)
Michael, I'm also not quite sure I understand your approach. If I could get the images out in one go, it might help, but it would still represent a pretty nasty workflow (how would they be named and how would I locate each in the original document to re-insert the translated text after OCRing..?)
Hmm... I guess my utopian idea of a macro for doing this is not so straight forward...
cheers,
Peter. ▲ Collapse | | | Michael Beijer Vereinigtes Königreich Local time: 20:03 Mitglied Niederländisch > Englisch + ... let me clarify | Jun 14, 2011 |
I was only trying to point out a way of accessing all of the images embedded in a Word document in a simple way. It really does work*, just try it.
In your specific case however, if the document is decently formatted, you could try:
saving it as a PDF from within Word,
and then import it into ABBYY,
then save it back out to a Word doc, ...
ABBYY will now use OCR on the images inside the document.
Michael
*... See more I was only trying to point out a way of accessing all of the images embedded in a Word document in a simple way. It really does work*, just try it.
In your specific case however, if the document is decently formatted, you could try:
saving it as a PDF from within Word,
and then import it into ABBYY,
then save it back out to a Word doc, ...
ABBYY will now use OCR on the images inside the document.
Michael
*"With Word 2007, Microsoft introduced the XML-based .docx file format. The new format is essentially a ZIP container, which contains a series of XML files and any embedded images. To access the embedded images in a .docx file, use the following steps:
If it's not already a .docx file, Open the file in Word 2007 and save the file as a Word Document (*.docx).
Change the file extension on the original file from .docx to .zip."
(http://www.techrepublic.com/photos/save-images-in-microsoft-word-documents-as-separate-files/206113?seq=4)
[Edited at 2011-06-14 12:52 GMT] ▲ Collapse | | | Peter Linton (X) Local time: 20:03 Schwedisch > Englisch + ... Tell the customer | Jun 14, 2011 |
Tell the customer about the problem, explain that you are a translator, not a graphics specialist, and would they please send you an editable file.
They may not like it, but they have not fulfilled their side of the bargain. Time to educate the client as diplomatically as possible. | | | pj-ffm Local time: 21:03 Deutsch > Englisch THEMENSTARTER Re-educating customers... | Jun 14, 2011 |
Peter Linton wrote:
Tell the customer about the problem, explain that you are a translator, not a graphics specialist, and would they please send you an editable file.
They may not like it, but they have not fulfilled their side of the bargain. Time to educate the client as diplomatically as possible.
Indeed, I have explained that I would charge 50% more or by the hour for doing the graphics.
If it weren't for the fact that I have translated many other documents for this project and have built up a useful TM, I would have refused the job...
I just thought that now, when faced with hundreds of them, would be a good time to look for an efficient and consitent way of dealing with graphics in my workflow.
cheers,
Peter.
p.s. I tried Michael's suggestion about saving as .docx and renaming to .zip, and indeed, in the "\word\media" folder there are all the images (all 292 of them!) saved as "image###.png". I'd have to think how I could make an efficient workflow from here though... | |
|
|
This works for my sample file | Jun 15, 2011 |
1. If there are tabulators in the text, temporarily Replace All with something that is not, for example, with #.
2. Replace ^g with ^t^& (that is: insert a tabulator in front of each graphic element).
3. Select All.
4. Go to Table/Convert, select Text to Table, where choose: Number of columns: 2, Cell separator: tabulator (to push graphic elements into a 2nd column).
5. Now you have all the graphic elements in the 2nd column. Take this column to OCR (keeping its column-st... See more 1. If there are tabulators in the text, temporarily Replace All with something that is not, for example, with #.
2. Replace ^g with ^t^& (that is: insert a tabulator in front of each graphic element).
3. Select All.
4. Go to Table/Convert, select Text to Table, where choose: Number of columns: 2, Cell separator: tabulator (to push graphic elements into a 2nd column).
5. Now you have all the graphic elements in the 2nd column. Take this column to OCR (keeping its column-structure), then replace it with the OCR-ed column. Now you have a table with 2 columns (of course, in the first column there can be embedded tables.)
6. Select the table, go to Table/Convert, select Table to Text, where check paragraph mark as cell separator and uncheck „Embedded tables…” (to keep the embedded tables untouched) - (to restore the original layout).
7. Replace All # with tabulator. (to restore the original tabulators). ▲ Collapse | | | pj-ffm Local time: 21:03 Deutsch > Englisch THEMENSTARTER Sounds interesting... | Jun 15, 2011 |
István Hirsch wrote:
1. If there are tabulators in the text, temporarily Replace All with something that is not, for example, with #.
2. Replace ^g with ^t^& (that is: insert a tabulator in front of each graphic element).
3. Select All.
4. Go to Table/Convert, select Text to Table, where choose: Number of columns: 2, Cell separator: tabulator (to push graphic elements into a 2nd column).
5. Now you have all the graphic elements in the 2nd column. Take this column to OCR (keeping its column-structure), then replace it with the OCR-ed column. Now you have a table with 2 columns (of course, in the first column there can be embedded tables.)
6. Select the table, go to Table/Convert, select Table to Text, where check paragraph mark as cell separator and uncheck „Embedded tables…” (to keep the embedded tables untouched) - (to restore the original layout).
7. Replace All # with tabulator. (to restore the original tabulators).
Hi István and thanks for the suggestion!
So if I understand correctly, the entire document will be put into a new, all encompassing giant table with two columns: the second of which will contain just the graphics, the first will contain every other document element (text, tables, text boxes, TOCs, links, etc.).
- Just the second column is copy/pasted into the OCR?
- The post-OCR result will be a single column with the text from the graphics in a one-column Word-compatible table
- This column is then pasted over the "graphics" column 2 in the doc? (ideally it needs to be aggregated, so that the text is below, or in some other way, associated with the corresponding graphic, but I guess there could be an enhanced solution involving further cunning search/replace steps...)
If it doesn't mess with the formatting, internal refs etc. and plays nice with Wordfast's segmentation it looks like a big step in the right direction!
I will give it a try when I have a moment and see what it does to the formatting in the rest of the document.
cheers,
Pete. | | |
Absolute correct. That is what I suggest and tried out on a sample file which was a mixture of some sentences, a 3 x 3 table and 3 embedded pictures. First, the pictures went one step right into a second column to be OCR-ed as a batch. Then the 2nd column was deleted, and the OCR-ed column was inserted. Then the OCR-ed cells went one step left into their original position.
Of course, this file is far from the complexity of a „real” file, so preliminary trials with a file similar to mi... See more Absolute correct. That is what I suggest and tried out on a sample file which was a mixture of some sentences, a 3 x 3 table and 3 embedded pictures. First, the pictures went one step right into a second column to be OCR-ed as a batch. Then the 2nd column was deleted, and the OCR-ed column was inserted. Then the OCR-ed cells went one step left into their original position.
Of course, this file is far from the complexity of a „real” file, so preliminary trials with a file similar to mine are suggested. ▲ Collapse | | | Vom Thema belegte Seiten: [1 2] > | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » OCR-ing graphics embedded in Word? Trados Business Manager Lite |
---|
Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
More info » |
| TM-Town |
---|
Manage your TMs and Terms ... and boost your translation business
Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |