Latin-1 trouble with unicode (UTF-8) on XP
Initiator des Themas: Dirk Bayer
Dirk Bayer
Dirk Bayer
Local time: 09:48
Englisch > Deutsch
+ ...
Feb 23, 2012

I have recently installed and set up OmegaT 2.3.0_1 for English-German translations on Windows XP (service pack 3). I made a glossary (".tab" file) according to the instructions I found which stipulated to use UTF-8 encoding and carriage return / linefeed combos for linebreaks.

However, the glossary and edit panes do not show the intended German characters (glyphs) but gobbledygook double-byte strings (e.g. "abhängig" instead of "abhängig"). Installing ostensibly true unicode f
... See more
I have recently installed and set up OmegaT 2.3.0_1 for English-German translations on Windows XP (service pack 3). I made a glossary (".tab" file) according to the instructions I found which stipulated to use UTF-8 encoding and carriage return / linefeed combos for linebreaks.

However, the glossary and edit panes do not show the intended German characters (glyphs) but gobbledygook double-byte strings (e.g. "abhängig" instead of "abhängig"). Installing ostensibly true unicode fonts like Bitstream Vera Sans and then setting them as OmegaT's font instead of the "Dialog" font made no difference. Applying those fonts in OpenOffice when viewing the target files likewise made no difference. A CodePage 1252 version of the same glossary displays correctly.

It seems OmegaT, OpenOffice, etc. only display CP 1252 on my PC and typing through the familiar US International Keyboard also only produces CP 1252 encoding. How can I produce UTF-8 encoding if clients ask for it?

I would be most grateful for help with this.
Collapse


 
Didier Briel
Didier Briel  Identity Verified
Frankreich
Local time: 15:48
Englisch > Französisch
+ ...
Use the right extension for UTF-8 Feb 23, 2012

Dirk Bayer wrote:

I have recently installed and set up OmegaT 2.3.0_1 for English-German translations on Windows XP (service pack 3). I made a glossary (".tab" file) according to the instructions I found which stipulated to use UTF-8 encoding and carriage return / linefeed combos for linebreaks.

However, the glossary and edit panes do not show the intended German characters (glyphs) but gobbledygook double-byte strings (e.g. "abhängig" instead of "abhängig").

The documentation says (Glossaries > File format):
Glossary files can be either in system default encoding (and indicated by the extension .tab) or in UTF-8 (the extension .utf8).

So, simply rename your glossary with a .utf8 extension.

Didier


 
esperantisto
esperantisto  Identity Verified
Local time: 16:48
Mitglied (2006)
Englisch > Russisch
+ ...
SITE LOCALIZER
Or .txt Feb 23, 2012

I have no problem with OmegaT 2.3.x/2.5.x using .txt extension for my glossaries in UTF-8.

the glossary and edit panes do not show the intended German characters (glyphs) but gobbledygook double-byte strings (e.g. "abhängig" instead of "abhängig").


Your glossary is actually in UTF-8 and has the correct format, simply change the extension.


 
Dirk Bayer
Dirk Bayer
Local time: 09:48
Englisch > Deutsch
+ ...
THEMENSTARTER
Oops and thanks! Feb 24, 2012

Oops! How did I miss that? Next question: where can I get a good egg remover for my face?

Seriously, thanks a million for the prompt and excellent responses!

Simply using an ".utf8" extension on the UTF-8 encoded glossary seems to have worked. Even typing non-English characters in OmegaT now using my preferred method (US Internati
... See more
Oops! How did I miss that? Next question: where can I get a good egg remover for my face?

Seriously, thanks a million for the prompt and excellent responses!

Simply using an ".utf8" extension on the UTF-8 encoded glossary seems to have worked. Even typing non-English characters in OmegaT now using my preferred method (US International Keyboard) seems to create the same results as insertion from the glossaries, no matter if I use the CP 1252 version of the glossary (with ".tab" file name extension) or the UTF-8 verson (with ".utf8" file name extension), and exporting an OmegaT-produced odt file to a utf8 cleartext file from OpenOffice now produces the same output as OmegaT does from a ".utf8" source file to a utf8 target file. -- It seems as if I could even use good old CP 1252 encoded glossaries (with ".tab" extension!) and leave the unicode worries entirely to OmegaT...

Only my previous attempt to use the UTF-8 verson of the glossary with an erroneous ".tab" file name extension seems to produce the weird results I saw previously: garbled displays in both OmegaT and OpenOffice plus a 4-byte string for a single non-English character upon exporting to utf8 cleartext from the OpenOffice odt file created by OmegaT.


Remaining questions for any takers:

1.) I can't say I understand how inserting the same code sequences from glossaries with different file name extensions creates such different results (in the days before Unicode a code sequence was what it was and you only had to apply the matching font to it), but consistent glyph displays and consistent underlying codes (as revealed in the cleartext files on a unicode-ignorant classic Mac platform) at least suggest that the intended output is now being created. I wonder if there is a good way for verifying UTF-8 vs CP 1252 encoding in the OpenOffice files since they seem to react identically to font changes no matter what glossary was used in their creation as long as the glossary file name matched the glossary's encoding.

2.) I wonder whether normalizing glossaries to use only straight quotes will now result in matches whether or not the source files contain straight or curly quotes, or whether such normalizing will even be necessary. I previously had mixed results.

This is an amazing forum.
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 16:48
Mitglied (2006)
Englisch > Russisch
+ ...
SITE LOCALIZER
OOo files… Feb 24, 2012

Dirk Bayer wrote:
I wonder if there is a good way for verifying UTF-8 vs CP 1252 encoding in the OpenOffice files


OOo files are and have always been Unicode-based, there’s nothing to verify.


 
Didier Briel
Didier Briel  Identity Verified
Frankreich
Local time: 15:48
Englisch > Französisch
+ ...
UTF-8 is UTF-8 Feb 24, 2012

Dirk Bayer wrote:
Simply using an ".utf8" extension on the UTF-8 encoded glossary seems to have worked. Even typing non-English characters in OmegaT now using my preferred method (US International Keyboard) seems to create the same results as insertion from the glossaries, no matter if I use the CP 1252 version of the glossary (with ".tab" file name extension) or the UTF-8 verson (with ".utf8" file name extension), and exporting an OmegaT-produced odt file to a utf8 cleartext file from OpenOffice now produces the same output as OmegaT does from a ".utf8" source file to a utf8 target file. -- It seems as if I could even use good old CP 1252 encoded glossaries (with ".tab" extension!) and leave the unicode worries entirely to OmegaT...

OmegaT handles everything in UTF-8 internally. So, if the input is correctly identified, the output will be correct, assuming it can handle the required characters. I.e., you cannot produce CP 1252 files containing Japanese.


I wonder if there is a good way for verifying UTF-8 vs CP 1252 encoding in the OpenOffice files since they seem to react identically to font changes no matter what glossary was used in their creation as long as the glossary file name matched the glossary's encoding.

There's nothing to verify: all OpenOffice.org files are in UTF-8.

I wonder whether normalizing glossaries to use only straight quotes will now result in matches whether or not the source files contain straight or curly quotes, or whether such normalizing will even be necessary. I previously had mixed results.

It depends on plenty of things.
In short, OmegaT has no specific function to understand that a straight quote is the same as a curly one.


This is an amazing forum.

For advanced discussion on OmegaT, the Yahoo support group would still be more suitable.

Didier


 
Dirk Bayer
Dirk Bayer
Local time: 09:48
Englisch > Deutsch
+ ...
THEMENSTARTER
Thanks again! Feb 24, 2012

These are very useful and reassuring confirmations.

Many thanks to both of you.


 


Dieses Forum wird von keinem Moderator betreut.
Um Verstöße gegen die ProZ.com-Regeln zu melden oder um Hilfe zu erhalten, wenden Sie sich bitte an unsere ProZ.com-Mitarbeiter »


Latin-1 trouble with unicode (UTF-8) on XP






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »