A CAT tool for translators only? (CAT Tools Technical Help)

Technical forums » CAT Tools Technical Help »
A CAT tool for translators only?
Track this topic

Pages in topic: [1 2] >

A CAT tool for translators only?

Thread poster: Selcuk Akyuz

Selcuk Akyuz

Türkiye
Local time: 13:05
English to Turkish
+ ...

Jul 14, 2012

I want a CAT tool that analyses segments and calculates matches if only the segment has minimum 10 words.

Not clear?!

Some years ago an agency sent me a TTX project with almost 70% exact matches and 20% fuzzies. But it was an Indesign project badly segmented everywhere. Most segments were 2 or 3 words divided by a tag, mostly by carriage returns (hard or soft).

It was almost impossible to translate that file without joining segments (an extra work and som... See more

(segment 1) This is not specific
(segment 2) to
(segment 3) Trados,
(segment 4) you can get such files
(segment 5) in any
(segment 6) CAT
(segment 7) tool.

Possibly you have some exact matches for each of the above segments(?), but a segment is not always a sentence and it is a nightmare when you get such a file (not only extra work but also free work).

In localisation you can have 1 or 2-word segments, but CAT Tools are not localisation tools (like Passolo or Catalyst). In each CAT tool matches should be calculated (as an option) only if the segment is longer than 10 (or 7) words.

Instead, CAT tools are now calculating even subsegment matches. No, I want a CAT tool with proper analysis and I will buy it even its price is 1000 EUR. ▲ Collapse

MikeTrans
Germany
Local time: 12:05
Italian to German
+ ...

Not the perfect solution but...

Jul 14, 2012

Hi Selcuk,

I'm still trying to make my perfect CAT tool by myself, well, every once so often I try it, but resign 10 minutes later... That's to say: The most useful features are spread around into various CAT tools which doesn't at the end give you any advantage.

That said, as I know you're also using DVX2, do the following, butI can't believe you don't know that

In DVX2, just filter the seg... See more

In DVX2, just filter the segments, use SQL if necessary to get only x words in a sentence; export to external view; open & select all source columns and import as a new document.
Then, analyze.
It's not a perfect solution, I know, it takes time, but if I see a better solution I'll tell you immediately.

As for *any* segmentation problem: get your docs imported as tmx files, this avoids any segmentation problems, it ensures all the codes will remain in the tmx files because DVX2 does a wonderful job with this filter.

I hope this will be at least a quick help for you.

Good luck,
Mike ▲ Collapse

Selcuk Akyuz

Türkiye
Local time: 13:05
English to Turkish
+ ...

TOPIC STARTER

SQL filter is a solution in DVX but

Jul 15, 2012

Hi Mike,

I know about the InStr VBA function which can be used in SQL filters in DVX.

For example (instr(1, source, " ", 1)= 0) displays segments with 1 word only.

To display segments with 10 or less words a longer statement is needed:

(instr(1, source, " ", 1)= 0) OR (instr(instr(1, source, " ", 1)+1, source, " ",1)= 0) OR (instr(in... See more

(instr(1, source, " ", 1)= 0) OR (instr(instr(1, source, " ", 1)+1, source, " ",1)= 0) OR (instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0) OR (instr(instr(instr(instr(instr(instr(instr(instr(instr(instr(1, source, " ", 1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)+1, source, " ",1)= 0)

Then I can lock these segments, go to All Except Locked Segments view, change their status to Pending and export these Pending segments longer than 10 words to an External View file. Hide all except the Source column and create a new project for these segments.

Not so difficult for DVX users but what about other CAT tools.

I normally do not receive projects which require use of a specific CAT tool, so I am lucky. But for those who receive CAT projects, this feature should be included in the analysis.

Matches for segments with less than 10 words are not reliable. See http://www.proz.com/forum/sdl_trados_support/225082-the_lord_of_the_rings.html

That is why CAT tool developers should implement a feature to exclude short segments from analysis reports. But then agencies will use another CAT tool

MikeTrans wrote:
As for *any* segmentation problem: get your docs imported as tmx files, this avoids any segmentation problems, it ensures all the codes will remain in the tmx files because DVX2 does a wonderful job with this filter.

How can I import a doc file as tmx file?! Perhaps you can answer it in the DVX forum, it is better to discuss that short segments issue here.

Selcuk ▲ Collapse

Samuel Murray

Netherlands
Local time: 12:05
Member (2006)
English to Afrikaans
+ ...

Another idea

Jul 15, 2012

Selcuk Akyuz wrote:
I want a CAT tool that analyses segments and calculates matches if only the segment has minimum 10 words.

I don't think it would be easy to come up with a method of catching all such instances that you refer to. However, I think segment length is possibly not the best way to do it. How about something that: excludes segments that do not start on a start-of-sentence indicator (e.g. a capital letter) and do not end on an end-of-sentence indicator (e.g. fullstop, question mark or exclamatoin mark).

Jaroslaw Michalak

Poland
Local time: 12:05
Member (2004)
English to Polish

SITE LOCALIZER

MemoQ

Jul 15, 2012

In MemoQ you can check the segments against specified character length (not number of words), based on the comment field. However, as you cannot set several comments at the same time, this requires a roundtrip through rtf column export (and Excel to make the filling out easier)...

I am not sure if character length suits your puporse, though.

On the other hand, if you roundtrip through Excel, it should be easy enough to check the number of words there and then filter tho... See more

Selcuk Akyuz

Türkiye
Local time: 13:05
English to Turkish
+ ...

TOPIC STARTER

good segmentation is another issue

Jul 15, 2012

Hi Samuel,

Are there punctuation marks in all languages, also what about capital letters? On the other hand a word is not always a word in all languages. Japanese, I think a good example for both.

Words or characters?
The number of words in a sentence differs even in European languages, long words in German, plenty of 'la' and 'le' in a French text, and then there are some agglutinative languages. That is why character (or line) count is preferred in some countries.

Some CAT tools, e.g. MemoQ can sort segments based on character length. I think any CAT tool can make a filter to display segments wit less than n words.

But is it sufficient? No! This should also be implemented (as a standard) in analysis reports. A Trados, MemoQ or Deja Vu analysis should not consider short segments because they are not reliable!

A simple one-word segment, e.g. 'parts' can have several translations in French (and in all languages). How can we consider it an exact match? ▲ Collapse

Selcuk Akyuz

Türkiye
Local time: 13:05
English to Turkish
+ ...

TOPIC STARTER

MemoQ

Jul 15, 2012

Hi Jarosław,

I know how to sort segments based on number of words in MemoQ. I can lock them and make an analysis not including locked segments. But is there a setting in the analysis window to exclude segments shorter than n words?

There are solutions to exclude them in every CAT tool but are we translators aware of the useless exact and fuzzy short segments?

Heinrich Pesch

Finland
Local time: 13:05
Member (2003)
Finnish to German
+ ...

Translate in Word

Jul 15, 2012

If the document is full of tags you must request a text in Word and translate using Wordfast or the like. After translation someone else my reformat the text as they like. I would not count any matches at all but bill according to Word wordcount.

[Bearbeitet am 2012-07-15 23:19 GMT]

Selcuk Akyuz

Türkiye
Local time: 13:05
English to Turkish
+ ...

TOPIC STARTER

in real world there are discounts

Jul 15, 2012

Heinrich Pesch wrote:

If the document is full of tags you must require a text in Word and translate using Wordfast or the like. After translation someone else my reformat the text as they like. I would not count any matches at all but bill according to Word wordcount.

Hi Heinrich,

The problem is not tags or any specific CAT tool or document types here. And I generally work for direct clients who do not ask for discounts for repetitions.

But many translators get TMs from agencies and asked for discounts, right?

Translators failed to unite against such requests but at least they should ask short segments not to be included in match analyses. And this could be achieved only if supported by CAT tool developers.

So simple, make a setting in CAT A or B so that (discount) analysis will exclude segments shorter than n words.

MikeTrans
Germany
Local time: 12:05
Italian to German
+ ...

answer about SQL + segmentation

Jul 15, 2012

Selcuk Akyuz wrote:

To display segments with 10 or less words a longer statement is needed:
... (long SQL statement taken out)...

To express an SQL statement which contains x words, you don't focus on the words, but on the spaces separating the words, so it should look like that:

Sentence like "* * * *" etc...
This will give you 3 word sentences. If you want any numbers or non-words excluded, this should work:

((Sentence not like "[^a-z]*[0-9]*") OR (Sentence not like "[0-9]*[^a-z]*")) AND Sentence like "* * * *"

This will work for 3 words; for x words add asterisks followed by spaces in the last Like statement ending with an asterisk.

How can I import a doc file as tmx file?! Perhaps you can answer it in the DVX forum, it is better to discuss that short segments issue here.

With your consent I will copy my answer in the Yahoo group. I do the following based on FarkasAndras advices in LF Aligner:

- Import your doc in DVX2; select all rows and F5 (copy source to target)
- Create a new TM and send file to TM
- Export new TM as tmx
- Reimport tmx as file to translate in your project

This has several advantages, especially reducing codes and you can handle with it ALL sort of documents, being also able to join/split at will.
This is particularly helpful when you have to deliver in legacy CAT tool formats: use the same procedure, but instead of importing into DVX2 in the first place, import, copy source to target, and pretranslate in the CAT tool you need for delivery in order to build a tmx file.

Mike

[Edited at 2012-07-15 21:04 GMT]

Selcuk Akyuz

Türkiye
Local time: 13:05
English to Turkish
+ ...

TOPIC STARTER

SQL filter

Jul 15, 2012

MikeTrans wrote:

To express an SQL statement which contains x words, you don't focus on the words, but on the spaces separating the words, so it should look like that:

Sentence like "* * * *" etc...
This will give you 3 word sentences.

Hi Mike,

In fact that was the first SQL filter I tried but there was a mistake in my filter, NOT was forgotten. And in your example it should be "Source", "Sentence" is used in TM not in Project files.

So it should be Source NOT LIKE "* * * *"

Otherwise, if I don't add the NOT operator, even a segment with 20 words will be displayed because "* * *" exists in a long segment as well.

Enough SQL for today

MikeTrans wrote:
I do the following based on FarkasAndras advices in LF Aligner:

- Import your doc in DVX2; select all rows and F5 (copy source to target)
- Create a new TM and send file to TM
- Export new TM as tmx
- Reimport tmx as file to translate in your project

This has several advantages, especially reducing codes and you can handle with it ALL sort of documents, being also able to join/split at will.
This is particularly helpful when you have to deliver in legacy CAT tool formats: use the same procedure, but instead of importing into DVX2 in the first place, import, copy source to target, and pretranslate in the CAT tool you need for delivery in order to build a tmx file.

But you can't export as a doc file, right? Possibly, after finishing translation of the TMX file, you pretranslate the doc file with the dedicated TM.

Number of codes will not change in this method, you can join/split segments but then when pretranslating the doc file you will not get exact matches for joined/split segments.

It seems that CAT tool support personnel are not interested in this topic.

[Edited at 2012-07-15 21:55 GMT]

MikeTrans
Germany
Local time: 12:05
Italian to German
+ ...

And the winner, ehem, the right formula is:

Jul 15, 2012

Sentence shorter than 10 words:
Equivalent to a sentence which doesn't have at least 10 words:

Source not like "*[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]* *[a-z]"

For segments ending with a space, just add a space after the last [a-z] , correct and repeat.

Hope this helps,
Mike

[Edited at 2012-07-15 23:16 GMT]

Heinrich Pesch

Finland
Local time: 13:05
Member (2003)
Finnish to German
+ ...

You are right

Jul 15, 2012

Selcuk Akyuz wrote:

The problem is not tags or any specific CAT tool or document types here. And I generally work for direct clients who do not ask for discounts for repetitions.

But many translators get TMs from agencies and asked for discounts, right?

Translators failed to unite against such requests but at least they should ask short segments not to be included in match analyses. And this could be achieved only if supported by CAT tool developers.

So simple, make a setting in CAT A or B so that (discount) analysis will exclude segments shorter than n words.

Very short segments are questionable matches, but often even one word segments are acceptable 100% matches. I doubt if one can draw a reasonable line there.
On the other side I often get jobs where most of the less than 99% matches are in fact 100%, only that the author has reformatted the segment or corrected a spelling mistake since previous edition.
All work should get priced according to the effort it takes to complete them. For rough formatting there must be a penalty.

[Bearbeitet am 2012-07-16 05:29 GMT]

Selcuk Akyuz

Türkiye
Local time: 13:05
English to Turkish
+ ...

TOPIC STARTER

I don't need SQL filters :)

Jul 15, 2012

Mike, another good try but it fails

Test and see yourself.

I can use the loooong filter with InStr function or the shorter one: Source NOT LIKE "* * * * * * * * * * *" They both work.

I like using or creating SQL filters but what I want here is a tool that can exclude shorter segments from analysis. That is all! And I want it for all CAT users, especially for those who receive CAT projects with exa... See more

Mike, another good try but it fails

MikeTrans
Germany
Local time: 12:05
Italian to German
+ ...

I see your point now - a little late...

Jul 15, 2012

Selcuk,

yes, I understand now your argument. Well, what I do is giving my *good* clients a tollerance margin of +/- words in a text if they are telling me "the text has 1344 words". But still the problem with short segments when counting fuzzy matches is a tricky one, I agree with Heinrich.

Personally I don't accept discounts for fuzzy matches and for any new client I make it very clear: they have to take out what's not to be translated. This will help the relationship in the future.

Also, when there are rules there are exceptions. I think I've once run into a translation of technical drafts with 1-2 words in lenght, very hard to translate, a lot of research to be done, I needed more than once to contact the author, but: that's part of my job, I cannot ask for higher rates just because I take more time to translate; the rate was appropriate for the subject, but it was a nightmare for me, I remember...

DVX2, Trados Studio, MemoQ etc... : I would not be surprised if all those CATs count the words and output their analyzes differently. I think they have just added this feature to make the translator busy.
I've heard that in Studio 2009 the Analyze function is broken due to a serious bug, I don't know about Studio 2011.

Mike

[Edited at 2012-07-16 00:04 GMT]

[Edited at 2012-07-16 00:11 GMT] ▲ Collapse

Pages in topic: [1 2] >

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

A CAT tool for translators only?

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators. Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way. More info »

Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

A CAT tool for translators only?

A CAT tool for translators only?

You have native languages that can be verified

Your current localization setting

Select a language