LF Aligner questions (CAT Tools Technical Help)

Technical forums » CAT Tools Technical Help »
LF Aligner questions
Track this topic

LF Aligner questions

Thread poster: MikeTrans

MikeTrans
Germany
Local time: 17:56
Italian to German
+ ...

Jun 29, 2012

Hi,

The Open Source Aligner from FarkasAndras is best suited for very large documents and shows very good alignement results.
As starting files, it can handle txt, doc, docx, xls, tmx, pdf.
The output files are: tab-text, tmx, xls (features that most commercial products don't have!)
Special reviewing filters for reviewing in EXCEL.

If you are interested, give it a try and download it at

http://sourceforge.net/projects/aligner/

My questions about LF Aligner:

I have 2 very large source and target documents in txt. LF Aligner can build a tab-text from them, but I would like to tell LF Aligner that all phrases terminating with a Carriage Return in the source document are *exactly* corresponding to those segments in the target document. In fact, I don't want to align, but just transform these 2 huge documents to a single tab-text doc. How to do that, so that LF doesn't try to align by skipping some sentences?

Note: if these docs were somewhat smaller, I would use UltraEdit, but with these giants I get only a "Out of Memory".

Thanks very much for your feedback,
Mike

[Edited at 2012-06-29 17:29 GMT] ▲ Collapse

FarkasAndras

Local time: 17:56
English to Hungarian
+ ...

generate tabbed

Jun 29, 2012

Well, the aligner has no such feature. It's not a "normal" use case.
I already have the code necessary for this, so here you go:
https://dl.dropbox.com/u/16377950/maketabbed.exe

It's obviously pretty spartan, but it should work (UTF-8 only, and it doesn't do any preprocessing. If your input file has tab characters in it, the output file will include those as well.)

By the way, I use notepad++ for large text files. It has worked really well for me up to a couple hundred MB, which is as far as I needed it to go - although I don't know if it offers an easy way to merge two files into a single tabbed text file. ▲ Collapse

MikeTrans
Germany
Local time: 17:56
Italian to German
+ ...

TOPIC STARTER

Thanks!

Jun 30, 2012

Hi Farkas,

thanks very much. The problem with UltraEdit when creating tab documents, is: I must first create columns where UE adds first the necessary trailing spaces. This may blow up a document 20x or more of its size, not to speak about the RAM requirements. I had to kill everything with the task manager and purge all my temp dirs which had a multiple Gigabyte size...

The sense of all this:
When I build a very large database, (EMA, DGT, EuroParl etc. and even extracted chunks of them) I always build 2 versions of any DB:

1) One that I purge from duplicates and other garbage to be used by a CAT tool
2) The same with all complete plain text available, not necessary well-aligned, which I send to XBench.

XBench has the great feature to show searches in context (it displays +/- 10 TM segments of any search). This helps very much for DBs which are not well-aligned or where the original text is broken because of conversion limits (the EMEA medical DB is a disaster in this regard, but still a big help for me).

So, to realize point 2) it's just enough to create a tab-text which can be send to XBench. What's important is to check that both files have the same number of sentences (which in my case, for both language texts of EMEA is 1.116.368 lines). Note that LF will only drop about 150.000 segments in this case if I align, but the result is very good for my purpose 1). Once purged, only 372.000 segments will remain!

Anyhow, I very much appreciate your file link and your time. LF will surely be a big help if used together with my Olifant TM manager or even when preparing TMX files to be used for CAT hopping. I still have to read your docs and do some experiments for such scenarios.

Thank you very much!
Mike

[Edited at 2012-06-30 10:34 GMT] ▲ Collapse

FarkasAndras

Local time: 17:56
English to Hungarian
+ ...

TMX -> tabbed

Jun 30, 2012

I use xbench in a similar way, so I get the use case.
I'm not sure how you end up with two separate txt files, though. You can take a tmx file such as the ones from OPUS, the DGT-TM etc. and convert it to a tabbed txt file in one step with a tool from the grab bag on sourceforge.
I don't use the EMEA corpus but I'd assume they also provide either tabbed text or TMX...?

MikeTrans
Germany
Local time: 17:56
Italian to German
+ ...

TOPIC STARTER

@Farkas,

Jun 30, 2012

I've just tryed the maketabbed.exe with about 40 lines in each UTF-8 txt file.
I'm getting the error:

Undefined subroutine &main::abort called at script/maketabbed.pl line 11, line 3.

Possible cause:
Must maketabbed be placed in a special directory? Maybe in "other tools" of LF ?
I had to take out some paragraph bullets of my txt, but there are still some slashes (webpage adresses). Can maketabbed handle these or special characters like \ @ (c) ö etc. ?
If not, no problem, I can convert the text with a macro of mine getting rid of all such and re-change it afterwards.

In LF Aligner, I think if Im chosing "Revert to paragraph segments" in the dialog box, then I should get exactly each carriage return processed as the end of a segment (thus a tabbed text). Or has this nothing to do?

Thanks,
Mike ▲ Collapse

FarkasAndras

Local time: 17:56
English to Hungarian
+ ...

error handling

Jun 30, 2012

The problem is that the input file you specified can't be opened for some reason. Maybe it doesn't exist, or it has accented letters, special characters or spaces in the file name or path.
I coded this tool in 5 minutes and the error handling is... not very robust, that's why you didn't get a more sensible error message.

MikeTrans
Germany
Local time: 17:56
Italian to German
+ ...

TOPIC STARTER

Opus download: tmx or txt language files

Jun 30, 2012

FarkasAndras wrote:

I use xbench in a similar way, so I get the use case.
I'm not sure how you end up with two separate txt files, though. You can take a tmx file such as the ones from OPUS, the DGT-TM etc. and convert it to a tabbed txt file in one step with a tool from the grab bag on sourceforge.
I don't use the EMEA corpus but I'd assume they also provide either tabbed text or TMX...?

Long ago I've downloaded from Opus Corpora a TMX compilation of the EMEA (European medicines Agency) En-De and Fr-De. You can chose TMX, but this requires a lot of editing work to correct line mistakes causing the file not to import into anything; XBench fortunately was displaying the errors with the line number, so I was able to correct mistakes and get 303.000+ segments for Fr-De. I remember that it has taken me more than 4 hours, even with intelligent search/replace operations in the TMX.

After this experience, I would strongly recommend to download the native txt files which come separately for any language, called "Moses format". Those can then easily be aligned with LF Aligner after converting to UTF-8.

Without LF Aligner, I just wouldn't know what to do with these 2 huge separate files! Splitting them into 65.000 segments each to be handled by Excel would still get me 40 files (20 per language), also no Unicode support. Not the best to do, but I don't see anything else.
Once you have TMX or tab-text, XBench can import/export such files without problems.

[EDITED]
maketabbed.exe:
My files to open are in a path with a long filename (> 8 characters); will try to put the files in a short dir name after C:, also changing filenames.
It works now!

Mike

[Edited at 2012-06-30 18:05 GMT]

mikhailo
Local time: 18:56
English to Russian
+ ...

re	May 4, 2015

FarkasAndras

Cause different source files require a little bit different segmenting I prefer segmenting text with regexes sets in text editor. At segmenting I divide from text paragraph numbering, bulleting etc but do not remove them cause these are good alignment markers for manual alignment

Is there any way not to segment such files (for example adding special extension stx(t) - segmented text)?
And how can I continue work with files, created by aligner (open ex... See more

FarkasAndras

Local time: 17:56
English to Hungarian
+ ...

easy

May 4, 2015

mikhailo wrote:

FarkasAndras

Cause different source files require a little bit different segmenting I prefer segmenting text with regexes sets in text editor. At segmenting I divide from text paragraph numbering, bulleting etc but do not remove them cause these are good alignment markers for manual alignment

Is there any way not to segment such files (for example adding special extension stx(t) - segmented text)?
And how can I continue work with files, created by aligner (open existing project)?

Another one good idea - to add in excel similarity index to each segment.
this allows to extract quickly only good segments.

1) Just reject sentence segmenting. You can also disable it in setup IIRC.
2) Depends on what you mean by continue to work with files. Launch other_tools/alignedit.exe to review/edit a tabbed txt.
3) Set 'Remove match confidence value' to n in the setup

mikhailo
Local time: 18:56
English to Russian
+ ...

re	May 5, 2015

FarkasAndras wrote:
1) Just reject sentence segmenting. You can also disable it in setup IIRC.
2) Depends on what you mean by continue to work with files. Launch other_tools/alignedit.exe to review/edit a tabbed txt.
3) Set 'Remove match confidence value' to n in the setup

Thanks a lot for Your answers

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

LF Aligner questions

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

LF Aligner questions

LF Aligner questions

You have native languages that can be verified

Your current localization setting

Select a language