Pages in topic: [1 2] > | Downloading the Acquis Communautaire Thread poster: Parrot
| Parrot Spain Local time: 01:25 Spanish to English + ...
Those colleagues often meeting up with texts of EU Directives and jurisprudence may be interested in downloading the corpora of the Acquis Communautaire as *.tmx files. They basically come as zipped volumes that may be extracted into the language pairs of interest using the tool made available on the same page: http://langtech.jrc.it/DGT-TM.html Once you have obtained the *.tmx files of in... See more Those colleagues often meeting up with texts of EU Directives and jurisprudence may be interested in downloading the corpora of the Acquis Communautaire as *.tmx files. They basically come as zipped volumes that may be extracted into the language pairs of interest using the tool made available on the same page: http://langtech.jrc.it/DGT-TM.html Once you have obtained the *.tmx files of interest, they may be converted for use with most CAT tools. Please note the Conditions for Use. ▲ Collapse | | |
Very useful | | | Maria Amorim (X) Sweden Local time: 01:25 Swedish to Portuguese + ... |
If you're posting about the DGT-TM, we might as well mention the europarl corpus as well. http://www.statmt.org/europarl/ These are autoaligned corpora of the transcripts of EP plenaries. The easiest way to get a TMX out of them is probably the following: - Download and extract a corpus file pair from the statmt site - Download the grab bag (1.5) and the aligner packag... See more If you're posting about the DGT-TM, we might as well mention the europarl corpus as well. http://www.statmt.org/europarl/ These are autoaligned corpora of the transcripts of EP plenaries. The easiest way to get a TMX out of them is probably the following: - Download and extract a corpus file pair from the statmt site - Download the grab bag (1.5) and the aligner package from sourceforge.net/projects/aligner/files - Use generate_tabbed.exe from the grab bag to make a tabbed txt out of the two files - Use the tmx maker from the aligner package to generate a tmx You can also try and shove the files into your preferred aligner instead of using these command-line tools, but that could go badly wrong in a number of ways. ▲ Collapse | |
|
|
Parrot Spain Local time: 01:25 Spanish to English + ... TOPIC STARTER
Now this is what I call return on taxes | | |
Hi, another interesting link with several TMs available in many languages. http://www.globalization-group.com/edge/2010/05/download-translation-memory/ Government Translation Memory European Commission (millions of TUs in 22 EU languages) EU Constitution (thousands of TUs in 21 EU languages) ... See more Hi, another interesting link with several TMs available in many languages. http://www.globalization-group.com/edge/2010/05/download-translation-memory/ Government Translation Memory European Commission (millions of TUs in 22 EU languages) EU Constitution (thousands of TUs in 21 EU languages) European Parliament (millions of TUs in 11 EU languages) Stockholm Parallel Corpora (thousands of TUs in English, Greek, and Chinese) Localization and Technical Translation Memory OpenOffice.org (tens of thousands of TUs in German, English, Spanish, French, Japanese, and Swedish) KDE (hundreds of thousands of TUs in 92 languages) PHP Manuals (thousands of TUs in 22 languages) European Medicines Agency (millions of TUs in 22 EU languages) Media Translation Memory OpenSubtitles.org (millions of TUs in 30 languages) SETimes.com (millions of TUs in 9 Southeastern European languages) Enjoy!! Christophe ▲ Collapse | | | "Autoaligned" is a synonym for GIGO | Oct 19, 2011 |
I consider autoaligned corpora or TMx files a waste of our tax money. The alignment is pretty useless. | | | And you base that on... | Oct 19, 2011 |
Siegfried Armbruster wrote: I consider autoaligned corpora or TMx files a waste of our tax money. The alignment is pretty useless. Might I ask what data you're basing this on? I'd also love to hear your ideas about alternative solutions for making, say, a corpus of 1 million sentences translated into 27 languages available for the general public - and searchable - for less than it costs to run it through an autoaligner. Autoalignments are remarkably good - but even if they were only mediocre, they are the only option we have for mining this huge dataset. By the way, the data I'm basing the above statement on is the numerical data from various academic researchers who tested various aligners on real-world texts, comparing the results to a manually prepared perfect alignment. Good aligners consistently produce around 95% correct alignments on mixed texts, and they can easily reach 98% or more if you tell them to automatically discard dubious sentence pairs. Of course they often exceed 99% on good quality input texts even without this "filtering". http://papers.ldc.upenn.edu/LREC2006/Champollion.pdf http://www.lrec-conf.org/proceedings/lrec2008/pdf/126_paper.pdf http://utkl.ff.cuni.cz/~rosen/public/slovko05.pdf ftp://ontologia.hu/Hunglish/doc/ranlp05.pdf My own experience backs this result: I use autoaligned texts daily and rarely come across misaligned sentences - but I would never dream about making categorical statements based on anecdotal personal experience, of course.
[Edited at 2011-10-19 17:34 GMT] | |
|
|
Parrot Spain Local time: 01:25 Spanish to English + ... TOPIC STARTER Easier said than done | Oct 20, 2011 |
FarkasAndras wrote: The easiest way to get a TMX out of them is probably the following: I'm no good with sourceforge tools on such large, unparsed and somehow distorted files. Fortunately, Christophe's link provides us with the ready-made Europarl *.tmxs. Thanks to everyone! | | |
Parrot wrote: FarkasAndras wrote: The easiest way to get a TMX out of them is probably the following: I'm no good with sourceforge tools on such large, unparsed and somehow distorted files. Fortunately, Christophe's link provides us with the ready-made Europarl *.tmxs. Thanks to everyone! That's an older release. The ready-made TMXes are based on version 3 of the corpus, which is now already at version 6. Since version 3, many new languages and more recent texts were added, and the quality improved somewhat as well. The command-line tools I wrote are not what I'd call user friendly, but it's all fairly straightforward once you get started. The source files are large - that's why you can't use Excel or something like that to process them. Other than that, they are nice and neat. It would be fairly easy to write a tool that generates the TMXes completely automatically after you launch it, but it shouldn't be necessary. | | | My opinion is based on the content of the TMX files | Oct 20, 2011 |
The following screenshots are just more or less random screenshots of parts of one of "autoaligned" TMX files from one of the sources mentioned above. I guess everybody will agree that this alignment is crap. ... See more | | | jacana54 (X) Uruguay English to Spanish + ... Thank you, Parrot! | Oct 20, 2011 |
| |
|
|
Poor input files | Oct 20, 2011 |
Siegfried Armbruster wrote: The following screenshots are just more or less random screenshots of parts of one of "autoaligned" TMX files from one of the sources mentioned above. I guess everybody will agree that this alignment is crap. From the 365.000 segments in the file, I already deleted > 40.000 and the file is still full with grap. Perhaps my approach might be completely wrong, and I would be really interested how the "experts" use the uncleanded autoaligned TMX files and get something useful out of it. There are a couple of things going on there. I agree that 40,000 useless TUs out of a total of 365,000 is too much, but that's not due to the alignment or autoaligners as such. It's due to original input files, which obviously don't have the same content in your screenshots. The GIGO principle applies, of course - if the source files are crap, there is only so much an automated system can do to fix them. The source files in the Europarl corpus and the DGT-TM corpus are very good in my experience, so I wouldn't expect much crap like this in those. Certainly not 10+ percent. Your excerpts look like they might be from the EMEA corpus, which I'm not familiar with. It looks like they should have done a better job of cleaning the files. There are a couple of things they could have done with automated solutions, from throwing out dodgy source files to throwing out individual dodgy TUs or even cleaning the source files themselves (removing footnotes and purely numerical sections etc.) At the very least, low-quality alignments like this should be made available as tab delimited text as well, which is easier to evaluate, process and clean up than TMX. Either way, one saving grace is that these won't come up as concordance hits often, as there isn't much text in there. It also wouldn't be too difficult to do some automated post-production on this material: throw out every TU where one language or the other doesn't contain at least one word (3 or more consecutive letter characters), for instance. | | | Parrot Spain Local time: 01:25 Spanish to English + ... TOPIC STARTER Comments FWIW | Oct 20, 2011 |
No expert, but on Studio 2011, the Acquis doesn't present problems. Europarl, on the other hand, is a bit of a nightmare. The raw files show distorted characters and need a lot of pre-processing. On the other hand, the old-release *.tmx files from Christophe's processed site just seem to require some legacy conversion before Studio admits them. I used to work with EMEA as a reference, and the texts never quite tallied. I moved out of the field to concentrate on law, so I'm no longer... See more No expert, but on Studio 2011, the Acquis doesn't present problems. Europarl, on the other hand, is a bit of a nightmare. The raw files show distorted characters and need a lot of pre-processing. On the other hand, the old-release *.tmx files from Christophe's processed site just seem to require some legacy conversion before Studio admits them. I used to work with EMEA as a reference, and the texts never quite tallied. I moved out of the field to concentrate on law, so I'm no longer sure about their status; still I can imagine it was not initially projected for alignment. ▲ Collapse | | | Michael Beijer United Kingdom Local time: 00:25 Member (2009) Dutch to English + ... generate_tabbed.exe doesn't seem to be working | Feb 26, 2015 |
Hi András, Any tips on how to get it to work? 2 simple txt files, UTF-8 without signature. I followed all the instructions, but nothing’s happening. Right before the cmd.exe windows closes, I briefly see "Can't open…"
[Edited at 2015-02-26 20:59 GMT] Weird. It doesn't work if I run it from: C:\Users\michaelbeijer\Desktop\PatTR Patent Translation Resource\de-en\pattr\description\1 - Copy (1) But it does wor... See more Hi András, Any tips on how to get it to work? 2 simple txt files, UTF-8 without signature. I followed all the instructions, but nothing’s happening. Right before the cmd.exe windows closes, I briefly see "Can't open…"
[Edited at 2015-02-26 20:59 GMT] Weird. It doesn't work if I run it from: C:\Users\michaelbeijer\Desktop\PatTR Patent Translation Resource\de-en\pattr\description\1 - Copy (1) But it does work if I run it from: C:\Users\michaelbeijer\Desktop\ Is it the spaces or brackets in the path maybe?
[Edited at 2015-02-26 21:05 GMT] ▲ Collapse | | | Pages in topic: [1 2] > | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Downloading the Acquis Communautaire Protemos translation business management system | Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!
The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.
More info » |
| Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |