How to handle large XML file
Thread poster: Peter Sass
Peter Sass
Peter Sass
Germany
Local time: 11:41
Member
English to German
+ ...
Jan 22, 2015

Hi there,

From a client I've received a single large XML file (700,000 words according to Trados Studio) containing a whole website.
For sure, this must be split up using a XML split programme.

1) Should the splitting be done on the client side preferably, as to make sure they can piece it together again from the translation files OR could I do this just as well?

2) Which XML split programme (preferable Freeware or Shareware) would you recommend?
... See more
Hi there,

From a client I've received a single large XML file (700,000 words according to Trados Studio) containing a whole website.
For sure, this must be split up using a XML split programme.

1) Should the splitting be done on the client side preferably, as to make sure they can piece it together again from the translation files OR could I do this just as well?

2) Which XML split programme (preferable Freeware or Shareware) would you recommend?

3) Is there anyother way to 'shrink' the file in some way?

Thanks for your advice in advance!
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 11:41
Member (2006)
English to Afrikaans
+ ...
Post in the Trados forum Jan 22, 2015

Peter Sass wrote:
From a client I've received a single large XML file (700,000 words according to Trados Studio) containing a whole website. For sure, this must be split up using a XML split programme.


If your CAT tool can handle it, why do you need to split it? I suggest you post this question also in the Trados forum.

That said, if it was me, I would try to split it by section or by page, since it is a web site with (presumably) separate pages. Sorry, I know of no XML splitter (yet... as I would have been googling like crazy and installing a whole range of programs just to try it out).

Are you sure about the word count?


 
Sergei Leshchinsky
Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 12:41
Member (2008)
English to Russian
+ ...
... Jan 22, 2015

Sam, most of the today's websites are databases, so it is quite difficult to split a solid massive of data. However, any database can be exported into a flat file (ttx, xml). The topic starter looks like having this exported content at hand. I think any CAT-tool can handle it today (all you need is a fast PC, like i5 or i7 with SSD and a lot of RAM, which is a must today to avoid latency, as the TMs are quite huge). It will take time, the PC will look halted, but it will do it (you can take coff... See more
Sam, most of the today's websites are databases, so it is quite difficult to split a solid massive of data. However, any database can be exported into a flat file (ttx, xml). The topic starter looks like having this exported content at hand. I think any CAT-tool can handle it today (all you need is a fast PC, like i5 or i7 with SSD and a lot of RAM, which is a must today to avoid latency, as the TMs are quite huge). It will take time, the PC will look halted, but it will do it (you can take coffee or walk with the dog in the meanwhile). Then, in the CAT-tool, it will turn into a database again, and will work much faster, than the source flat file. Also, it can be split into smaller CAT-files (there is a corresponding tool for SDL Studio).

[Редактировалось 2015-01-22 21:21 GMT]
Collapse


 
Sergei Leshchinsky
Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 12:41
Member (2008)
English to Russian
+ ...
... Jan 22, 2015

Samuel Murray wrote:
That said, if it was me, I would try to split it by section or by page, since it is a web site with (presumably) separate pages. Sorry, I know of no XML splitter (yet... as I would have been googling like crazy and installing a whole range of programs just to try it out).

Are you sure about the word count?

XML is a TEXT file, it can be split into smaller text files even using DOS command. And then merged back into one file afterwards.

As to the word count, it can be smaller in the end. I would not judge before I have the file at hand.


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 11:41
Member (2006)
English to Afrikaans
+ ...
DOS command won't split XML smartly Jan 22, 2015

Sergei Leshchinsky wrote:
Samuel Murray wrote:
Sorry, I know of no XML splitter...

XML is a TEXT file, it can be split into smaller text files even using DOS command. And then merged back into one file afterwards.


No, a DOS command might split a piece of translatable text right down the middle (in fact, the DOS commands that I know will happily split a word in two). Or it might split a tag in two, which would cause the CAT tool to misinterpret the tag (or worse: try to fix it). And even if it doesn't split a segment or a tag in two, it might not split nested tags cleanly, which may also affect the way the CAT tool interprets the XML.


 
Peter Sass
Peter Sass
Germany
Local time: 11:41
Member
English to German
+ ...
TOPIC STARTER
Thanks so far Jan 23, 2015

..for all your comments!
Actually, the problem is that Trados Studio cannot process the file properly because it is simply too big (and yes I do have a proper PC with i5 processor + 8 GB RAM).

From previous website translations I recollect that there would normally be a set of separate translation files that followed the structure of the website.
As far as I delved into the matter now, one needs a proper XML split programme to preserve this structure (header tags etc.),
... See more
..for all your comments!
Actually, the problem is that Trados Studio cannot process the file properly because it is simply too big (and yes I do have a proper PC with i5 processor + 8 GB RAM).

From previous website translations I recollect that there would normally be a set of separate translation files that followed the structure of the website.
As far as I delved into the matter now, one needs a proper XML split programme to preserve this structure (header tags etc.), so I couldn't just split the XML file in a text editor.
I'll see what the client thinks..
Collapse


 
Sergei Leshchinsky
Sergei Leshchinsky  Identity Verified
Ukraine
Local time: 12:41
Member (2008)
English to Russian
+ ...
... Jan 23, 2015

http://www.hongkiat.com/blog/split-large-xml-for-wordpress/
https://www.npmjs.com/package/xml-splitter
http://www.xponentsoftware.com/XmlSplit.aspx


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 11:41
English to Hungarian
+ ...
Oof Jan 23, 2015

I'd be wary about splitting a huge XML with a random tool off the internet. After you stitch it back together at the end (I presume you plan to do that), it may not be exactly the same as before. The client's software might complain about it. Perhaps you could ask the client to export the site in several reasonably-sized chunks, and note that the alternative option is for you to use xml splitter XXX.

 
RWS Community
RWS Community
United Kingdom
Local time: 11:41
English
I can recommend... Jan 24, 2015

... this program for splitting XML files : http://www.xponentsoftware.com/XmlSplit.aspx

Not freeware though, but not expensive and very capable. I used this to split the IATE TBX files here : http://multifarious.filkin.com/2014/07/13/what-a-whopper/

... See more
... this program for splitting XML files : http://www.xponentsoftware.com/XmlSplit.aspx

Not freeware though, but not expensive and very capable. I used this to split the IATE TBX files here : http://multifarious.filkin.com/2014/07/13/what-a-whopper/

I guess 700,000 words is a lot for one file! I've never tried to handle anything that large in the Studio Editor but I can imagine it would be a fruitless and frustrating exercise.

We also have the Split and Merge tool on the OpenExchange but you'd have to process the XML first to split the SDLXLIFF and I don't know who much success you'd have handling that even without opening it in the Editor.

Regards

Paul
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to handle large XML file







Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »