Pages in topic:   < [1 2]
How do you speed up your term/phrase search process (for TM, glossary, termbases)?
Thread poster: Alex Aruj
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 14:54
Member (2009)
Dutch to English
+ ...
@Dan: Oct 23, 2014

One more question:

What settings would you suggest here:

some_text ?

To cache or not to cache? I would be storing my index(es) on a separate SSD. I have 2 big SSDs on my laptop and a 1TB hybrid drive. One of the SSDs is reserved for indexes, my CafeTran Total Recall database and a few VMs.

Michael


 
Dan Lucas
Dan Lucas  Identity Verified
United Kingdom
Local time: 14:54
Member (2014)
Japanese to English
Cache text only Oct 23, 2014

Michael Beijer wrote:
One more question: What settings would you suggest here

Michael, I would cache text but not documents. The index covering 130Gb of data that I mentioned earlier is slightly less than 5Gb in size, so very manageable. I've never tried caching the documents but logically you should end up with an index as large as the data itself plus an additional few percent on top representing the indexing overhead. So you'd use 135Gb to index and cache 130Gb of data. That doesn't sound like a good idea to me unless you want a portable library of your data that you can carry around with you on an external drive.

As I deal mostly in English and Japanese, I have never needed to use the accent-sensitive option. Likewise, case sensitivity isn't a big deal for me although I can see it being useful in some contexts. Keep your logging as Summary only, otherwise any useful information gets lost in huge lists of files.

Dan


[Edited at 2014-10-24 07:29 GMT]


 
Alex Aruj
Alex Aruj  Identity Verified
United States
Local time: 07:54
Spanish to English
+ ...
TOPIC STARTER
plenty of functionality, now to review query habits and see what works and what doesn't Oct 24, 2014

port the code to other languages

Why? Just for fun? Other platforms might be a target, but other languages?
Multifultor is platform-dependent, anyway. Any porting would have to be done from scratch.


Yes, I thought it would be fun. Also, I figured there could be some application for an NLP library like Natural language toolkit which comes with methods that can identify named entities and frequent part-of-speech tag combinations. It's a layer of complexity on top of query building, which normally is an intensive and multi-step process.


add more functionality

Please feel free to propose additional features that fit to Multifultor's basic concept. Most users never provide any feedback.


I'm having some difficulty dragging my new data source from the main window into the icon palette. I will get back to you on more functionality, but after going through some of the readme, I see it does much more than just crawl over selected domains!

I've spent time learning about natural language processing techniques and what I can say is, that it does offer a helpful view of a document when it comes to showing frequency distributions of terms and n-grams. I think it could be useful in the context of research too, if applying the algorithms to larger datasets to be able to model large TMs, label them and classify them and then find some measures of similarity (at a basic sense, this could be term overlap, n-gram overlap...there is also something called a skip-gram, which afaik is a model based on every other word, possibly skipping over stopwords too).

For the time being, I will see about classifying my search queries by mining the browser history and see what research approaches could be designed to aid the user in crafting a query and extracting relevant search hits and use cases of text in question.



[Edited at 2014-10-24 17:17 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 14:54
Member (2009)
Dutch to English
+ ...
Thanks Dan! Oct 24, 2014

Dan Lucas wrote:

Michael Beijer wrote:
One more question: What settings would you suggest here

Michael, I would cache text but not documents. The index covering 130Gb of data that I mentioned earlier is slightly less than 5Gb in size, so very manageable. I've never tried caching the documents but logically you should end up with an index as large as the data itself plus an additional few percent on top representing the indexing overhead. So you'd use 135Gb to index and cache 130Gb of data. That doesn't sound like a good idea to me unless you want a portable library of your data that you can carry around with you on an external drive.

As I deal mostly in English and Japanese, I have never needed to use the accent-sensitive option. Likewise, case sensitivity isn't a big deal for me although I can see it being useful in some contexts. Keep your logging as Summary only, otherwise any useful information gets lost in huge lists of files.

Dan


[Edited at 2014-10-24 07:29 GMT]


Thanks Dan!

Will try these settings.

Michael


 
Meta Arkadia
Meta Arkadia
Local time: 21:54
English to Indonesian
+ ...
SQL query? Oct 25, 2014

Just a suggestion, Alex, I'm not an expert on databases at all. It seems you can use SQL queries to search databases for all your purposes, concordance, n-grams, skip-grams, you name it. One of the latest major CafeTran features is indexed databases. The indexing process takes forever, but after that, regular searches yield results blistering fast, even if yo... See more
Just a suggestion, Alex, I'm not an expert on databases at all. It seems you can use SQL queries to search databases for all your purposes, concordance, n-grams, skip-grams, you name it. One of the latest major CafeTran features is indexed databases. The indexing process takes forever, but after that, regular searches yield results blistering fast, even if you still don't have an SSD, like poor me.


searching 2 million segments in zero seconds

Since you can import TMX files in the external database, and have them indexed, I suppose you can speed up your queries considerably from within a CAT tool. I haven't tried SQL though, also because CafeTran at this moment only supports H2 databases, whereas I want to use (AppleScriptable) MySQL, which the developer will implement soon, together with other database flavours.

FWIW,

Hans
Collapse


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 14:54
Member (2009)
Dutch to English
+ ...
H2 uses Java SQL Oct 25, 2014

Not sure what this means, but H2 (used by CafeTran in its Total Recall databases) is apparrently "Java SQL".

See e.g.: http://www.h2database.com/html/grammar.html + http://www.h2database.com/html/main.html

Michael


 
Meta Arkadia
Meta Arkadia
Local time: 21:54
English to Indonesian
+ ...
Total Recall Oct 25, 2014

Michael Beijer wrote:
used by CafeTran in its Total Recall databases

More often than not, I'm not happy with CT's terminology. Since the introduction of CT, you can use relational databases (H2, MySQL, SQLite, a.o.) for search. I used it quite often. New (very new, like a couple of weeks old) are the the possibility to index the database, and the possibility to extract the results of a search of the document against the indexed database (CT calls it "pretranslation") to a TMX file that after finishing the process will be "connected" automatically to the project. I don't think "Java SQL" or "SQL" have anything to do with it, that is, for the user.

Cheers,

Hans


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 15:54
English to Hungarian
+ ...
SQL Oct 25, 2014

SQL isn't a specific database software, it is a basically a database query language. H2, MySQL, SQLite and a bunch of other database engines implement SQL to varying degrees. I'm not sure how much intercompatibility that provides in practice.

 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 14:54
Member (2009)
Dutch to English
+ ...
Come again? Oct 25, 2014

Meta Arkadia wrote:

Michael Beijer wrote:
used by CafeTran in its Total Recall databases

More often than not, I'm not happy with CT's terminology. Since the introduction of CT, you can use relational databases (H2, MySQL, SQLite, a.o.) for search. I used it quite often. New (very new, like a couple of weeks old) are the the possibility to index the database, and the possibility to extract the results of a search of the document against the indexed database (CT calls it "pretranslation") to a TMX file that after finishing the process will be "connected" automatically to the project. I don't think "Java SQL" or "SQL" have anything to do with it, that is, for the user.

Cheers,

Hans


I'm not really sure what you are trying to say Hans.

Michael


 
Meta Arkadia
Meta Arkadia
Local time: 21:54
English to Indonesian
+ ...
DBMS+ SQL+ some sort of script Oct 26, 2014

Michael Beijer wrote:
I'm not really sure what you are trying to say Hans.


“Total Recall” is a fantasy name for an indexed database. The database functionality in CafeTran has been around forever, it was called Rendezvous Memory Server, or External Database. It supported H2 databases out of the box, and many more (including MySQL, Oracle 10g, HSQLDB 2.0 and Derby Java DB) after some fiddling with *.jar files.

The new part is the indexing (and the “Pretranslation”, but that’s probably not important for Alex - the OP). If the CafeTran developer can create an indexed database, so can we (don’t count on me, though). And I checked several.

I think an indexed database, preferably on SSD, combined with SQL queries and if possible automated versions of those queries, would solve Alex’ problem. For mere mortals, the solution CafeTran offers is more than satisfying.

Cheers,

Hans


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 14:54
Member (2009)
Dutch to English
+ ...
(What is) Total Recall (?) Oct 26, 2014

Meta Arkadia wrote:

Michael Beijer wrote:
I'm not really sure what you are trying to say Hans.


“Total Recall” is a fantasy name for an indexed database. The database functionality in CafeTran has been around forever, it was called Rendezvous Memory Server, or External Database. It supported H2 databases out of the box, and many more (including MySQL, Oracle 10g, HSQLDB 2.0 and Derby Java DB) after some fiddling with *.jar files.

The new part is the indexing (and the “Pretranslation”, but that’s probably not important for Alex - the OP). If the CafeTran developer can create an indexed database, so can we (don’t count on me, though). And I checked several.

I think an indexed database, preferably on SSD, combined with SQL queries and if possible automated versions of those queries, would solve Alex’ problem. For mere mortals, the solution CafeTran offers is more than satisfying.

Cheers,

Hans


Hmm, as far as I understand the new feature called "Total Recall", what is new is not that the database is indexed. As far as I understand a database, it already is "indexed", in the sense that a db is a table containing data.

As you mentioned, CafeTran already had external databases, but the new part is that Igor connected these databases to the pre-translation system. Previously, you could import TMXs into an H2 database, and then search for terms and phrases in this db from inside CT. Now, you can also pre-translate an entire document against these databases. That is what Total Recall is. Total Recall analyses the current document and looks for translations for pieces of it in the H2 database.

Michael


[Edited at 2014-10-26 15:32 GMT]


 
Meta Arkadia
Meta Arkadia
Local time: 21:54
English to Indonesian
+ ...
Total Recall (revisited) Oct 26, 2014

Michael Beijer wrote:
Previously, you could import TMXs into an H2 database, and then search for terms and phrases in this db from inside CT. Now, you can also pre-translate an entire document against these databases. That is what Total Recall is. Total Recall analyses the current document and looks for translations for pieces of it in the H2 database.

That's correct. The Menu item only a few weeks ago read "External DB". That was the correct name, and should have been continued. You could and still can connect to a database - locally or elsewhere - and search it.

The three recent changes:
- Locally stored databases will be indexed. No choice. You can't have an "unindexed" DB anymore. Search, of course, is still possible. Thank heavens, because - like tab del TXT files, there's no fuzziness available in H2 databases.
- You can use the new Recall feature to leverage the DB against your document. CafeTran will then automagically show a TMX file with the matches. Again, no fuzzy matches.
- You can use the resulting TMX file to "pretranslate" your document. With fuzzy matches from within the TMX file.

I already mentioned I'm not always happy with Igor's terminology choices (to put it mildly), and I think the Menu item Totall Recall is wrong, because it seems to refer only to the new Recall features, not the basic workings of the database.

Cheers,

Hans

[Edited at 2014-10-26 23:23 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 14:54
Member (2009)
Dutch to English
+ ...
I like the new name. Oct 26, 2014

Meta Arkadia wrote:

Michael Beijer wrote:
Previously, you could import TMXs into an H2 database, and then search for terms and phrases in this db from inside CT. Now, you can also pre-translate an entire document against these databases. That is what Total Recall is. Total Recall analyses the current document and looks for translations for pieces of it in the H2 database.

That's correct. The Menu item only a few weeks ago read "External DB". That was the correct name, and should have been continued. You could and still can connect to a database - locally or elsewhere - and search it.

The three recent changes:
- Locally stored databases will be indexed. No choice. You can't have an "unindexed" DB anymore. Search, of course, is still possible. Thank heavens, because - like tab del TXT files, there's no fuzziness available in H2 databases.
- You can use the new Recall feature to leverage the DB against your document. CafeTran will then automagically show a TMX file with the matches. Again, no fuzzy matches.
- You can use the resulting TMX file to "pretranslate" your document. With fuzzy matches from within the TMX file.

I already mentioned I'm not always happy with Igor's terminology choices (to put it mildly), and I think the Menu item Totall Recall is wrong, because it seems to refer only to the new Recall features, not the basic workings of the database.

Cheers,

Hans

[Edited at 2014-10-26 23:23 GMT]


I actually like the new name up in the menu bar. "Total Recall" indicates that it is a feature to use to recall everything. Kind of a super memory machine. "External database" or "External DB" (the former name) sounded kind of boring and would currently no longer really convey what the new feature can do.

Now that Igor has connected these previously external databases to the AA and pre-translation system, they are not so "external" anymore.

Michael


 
Meta Arkadia
Meta Arkadia
Local time: 21:54
English to Indonesian
+ ...
Pars pro toto Oct 26, 2014

Michael Beijer wrote:
Now that Igor has connected these previously external databases to the AA and pre-translation system, they are not so "external" anymore.

But you can still connect to non-local databases, and I suppose the will not be indexed. And you can search the locally stored ones just like before (but indexed). "Total Recall" in the Menu is a pars pro toto, that's what it is.

Cheers,

Hans


 
Rolf Keller
Rolf Keller
Germany
Local time: 15:54
English to German
Multifultor has become Omni-Lookup Oct 16, 2016

Rolf Keller wrote:

Use the URL in my profile and download multifultor.zip or its Readme.pdf.


I correct myself: Goto www.omni-lookup.de


 
Pages in topic:   < [1 2]


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How do you speed up your term/phrase search process (for TM, glossary, termbases)?







Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »