LC7 and 8 - Non responsive processing large text files

classic Classic list List threaded Threaded
21 messages Options
Reply | Threaded
Open this post in threaded view

Re: LC7 and 8 - Non responsive processing large text files

I once helped someone in the forum
In 2009 someone at the forum wanted to index Wikipedia for "title". It took him 3 days to complete an indexing operation on a 23 GB xml file.

after some optimizations it was down to 30 minutes

When Roland started this thread I "unearthed" the old code, adapted it (mostly replacing char by byte) and tweaked it a bit.

I now use a subset of the german Wikipedia.

The file is 1.74 GB contains 395,621 times <Title>text of title</Title>. I extract those "Titles" and the byte where the "record" starts in the original file.

I write that information to a file which is 23 MB large.

I get a throughput of roughly 20,000 "records" or "hits" per second. It takes 20 seconds to gather all 395,621 records including writing out to the index file.

I am using a SSD.

As Richard says this needs a little tweaking. I found that in LC8 RC1 roughly 80,000 bytes per file access give best performance on my system a MacbookPro mid 2010. In LC 6 it is about 1 Mb per file access. (LC 6.7.10 is twice as fast, whereas LC 7.1.3 is about 30% slower)

And every 1000 records when writing data out I throw in a "wait 0 milliseconds with messages"

I can even type in a field without problem while indexing is running.

This all is done using "binary read", simple "read" more than doubles the time needed. Of course this depends on your data if binary read is ok for you.

So definitely one can process huge data files in LC without problem if one adapts the code to the problem.

Doing this I discovered that LC 8 does not return "EOF" in the result when attempting to read past the end of the file.
I reported the bug

2016-04-15 09:35 BST

2016-04-15 11:38 BST

This must be one of the fastest bug-fixes on record, 2 hours from reporting to "awaiting merge".

Hats off to Mark Waddingham and the team.

It will be fixed in LC 8 RC2

Kind regards