When Roland started this thread I "unearthed" the old code, adapted it (mostly replacing char by byte) and tweaked it a bit.
I now use a subset of the german Wikipedia.
The file is 1.74 GB contains 395,621 times <Title>text of title</Title>. I extract those "Titles" and the byte where the "record" starts in the original file.
I write that information to a file which is 23 MB large.
I get a throughput of roughly 20,000 "records" or "hits" per second. It takes 20 seconds to gather all 395,621 records including writing out to the index file.
I am using a SSD.
As Richard says this needs a little tweaking. I found that in LC8 RC1 roughly 80,000 bytes per file access give best performance on my system a MacbookPro mid 2010. In LC 6 it is about 1 Mb per file access. (LC 6.7.10 is twice as fast, whereas LC 7.1.3 is about 30% slower)
And every 1000 records when writing data out I throw in a "wait 0 milliseconds with messages"
I can even type in a field without problem while indexing is running.
This all is done using "binary read", simple "read" more than doubles the time needed. Of course this depends on your data if binary read is ok for you.
So definitely one can process huge data files in LC without problem if one adapts the code to the problem.