Looking for parser for Email (MIME)

classic Classic list List threaded Threaded
26 messages Options
12
RH
Reply | Threaded
Open this post in threaded view
|

Looking for parser for Email (MIME)

RH
Hello all

Something else. I managed to download all my mail from last years in MBOX
format from Gmail. The file size is over 38 GB, and there are more than
120,000 messages.

There is no way of just opening and reading such last file into memory, at
least not on my computer with limited RAM. Usual text processors also do
not open such large files. LiveCode simply does not read such file and "it"
remains empty. (There should be an error message in "the result" though.)

But it was possible using the "open file <filename> for binary read" and
crawling through the file for each email message using "read from <file> at
<position> until <string>" and calculating the new starting position in
each loop.

Now, having extracted each message (over 120,000  and then storing in
database), I want to parse each email message which usually supports MIME
format with single or multi message bodies. It is not difficult for header
fields, but there have been some difficulties correctly decoding other
parts with encoded pictures, sound or whatever.

Also the HTML parts are not correctly displaying in LiveCode fields when
set to their HTML property.

There are numerous different text encodings in different messages.

My question: Did anybody already develop a parser in LiveCode accomplishing
such task? Otherwise I have to put more time her and figure it out all
myself... )

I am using Windows 10 and LC 8.0.0 DP 16.

Roland
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Roland Huettmann - Babanin GmbH - Switzerland www.babanin.com / roh@babanin.com
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

Alejandro Tejada
Hi,

Check if WordLib could help you
with this particular task:

http://curryk.com/wordlib.html

Does Unicode handles all different encodings?
http://livecode.byu.edu/unicode/unicodeInRev.html

By the way, some years ago, I posted
a Mailbox reader, check this stack just
in case that you find something useful:
http://andregarzia.on-rev.com/alejandro/stacks/Mailbox_browser.zip

Have a nice weekend!

Alejandro
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

Ben Rubinstein
In reply to this post by RH
On 20/03/2016 10:56, Roland Huettmann wrote:
> There is no way of just opening and reading such last file into memory, at
> least not on my computer with limited RAM. Usual text processors also do
> not open such large files. LiveCode simply does not read such file and "it"
> remains empty. (There should be an error message in "the result" though.)

http://quality.livecode.com/show_bug.cgi?id=2772

bah!

Ben

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

Richard Gaskin
Ben Rubinstein wrote:
> On 20/03/2016 10:56, Roland Huettmann wrote:
>> There is no way of just opening and reading such last file into memory, at
>> least not on my computer with limited RAM. Usual text processors also do
>> not open such large files. LiveCode simply does not read such file and "it"
>> remains empty. (There should be an error message in "the result" though.)
>
> http://quality.livecode.com/show_bug.cgi?id=2772

That's a useful enhancement request, for making sure LC degrades
gracefully in low-memory situations.

But that seems different from what Roland was asking about.  He needs to
work with a 38GB file, beyond the memory address space of LC, and
impractical for many programs.

When faced with a file that large most apps will read it in chunks, and
as of several versions ago LC can do this gracefully:  "seek" and "read
at" were enhanced to allow locations as large as the host file system
permits.  With this we can traverse files far bigger than would be
practical in RAM.
<http://quality.livecode.com/show_bug.cgi?id=11828>

Oddly enough, when I was experimenting with chunked reading I started
with 10 MB chunks, thinking the less I touched the disk the better.  But
I found my routines got faster as I tried smaller reads, all the way
down to about 130k where the comparison leveled off.  Apparently memcopy
is not with its own overhead.

--
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  [hidden email]                http://www.FourthWorld.com

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

[-hh]
In reply to this post by RH
Why don't you use a professional eMail client?
For example Thunderbird?

https://backupify.zendesk.com/hc/en-us/articles/203099108-How-to-view-Gmail-Exports-locally-on-your-computer
RH
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

RH
*Reading very large files*
===================

I made some tests for my large file.

Using LCS it is not possible to know in advance the file size as it is not
possible to read the file into memory in one piece. It is possible though
to read the properties from a file using a Shell command. And one of the
properties is the size in bytes. This is my length: Size on
disk: 28'875'927'552 bytes.

It is not a problem to go to any position in the file reading from there -
as long as memory can take the read junk of the file.

How to know how much we can read into memory? Is there any function to know
this? Is there a size limit for variables?

But it is possible to retrieve chunks of data until eof when approaching
the file end. So, using "read from file fName at 28875927500 until eof" is
possible. 52 bytes are read.

It is not possible to read backwards - which could be a nice way reading a
file in some special cases. So "read from file fName at eof until -1000"
does not work.

So, the only way reading very large file is reading a chunk of data of n
bytes (whatever is allowed in memory), processing this, and then reading
the next chunk until the remaining part of the file is small enough to be
read until eof.

*Why not use Thunderbird or other mail clients?*
======================================

Well to answer the question regarding why not using another E-Mail client
such as Thunderbird, Eudora, or a pure mailbox reader such as "Free MBOX
File viewer"? I have them all. I tried them all.

I love LiveCode. That is the answer. Because I want to not just save data.
I want to be able to add annotations the way I want, I want to add tags,
make calculations, print selected lists of messages, and do calculations,
using statistics, increasing the usefulness of years of work. I am using
this to also write invoices to clients for work done listing all the
messages for a given task and project. My messages are also kind of work
reports and they give me an idea of what was done at what time and
occasion, because they are a "minute of meeting", take my "notes" etc.

I store a lot of information in messages, even ideas. It is my container
for data in many cases. And I transfer my Skype conversations to email, or
other data from my phone to Email.

The message format qualifies as the format for any activity and
communication. So, why not use it everywhere as a multi-purpose format?

But to organize messages it is not enough to sort and filter them by
keywords. And people use the Subject field not the way it is meant to be
used. Too often the Subject does not give an idea of what the messages
contain. It is just "carried over" from another message. That is awful, but
unavoidable. Threads organizing messages by keywords are also not useful
because people are simply not taking care of maintaining consistency in the
Subject field which organizes message threads. In other words: I introduce
a second Subject field to enter my keywords which are then correctly
identifying the content of the message (or messages). LiveCode is a way of
doing this, and of course using a database for storing.

So, I like to add and manipulate data which then can be filtered in a
correct way according to personal needs without the limitations all other
E-Mail readers or clients will allow.

My message system shall be more global, unified, including messages from
all kinds of devices and mail programs, even from social media, phone
conversations, etc. And I want to make use of data, not just storing and
backing up data.

Just storing is boring. )

The alternative would be using something like Python, but we are in
LiveCode not in Python, or a lower level language. But then - why do we use
LiveCode?

It want LiveCode to be my power horse for all such tasks and understand if
it really can fulfill such promises.

Roland

P.S:  Just now I do not know if we start using LCB in future, or continue
with LCS for such tasks which would qualify for developing a library in its
own right.








On 22 March 2016 at 00:12, -hh <[hidden email]> wrote:

> Why don't you use a professional eMail client?
> For example Thunderbird?
>
>
> https://backupify.zendesk.com/hc/en-us/articles/203099108-How-to-view-Gmail-Exports-locally-on-your-computer
>
>
>
> --
> View this message in context:
> http://runtime-revolution.278305.n4.nabble.com/Looking-for-parser-for-Email-MIME-tp4702407p4702480.html
> Sent from the Revolution - User mailing list archive at Nabble.com.
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Roland Huettmann - Babanin GmbH - Switzerland www.babanin.com / roh@babanin.com
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

Mark Waddingham-2
On 2016-03-22 12:45, Roland Huettmann wrote:
> How to know how much we can read into memory? Is there any function to
> know
> this? Is there a size limit for variables?

LiveCode has a limit of 2Gb characters for strings but that depends on
how much memory a single process can have on your system.

On 32-bit systems, you're generally limited to 768Mb-1Gb contiguous
block of memory (32-bit Windows has an address space of 3Gb for a user
process which also has to include all mapped resources such as
executables and shared libraries; Mac has a user process address space
of 4Gb which also has to include all mapped resources so you can
generally get up to around 1.5Gb contiguous allocated memory block).

On 64-bit systems then you should be able to many 2Gb strings (or
similar in LiveCode), although obviously how fast they will operate will
depend on the amount of physical ram in the machine - disk paged virtual
memory taking up the slack).

> It is not possible to read backwards - which could be a nice way
> reading a
> file in some special cases. So "read from file fName at eof until
> -1000"
> does not work.

Well, reading backwards in that way is equivalent to knowing how long
the file is:

    read ... at -1000 until EOF

is the same as

    read ... at (fileSize - 1000) until EOF

> So, the only way reading very large file is reading a chunk of data of
> n
> bytes (whatever is allowed in memory), processing this, and then
> reading
> the next chunk until the remaining part of the file is small enough to
> be
> read until eof.

For such a large file (38gb) your only solution is to read and parse it
in chunks. MBOX files are a sequence of records, so you need to use a
process which reads in blocks from the file when there is not enough
data left to find the current record boundary - that way you only load
into memory (at any one time) enough of the file to process completely
the next record.

In terms of finding the size of a file in LiveCode you can use 'the
detailed files'.

It is worth pointing out that using 'open file' and 'read from file' are
*stream* based in approach. From memory, the MBOX format is essentially
line-based, so you should be able to write a relatively simple parsing
loop with that in mind:

open file ...
repeat forever
   read from file ... until return
   if the result is not empty then
     exit repeat
   end if
   if *it is a new message boundary* then
     ... finish processing current message ...
     ... start processing new boundary ...
   else
     ... append line to current message ...
   end if
end repeat

Of course, one thing to bear in mind, is that with a 38Gb file you are
never going to fit all of that into memory; so the best approach would
probably be to parse your mail messages and then store them into a
storage scheme which doesn't require everything to appear in memory at
once - e.g. an sqlite db or a more traditional dbms, or even lots of
discrete files in a filesystem in some suitable hierarchy.

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Buffer size (was Looking for parser for Email (MIME))

Richard Gaskin
Mark Waddingham wrote:

> open file ...
> repeat forever
>    read from file ... until return
>    if the result is not empty then
>      exit repeat
>    end if
>    if *it is a new message boundary* then
>      ... finish processing current message ...
>      ... start processing new boundary ...
>    else
>      ... append line to current message ...
>    end if
> end repeat

What is the size of the read buffer used when reading until <char>?

I'm assuming it isn't reading a single char per disk access, probably at
least using the file system's block size, no?

I ask because some months ago I wrote a needed to parse a 6GB file and
"read...until CR" was slower than I preferred so I experimented with a
complicated routine that reads into a buffer of about 128k and then
parses the buffer.

If I can turn up the code it may be mildly interesting, but the main
question it raised for me was:

Given that the engine is probably already doing pretty much the same
thing, would it make sense to consider a readBufferSize global property
which would govern the size of the buffer the engine uses when executing
"read...until <char>"?

In my experiments I was surprised to find that larger buffers (>10MB)
were slower than "read...until <char>", but the sweet spot seemed to be
around 128k.  Presumably this has to do with the overhead of allocating
contiguous memory, and if you have any insights on that it would be
interesting to learn more.

I recognize this sort of things may seem like mere performance
fetishism, but I believe this has useful application for making LC an
ever better solution for working with large amounts of data.

Pretty much any program will read big files in chunks, and if LC can do
so optimally with all the grace and ease of "read...until <char>" it
makes one more strong set of use cases where choosing LC isn't a
tradeoff but an unquestionable advantage.

--
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  [hidden email]                http://www.FourthWorld.com

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
RH
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

RH
In reply to this post by Mark Waddingham-2
Hello Mark,

Thank your for the explanation. It is very nice.

--- MBOX file format

Yes, as you also suggest, I am already reading MXBOX file in chunks which
are separated by a string CR & "From " as also defined for that file format.

So, it goes "read from file <filename> at <position> until <string>".

The only drawback is that at the end of the file there is no such string
and it needs another way of reading, but that is then possible

Another way, as you suggest, is reading line by line and checking for such
string value to separate messages. I just do not know yet what will be more
efficient in terms of speed. I will be testing.

--- Checking available physical memory (in RAM, not on disk)

Also a good way would be to check for available amount of *physical*
memory. This way one could limit chunks read into memory, and processing
would be pretty straight forward and fast when also knowing limitations of
the OS (32bit, 64bit, available RAM, etc... all you suggested).

Is there a function to know available physical memory in LiveCode? I could
not find yet.

--- Reading backwards in a file

--Well, reading backwards in that way is equivalent to knowing how long the
file is:
  -- read ... at -1000 until EOF
-- is the same as
 --   read ... at (fileSize - 1000) until EOF

With reading backwards I meant starting from EOF or any position and having
the pointer going backward char by char to whatever other previous
position. Syntax could be: "read from file <filename> at <position> down to
<position>". But I am not sure if there are many use cases for this.

--- Storing large number of messages

You are right with storing the retrieved messages in a database. It is the
best way. That is what I was preparing to do as it is obviously the only
solution which makes sense for such large amounts of data. And only then it
allows for all kinds of post-processing the easier way. I will be using
both, SQLite, and later a remote database system.

--- The detailed files

I was not aware about the "the detailed files" function. Something new I
learned. Again thank you. I checked the dictionary. It could be much more
explicit about such function. With "detailed". It only finds the keyword
"detailed." Searching for "detailed files" I finds nothing.

But I found something in the Forums with good explanation. Maybe it is
worth writing an enhancement request to document this function the
dictionary of LiveCode.

Cheers to all ), Roland









On 22 March 2016 at 14:16, Mark Waddingham <[hidden email]> wrote:

> On 2016-03-22 12:45, Roland Huettmann wrote:
>
>> How to know how much we can read into memory? Is there any function to
>> know
>> this? Is there a size limit for variables?
>>
>
> LiveCode has a limit of 2Gb characters for strings but that depends on how
> much memory a single process can have on your system.
>
> On 32-bit systems, you're generally limited to 768Mb-1Gb contiguous block
> of memory (32-bit Windows has an address space of 3Gb for a user process
> which also has to include all mapped resources such as executables and
> shared libraries; Mac has a user process address space of 4Gb which also
> has to include all mapped resources so you can generally get up to around
> 1.5Gb contiguous allocated memory block).
>
> On 64-bit systems then you should be able to many 2Gb strings (or similar
> in LiveCode), although obviously how fast they will operate will depend on
> the amount of physical ram in the machine - disk paged virtual memory
> taking up the slack).
>
> It is not possible to read backwards - which could be a nice way reading a
>> file in some special cases. So "read from file fName at eof until -1000"
>> does not work.
>>
>
> Well, reading backwards in that way is equivalent to knowing how long the
> file is:
>
>    read ... at -1000 until EOF
>
> is the same as
>
>    read ... at (fileSize - 1000) until EOF
>
> So, the only way reading very large file is reading a chunk of data of n
>> bytes (whatever is allowed in memory), processing this, and then reading
>> the next chunk until the remaining part of the file is small enough to be
>> read until eof.
>>
>
> For such a large file (38gb) your only solution is to read and parse it in
> chunks. MBOX files are a sequence of records, so you need to use a process
> which reads in blocks from the file when there is not enough data left to
> find the current record boundary - that way you only load into memory (at
> any one time) enough of the file to process completely the next record.
>
> In terms of finding the size of a file in LiveCode you can use 'the
> detailed files'.
>
> It is worth pointing out that using 'open file' and 'read from file' are
> *stream* based in approach. From memory, the MBOX format is essentially
> line-based, so you should be able to write a relatively simple parsing loop
> with that in mind:
>
> open file ...
> repeat forever
>   read from file ... until return
>   if the result is not empty then
>     exit repeat
>   end if
>   if *it is a new message boundary* then
>     ... finish processing current message ...
>     ... start processing new boundary ...
>   else
>     ... append line to current message ...
>   end if
> end repeat
>
> Of course, one thing to bear in mind, is that with a 38Gb file you are
> never going to fit all of that into memory; so the best approach would
> probably be to parse your mail messages and then store them into a storage
> scheme which doesn't require everything to appear in memory at once - e.g.
> an sqlite db or a more traditional dbms, or even lots of discrete files in
> a filesystem in some suitable hierarchy.
>
> Warmest Regards,
>
> Mark.
>
> --
> Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Roland Huettmann - Babanin GmbH - Switzerland www.babanin.com / roh@babanin.com
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

[-hh]
> Roland H. wrote
> --- The detailed files
>
> I was not aware about the "the detailed files" function. Something new I
> learned. Again thank you. I checked the dictionary. It could be much more
> explicit about such function. With "detailed". It only finds the keyword
> "detailed." Searching for "detailed files" I finds nothing.
>
> But I found something in the Forums with good explanation. Maybe it is
> worth writing an enhancement request to document this function the
> dictionary of LiveCode.

Search for "files"? Very "detailed" there ;-)
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

Ben Rubinstein
In reply to this post by Richard Gaskin
On 21/03/2016 23:03, Richard Gaskin wrote:

> Ben Rubinstein wrote:
>> On 20/03/2016 10:56, Roland Huettmann wrote:
>>> There is no way of just opening and reading such last file into memory, at
>>> least not on my computer with limited RAM. Usual text processors also do
>>> not open such large files. LiveCode simply does not read such file and "it"
>>> remains empty. (There should be an error message in "the result" though.)
>>
>> http://quality.livecode.com/show_bug.cgi?id=2772
>
> That's a useful enhancement request, for making sure LC degrades gracefully in
> low-memory situations.
>
> But that seems different from what Roland was asking about.

Absolutely - I was responding narrowly to the particular point from Roland's
email which I quoted: namely that LiveCode should return an error message when
it fails to read a file because it doesn't have enough memory, rather than
returning exactly the same results as it would for an empty file... which is
what the cited RQCC report has been complaining about for at least seven years.

Ben



_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
RH
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

RH
Re: HH:

"Search for "files"? Very "detailed" there ;-)"

Stupid me !

I was searching for "file", "detailed file", "detailed files". But I was
not searching for "files". And "files" is listed searching for "file", but
I would not have thought about that this function would be available in
"files" listing the details of all the files in the given folder.

Anyway, I would appreciate a function giving the same pointing at just one
selected file (not files). I will use my custom function doing that.
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Roland Huettmann - Babanin GmbH - Switzerland www.babanin.com / roh@babanin.com
Reply | Threaded
Open this post in threaded view
|

Re: Buffer size (was Looking for parser for Email (MIME))

Mark Waddingham-2
In reply to this post by Richard Gaskin
On 2016-03-22 15:24, Richard Gaskin wrote:
> What is the size of the read buffer used when reading until <char>?
>
> I'm assuming it isn't reading a single char per disk access, probably
> at least using the file system's block size, no?

Well, the engine will memory map files if it can (if there is available
address space) so for smaller (sub 1Gb) files they are essentially all
buffered. For larger files, the engine uses the stdio FILE abstraction
so will get buffering from that.

> Given that the engine is probably already doing pretty much the same
> thing, would it make sense to consider a readBufferSize global
> property which would govern the size of the buffer the engine uses
> when executing "read...until <char>"?

Perhaps - the read until routines could potentially be made more
efficient. For some streams, buffering is inappropriate unless
explicitly stated (which isn't an option at the moment). For example,
for serial port streams and process streams you don't want to read any
more than you absolutely need to as the other end can block if you ask
it for more data than it has available. At the moment the engine favours
the 'do not read any more than absolutely necessary' approach as the
serial/file/process stream processing code is the same.

> In my experiments I was surprised to find that larger buffers (>10MB)
> were slower than "read...until <char>", but the sweet spot seemed to
> be around 128k.  Presumably this has to do with the overhead of
> allocating contiguous memory, and if you have any insights on that it
> would be interesting to learn more.

My original reasoning on this was a 'working set' argument. Modern CPUs
heavily rely on various levels of memory cache, access getting more
expensive as the cache is further away from the processor. If you use a
reasonable sized buffer to implement processing in a stream fashion,
then the working set is essentially just that buffer which means less
movement of blocks of memory from physical memory to/from the processors
levels of cache.

However, having chatted to Fraser, he pointed out that Linux tends to
have a file read ahead of 64kb-128kb 'builtin'. This means that the OS
will proactively prefetch the next 64-128kb of data after it has
finished fetching the one you have asked for. The result is that data is
being read from disk by the OS whilst your processing code is running
meaning that things get done quicker. (In contrast, if you have a 10Mb
buffer then you have to wait to read 10Mb before you can do anything
with it, and then do that again when the buffer is empty).

> Pretty much any program will read big files in chunks, and if LC can
> do so optimally with all the grace and ease of "read...until <char>"
> it makes one more strong set of use cases where choosing LC isn't a
> tradeoff but an unquestionable advantage.

If you have the time to submit a report in the QC with a sample stack
measuring the time of a simple 'read until cr' type loop with some data
and comparing it to the more efficient approach you found then it is
something we (or someone else) can do some digging into at some point to
see what we can do to improve its performance.

As I said initially, for smaller files I'd be surprised if we could do
that much since those files will be memory mapped; however, it might be
there are some improvements which could be made for larger (non memory
mappable) files.

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Buffer size (was Looking for parser for Email (MIME))

Alex Tweedly
In reply to this post by Richard Gaskin


On 22/03/2016 14:24, Richard Gaskin wrote:

> Given that the engine is probably already doing pretty much the same
> thing, would it make sense to consider a readBufferSize global
> property which would govern the size of the buffer the engine uses
> when executing "read...until <char>"?
>
> In my experiments I was surprised to find that larger buffers (>10MB)
> were slower than "read...until <char>", but the sweet spot seemed to
> be around 128k.  Presumably this has to do with the overhead of
> allocating contiguous memory, and if you have any insights on that it
> would be interesting to learn more.
>
Rather than a settable global property, it may be better to have a
readable global property which suggests an optimal (or near optimal)
size for reading.

Also, I'd point out that it is NOT "read ... until <char>", it is "read
... until <string>" (according to the dictionary - haven't tried it yet).

This means that for reading MBOX format, you could do something like
(untested)

put CR & "From " into tTerminator -- note the space at the end of the string

repeat forever -- !!
   read from file tFilePath until tTerminator
   if it is empty then exit repeat
   put it into tOneMailMessage
   -- and process that whatever way you want
end repeat

-- Alex.

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Looking for parser for Email (MIME)

[-hh]
In reply to this post by RH
> Roland H. wrote:
> was searching for "file", "detailed file", "detailed files". But I was
> not searching for "files". And "files" is listed searching for "file", but
> I would not have thought about that this function would be available in
> "files" listing the details of all the files in the given folder.

The dictionary search has its own logic. I'm also still learning that...
(But it's pretty fast.)
Reply | Threaded
Open this post in threaded view
|

Re: Buffer size (was Looking for parser for Email (MIME))

Richard Gaskin
In reply to this post by Mark Waddingham-2
Very helpful info - thanks!

I'll see if I can dig up my old experiment code and submit a tidy
version with an enhancement request.

My hope was that it might be as simple as "Aha, yes, as bigger buffer
size!", but few things in life are that simple. :)

--
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  [hidden email]                http://www.FourthWorld.com


Mark Waddingham wrote:

> On 2016-03-22 15:24, Richard Gaskin wrote:
>> What is the size of the read buffer used when reading until <char>?
>>
>> I'm assuming it isn't reading a single char per disk access, probably
>> at least using the file system's block size, no?
>
> Well, the engine will memory map files if it can (if there is available
> address space) so for smaller (sub 1Gb) files they are essentially all
> buffered. For larger files, the engine uses the stdio FILE abstraction
> so will get buffering from that.
>
>> Given that the engine is probably already doing pretty much the same
>> thing, would it make sense to consider a readBufferSize global
>> property which would govern the size of the buffer the engine uses
>> when executing "read...until <char>"?
>
> Perhaps - the read until routines could potentially be made more
> efficient. For some streams, buffering is inappropriate unless
> explicitly stated (which isn't an option at the moment). For example,
> for serial port streams and process streams you don't want to read any
> more than you absolutely need to as the other end can block if you ask
> it for more data than it has available. At the moment the engine favours
> the 'do not read any more than absolutely necessary' approach as the
> serial/file/process stream processing code is the same.
>
>> In my experiments I was surprised to find that larger buffers (>10MB)
>> were slower than "read...until <char>", but the sweet spot seemed to
>> be around 128k.  Presumably this has to do with the overhead of
>> allocating contiguous memory, and if you have any insights on that it
>> would be interesting to learn more.
>
> My original reasoning on this was a 'working set' argument. Modern CPUs
> heavily rely on various levels of memory cache, access getting more
> expensive as the cache is further away from the processor. If you use a
> reasonable sized buffer to implement processing in a stream fashion,
> then the working set is essentially just that buffer which means less
> movement of blocks of memory from physical memory to/from the processors
> levels of cache.
>
> However, having chatted to Fraser, he pointed out that Linux tends to
> have a file read ahead of 64kb-128kb 'builtin'. This means that the OS
> will proactively prefetch the next 64-128kb of data after it has
> finished fetching the one you have asked for. The result is that data is
> being read from disk by the OS whilst your processing code is running
> meaning that things get done quicker. (In contrast, if you have a 10Mb
> buffer then you have to wait to read 10Mb before you can do anything
> with it, and then do that again when the buffer is empty).
>
>> Pretty much any program will read big files in chunks, and if LC can
>> do so optimally with all the grace and ease of "read...until <char>"
>> it makes one more strong set of use cases where choosing LC isn't a
>> tradeoff but an unquestionable advantage.
>
> If you have the time to submit a report in the QC with a sample stack
> measuring the time of a simple 'read until cr' type loop with some data
> and comparing it to the more efficient approach you found then it is
> something we (or someone else) can do some digging into at some point to
> see what we can do to improve its performance.
>
> As I said initially, for smaller files I'd be surprised if we could do
> that much since those files will be memory mapped; however, it might be
> there are some improvements which could be made for larger (non memory
> mappable) files.
>
> Warmest Regards,
>
> Mark.


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Buffer size (was Looking for parser for Email (MIME))

Alejandro Tejada
In reply to this post by Alex Tweedly
Hi All,

Just wondering: Using LiveCode could you split and compress
this 38 gb text file in 10,000 smaller files?

Some years ago, when I was interested in creating
an Offline Wikipedia Reader (using Livecode),
I found the same problem to gather all parts
of an article from compressed files.

A Wikipedia article could start in the middle of a
compressed file and end at the beginning of next.

The script to gather all parts of an article did this:
1) decompress the file where the article starts,
2) if end tag of this same article is not in
decompressed data, then
3) decompressed next file, search for end of article
and append to previous decompressed data.

This simple algorithm would fail if there was a
really large Wikipedia article that spans among
3 compressed files, but still today, do not exists
such really large article in Wikipedia.

Alejandro
Reply | Threaded
Open this post in threaded view
|

Speaking of package managers...

Richard Gaskin
A cautionary tale as we explore package dependency management:


"How one developer just broke Node, Babel and thousands of projects in
11 lines of JavaScript"
http://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/

--
  Richard Gaskin
  Fourth World Systems

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Speaking of package managers...

mwieder
On 03/22/2016 09:48 PM, Richard Gaskin wrote:
> A cautionary tale as we explore package dependency management:
>
>
> "How one developer just broke Node, Babel and thousands of projects in
> 11 lines of JavaScript"
> http://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/
>

Well, yes, but this seems like an npm registry problem. If you're going
to allow something silly like "unpublish" after something's already out
in the wild, and then not allow republishing the same version, then
that's just asking for trouble.

--
  Mark Wieder
  [hidden email]

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
--
 Mark Wieder
 ahsoftware@gmail.com
Reply | Threaded
Open this post in threaded view
|

Re: Speaking of package managers...

Monte Goulding-2

> On 23 Mar 2016, at 4:39 PM, Mark Wieder <[hidden email]> wrote:
>
> Well, yes, but this seems like an npm registry problem. If you're going to allow something silly like "unpublish" after something's already out in the wild, and then not allow republishing the same version, then that's just asking for trouble.


I suspect there would need to be some kind of takedown procedure. None of us need LiveCode Ltd. to be on the hook for someone’s copyright infringement.

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
12