Need Help With String Pattern Matching

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Need Help With String Pattern Matching

Gregory Lypny
Hello everyone,

I’ve just come back to LiveCode and I'm pretty little rusty. I used to do some basic text analysis of files where the lines containing strings of interest were consistent and therefore easy to spot. I am now working on files where the chunk of text that contains the data I want is more ambiguous. I figure I should be using MatchChunk and was wondering if anyone might give me some tips on how to do the following. The chunk that I want to extract will have a certain word or phrase near its start and a certain word or phrase near its end. There may be many such chunks like it in the document, but the best candidate contains certain other strings. Here’s an example:

The chunk starts with the word *owner* or the phrase *beneficial owner*.

The chunk ends with *all directors* or *less than one percent*.

The chunk contains all of the following:
- At least four or five big numbers, e.g., 234,879
- At least two percentages, e.g., 3.4%, or percentage signs

If you are curious, this would more or less identify an ownership table in a proxy statement filed at the Securities and Exchange Commission. These are archived at the SEC in text and html (in vintages going back to about 1994).

Any tips or examples would be much appreciated.

Regards,

Gregory




_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Need Help With String Pattern Matching

Quentin Long
Message: 14
Date: Sat, 11 Jun 2016 15:48:00 -0400
From: Gregory Lypny <[hidden email]>
To: LiveCode Discussion List <[hidden email]>
Subject: Need Help With String Pattern Matching
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=utf-8

Hello everyone,

> I used to do some basic text analysis of files where the lines containing strings of interest were consistent and therefore easy to spot. I am now working on files where the chunk of text that contains the data I want is more ambiguous.…

>The chunk starts with the word *owner* or the phrase *beneficial owner*.
>
>The chunk ends with *all directors* or *less than one percent*.
>
>The chunk contains all of the following:
>- At least four or five big numbers, e.g., 234,879
>- At least two percentages, e.g., 3.4%, or percentage signs
MatchChunk uses regular expressions ("regex" for short). I don't claim to be a master of regex, but hopefully the following will be of some help to you.

First off, "owner" or "beneficial owner". That would be like so:

[owner|beneficial owner]

Since that's the start of the chunks you're interested in, you'll put that at the beginning of your regex filter. Next is "all directors" or "less than one percent". That's going to be similar:

[all directors|less than one percent]

And *that* bit goes at the *end* of your regex filter. In between the start-bit and the end-bit, you have "four or five big numbers", and "percentages" or "percentage signs". "Big number" isn't really a well-defined concept, but here's one way to go for "big numbers":

[0-9][0-9],[0-9][0-9][0-9]

In regex, that bit will match any string that consists of *at least* two digits, a comma, and three more digits. It'll match XX,XXX (where "X" is any digit at all); it'll match XXX,XXX (because if you can match *two* digits in a row, you can certainly match *three* digits in a row); it'll match XX,XXXX (if you can match 3 in a row, you can match 4 in a row); and so on. Note that this bit *will not* match XXXXX—that's a string of five digits in a row *without* any commas. As for percentages, this will work for matching a percent sign:

&

And this will work for matching a single digit followed by a percent sign:

[0-9]%

I'm going to assume that you don't know exactly where the "big number"s or "percentage"s will be within the chunks you're interested in, or how many characters will occur in between the bits of interest. If you want your regex filter to ignore what occurs between the bits of interest, this will do the trick:

.*

The period will match any character (except a newline character), and the asterisk is regex for "at least 0 of that thing just previous". So if you want to match Big Number followed by Percentage, this should do the trick:

[0-9][0-9],[0-9][0-9][0-9].*[0-9]%

If you at least know what order your Big Numbers and Percentages going to be found in, you can build a regex filter for that sequence by fitting the bits together like Lego bricks, with the period-asterisk "spacer" in between the important bits.
   
"Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length"
   
Read the webcomic at [ http://www.atarmslength.net ]!
   
If you like "At Arm's Length", support it at [ http://www.patreon.com/DarkwingDude ].

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Need Help With String Pattern Matching

Gregory Lypny
In reply to this post by Gregory Lypny
Hello Quentin,

Thank you for the tips on string pattern matching. I’m used to Mathematica’s string pattern syntax, which is probably built on regex, but I can see the similarities in your nice examples, particularly the using of alternatives [Joe|Anges]. While Mathematica’s string functions are insanely extensive and their implementation far more powerful that those in LiveCode, they can become arbitrarily slow if used repeatedly in loops, and unfortunately, my procedure requires repeating the functions over tens of thousands of files. That is why I want to build an alternative procedure in LiveCode.

Thanks again,

Gregory
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Need Help With String Pattern Matching

Quentin Long
In reply to this post by Gregory Lypny
sez Gregory Lypny:

> Thank you for the tips on string pattern matching. I?m
> used to Mathematica?s string pattern syntax, which is
> probably built on regex, but I can see the similarities in
> your nice examples, particularly the using of
> alternatives [Joe|Anges]. While Mathematica?s string
> functions are insanely extensive and their implementation
> far more powerful that those in LiveCode, they can
> become arbitrarily slow if used repeatedly in loops, and
> unfortunately, my procedure requires repeating the
> functions over tens of thousands of files. That is why I
> want to build an alternative procedure in LiveCode.
Hold it. You're saying you don't want to use regex in *LiveCode* because a more-complex feature is too slow in *Mathematica*? I'm not sure how to connect those dots, myself. Why not give regex in LiveCode a shot anyway?
   
"Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length"
   
Read the webcomic at [ http://www.atarmslength.net ]!
   
If you like "At Arm's Length", support it at [ http://www.patreon.com/DarkwingDude ].

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Need Help With String Pattern Matching

Mike Bonner
If you don't want to write a regex loop, you can also use regex with filter.
I haven't done speed comparisons though.

On Tue, Jun 14, 2016 at 3:54 PM, Quentin Long <[hidden email]> wrote:

> sez Gregory Lypny:
> > Thank you for the tips on string pattern matching. I?m
> > used to Mathematica?s string pattern syntax, which is
> > probably built on regex, but I can see the similarities in
> > your nice examples, particularly the using of
> > alternatives [Joe|Anges]. While Mathematica?s string
> > functions are insanely extensive and their implementation
> > far more powerful that those in LiveCode, they can
> > become arbitrarily slow if used repeatedly in loops, and
> > unfortunately, my procedure requires repeating the
> > functions over tens of thousands of files. That is why I
> > want to build an alternative procedure in LiveCode.
> Hold it. You're saying you don't want to use regex in *LiveCode* because a
> more-complex feature is too slow in *Mathematica*? I'm not sure how to
> connect those dots, myself. Why not give regex in LiveCode a shot anyway?
>
> "Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length"
>
> Read the webcomic at [ http://www.atarmslength.net ]!
>
> If you like "At Arm's Length", support it at [
> http://www.patreon.com/DarkwingDude ].
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode