There should be a "unique" option on sort . . .

classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

There should be a "unique" option on sort . . .

Dr. Hawkins
I've seen many times when this would be useful.

The sort command should be able to specify "unique" s an argument   This
would be far faster than looping through the data

--
Dr. Richard E. Hawkins, Esq.
(702) 508-8462
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

mwieder
> The sort command should be able to specify "unique" s an argument This
> would be far faster than looping through the data

Do you mean as in "remove duplicates while you sort"?

--
-Mark Wieder
 [hidden email]


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
--
 Mark Wieder
 ahsoftware@gmail.com
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Dr. Hawkins
On Sat, Jan 4, 2014 at 1:18 PM, Mark Wieder <[hidden email]> wrote:

> Do you mean as in "remove duplicates while you sort"?
>

yes, or to only take the first that matches the sort key if sorting by
other than the full record.


--
Dr. Richard E. Hawkins, Esq.
(702) 508-8462
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Peter Haworth
That seems way outside the scope of a sort operation.  For a start, you say
"take the first that matches the sort key" - what if I want the last, or
some other intermediate entry determined by other data criteria?  To
implement this, your'd have to come up with something akin to the SQL GROUP
BY and HAVING keywords in addition to sorting.

Pete
lcSQL Software <http://www.lcsql.com>


On Sat, Jan 4, 2014 at 3:26 PM, Dr. Hawkins <[hidden email]> wrote:

> On Sat, Jan 4, 2014 at 1:18 PM, Mark Wieder <[hidden email]>
> wrote:
>
> > Do you mean as in "remove duplicates while you sort"?
> >
>
> yes, or to only take the first that matches the sort key if sorting by
> other than the full record.
>
>
> --
> Dr. Richard E. Hawkins, Esq.
> (702) 508-8462
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

mwieder
In reply to this post by Dr. Hawkins
> yes, or to only take the first that matches the sort key if sorting by
> other than the full record.

I can see it being slightly useful in certain cases, but it leaves me
feeling a bit queasy. I think it's unsettling enough that the sort
command sorts in place instead of being a function that returns a
sorted copy, and of course it's way too late to change that now. So
deleting items from a dataset while sorting them seems one more step
down that ladder. I do realize that you'd have to specify "unique"
explicitly, but still... if it didn't mess with the original data set
I'd be all over this.

--
-Mark Wieder
 [hidden email]


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
--
 Mark Wieder
 ahsoftware@gmail.com
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

dunbarxx
I am with Mark.


And in any case it is simple and fast to delete duplicates, if that is what is desired, in a few lines of code.


The idea of singling out one instance seems more like a job for "filter".


Craig



-----Original Message-----
From: Mark Wieder <[hidden email]>
To: How to use LiveCode <[hidden email]>
Sent: Sat, Jan 4, 2014 8:07 pm
Subject: Re: There should be a "unique" option on sort . . .


> yes, or to only take the first that matches the sort key if sorting by
> other than the full record.

I can see it being slightly useful in certain cases, but it leaves me
feeling a bit queasy. I think it's unsettling enough that the sort
command sorts in place instead of being a function that returns a
sorted copy, and of course it's way too late to change that now. So
deleting items from a dataset while sorting them seems one more step
down that ladder. I do realize that you'd have to specify "unique"
explicitly, but still... if it didn't mess with the original data set
I'd be all over this.

--
-Mark Wieder
 [hidden email]


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

 

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Dr. Hawkins
On Sat, Jan 4, 2014 at 8:54 PM, <[hidden email]> wrote:

> And in any case it is simple and fast to delete duplicates, if that is
> what is desired, in a few lines of code.
>
>
> The idea of singling out one instance seems more like a job for "filter".
>
>
I'd agree with that, too.

But "a few lines of code" is almost always going to be slower than
something actually built in.

What looks fastest to me (not tested) is

assuming sorted data:

repeat with i = the number of lines of stuff down to 1
   put line i of stuff into newLn
   if newLn = oldLn then delete line i of stuff
   put newLn into oldLn
next repeat

you could also


unsorted data:

repeat for each line line theLn in stuff
    if theLn is not among the lines of newStuff then put cr & theLn after
newStuff
end repeat

Each of these, though, seems to be doing a search each time, while the info
was already there while the intrinsic sorted.

nice would be

repeat for each line theLn in stuff BACKWARDS

which would allow a single pass through, and an appending of non-unique
lines (with sorted data)
--
Dr. Richard E. Hawkins, Esq.
(702) 508-8462
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Dr. Hawkins
And in hindsight, I suppose I could loop through each line going forward on
sorted data--but it will still be slowr than an intrinsic
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

dunbarxx
In reply to this post by Dr. Hawkins
Richard.


Did you mean something like this, for sorted data:

repeat with i = the number of lines of stuff down to 1
   if line i of stuff = line (i-1) of stuff then delete line i of stuff
next repeat

Another way would be to make an array, where placing an element of "stuff" into the array would simply overwrite any earlier placement. After that was all over, only one element of each kind would be in the array.


Not as fast, for sure, as some native implementation.


Craig



-----Original Message-----
From: Dr. Hawkins <[hidden email]>
To: How to use LiveCode <[hidden email]>
Sent: Sun, Jan 5, 2014 11:19 am
Subject: Re: There should be a "unique" option on sort . . .


On Sat, Jan 4, 2014 at 8:54 PM, <[hidden email]> wrote:

> And in any case it is simple and fast to delete duplicates, if that is
> what is desired, in a few lines of code.
>
>
> The idea of singling out one instance seems more like a job for "filter".
>
>
I'd agree with that, too.

But "a few lines of code" is almost always going to be slower than
something actually built in.

What looks fastest to me (not tested) is

assuming sorted data:

repeat with i = the number of lines of stuff down to 1
   put line i of stuff into newLn
   if newLn = oldLn then delete line i of stuff
   put newLn into oldLn
next repeat

you could also


unsorted data:

repeat for each line line theLn in stuff
    if theLn is not among the lines of newStuff then put cr & theLn after
newStuff
end repeat

Each of these, though, seems to be doing a search each time, while the info
was already there while the intrinsic sorted.

nice would be

repeat for each line theLn in stuff BACKWARDS

which would allow a single pass through, and an appending of non-unique
lines (with sorted data)
--
Dr. Richard E. Hawkins, Esq.
(702) 508-8462
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

 
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Peter Haworth
In reply to this post by Dr. Hawkins
Or (assuming a simple line delimited list):

put theList into theNewList
split theNewList by return and return
put the keys of theNewList into theNewList
sort theNewList

Haven't done any performance testing but previous posts have suggested that
array manipulation is usually lightning fast.


Pete
lcSQL Software <http://www.lcsql.com>


On Sun, Jan 5, 2014 at 8:19 AM, Dr. Hawkins <[hidden email]> wrote:

> On Sat, Jan 4, 2014 at 8:54 PM, <[hidden email]> wrote:
>
> > And in any case it is simple and fast to delete duplicates, if that is
> > what is desired, in a few lines of code.
> >
> >
> > The idea of singling out one instance seems more like a job for "filter".
> >
> >
> I'd agree with that, too.
>
> But "a few lines of code" is almost always going to be slower than
> something actually built in.
>
> What looks fastest to me (not tested) is
>
> assuming sorted data:
>
> repeat with i = the number of lines of stuff down to 1
>    put line i of stuff into newLn
>    if newLn = oldLn then delete line i of stuff
>    put newLn into oldLn
> next repeat
>
> you could also
>
>
> unsorted data:
>
> repeat for each line line theLn in stuff
>     if theLn is not among the lines of newStuff then put cr & theLn after
> newStuff
> end repeat
>
> Each of these, though, seems to be doing a search each time, while the info
> was already there while the intrinsic sorted.
>
> nice would be
>
> repeat for each line theLn in stuff BACKWARDS
>
> which would allow a single pass through, and an appending of non-unique
> lines (with sorted data)
> --
> Dr. Richard E. Hawkins, Esq.
> (702) 508-8462
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Geoff Canyon-4
In reply to this post by Dr. Hawkins
   *put* line 1 of tSortedData is false into lastLine

   *repeat* for each line L in tSortedData

      *if* L is lastLine *then* *next* *repeat*

      *put* L & cr after tUniqueData

      *put* L into lastLine

   *end* *repeat*

Either that or some form of Peter's de-dupe and then sort.
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

dunbarxx
Or, (untested)


repeat for each line tLine in tData --unsorted
  if tLine is not among the lines of tFinal then put tLine & return after tFinal
end repeat
sort tFinal --if desired


But using Peter's "split" post is the most elegant


Craig



-----Original Message-----
From: Geoff Canyon <[hidden email]>
To: How to use LiveCode <[hidden email]>
Sent: Sun, Jan 5, 2014 5:27 pm
Subject: Re: There should be a "unique" option on sort . . .


   *put* line 1 of tSortedData is false into lastLine

   *repeat* for each line L in tSortedData

      *if* L is lastLine *then* *next* *repeat*

      *put* L & cr after tUniqueData

      *put* L into lastLine

   *end* *repeat*

Either that or some form of Peter's de-dupe and then sort.
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

 
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Alejandro Tejada
Did you test the handler created by Scott Raney
for his stack "MetaTalk Examples"?

Maybe you could adapt this handler for your purpose

Jacque posted a variant of this handler,
some time ago:

on Mar 07, 2007, Jacqueline Landman Gay wrote:

> The old MetaCard IDE shipped with a very efficient word-counting script
> as one of its examples. Here it is with superfluous stuff removed and
> reformatted as a function:

function wordCount pText
   repeat for each word w in pText
     add 1 to wordCount[w]
   end repeat
   put keys(wordCount) into keyWords
   sort keyWords -- optional
   repeat for each line l in keyWords
     put l & tab & wordCount[l] & return after tResult
   end repeat
   return tResult
end wordCount

Have a nice week! :D

Al
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Dr. Hawkins
In reply to this post by dunbarxx
On Sun, Jan 5, 2014 at 9:36 AM, <[hidden email]> wrote:

> Did you mean something like this, for sorted data:
>
> repeat with i = the number of lines of stuff down to 1
>    if line i of stuff = line (i-1) of stuff then delete line i of stuff
> next repeat
>

Yes.  I use the extra variables to stop the extra searches of stuff.

In Fortran or C, I'd just assume that the multiple references would be
caught by the compiler optimizer, but I don't thinkLiveCode is
even looking at such things . . .


--
Dr. Richard E. Hawkins, Esq.
(702) 508-8462
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Geoff Canyon-4
In reply to this post by Geoff Canyon-4
I ran a brief speed test of the code below against using an array to
de-dupe and then sort the keys, and couldn't find a data set that made the
below code slower. It's not always much faster, but for some data sets it
is.


On Sun, Jan 5, 2014 at 5:26 PM, Geoff Canyon <[hidden email]> wrote:

>    *put* line 1 of tSortedData is false into lastLine
>
>    *repeat* for each line L in tSortedData
>
>       *if* L is lastLine *then* *next* *repeat*
>
>       *put* L & cr after tUniqueData
>
>       *put* L into lastLine
>
>    *end* *repeat*
>
> Either that or some form of Peter's de-dupe and then sort.
>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Richard Gaskin
Geoff Canyon wrote:

> I ran a brief speed test of the code below against using an array to
> de-dupe and then sort the keys, and couldn't find a data set that made the
> below code slower. It's not always much faster, but for some data sets it
> is.

One of the challenges with this sort of comparative benchmarking is
accounting for scale, and doing so in all relevant dimensions.

In this scenario, two scaling factors come to mind:
- Number of lines in the data set
- Number of duplicates among those lines

The test script below has variables at the top which can be altered to
account for each of those, with an additional variable for number of
iterations (small data sets can be too quick to measure well).

The size of the data set is simply the number of lines it contains (n).
  The percentage of duplicates is likely higher when the strings are
shorter, so tMaxLineSize handles that.

The results suggest two things:

1. Larger data sets favor arrays.  At 100k lines this was roughly a
four-fold boost, with no apparent penalty for smaller data sets.

2. A larger percentage of duplicates favors arrays, but only slightly,
far less so than the total size of the data set.


Here are the results, run for data sets of 100, 1,000, 10,000, and
100,000 lines, once each for strings with a maximum of 5 chars, and
again with a maximum of 250 chars:


10 iterations on 100 lines of 5 or fewer chars:
Array: 2 ms (0.2 ms per iteration)
Chunks: 3 ms (0.3 ms per iteration)
Results match - Each list has 91 lines

10 iterations on 100 lines of 250 or fewer chars:
Array: 5 ms (0.5 ms per iteration)
Chunks: 5 ms (0.5 ms per iteration)
Results match - Each list has 100 lines


10 iterations on 1000 lines of 5 or fewer chars:
Array: 13 ms (1.3 ms per iteration)
Chunks: 17 ms (1.7 ms per iteration)
Results match - Each list has 629 lines

10 iterations on 1000 lines of 250 or fewer chars:
Array: 34 ms (3.4 ms per iteration)
Chunks: 36 ms (3.6 ms per iteration)
Results match - Each list has 735 lines


10 iterations on 10000 lines of 5 or fewer chars:
Array: 61 ms (6.1 ms per iteration)
Chunks: 150 ms (15 ms per iteration)
Results match - Each list has 993 lines

10 iterations on 10000 lines of 250 or fewer chars:
Array: 181 ms (18.1 ms per iteration)
Chunks: 414 ms (41.4 ms per iteration)
Results match - Each list has 1493 lines


10 iterations on 100000 lines of 5 or fewer chars:
Array: 492 ms (49.2 ms per iteration)
Chunks: 1674 ms (167.4 ms per iteration)
Results match - Each list has 987 lines

10 iterations on 100000 lines of 250 or fewer chars:
Array: 1514 ms (151.4 ms per iteration)
Chunks: 5502 ms (550.2 ms per iteration)
Results match - Each list has 1494 lines


Here's the code - thrown together in a few minutes, it may well not be a
complete measure of what we're looking for, so if you see ways to
improve it please feel free to revise:

on mouseUp
    put 100000 into n -- number of lines of data
    put 10 into x -- number of iterations
    put 250 into tMaxLineSize -- Max length of each string
    --
    -- Make n lines of random data:
    put empty into tData
    -- First make one line of data:
    repeat with i = 1 to 256 -- make random string 256 bytes long
       -- in the ASCI range 48 through 122
       put numToChar(random(73)+48) after s
    end repeat
    -- Make n lines from that data, each at least Z bytes:
    put 256-tMaxLineSize into tStartMax
    repeat n
       put random(tStartMax) into tStart
       put tStart + random(tMaxLineSize-1) into tEnd
       put char tStart to tEnd of s &cr after tData
    end repeat
    --
    -- Test 1: Array
    put the millisecs into t
    repeat x
       put UniqueListFromArray(tData) into r1
    end repeat
    put the millisecs - t into t1
    --
    -- Test 2: Chunks
    put the millisecs into t
    repeat x
       put UniqueListFromChunks(tData) into r2
    end repeat
    put the millisecs - t into t2
    --
    -- Display results:
    if (r1=r2) then
       put "Results match - Each list has "&\
       the number of lines of r1 & " lines" into tSanityCheck
    else
       put "RESULTS DON'T MATCH" into tSanityCheck
    end if
    put x & " iterations on "& n &" lines of "& tMaxLineSize &\
    " or fewer chars:"&cr\
    &"Array: "& t1 &" ms (" & t1/x &" ms per iteration)"&cr\
    &"Chunks: "& t2 &" ms (" & t2/x &" ms per iteration)"&cr\
    & tSanityCheck
end mouseUp


function UniqueListFromArray pData
    repeat for each line tLine in pData
       add 1 to tA[tLine]
    end repeat
    put the keys of tA into tKeys
    sort lines of tKeys
    return tKeys
end UniqueListFromArray


function UniqueListFromChunks pData
    sort lines of pData
    # put line 1 of pData is false into lastLine -- what does this do?
    put empty into lastLine
    repeat for each line tLine in pData
       if tLine is tLastLine then next repeat
       put tLine & cr after tResult
       put tLine into tLastLine
    end repeat
    delete last char of tResult
    return tResult
end UniqueListFromChunks


--
  Richard Gaskin
  Fourth World
  LiveCode training and consulting: http://www.fourthworld.com
  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
  Follow me on Twitter:  http://twitter.com/FourthWorldSys

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Peter Haworth
Hi Richard,
Thanks for the speed tests.

Your handler to use an array isn't quite the same as mine. You have a
repeat loop adding 1 to the key to create the array whereas I used combine.
Be interesting to see if that makes any significant difference.

Pete
lcSQL Software
On Jan 6, 2014 7:13 AM, "Richard Gaskin" <[hidden email]> wrote:

> Geoff Canyon wrote:
>
>  I ran a brief speed test of the code below against using an array to
>> de-dupe and then sort the keys, and couldn't find a data set that made the
>> below code slower. It's not always much faster, but for some data sets it
>> is.
>>
>
> One of the challenges with this sort of comparative benchmarking is
> accounting for scale, and doing so in all relevant dimensions.
>
> In this scenario, two scaling factors come to mind:
> - Number of lines in the data set
> - Number of duplicates among those lines
>
> The test script below has variables at the top which can be altered to
> account for each of those, with an additional variable for number of
> iterations (small data sets can be too quick to measure well).
>
> The size of the data set is simply the number of lines it contains (n).
>  The percentage of duplicates is likely higher when the strings are
> shorter, so tMaxLineSize handles that.
>
> The results suggest two things:
>
> 1. Larger data sets favor arrays.  At 100k lines this was roughly a
> four-fold boost, with no apparent penalty for smaller data sets.
>
> 2. A larger percentage of duplicates favors arrays, but only slightly, far
> less so than the total size of the data set.
>
>
> Here are the results, run for data sets of 100, 1,000, 10,000, and 100,000
> lines, once each for strings with a maximum of 5 chars, and again with a
> maximum of 250 chars:
>
>
> 10 iterations on 100 lines of 5 or fewer chars:
> Array: 2 ms (0.2 ms per iteration)
> Chunks: 3 ms (0.3 ms per iteration)
> Results match - Each list has 91 lines
>
> 10 iterations on 100 lines of 250 or fewer chars:
> Array: 5 ms (0.5 ms per iteration)
> Chunks: 5 ms (0.5 ms per iteration)
> Results match - Each list has 100 lines
>
>
> 10 iterations on 1000 lines of 5 or fewer chars:
> Array: 13 ms (1.3 ms per iteration)
> Chunks: 17 ms (1.7 ms per iteration)
> Results match - Each list has 629 lines
>
> 10 iterations on 1000 lines of 250 or fewer chars:
> Array: 34 ms (3.4 ms per iteration)
> Chunks: 36 ms (3.6 ms per iteration)
> Results match - Each list has 735 lines
>
>
> 10 iterations on 10000 lines of 5 or fewer chars:
> Array: 61 ms (6.1 ms per iteration)
> Chunks: 150 ms (15 ms per iteration)
> Results match - Each list has 993 lines
>
> 10 iterations on 10000 lines of 250 or fewer chars:
> Array: 181 ms (18.1 ms per iteration)
> Chunks: 414 ms (41.4 ms per iteration)
> Results match - Each list has 1493 lines
>
>
> 10 iterations on 100000 lines of 5 or fewer chars:
> Array: 492 ms (49.2 ms per iteration)
> Chunks: 1674 ms (167.4 ms per iteration)
> Results match - Each list has 987 lines
>
> 10 iterations on 100000 lines of 250 or fewer chars:
> Array: 1514 ms (151.4 ms per iteration)
> Chunks: 5502 ms (550.2 ms per iteration)
> Results match - Each list has 1494 lines
>
>
> Here's the code - thrown together in a few minutes, it may well not be a
> complete measure of what we're looking for, so if you see ways to improve
> it please feel free to revise:
>
> on mouseUp
>    put 100000 into n -- number of lines of data
>    put 10 into x -- number of iterations
>    put 250 into tMaxLineSize -- Max length of each string
>    --
>    -- Make n lines of random data:
>    put empty into tData
>    -- First make one line of data:
>    repeat with i = 1 to 256 -- make random string 256 bytes long
>       -- in the ASCI range 48 through 122
>       put numToChar(random(73)+48) after s
>    end repeat
>    -- Make n lines from that data, each at least Z bytes:
>    put 256-tMaxLineSize into tStartMax
>    repeat n
>       put random(tStartMax) into tStart
>       put tStart + random(tMaxLineSize-1) into tEnd
>       put char tStart to tEnd of s &cr after tData
>    end repeat
>    --
>    -- Test 1: Array
>    put the millisecs into t
>    repeat x
>       put UniqueListFromArray(tData) into r1
>    end repeat
>    put the millisecs - t into t1
>    --
>    -- Test 2: Chunks
>    put the millisecs into t
>    repeat x
>       put UniqueListFromChunks(tData) into r2
>    end repeat
>    put the millisecs - t into t2
>    --
>    -- Display results:
>    if (r1=r2) then
>       put "Results match - Each list has "&\
>       the number of lines of r1 & " lines" into tSanityCheck
>    else
>       put "RESULTS DON'T MATCH" into tSanityCheck
>    end if
>    put x & " iterations on "& n &" lines of "& tMaxLineSize &\
>    " or fewer chars:"&cr\
>    &"Array: "& t1 &" ms (" & t1/x &" ms per iteration)"&cr\
>    &"Chunks: "& t2 &" ms (" & t2/x &" ms per iteration)"&cr\
>    & tSanityCheck
> end mouseUp
>
>
> function UniqueListFromArray pData
>    repeat for each line tLine in pData
>       add 1 to tA[tLine]
>    end repeat
>    put the keys of tA into tKeys
>    sort lines of tKeys
>    return tKeys
> end UniqueListFromArray
>
>
> function UniqueListFromChunks pData
>    sort lines of pData
>    # put line 1 of pData is false into lastLine -- what does this do?
>    put empty into lastLine
>    repeat for each line tLine in pData
>       if tLine is tLastLine then next repeat
>       put tLine & cr after tResult
>       put tLine into tLastLine
>    end repeat
>    delete last char of tResult
>    return tResult
> end UniqueListFromChunks
>
>
> --
>  Richard Gaskin
>  Fourth World
>  LiveCode training and consulting: http://www.fourthworld.com
>  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
>  Follow me on Twitter:  http://twitter.com/FourthWorldSys
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Richard Gaskin
Peter Haworth wrote:
> Your handler to use an array isn't quite the same as mine. You have a
> repeat loop adding 1 to the key to create the array whereas I used combine.
> Be interesting to see if that makes any significant difference.

Good catch.  The arithmetic operation isn't needed there (I just blindly
copied-and-pasted from some old MetaCard stuff and revised it only to
use my var names), so conceivably it could be slightly faster to omit
the arithmetic and just put empty into that array value.

Or so it would seem....

I couldn't resist trying this, so I just ran another test with two array
functions, one being the original and the other replacing this:

       add 1 to tA[tLine]

...with this:

       put empty into tA[tLine]

Here's the results:

10 iterations on 100000 lines of 5 or fewer chars:
Non-Arithmetic Array: 499 ms (49.9 ms per iteration)
Original Array: 470 ms (47 ms per iteration)
Results match - Each list has 996 lines


I'm guessing this has to do with a certain amount of overhead associated
with strings.

So then I tried another variant, in which it puts a value that has been
coerced to a number into the array slot:

function UniqueListFromArray2 pData
    put 0+0 into y
    repeat for each line tLine in pData
       put y into tA[tLine]
    end repeat
    put the keys of tA into tKeys
    sort lines of tKeys
    return tKeys
end UniqueListFromArray2

And the results:

10 iterations on 100000 lines of 5 or fewer chars:
"Add 1" Array: 479 ms (47.9 ms per iteration)
"Put Y" Array: 530 ms (53 ms per iteration)
Results match - Each list has 993 lines

Hmmm.....

--
  Richard Gaskin
  Fourth World
  LiveCode training and consulting: http://www.fourthworld.com
  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
  Follow me on Twitter:  http://twitter.com/FourthWorldSys

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Alejandro Tejada
Richard Gaskin wrote
[snip]
10 iterations on 100000 lines of 5 or fewer chars:
"Add 1" Array: 479 ms (47.9 ms per iteration)
"Put Y" Array: 530 ms (53 ms per iteration)
Results match - Each list has 993 lines

Hmmm.....
Interesting. Really interesting results.
Makes oneself wonder:

What did Scott Raney knew about
string manipulation that we didn't?

Oh, wait... ;)
Reply | Threaded
Open this post in threaded view
|

Re: There should be a "unique" option on sort . . .

Peter Haworth
In reply to this post by Richard Gaskin
Hi Richard,
It's not so much the arithmetic operation or putting a value into the array
element, it's the repeat loop itself which I don't think is needed.  It can
be replaced with "combine pData with return and return".  That one line
gets rid of the duplicates.  So the function would look like this:

function UniqueListFromArray2 pData
   local tKeys
   combine pData with return and return
   put the keys of pdata into tKeys
   sort tKeys
   return tKeys
end UniqueListFromArray2

I came up with that because the OP mentioned that an engine-embedded way to
de-dup would be faster than any user written handler, which seems like a
reasonable assumption.

Pretty academic really because it seems like all the solutions work
acceptably fast but nevertheless interesting to compare.

Pete
lcSQL Software <http://www.lcsql.com>


On Mon, Jan 6, 2014 at 9:22 AM, Richard Gaskin
<[hidden email]>wrote:

> Peter Haworth wrote:
>
>> Your handler to use an array isn't quite the same as mine. You have a
>> repeat loop adding 1 to the key to create the array whereas I used
>> combine.
>> Be interesting to see if that makes any significant difference.
>>
>
> Good catch.  The arithmetic operation isn't needed there (I just blindly
> copied-and-pasted from some old MetaCard stuff and revised it only to use
> my var names), so conceivably it could be slightly faster to omit the
> arithmetic and just put empty into that array value.
>
> Or so it would seem....
>
> I couldn't resist trying this, so I just ran another test with two array
> functions, one being the original and the other replacing this:
>
>       add 1 to tA[tLine]
>
> ...with this:
>
>       put empty into tA[tLine]
>
> Here's the results:
>
> 10 iterations on 100000 lines of 5 or fewer chars:
> Non-Arithmetic Array: 499 ms (49.9 ms per iteration)
> Original Array: 470 ms (47 ms per iteration)
> Results match - Each list has 996 lines
>
>
> I'm guessing this has to do with a certain amount of overhead associated
> with strings.
>
> So then I tried another variant, in which it puts a value that has been
> coerced to a number into the array slot:
>
> function UniqueListFromArray2 pData
>    put 0+0 into y
>    repeat for each line tLine in pData
>       put y into tA[tLine]
>    end repeat
>    put the keys of tA into tKeys
>    sort lines of tKeys
>    return tKeys
> end UniqueListFromArray2
>
> And the results:
>
> 10 iterations on 100000 lines of 5 or fewer chars:
> "Add 1" Array: 479 ms (47.9 ms per iteration)
> "Put Y" Array: 530 ms (53 ms per iteration)
> Results match - Each list has 993 lines
>
> Hmmm.....
>
> --
>  Richard Gaskin
>  Fourth World
>  LiveCode training and consulting: http://www.fourthworld.com
>  Webzine for LiveCode developers: http://www.LiveCodeJournal.com
>  Follow me on Twitter:  http://twitter.com/FourthWorldSys
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
12