What is LC's internal text format?

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

What is LC's internal text format?

Pi Digital via use-livecode
This is something that I've been wondering about for a while.

My unexamined assumption had been that in the 'new' fully unicode LC, text was
held in UTF-8. However when I saved some text strings in binary I got
something like UTF-8 - but not quite. And the recent experiments with offset
suggested that LC at the least is able to distinguish between a string which
is fully represented as single-byte (or perhaps ASCII?). And the reports of
the ingenious investigators using UTF-32 to speed up offsets, and discovering
that offset somehow managed to be case-insensitive in this case, made me
wonder whether after using textEncode(xt, "UTF-32") LC marks the string in
some way to give a clue about how to interpret it as text?

So could someone who is familar with this bit of the engine enlighten us? In
particular:
- What is the internal format?
- Is it different on different platforms?
- Given that it appears to include a flag to indicate whether it is
single-byte text or not, are there any other attributes?
- Does saving a string in 'binary' file faithfully report the internal format?

TIA,

Ben

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
Text strings in LiveCode are native encoded (MacRoman or ISO 8859) where possible and where you don’t explicitly tell the engine it’s unicode (via textDecode) so that they can follow faster single byte code paths. If you use textDecode then the engine will first check if the text can be native encoded and use native if so otherwise it will use UTF 16 encoding.

For what it’s worth using `offset` is the wrong thing to do if you have textEncoded your strings into binary data. You want to use `byteOffset` otherwise the engine will convert your data to a string and assume native encoding. This is probably why you are getting some case insensitivity.

I haven’t been following along the offset discussion. I’ll have to take a look to see if there were some speed comparisons between offset and codepointOffset.

Cheers

Monte

> On 13 Nov 2018, at 9:35 am, Ben Rubinstein via use-livecode <[hidden email]> wrote:
>
> This is something that I've been wondering about for a while.
>
> My unexamined assumption had been that in the 'new' fully unicode LC, text was held in UTF-8. However when I saved some text strings in binary I got something like UTF-8 - but not quite. And the recent experiments with offset suggested that LC at the least is able to distinguish between a string which is fully represented as single-byte (or perhaps ASCII?). And the reports of the ingenious investigators using UTF-32 to speed up offsets, and discovering that offset somehow managed to be case-insensitive in this case, made me wonder whether after using textEncode(xt, "UTF-32") LC marks the string in some way to give a clue about how to interpret it as text?
>
> So could someone who is familar with this bit of the engine enlighten us? In particular:
> - What is the internal format?
> - Is it different on different platforms?
> - Given that it appears to include a flag to indicate whether it is single-byte text or not, are there any other attributes?
> - Does saving a string in 'binary' file faithfully report the internal format?
>
> TIA,
>
> Ben
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
[hidden email]> wrote:

> Text strings in LiveCode are native encoded (MacRoman or ISO 8859) where
> possible and where you don’t explicitly tell the engine
> For what it’s worth using `offset` is the wrong thing to do if you have
> textEncoded your strings into binary data. You want to use `byteOffset`
> otherwise the engine will convert your data to a string and assume native
> encoding. This is probably why you are getting some case insensitivity.
>

Unless I'm misunderstanding, this hasn't been my observation. Using offset
on a string that has been textEncodet()ed to UTF-32 returns values that are
4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't it
return the actual offsets (except when it fails)? Also, 𐀁 encodes to
00010001, and routines that convert to UTF-32 and then use offset will find
five instances of that character in the UTF-32 encoding because of improper
boundaries. To see this, run this code:

on mouseUp
   put textencode("𐀁","UTF-32") into X
   put textencode("𐀁𐀁𐀁","UTF-32") into Y
   put offset(X,Y,1)
end mouseUp

That will return 2, meaning that it found the encoding for X starting at
character 2 + 1 = 3 of Y. In other words, it found X using the last half of
the first "𐀁" and the first half of the second "𐀁"
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:

> On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
> [hidden email]> wrote:
> Unless I'm misunderstanding, this hasn't been my observation. Using
> offset
> on a string that has been textEncodet()ed to UTF-32 returns values that
> are
> 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't
> it
> return the actual offsets (except when it fails)? Also, 𐀁 encodes to
> 00010001, and routines that convert to UTF-32 and then use offset will
> find
> five instances of that character in the UTF-32 encoding because of
> improper
> boundaries. To see this, run this code:
>
> on mouseUp
>    put textencode("𐀁","UTF-32") into X
>    put textencode("𐀁𐀁𐀁","UTF-32") into Y
>    put offset(X,Y,1)
> end mouseUp
>
> That will return 2, meaning that it found the encoding for X starting
> at
> character 2 + 1 = 3 of Y. In other words, it found X using the last
> half of
> the first "𐀁" and the first half of the second "𐀁"

The textEncode function generates binary data which is composed of
bytes. When you use binary data in a text function (which offset is),
the engine uses a compatability conversion which treats the sequence of
bytes as a sequence of native characters (this preserves what happened
pre-7.0 when strings were only ever native, and as such binary and
string were essentially the same thing).

So if you textEncode a 1 (native) character string as UTF-32, you will
get a four byte string, which will then turn back into a 4 (native)
character string when passed to offset.

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
So then why does put textEncode("a","UTF-32") into X;put chartonum(byte 1
of X) put 97? That implies that "byte" 1 is "a", not 1100001. Likewise, put
textEncode("㍁","UTF-32") into X;put chartonum(byte 1 of X) puts 65.

I've looked in the dictionary and I don't see anything that comes close to
describing this.

gc

On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
[hidden email]> wrote:

> On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
> > On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
> > [hidden email]> wrote:
> > Unless I'm misunderstanding, this hasn't been my observation. Using
> > offset
> > on a string that has been textEncodet()ed to UTF-32 returns values that
> > are
> > 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't
> > it
> > return the actual offsets (except when it fails)? Also, 𐀁 encodes to
> > 00010001, and routines that convert to UTF-32 and then use offset will
> > find
> > five instances of that character in the UTF-32 encoding because of
> > improper
> > boundaries. To see this, run this code:
> >
> > on mouseUp
> >    put textencode("𐀁","UTF-32") into X
> >    put textencode("𐀁𐀁𐀁","UTF-32") into Y
> >    put offset(X,Y,1)
> > end mouseUp
> >
> > That will return 2, meaning that it found the encoding for X starting
> > at
> > character 2 + 1 = 3 of Y. In other words, it found X using the last
> > half of
> > the first "𐀁" and the first half of the second "𐀁"
>
> The textEncode function generates binary data which is composed of
> bytes. When you use binary data in a text function (which offset is),
> the engine uses a compatability conversion which treats the sequence of
> bytes as a sequence of native characters (this preserves what happened
> pre-7.0 when strings were only ever native, and as such binary and
> string were essentially the same thing).
>
> So if you textEncode a 1 (native) character string as UTF-32, you will
> get a four byte string, which will then turn back into a 4 (native)
> character string when passed to offset.
>
> Warmest Regards,
>
> Mark.
>
> --
> Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
On 2018-11-13 08:35, Geoff Canyon via use-livecode wrote:
> So then why does put textEncode("a","UTF-32") into X;put chartonum(byte
> 1
> of X) put 97?

Because:

   1) textEncode("a", "UTF-32") produces the byte sequence <97,0,0,0>
   2) byte 1 of <97,0,0,0> is <97>
   3) charToNum(<97>) first converts the byte <97> into a native string
which is "a" (as the 97 is the code for 'a' in the native encoding
table), then converts that (native) char to a number -> 97

> That implies that "byte" 1 is "a", not 1100001.

1100001 is 97 but printed in base-2.

FWIW, I think you are confusing 'binary string' with 'binary number' -
these are not the same thing.

A 'binary string' (internally the data type is 'Data') is a sequence of
bytes (just as a 'string' is a sequence of
characters/codepoints/codeunits).

A 'binary number' is a number which has been rendered to a string with
base-2.

Bytes are like characters (and codepoints, and codeunits) in that they
are 'abstract' things - they aren't numbers, and have no direct
conversion to them - which is why we have byteToNum, numToByte,
nativeCharToNum, numToNativeChar, codepointToNum and numToCodepoint.

The charToNum and numToChar functions are actually deprecated /
considered legacy - as their function (when useUnicode is set to true)
depends on processing unicode text as binary data - which isn't how
unicode works post-7 (indeed, there was no way to fold their behavior
into the new model - hence the deprecation, and replacement with
nativeCharToNum / numToNativeChar).

You'll notice that there is no modern 'charToNum'/'numToChar' - just
'codepointToNum'/'numToCodepoint'. A codepoint is an index into the
(large - 21-bit) Unicode code table; Unicode characters can be composed
of multiple codepoints (e.g. [e,combining-acute] and thus don't have a
'number' per-se.

Warmest Regards,

Mark.

>
> I've looked in the dictionary and I don't see anything that comes close
> to
> describing this.
>
> gc
>
> On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
> [hidden email]> wrote:
>
>> On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
>> > On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
>> > [hidden email]> wrote:
>> > Unless I'm misunderstanding, this hasn't been my observation. Using
>> > offset
>> > on a string that has been textEncodet()ed to UTF-32 returns values that
>> > are
>> > 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't
>> > it
>> > return the actual offsets (except when it fails)? Also, 𐀁 encodes to
>> > 00010001, and routines that convert to UTF-32 and then use offset will
>> > find
>> > five instances of that character in the UTF-32 encoding because of
>> > improper
>> > boundaries. To see this, run this code:
>> >
>> > on mouseUp
>> >    put textencode("𐀁","UTF-32") into X
>> >    put textencode("𐀁𐀁𐀁","UTF-32") into Y
>> >    put offset(X,Y,1)
>> > end mouseUp
>> >
>> > That will return 2, meaning that it found the encoding for X starting
>> > at
>> > character 2 + 1 = 3 of Y. In other words, it found X using the last
>> > half of
>> > the first "𐀁" and the first half of the second "𐀁"
>>
>> The textEncode function generates binary data which is composed of
>> bytes. When you use binary data in a text function (which offset is),
>> the engine uses a compatability conversion which treats the sequence
>> of
>> bytes as a sequence of native characters (this preserves what happened
>> pre-7.0 when strings were only ever native, and as such binary and
>> string were essentially the same thing).
>>
>> So if you textEncode a 1 (native) character string as UTF-32, you will
>> get a four byte string, which will then turn back into a 4 (native)
>> character string when passed to offset.
>>
>> Warmest Regards,
>>
>> Mark.
>>
>> --
>> Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
>> LiveCode: Everyone can create apps
>>
>> _______________________________________________
>> use-livecode mailing list
>> [hidden email]
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
I don't *think* I'm confusing binary string/data with binary numbers -- I
was just trying to illustrate that when a Latin Small Letter A (U+0061)
gets encoded, somewhere there is stored (four bytes, one of which is) a
byte 97, i.e. the bit sequence 1100001, unless computers don't work that
way anymore.

What I now see is tripping me up is the implicit cast to a character you're
saying that charToNum supports, without the corresponding cast to a number
supported in numToChar -- i.e. this fails:

put textEncode("a","UTF-32") into X;put numtochar(byte 1 of X)

while this works:

put textEncode("a","UTF-32") into X;put numtochar(bytetonum(byte 1 of X))

Thanks for the insight,

Geoff

On Tue, Nov 13, 2018 at 12:03 AM Mark Waddingham via use-livecode <
[hidden email]> wrote:

> On 2018-11-13 08:35, Geoff Canyon via use-livecode wrote:
> > So then why does put textEncode("a","UTF-32") into X;put chartonum(byte
> > 1
> > of X) put 97?
>
> Because:
>
>    1) textEncode("a", "UTF-32") produces the byte sequence <97,0,0,0>
>    2) byte 1 of <97,0,0,0> is <97>
>    3) charToNum(<97>) first converts the byte <97> into a native string
> which is "a" (as the 97 is the code for 'a' in the native encoding
> table), then converts that (native) char to a number -> 97
>
> > That implies that "byte" 1 is "a", not 1100001.
>
> 1100001 is 97 but printed in base-2.
>
> FWIW, I think you are confusing 'binary string' with 'binary number' -
> these are not the same thing.
>
> A 'binary string' (internally the data type is 'Data') is a sequence of
> bytes (just as a 'string' is a sequence of
> characters/codepoints/codeunits).
>
> A 'binary number' is a number which has been rendered to a string with
> base-2.
>
> Bytes are like characters (and codepoints, and codeunits) in that they
> are 'abstract' things - they aren't numbers, and have no direct
> conversion to them - which is why we have byteToNum, numToByte,
> nativeCharToNum, numToNativeChar, codepointToNum and numToCodepoint.
>
> The charToNum and numToChar functions are actually deprecated /
> considered legacy - as their function (when useUnicode is set to true)
> depends on processing unicode text as binary data - which isn't how
> unicode works post-7 (indeed, there was no way to fold their behavior
> into the new model - hence the deprecation, and replacement with
> nativeCharToNum / numToNativeChar).
>
> You'll notice that there is no modern 'charToNum'/'numToChar' - just
> 'codepointToNum'/'numToCodepoint'. A codepoint is an index into the
> (large - 21-bit) Unicode code table; Unicode characters can be composed
> of multiple codepoints (e.g. [e,combining-acute] and thus don't have a
> 'number' per-se.
>
> Warmest Regards,
>
> Mark.
>
> >
> > I've looked in the dictionary and I don't see anything that comes close
> > to
> > describing this.
> >
> > gc
> >
> > On Mon, Nov 12, 2018 at 10:21 PM Mark Waddingham via use-livecode <
> > [hidden email]> wrote:
> >
> >> On 2018-11-13 07:15, Geoff Canyon via use-livecode wrote:
> >> > On Mon, Nov 12, 2018 at 3:50 PM Monte Goulding via use-livecode <
> >> > [hidden email]> wrote:
> >> > Unless I'm misunderstanding, this hasn't been my observation. Using
> >> > offset
> >> > on a string that has been textEncodet()ed to UTF-32 returns values
> that
> >> > are
> >> > 4 * (the character offset - 1) + 1 -- if it were re-encoded, wouldn't
> >> > it
> >> > return the actual offsets (except when it fails)? Also, 𐀁 encodes to
> >> > 00010001, and routines that convert to UTF-32 and then use offset will
> >> > find
> >> > five instances of that character in the UTF-32 encoding because of
> >> > improper
> >> > boundaries. To see this, run this code:
> >> >
> >> > on mouseUp
> >> >    put textencode("𐀁","UTF-32") into X
> >> >    put textencode("𐀁𐀁𐀁","UTF-32") into Y
> >> >    put offset(X,Y,1)
> >> > end mouseUp
> >> >
> >> > That will return 2, meaning that it found the encoding for X starting
> >> > at
> >> > character 2 + 1 = 3 of Y. In other words, it found X using the last
> >> > half of
> >> > the first "𐀁" and the first half of the second "𐀁"
> >>
> >> The textEncode function generates binary data which is composed of
> >> bytes. When you use binary data in a text function (which offset is),
> >> the engine uses a compatability conversion which treats the sequence
> >> of
> >> bytes as a sequence of native characters (this preserves what happened
> >> pre-7.0 when strings were only ever native, and as such binary and
> >> string were essentially the same thing).
> >>
> >> So if you textEncode a 1 (native) character string as UTF-32, you will
> >> get a four byte string, which will then turn back into a 4 (native)
> >> character string when passed to offset.
> >>
> >> Warmest Regards,
> >>
> >> Mark.
> >>
> >> --
> >> Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
> >> LiveCode: Everyone can create apps
> >>
> >> _______________________________________________
> >> use-livecode mailing list
> >> [hidden email]
> >> Please visit this url to subscribe, unsubscribe and manage your
> >> subscription preferences:
> >> http://lists.runrev.com/mailman/listinfo/use-livecode
> > _______________________________________________
> > use-livecode mailing list
> > [hidden email]
> > Please visit this url to subscribe, unsubscribe and manage your
> > subscription preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
>
> --
> Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
On 2018-11-13 11:06, Geoff Canyon via use-livecode wrote:
> I don't *think* I'm confusing binary string/data with binary numbers --
> I
> was just trying to illustrate that when a Latin Small Letter A (U+0061)
> gets encoded, somewhere there is stored (four bytes, one of which is) a
> byte 97, i.e. the bit sequence 1100001, unless computers don't work
> that
> way anymore.

Yes - a byte is not a number, a char is not a number a bit sequence is
not a number.

Chars have never been numbers in LC - when LC sees a char - it sees a
string and so
when such a thing is used in number context it converts it to the number
it *looks* like
i.e. "1" -> 1, but "a" -> error in number context (bearing in mind the
code for "1" is not 1).

i.e. numToChar(charToNum("1")) + 0 -> 1

The same is try for 'byte' in LC7+ (indeed prior to that byte was a
synonym for char).

> What I now see is tripping me up is the implicit cast to a character
> you're
> saying that charToNum supports, without the corresponding cast to a
> number
> supported in numToChar -- i.e. this fails:
>
> put textEncode("a","UTF-32") into X;put numtochar(byte 1 of X)

Right so that shouldn't work - byte 1 of X here is <97> (a byte), bytes
get converted to native
chars in string context, so numToChar(byte 1 of X) -> numToChar(<97> as
char) -> numToChar("a")
and "a" is not a number.

You'd get exactly the same result if you did put numToChar(char 1 of
"a").

As I said, bytes are not numbers, just as chars are not numbers - bytes
do implicitly convert to
(native) chars though - so when you use a binary string in number
context, it gets treated as a
numeric string.

Put another way, just as the code for a char is not used in conversion
in number context, the
code of a byte is not used either.

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
In reply to this post by Pi Digital via use-livecode
I'm grateful for all the information, but _outraged_ that the thread that I
carefully created separate from the offset thread was so quickly hijacked for
the continuing (useful!) detailed discussion on that topic.

 From recent contributions on both threads I'm getting some more insights, but
I'd really like to understand clearly what's going on. I do think that I
should have asked this question more broadly: how does the engine represent
values internally?


I believe from what I've read that the engine can distinguish the following
kinds of value:
        - empty
        - array
        - number
        - string
        - binary string

 From Monte I get that the internal encoding for 'string' may be MacRoman, ISO
8859 (I thought it would be CP1252), or UTF16 - presumably with some attribute
to tell the engine which one in each case.

So then my question is whether a 'binary string' is a pure blob, with no clues
as to interpretation; or whether in fact it does have some attributes to
suggest that it might be interpreted as UTF8, UTF132 etc?

If there are no such attributes, how does codepointOffset operate when passed
a binary string?

If there are such attributes, how do they get set? Evidently if textEncode is
used, the engine knows that the resulting value is the requested encoding. But
what happens if the program reads a file as 'binary' - presumable the result
is a binary string, how does the engine treat it?

Is there any way at LiveCode script level to detect what a value is, in the
above terms?

And one more question: if a string, or binary string, is saved in a 'binary'
file, are the bytes stored on disk a faithful rendition of the bytes that
composed the value in memory, or an interpretation of some kind?

TIA,

Ben

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
On 2018-11-13 12:43, Ben Rubinstein via use-livecode wrote:
> I'm grateful for all the information, but _outraged_ that the thread
> that I carefully created separate from the offset thread was so
> quickly hijacked for the continuing (useful!) detailed discussion on
> that topic.

The phrase 'attempting to herd cats' springs to mind ;)

> From recent contributions on both threads I'm getting some more
> insights, but I'd really like to understand clearly what's going on. I
> do think that I should have asked this question more broadly: how does
> the engine represent values internally?

The engine uses a number of distinct types 'behind the scenes'. The ones
pertinent to LCS (there are many many more which LCS never sees) are:

   - nothing: a type with a single value nothing/null)
   - boolean: a type with two values true/false
   - number: a type which can either store a 32-bit integer *or* a double
   - string: a type which can either store a sequence of native (single
byte) codes, or a sequence of unicode (two byte - UTF-16) codes
   - name: a type which stores a string, but uniques the string so that
caseless and exact equality checking is constant time
   - data: a type which stores a sequence of bytes
   - array: a type which stores (using a hashtable) a mapping from
'names' to any other storage value type

The LCS part of the engine then sits on top of these core types,
providing
various conversions depending on context.

All LCS syntax is actually typed - meaning that when you pass a value to
any
piece of LCS syntax, each argument is converted to the type required.

e.g. charToNativeNum() has signature 'integer charToNativeNum(string)'
meaning that it
expects a string as input and will return a number as output.

Some syntax is overloaded - meaning that it can act in slightly
different (but always consistent) ways depending on the type of the
arguments.

e.g. & has signatures 'string &(string, string)' and 'data &(data,
data)'.

In simple cases where there is no overload, type conversion occurs
exactly as required:

e.g. In the case of charToNativeNum() - it has no overload, so always
expects a string
which means that the input argument will always undergo a 'convert to
string' operation.

The convert to string operation operates as follows:

    - nothing -> ""
    - boolean -> "true" or "false"
    - number -> decimal representation of the number, using numberFormat
    - string -> stays the same
    - name -> uses the string the name contains
    - data -> converts to a string using the native encoding
    - array -> converts to empty (a very old semantic which probably does
more harm than good!)

In cases where syntax is overloaded, type conversion generally happens
in syntax-specific sequence in order to preserve consistency:

e.g. In the case of &, it can either take two data arguments, or two
string arguments. In this case,
if both arguments are data, then the result will be data. Otherwise both
arguments will be converted
to strings, and a string returned.

> From Monte I get that the internal encoding for 'string' may be
> MacRoman, ISO 8859 (I thought it would be CP1252), or UTF16 -
> presumably with some attribute to tell the engine which one in each
> case.

Monte wasn't quite correct - on Mac it is MacRoman or UTF-16, on Windows
it
is CP1252 or UTF-16, on Linux it is IOS8859-1 or UTF-16. There is an
internal
flag in a string value which says whether its character sequence is
single-byte (native)
or double-byte (UTF_16).

> So then my question is whether a 'binary string' is a pure blob, with
> no clues as to interpretation; or whether in fact it does have some
> attributes to suggest that it might be interpreted as UTF8, UTF132
> etc?

Data (binary string) values are pure blobs - they are sequences of bytes
- it has
no knowledge of where it came from. Indeed, that would generally be a
bad idea as you
wouldn't get repeatable semantics (i.e. a value from one codepath which
is data, might
have a different effect in context from one which is fetched from
somewhere else).

That being said, the engine does store some flags on values - but purely
for optimization.
i.e. To save later work. For example, a string value can store its
(double) numeric value in
it - which saves multiple 'convert to number' operations performed on
the same (pointer wise) string (due to the copy-on-write nature of
values, and the fact that all literals are unique names, pointer-wise
equality of values occurs a great deal).

> If there are no such attributes, how does codepointOffset operate when
> passed a binary string?

CodepointOffset is has signature 'integer codepointOffset(string)', so
when you
pass a binary string (data) value to it, the data value gets converted
to a string
by interpreting it as a sequence of bytes in the native encoding.

> If there are such attributes, how do they get set? Evidently if
> textEncode is used, the engine knows that the resulting value is the
> requested encoding. But what happens if the program reads a file as
> 'binary' - presumable the result is a binary string, how does the
> engine treat it?

There are no attributes of that ilk. When you read a file as binary you
get data (binary
string) values - which means when you pass them to string taking
functions/commands that
data gets interpreted as a sequence of bytes in the native encoding.
This is why you must
always explicitly textEncode/textDecode data values when you know they
are not representing
native encoded text.

> Is there any way at LiveCode script level to detect what a value is,
> in the above terms?

Yes - the 'is strictly' operators:

   is strictly nothing
   is strictly a boolean
   is strictly an integer - a number which has internal rep 32-bit int
   is strictly a real - a number which has internal rep double
   is strictly a string
   is strictly a binary string
   is strictly an array

It should be noted that 'is strictly' reports only how that value is
stored and not anything based on the value itself. This only really
applies to 'an integer' and 'a real' - you can store an integer in a
double and all LCS arithmetic operators act on doubles.

e.g. (1+2) is strictly an integer -> false
      (1+2) is strictly a real -> true

In contrast, though, *some* syntax will return numbers which are stored
internally as integers:

e.g. nativeCharToNum("a") is strictly an integer -> true

I should point out that what 'is strictly' operators return for any
given context is not stable in the sense that future engine versions
might return different things. e.g. We might optimize arithmetic in the
future (if we can figure out a way to do it without performance
penalty!) so that things which are definitely integers, are stored as
integers (e.g. 1 + 2 in the above).

> And one more question: if a string, or binary string, is saved in a
> 'binary' file, are the bytes stored on disk a faithful rendition of
> the bytes that composed the value in memory, or an interpretation of
> some kind?

What happens when you read or write data or string values to a file
depends on how you opened the file.

If you opened the file for binary (whether reading or writing), when you
read you will get data, when you write string values will be converted
to data via the native encoding (default rule).

If you opened the file for text, then the engine will try and determine
(using a BOM) the existing text encoding of the file. If it can't
determine it (if for example, you are opening a file for write which
doesn't exist), it will assume it is encoded as native.

Otherwise the file will have an explicit encoding associated with it
specified by you - reading from it will interpret the bytes in that
explicit encoding; while writing to it will expect string values which
will be encoded appropriately. In the latter case if you write data
values, they will first be converted to a string (assuming native
encoding) and then written as strings in the file's encoding (i.e.
default type conversion applies).

Essentially you can view file's a typed-stream - if you opened for
binary read/write give/take data; if you opened for text then read/write
give/take strings and default type conversion rules apply.

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
There is a quest in World of Warcraft where the objective is actually to herd cats. It can be done, but only one cat at a time. :-)

Bob S


> On Nov 13, 2018, at 05:31 , Mark Waddingham via use-livecode <[hidden email]> wrote:
>
> On 2018-11-13 12:43, Ben Rubinstein via use-livecode wrote:
>> I'm grateful for all the information, but _outraged_ that the thread
>> that I carefully created separate from the offset thread was so
>> quickly hijacked for the continuing (useful!) detailed discussion on
>> that topic.
>
> The phrase 'attempting to herd cats' springs to mind ;)


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
In reply to this post by Pi Digital via use-livecode
On Tue, Nov 13, 2018 at 3:43 AM Ben Rubinstein via use-livecode <
[hidden email]> wrote:

> I'm grateful for all the information, but _outraged_ that the thread that
> I
> carefully created separate from the offset thread was so quickly hijacked
> for
> the continuing (useful!) detailed discussion on that topic.
>

Nothing I said in this thread has anything to do with optimizing the
allOffsets routines; I only used examples from that discussion because they
illustrate my puzzlement on the exact topic you (in general) raised: how
data types are handled by the engine. I'd generalize the responses, to say
that it seems how the engine stores data and how it presents that data are
not identical in all cases.

Separately, it's interesting to hear that the engine (can) store(s) numeric
values for strings, as an optimization.

The above notwithstanding: sorry I outraged you; I'll exit this thread.
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
On 2018-11-13 18:21, Geoff Canyon via use-livecode wrote:

> Nothing I said in this thread has anything to do with optimizing the
> allOffsets routines; I only used examples from that discussion because
> they
> illustrate my puzzlement on the exact topic you (in general) raised:
> how
> data types are handled by the engine. I'd generalize the responses, to
> say
> that it seems how the engine stores data and how it presents that data
> are
> not identical in all cases.

The best way to think about it is that the engine stores data pretty
much in the form it is presented with it; however, what script sees of
data is in the form it requests. In particular, if data has been through
some operation, or mutated, then there is a good change it won't be in
the same form it was before.

e.g. put tVar + 1 into tVar

Here tVar could start off as a string, but would end up as a number by
virtue of the fact you've performed an arithmetic operation on it.

> The above notwithstanding: sorry I outraged you; I'll exit this thread.

Obviously I'm not Ben, but I *think* it was 'faux outrage' (well I hope
it was - hence my jocular comment about herding cats!) - so I don't
think there's a reason to exit...

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
In reply to this post by Pi Digital via use-livecode
That's really helpful - and in parts eye-opening - thanks Mark.

I have a few follow-up questions.

Does textEncode _always_ return a binary string? Or, if invoked with "CP1252",
"ISO-8859-1", "MacRoman" or "Native", does it return a string?

 > CodepointOffset has signature 'integer codepointOffset(string)', so when you
 > pass a binary string (data) value to it, the data value gets converted to a
 > string by interpreting it as a sequence of bytes in the native encoding.

OK - so one message I take are that in fact one should never invoke
codepointOffset on a binary string. Should it actually throw an error in this
case?

By the same token, probably one should only use 'byte', 'byteOffset',
'byteToNum' etc with binary strings - would it be better, to avoid confusion,
if char, offset, charToNum should refuse to operate on a binary string?

> e.g. In the case of &, it can either take two data arguments, or two
> string arguments. In this case, if both arguments are data, then the result
> will be data. Otherwise both arguments will be converted to strings, and a
> string returned.
The second message I take is that one needs to be very careful, if operating
on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by
concatenating with a simple quoted string, as this may cause it to be silently
converted to a non-binary string. (I presume that 'put "simple string"
after/before pBinaryString' will cause a conversion in the same way as "&"?
What about 'put "!" into char x of pBinaryString?)

The engine can tell whether a string is 'native' or UTF16. When the engine is
converting a binary string to 'string', does it always interpret the source as
the native 8-bit encoding, or does it have some heuristic to decide whether it
would be more plausible to interpret the source as UTF16?

Thanks again for all the detail!

Ben

On 13/11/2018 13:31, Mark Waddingham via use-livecode wrote:

> On 2018-11-13 12:43, Ben Rubinstein via use-livecode wrote:
>> I'm grateful for all the information, but _outraged_ that the thread
>> that I carefully created separate from the offset thread was so
>> quickly hijacked for the continuing (useful!) detailed discussion on
>> that topic.
>
> The phrase 'attempting to herd cats' springs to mind ;)
>
>> From recent contributions on both threads I'm getting some more
>> insights, but I'd really like to understand clearly what's going on. I
>> do think that I should have asked this question more broadly: how does
>> the engine represent values internally?
>
> The engine uses a number of distinct types 'behind the scenes'. The ones
> pertinent to LCS (there are many many more which LCS never sees) are:
>
>    - nothing: a type with a single value nothing/null)
>    - boolean: a type with two values true/false
>    - number: a type which can either store a 32-bit integer *or* a double
>    - string: a type which can either store a sequence of native (single byte)
> codes, or a sequence of unicode (two byte - UTF-16) codes
>    - name: a type which stores a string, but uniques the string so that
> caseless and exact equality checking is constant time
>    - data: a type which stores a sequence of bytes
>    - array: a type which stores (using a hashtable) a mapping from 'names' to
> any other storage value type
>
> The LCS part of the engine then sits on top of these core types, providing
> various conversions depending on context.
>
> All LCS syntax is actually typed - meaning that when you pass a value to any
> piece of LCS syntax, each argument is converted to the type required.
>
> e.g. charToNativeNum() has signature 'integer charToNativeNum(string)' meaning
> that it
> expects a string as input and will return a number as output.
>
> Some syntax is overloaded - meaning that it can act in slightly different (but
> always consistent) ways depending on the type of the arguments.
>
> e.g. & has signatures 'string &(string, string)' and 'data &(data, data)'.
>
> In simple cases where there is no overload, type conversion occurs exactly as
> required:
>
> e.g. In the case of charToNativeNum() - it has no overload, so always expects
> a string
> which means that the input argument will always undergo a 'convert to string'
> operation.
>
> The convert to string operation operates as follows:
>
>     - nothing -> ""
>     - boolean -> "true" or "false"
>     - number -> decimal representation of the number, using numberFormat
>     - string -> stays the same
>     - name -> uses the string the name contains
>     - data -> converts to a string using the native encoding
>     - array -> converts to empty (a very old semantic which probably does more
> harm than good!)
>
> In cases where syntax is overloaded, type conversion generally happens in
> syntax-specific sequence in order to preserve consistency:
>
> e.g. In the case of &, it can either take two data arguments, or two string
> arguments. In this case,
> if both arguments are data, then the result will be data. Otherwise both
> arguments will be converted
> to strings, and a string returned.
>
>> From Monte I get that the internal encoding for 'string' may be
>> MacRoman, ISO 8859 (I thought it would be CP1252), or UTF16 -
>> presumably with some attribute to tell the engine which one in each
>> case.
>
> Monte wasn't quite correct - on Mac it is MacRoman or UTF-16, on Windows it
> is CP1252 or UTF-16, on Linux it is IOS8859-1 or UTF-16. There is an internal
> flag in a string value which says whether its character sequence is
> single-byte (native)
> or double-byte (UTF_16).
>
>> So then my question is whether a 'binary string' is a pure blob, with
>> no clues as to interpretation; or whether in fact it does have some
>> attributes to suggest that it might be interpreted as UTF8, UTF132
>> etc?
>
> Data (binary string) values are pure blobs - they are sequences of bytes - it has
> no knowledge of where it came from. Indeed, that would generally be a bad idea
> as you
> wouldn't get repeatable semantics (i.e. a value from one codepath which is
> data, might
> have a different effect in context from one which is fetched from somewhere
> else).
>
> That being said, the engine does store some flags on values - but purely for
> optimization.
> i.e. To save later work. For example, a string value can store its (double)
> numeric value in
> it - which saves multiple 'convert to number' operations performed on the same
> (pointer wise) string (due to the copy-on-write nature of values, and the fact
> that all literals are unique names, pointer-wise equality of values occurs a
> great deal).
>
>> If there are no such attributes, how does codepointOffset operate when
>> passed a binary string?
>
> CodepointOffset is has signature 'integer codepointOffset(string)', so when you
> pass a binary string (data) value to it, the data value gets converted to a
> string
> by interpreting it as a sequence of bytes in the native encoding.
>
>> If there are such attributes, how do they get set? Evidently if
>> textEncode is used, the engine knows that the resulting value is the
>> requested encoding. But what happens if the program reads a file as
>> 'binary' - presumable the result is a binary string, how does the
>> engine treat it?
>
> There are no attributes of that ilk. When you read a file as binary you get
> data (binary
> string) values - which means when you pass them to string taking
> functions/commands that
> data gets interpreted as a sequence of bytes in the native encoding. This is
> why you must
> always explicitly textEncode/textDecode data values when you know they are not
> representing
> native encoded text.
>
>> Is there any way at LiveCode script level to detect what a value is,
>> in the above terms?
>
> Yes - the 'is strictly' operators:
>
>    is strictly nothing
>    is strictly a boolean
>    is strictly an integer - a number which has internal rep 32-bit int
>    is strictly a real - a number which has internal rep double
>    is strictly a string
>    is strictly a binary string
>    is strictly an array
>
> It should be noted that 'is strictly' reports only how that value is stored
> and not anything based on the value itself. This only really applies to 'an
> integer' and 'a real' - you can store an integer in a double and all LCS
> arithmetic operators act on doubles.
>
> e.g. (1+2) is strictly an integer -> false
>       (1+2) is strictly a real -> true
>
> In contrast, though, *some* syntax will return numbers which are stored
> internally as integers:
>
> e.g. nativeCharToNum("a") is strictly an integer -> true
>
> I should point out that what 'is strictly' operators return for any given
> context is not stable in the sense that future engine versions might return
> different things. e.g. We might optimize arithmetic in the future (if we can
> figure out a way to do it without performance penalty!) so that things which
> are definitely integers, are stored as integers (e.g. 1 + 2 in the above).
>
>> And one more question: if a string, or binary string, is saved in a
>> 'binary' file, are the bytes stored on disk a faithful rendition of
>> the bytes that composed the value in memory, or an interpretation of
>> some kind?
>
> What happens when you read or write data or string values to a file depends on
> how you opened the file.
>
> If you opened the file for binary (whether reading or writing), when you read
> you will get data, when you write string values will be converted to data via
> the native encoding (default rule).
>
> If you opened the file for text, then the engine will try and determine (using
> a BOM) the existing text encoding of the file. If it can't determine it (if
> for example, you are opening a file for write which doesn't exist), it will
> assume it is encoded as native.
>
> Otherwise the file will have an explicit encoding associated with it specified
> by you - reading from it will interpret the bytes in that explicit encoding;
> while writing to it will expect string values which will be encoded
> appropriately. In the latter case if you write data values, they will first be
> converted to a string (assuming native encoding) and then written as strings
> in the file's encoding (i.e. default type conversion applies).
>
> Essentially you can view file's a typed-stream - if you opened for binary
> read/write give/take data; if you opened for text then read/write give/take
> strings and default type conversion rules apply.
>
> Warmest Regards,
>
> Mark.
>

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
In reply to this post by Pi Digital via use-livecode

> On Nov 13, 2018, at 2:52 AM, Mark Waddingham via use-livecode <[hidden email]> wrote:
>
> Yes - a byte is not a number, a char is not a number a bit sequence is not a number.

Reminds of a clever sig line from somebody on this list.
I can’t remember who, so author please step up and take credit.
Paraphrasing The Prisoner:

“I am not a number, I am a free NaN”.
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
In reply to this post by Pi Digital via use-livecode
For the avoidance of doubt, all my outrage is faux outrage.
Public life on both sides of the Atlantic (and around the world) has
completely exhausted capacity for real outrage.

Come back Geoff!

Ben

On 13/11/2018 17:29, Mark Waddingham via use-livecode wrote:

> On 2018-11-13 18:21, Geoff Canyon via use-livecode wrote:
>> Nothing I said in this thread has anything to do with optimizing the
>> allOffsets routines; I only used examples from that discussion because they
>> illustrate my puzzlement on the exact topic you (in general) raised: how
>> data types are handled by the engine. I'd generalize the responses, to say
>> that it seems how the engine stores data and how it presents that data are
>> not identical in all cases.
>
> The best way to think about it is that the engine stores data pretty much in
> the form it is presented with it; however, what script sees of data is in the
> form it requests. In particular, if data has been through some operation, or
> mutated, then there is a good change it won't be in the same form it was before.
>
> e.g. put tVar + 1 into tVar
>
> Here tVar could start off as a string, but would end up as a number by virtue
> of the fact you've performed an arithmetic operation on it.
>
>> The above notwithstanding: sorry I outraged you; I'll exit this thread.
>
> Obviously I'm not Ben, but I *think* it was 'faux outrage' (well I hope it was
> - hence my jocular comment about herding cats!) - so I don't think there's a
> reason to exit...
>
> Warmest Regards,
>
> Mark.
>

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
I never left, I just went silent.

But since I'm "back", I'm curious to know what the engine-types think of
Bernd's solution for fixing the UTF-32 offsets code. It seems that when
converting both the stringToFind and stringToSearch to UTF-32 and then
searching the binary with byteOffset, you won't find "Reykjavík" in
"Reykjavík er höfuðborg"

But if you first append "せ" to each string, then do the textEncode, then
strip the last 4 bytes, the match will work. That seems like strange voodoo
to me.

On Tue, Nov 13, 2018 at 12:54 PM Ben Rubinstein via use-livecode <
[hidden email]> wrote:

> For the avoidance of doubt, all my outrage is faux outrage.
> Public life on both sides of the Atlantic (and around the world) has
> completely exhausted capacity for real outrage.
>
> Come back Geoff!
>
> Ben
>
> On 13/11/2018 17:29, Mark Waddingham via use-livecode wrote:
> > On 2018-11-13 18:21, Geoff Canyon via use-livecode wrote:
> >> Nothing I said in this thread has anything to do with optimizing the
> >> allOffsets routines; I only used examples from that discussion because
> they
> >> illustrate my puzzlement on the exact topic you (in general) raised: how
> >> data types are handled by the engine. I'd generalize the responses, to
> say
> >> that it seems how the engine stores data and how it presents that data
> are
> >> not identical in all cases.
> >
> > The best way to think about it is that the engine stores data pretty
> much in
> > the form it is presented with it; however, what script sees of data is
> in the
> > form it requests. In particular, if data has been through some
> operation, or
> > mutated, then there is a good change it won't be in the same form it was
> before.
> >
> > e.g. put tVar + 1 into tVar
> >
> > Here tVar could start off as a string, but would end up as a number by
> virtue
> > of the fact you've performed an arithmetic operation on it.
> >
> >> The above notwithstanding: sorry I outraged you; I'll exit this thread.
> >
> > Obviously I'm not Ben, but I *think* it was 'faux outrage' (well I hope
> it was
> > - hence my jocular comment about herding cats!) - so I don't think
> there's a
> > reason to exit...
> >
> > Warmest Regards,
> >
> > Mark.
> >
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode
In reply to this post by Pi Digital via use-livecode


> On 14 Nov 2018, at 6:33 am, Ben Rubinstein via use-livecode <[hidden email]> wrote:
>
> That's really helpful - and in parts eye-opening - thanks Mark.
>
> I have a few follow-up questions.
>
> Does textEncode _always_ return a binary string? Or, if invoked with "CP1252", "ISO-8859-1", "MacRoman" or "Native", does it return a string?

Internally we have different types of values. So we have MCStringRef which is the thing which either contains a buffer of native chars or a buffer of UTF-16 chars. There are others. For example, MCNumberRef will either hold a 32 bit signed int or a double. These are returned by numeric operations where there’s no string representation of a number. So:

put 1.0 into tNumber # tNumber holds an MCStringRef
put 1.0 + 0 int0 tNumber # tNumber holds an MCNumberRef

The return type of textEncode is an MCDataRef. This is a byte buffer, buffer size & byte count.

So:
put textEncode(“foo”, “UTF-8”) into tFoo # tFoo holds MCDataRef

Then if we do something like:
set the text of field “foo” to tFoo

tFoo is first converted to MCStringRef. As it’s an MCDataRef we just move the buffer over and say it’s a native encoded string. There’s no checking to see if it’s a UTF-8 string and decoding with that etc.

Then the string is put into the field.

If you remember that mergJSON issue you reported where mergJSON returns UTF-8 data and you were putting it into a field and it looked funny this is why.
>
> > CodepointOffset has signature 'integer codepointOffset(string)', so when you
> > pass a binary string (data) value to it, the data value gets converted to a
> > string by interpreting it as a sequence of bytes in the native encoding.
>
> OK - so one message I take are that in fact one should never invoke codepointOffset on a binary string. Should it actually throw an error in this case?

No, as mentioned above values can move to and from different types according to the operations performed on them and this is largely opaque to the scripter. If you do a text operation on a binary string then there’s an implicit conversion to a native encoded string. You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
>
> By the same token, probably one should only use 'byte', 'byteOffset', 'byteToNum' etc with binary strings - would it be better, to avoid confusion, if char, offset, charToNum should refuse to operate on a binary string?

That would not be backwards compatible.
>
>> e.g. In the case of &, it can either take two data arguments, or two
>> string arguments. In this case, if both arguments are data, then the result
>> will be data. Otherwise both arguments will be converted to strings, and a
>> string returned.
> The second message I take is that one needs to be very careful, if operating on UTF8 or other binary strings, to avoid 'contaminating' them e.g. by concatenating with a simple quoted string, as this may cause it to be silently converted to a non-binary string. (I presume that 'put "simple string" after/before pBinaryString' will cause a conversion in the same way as "&"? What about 'put "!" into char x of pBinaryString?)

When concatenating if both left and right are binary strings (MCDataRef) then there’s no conversion of either to string however we do not currently have a way to declare a literal as a binary string (might be nice if we did!) so you would need to:

put textEncode("simple string”, “UTF-8”) after pBinaryString

>
> The engine can tell whether a string is 'native' or UTF16. When the engine is converting a binary string to 'string', does it always interpret the source as the native 8-bit encoding, or does it have some heuristic to decide whether it would be more plausible to interpret the source as UTF16?

No it does not try to interpret. ICU has a charset detector that will give you a list of possible charsets along with a confidence. It could be implemented as a separate api:

get detectedTextEncodings(<binary string>, [<optional hint charset>]) -> array of charset/confidence pairs

get bestDetectedTextEncoding(<binary string>, [<optional hint charset>]) -> charset

Feel free to feature request that!

Cheers

Monte


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode


> On 14 Nov 2018, at 10:44 am, Monte Goulding via use-livecode <[hidden email]> wrote:
>
> You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.

Sorry! I have written codepoints here when I was thinking codeunits! Use codeunits rather than codepoints as they are a fixed number of bytes (2). Codepoints may be 2 or 4 bytes so there is a cost in figuring out the number of codepoints or the exact byte codepoint x refers to. So for chunk expressions on unicode strings use `codeunit x to y`.

Cheers

Monte


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: What is LC's internal text format?

Pi Digital via use-livecode


> On 14 Nov 2018, at 11:39 am, Monte Goulding via use-livecode <[hidden email]> wrote:
>
>> You generally want to use codepoint in 7+ generally where previously you used char unless you know you are dealing with a binary string and then you use byte.
>
> Sorry! I have written codepoints here when I was thinking codeunits! Use codeunits rather than codepoints as they are a fixed number of bytes (2). Codepoints may be 2 or 4 bytes so there is a cost in figuring out the number of codepoints or the exact byte codepoint x refers to. So for chunk expressions on unicode strings use `codeunit x to y`.

Argh… sorry again… codeunits are a fixed number of bytes but that fixed number depends on whether the string is native encoded (1 byte) or UTF-16 (2 bytes)!

And for completeness codeunit/codepoint is not equivalent to char. If you really need to count graphemes then you will need to use char.

Cheers

Monte
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
12