byteLen()?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

byteLen()?

Sannyasin Brahmanathaswami via use-livecode
the len() function returns a character count, but with Unicode this may
be very different than the byte size.

Do we have a size() or byteLen() function?

--
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  [hidden email]                http://www.FourthWorld.com

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: byteLen()?

Sannyasin Brahmanathaswami via use-livecode
Hi Richard,

No, there is no such function as the byte length is a property of text which has been encoded with a specific encoding, not text in general.

the number of bytes in textEncode(tText, kEncoding)

Should give you what you need.

Warmest Regards,

Mark.

Sent from my iPhone

> On 9 Mar 2017, at 02:37, Richard Gaskin via use-livecode <[hidden email]> wrote:
>
> the len() function returns a character count, but with Unicode this may be very different than the byte size.
>
> Do we have a size() or byteLen() function?
>
> --
> Richard Gaskin
> Fourth World Systems
> Software Design and Development for the Desktop, Mobile, and the Web
> ____________________________________________________________________
> [hidden email]                http://www.FourthWorld.com
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: byteLen()?

Sannyasin Brahmanathaswami via use-livecode
Mark Waddingham wrote:

 > On 9 Mar 2017, at 02:37, Richard Gaskin wrote:
 >>
 >> the len() function returns a character count, but with Unicode this
 >> may be very different than the byte size.
 >>
 >> Do we have a size() or byteLen() function?
 >
 >
 > No, there is no such function as the byte length is a property of
 > text which has been encoded with a specific encoding, not text in
 > general.
 >
 > the number of bytes in textEncode(tText, kEncoding)
 >
 > Should give you what you need.

Thanks. I don't mind the verbosity, but I could use some clarity:

There's been talk of LC using UTF-16 internally, but when I do this:

on mouseUp
    put "Hello" into s
    put the number of bytes of s
end mouseUp

...I get "5".

When does LC use UTF-16, and when it's not UTF-16 is it still ISO-8959-1
or UTF-8?

--
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  [hidden email]                http://www.FourthWorld.com

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: byteLen()?

Sannyasin Brahmanathaswami via use-livecode
On 2017-03-09 19:06, Richard Gaskin via use-livecode wrote:

> Thanks. I don't mind the verbosity, but I could use some clarity:
>
> There's been talk of LC using UTF-16 internally, but when I do this:
>
> on mouseUp
>    put "Hello" into s
>    put the number of bytes of s
> end mouseUp
>
> ...I get "5".
>
> When does LC use UTF-16, and when it's not UTF-16 is it still
> ISO-8959-1 or UTF-8?

Internally strings are stored as either UTF-16 or in the native encoding
(MacRoman, Latin-1) depending on the content of the string and such (the
engine transparently switches internal encoding as necessary). However,
this is an internal implementation detail - it might do something
completely different in the future...

Before 7, the idea of 'byte' and 'char' were synonymous - if you used a
string in the context of something expecting text it interpreted as
being a string encoded in the native encoding, if you used a string in
the context of something expecting binary data it interpreted as being
just plain bytes.

With the advent of 7 it is necessary to treat text and binary separately
- they aren't the same thing at all for the simple reason that text only
becomes binary when you choose a specific text encoding and apply it to
the unicode string.

In order to ensure that code written prior to 7 worked identically in 7
it was necessary to add an automatic conversion between text and binary
which preserved the previous behavior which (essentially) viewed text
and binary strings as being the same thing.

Indeed, in the above code what is actually happening is this:

on mouseUp
   put "Hello" into a
   put the number of bytes of <implicit-text-to-data>(s)
end mouseUp

One important property which existed before 7 was that:

    the number of bytes in s == the number of chars in s

However, the definition of 'char' changed in 7 to mean a Unicode
grapheme - something which will often require many bytes to encode in
any encoding (e.g. [e, combining-acute] is a perfectly valid way to
express e-acute in Unicode - taking two codepoints, and not one). In
order to keep the above equivalence (which would break many things if it
were not kept) the implicit-text-to-data conversion is defined as
follows:

   repeat for each char tChar in tString
     get textEncode(tChar, "native")
     if textDecode(it, "native") is tChar then
        put it after tData
     else
        put "?" after tData
     end if
   end repeat

(Note: The engine does work quite hard to keep things as equivalent as
possible - it normalizes tString to NFC first so that it doesn't matter
if the string has passed through a process which has happened to
decompose it, or if it has come from a source which favours decomposed
representations - most notably Mac HFS filenames).

This approach means that any multi-codepoint character in Unicode still
maps to a single byte - and any non-updated code which manipulates
strings as if they are data will still work (albeit with some data loss
in regards the original Unicode string - which it wasn't written to
understand anyway).

In the future, it is entirely possible that we will make it a runtime
error to implicitly convert between data and string (don't worry, it
wouldn't be the default behavior) because if you aren't clear about how
you are doing the conversion (i.e. which conversion you are using) it is
a potential source of hard to find errors in code.

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: byteLen()?

Sannyasin Brahmanathaswami via use-livecode
Thanks for that background, Mark.  I always appreciate your informal
tech notes.

I'm copying only the most relevant parts here - others looking for a
good reach will want the full post if you missed it:
http://lists.runrev.com/pipermail/use-livecode/2017-March/235278.html


Mark Waddingham wrote:

 > This approach means that any multi-codepoint character in Unicode
 > still maps to a single byte - and any non-updated code which
 > manipulates strings as if they are data will still work (albeit with
 > some data loss in regards the original Unicode string - which it
 > wasn't written to understand anyway).

I'm not sure I follow that, but it almost sounds like no matter what the
encoding each char is mapped to one byte, so a 5-chart string like
"hello" will take up 5 bytes - is that right?

Doesn't feel right, but there's so much to both Unicode and how LC
handles it that I've lost my confidence with things like this.

Your guidance is appreciated, and perhaps it may help if I describe the
use-case at hand:

I have some large files I want to open and read as binary (for speed
mostly; if there's a reason I should be doing that as text let me know),
then I'll work my way through it looking for substrings, keeping track
of the byte offsets within the data where those can be found.

Once I have my list of byte offsets, I can save that as a sort of index
file, and use "seek" or "read at" to go directly to that portion of the
larger files whenever I need to access that data.

The data files may use a variety of encodings, mostly UTF-8 but I can
expect Latin-ISO or perhaps even UTF-16.  In short, encoding will may be
known in advance.

But since I'm working with binary data the whole time, the encoding
shouldn't matter, should it?

Earlier you wrote:

   the number of bytes in textEncode(tText, kEncoding)

...which implies that I would need to know the encoding (kEncoding), but
do I really need textEncode for the use-case described here?

--
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  [hidden email]                http://www.FourthWorld.com

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: byteLen()?

Sannyasin Brahmanathaswami via use-livecode
On 2017-03-09 22:24, Richard Gaskin via use-livecode wrote:
> I'm not sure I follow that, but it almost sounds like no matter what
> the encoding each char is mapped to one byte, so a 5-chart string like
> "hello" will take up 5 bytes - is that right?

In the case of the implicit conversion the engine does between text
and binary data - yes it is. The number of bytes in the generated
data will be the same as the number of chars in the original text.

However that only relates to the implicit 'compatibility' conversion
the engine does. In new code, it is better to make sure the conversion
is explicit by using textEncode / textDecode.

> I have some large files I want to open and read as binary (for speed
> mostly; if there's a reason I should be doing that as text let me
> know), then I'll work my way through it looking for substrings,
> keeping track of the byte offsets within the data where those can be
> found.
>
> Once I have my list of byte offsets, I can save that as a sort of
> index file, and use "seek" or "read at" to go directly to that portion
> of the larger files whenever I need to access that data.
>
> The data files may use a variety of encodings, mostly UTF-8 but I can
> expect Latin-ISO or perhaps even UTF-16.  In short, encoding will may
> be known in advance.
>
> But since I'm working with binary data the whole time, the encoding
> shouldn't matter, should it?

It depends on whether you need to convert a text string into a byte
sequence
to search for, and whether you are wanting an exact text match or a
caseless
text match.

If the file you are searching is just a text file which you want to
search
as binary then you need to know the encoding of said text file so you
can
encode the text you are searching for in the same way. For example, if
you
are search for "foó" and encode it as UTF-16 (which would generate 6
bytes)
and the (text) file you are searching is UTF-8 encoded then it won't
work.
The UTF-8 encoding of "foó" is different from the UTF-16 encoding.

If the file you are searching is some binary file containing text then
things
are decidedly more tricky as to do the search accurately you need to
know the
exact format of the binary file so you know precisely where the
(encoded) text
strings within it sit. This is presuming you are not happy with 'false
positives'.

(A stackfile, for example, contains encoded text and sequences of bytes
which
were and never will be text - however, it is perfectly possible for the
latter
to match encoded text, just by chance.)

If you are wanting a caseless match rather than an exact match then you
pretty
much have to treat the file as text - you can't do caseless matching on
arbitrary
bytes as it makes no sense (as they are just bytes with no meaning).

> Earlier you wrote:
>
>   the number of bytes in textEncode(tText, kEncoding)
>
> ...which implies that I would need to know the encoding (kEncoding),
> but do I really need textEncode for the use-case described here?

Strictly speaking that depends on the encoding:

For native encoding - number of bytes == number of codeunits

For UTF-16 - number of bytes = 2 * number of codeunits

For UTF-32 - number of bytes = 4 * number of codeunits

However, UTF-8 is a multibyte encoding based on the codepoints in the
text. A single codepoint can be encoded as 1, 2, 3 or 4 bytes.

The point here being, in order to compute the byte length of a piece of
text encoded as UTF-8 you need to look at each character. Since
textEncode
does that, it is a reasonably clear way of working such things out.

By the way, here I've mentioned three things - codeunit, codepoint and
char:

   - a codeunit is the smallest element in UTF-16 and represents unicode
     codepoints 0-65535 (i.e. fits in a 16-bit unsigned int).

   - a codepoint is the natural 'unit' of Unicode - a 21-bit quantity
which
     indexes into the Unicode char tables. (UTF-16 encodes the 21-bit
quantity
     by using 'surrogate' pairs of codeunits - meaning that, in that
encoding
     a codepoint can take 1 or 2 codeunits).

   - a char is a sequence of codepoints which are generally considered to
     be a single (human-processable) character.

I'm not sure if the above helps or not - it might be helpful to explain
the
problem you are trying to solve more deeply. I still can't quite see how
the byte length of a piece of text (encoded in a particular encoding) is
useful
since surely you need the byte sequence to search for anyway, in which
case
the number of bytes is the length of that byte sequence that you already
have...

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Loading...