the mouseText and Unicode: a 3-char puzzle

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

the mouseText and Unicode: a 3-char puzzle

Slava Paperno (Bridge)
Following Tariel's report, here is a puzzle:

Make a text entry field and set its font to Arial,Unicode.

Put these three characters in the field and lock it:

<->

The first one is decimal 171, the last one is decimal 187; they are called
Double Angle Quotation Marks.

The one in the middle is called Em-Dash, decimal 8212.

Give the field this mouseDown script:

on mouseDown
   PUT "FIELD"
   repeat with i = 1 to length( the unicodeText of field "TextToClick")
      put cr & byteToNum(byte i of the unicodeText of field "TextToClick")
after msg
   end repeat
   
   put the unicodeText of field "TextToClick" into locEntireText --this is
UTF16
   PUT cr & "VAR UTF-16" after msg
   repeat with i = 1 to length(locEntireText)
      put cr & byteToNum(byte i of locEntireText) after msg
   end repeat
   
   put uniDecode(locEntireText, "UTF8") into locEntireText --this is UTF8
   PUT cr & "VAR UTF-8" after msg
   repeat with i = 1 to length(locEntireText)
      put cr & byteToNum(byte i of locEntireText) after msg
   end repeat
end mouseDown

When I click the field in LC 4.6.1 on my Windows 7 machine, I get this
display in the Message box:

FIELD
171
0
20
32
187
0
VAR UTF-16
171
0
20
32
187
0
VAR UTF-8
194
171
226
128
148
194
187

The FIELD and the VAR UTF-16 reports are entirely predictable, but the VAR
UTF-8 list is puzzling to me. I expected six bytes, not seven.

There is a practical reason for trying to solve this puzzle: these three
characters throw off the byte count that I used in the workaround for the
"clickedUnicodeText" problem that was discussed under this Subject line the
other day. I feel obliged to restore order in this chaotic universe, or fall
asleep trying.

Thanks, Tariel, and thank you all for reading this,

Slava

> -----Original Message-----
> From: Tariel Gogoberidze [mailto:[hidden email]]
> Sent: Monday, June 20, 2011 11:58 AM
> To: [hidden email]
> Subject: Re: the mouseText and Unicode: CONCLUSION
>
>
> Hi Slava,
>
> Tried your script (nice job), but with text I copied from some Russian
> web side it brakes on word "dikanky" and all words after that.
> Try attached stack, you will see on which char it brakes farther word
> selection and removing this char will allow correct selection again.
>
> regards
> Tariel




_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: the mouseText and Unicode: a 3-char puzzle

BNig
Slava,

no contribution to the puzzle but maybe to more sleep: the HTMLText works for selecting «—» and copying it. And all words afterwards.

No order in the chaotic universe, just managing it, word by word :)


Kind regards

Bernd

Reply | Threaded
Open this post in threaded view
|

the mouseText and Unicode: a 3-char puzzle

Slava Paperno (Bridge)
Bernd,

Thanks for the good news, but it doesn't work for me. I must be doing something wrong. I am using the Russian text from Gogol that Tariel found. It is in the attached txt file, with the three problem chars. And I'm using the handler you posted on 6/16. It is in the other attached file. Do these work for you? Mind the field names, please.

Slava

> From: [hidden email] [mailto:use-livecode-
> [hidden email]] On Behalf Of BNig
> Sent: Tuesday, June 21, 2011 6:06 AM
> To: [hidden email]
> Subject: Re: the mouseText and Unicode: a 3-char puzzle
>
> Slava,
>
> no contribution to the puzzle but maybe to more sleep: the HTMLText
> works for selecting «—» and copying it. And all words afterwards.
>
> No order in the chaotic universe, just managing it, word by word :)
>
> Kind regards
>
> Bernd

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Gogol.txt (560 bytes) Download Attachment
SelectWordByHTMLText.txt (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: the mouseText and Unicode: a 3-char puzzle

BNig
Slava,

it worked on a self assembled russian text including em-dash and guillemets but not on your text.

This works on your text:

------------------------------------------
on mouseUp
   lock screen
   get word 4 of the clickCharChunk
   put it into tSelPos
   put 0 into tStartSel
   
   repeat with i = tSelPos down to 1
      put the htmlText of char i of field "textToClickB" into tHTML
      if  (tHTML contains  "<p> <" or tHTML is "<p></p>" or tHTML contains ">,<" or tHTML contains ">.<"  \
            or tHTML contains "> </font" or tHTML contains ">&•••nbsp;<" or tHTML contains ">&•••laquo;<" or tHTML contains ">&•••raquo;<") then
         put i into tStartSel
         exit repeat
      end if
   end repeat
   
   put the number of chars of field 1 into tEndSel
   repeat with i = tSelPos to the number of chars of field 1
      put the htmlText of char i of field "textToClickB" into tHTML
      put char i of tData into taChar
      if  (tHTML contains  "<p> <" or tHTML is "<p></p>" or tHTML contains ">,<" or tHTML contains ">.<"  \
            or tHTML contains "> </font" or tHTML contains ">&•••nbsp;<" or tHTML contains ">&•••laquo;<" or tHTML contains ">&•••raquo;<") then
         put i into tEndSel
         exit repeat
      end if
   end repeat
   
   select char tStartSel + 1 to tEndSel -1 of me
   
   put the htmlText of  the selectedtext into tWordClicked
   set the htmlText of field "ClickedWord" to tWordClicked
   unlock screen
end mouseUp
-------------------------------------------

I added 3 bullets ••• to prevent the automatic conversion to the html characters, I am posting via Nabble. You'd have to remove the ••• and please watch out for linebreaks.


I am shure there are more gotchas and it is not getting nicer this way but it is a poor man's unicode :)

Kind regards

Bernd
Reply | Threaded
Open this post in threaded view
|

Re: the mouseText and Unicode: a 3-char puzzle

BNig
In reply to this post by Slava Paperno (Bridge)
Slava,

I made a slider, make it rather wide and add a field where to put the htmlText into

The slider has this code:

-----------------------------------
on mouseDown
   set the endValue of me to the length of field "TextToClick"
end mouseDown

on scrollbarDrag tValue
   put round (tValue) into tThumb
   select char tThumb of field "textToClick"
   put the htmlText of the selection into field "fRes"
end scrollbarDrag
--------------------------------------

In my previous post I accidentally left the name of the field as "TextToClickB" which is my field name for the htmlText version of selecting a word. But you probably have noticed that already.

Kind regards

Bernd
Reply | Threaded
Open this post in threaded view
|

RE: the mouseText and Unicode: a 3-char puzzle

Slava Paperno (Bridge)
In reply to this post by BNig
I like your definition... "a poor man's Unicode." Cute.

Thanks for everything,

Slava

> -----Original Message-----
> From: [hidden email] [mailto:use-livecode-
> [hidden email]] On Behalf Of BNig
> Sent: Tuesday, June 21, 2011 5:39 PM
> To: [hidden email]
> Subject: Re: the mouseText and Unicode: a 3-char puzzle
>
> Slava,
>
> it worked on a self assembled russian text including em-dash and
> guillemets
> but not on your text.
>
> This works on your text:
>
> ------------------------------------------
> on mouseUp
>    lock screen
>    get word 4 of the clickCharChunk
>    put it into tSelPos
>    put 0 into tStartSel
>
>    repeat with i = tSelPos down to 1
>       put the htmlText of char i of field "textToClickB" into tHTML
>       if  (tHTML contains  "<p> <" or tHTML is "<p></p>" or tHTML
> contains
> ">,<" or tHTML contains ">.<"  \
>             or tHTML contains "> &lt;/font&quot; or tHTML contains
> &quot;&gt;&•••nbsp;<" or tHTML contains ">&•••laquo;<" or tHTML
> contains
> ">&•••raquo;<") then
>          put i into tStartSel
>          exit repeat
>       end if
>    end repeat
>
>    put the number of chars of field 1 into tEndSel
>    repeat with i = tSelPos to the number of chars of field 1
>       put the htmlText of char i of field "textToClickB" into tHTML
>       put char i of tData into taChar
>       if  (tHTML contains  "<p> <" or tHTML is "<p></p>" or tHTML
> contains
> ">,<" or tHTML contains ">.<"  \
>             or tHTML contains "> &lt;/font&quot; or tHTML contains
> &quot;&gt;&•••nbsp;<" or tHTML contains ">&•••laquo;<" or tHTML
> contains
> ">&•••raquo;<") then
>          put i into tEndSel
>          exit repeat
>       end if
>    end repeat
>
>    select char tStartSel + 1 to tEndSel -1 of me
>
>    put the htmlText of  the selectedtext into tWordClicked
>    set the htmlText of field "ClickedWord" to tWordClicked
>    unlock screen
> end mouseUp
> -------------------------------------------
>
> I added 3 bullets ••• to prevent the automatic conversion to the html
> characters, I am posting via Nabble. You'd have to remove the ••• and
> please
> watch out for linebreaks.
>
>
> I am shure there are more gotchas and it is not getting nicer this way
> but
> it is a poor man's unicode :)
>
> Kind regards
>
> Bernd



_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: the mouseText and Unicode: a 3-char puzzle

Dave Cragg-2
In reply to this post by Slava Paperno (Bridge)

On 21 Jun 2011, at 07:40, Slava Paperno wrote:

> VAR UTF-8
> 194
> 171
> 226
> 128
> 148
> 194
> 187
>
> The FIELD and the VAR UTF-16 reports are entirely predictable, but the VAR
> UTF-8 list is puzzling to me. I expected six bytes, not seven.

I didn't follow the earlier thread, so apologies if I'm not helping here.

You said you were puzzled by the UTF-8 list having seven bytes. But unicode characters in UTF-8 may be from 1 to 5 bytes long. The values of the bytes give a hint to what they represent. A byte value between 192 and 223 is the first byte in a 2-byte character. And a byte value between 224 and 239 is the first byte in a 3-byte character. So in this case, the 226 value is the beginning of the 3-byte sequence for em-dash.

Cheers
Dave
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode