Help converting Hex UTF-8 bytes to character

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode
Hi,

I have a text file that contains Hex UTF-8 bytes encode in the following
manner:

```
\xC3\xB3
```

This particular sequence represents the following character:

```
ó
```

I need to read this file in, converting these hex bytes to the proper
character. For example, the following string:

```
versi\xC3\xB3n HTML5
```

should be read in as:

```
versión HTML 5
```

Does anybody know how to use the C3 B3 hex values to generate the proper
character?

Thanks,

--
Trevor DeVore
ScreenSteps
www.screensteps.com
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode
As a general approach:

1) use offset() looking for "\x" (or you could use regex) to find the start
2) if the value returned by offset is not zero (call it tOS) put char
tOS+2 to tOS+2 into tByte1 and char tOS+6 to tOS+7 into byte2 to get the
2 hex values
3) use the formula put
baseConvert(byte1,16,10)*256+baseconvert(byte2,16,10) into tCodePoint
4) lastly put numToCodepoint(tCodePoint) into char tOS to tOS+7 of the
original string

Off the top of my head and (obviously) not tested.


On 5/31/2018 4:13 PM, Trevor DeVore via use-livecode wrote:

> Hi,
>
> I have a text file that contains Hex UTF-8 bytes encode in the following
> manner:
>
> ```
> \xC3\xB3
> ```
>
> This particular sequence represents the following character:
>
> ```
> ó
> ```
>
> I need to read this file in, converting these hex bytes to the proper
> character. For example, the following string:
>
> ```
> versi\xC3\xB3n HTML5
> ```
>
> should be read in as:
>
> ```
> versión HTML 5
> ```
>
> Does anybody know how to use the C3 B3 hex values to generate the proper
> character?
>
> Thanks,
>


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode
You meant tOS+1 to tOS+2?

> On May 31, 2018, at 13:39 , Paul Dupuis via use-livecode <[hidden email]> wrote:
>
> tOS+2 to tOS+2 into tByte1 and char tOS+6 to tOS+7 into byte2 to get the


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode
Actually, I think I meant tOS+2 to tOS+3, if the find starts on \x##,
the the \ is tOS and tOS+1 is the x

That's way I get for rushing a reply.

On 5/31/2018 4:47 PM, Bob Sneidar via use-livecode wrote:

> You meant tOS+1 to tOS+2?
>
>> On May 31, 2018, at 13:39 , Paul Dupuis via use-livecode <[hidden email]> wrote:
>>
>> tOS+2 to tOS+2 into tByte1 and char tOS+6 to tOS+7 into byte2 to get the
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode
In reply to this post by Niggemann, Bernd via use-livecode
Hi Trevor

I’m pretty sure that the following will do what you want here:

textDecode(format(<file content>),”utf-8”)

Cheers

Monte

> On 1 Jun 2018, at 6:13 am, Trevor DeVore via use-livecode <[hidden email]> wrote:
>
> Hi,
>
> I have a text file that contains Hex UTF-8 bytes encode in the following
> manner:
>
> ```
> \xC3\xB3
> ```
>
> This particular sequence represents the following character:
>
> ```
> ó
> ```
>
> I need to read this file in, converting these hex bytes to the proper
> character. For example, the following string:
>
> ```
> versi\xC3\xB3n HTML5
> ```
>
> should be read in as:
>
> ```
> versión HTML 5
> ```
>
> Does anybody know how to use the C3 B3 hex values to generate the proper
> character?
>
> Thanks,
>
> --
> Trevor DeVore
> ScreenSteps
> www.screensteps.com
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode
On Thu, May 31, 2018 at 5:20 PM, Monte Goulding via use-livecode <
[hidden email]> wrote:

>
> I’m pretty sure that the following will do what you want here:
>
> textDecode(format(<file content>),”utf-8”)
>

Yes it does! `format` is my new best friend.

Thanks for everyone’s tips. Here is what I came up with which has worked on
the four files I’ve thrown at it.

```
on mouseUp
  answer file "Select UTF-8 File";
  put url("binfile:" & it) into tData
  put 0 into tSkip

  repeat forever
    # Find next occurance of \x
    put offset("\x", tData, tSkip) into tStartOffset
    if tStartOffset > 0 then
      add tSkip to tStartOffset
      put tStartOffset + 3 into tEndOffset

      # Find all repeating \x instances
      repeat forever
        if char (tEndOffset + 1) to (tEndOffset + 2) of tData is "\x" then
          add 4 to tEndOffset
        else
          exit repeat
        end if
      end repeat

      try
        # Now format them as normal characters
        put format(char tStartOffset to tEndOffset of tData) into tNewString
        put tNewString into char tStartOffset to tEndOffset of tData
        add the number of chars of tNewString to tSkip
      catch e
        breakpoint
        # Skip over this \x instance as format couldn't handle it
        add 2 to tSkip
        next repeat
      end try
    else
      exit repeat
    end if
  end repeat

  set the clipboarddata to textDecode(tData, "utf8")
  beep
end mouseUp

```

--
Trevor DeVore
ScreenSteps
www.screensteps.com
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode


> On 1 Jun 2018, at 2:18 pm, Trevor DeVore via use-livecode <[hidden email]> wrote:
>
> Yes it does! `format` is my new best friend.

Hmm… why not just throw the whole thing at format? If it has one escape sequence it might have others and you can’t put one in there and expect a single `\` to be literal.

Cheers

Monte
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode
On 2018-06-01 06:21, Monte Goulding via use-livecode wrote:
>> On 1 Jun 2018, at 2:18 pm, Trevor DeVore via use-livecode
>> <[hidden email]> wrote:
>>
>> Yes it does! `format` is my new best friend.
>
> Hmm… why not just throw the whole thing at format? If it has one
> escape sequence it might have others and you can’t put one in there
> and expect a single `\` to be literal.

@Trevor : Monte makes a good point here - \x is the standard C escape
for a single byte char. If a string format does \x, then it also has to
escape \ as \\ - unless it requires \ to be encoded in \x form (which is
certainly plausible - URL encoding requires that of % which is the
escape character).

Do you have a spec for the escaped strings you are processing?

Warmest Regards,

Mark.

--
Mark Waddingham ~ [hidden email] ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help converting Hex UTF-8 bytes to character

Niggemann, Bernd via use-livecode
On Fri, Jun 1, 2018 at 2:06 AM, Mark Waddingham via use-livecode <
[hidden email]> wrote:

> On 2018-06-01 06:21, Monte Goulding via use-livecode wrote:
>
>> On 1 Jun 2018, at 2:18 pm, Trevor DeVore via use-livecode <
>>> [hidden email]> wrote:
>>>
>>> Yes it does! `format` is my new best friend.
>>>
>>
>> Hmm… why not just throw the whole thing at format? If it has one
>> escape sequence it might have others and you can’t put one in there
>> and expect a single `\` to be literal.
>>
>
> @Trevor : Monte makes a good point here - \x is the standard C escape for
> a single byte char. If a string format does \x, then it also has to escape
> \ as \\ - unless it requires \ to be encoded in \x form (which is certainly
> plausible - URL encoding requires that of % which is the escape character).
>
> Do you have a spec for the escaped strings you are processing?
>

I tried passing the entire string in but it makes `format` barf. The data
is coming from a database field where ActiveRecord has serialized some json
encoded data. The data contains some ActiveRecord (Ruby) encoding
information. Years ago a bug somewhere in our system caused some data to be
encoded incorrectly in a few cases. I'm not sure how it happened or if
there are any rules I can count on. My search and replace solution has
worked properly thus far. Fortunatley I don't have many more records to
process.

--
Trevor DeVore
ScreenSteps
www.screensteps.com
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode