Help Wrapping HTMLTidy in LCB

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
Hello,

While looking at solutions for converting HTML into XHTML that can be
parsed by revXML I decided to test HTMLTidy which has an option to output
the input as XHTML. While I could bundle up the tidy command line tool and
include it with my app, I prefer to wrap things up in LCB if possible.

Unfortunately I haven't gotten very far with HTMLTidy and I'm
hoping someone else might be able to figure out what I'm doing wrong. If
you are up for loading up an LCB project in LC 9 on macOS and looking at
some C files then please read on:

RESOURCES

- Github repo with LCB file, a test stack, and compiled HTMLTidy dylib for
testing on macOS: https://github.com/trevordevore/lc-htmltidy
- HTMLTidy github repo where source files are located:
https://github.com/htacg/tidy-html5

WHAT WORKS

In the htmltidy.lcb file I've wrapped some of the simple APIs that return
strings: tidyReleaseDate(), tidyLibraryVersion(), and tidyPlatform(). Those
all work.

WHAT DOESN'T WORK?

tidyHTMLToXHTML() in the htmltidy.lcb file has some test code in it that
isn't working. As a test I want to call `tidyOptGetIdForName()` from the
htmltidy C library and get a valid value returned. I expect the following
code to log `0` but it is logging `104`. I don't think I am creating
the Ctmbstr pointer properly but I don't really know. Here is code from the
htmltidy.lcb file along with links to the ctmbstr definition in the
HTMLTidy source code:

```
variable tCStr as Pointer

-- Attempting to create a Ctmbstr from a LiveCode string
-- ctmbstr:
https://github.com/htacg/tidy-html5/blob/next/include/tidyplatform.h#L607
MCStringConvertToCString("TidyUnknownOption", tCStr)
-- The next handler is logging `104` which is N_TIDY_OPTIONS (error)
-- Appears that tCStr is not the right format.
log c_tidyOptGetIdForName(tCStr)
```

MCStringConvertToCString is defined as follows in the htmltidy.lcb file:

```
foreign handler MCStringConvertToCString(in pString as String, out rCString
as Pointer) returns CBool binds to "<builtin>"
```

If anyone can provide some pointers or a PR I would really appreciate it.

--
Trevor DeVore
ScreenSteps
www.screensteps.com
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
Trevor DeVore wrote:

 > While looking at solutions for converting HTML into XHTML that can be
 > parsed by revXML I decided to test HTMLTidy which has an option to
 > output the input as XHTML. While I could bundle up the tidy command
 > line tool and include it with my app, I prefer to wrap things up in
 > LCB if possible.

Is conversion to XHTML the way to go?

I've tried using the XML external to parse even RSS files -- ostensibly
pure XML -- only to find it choke on some of them.  I've gone back to
hand-crafted pull-parsers.

--
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  [hidden email]                http://www.FourthWorld.com

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
On Fri, Nov 22, 2019 at 2:25 PM Richard Gaskin via use-livecode <
[hidden email]> wrote:

> Trevor DeVore wrote:
>
>  > While looking at solutions for converting HTML into XHTML that can be
>  > parsed by revXML I decided to test HTMLTidy which has an option to
>  > output the input as XHTML. While I could bundle up the tidy command
>  > line tool and include it with my app, I prefer to wrap things up in
>  > LCB if possible.
>
> Is conversion to XHTML the way to go?
>
> I've tried using the XML external to parse even RSS files -- ostensibly
> pure XML -- only to find it choke on some of them.  I've gone back to
> hand-crafted pull-parsers.
>

There are definitely other ways to approach the problem I'm trying to
solve. In fact, in other areas of my app I will extract parts of HTML by
without relying on revXML.

In this particular case I already have some LC code that parses HTML placed
on the clipboard and converts it into data structure used by the
application. This was originally implemented using the revXML callback
feature (no tree is created in memory) and that API has worked well for the
conversions I need to make. HTML may be placed on the clipboard when
copying text and images from web browsers or by our good friend Microsoft
Word. Microsoft Word places some very "interesting" HTML on the clipboard
that needs to be massaged quite a bit before running it through revXML.
There is a speed hit that occurs when running some of the regex patterns on
the Word HTML that are used to strip out some markup and do things such as
add quotes around attributes.

Given the code that I have in place already, I would prefer to leverage
HTMLTidy rather than fix every potential "gotcha" or spend time trying to
optimize the code. I'm betting that HTMLTidy can do it better and faster
given how mature it is.

--
Trevor DeVore
ScreenSteps
www.screensteps.com
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
Trevor DeVore wrote:

 > HTML may be placed on the clipboard when copying text and images
 > from web browsers or by our good friend Microsoft Word. Microsoft
 > Word places some very "interesting" HTML on the clipboard that
 > needs to be massaged quite a bit before running it through revXML.

Are you suggesting Microsoft has trouble reading open and
well-documented standards?  Why, I never! ;)

--
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  [hidden email]                http://www.FourthWorld.com

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
On Fri, Nov 22, 2019 at 5:31 PM Richard Gaskin via use-livecode <
[hidden email]> wrote:

> Trevor DeVore wrote:
>
>  > HTML may be placed on the clipboard when copying text and images
>  > from web browsers or by our good friend Microsoft Word. Microsoft
>  > Word places some very "interesting" HTML on the clipboard that
>  > needs to be massaged quite a bit before running it through revXML.
>
> Are you suggesting Microsoft has trouble reading open and
> well-documented standards?  Why, I never! ;)


It’s not them, it’s me. Clearly I’m expecting too much.

- -
Trevor DeVore
ScreenSteps

>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
In reply to this post by Mark Wieder via use-livecode
Is it really worth the work to do that from LCB?

A while ago I installed HTML tidy 5.6.0 from here
http://binaries.html-tidy.org (the Mac .dmg)

Then I copied the binary "tidy" from /usr/local/bin
compressed to my stack (=231 KByte).
Now I use it from there, running it in the temporary
folder via shell(). Works fine (on any 64bit Mac).


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
On Sat, Nov 23, 2019 at 10:52 AM hh via use-livecode <
[hidden email]> wrote:

> Is it really worth the work to do that from LCB?


In my opinion, yes. If for no other reason then that with each library that
is wrapped in LCB I learn what the limitations are in LCB or I learn how to
do something that I didn’t know how to do before. There is a lot of code
out in the world that we could benefit from in LiveCode. Not all have as
nice a command line tool as HTMLTidy. Some don’t have a command line tool
at all. Having lots and lots of example of wrapping C, Objective-C, etc.
will help more people wrap libraries and contribute to the community in the
future.

- -
Trevor DeVore
ScreenSteps

>
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
In reply to this post by Mark Wieder via use-livecode
On Fri, Nov 22, 2019 at 10:30 AM Trevor DeVore <[hidden email]>
wrote:

> While looking at solutions for converting HTML into XHTML that can be
> parsed by revXML I decided to test HTMLTidy which has an option to output
> the input as XHTML. While I could bundle up the tidy command line tool and
> include it with my app, I prefer to wrap things up in LCB if possible.
>
> Unfortunately I haven't gotten very far with HTMLTidy and I'm
> hoping someone else might be able to figure out what I'm doing wrong. If
> you are up for loading up an LCB project in LC 9 on macOS and looking at
> some C files then please read on:
>

UPDATE:

I made some progress on the HTMLTidy project and this morning Mark
Waddingham and Brian Milby helped me over the last hurdle. The code base
now has a tidyHTMLToXHTML() function which works on macOS. You can try it
out using the test stack included in the repo. The code may also be of
interest to those trying to wrap other libraries.

https://github.com/trevordevore/lc-htmltidy

I will be adding the Windows DLL so that the extension works on Windows and
then trying to create a sensible API around HTMLTidy for my current needs.
I don't plan on making it feature complete at the moment as I just need to
for my own work. If someone else wanted to take that up they are welcome to.

--
Trevor DeVore
ScreenSteps
www.screensteps.com
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|

Re: Help Wrapping HTMLTidy in LCB

Mark Wieder via use-livecode
On Mon, Dec 9, 2019 at 12:28 PM Trevor DeVore <[hidden email]>
wrote:

> UPDATE:
>
> I made some progress on the HTMLTidy project and this morning Mark
> Waddingham and Brian Milby helped me over the last hurdle. The code base
> now has a tidyHTMLToXHTML() function which works on macOS. You can try it
> out using the test stack included in the repo. The code may also be of
> interest to those trying to wrap other libraries.
>
> https://github.com/trevordevore/lc-htmltidy
>
> I will be adding the Windows DLL so that the extension works on Windows
> and then trying to create a sensible API around HTMLTidy for my current
> needs. I don't plan on making it feature complete at the moment as I just
> need to for my own work. If someone else wanted to take that up they are
> welcome to.
>

I've added Windows support and the ability to pass boolean options so that
you can control the behavior of htmltidy when it cleanses the HTML input.
I've added some sample settings to the test stack for testing.

If anybody would like to add Linux support it should just be a matter of
following the build instructions provided by htmltidy and added the
resulting library to an `x86-linux` folder in the code folder:

https://github.com/trevordevore/lc-htmltidy/tree/master/code

You can find a link to the htmltidy build instructions in the product
README:

https://github.com/trevordevore/lc-htmltidy

--
Trevor DeVore
ScreenSteps
www.screensteps.com
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode