Parsing (scraping) OpenGraph Tags from html HEAD

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
given that

a) trying to instantiate an XML tree from any given web page is likely to fail 85% of the time because they simply are never built to that strict a standard


and


b) you want to extract from the <head> of the document  the openGraph  tags

<meta property="og:site_name" content="YouTube">
<meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam">
<meta property="og:title" content="Kauai's Hindu Monastery">
<meta property="og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg">
<meta property="og:description" content="{where hinduism meets the future}">

c) you also cannot depend on the output being line delimited, because some CMS's delivery "agents" will minimize this to

<meta property="og:site_name" content="YouTube"><meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam"><meta property="og:title" content="Kauai's Hindu Monastery"><meta property="og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg"><meta property="og:description" content="{where hinduism meets the future}">

Has anyone rolled up a parser/scraper for this?   Looks like "idiot simple text extraction"  but I'm trying to wrap my head around how to extract the name=value pairs, and not getting anything easy…  these are space delimited, but then we also have spaces inside quoted strings.  Maybe easier target "<meta (.*?)>" using regEx with matchText, get ALL the meta tags in the HEAD, push to array then just check for if key contains "og:"  then we have an openGraph value.

I'll sleep on this, but but before I wake up and write 50 lines to get this done…  I see the other thread on scraping pages generated by JS and suspect perhaps some wizard among us already has this done…would save a bit of time here.

BR




_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
Hi Swami, I know you can do this in Javascript, but you will have to enumerate through a JavaScript object to get all of the properties:

https://www.w3schools.com/jsref/prop_meta_content.asp

Sent from my iPhone

> On Jul 29, 2017, at 4:16 PM, Sannyasin Brahmanathaswami via use-livecode <[hidden email]> wrote:
>
> given that
>
> a) trying to instantiate an XML tree from any given web page is likely to fail 85% of the time because they simply are never built to that strict a standard
>
>
> and
>
>
> b) you want to extract from the <head> of the document  the openGraph  tags
>
> <meta property="og:site_name" content="YouTube">
> <meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam">
> <meta property="og:title" content="Kauai's Hindu Monastery">
> <meta property="og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg">
> <meta property="og:description" content="{where hinduism meets the future}">
>
> c) you also cannot depend on the output being line delimited, because some CMS's delivery "agents" will minimize this to
>
> <meta property="og:site_name" content="YouTube"><meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam"><meta property="og:title" content="Kauai's Hindu Monastery"><meta property="og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg"><meta property="og:description" content="{where hinduism meets the future}">
>
> Has anyone rolled up a parser/scraper for this?   Looks like "idiot simple text extraction"  but I'm trying to wrap my head around how to extract the name=value pairs, and not getting anything easy…  these are space delimited, but then we also have spaces inside quoted strings.  Maybe easier target "<meta (.*?)>" using regEx with matchText, get ALL the meta tags in the HEAD, push to array then just check for if key contains "og:"  then we have an openGraph value.
>
> I'll sleep on this, but but before I wake up and write 50 lines to get this done…  I see the other thread on scraping pages generated by JS and suspect perhaps some wizard among us already has this done…would save a bit of time here.
>
> BR
>
>
>
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
In reply to this post by ** Clarence P Martin ** via use-livecode
Here's where it's handy that delimiters can now be more than a single
character. This should extract the lines you need regardless of whether
they contain carriage returns or not:


on parseHeader pData
   set the lineDel to "<meta property="
   repeat for each line l in pData
     if l contains "og:" then put char 1 to offset(">",l)-1 of l & cr
after tList
   end repeat
   -- do something with tList
end parseHeader


On 7/29/17 3:16 PM, Sannyasin Brahmanathaswami via use-livecode wrote:

> given that
>
> a) trying to instantiate an XML tree from any given web page is likely to fail 85% of the time because they simply are never built to that strict a standard
>
>
> and
>
>
> b) you want to extract from the <head> of the document  the openGraph  tags
>
> <meta property="og:site_name" content="YouTube">
> <meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam">
> <meta property="og:title" content="Kauai's Hindu Monastery">
> <meta property="og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg">
> <meta property="og:description" content="{where hinduism meets the future}">
>
> c) you also cannot depend on the output being line delimited, because some CMS's delivery "agents" will minimize this to
>
> <meta property="og:site_name" content="YouTube"><meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam"><meta property="og:title" content="Kauai's Hindu Monastery"><meta property="og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg"><meta property="og:description" content="{where hinduism meets the future}">
>
> Has anyone rolled up a parser/scraper for this?   Looks like "idiot simple text extraction"  but I'm trying to wrap my head around how to extract the name=value pairs, and not getting anything easy…  these are space delimited, but then we also have spaces inside quoted strings.  Maybe easier target "<meta (.*?)>" using regEx with matchText, get ALL the meta tags in the HEAD, push to array then just check for if key contains "og:"  then we have an openGraph value.
>
> I'll sleep on this, but but before I wake up and write 50 lines to get this done…  I see the other thread on scraping pages generated by JS and suspect perhaps some wizard among us already has this done…would save a bit of time here.
>
> BR
>
>
>
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>


--
Jacqueline Landman Gay         |     [hidden email]
HyperActive Software           |     http://www.hyperactivesw.com


_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
" delimiters can now be more than a single character."

Hmm, that completely did not cross my mind… awesome..  

 

On 7/29/17, 5:36 PM, "use-livecode on behalf of J. Landman Gay via use-livecode" <[hidden email] on behalf of [hidden email]> wrote:

    Here's where it's handy that delimiters can now be more than a single
    character. This should extract the lines you need regardless of whether
    they contain carriage returns or not:
   
   
    on parseHeader pData
       set the lineDel to "<meta property="
       repeat for each line l in pData
         if l contains "og:" then put char 1 to offset(">",l)-1 of l & cr
    after tList
       end repeat
       -- do something with tList
    end parseHeader

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
In reply to this post by ** Clarence P Martin ** via use-livecode
2017-07-29 22:16 GMT+02:00 Sannyasin Brahmanathaswami

:


> you want to extract from the <head> of the document  the openGraph  tags
>
> <meta property="og:site_name" content="YouTube">
> <meta property="og:url" content="https://www.youtube.
> com/user/kauaiaadheenam">
> <meta property="og:title" content="Kauai's Hindu Monastery">
> <meta property="og:image" content="https://yt3.ggpht.
> com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-
> c-k-no-mo-rj-c0xffffff/photo.jpg">
> <meta property="og:description" content="{where hinduism meets the
> future}">
>
> c) you also cannot depend on the output being line delimited, because some
> CMS's delivery "agents" will minimize this to
>
> <meta property="og:site_name" content="YouTube"><meta property="og:url"
> content="https://www.youtube.com/user/kauaiaadheenam"><meta
> property="og:title" content="Kauai's Hindu Monastery"><meta
> property="og:image" content="https://yt3.ggpht.
> com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-
> c-k-no-mo-rj-c0xffffff/photo.jpg"><meta property="og:description"
> content="{where hinduism meets the future}">
>
> Has anyone rolled up a parser/scraper for this?

Looks like "idiot simple text extraction"



​Hi,

Here is a quick coded piece of code, tested only on your URL.
I did write this regex based on the Datas you provide in your email.


>

I see the other thread on scraping pages generated by JS and suspect
> perhaps some wizard among us already has this done…would save a bit of time
> here.
>
> BR
>

​Every time you see any kind of scraping/search/extraction/transformation
in JS, you can be sure
it's possible to do it in LiveCode​

So, here is the code:

   local Rx, Rslt, _Html, OG

   put empty into Rslt
   put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html

   get
"(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{22}(.+?)\x{22}>"
   put IT into Rx

   repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
      put  char p3 to p4 of _Html  into OG[  char p1 to p2 of _Html ]
      delete char 1 to p4 of _Html
   end repeat



and you can test it this way:

   combine OG using return and ":"
   put OG into fld 1





HTH and feel free to ask any question...

Kind regards,

Thierry

--
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
In reply to this post by ** Clarence P Martin ** via use-livecode
On 07/29/2017 01:16 PM, Sannyasin Brahmanathaswami via use-livecode wrote:

> <meta property="og:site_name" content="YouTube">

LOL. I guess Brahmanathaswami's been around these parts long enough by
now to have OG status.

--
  Mark Wieder
  [hidden email]

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
In reply to this post by ** Clarence P Martin ** via use-livecode
Thanks Thierry

though I'm yet sure when using regEx this is better than using Jacque's method


on parseHeader pData
   set the lineDel to "<meta property="
   repeat for each line l in pData
     if l contains "og:" then put char 1 to offset(">",l)-1 of l & cr
after tList
   end repeat
   -- do something with tList
end parseHeader

Either way it would seem prudent to extract the head first before processing

put the htmlText of widget "youtubes" into _HTML # interesting convention of underscore usage for var declaration
put  char ( offset("<head>",_HTML)) to  ( ( offset("</head>",_HTML))+6) of _html  into tHead

Using jacques method just gets the list.. and we need to do more coding to get the array we need.
but it returns:

"og:site_name" content="YouTube"
"og:url" content="https://www.youtube.com/user/kauaiaadheenam"
"og:title" content="Kauai&#39;s Hindu Monastery"
"og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg"
"og:description" content="{where hinduism meets the future}"
"og:type" content="profile"
"og:video:tag" content="kauai"
"og:video:tag" content="hawaii"
"og:video:tag" content="hindu"
"og:video:tag" content="hinduism"
"og:video:tag" content="siva"
# And many more tags total of 39 tags…

But your method can only handle 1 tag.

description:{where hinduism meets the future}
image:https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg
site_name:YouTube
title:Kauai&#39;s Hindu Monastery
type:profile
url:https://www.youtube.com/user/kauaiaadheenam
video:tag:scriptural  

#r est of the tags, all preceeding 38 of them, are lost  -- "scriptural" was the last one
# and so stands as the final output for the key as the loop which is
# effectively retain the single key "og:video"tag" and replacing the value 39 times
# leaving us with on the last value of the 39th tag.
# so we would need an ordered multi-dimensional array like

OG["site_name"]
# and the other top keys, then:
OG["video"]["tags"][1]  
OG["video"]["tags"][2]  

But I'm not sure we need tags for the particular use case in question which is to create a robust "history" of web viewing with more detail.    OTOH, since we are coding for "Oh God" data, we may as well get all the tags into the array. This could be useful later to have this code in the toolbox for when we *do* want all the tags from the OG set… God does not like to see partial metadata, because S/He Knows All the Metadata.

BR






On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez via use-livecode" <[hidden email] on behalf of [hidden email]> wrote:

    So, here is the code:
   
       local Rx, Rslt, _Html, OG
   
       put empty into Rslt
       put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
   
       get
    "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{22}(.+?)\x{22}>"
       put IT into Rx
   
       repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
          put  char p3 to p4 of _Html  into OG[  char p1 to p2 of _Html ]
          delete char 1 to p4 of _Html
       end repeat
   
   
   
    and you can test it this way:
   
       combine OG using return and ":"
       put OG into fld 1
   
   
   
   
   
    HTH and feel free to ask any question...
   
    Kind regards,
   
    Thierry

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami:


​Hi Brahmanathaswami,


Thanks Thierry
>
> though I'm yet sure when using regEx this is better than using Jacque's
> method
>

​That's 2 different ways..
but with the regex one, you have the exact key and value of each tags,
nothing more to do.​


Either way it would seem prudent to extract the head first before processing
>

​Mmm, don't really see why, but I've added a line of code for this too
below.



>
> Using jacques method just gets the list..

and we need to do more coding to get the array we need.
>
> But your method can only handle 1 tag.
>


​I was aware of that but didn't know what you want to achieve, therefore I
leave it for the reader.
However this has nothing to do with the regex but with the code inside the
repeat loop.


Here is another way to do it, changing only *1* line of code inside the loop
with the same regex as before:



  -- to please BR wishes, but not necessary
  -- erase everything after </head>
   put replaceText( _Html, "(?ms)</head>.*?$", empty) into _Html

   repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 )
      put  char p1 to p2 of _Html & tab& char p3 to p4 of _Html  &cr after
Rslt
      delete char 1 to p4 of _Html
   end repeat
   delete last char of Rslt -- extra cr

   put Rslt into fld 1
   answer "Got " & the number of lines of Rslt & " og: meta tags!"


Building a multi-dimensionnal array after the extraction,
a bit more work inside the repeat loop will be needed,
but  the extraction part is still valid.




Finally, if you are not at ease with regex, go with Jacque's way and
everything will be fine.
There are fundamentally not much differences in between the 2 ways.


Kind regards,

Thierry






> On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote:
>
>     So, here is the code:
>
>        local Rx, Rslt, _Html, OG
>
>        put empty into Rslt
>        put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
>
>        get
>     "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{
> 22}(.+?)\x{22}>"
>        put IT into Rx
>
>        repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
>           put  char p3 to p4 of _Html  into OG[  char p1 to p2 of _Html ]
>           delete char 1 to p4 of _Html
>        end repeat
>
>
>
>     and you can test it this way:
>
>        combine OG using return and ":"
>        put OG into fld 1
>
>
>
>     HTH and feel free to ask any question...
>
>     Kind regards,
>
>     Thierry
>


--
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
Responding on top

Jacque's method only gets us a  list, not an array, so one ends up having to write more code to parse the list anyway, your method is more efficient.

"not comfortable with RegEx"  Ha,, right. but it worth the effort to keep the little grey cells green! I will have to study the regEx… things like ?ms
are "brand new" to me.


re: extracting the head first: I was under the impression your repeat loop would have to work through the entire text of _HTML unnecessarily and that extracting the heads would reduce processing time. OTOH, Andre tells me that for this kind of operation, even cell phones have CPU's that are more powerful than some desktop machines and so perhaps the time to loop through the entire html source is too trivial to consider at all.

Thanks for the effort you put into this. We are adding OG tags to all the media on our web site (eventually) and our apps will need to parse that out in various contexts.

BR



 

On 8/1/17, 10:07 PM, "use-livecode on behalf of Thierry Douez via use-livecode" <[hidden email] on behalf of [hidden email]> wrote:

    2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami:
   
   
    ​Hi Brahmanathaswami,
    ​
   
    Thanks Thierry
    >
    > though I'm yet sure when using regEx this is better than using Jacque's
    > method
    >
   
    ​That's 2 different ways..
    but with the regex one, you have the exact key and value of each tags,
    nothing more to do.​
   
   
    Either way it would seem prudent to extract the head first before processing
    >
   
    ​Mmm, don't really see why, but I've added a line of code for this too
    below.
   
    ​
   
    >
    > Using jacques method just gets the list..
   
    and we need to do more coding to get the array we need.
    >
    > But your method can only handle 1 tag.
    >
   
   
    ​I was aware of that but didn't know what you want to achieve, therefore I
    leave it for the reader.
    However this has nothing to do with the regex but with the code inside the
    repeat loop.
   
   
    Here is another way to do it, changing only *1* line of code inside the loop
    with the same regex as before:
   
   
   
      -- to please BR wishes, but not necessary
      -- erase everything after </head>
       put replaceText( _Html, "(?ms)</head>.*?$", empty) into _Html
   
       repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 )
          put  char p1 to p2 of _Html & tab& char p3 to p4 of _Html  &cr after
    Rslt
          delete char 1 to p4 of _Html
       end repeat
       delete last char of Rslt -- extra cr
   
       put Rslt into fld 1
       answer "Got " & the number of lines of Rslt & " og: meta tags!"
   
   
    Building a multi-dimensionnal array after the extraction,
    a bit more work inside the repeat loop will be needed,
    but  the extraction part is still valid.
    ​
   
    ​
   
    Finally, if you are not at ease with regex, go with Jacque's way and
    everything will be fine.
    There are fundamentally not much differences in between the 2 ways.
   
   
    Kind regards,
   
    Thierry
   
   
   
   
   
   
    > On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote:
    >
    >     So, here is the code:
    >
    >        local Rx, Rslt, _Html, OG
    >
    >        put empty into Rslt
    >        put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
    >
    >        get
    >     "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{
    > 22}(.+?)\x{22}>"
    >        put IT into Rx
    >
    >        repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
    >           put  char p3 to p4 of _Html  into OG[  char p1 to p2 of _Html ]
    >           delete char 1 to p4 of _Html
    >        end repeat
    >
    >
    >
    >     and you can test it this way:
    >
    >        combine OG using return and ":"
    >        put OG into fld 1
    >
    >
    >
    >     HTH and feel free to ask any question...
    >
    >     Kind regards,
    >
    >     Thierry
    >
   
   
    --
    ------------------------------------------------
    Thierry Douez - sunny-tdz.com
    sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
    _______________________________________________
    use-livecode mailing list
    [hidden email]
    Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
    http://lists.runrev.com/mailman/listinfo/use-livecode

_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Parsing (scraping) OpenGraph Tags from html HEAD

** Clarence P Martin ** via use-livecode
2017-08-02 17:54 GMT+02:00 Sannyasin Brahmanathaswami via use-livecode <
[hidden email]>:

> Responding on top
>
> Jacque's method only gets us a  list, not an array, so one ends up having
> to write more code to parse the list anyway, your method is more efficient.
>
> "not comfortable with RegEx"  Ha,, right. but it worth the effort to keep
> the little grey cells green! I will have to study the regEx… things like ?ms
> are "brand new" to me.
>

​So, you win your first Regex training :)

(?ms) are regex options.

m means multi-lines
s means the dot ( '.' ) could also match a return/cr/lf char.



>
>
> re: extracting the head first: I was under the impression your repeat loop
> would have to work through the entire text of _HTML unnecessarily and that
> extracting the heads would reduce processing time.



​Well, you are right:
 but only when the regex will try to match after the last valid pattern.

What is most costly is the delete inside the loop; so working only with the
<head>...</head> of your html might be more efficient in this case. But
this is more a LC thing.




> OTOH, Andre tells me that for this kind of operation, even cell phones
> have CPU's that are more powerful than some desktop machines and so perhaps
> the time to loop through the entire html source is too trivial to consider
> at all.
>

​Yep, as I said, only after the last match, the regex will loop through the
end
of the html and only one time. About quality concerns, restricting the
regex to the <head> part is a good idea as you never know what could be
some html in the future...



>
> Thanks for the effort you put into this.


You're welcome.

Kind regards,

Thierry



We are adding OG tags to all the media on our web site (eventually) and our

> apps will need to parse that out in various contexts.
>
> BR
>
>
>
>
>
> On 8/1/17, 10:07 PM, "use-livecode on behalf of Thierry Douez via
> use-livecode" <[hidden email] on behalf of
> [hidden email]> wrote:
>
>     2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami:
>
>
>     ​Hi Brahmanathaswami,
>     ​
>
>     Thanks Thierry
>     >
>     > though I'm yet sure when using regEx this is better than using
> Jacque's
>     > method
>     >
>
>     ​That's 2 different ways..
>     but with the regex one, you have the exact key and value of each tags,
>     nothing more to do.​
>
>
>     Either way it would seem prudent to extract the head first before
> processing
>     >
>
>     ​Mmm, don't really see why, but I've added a line of code for this too
>     below.
>
>     ​
>
>     >
>     > Using jacques method just gets the list..
>
>     and we need to do more coding to get the array we need.
>     >
>     > But your method can only handle 1 tag.
>     >
>
>
>     ​I was aware of that but didn't know what you want to achieve,
> therefore I
>     leave it for the reader.
>     However this has nothing to do with the regex but with the code inside
> the
>     repeat loop.
>
>
>     Here is another way to do it, changing only *1* line of code inside
> the loop
>     with the same regex as before:
>
>
>
>       -- to please BR wishes, but not necessary
>       -- erase everything after </head>
>        put replaceText( _Html, "(?ms)</head>.*?$", empty) into _Html
>
>        repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 )
>           put  char p1 to p2 of _Html & tab& char p3 to p4 of _Html  &cr
> after
>     Rslt
>           delete char 1 to p4 of _Html
>        end repeat
>        delete last char of Rslt -- extra cr
>
>        put Rslt into fld 1
>        answer "Got " & the number of lines of Rslt & " og: meta tags!"
>
>
>     Building a multi-dimensionnal array after the extraction,
>     a bit more work inside the repeat loop will be needed,
>     but  the extraction part is still valid.
>     ​
>
>     ​
>
>     Finally, if you are not at ease with regex, go with Jacque's way and
>     everything will be fine.
>     There are fundamentally not much differences in between the 2 ways.
>
>
>     Kind regards,
>
>     Thierry
>
>
>
>
>
>
>     > On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote:
>     >
>     >     So, here is the code:
>     >
>     >        local Rx, Rslt, _Html, OG
>     >
>     >        put empty into Rslt
>     >        put URL "https://www.youtube.com/user/kauaiaadheenam" into
> _Html
>     >
>     >        get
>     >     "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{
>     > 22}(.+?)\x{22}>"
>     >        put IT into Rx
>     >
>     >        repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
>     >           put  char p3 to p4 of _Html  into OG[  char p1 to p2 of
> _Html ]
>     >           delete char 1 to p4 of _Html
>     >        end repeat
>     >
>     >
>     >
>     >     and you can test it this way:
>     >
>     >        combine OG using return and ":"
>     >        put OG into fld 1
>     >
>     >
>     >
>     >     HTH and feel free to ask any question...
>     >
>     >     Kind regards,
>     >
>     >     Thierry
>     >
>
>
>     --
>     ------------------------------------------------
>     Thierry Douez - sunny-tdz.com
>     sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
>     _______________________________________________
>     use-livecode mailing list
>     [hidden email]
>     Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
>     http://lists.runrev.com/mailman/listinfo/use-livecode
>
> _______________________________________________
> use-livecode mailing list
> [hidden email]
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



--
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
_______________________________________________
use-livecode mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
Loading...