Indexing mail list messages

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing mail list messages

capellan
Hi Developers,

i've started build indexes for searching
(from a CD-RW), keywords and phrases within the
200 MB of mail list messages.

Many of you suggest third party software,
but i'm sure that RR is able to search for
phrases and keywords within these text files.

The files range (for RR mail list messages) from
4.8 MB to 543k and my first idea is to create
two indexes for each of the 45 mail messages
text files.

The first index have a list of each
message subjects submitted in that month,
followed by the line or lines where this subject
is found in the text. For example:

message subject         lines where this text appears

Subject: Gif animation  75,124,257,310,358,

Creating this index took only a few minutes for
all the files.

The second index is for keywords within each
text file, using the same approach.
Unfortunaly, using this approach, pairing
words with line offsets created in some cases
files bigger than the mail archive! :-(
For example, the june 2005 text file is only
4.8 MB, but the index is more than 5.3 MB...

After, i deleted the stop words from the index,
(search in Google for: "google stop words")
it was "reduced" to 3.5 MB. Still too big for
my taste.

Which approach could i take to build a smaller
and accurate word index for mail list archives?

Thanks in advance.

al




Visit my site:
http://www.geocities.com/capellan2000/


               
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs 
 
_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
Reply | Threaded
Open this post in threaded view
|

Re: Indexing mail list messages

Alex Tweedly
Alejandro Tejada wrote:

>The second index is for keywords within each
>text file, using the same approach.
>Unfortunaly, using this approach, pairing
>words with line offsets created in some cases
>files bigger than the mail archive! :-(
>For example, the june 2005 text file is only
>4.8 MB, but the index is more than 5.3 MB...
>
>After, i deleted the stop words from the index,
>(search in Google for: "google stop words")
>it was "reduced" to 3.5 MB. Still too big for
>my taste.
>
>Which approach could i take to build a smaller
>and accurate word index for mail list archives?
>  
>
Are you indexing every line where the word exists ?
Could you instead index only the message number (or id, or first line of
the message) ?

Or could you post the code / a stack to save me asking you another 50
questions ... ? :-)

Are you keeping the whole mbox format ?  Or discarding the headers you
don't need ?
How many different words remain after the stop words are discarded ?
How many lines in the file ?   How many entries per word ? (min, max,
avg, mean, std dev) .. ?


--
Alex Tweedly       http://www.tweedly.net



--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.323 / Virus Database: 267.9.0/50 - Release Date: 16/07/2005

_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
Reply | Threaded
Open this post in threaded view
|

Re: Indexing mail list messages

Brian Yennie
In reply to this post by capellan
Alejandro,

Some off-the-cusp thoughts:

* Add synonyms for common xTalk terms (cd => card, btn => button, etc)
and combine their indices
* Support some sort of stemming (or at least, combine words with their
plurals)
* Create a stop word threshold: any term which occurs in more than X%
of messages becomes a stop word and is discarded from the index.
* Index by message, not by line. You could always find the line in the
message on the fly.
* Don't index all message headers
* Don't index message footers and/or signatures
* Remove dups (i.e. if a word appears twice on a line or twice in a
message)

Hope these give you some ideas.

Of course I also have a high level question- what's wrong with just a
5MB index on a CD-ROM? If it is just for disk space, you could compress
the index and probably get a significant savings.

- Brian

_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
Reply | Threaded
Open this post in threaded view
|

Re: Indexing mail list messages

capellan
In reply to this post by capellan
Alex Tweedley wrote:

> Are you indexing every line where the word exists ?

Oh yes, in a first try i was guilty of that... :-(

> Could you instead index only the message number (or
> id, or first line of the message) ?

Ah! The msg id... this is a good choice because
this specific line is not repeated when
developers replies to a message.
So there is only one msg id for every msg. :-)

> Or could you post the code / a stack to save me
> asking you another 50 questions ... ? :-)

Here is the first iteration that produced
an index larger than the indexed file.
(i'll change it to work, not with line numbers, but
with message number)

-- start script --
on mouseUp
  -- based in Scott Raney's example code
  -- comments that start with "#" are his...

  answer file "Select a mail message text file for
input:"
  if it is empty then exit mouseUp
  # let user know we're working on it
  set the cursor to watch
  put it into inputFile
  set the itemdelimiter to "\"
  put the last item of inputFile into zvn
  put ".wndx" into char -4 to -1 of zvn
  put "file:" before zvn
 
  open file inputFile for read
  read from file inputFile until eof
  put it into fileContent
  close file inputFile
 
  repeat for each line w in fileContent
    add 1 to mylinecount
    repeat for each word z in w
      put mylinecount & comma after wordCount[z]
    end repeat
  end repeat
 
  # copy all the indexes that is in the wordCount
associative array
  put keys(wordCount) into keyWords
  # sort the indexes -- keyWords contains a list of
elements in array
  sort keyWords
  repeat for each line l in keyWords
    put l & tab & wordCount[l] & return after
displayResult
  end repeat
  put displayResult into URL zvn

  -- look for a file with the extension *.wndx
  -- in the same location of the selected text file
  -- This *.wndx file contains the index.

end mouseUp

-- end script --

> Are you keeping the whole mbox format ?  

Yes, completely.

> Or discarding the headers you don't need ?

No, the file is complete without change.

> How many different words remain after the stop words
> are discarded ?

Not too many words, but there are a lot
of similar words that change a little
in their endings.

> How many lines in the file ?  
> How many entries per word ?
> (min, max, avg, mean, std dev) .. ?

With the code above, and this file:
<http://mail.runrev.com/pipermail/use-revolution/2005-June.txt.gz>
the answer to these questions is at a glance. ;-)

I'll keep building on these new ideas!
Thanks a lot for your help!

al

Visit my site:
http://www.geocities.com/capellan2000/

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 
_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
Reply | Threaded
Open this post in threaded view
|

Re: Indexing mail list messages

capellan
In reply to this post by capellan
Hi Brian, :-)

Brian Yennie wrote:

> Some off-the-cusp thoughts:
> * Add synonyms for common xTalk terms (cd => card,
> btn => button, etc) and combine their indices

Interesting idea, i'll give more thought
to this possibility.

> * Support some sort of stemming (or at least,
combine
> words with their plurals)

Yes, this is a must.

> * Create a stop word threshold: any term which
occurs
> in more than X% of messages becomes a stop word and
> is discarded from the index.

This is a good recomendation. For example,
the word "revolution" should be a stop word. :-)

> * Index by message, not by line. You could always
> find the line in the message on the fly.

Yes, Alex Tweedley makes this recomendation too.

> * Don't index all message headers
> * Don't index message footers and/or signatures

The headers contains some useful info... No?

> * Remove dups (i.e. if a word appears twice on a
line
> or twice in a message)

Yes, this is a must too.

> Hope these give you some ideas.

Sure they do! These are mind opening
ideas. You could be sure that many other
ideas, probably unrelated to this task
will take life while working on this... :-)

Today i have step on an interesting idea for
a new educative game. Let's keep the hope
to raise the resources to make this game a reality!

> Of course I also have a high level question- what's
> wrong with just a 5MB index on a CD-ROM? If it is
> just for disk space, you could compress
> the index and probably get a significant savings.

Space is not the problem, fast searching in optimized
indexes are the goal. ;-)

Thanks again for your help, Brian!

al

Visit my site:
http://www.geocities.com/capellan2000/


               
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs 
 
_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
Reply | Threaded
Open this post in threaded view
|

Re: Indexing mail list messages

Dave Cragg
In reply to this post by capellan

On 22 Jul 2005, at 02:36, Alejandro Tejada wrote:

Just one small observation:

>
>
>   repeat for each line w in fileContent
>     add 1 to mylinecount
>     repeat for each word z in w
>       put mylinecount & comma after wordCount[z]
>     end repeat
>   end repeat


z will include any puntuation attached to words, so  
"script","script,", "script?", etc. will be indexed separately.  
(Unless I missed the point where you accounted for this.)

Cheers
Dave
_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
Reply | Threaded
Open this post in threaded view
|

Re: Indexing mail list messages

capellan
In reply to this post by capellan
on Fri, 22 Jul 2005
Dave Cragg wrote:

> Just one small observation:
> >   repeat for each line w in fileContent
> >     add 1 to mylinecount
> >     repeat for each word z in w
> >       put mylinecount & comma after wordCount[z]
> >     end repeat
> >   end repeat
> z will include any puntuation attached to words, so
> "script","script,", "script?", etc. will be indexed
> separately.  
> (Unless I missed the point where you accounted for
> this.)

You are right. This first version handler
will include words with any punctuation attached.
This is wrong. :-(

i'm working to apply the advices that alex and brian
generously provide last night.
When i have a complete handler that implements
all their recomendations, i'll post the results.

Did you have a regex that could handle these
words with punctuation?

Thanks in advance.

al
 

Visit my site:
http://www.geocities.com/capellan2000/


               
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs 
 
_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
Reply | Threaded
Open this post in threaded view
|

Re: Indexing mail list messages

xbury.cs
Al,

To remove the RevWordDefitiency, just replace "," with " " in yourtext

do the same for any punctuation, and all non-wordy characters the
RevWordDefitiency implies!

cheers
Xavier

On 22/07/2005 14:55:42 use-revolution-bounces wrote:

>on Fri, 22 Jul 2005
>Dave Cragg wrote:
>
>> Just one small observation:
>> >   repeat for each line w in fileContent
>> >     add 1 to mylinecount
>> >     repeat for each word z in w
>> >       put mylinecount & comma after wordCount[z]
>> >     end repeat
>> >   end repeat
>> z will include any puntuation attached to words, so
>> "script","script,", "script?", etc. will be indexed
>> separately.
>> (Unless I missed the point where you accounted for
>> this.)
>
>You are right. This first version handler
>will include words with any punctuation attached.
>This is wrong. :-(
>
>i'm working to apply the advices that alex and brian
>generously provide last night.
>When i have a complete handler that implements
>all their recomendations, i'll post the results.
>
>Did you have a regex that could handle these
>words with punctuation?
>
>Thanks in advance.
>
>al
>
>
>Visit my site:
>http://www.geocities.com/capellan2000/
>
>
>
>____________________________________________________
>Start your day with Yahoo! - make it your home page
>http://www.yahoo.com/r/hs
>
>_______________________________________________
>use-revolution mailing list
>[hidden email]
>Please visit this url to subscribe, unsubscribe and manage your
subscription
>preferences:
>http://lists.runrev.com/mailman/listinfo/use-revolution


-----------------------------------------
Visit us at http://www.clearstream.com
                                                         
IMPORTANT MESSAGE

Internet communications are not secure and therefore Clearstream
International does not accept legal responsibility for the contents of
this message.

The information contained in this e-mail is confidential and may be
legally privileged. It is intended solely for the addressee. If you are
not the intended recipient, any disclosure, copying, distribution or
any action taken or omitted to be taken in reliance on it, is
prohibited and may be unlawful. Any views expressed in this e-mail are
those of the individual sender, except where the sender specifically
states them to be the views of Clearstream International or of any of
its affiliates or subsidiaries.

END OF DISCLAIMER
_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution
Reply | Threaded
Open this post in threaded view
|

Re: Indexing mail list messages

Dave Cragg
In reply to this post by capellan

On 22 Jul 2005, at 13:55, Alejandro Tejada wrote:
>
> Did you have a regex that could handle these
> words with punctuation?

No regex, but I've used this sequence of replace calls in the past.

     replace quote with space in tData
     replace "(" with space in tData
     replace "[" with space in tData
     replace "{" with space in tData
     replace ")" with space in tData
     replace "]" with space in tData
     replace "}" with space in tData

     replace "," with space in tData
     replace ":" with space in tData
     replace ";" with space in tData
     replace "." with space in tData

     replace "?" with space in tData
     replace "!" with space in tData

     replace "*" with space in tData
     replace "#" with space in tData
     replace "/" with space in tData
     replace "`" with space in tData


However, you should probably give this some thought. For example, if  
you replace  "." and "/", you will break up urls which may not be a  
good idea in all situations. In that case, it might be better to just  
replace a "." that is followed by a space or return.

   replace ". " with space & space in tData
   replace "." & return with space & return in tData

The reason for replacing with an equal number of characters is that I  
understand it's much faster. (From previous discussion on the list.)

Cheers
Dave
_______________________________________________
use-revolution mailing list
[hidden email]
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution