Harvester Roadmap

Sat Apr 14 18:35:25 EDT 2012

Hi,

> 
> 1. Extend the PipermailHarvester to track any number of forums and not just
> one.
> 2. Track the state properly for each forum (internal to PipermailHarvester).
> 3. Make this state persistent on disk (internal to PipermailHarvester). 

Done.

> 4.
> Design a proper SQL table layout for the gathered messages (internal to
> DiffMessageTable).

Yesterday Mike had a few questions on IRC, I reply to them here:

<mcallan> 
http://whiletaker.homeip.net/votorola/harvester/javadoc/votorola/s/wap/HarvestWAP.html
<mcallan> (1) hPoll is mandatory, but feed can show bites without filtering by 
poll.  how can it request that?
Ok, poll is not mandatory then. 

<mcallan> (2) the purpose of "parsed date" is unclear.  don't clients only 
care about time of posting?
That makes sense from the application/client perspective, but parsed date is 
much better for efficient access and caching. Usually parsed date will be 
close to sent date, only the initial back crawl of the archive inverts this 
order. This means new clients on a freshly harvested feed-archive will see 
oldest messages first in this short period of initial feed loading (if they 
are multiple fetches back over hundreds of messages, the fetches are in the 
wrong order). But as the harvesting continues new messages will appear first 
close to sent_date for all clients on multiple JSON fetches.
Parsed date is the primary key, to sequentially query for new messages and 
cache them locally on the client. Sent date is not consistent as older 
messages can be added by a later back-crawl of a newly harvested archive.
The client simply keeps a newest_parsed_date marker for each poll and can then 
request all messages from then now, caching the data locally at the same time.

<mcallan> (3) the names are not consistent, which is confusing.  between wap 
and table there are hUser, username, and maillish-username (spelling mistake 
there) - and all designate the same information.  should all be same name.  
likewise for hPoll, pollname, and the dates
<mcallan> (4) "db will be ok, if we only request the newest parsed bites".  
almost always client will request new messages, but why is requesting old ones 
a problem?
The client can cache in the same history "dimension" as the harvesting process 
and will still be largely in sync. (Note again: Newest posts still occur first 
on the client feed (sentDate).)
I am just worried about keeping this beast scalable, but maybe I am wrong? It 
is potentially the biggest load of the whole architecture, so some thoughts on 
this should be done imo.
<mcallan> (5) HarvestWAP is a "web API for the harvested messages", but that 
is unclear.  javadoc maybe needs a link to the harvest package

Ok, will add that.

<mcallan> (6) "seperate table for the usernames and pollnames, because many of 
them will be duplicates".  i don't understand.  what problem is caused by 
single table for harvest cache?
The database won't store duplicate entries efficiently. Since we have in fact 
data combined of a pollname, username, base_url and difference table, each 
only has an index and the attribute. If we join the tables, data won't scale 
with constantly increasing number of messages, but will be constant to number 
of users and polls for data stored in these tables. It adds the integer index 
though, of course. Still this is far less data then these string literals 
repeated all over again. (1) For the differences which float with the 
messages, a separate table won't be as beneficial, but it still makes sense 
imo. Or is this somehow built-in Postgres with some special parameters? I have 
had a look at views to give a common interface for different queries, but this 
doesn't make the db commands compacter.

<mcallan> (ad 3) actually it's ok to have nice short "hUser" and elsewhere 
more explicit "username" (likewise "hPoll" and "pollname"), but don't include 
"mailish" in formal name.  all votorola usernames are mailish
Ok. So username, pollname and url-parameters are shorter: hPoll, hUser.

<mcallan> (7) it's maybe better to store email addresses not usernames in 
table.  usernames are not the canonical identifiers.  
http://zelea.com/project/votorola/_/javadoc/votorola/a/voter/IDPair.html#email()
Ok, agree.

<mcallan> column name might be "sender_email" or simply "sender"
Ok.

Do we need the difference really? It is core to the information, but not 
really necessary on the client-side, right? We have then

TIMESTAMP: parsed_date
TIMESTAMP: sent_date
VARCHAR(150): summary
VARCHAR(255): sender
VARCHAR(255): addressee (opposite of difference, or should this be calculed in 
servlet through voteserver from the difference and the sender only? Do we need 
who is who? I am not sure because of the Biters and filtering.)
VARCHAR(255): pollname
VARCHAR(255): url (relative to base_url)
VARCHAR(255): base_url (unique for each archive usually)
INTEGER[2][2]: difference

> 5. Configure the forums by querying the wiki on startup instead of
> hardcoding them (PipermailHarvester). This should also happen from time to
> time during runtime.

I thought about this and think it is best if the Harvesters automatically 
configure new forums on an event by a Controller, which reads the Wiki-Config 
and emits a one time NewForumKick once it finds new forums. Does that sound 
reasonable to you? The controller would also start the Detectors. It could 
also be a boot routine only and some separate background updater, but I think 
this does not diminish the life-config controller.

> 6. Make HarvestCache configurable. (the URL to the difference bridge)
> 7. Properly reuse TCP sessions in the IOReactor (internal to HarvestRunner).
> 

conseo

(1) "and that it is free of repeating groups." 
http://en.wikipedia.org/wiki/First_normal_form Actually 3NF is recommended, if 
somebody wants to have a look, I will quickly zap over it to make no big 
mistakes, but i think it is still simple enough in 4-5 tables.