Harvester Roadmap
conseo
4consensus at web.de
Sat Apr 14 18:35:25 EDT 2012
Hi,
>
> 1. Extend the PipermailHarvester to track any number of forums and not just
> one.
> 2. Track the state properly for each forum (internal to PipermailHarvester).
> 3. Make this state persistent on disk (internal to PipermailHarvester).
Done.
> 4.
> Design a proper SQL table layout for the gathered messages (internal to
> DiffMessageTable).
Yesterday Mike had a few questions on IRC, I reply to them here:
<mcallan>
http://whiletaker.homeip.net/votorola/harvester/javadoc/votorola/s/wap/HarvestWAP.html
<mcallan> (1) hPoll is mandatory, but feed can show bites without filtering by
poll. how can it request that?
Ok, poll is not mandatory then.
<mcallan> (2) the purpose of "parsed date" is unclear. don't clients only
care about time of posting?
That makes sense from the application/client perspective, but parsed date is
much better for efficient access and caching. Usually parsed date will be
close to sent date, only the initial back crawl of the archive inverts this
order. This means new clients on a freshly harvested feed-archive will see
oldest messages first in this short period of initial feed loading (if they
are multiple fetches back over hundreds of messages, the fetches are in the
wrong order). But as the harvesting continues new messages will appear first
close to sent_date for all clients on multiple JSON fetches.
Parsed date is the primary key, to sequentially query for new messages and
cache them locally on the client. Sent date is not consistent as older
messages can be added by a later back-crawl of a newly harvested archive.
The client simply keeps a newest_parsed_date marker for each poll and can then
request all messages from then now, caching the data locally at the same time.
<mcallan> (3) the names are not consistent, which is confusing. between wap
and table there are hUser, username, and maillish-username (spelling mistake
there) - and all designate the same information. should all be same name.
likewise for hPoll, pollname, and the dates
<mcallan> (4) "db will be ok, if we only request the newest parsed bites".
almost always client will request new messages, but why is requesting old ones
a problem?
The client can cache in the same history "dimension" as the harvesting process
and will still be largely in sync. (Note again: Newest posts still occur first
on the client feed (sentDate).)
I am just worried about keeping this beast scalable, but maybe I am wrong? It
is potentially the biggest load of the whole architecture, so some thoughts on
this should be done imo.
<mcallan> (5) HarvestWAP is a "web API for the harvested messages", but that
is unclear. javadoc maybe needs a link to the harvest package
Ok, will add that.
<mcallan> (6) "seperate table for the usernames and pollnames, because many of
them will be duplicates". i don't understand. what problem is caused by
single table for harvest cache?
The database won't store duplicate entries efficiently. Since we have in fact
data combined of a pollname, username, base_url and difference table, each
only has an index and the attribute. If we join the tables, data won't scale
with constantly increasing number of messages, but will be constant to number
of users and polls for data stored in these tables. It adds the integer index
though, of course. Still this is far less data then these string literals
repeated all over again. (1) For the differences which float with the
messages, a separate table won't be as beneficial, but it still makes sense
imo. Or is this somehow built-in Postgres with some special parameters? I have
had a look at views to give a common interface for different queries, but this
doesn't make the db commands compacter.
<mcallan> (ad 3) actually it's ok to have nice short "hUser" and elsewhere
more explicit "username" (likewise "hPoll" and "pollname"), but don't include
"mailish" in formal name. all votorola usernames are mailish
Ok. So username, pollname and url-parameters are shorter: hPoll, hUser.
<mcallan> (7) it's maybe better to store email addresses not usernames in
table. usernames are not the canonical identifiers.
http://zelea.com/project/votorola/_/javadoc/votorola/a/voter/IDPair.html#email()
Ok, agree.
<mcallan> column name might be "sender_email" or simply "sender"
Ok.
Do we need the difference really? It is core to the information, but not
really necessary on the client-side, right? We have then
TIMESTAMP: parsed_date
TIMESTAMP: sent_date
VARCHAR(150): summary
VARCHAR(255): sender
VARCHAR(255): addressee (opposite of difference, or should this be calculed in
servlet through voteserver from the difference and the sender only? Do we need
who is who? I am not sure because of the Biters and filtering.)
VARCHAR(255): pollname
VARCHAR(255): url (relative to base_url)
VARCHAR(255): base_url (unique for each archive usually)
INTEGER[2][2]: difference
> 5. Configure the forums by querying the wiki on startup instead of
> hardcoding them (PipermailHarvester). This should also happen from time to
> time during runtime.
I thought about this and think it is best if the Harvesters automatically
configure new forums on an event by a Controller, which reads the Wiki-Config
and emits a one time NewForumKick once it finds new forums. Does that sound
reasonable to you? The controller would also start the Detectors. It could
also be a boot routine only and some separate background updater, but I think
this does not diminish the life-config controller.
> 6. Make HarvestCache configurable. (the URL to the difference bridge)
> 7. Properly reuse TCP sessions in the IOReactor (internal to HarvestRunner).
>
conseo
(1) "and that it is free of repeating groups."
http://en.wikipedia.org/wiki/First_normal_form Actually 3NF is recommended, if
somebody wants to have a look, I will quickly zap over it to make no big
mistakes, but i think it is still simple enough in 4-5 tables.
More information about the Votorola
mailing list