miércoles, noviembre 26, 2008

Heritrix vs Nutch


Enlazando con el post anterior, ahí va este.
Como ya dije, me llevé una muy buena impresión de Heritrix. Esto unido a que no soy fun de Java, me hizo en un primer momento pensar que Heritrix era el mejor producto... Pero no es el caso...
Como ya he indicado en el post anterior, como almacena Heritrix los datos tiene su pega...
Es por ello que he procedido a buscar en la web comparaciones entre ambos...
En este post, adjunto lo que he encontrado.

De [1] (ojo, del 2005):
"The primary aim of Heritrix is to be an "archival crawler" --
obtaining complete, accurate, deep copies of websites. This
includes getting graphical and other non-textual content.
Resources are stored exactly as they were received -- no
truncation, encoding changes, header changes, etc.


Recrawls of the same URLs do not replace prior crawls in any
sort of running page database.


The focus has usually been dozens to hundreds of chosen
websites, but has begun to shift to tens of thousands or
hundreds of thousands of websites (entire national domains).


Crawls are launched, monitored, and adjusted via a (fairly
complex) web user interface, allowing flexible (and sometimes
downright idiosyncratic) definition of what URLs should be
visited and which should not.


My understanding is that with the Nutch crawler's alternate
aims, some of the things it does differently are:
- only retrieves and saves indexable content
- may truncate or format-shift content as needed
- saves content into a database format optimized for
later indexing; refetches replace older fetches
- run and controlled from a command-line
- emphasizes volume of collection under default conditions,
rather than exact satisfaction of custom parameters


I'm not up-to-date on the Nutch crawler, I could be missing
important features or distinctions.


It'd be nice to converge parts of the crawlers' architectures,
for example to share link-extraction or trap-detection
techniques."

Me preocupa sobre lo que dice que almacena y no nutch. Tengo que hacer pruebas de esto...

De [2] (del 2008):
"Here are a (few) dimensions that may help
with your investigation:

+ The Nutch crawler operates in a stepped, batch fashion. It runs
through a list of generated URLs fetching each individual URL until the
list is done. You then generate the next list to run after running
analysis of the most recent fetch. Heritrix just runs fetching until it
runs out of URLs that fit its defined scope.
+ The Nutch crawler is a MapReduce job. This means that distribution is
just a matter of adding nodes and tasks (Tasks are retried if they fail,
etc.). Heritrix distribution is the divvying up of the crawl-space
outlined above (but from what I've heard, on big crawls folks just don't
bother with the sort-and-insert step reasoning that eventually every
individual crawler will trip over its URLs if its let run long enough).
If a heritrix instance crashes, recovery is manual involving either the
rerunning of a pseudo-crawl 'journal' or revivification using the last
'checkpoint'.
+ Heritrix has more crawling features -- knobs and switches -- than the
crawler in Nutch and it is more 'dogged' about fetching content than
Nutch given its roots in an Archiving organization.

The latter feature you may not want and regards the former, its easy
enough adding whats missing given Nutch is pluggable, open source."

Como indican, la forma de distribuir el trabajo es totalmente diferente.
Voy a terminar el post, con dos graficos que muestran las arquitecturas de ambos crawlers:


Nutch



Heritrix





Notas:
  • Consultando el API de Heritrix, puedo comprobar que hay mas de un 'processor' de escritura. Aparte de el de por defecto, ARCWriterProcessor , tenemos: Kw3WriterProcessor, MirrorWriterProcessor, WARCWriterProcessor (este formato parece que no ha pasado de la etapa experimental).
  • Si realmente queremos acceder al crawler store desde python, quizás sea mejor opción Heritrix, aunque nos toque hacer un WriterProcessor a medida.
  • El siguiente link explica como funciona Heritrix. Tomado del Developer Manual.

No hay comentarios: