Loading…
This event has ended. Create your own event → Check it out
This event has ended. Create your own
Register Now for ApacheCon Europe 2014 - November 17-21 in Budapest, Hungary. 

View analytic
Tuesday, November 18 • 2:30pm - 3:20pm
Web Crawling With Apache Nutch - Sebastian Nagel, Exorbyte GmBH

Sign up or log in to save this to your schedule and see who's attending!

Apache Nutch is an extensible and scalable web crawler based on Hadoop. This talk gives an overview of the crawler flow of work, its main components, job execution, the underlying data structures, and how it integrates with other Apache projects (Hadoop, Gora, Solr, Tika and HBase). The extensible plugin architecture is demonstrated by giving examples, how plugins help to adapt the crawler to specific use cases.
History, recent, and future developments of the Apache Nutch project are outlined, as well as the two branches under active development: the stable 1.x branch and the 2.x which is based on Apache Gora to abstract from storage back-ends.

Speakers
avatar for Sebastian Nagel

Sebastian Nagel

Crawl Engineer, commoncrawl.org
Sebastian Nagel works as crawl engineer at Common Crawl, a non-profit organization that makes web data freely accessible to everyone. Prior to joining Common Crawl he implemented search and data quality solutions at Exorbyte. Sebastian is a committer and PMC of Apache Nutch, a scalable web crawler, and presented the project at ApacheCon 2014.


Tuesday November 18, 2014 2:30pm - 3:20pm
Elod/Ond

Attendees (21)