ApacheCon EU 2014 has ended
Register Now for ApacheCon Europe 2014 - November 17-21 in Budapest, Hungary. 
Back To Schedule
Tuesday, November 18 • 2:30pm - 3:20pm
Web Crawling With Apache Nutch - Sebastian Nagel, Exorbyte GmBH

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Apache Nutch is an extensible and scalable web crawler based on Hadoop. This talk gives an overview of the crawler flow of work, its main components, job execution, the underlying data structures, and how it integrates with other Apache projects (Hadoop, Gora, Solr, Tika and HBase). The extensible plugin architecture is demonstrated by giving examples, how plugins help to adapt the crawler to specific use cases.
History, recent, and future developments of the Apache Nutch project are outlined, as well as the two branches under active development: the stable 1.x branch and the 2.x which is based on Apache Gora to abstract from storage back-ends.

avatar for Sebastian Nagel

Sebastian Nagel

Crawl Engineer, commoncrawl.org
Sebastian Nagel works as crawl engineer at Common Crawl, a non-profit organization that makes web data freely accessible to everyone. Prior to joining Common Crawl he implemented search and data quality solutions at Exorbyte. Sebastian is a committer and PMC of Apache Nutch, a scalable... Read More →

Tuesday November 18, 2014 2:30pm - 3:20pm CET

Attendees (0)