ApacheCon EU 2014 has ended
Register Now for ApacheCon Europe 2014 - November 17-21 in Budapest, Hungary. 
Lucene / Solr [clear filter]
Tuesday, November 18

2:30pm CET

Web Crawling With Apache Nutch - Sebastian Nagel, Exorbyte GmBH
Apache Nutch is an extensible and scalable web crawler based on Hadoop. This talk gives an overview of the crawler flow of work, its main components, job execution, the underlying data structures, and how it integrates with other Apache projects (Hadoop, Gora, Solr, Tika and HBase). The extensible plugin architecture is demonstrated by giving examples, how plugins help to adapt the crawler to specific use cases.
History, recent, and future developments of the Apache Nutch project are outlined, as well as the two branches under active development: the stable 1.x branch and the 2.x which is based on Apache Gora to abstract from storage back-ends.

avatar for Sebastian Nagel

Sebastian Nagel

Crawl Engineer, commoncrawl.org
Sebastian Nagel works as crawl engineer at Common Crawl, a non-profit organization that makes web data freely accessible to everyone. Prior to joining Common Crawl he implemented search and data quality solutions at Exorbyte. Sebastian is a committer and PMC of Apache Nutch, a scalable... Read More →

Tuesday November 18, 2014 2:30pm - 3:20pm CET

Filter sessions
Apply filters to sessions.
  • ApacheBarCamp
  • Big Data
  • Business
  • Cassandra Days
  • Cloud
  • Community
  • Content in Action
  • Content Services
  • Couch
  • CXF
  • Developer
  • Evening Event
  • Fast Feather Track
  • httpd
  • Infrastructure
  • Keynote
  • Linked Data
  • Lucene / Solr
  • Mesos
  • Mobile/Flex
  • OFBiz
  • OpenOffice
  • Operations
  • OSGi
  • Software Development
  • Tomcat
  • Tutorial
  • Web Frameworks
  • Wildcard