ApacheCon EU 2014 has ended
Register Now for ApacheCon Europe 2014 - November 17-21 in Budapest, Hungary. 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Big Data [clear filter]
Monday, November 17

10:30am CET

Apache Tez - A New Chapter In Hadoop Data Processing - Hitesh Shah, Hortonworks
Apache Tez is a modern data processing engine designed for YARN on Apache Hadoop 2. Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. It provides a sophisticated topology API, advanced scheduling and concurrency control & fault tolerance. With a clear separation between the logical app layer and the physical data movement layer, Tez is designed from the ground up to be a platform on top of which a variety of domain specific applications can be built. Tez has pluggable control and data planes that allow users to plug in custom data transfer technologies, concurrency-control and scheduling policies to meet their exact requirements.

The talk will cover real use cases from adopters like Hive, Pig and Cascading and provide data to show the performance of Tez.


Hitesh Shah

Hortonworks Inc.
Hitesh Shah currently works on various things related to Apache Hadoop at Hortonworks with his primary focus on Apache Tez and Apache Hadoop YARN. He is a PMC member and committer for the Apache Hadoop, Tez and Ambari projects. Earlier to that, he spent close to a decade at Yahoo... Read More →

Monday November 17, 2014 10:30am - 11:20am CET

2:40pm CET

Long-Lived Yarn Services: The Future Of Yarn Applications. - Steve Loughran, Hortonworks
Apache Hadoop clusters are generally viewed as data analysis systems, running short-to medium life analysis applications —or installations of a single large application such as Apache HBase & Apache Accumulo.

There's no reason for this to be the case: you can deploy long lived services into a Hadoop cluster, gaining access to the HDFS filesystem, availability from a fault-resilient infrastructure, shared use driven by scheduling -and the ability to integrate with other services running in the YARN cluster.

In this talk I will look at the needs of long-lived services, where YARN is today with supporting them -and where we are going next. In particular, I will explore the JIRA issue YARN-896 is the focal point for evolving YARN's support of long-lived services, addressing needs such as security, logging and service discovery -demonstrating some of this in action.

avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache... Read More →

Monday November 17, 2014 2:40pm - 3:30pm CET

3:50pm CET

Building A Better Test Platform: A Case Study Of Improving Apache Hbase Testing With Docker - Aleks Shulman & Dima Spivak, Cloudera
Cloudera Engineering has heavily incorporated Docker, an extension of Linux Containers, into our integration testing framework for Apache HBase, a distributed "NoSQL" datastore. Through the use of Docker images, we have succeeded in parameterizing the Apache Hadoop environment on which our tests are deployed. This allows us to test functionality and compatibility across a wider range of platforms in a much shorter amount of time, resulting in dramatic improvements in utilization of our computational resources. In this talk, we will present how we use Docker during test development to reduce the time it takes to write and run functional tests and to include more test configurations. We will then go in-depth on a particularly novel use case: compatibility testing. Attendees will come away with a perspective on Docker that will help them adopt it into their own test frameworks.


Aleks Shulman

Software Engineer, Cloudera
Aleks is a Software Engineer in Test, specializing in Apache HBase and running Apache Hadoop in the Cloud. He has been at Cloudera for two years. Previously, he was at Salesforce.com, working on test automation for the Force.com Platform APIs. Before Salesforce.com, Aleks attended... Read More →
avatar for Dima Spivak

Dima Spivak

Software Engineer, Cloudera
Dima Spivak is a Software Engineer in Test, where he works on Apache HBase in particular and test frameworks in general. Before joining Cloudera, Dima was a Research Assistant in the School of Physics and Astronomy at the University of Minnesota, where he received his MS in Physi... Read More →

Monday November 17, 2014 3:50pm - 4:40pm CET

4:50pm CET

Secure Your Hadoop Cluster With Apache Sentry - Xuefu Zhang, Cloudera
Apache Hadoop users can drive adoption within their organization by implementing Apache Sentry Role Based Access Control (RBAC). This talk will discuss the current security problems Apache Hadoop's ecosystem, with a focus on authorization primitives in the context of Apache Hive and Apache SOLR, and how Apache Sentry addresses those problems. Apache Sentry's architecture and internals will be also be discussed, and the Apache Sentry roadmap and latest development will be given.


Xuefu Zhang

Software Engineer, Uber Technologies
Xuefu Zhang has over 10 year’s experience in software development. Earlier this year he joined as a software engineer in Uber from Cloudera, where he spent his main efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on... Read More →

Monday November 17, 2014 4:50pm - 5:40pm CET
Tuesday, November 18

9:00am CET

State Of Apache Hbase, 1.0 Release - Nick Dimiduk, Hortonworks
The pace of innovation in HBase is rapidly increasing together with its popularity. In this talk, we will take a look at all the development that happened last year for a user level overview of all the recently added features, and releases in HBase. We will talk about the upcoming 1.0 release which is expected to arrive at summer 2014. We will cover which release to choose, binary / wire and source compatibility considerations and how to upgrade between releases. Specifically, we will talk about long list of new features in recent releases including client API changes, new PB based Filter and Coprocessor interfaces, namespaces, per-cell ACLs, region replicas and many other features.

avatar for Nick Dimiduk

Nick Dimiduk

Nick Dimiduk is a committer and PMC member on both Apache HBase and Apache Phoenix. He's Release Manager for the HBase 1.1 branch and an author of the book HBase in Action, on Manning Press. Nick has also contributed to a number of Apache projects around HBase, including, HTrace... Read More →

Tuesday November 18, 2014 9:00am - 9:50am CET

11:20am CET

Ndfs: A Native Client For The Hadoop Distributed Filesystem - Colin McCabe, Cloudera
As the main filesystem for Hadoop, the Hadoop Distributed Filesystem an important part of the big data ecosystem. However, previously, non-Java Hadoop clients have had to deal with the JNI interface when communicating with HDFS. NDFS, our new project to create a native client for HDFS, offers many operational, performance, and practical advantages for these clients. In this presentation, I'll talk about the architecture of NDFS, the problems we solved when developing it, and our plans for the future.


Colin McCabe

Software Engineer, Cloudera
Colin McCabe is a Platform Software Engineer at Cloudera, where he works on HDFS and related technologies. He is a committer on HDFS. Prior to joining Cloudera, he worked on the Ceph Distributed Filesystem, and the Linux kernel, among other things. He studied Computer Science and... Read More →

Tuesday November 18, 2014 11:20am - 12:10pm CET

1:30pm CET

Scalable Stream Processing With Apache Samza And Apache Kafka - Martin Kleppmann, LinkedIn
Samza, an Apache Incubator project, is a framework for processing and analysing high-volume data streams. It is built upon Apache Kafka and YARN (Hadoop 2.0). You can think of Samza as a real-time, continuously running version of MapReduce.

In this talk, Martin will show why stream processing is becoming an important part of the architecture of data-intensive applications, alongside storage and batch processing. We will explore how Samza works, and show how it reliably processes millions of messages per second. We will also examine what kinds of applications would benefit from using Samza.

avatar for Martin Kleppmann

Martin Kleppmann

Researcher, University of Cambridge

Tuesday November 18, 2014 1:30pm - 2:20pm CET

2:30pm CET

The Flink Big Data Analytics Platform - Márton Balassi, Hungarian Academy of Sciences & Gyula Fóra
Apache Incubator Flink is a next-generation platform for big data analysis originating from the Stratosphere project (www.stratosphere.eu). Flink offers an alternative runtime engine to Hadoop MapReduce, but uses HDFS for data storage and runs on top of YARN. Flink`s runtime streams data rather than processing them in batch, uses out-of-core implementations for data-parallel processing tasks, degrading to disk if main memory is not sufficient. Flink is programmable via a Java or Scala API that includes functional operators like map, reduce, join, cogroup, and cross. Analysis logic is specified without the need of linking user-defined functions. Flink includes a cost-based program optimizer that picks data shipping strategies. Finally, Flink features support for iterative programs and graph processing programs. As a consequence Flink is currently witnessing its first commercial use cases.

avatar for Márton Balassi

Márton Balassi

Solutions Architect, Cloudera
Márton Balassi is a Solution Architect at Cloudera and a PMC member at Apache Flink. He focuses on Big Data application development, especially in the streaming space. Marton is a regular contributor to open source and has been a speaker of a number of Big Data related conferences... Read More →
avatar for Gyula Fóra

Gyula Fóra

Researcher, Distributed Systems, SICS
Gyula is a committer and PMC member for the Apache Flink project, currently working as a researcher at the Swedish Institute of Computer Science. His main expertise and interest is real-time distributed data processing frameworks, and their connections to other big data applications... Read More →

Tuesday November 18, 2014 2:30pm - 3:20pm CET

3:50pm CET

Improving Spark Application Performance - William Benton, Red Hat
Apache Spark presents an elegant and powerful set of high-level abstractions for developing distributed data-processing applications.  Analysts who use Spark can rapidly prototype applications and experiment with new techniques at scale.  However, to make the most of Spark, developers need to understand both the abstractions and how Spark will schedule and execute their code.  

This talk will show you how to improve Spark application performance by working with, not against, Spark's operational model.  We'll start with a real prototype Spark application and apply several simple, generally applicable transformations to make it more efficient and scalable.  For each transformation, we'll look both at *why* it works, considering the relevant details of Spark's internals, and *how well* it works, considering its impact on overall application performance.  You'll leave this talk with an improved understanding of how Spark runs your code and some additional tools to make your big data apps even more efficient.

avatar for William Benton

William Benton

Engineering Manager and Senior Principal Software Engineer, Red Hat
Senior Principal Software Engineer

Tuesday November 18, 2014 3:50pm - 4:40pm CET

3:50pm CET

Time Series Data With Apache Cassandra - Eric Evans, OpenNMS Group
Whether it's statistics, astronomy, finance, or network management, time series data plays a critical role in analytics and forecasting. Yet, while many tools exist, few are able to scale past memory limits; For those challenged by large volumes of data, there is much room for improvement.

Apache Cassandra is a fully distributed second-generation database. Cassandra stores data in key-sorted order making it ideal for time series, and its high throughput and linear scalability make it well suited to very large data sets.

This talk will cover some of the requirements and challenges of large scale time series storage and analysis. Cassandra data and query modeling for this use-case will be discussed, and Newts, an open source Cassandra-based time series store under development at The OpenNMS Group will be introduced.

avatar for Eric Evans

Eric Evans

Senior Software Engineer, Wikimedia Foundation
Eric has more than a decade of experience with the engineering and operations of large-scale distributed systems. He joined Rackspace as a startup, and implemented a global DNS infrastructure utilizing IP anycast (possibly the first), and a novel data-center-wide IDS for which a patent... Read More →

Tuesday November 18, 2014 3:50pm - 4:40pm CET
Wednesday, November 19

10:40am CET

What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And Friends - Nick Burch, Quanticate
If you have one or two files, you can take the time to manually work out what they are, what they contain, and how to get the useful bits out (probably....). However, this approach really doesn't scale, mechanical turks or no! Luckily, there are Apache projects out there which can help!

In this talk, we'll first look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how to do all of this with Apache Tika, and how to dive down to the underlying libraries (including its Apache friends like POI and PDFBox) for specialist cases. Finally, we'll look a little bit about how to roll this all out on a Big Data or Large-Search case.

avatar for Nick Burch

Nick Burch

CTO, Quanticate
Nick began contributing to Apache projects in 2003, and hasn't looked back since! Most of the projects Nick has worked in belong in the "Content" space, such as Apache POI (ex-PMC Chair), Apache Tika and Apache Chemistry. As well as coding projects, Nick is also involved in a number... Read More →

Wednesday November 19, 2014 10:40am - 11:30am CET

2:00pm CET

Introduction To Apache Slider - Steve Loughran, Hortonworks
With YARN, Apache Hadoop can deploy distributed applications —applications which can dynamically expand or contract their size based on demand or other factors. It allows the application to choose the placement of distributed components within the cluster, as well as their resource requirements such as CPUs and memory. As YARN evolves to support long-lived services, YARN applications can become services supporting broader uses.

Taking advantage of these features has required the application to be rewritten as a YARN application, central to which is the Application Master —a process which manages the allocation of components across the cluster, deployment of these components, as well as failure handling.

The Apache Slider project can deploy distributed applications without requiring them to be ported to YARN: Slider provides the Application Master and the allocation, deplo

avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache... Read More →

Wednesday November 19, 2014 2:00pm - 2:50pm CET

Filter sessions
Apply filters to sessions.
  • ApacheBarCamp
  • Big Data
  • Business
  • Cassandra Days
  • Cloud
  • Community
  • Content in Action
  • Content Services
  • Couch
  • CXF
  • Developer
  • Evening Event
  • Fast Feather Track
  • httpd
  • Infrastructure
  • Keynote
  • Linked Data
  • Lucene / Solr
  • Mesos
  • Mobile/Flex
  • OFBiz
  • OpenOffice
  • Operations
  • OSGi
  • Software Development
  • Tomcat
  • Tutorial
  • Web Frameworks
  • Wildcard