Loading…
ApacheCon EU 2014 has ended
Register Now for ApacheCon Europe 2014 - November 17-21 in Budapest, Hungary. 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Big Data [clear filter]
Monday, November 17
 

10:30am

Apache Tez - A New Chapter In Hadoop Data Processing - Hitesh Shah, Hortonworks
Apache Tez is a modern data processing engine designed for YARN on Apache Hadoop 2. Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. It provides a sophisticated topology API, advanced scheduling and concurrency control & fault tolerance. With a clear separation between the logical app layer and the physical data movement layer, Tez is designed from the ground up to be a platform on top of which a variety of domain specific applications can be built. Tez has pluggable control and data planes that allow users to plug in custom data transfer technologies, concurrency-control and scheduling policies to meet their exact requirements.

The talk will cover real use cases from adopters like Hive, Pig and Cascading and provide data to show the performance of Tez.

Speakers
HS

Hitesh Shah

Hortonworks Inc.
Hitesh Shah currently works on various things related to Apache Hadoop at Hortonworks with his primary focus on Apache Tez and Apache Hadoop YARN. He is a PMC member and committer for the Apache Hadoop, Tez and Ambari projects. Earlier to that, he spent close to a decade at Yahoo... Read More →


Monday November 17, 2014 10:30am - 11:20am
Arany

11:30am

ETL Made Simple Using Spark - Mayur Rustagi, Sigmoid Analytics
Apache Spark is growing to be the most active project in Apache Big Data ecosystem. It truly unlocks the ability to perform analytics in-memory & in an iterative fashion. In this talk I will highlight the several customer case studies where we used several aspects of Apache Spark from Streaming, Warehousing & ML. Furthermore I will show how the seamless integration of Streaming, ML & warehousing yields new opportunities for businesses to reach to their data faster.

Speakers
MR

Mayur Rustagi

CTO & Co-Founder, Sigmoid Analytics
Mayur Rustagi is a CTO & Co-founder of Sigmoid Analytics. His areas of expertise include Real Time Big Data Analytics using open source technologies like Apache Spark, Shark and Apache Hadoop. Sigmoid Analytics has worked with over 25 customers in the Big data space including several... Read More →


Monday November 17, 2014 11:30am - 12:20pm
Arany

1:40pm

Accelerating Big Data Application Development With Cascading - Supreet Oberoi, Concurrent, Inc.
Cascading is a Java-based application development framework for building Big Data applications on Apache Hadoop. This open source framework allows developers to leverage their existing skillsets such as Java, SQL, etc. to create enterprise=grade applications without having to think in MapReduce. This comprehensive framework separates business logic from integration logic so that developers can quickly build and test data applications locally on their laptop and then deploy them on Hadoop. While typical enterprise data applications must cross through multiple departments and frameworks, Cascading allows multiple departments to seamlessly integrate their application components into one single data processing application. In this presentation, developers will get an introduction to Cascading, how it works, and then dive into how one can build applications with Cascading

Speakers
SO

Supreet Oberoi

VP of Field Engineering, Concurrent, Inc.
Supreet Oberoi is a hands-on, entrepreneurial, technology leader with over two decades of experience in successfully developing transformative information technologies, and working in leadership roles at Concurrent Inc., American Express, Oracle, Microsoft and many privately-held... Read More →


Monday November 17, 2014 1:40pm - 2:30pm
Arany

2:40pm

Long-Lived Yarn Services: The Future Of Yarn Applications. - Steve Loughran, Hortonworks
Apache Hadoop clusters are generally viewed as data analysis systems, running short-to medium life analysis applications —or installations of a single large application such as Apache HBase & Apache Accumulo.

There's no reason for this to be the case: you can deploy long lived services into a Hadoop cluster, gaining access to the HDFS filesystem, availability from a fault-resilient infrastructure, shared use driven by scheduling -and the ability to integrate with other services running in the YARN cluster.

In this talk I will look at the needs of long-lived services, where YARN is today with supporting them -and where we are going next. In particular, I will explore the JIRA issue YARN-896 is the focal point for evolving YARN's support of long-lived services, addressing needs such as security, logging and service discovery -demonstrating some of this in action.

Speakers
avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache... Read More →


Monday November 17, 2014 2:40pm - 3:30pm
Arany

3:50pm

Building A Better Test Platform: A Case Study Of Improving Apache Hbase Testing With Docker - Aleks Shulman & Dima Spivak, Cloudera
Cloudera Engineering has heavily incorporated Docker, an extension of Linux Containers, into our integration testing framework for Apache HBase, a distributed "NoSQL" datastore. Through the use of Docker images, we have succeeded in parameterizing the Apache Hadoop environment on which our tests are deployed. This allows us to test functionality and compatibility across a wider range of platforms in a much shorter amount of time, resulting in dramatic improvements in utilization of our computational resources. In this talk, we will present how we use Docker during test development to reduce the time it takes to write and run functional tests and to include more test configurations. We will then go in-depth on a particularly novel use case: compatibility testing. Attendees will come away with a perspective on Docker that will help them adopt it into their own test frameworks.

Speakers
AS

Aleks Shulman

Software Engineer, Cloudera
Aleks is a Software Engineer in Test, specializing in Apache HBase and running Apache Hadoop in the Cloud. He has been at Cloudera for two years. Previously, he was at Salesforce.com, working on test automation for the Force.com Platform APIs. Before Salesforce.com, Aleks attended... Read More →
avatar for Dima Spivak

Dima Spivak

Software Engineer, Cloudera
Dima Spivak is a Software Engineer in Test, where he works on Apache HBase in particular and test frameworks in general. Before joining Cloudera, Dima was a Research Assistant in the School of Physics and Astronomy at the University of Minnesota, where he received his MS in Physi... Read More →


Monday November 17, 2014 3:50pm - 4:40pm
Arany

4:50pm

Secure Your Hadoop Cluster With Apache Sentry - Xuefu Zhang, Cloudera
Apache Hadoop users can drive adoption within their organization by implementing Apache Sentry Role Based Access Control (RBAC). This talk will discuss the current security problems Apache Hadoop's ecosystem, with a focus on authorization primitives in the context of Apache Hive and Apache SOLR, and how Apache Sentry addresses those problems. Apache Sentry's architecture and internals will be also be discussed, and the Apache Sentry roadmap and latest development will be given.

Speakers
XZ

Xuefu Zhang

Software Engineer, Uber Technologies
Xuefu Zhang has over 10 year’s experience in software development. Earlier this year he joined as a software engineer in Uber from Cloudera, where he spent his main efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on... Read More →


Monday November 17, 2014 4:50pm - 5:40pm
Arany
 
Tuesday, November 18
 

9:00am

Cassandra (And Hadoop) Case Studies From Finn.No - Mick Semb Wever, FINN.no
FINN.no is the leading classifieds website in Norway and the country's busiest website.

This session will go through various product development where Cassandra has shown to be the best choice. Focusing on the primary use-case: a tracking solution that's collects raw time-series data in c* and aggregates it near real-time using Hadoop into various new datasets from advert-centric statistics to user-centric behavioural analysis.

Mick will cover the final technical design chosen after three years of development iterations, touching on technologies: scribe, thrift, kafka, hadoop, pig, mahout; the hurdles faced along the way, integration improvements done between cassandra and hadoop, and the throughput and performance of today's systems.

Speakers
avatar for Mick Semb Wever

Mick Semb Wever

Team Member, The Last Pickle
Mick Semb Wever works at The Last Pickle helping customers deliver and improve Apache Cassandra based solutions. Prior to TLP he spent seven years at FINN.no building their Microservices platform utilizing Apache Cassandra, Hadoop, Spark and Kafka. He is the PMC Chair for Apache Tiles... Read More →


Tuesday November 18, 2014 9:00am - 9:50am
Tohotom

9:00am

State Of Apache Hbase, 1.0 Release - Nick Dimiduk, Hortonworks
The pace of innovation in HBase is rapidly increasing together with its popularity. In this talk, we will take a look at all the development that happened last year for a user level overview of all the recently added features, and releases in HBase. We will talk about the upcoming 1.0 release which is expected to arrive at summer 2014. We will cover which release to choose, binary / wire and source compatibility considerations and how to upgrade between releases. Specifically, we will talk about long list of new features in recent releases including client API changes, new PB based Filter and Coprocessor interfaces, namespaces, per-cell ACLs, region replicas and many other features.

Speakers
avatar for Nick Dimiduk

Nick Dimiduk

Hortonworks
Nick Dimiduk is a committer and PMC member on both Apache HBase and Apache Phoenix. He's Release Manager for the HBase 1.1 branch and an author of the book HBase in Action, on Manning Press. Nick has also contributed to a number of Apache projects around HBase, including, HTrace... Read More →


Tuesday November 18, 2014 9:00am - 9:50am
Arany

10:00am

Introduction to Cassandra - Duy Hai, DataStax

During this talk, you'll be given a high level introduction to Cassandra and to the database mechanism. Summary of topics discussed:

  •    Architecture
  •    Data model
  •    Replication
  •    Consistency model
  •    Read/Write path
  •    Failure handling

Speakers
avatar for Duy Hai

Duy Hai

Technical Advocate, Datastax
DuyHai is a Cassandra technical advocate. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects to support the community and helping all companies using Cassandra to make their project successful. Shall you have any question on Cassandra... Read More →


Tuesday November 18, 2014 10:00am - 10:50am
Arany

11:20am

DataType API by Example - Nick Dimiduk, HBase in Action
HBase has traditionally been a simple "byte-bucket", in strict homage to the BigTable paper. HBase 0.96 introduced a new API for making HBase "data type aware". This API provides necessary encodings that preserve serialized order and have first-class support for complex rowkeys. It's also user-extensible. This session will introduce the API to developers with examples, including how to implement your own data types for HBase.

Speakers
ND

Nick Dimiduk

Hortonworks, Inc
Nick Dimiduk is an HBase committer and an author of HBase in Action. He works on the HBase team at Hortonworks where his focus is on usability and performance. His involvement in Hadoop and HBase communities started in 2008 when his nightly ETL jobs were taking 20+ hours. Since... Read More →


Tuesday November 18, 2014 11:20am - 12:10pm
Tas

11:20am

Ndfs: A Native Client For The Hadoop Distributed Filesystem - Colin McCabe, Cloudera
As the main filesystem for Hadoop, the Hadoop Distributed Filesystem an important part of the big data ecosystem. However, previously, non-Java Hadoop clients have had to deal with the JNI interface when communicating with HDFS. NDFS, our new project to create a native client for HDFS, offers many operational, performance, and practical advantages for these clients. In this presentation, I'll talk about the architecture of NDFS, the problems we solved when developing it, and our plans for the future.

Speakers
CM

Colin McCabe

Software Engineer, Cloudera
Colin McCabe is a Platform Software Engineer at Cloudera, where he works on HDFS and related technologies. He is a committer on HDFS. Prior to joining Cloudera, he worked on the Ceph Distributed Filesystem, and the Linux kernel, among other things. He studied Computer Science and... Read More →


Tuesday November 18, 2014 11:20am - 12:10pm
Arany

1:30pm

Scalable Stream Processing With Apache Samza And Apache Kafka - Martin Kleppmann, LinkedIn
Samza, an Apache Incubator project, is a framework for processing and analysing high-volume data streams. It is built upon Apache Kafka and YARN (Hadoop 2.0). You can think of Samza as a real-time, continuously running version of MapReduce.

In this talk, Martin will show why stream processing is becoming an important part of the architecture of data-intensive applications, alongside storage and batch processing. We will explore how Samza works, and show how it reliably processes millions of messages per second. We will also examine what kinds of applications would benefit from using Samza.

Speakers
avatar for Martin Kleppmann

Martin Kleppmann

Software Engineer, LinkedIn
Martin is committer on Apache Samza and Apache Avro, software engineer at LinkedIn, and author at O'Reilly (currently writing a book on designing data-intensive applications). Previously he co-founded and sold two startups, Rapportive and Go Test It. His technical blog is at http://martin.kleppmann.com... Read More →


Tuesday November 18, 2014 1:30pm - 2:20pm
Arany

2:30pm

The Flink Big Data Analytics Platform - Márton Balassi, Hungarian Academy of Sciences & Gyula Fóra
Apache Incubator Flink is a next-generation platform for big data analysis originating from the Stratosphere project (www.stratosphere.eu). Flink offers an alternative runtime engine to Hadoop MapReduce, but uses HDFS for data storage and runs on top of YARN. Flink`s runtime streams data rather than processing them in batch, uses out-of-core implementations for data-parallel processing tasks, degrading to disk if main memory is not sufficient. Flink is programmable via a Java or Scala API that includes functional operators like map, reduce, join, cogroup, and cross. Analysis logic is specified without the need of linking user-defined functions. Flink includes a cost-based program optimizer that picks data shipping strategies. Finally, Flink features support for iterative programs and graph processing programs. As a consequence Flink is currently witnessing its first commercial use cases.

Speakers
avatar for Márton Balassi

Márton Balassi

Solutions Architect, Cloudera
Márton Balassi is a Solution Architect at Cloudera and a PMC member at Apache Flink. He focuses on Big Data application development, especially in the streaming space. Marton is a regular contributor to open source and has been a speaker of a number of Big Data related conferences... Read More →
avatar for Gyula Fóra

Gyula Fóra

Researcher, Distributed Systems, SICS
Gyula is a committer and PMC member for the Apache Flink project, currently working as a researcher at the Swedish Institute of Computer Science. His main expertise and interest is real-time distributed data processing frameworks, and their connections to other big data applications... Read More →


Tuesday November 18, 2014 2:30pm - 3:20pm
Arany

3:50pm

Improving Spark Application Performance - William Benton, Red Hat
Apache Spark presents an elegant and powerful set of high-level abstractions for developing distributed data-processing applications.  Analysts who use Spark can rapidly prototype applications and experiment with new techniques at scale.  However, to make the most of Spark, developers need to understand both the abstractions and how Spark will schedule and execute their code.  

This talk will show you how to improve Spark application performance by working with, not against, Spark's operational model.  We'll start with a real prototype Spark application and apply several simple, generally applicable transformations to make it more efficient and scalable.  For each transformation, we'll look both at *why* it works, considering the relevant details of Spark's internals, and *how well* it works, considering its impact on overall application performance.  You'll leave this talk with an improved understanding of how Spark runs your code and some additional tools to make your big data apps even more efficient.

Speakers
avatar for William Benton

William Benton

Senior Principal Software Engineer, Red Hat
Senior Principal Software Engineer


Tuesday November 18, 2014 3:50pm - 4:40pm
Kond

3:50pm

Time Series Data With Apache Cassandra - Eric Evans, OpenNMS Group
Whether it's statistics, astronomy, finance, or network management, time series data plays a critical role in analytics and forecasting. Yet, while many tools exist, few are able to scale past memory limits; For those challenged by large volumes of data, there is much room for improvement.

Apache Cassandra is a fully distributed second-generation database. Cassandra stores data in key-sorted order making it ideal for time series, and its high throughput and linear scalability make it well suited to very large data sets.

This talk will cover some of the requirements and challenges of large scale time series storage and analysis. Cassandra data and query modeling for this use-case will be discussed, and Newts, an open source Cassandra-based time series store under development at The OpenNMS Group will be introduced.

Speakers
avatar for Eric Evans

Eric Evans

Senior Software Engineer, Wikimedia Foundation
Eric has more than a decade of experience with the engineering and operations of large-scale distributed systems. He joined Rackspace as a startup, and implemented a global DNS infrastructure utilizing IP anycast (possibly the first), and a novel data-center-wide IDS for which a patent... Read More →


Tuesday November 18, 2014 3:50pm - 4:40pm
Arany

4:50pm

Apache Giraph: Start Analyzing Graph Relationships In Your Big Data In 45 Minutes (Or Your Money Back)! - Roman Shaposhnik, Pivotal
The genesis of Hadoop was in analyzing massive amounts of data with a mapreduce framework. SQL­-on­Hadoop has followed shortly after that, paving a way to the whole schema-­on­-read notion. Discovering graph relationship in your data is the next logical step. Apache Giraph (modeled on Google’s Pregel) lets you apply the power of BSP approach to the unstructured data. In this talk we will focus on practical advice of how to get up and running with Apache Giraph, start analyzing simple data sets with built­-in algorithms and finally how to implement your own graph processing applications using the APIs provided by the project. We will then dive into how Giraph integrates with the Hadoop ecosystem (Hive, HBase, Accumulo, etc.) and will also provide a whirlwind tour of Giraph architecture.

Speakers
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Linux Foundation
Apache Software Foundation and Data, oh but also unikernels


Tuesday November 18, 2014 4:50pm - 5:40pm
Arany
 
Wednesday, November 19
 

9:30am

OSv: Probably The Best OS For Cloud Workloads You've Never Heard Of - Roman Shaposhnik, Pivotal
OSv is the revolutionary new open source technology that combines the power of virtualization and micro-services architecture. This combination allows unmodified applications deployed in a virtualized environment to outperform bare-metal deployments. Yes. You've heard it right: for the first time ever we can stop asking the question of how much performance would I lose if I virtualize. OSv lets you ask a different question: how much would my application gain in performance if I virtualize it. This talk will start by looking into the architecture of OSv and the kind of optimizations it makes possible for native, unmodified applications. We will then focus on JVM-specific optimizations and specifically on speedups available to ASF projects when they are deployed on OSv

Speakers
avatar for Roman Shaposhnik

Roman Shaposhnik

Director of Open Source, Linux Foundation
Apache Software Foundation and Data, oh but also unikernels


Wednesday November 19, 2014 9:30am - 10:20am
Arany

10:40am

What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And Friends - Nick Burch, Quanticate
If you have one or two files, you can take the time to manually work out what they are, what they contain, and how to get the useful bits out (probably....). However, this approach really doesn't scale, mechanical turks or no! Luckily, there are Apache projects out there which can help!

In this talk, we'll first look at how we can work out what a given blob of 1s and 0s actually is, be it textual or binary. We'll then see how to extract common metadata from it, along with text, embedded resources, images, and maybe even the kitchen sink! We'll see how to do all of this with Apache Tika, and how to dive down to the underlying libraries (including its Apache friends like POI and PDFBox) for specialist cases. Finally, we'll look a little bit about how to roll this all out on a Big Data or Large-Search case.

Speakers
avatar for Nick Burch

Nick Burch

CTO, Quanticate
Nick began contributing to Apache projects in 2003, and hasn't looked back since! Most of the projects Nick has worked in belong in the "Content" space, such as Apache POI (ex-PMC Chair), Apache Tika and Apache Chemistry. As well as coding projects, Nick is also involved in a number... Read More →


Wednesday November 19, 2014 10:40am - 11:30am
Arany

2:00pm

Introduction To Apache Slider - Steve Loughran, Hortonworks
With YARN, Apache Hadoop can deploy distributed applications —applications which can dynamically expand or contract their size based on demand or other factors. It allows the application to choose the placement of distributed components within the cluster, as well as their resource requirements such as CPUs and memory. As YARN evolves to support long-lived services, YARN applications can become services supporting broader uses.

Taking advantage of these features has required the application to be rewritten as a YARN application, central to which is the Application Master —a process which manages the allocation of components across the cluster, deployment of these components, as well as failure handling.

The Apache Slider project can deploy distributed applications without requiring them to be ported to YARN: Slider provides the Application Master and the allocation, deplo

Speakers
avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache... Read More →


Wednesday November 19, 2014 2:00pm - 2:50pm
Arany

3:00pm

The Other Apache Technologies Your Big Data Solution Needs - Nick Burch, Quanticate
In this talk, we'll take a look at a range of projects from the Apache Software Foundation, looking at those which complement the "headline projects" to build out your big data solution. While we can't cover every project at Apache (there are just too many these days!), we'll take a tour through some of the up-coming and lesser-known established projects out there, those that should prove very helpful to you in building your big data solution. We'll see that Apache is more than just the webserver, Hadoop and Lucene, and with any luck point you at projects that'll save you time and effort!

Speakers
avatar for Nick Burch

Nick Burch

CTO, Quanticate
Nick began contributing to Apache projects in 2003, and hasn't looked back since! Most of the projects Nick has worked in belong in the "Content" space, such as Apache POI (ex-PMC Chair), Apache Tika and Apache Chemistry. As well as coding projects, Nick is also involved in a number... Read More →


Wednesday November 19, 2014 3:00pm - 3:50pm
Arany