We'll be adding the sessions below (which are mostly in alpha order) to the agenda grid really soon!
Introduction to HBase Session
- > Welcome to HBaseCon 2014!
- > Bigtable at Google: Yesterday, Today, and Tomorrow
- > HBase @ Salesforce.com
- > HydraBase: Facebook's Highly Available and Strongly Consistent Storage Service Based on Replicated HBase Instances
- > From MongoDB to HBase in Six Easy Months
- > Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
- > HBase Backups
- > Real-time HBase: Lessons from the Cloud
- > Smooth Operators Panel
- > The State of HBase Replication
- > Tales from the Cloudera Field
Features & Internal Track
- > Bulk Loading in the Wild: Ingesting the World's Energy Data
- > HBase at Xiaomi
- > HBase: Extreme Makeover
- > HBase Read High Availability Using Timeline-Consistent Region Replicas
- > HBase: Where Online Meets Low Latency
- > New Security Features in Apache HBase 0.98: An Operator's Guide
- > State of HBase: Meet the Release Managers
- > Cross-Site BigTable using HBase
- > Design Patterns for Building 360-degree Views with HBase and Kiji
- > HBase Data Modeling and Access Patterns with Kite SDK
- > OpenTSDB 2.0
- > Presto + HBase: A Distributed SQL Query Execution Engine on Top of HBase
- > Tasmo: Building HBase Applications From Event Streams
- > Taming HBase with Apache Phoenix and SQL
Case Studies Track
- > A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase
- > A Survey of HBase Application Archetypes
- > Blackbird: Storing Billions of Rows a Couple of Milliseconds Away
- > Content Identification using HBase
- > Data Evolution in HBase
- > Digital Library Collection Management using HBase
- > HBase at Bloomberg: High Availability Needs for the Financial Industry
- > HBase Design Patterns @ Yahoo! (20-minute session)
- > Large-scale Web Apps @ Pinterest
Introduction to HBase Session
HBase: Just the Basics - Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
Welcome to HBaseCon 2014! - Michael Stack and Amr Awadallah (Cloudera)
The hosts of HBaseCon welcome the Apache HBase community to the conference and preview the day ahead.
Bigtable at Google: Yesterday, Today, and Tomorrow - Avtandil Garakanidze and Carter Page (Google)
Bigtable is the world's largest multi-purpose database, supporting 90% of Google's applications around the world. This talk provides a brief overview of Bigtable evolution since it was originally described in an OSDI '06 paper, its current use cases at Google, and future directions.
HBase @ Salesforce.com - Lars Hofhansl (Salesforce.com)
Lars explains how Salesforce.com's scalability requirements led it to HBase and the multiple use cases for Apache HBase there today. You'll also learn how Salesforce.com works with the HBase community, and get a detailed look into its operational environment.
HydraBase: Facebook's Highly Available and Strongly Consistent Storage Service Based on Replicated HBase Instances - Liyin Tang (Facebook)
HBase powers multiple mission-critical online applications at Facebook. However, providing a highly available online storage system on top of a single HDFS cluster has been challenging. HydraBase is built to provide a highly available, strongly consistent online storage service. It allows Facebook to synchronously replicate transactions across multiple geographically dispersed HBase instances and support seamless failover among HBase instances at Region-level granularity. This talk will cover the design of HydraBase, including the replication protocol, an analysis of failure scenario, and a contribution plan to HBase.
From MongoDB to HBase in Six Easy Months - Shreeganesh Ramanan and Mike Davis (Optimizely)
Pushing well past MongoDB's limits (2TB data every week) is an interesting exercise in operational frustration. It also severely hampers flexibility of design for new use cases. This talk covers the architectural journey from MongoDB/Redis to HBase at Optimizely -- including the performance, design flexibility, speed of implementation, and other gains made. It also covers the operational setup needed to monitor and maintain the system as well as lessons learned from the migration process itself.
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity - Dheeraj Kapur, Rajiv Chittajallu & Anish Mathew (Yahoo!)
In early 2013, Yahoo! introduced multi-tenancy to HBase to offer it as a platform service for all Hadoop users. A certain degree of customization per tenant (a user or a project) was achieved through RegionServer groups, namespaces, and customized configs for each tenant. This talk covers how to accommodate diverse needs to individual tenants on the cluster, as well as operational tips and techniques that allow Yahoo! to automate the management of multi-tenant clusters at petabyte scale without errors.
HBase Backups - Jesse Yates (Salesforce.com), Demai Ni, Richard Ding & Jing Chen He (IBM)
This talk provides an overview of enterprise-scale backup strategies for HBase: Jesse Yates will describe how Salesforce.com runs backup and recovery on its multi-tenant, enterprise scale HBase deploys; Demai Ni, Songqinq Ding, and Jing Chen of the IBM InfoSphere BigInsights development team will then follow with a description of IBM's recently open-sourced disaster/recovery solution based on HBase snapshots and replication.
Real-time HBase: Lessons from the Cloud - Bryan Beaudreault (HubSpot)
Running HBase in real time in the cloud provides an interesting and ever-changing set of challenges -- instance types are not ideal, neighbors can degrade your performance, and instances can randomly die in unanticipated ways. This talk will cover what HubSpot has learned about running in production on Amazon EC2, how it handle DR and redundancy, and the tooling the team has found to be the most helpful.
Smooth Operators Panel - Moderated by Eric Sammer (Cloudera)
Includes Jeremy Carroll (Pinterest), Adam Frank (Flurry), and Paul Tuckfield (Facebook).
The State of HBase Replication - Jean-Daniel Cryans (Cloudera)
HBase Replication has come a long way since its inception in HBase 0.89 almost four years ago. Today, master-master and cyclic replication setups are supported; many bug fixes and new features like log compression, per-family peers configuration, and throttling have been added; and a major refactoring has been done. This presentation will recap the work done during the past four years, present a few use cases that are currently in production, and take a look at the roadmap.
Tales from the Cloudera Field - Kevin O'Dell, Aleksandr Shulman & Kathleen Ting (Cloudera)
From supporting the 0.90.x, 0.92, 0.94, and 0.96 HBase installations on clusters ranging from tens to hundreds of nodes, Cloudera has seen it all. Having automated the upgrade paths from the different Apache releases, we have developed a smooth path that can help the community with upcoming upgrades. In addition to automation best practices, in this talk you'll also learn proactive configuration tweaks and operational best practices to keep your HBase cluster always up and running. We'll also walk through how to contain an application bug let loose in production, to minimize the impact on HBase posed by faulty hardware, and the direct correlation between inefficient schema design and HBase performance.
Features & Internal Track
Bulk Loading in the Wild: Ingesting the World's Energy Data - Eric Chang (Opower) and Jean-Daniel Cryans (Cloudera)
HBase is designed to store your big data and provide low latency random access to that data. One of its most compelling features is Bulk Loading, which enables the generation of HFiles that can then be passed to the RegionServers. Opower's energy insights platform uses it to ingest the hundreds of millions of meter reads it receives daily from its partner utility companies. This presentation will walk you through the HBase Bulk Loading process and Opower's adoption of it as an important piece of its HBase ecosystem.
HBase at Xiaomi - Liang Xie and Honghua Feng (Xiamoi)
This talk covers the HBase environment at Xiaomi, including thoughts and practices around latency, hardware/OS/VM configuration, GC tuning, the use of a new write thread model and reverse scan, and block index optimization. It will also include some discussion of planned JIRAs based on these approaches.
HBase: Extreme Makeover - Vladimir Rodionov (bigbase.org)
This talks introduces a totally new implementation of a multilayer caching in HBase called BigBase. BigBase has a big advantage over HBase 0.94/0.96 because of an ability to utilize all available server RAM in the most efficient way, and because of a novel implementation of a L3 level cache on fast SSDs. The talk will show that different type of caches in BigBase work best for different type of workloads, and that a combination of these caches (L1/L2/L3) increases the overall performance of HBase by a very wide margin.
HBase Read High Availability Using Timeline-Consistent Region Replicas - Enis Soztutar and Devaraj Das (Hortonworks)
HBase has ACID semantics within a row that make it a perfect candidate for a lot of real-time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to do possibly stale reads while the region is recovering. In this talk, you will get an overview of our design and implementation of region replicas in HBase, which provide timeline-consistent reads even when the primary region is unavailable or busy.
HBase: Where Online Meets Low Latency - Nick Dimiduk (Hortonworks) and Nicolas Liochon (Scaled Risk)
HBase is an online database so response latency is critical. This talk will examine sources of latency in HBase, detailing steps along the read and write paths. We'll examine the entire request lifecycle, from client to server and back again. We'll also look at the different factors that impact latency, including GC, cache misses, and system failures. Finally, the talk will highlight some of the work done in 0.96+ to improve the reliability of HBase.
New Security Features in Apache HBase 0.98: An Operator's Guide - Andrew Purtell and Ramkrishna Vasudevan (Intel)
HBase 0.98 introduces several new security features: visibility labels, cell ACLs, transparent encryption, and coprocessor framework changes. This talk will cover the new capabilities available in HBase 0.98+, the threat models and use cases they cover, how these features stack up against other data stores in the Apache big data ecosystem, and how operators and security architects can take advantage of them.
State of HBase: Meet the Release Managers
HBase release managers Lars Hofhansl, Andrew Purtell, Enis Soztutar, Michael Stack, and Liyin Tang jointly present highlights from their releases, and take your questions throughout.
Cross-Site BigTable using HBase - Jingcheng Du and Ramkrishna Vasudevan (Intel)
As HBase continues to expand in application and enterprise or government deployments, there is a growing demand for storing data across geographically distributed datacenters for improved availability and disaster recovery. The Cross-Site BigTable extends HBase to make it well-suited for such deployments, providing the capabilities of creating and accessing HBase tables that are partitioned and asynchronously backed-up over a number of distributed datacenters. This talk reveals how the Cross-Site BigTable manages data access over multiple datacenters and removes the data center itself as a single point of failure in geographically distributed HBase deployments.
Design Patterns for Building 360-degree Views with HBase and Kiji - Jonathan Natkins (WibiData)
Many companies aspire to have 360-degree views of their data. Whether they're concerned about customers, users, accounts, or more abstract things like sensors, organizations are focused on developing capabilities for analyzing all the data they have about these entities. This talk will introduce the concept of entity-centric storage, discuss what it means, what it enables for businesses, and how to develop an entity-centric system using the open-source Kiji framework and HBase. It will also compare and contrast traditional methods of building a 360-degree view on a relational database versus building against a distributed key-value store, and why HBase is a good choice for implementing an entity-centric system.
HBase Data Modeling and Access Patterns with Kite SDK - Adam Warrington (Cloudera)
The Kite SDK is a set of libraries and tools focused on making it easier to build systems on top of the Hadoop ecosystem. HBase support has recently been added to the Kite SDK Data Module, which allows a developer to model and access data in HBase consistent with how they would model data in HDFS using Kite. This talk will focus on Kite's HBase support by covering Kite basics and moving through the specifics of working with HBase as a data source. This feature overview will be supplemented by specifics of how that feature is being used in production applications at Cloudera.
OpenTSDB 2.0 - Chris Larsen (Limelight Networks) and Benoit Sigoure (Arista Networks)
The OpenTSDB community continues to grow and with users looking to store massive amounts of time-series data in a scalable manner. In this talk, we will discuss a number of use cases and best practices around naming schemas and HBase configuration. We will also review OpenTSDB 2.0's new features, including the HTTP API, plugins, annotations, millisecond support, and metadata, as well as what's next in the roadmap.
Presto + HBase: A Distributed SQL Query Execution Engine on Top of HBase - Manukranth Kolloju (Facebook)
Presto is a distributed SQL query engine optimized for ad hoc analysis at interactive speed in use at Facebook. At Facebook scale, having ad hoc SQL query capabilities for high-volume NoSQL data stores has been a very valuable asset, and Presto enabled this by supporting connectors on top of HDFS and other data providers. To effectively process the Presto SQL-based workload, HBase needs to be able to efficiently support a critical set of data access patterns over large data sets with high performance. This talk covers the improvements we've made to enhance scan performance and optimize the read path, as well as a number of other new features that help push down the work from the query execution to the database.
Tasmo: Building HBase Applications From Event Streams - Pete Matern and Jonathan Colt (Jive Software)
Tasmo is a system that enables application development on top of event streams and HBase. Its functionality is similar to a materialized view in a relational database, where data is maintained at write time in the forms it is needed at read time for display and indexing. Tasmo is designed for significantly read-heavy applications that display the same underlying data in multiple forms, where repeatedly performing the required selects and joins at read time can be prohibitively expensive. In this talk, we'll explore the features and roadmap for Tasmo.
Taming HBase with Apache Phoenix and SQL - Eli Levine, James Taylor (Salesforce.com) & Maryann Xue (Intel)
HBase is the Turing machine of the Big Data world. It's been scientifically proven that you can do *anything* with it. This is, of course, a blessing and a curse, as there are so many different ways to implement a solution. Apache Phoenix (incubating), the SQL engine over HBase to the rescue. Come learn about the fundamentals of Phoenix and how it hides the complexities of HBase while giving you optimal performance, and hear about new features from our recent release, including updatable views that share the same physical HBase table and n-way equi-joins through a broadcast hash join mechanism. We'll conclude with a discussion about our roadmap and plans to implement a cost-based query optimization to dynamically adapt query execution based on your data sizes.
Case Studies Track
A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase - Chris Huang and Scott Miao (Trend Micro)
Trend Micro collects lots of threat knowledge data for clients containing many different threat (web) entities. Most threat entities will be observed along with relations, such as malicious behaviors or interaction chains among them. So, we built a graph model on HBase to store all the known threat entities and their relationships, allowing clients to query threat relationships via any given threat entity. This presentation covers what problems we try to solve, what and how the design decisions we made, how we design such a graph model, and the graph computation tasks involved.
A Survey of HBase Application Archetypes - Lars George and Jon Hsieh (Cloudera)
Today, there are hundreds of production HBase clusters running a multitude of applications and use cases. Many well-known implementations exercise opposite ends of the HBase's capabilities emphasizing either entity-centric schemas or event-based schemas. This talk presents these archetypes and others based on a use-case survey of clusters conducted by Cloudera's development, product, and services teams. By analyzing the data from the nearly 20,000 HBase cluster nodes Cloudera has under management, we'll categorize HBase users and their use cases into a few simple archetypes, describe workload patterns, and quantify the usage of advanced features. We'll also explain what an HBase user can do to alleviate pressure points from these fundamentally different workloads, and use these results will provide insight into what lies in HBase's future.
Blackbird: Storing Billions of Rows a Couple of Milliseconds Away - Ishan Chhabra, Shrijeet Paliwal & Abhijit Pol (Rocket Fuel)
Would you use HBase to make billions of rows available for real-time lookup under 10 ms with 99% guarantee? We, at Rocket Fuel, do just that. Blackbird, our system built on top of HBase, makes billions of rich user profiles available for AI based optimization under the tight latency requirements of real time auction. It relies on our novel collections API, a constrained yet useful append only model that is sympathetic to HBase internals and allows us to scale our writes easily while keeping strict read performance guarantees. In this talk, we describe the key abstractions Blackbird exposes, utilities we built over time to support our use cases and our hardware and software configuration (including HBase configs) that helps us achieve our strict latency guarantees. We also share the key challenges and lessons learned scaling the system ten fold in a short span of time and some common beginner mistakes that we made and fixed later that you should avoid.
Content Identification using HBase (20-minute session) - Daniel Nelson (Nielsen)
The motivation behind content identification is to determine the media people are consuming (via TV shows, movies, or streaming). Nielsen collects that data via its Fingerprints system, which generates significant amounts of structured data that is stored in HBase. This presentation will review the options a developer has for HBase querying and retrieval of hash data. Also covered is the use of wire protocols (Protocol Buffers), and how they can improve network efficiency and throughput, especially when combined with an HBase coprocessor.
Data Evolution in HBase - Eric Czech and Alec Zopf (Next Big Sound)
Managing the evolution of data within HBase over time is not easy: Data resulting from Hadoop processing pipelines or otherwise placed in HBase is subject to the same kinds of oversights, bugs, and faulty assumptions inherent to the software that creates it. While the development of this software is often effectively managed through revision control systems, data itself is rarely modeled in a way that affords the same flexibility. In this session, we'll talk about how to build a versioned, time-series data store using HBase that can provide significantly greater adaptability and performance than similar systems.
Digital Library Collection Management using HBase (20-minute session) - Ron Buckley (OCLC)
OCLC has been working over the last year to move its massive repository to HBase. This talk will focus on the impetus behind the move, implementation details and technology choices we've made (key design, shredding PDFs and other digital objects into HBase, scaling), and the value-add that HBase brings to digital collection management.
HBase at Bloomberg: High Availability Needs for the Financial Industry (20-minute session) - Sudarshan Kadambi and Matthew Hunt (Bloomberg LP)
Bloomberg is a financial data and analytics provider, so data management is core to what we do. There's tremendous diversity in the type of data we manage, and HBase is a natural fit for many of these datasets - from the perspective of the data model as well as in terms of a scalable, distributed database. This talk covers data and analytics use cases at Bloomberg and operational challenges around HA. We'll explore the work currently being done under HBASE-10070, further extensions to it, and how this solution is qualitatively different to how failover is handled by Apache Cassandra.
HBase Design Patterns @ Yahoo! (20-minute session) - Francis Liu (Yahoo!)
HBase's introduction into the Yahoo! Grid has provided our users with new ways to process and store data. A year after its availability, there has been varied usages: Event processing for personalization, incremental processing for ingestion, time-based aggregations for analytics, etc. All these were possible thanks to features HBase brings beyond working with HDFS files. This talk will review some recurring HBase design patterns at Yahoo! as well as share our learnings and experiences.
Large-scale Web Apps @ Pinterest - Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.