What Happened at
Data Platforms 2017?

Wednesday May 24, 2017

11:00 am – 6:00 pm
Registration & Networking
1:00 pm – 5:00 pm
On-Site office hours with a Solutions Architect (Aztec C)
1:00 pm – 6:00 pm
QuMbaya: Networking Lounge
1:00 pm – 3:00 pm
Using Qubole Data Service 101
2:00 pm – 4:00 pm
Qubole Data Service Workshop – From Ingestion to Insights in 120
3:30 pm – 5:30 pm
Using Qubole Data Service 201 – Choosing your Big Data SQL Engine
6:30 pm – 8:30 pm
Welcome Reception

Qubole Co-Founders Ashish Thusoo and Joydeep Sen Sarma welcome you to Data Platforms 2017 to kick off this inaugural event. Pick up a copy of their new book “Creating a Data Driven Enterprise with DataOps: Insights from Facebook, Uber, LinkedIn, Twitter and eBay” Published by O’Reilly media and released May 24!!

Enjoy cocktails & conversation.

Thursday May 25, 2017

7:00 am – 8:30 am
8:30 am – 9:30 am
DataOps & The Modern (Big) Data Platform

Ashish Thusoo will discuss his role creating the first modern, big data platform at Facebook, as well as insights from the new book.  Joining Ashish will be book contributors and big data  pioneers Shrikanth Shankar, LinkedIn, and Karthik Ramasamy, Cofounder of Streamlio, formerly at Twitter, sharing how they led their organizations through similar transformations to become data-driven businesses.  They will share what they did, how they did it, and lessons learned on their journeys.

9:30 am – 10:15 am
Building The Modern Data Platform

When we were growing up….

From startups to enterprises, industry leaders will discuss the growth aspects and challenges from various stages along the way. Which of the challenges led to technological innovation within the organization and/or adoption of newer tools and technologies and why?


Karthik Subramaniam, Data Platform Lead, Data Science & Engineering, Under Armour Connected Fitness

Oskar Austegard, Director, Data Solutions, Gannett

Colin Riddell, Senior Data Architect, EpicGames

Wade Warren, VP Engineering, Wikia

Tripp Smith, Clarity Insights

Rakesh Soni, Intersys Consulting

Moderated by: Andy Sautins, Technical Manager, Google

10:15 am – 10:30 am
10:30 am – 11:00 am
The Intersection of Cloud and Big Data…..what’s next?

Experience the future! Join Joydeep Sen Sarma and team for some exciting announcements and cool demos!

11:00 am – 12:00 pm
Big Data in the Clouds

Industry visionaries from Amazon, Microsoft and Oracle share their views on the future and promise of the next wave of cloud computing.  Hear from:

John AliotoCTO DX Technology Evangelism & Startups, Microsoft
Jeff Barr, Chief Evangelist, Amazon Web Services
Vinay Kumar, Senior Director of Product Management, Oracle

12:00 pm – 1:15 pm
1:30 pm – 4:30 pm
Tech Talks

Practitioners share best practices, techniques, challenges and solutions in these deep dive sessions on the technical, organizational and cultural aspects of building modern, big data platforms.

1:30 pm – 2:15 pm Session I
Playing Offense with The Data Platform — From On-Premise to The Cloud

Speaker: Santanu Dey, Director of Data Science and Engineering, Fanatics Inc.

Over last two years, Fanatics Inc., the global leader in licensed sports merchandise, went through major transformations in terms of technology, and especially in data, by not only moving to Cloud from on-premise but also in terms of how the data is being strategically used to power the e-commerce and backend supply chain systems. From the very start, we expected the data platform to be exposed via a set of data and analytical web services that can act as a brain to provide a delightful customer experience–whether it’s ranking the relevant products or recommending the most interesting ones.

As web visitors browse through the web pages and transact, all events that are generated flow through a Kafka messaging system–Fanflow. These events are then aggregated by Flink and Spark Streaming consumers and machine learned models adapt and react to changes in behaviors evident in these events. Setting up data as services and blurring the line with application stack to provide business metrics and metadata, in addition to storing them as traditional warehouse or data lake, made all the difference. In this session, we will deep dive into couple of these data services and discuss how you may benefit from implementing similar patterns.

1:30 pm – 2:15 pm Session I
Industrializing Data Science Workflows

Speaker: Sean Downes, Senior Data Scientist, Expedia

Discover the evolution of data science workflows implemented at Expedia with a special emphasis on Learning to Rank problems. This session will explore the process of industrializing the data science workflow and best practices on how to keep your data productive, or even pull your organization out of the data swamp.

1:30 pm – 2:15 pm Session I
Virtualizing Big Data in the Cloud

Speaker: Kellyn Pot’Vin-Gorman, Technical Intelligence Manager of CTO, Delphix

Big Data encompasses a large landscape and building into the cloud can introduce more unique challenges. Two of the primary are cost and storage. Join Kellyn as she discusses cost savings by utilizing virtualization of multiple tiers encompassing the big data landscape through a review of real use cases, along with methods of discovery to gain incredible success and the technical specifications behind different big data platforms when engaging virtualization when data is big and platforms are vast.

1:30 pm – 2:15 pm Session I
How We Built a Scalable, Real-time User Targeting System

Sriranjan Manjunath, CTO and Head of Engineering @ Saavn

Saavn is India’s leading music streaming service. Since context is key to music, we have built a system called Sniper that lets us identify cohorts of users in real-time and target them for marketing, advertising and recommendation purposes. This system allows us to understand user behavior by quantifying their engagement characteristics such as stream consumption, affinities or ads. Speed and scalability are critical to its design. This talk will cover our motivations behind building such a system and how big data technologies have helped us architect it.

1:30 pm – 2:15 pm Session I
Power your Big Data Infrastructure with Data Intelligence for Analytics and Data Operations

Speaker: Balaji Mohanam, Product Manager @ Qubole

Discover the newly launched features in Qubole, powered by Data Intelligence, that automates mundane Data Model performance appraisal and simplifies Data Ops. This session will provide detailed walkthrough of Qubole’s latest offering in Data Intelligence that includes Data Model insights and Recommendations including Partitioning, Formatting and Sorting that helps optimize data models for improved performance and computing resources. In addition, learn about Qubole’s latest offering in self-service analytics and how it can improve analysts productivity by making data discovery easy through column and table name auto-suggestion and completion, and insights preview.

2:30 pm – 3:15 pm Session II
Designing and Optimizing Big Data Sets for Visualization, Analysis and Exploration

Speaker: Steve Gotlieb, Data Pipeline Architect, Product Delivery and Analytics Services Team, Autodesk

This presentation will discuss tips and tricks on how to build data sets optimized for Apache Spark. You will learn different aggregation strategies, granularity required for a data set, optimal file formats, partitioning strategies, and when to leverage caching to improve performance.

2:30 pm – 3:15 pm Session II
Data Science Stack in the Cloud

Speaker: Evan Harris, Data Scientist, Return Path

Journey from exploration and visualization to machine learning and natural language processing. Discover how Return Path built a cloud based, production ready, enterprise scale data solution without a dedicated Dev Ops team. Leveraging modern distributed computing frameworks like Spark and managed services like EMR and Qubole were key to the process.

2:30 pm – 3:15 pm Session II
A Data Platform To Enable Intelligent Features

Speaker: Karthik Subramaniam, Data Platform Lead, Data Science & Engineering, Under Armour Connected Fitness

The Under Armour built the world’s most comprehensive health and wellness application: the Connected Fitness Data Platform. It consists of event streaming pipelines and processing using big data technologies like Hive, Presto and Spark to derive the insights needed to keep their users fit and healthy. Discover their step by step process.

2:30 pm – 3:15 pm Session II
A Case Study-Oracle Data Cloud & Heterogeneous Clusters

Speaker: Justin Wainwright, Systems Analyst, Oracle

Infrastructure planning and implementing cluster best practices can lead to significant cost savings. In this in depth session, discover the benefits Oracle has reaped from using heterogeneous vs homogeneous clusters.

2:30 pm – 3:15 pm Session II
Go Further, Faster with a Data Lake in Amazon Web Services

Speaker: Mick Bass, 47Lining, CEO

Increasingly, valuable customer data sources are dispersed among on-premise, SaaS providers, partners, 3rd party data providers and public data sets.  Data Lakes in the cloud are the foundation for storing on-premises, 3rd party and public data sets at attractive price / performance.  Atop this foundation, a portfolio of descriptive, predictive and real-time agile analytics can empower customers to answer their most important business questions.  In this talk, 47Lining CEO, Mick Bass, will walk attendees through best practices data lake reference architecture in AWS and share real world customer use cases like predicting customer churn, propensity to buy, detecting fraud, optimizing industrial processes and content recommendations.

2:30 pm – 3:15 pm Session II
Data Operations: Or How I Learned to Stop Data Wrangling and Love Machine Learning

Speaker: Saket Saurabh, Co-Founder, CEO Nexla

Behind all the glory of AI and machine learning advancements is the work of data operations. Ensuring that the right data is available to the right system in the right form can be as much as 80% of the overall workload. But don’t take our word for it— come hear the results of the first-ever DataOps professionals survey. In this session you’ll learn how the world’s leading companies are organizing their DataOps teams, what systems they are using, and how managers can better support their efforts. Learn how Nexla uses machine learning to automate common tasks such as data source monitoring, schema management, and data quality management. Leave the session knowing how you’ll spend more time analyzing your data, and less time wrangling with it.

3:30 pm – 4:15 pm Session III
The Oracle Data Cloud: A Journey from Hive to Spark / Lessons Learned

Speaker: Alex Sadovsky, Director Of Data Science, Oracle Data Cloud

At the Oracle Data Cloud, petabytes are processed to create custom, targeted, online advertising. Speed, throughput, and scalability are the three core metrics upon which architecture effectiveness is measured.

The Oracle Data Cloud has moved from on-premise, single machine data processing, to cloud based Hive and ultimately, Spark solutions. This talk covers the challenges along the way: success and failures, and where the easy Hive to Spark SQL (and beyond) translations did not work quite as advertised.  The Oracle Data Cloud has improved the speed of processes by over 300%, and eliminated some gotchas that caused months of confusion.

3:30 pm – 4:15 pm Session III
Building a Culture, Organization, and Infrastructure to Support a Self-Service Big Data Model in the Cloud

Speaker: Dale Treece, Senior Solutions Architect and Engineering Lead, Digital Data Services, Scripps Networks Interactive

For years IT has been tasked to produce, gather, and store large volumes of internal and external data. We’re now engaged in developing the infrastructure to support analysis of that data. A cloud-based, self-service, big-data model can be the answer and provide numerous benefits and efficiencies. But with those benefits, there are cultural, organizational and architectural hurdles to clear. We will discuss the challenges we faced at Scripps Networks Interactive, and the successful team and architectural outcomes that emerged.

3:30 pm – 4:15 pm Session III
Billion Dollar Waste: Using ORC files in places where most of the tech world is still stuck with bundles of TSV/CSV

Speaker: Charles Pritchard, Big Data Expert, Jumis

Simple decisions in data lead to extra work for developers. Ingesting files (in CSV, TSV, etc. formats) and getting it right without losing data, is an expensive proposition. Compare this to receiving a file that is optimized or organized into a schema and good file format ahead of time. If industry and government were transporting files in a better way, we’d all save a lot of time and money, and the tools are widely available.

3:30 pm – 4:15 pm Session III
Using Machine Learning to Manage User Access

Speaker: Jon Austin Osborne, Machine Learning Engineer, Capital One

In any enterprise, one of the most prevalent security risks revolves around who has access to which resources. Whether data is being stored in a cloud solution or on-premises, there is a large challenge in knowing how to provide the correct privileges to associates. By using machine learning and clustering algorithms like the Louvain Method, we can group similar users in the Capital One network and create two valuable features: (1) automated onboarding and (2) automated “rogue access” detection.  With the utilization of machine learning, we have allowed Capital One to become a more well-managed company, and have reduced a major cybersecurity threat. This talk will be a deep dive into the model, data engineering and productionization of the web application interface.

4:15 pm – 4:45 pm
4:45 pm – 5:30 pm
The Big (Data) Picture: How Your Choices are Changing the World Beyond Tech

R. David Edelman, President Obama’s “Geek in Chief”

The United States is the global leader in big data technology – so how is the United States Government using big data today and how can it be used in the future? What are the key policy debates that will shape the big data landscape?

7:00 pm – 10:00 pm
Deep Dive in the Data Lake Pool Party & BBQ

Enjoy cocktails, a casual barbecue dinner and play some bocce or volleyball.   Poolside at the Wigwam.

Friday May 26, 2017

8:00 am – 9:00 am
9:00 am – 10:15 am
Engineering the Future: What will you say and do differently on Monday as a result of attending DP 2017?

Ashish will kick off the morning with highlights from Thursday’s session. Joining him in conversation is book contributor and Former VP of Commerce Platform Infrastructure at eBay, Debashis Saha, who will share his insights and lead a discussion around charting your data platforms next steps.   From there we’ll break into facilitated small group discussions designed to help you hone your action plans.

10:30 am – 11:30 am
11:30 am – 12:00 pm
It’s a Wrap!

Closing session.