What's on
the Agenda

Note: The activities schedule on Wednesday are optional hands-on training for practitioners.

9:00 am - 6:00 pm

Registration, Networking

9:30 am - 11:30 am

Training – Qubole Enterprise User

This session will answer the questions: What is Qubole? What are the features available to me as a user? How does it interact with the cloud on my behalf? How do I pick the appropriate SQL Engine for my need?
11:30 am - 1:00 pm

Lunch Break

1:00 pm - 3:00 pm

Training – Qubole Enterprise Admin (AWS)

This session will address: How Qubole clusters work, how to administer Qubole cluster, and how to decide which cluster is appropriate for a given scenario. This session will be specific to AWS.
1:00 pm - 3:00 pm

Training – Qubole Enterprise (Azure)

This session will address: How Qubole clusters work, how to administer Qubole cluster, and how to decide which cluster is appropriate for a given scenario. This session will be specific to Azure.
3:30 pm - 5:30 pm

Training – Spark for Data Scientists

This session will cover Spark Notebooks, Spark Functionality, Qubole Features, Notebook API Execution, NMnotebook Dashboards, Notebook Tuning, Interpreter Configuration, Executor Management Troubleshooting
6:30 pm - 8:30 pm

Welcome Reception

Get your fill of fun, food, and drinks as we kick off Data Platforms 2018. Watch the sunset on the patio of a beautiful adobe building, and network with other leaders and innovators in the big data space.
7:30 am - 8:45 am


9:00 am - 9:45 am

Opening Keynote – Ashish Thusoo, Co-founder & CEO, Qubole

The next battleground for Big Data is activation. Data (and how it is used) has become a strategic differentiator for many companies. Getting as many users as possible productive with data assets has become a competitive battleground. But most companies are stuck in the “Activation Gap” - unable to onboard new users, use cases and data sets quickly and affordably. In his keynote address, Ashish Thusoo will talk about the Activation Gap, why it represents the biggest threat to big data success and practical steps you can take to eliminate it.
9:45 am - 10:15 am

Panel: Activating Big Data Across the Enterprise

Every CEO aspires to create a data-driven culture that can activate 100s or 1000s of users and petabyte-scale data to continuously deliver true business value. This keynote panel will explore the journey of 4 companies: Comcast, Turner Broadcasting, Fanatics and MediaMath, that have chronicled their successes and challenges in two upcoming books, published by O’Reilly Media: Creating a Data-Driven Enterprise in Media and Creating a Data-Driven Enterprise in Retail. The panelists will talk not just about their technology strategy and choices but also how data-driven insights are powering their business and transforming the competitive dynamics of their industry.
10:15 am - 11:00 am
Session I

Tech Talk – HiveServer2 Or: How I Learned to Stop Worrying and Love the Bomb (Lyft)

This presentation covers hive use cases at Lyft and challenges / learnings with recent hive upgrade and handling ETL at scale.
Speaker: Puneet Jaiswal,
Software Engineer,
10:15 am - 11:00 am
Session I

Tech Talk – The Story of Building a Scalable Data Trust Playbook at Optimizely (Optimzely)

Key points: At Optimizely, we receive billions of user click stream events for the thousands of A/B Experiments we run for our customers every day. Customer inquiries related to alignment of key experiment metrics between raw data and what they see on their Experiment Results pages required expensive engineering analysis for lack of scale and flexibility in analysis. In this talk, I will walk the audience through the journey of how we created a Playbook that can be used to enhance Customer Trust on Optimizely in these situations and made it scalable for the entire organization to self-serve.
Speaker: Vignesh Sukumar,
Senior Manager, Data Engineering,
10:15 am - 11:00 am
Session I

Tech Talk – Structured Streaming + ML and Avoiding Chernobyl (Blueprint Technologies)

Lessons and guidance on using Structured Streaming [on Qubole and in general] with Spark ML to score and serve models in real time. We'll discuss how to build a structured streaming app, why to use it [as opposed to DStreams], and what should be considered in a streaming app.
Speaker: Garren Staubli,
Blueprint Technologies,
Sr. Data Engineer
10:15 am - 11:00 am
Session I

Tech Talk – Democratizing the Data Platform (Nextdoor)

Learn how the data team at Nextdoor.com stopped writing queries all day and built a platform that empowered their entire company to build their own data pipelines.
10:15 am - 11:00 am
Session I

Tech Talk – The BIG Picture: The Journey from On Premise Cloud to ML Marketplace for a Geospatial Data Platform (DigitalGlobe)

In a world driven by location intelligence, DigitalGlobe is creating a place where everyone can access geospatial data and use it to derive truth and, in turn, knowledge. DigitalGlobe’s Geospatial Big Data Platform (GBDX) is empowering an ecosystem of location intelligence, cataloging 100’s of PB worth of geospatial information and have executed tens of millions (!) of hours’ worth of cloud compute. This session will talk about the migration of GBDX from on-premise to cloud, involving 100PB of satellite image data, how it’s become a key platform for analytics and machine learning applications across industries and how it’s evolving into a marketplace for imagery related ML algorithms.
Speaker: Ori Elkin,
Chief Product Officer,
Digital Globe
11:15 am - 12:00 pm
Session II

Tech Talk – Packaging, Deploying and Running Apache Spark Applications in Production (Mapbox)

Apache Spark has proven itself to be indispensable due to its endless applications and use cases. Developers, data scientists, engineers and analysts alike can benefit from its power. However, deterministically managing dependencies, packaging, testing, scheduling and deploying a Spark application can be challenging. As organizations grow, these individuals become dispersed across multiple teams and departments. This makes a team specific solution no longer applicable. So, what type of tooling do you need to allow these individuals to solely focus on writing a Spark application? And more importantly, how do you enforce development best practices, such as unit testing, continuous integration, version control and deployment environments? The data engineering team at Mapbox has developed tooling and infrastructure to address these challenges and enable individuals across the organization to build and deploy Spark applications. This talk will walk you through our solution to packaging, deploying, scheduling and running Spark applications in production. It will also address some of the problems we’ve faced and how the adoption process is evolving across the team.
Speaker: Saba El-Hilo,
Data Engineer,
11:15 am - 12:00 pm
Session II

Tech Talk – From Zero to Activated Big Data in the Cloud – the First Year’s Journey (Auris Surgical Robotics)

What would you do with 1 year, a blank slate, the knowledge that “big data” is on the way, and the company knows the data is a corporate asset? A quick discussion through the goals, practices, architectural constraints, and technologies used to get from that starting point to tested systems with engaged users. Then - how have we geared up for the multiple use cases downstream from the data lake: fueling APIs, guided exploration, streaming, and integration with external systems? Can you do this with 1 FTE – yes you can!
Speaker: Brian Greene,
Cloud Data Architect,
Auris Surgical Robotics
11:15 am - 12:00 pm
Session II

Tech Talk – Lighthouse Related Product, an Efficient Cross-boundary Product Recommendation Platform on Qubole Ecosystem (Fanatics)

Authors in the alphabetical order of last name: Santanu Dey, Gillian Lam, Gaurav Mehta, Jing Pan, Weian Sheng, Di Zhao Fanatics, Inc*. will introduce an item-to-item recommendation service platform in production, LIGHTHOUSE RELATED PRODUCT (LRP, here after). It is supervised-vs-non-supervised boundary-crossing, extendable, flexible, and lightweight. LRP implements a modeling architecture that fuses heterogeneous features from modern machine learning technique of (1) non-supervised user-item matrices, (2) self-supervised word2vec, and (3) supervised xgboost or deep learning into one system. This architecture allows innate extendibility to user-item recommendation, and flexibility for both offline and online use cases. It is lightweight and efficient enough to handle near 1 million products' item-to-item recommendations on over 400 affiliated sites. The platform relies on Qubole's spark cluster for both data feature extraction and prediction in a distributed manner with map procedure from a pre-trained supervised model; tasks on Qubole spark cluster are seamlessly integrated into the rest of the workflows in Fanatics with another 3rd party scheduling service, Stone Branch. LRP has successfully passed the real-life load test of 2017 holiday season and Super Bowl LII, and an earlier predecessor of the current version of LRP had achieved better performance in all measures, such as click through rate and average order volume, etc, compared to an industrial standard 3rd party recommendation service provider. *Fanatics is an omni-channel global leader in licensed sport merchandise.
Speaker: Jing Pan,
User Experience Researcher,
11:15 am - 12:00 pm
Session II

Tech Talk – A Framework for Assessing the Quality of Product Usage Data (Autodesk)

Our world is driven by location intelligence. We make decisions that leverage location intelligence every day through mobile devices, navigation systems, and social media. Businesses are making critical decisions using this technology and lives are being saved through improved situational awareness. DigitalGlobe has long provided the basic fuel for location intelligence through our high-resolution satellite imagery; however, DigitalGlobe is transitioning as a company to focus on fundamentally and directly empowering innovation within the location intelligence community. Critical parts of this empowerment involves access to raw materials and 'infinitely scalable' in a way that is commercially viable. How does one access hundreds of petabytes of imagery? How does one effectively leverage raw compute and artificial intelligence, at massive scale, to extract information and facilitate decision making? The future of data platforms lies within specialized platform, geared towards unique 'undifferentiated heavy lifting', leveraging industry knowledge and partnerships. DigitalGlobe is creating a place where everyone can access geospatial data and use it to derive truth and, in turn, knowledge. DigitalGlobe's Geospatial Big Data Platform (GBDX) is empowering an ecosystem of location intelligence which is hosting more than 100 PB worth of geospatial information and have executed tens of millions (!) of hours’ worth of cloud compute. This ecosystem is working to create solutions in the automotive, insurance, and telecommunications industries, to better understand how we mitigate disasters and improve lives, and to leverage artificial intelligence to its full potential in the extraction of information from spatiotemporal data sources.
11:15 am - 12:00 pm
Session II

Tech Talk – A Lap Around Azure Data Lake (Insight)

Azure Data Lake is one of the most powerful PaaS services that Microsoft Azure offers to manage Big Data. Built on well-known projects such as HDFS and YARN, it adds the possibility to focus on the design of the solution instead of the administration part. A new language, U-SQL, combines SQL and C# to work with any type and size of data. During the session we will explore Azure Data Lake Store and Azure Data Lake Analytics, the core of the Azure Data Lake offering.
Speaker: Francesco Diaz,
Regional Solutions Manager Alps, Nordict & Southern Europe,
12:00 pm - 1:15 pm


1:30 pm - 2:15 pm
Session III

Tech Talk – Enterprise Fabric – a Concept for the Essential Thread in Your Transformational Journeys (IBM)

This session will provide an overview of the enterprise fabric and the encapsulated view of the required capabilities. Some key components of the fabric include data and cognitive technologies. We will dive into the enterprise fabric-based architecture and why it is the core foundation for business transformation.
Speaker: Dan Sutherland,
Distinguished Engineer & CTO - Data Platforms, Global Business Services,
1:30 pm - 2:15 pm
Session III

Tech Talk – Building Data Functions at Poshmark: From KPI Monitoring to Enabling Social Graphs (Poshmark)

Poshmark is the largest social marketplace for fashion in the U.S. where anyone can buy, sell and share their personal style. With user engaging in 300MM+ activities every day, data is a core asset at Poshmark. We began our data journey with very basic usage of data - monitoring high-level business KPIs. Four years later, we are now deploying data applications such as enabling a balanced social graph and driving our homepage content based on real-time community activity. Join me in this session to hear key highlights from this incredible journey of building a data function at Poshmark, along with insights from development of people-matching algorithm and real-time user-driven homepage content.
Speaker: Barkha Saxena,
Vice President, Data & Analytics,
1:30 pm - 2:15 pm
Session III

Tech Talk – Highly Scalable and Flexible ETL Tool Built on Top of Cascading Framework (BloomReach)

At BloomReach we have over 100+ e-commerce customers sending there product catalog ranging from few megabytes to 100 of GBs which then needs to be parsed & transformed. In this presentation we will talk about how we build a custom ETL transformation tool on top of cascading framework that handles lot of custom transformations, joins etc, while doing all of this at scale and speed.
Speaker: Navin Agarwal,
Principal Engineer,
1:30 pm - 2:15 pm
Session III

Tech Talk – Team Data Science Process (TDSP) and Azure Machine Learning (Microsoft)

TDSP is a agile data science process meant to keep data science and business teams working together. In this session, we'll explore the Team Data Science Process and walk through an example using Azure Machine Learning Services.
Speaker: Erik Zweifel,
Advanced Analytics and AI Architect,
1:30 pm - 2:15 pm
Session III

Tech Talk – Self Regulating Streaming Capabilities in Apache Heron (Streamlio)

Several enterprises have been producing data not only at high volume but also at high velocity. Many daily business operations depend on real-time insights, therefore real-time staging and processing of the data is gaining significance. Hence there is a need for a scalable infrastructure that can continuously ingest and process billions of events per day the instant the data is acquired. To achieve real time performance at scale, Twitter designed Heron for stream data processing. Heron in production for more than 4 years, faced crucial challenges from operations point of view: the manual, time-consuming and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLO) as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation. In order to address these issues, we conceived and implemented several innovative methods and algorithms that aim to bring self-regulating capabilities to these systems thereby reducing the number of manual interventions. In this talk, we will give a brief introduction to issues and enumerate the challenges such as slow hosts, unpredictable spikes, network slowness and network partitioning that we faced while running in production and describe how made the systems self regulating themselves thereby minimizing overhead and operations.
Speaker: Karthik Ramasamy,
Streamlio / Twitter
2:30 pm - 3:15 pm
Session IV

Tech Talk – Building Your Data Lake on Amazon S3: Architecture and Best Practices (AWS)

Learn how to build and architect a data lake on AWS where different teams within your organization can publish and consume data in a self-service manner. And learn about best practices for data curation, normalization, and analysis on Amazon object storage services. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users - from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. If we end up doing the Quickstart demo, then I can just include that in the session.
2:30 pm - 3:15 pm
Session IV

Tech Talk – The 3S Method for Cluster Architecture Design (Oracle)

This session highlights the model used within Oracle Data Cloud (ODC) for Hadoop2 and Spark clusters. Take the guesswork out of cluster design, learning about the keys for balancing cost and performance while minimizing administration overhead. Audiences: Cloud Engineers/Architects/Admins, Managers, New/Established Qubole Customers, AWS Cloud Customers, Hadoop/Spark Users
Speaker: Justin Wainright,
Systems Analyst, Oracle Data Logics,
3:15 pm - 3:45 pm


3:45 pm - 4:30 pm
Session V

Tech Talk – Session V

4:45 pm - 5:15 pm

Data Science Keynote – IBM

6:30 pm - 9:30 pm

Dinner & Evening Event

You've spent the day learning how the wild west of big data is being won, now don your ten gallon hat and join us for a farm fresh dinner and ice cold saloon drinks. Watch the sunset on the patio of an beautiful adobe building, and network with other leaders and innovators in the big data space. This is a party not to be missed.
7:30 am - 8:45 am


9:00 am - 9:30 am

General Session

9:45 am - 10:30 am
Session V

Tech Talk – Kubernetes for Data Engineers (Google)

The talk will give an introduction to Kubernetes in general and then focus on topics relevant to Data Engineers: in particular, we will talk about how to run stateful workloads on Kubernetes and how to run Machine Learning workloads that use GPUs on Kubernetes.
Speaker: Rohit Agarwal,
Software Engineer,
9:45 am - 10:30 am
Session V

Tech Talk – Key Objectives and Principles for Building Predictive Models on Big Data (Neustar)

Business analysts spend a lot of time today looking at what happened in the past, but what about trying to grasp what will happen in the future? For example, what if you are given 10% more budget for next quarter’s marketing spend? Do you know how you’ll use that extra money, and do you know what impact it will create? Or, suppose you want an increase in your budget, but need to show what you expect that increase to do, then what? Many of today’s data applications are simply “decision support systems” designed to be useful in the aforementioned scenarios. They help business professionals use data to better understand their environment and to make better decisions. But with larger volumes of data and increased ambitions of competitive businesses, the end goals become tougher to achieve. As the VP of engineering for MarketShare DecisionCloud at Neustar, which provides planning and analytics capabilities for marketers, Satya Ramachandran has taken on these challenges by leveraging big data technologies. In this talk, Satya will talk about some of the high expectations he’s faced at MarketShare, and also some of his successes. For example, despite the fact that data has grown significantly in recent years, business users still want faster results. This phenomenon led to efforts that supported orders of magnitude more data within his organization, and also demanded orders of magnitude speed improvements - going from several minutes to sub-second responses. Satya will share some guiding principles that helped him successfully develop and deploy systems that his customers needed to be successful within their big data projects.
Speaker: Satya Ramachandran,
Vice President, Engineering,
9:45 am - 10:30 am
Session V

Tech Talk – Using Qubole as the Data Lake for Programmatic Advertising (Adobe)

Qubole has been the data warehouse of the DSP for the last 6+ years and it has grown and evolved as our business has. Qubole was an ideal partner for mobilizing the considerable amount of diagnostic and base truth data contained within S3. From these origins, Qubole is now powering our Custom Reporting infrastructure, Machine Learning algorithms, and User mapping reports along with its evolving role in supporting diagnosis and audit of our systems. This presentation will touch on several use cases across the organization that demonstrate the flexibility and power of Qubole in democratizing data across an organization. I can get more specific but basically covering: Complex fact table JOIN's for Bid Funnel metrics Machine learning training data Scaling custom reports using Presto
Speaker: Tom Silverstrim,
Sr. Manager, Adobe Media Optimizer,
9:45 am - 10:30 am
Session V

Tech Talk – An Interactive Discussion on Building a Data Driven Culture (Wikia/FANDOM)

Examples include: 1. From Verisign's "SiteFinder" debacle to the "Internet Threat Tracking Service". 2. Netflix: How we know more about what you love to watch than you do. 3. Wikia/Fandom: How to get you the most meaningful content and engaging experience.
Speaker: Wade Warren,
Senior Vice President, Global Engineering & TechOps,
9:45 am - 10:30 am
Session V

Tech Talk – Where We’re Going We Don’t Need Computers: End-to-End Serverless Data Science (Oracle)

Data Scientists are expected to wear many hats in an organization. From ingesting and cleaning data, managing data storage, creating scalable machine learning models, and publishing APIs to expose and schedule services for end users, there are many tasks that often fall in the realm of data science. This talk focuses on how to create end-to-end data science products that allow data scientists to focus on business logic, all while embracing nearly infinitely horizontally scaling data platforms. To do this, we’ll explore serverless cloud technologies at multiple levels of the data science pipeine such as serverless compute, workflow, containerized workloads, distributed, on-demand machine learning, metrics tracking, and API gateway access. At the end of this talk we'll have a prototype for an end-to-end machine learning system, on a scalable cloud platform, capable of processing petabytes of data and thousands of requests without the need for any freestanding servers.
Speaker: Alex Sadovsky,
Senior Director of Data Science,
Oracle Data & Cloud
10:45 am - 11:30 am
Session VI

Tech Talk – Velocity Versus Volume (Expedia)

For the technology company, there is an inherent tension between streaming and batch processing. Real time datastreams can transform a small input signal into an immediate response, but machine learning is most effective in batch. Modern data platforms can easily handle both streaming and batch jobs simultaneously. Balancing these two paradigms therefore becomes a matter of design, and right now this interplay is thriving at the intersection of Product and Data Science. We discuss these dualities in the context of recommendations systems, some of our core products at Expedia. We’ll sketch the design, architecture, tools and metrics, as well as share our experience with our attempts at personalization. We’ll merge the ideas behind multi-armed bandits and learning-to-rank to develop a novel recommendation system and give you the background needed to start building products in this rapidly evolving space.
Speaker: Sean Downes,
Data Scientist,
10:45 am - 11:30 am
Session VI

Tech Talk – Optimize for Reduced Big Data Partitioning Costs (inMarket media, LLC)

Businesses that collect and process data can benefit greatly from partitioning their tables. Partitioning improves performance, increases query performance, reduces the effort for rebuilding tables Single partition queries can also be used to reduce the query load and avoid scanning the entire table. However, transitioning large existing tables into partitioned tables can be cost-prohibitive. For example, we at InMarket media LLC load billions of location records into multiple tables in our database to process these records through a pipeline of transformations. The resulting tables were not originally partitioned, and as time went on the decision not to partition the tables became increasingly expensive to maintain. We decided to partition our large existing tables to improve performance and reduce the costs of our queries. Initially, we thought that partitioning at this stage might expensive. In my talk, I am going to give you a brief overview of the partitioning feature and explain the advantages and drawbacks of several different implementations of the partitioning process, and how we were able reduce the cost of the partitioning. Abstract We needed to partition large tables by date and came across several issues.The challenge was to direct each row to its correct partition without scanning the entire table for each day! Our data volume of ~12 TB represents events spanning 700 days. We encountered two main problems with the standard approach for partitioning the table: Inefficiency - Manually partitioning each day is tedious and prone to human error. Cost - The price of scanning the entire table to partition each day came out to about $42k We needed to find a way to partition the table while avoiding the full scan to save money and automate the process. I decided to see whether we could represent the rows differently. I created a staging table that compressed the rows into an array and each row became a column. More technical detail of steps.
Speaker: Waad Aljaradt,
Data Engineer,
InMarket media LLC
10:45 am - 11:30 am
Session VI

Tech Talk – Email Text Classification: Building an End-to-End Data Product (Return Path)

Sasha will tell the story of building an end to end data product that feeds various parts of the Return Path business to optimize email programs for marketers. We will cover discovery, development and production of an email classification model that uses Spark to fit classifiers such as Random Forests and Support Vector Machines to read email text and classify the content. We will discuss different methods of hyperparameter tuning and ensembling used, and will describe different stages of production from batch jobs in Qubole Scheduler and Apache Airflow to streaming in Kafka. We will also reflect on what it means to be a full stack data scientist and how data science teams can be empowered to own their own data products.
Speaker: Sasha Mushovic,
Data Scientist,
Return Path
10:45 am - 11:30 am
Session VI

Tech Talk – The Dismal and Uncomfortable Science of Data Engineering: Building Out Big Data with Your Analytics Team

While software services catch up, many Big Data projects rely on engineering resources, people with programming skills and the culture around software development. Meanwhile, analysis teams are often oriented to serve customer engagement and finance managers, with an ethos and engagement style quite distinct from their counterparts in the school of computer science. As business and engineering sides of the house clash and cooperate, it's important to remember the human side of things, both in the data and in the delivery of insights. Let's talk about ways these capable Business Analysts and Data Engineers can work together, and ideals they can align toward.
Speaker: Charles Pritchard,
Data Janitor,
11:30 am