Join Our Sessions
and Learn From The Best

10:30 am - 11:15 am
Thursday, April 12
Session I

Tech Talk – The Story of Building a Scalable Data Trust Playbook at Optimizely (Optimzely)

Key points: At Optimizely, we receive billions of user click stream events for the thousands of A/B Experiments we run for our customers every day. Customer inquiries related to alignment of key experiment metrics between raw data and what they see on their Experiment Results pages required expensive engineering analysis for lack of scale and flexibility in analysis. In this talk, I will walk the audience through the journey of how we created a Playbook that can be used to enhance Customer Trust on Optimizely in these situations and made it scalable for the entire organization to self-serve.
Speaker: Vignesh Sukumar,
Senior Manager, Data Engineering,
Optimizely
10:30 am - 11:15 am
Thursday, April 12
Session I

Tech Talk – The BIG Picture: The Journey from On Premise Cloud to ML Marketplace for a Geospatial Data Platform (DigitalGlobe)

In a world driven by location intelligence, DigitalGlobe is creating a place where everyone can access geospatial data and use it to derive truth and, in turn, knowledge. DigitalGlobe’s Geospatial Big Data Platform (GBDX) is empowering an ecosystem of location intelligence, cataloging hundreds of petabytes worth of geospatial information and executing tens of millions (!) of hours’ worth of cloud compute. This session will talk about the migration of GBDX from on-premise to cloud, involving 100 petabytes of satellite image data. We will also discuss GBDX as a key platform for analytics and machine learning applications across industries, and how it’s evolving into a marketplace for imagery-related machine learning algorithms.
Speaker: Ori Elkin,
Chief Product Officer,
Digital Globe
10:30 am - 11:15 am
Thursday, April 12
Session I

Tech Talk – HiveServer2 Or: How I Learned to Stop Worrying and Love the Bomb (Lyft)

This presentation covers Apache Hive use cases at Lyft, including the challenges and learnings associated with the recent Hive upgrade and handling ETL at scale.
Speaker: Puneet Jaiswal,
Software Engineer,
Lyft
10:30 am - 11:15 am
Thursday, April 12
Session I

Tech Talk – Building a Real-Time Decision Engine Using Machine Learning on Apache Spark Structured Streaming (Blueprint Technologies)

Real-time decision making using ML/AI is the holy grail of customer-facing applications. It’s no longer a long-shot dream; it’s our new reality. The real-time decision engine leverages the latest features in Apache Spark 2.3, including stream-to-stream joins and Spark ML, to directly improve the customer experience. We will discuss the architecture at length, including data source features and technical intricacies, as well as model training and serving dynamics. Critically, real-time decision engines that directly affect customer experience require production-level SLAs and/or reliable fallbacks to avoid meltdowns, which this talk will also address.
Speaker: Garren Staubli,
Sr. Data Engineer,
Blueprint Technologies
10:30 am - 11:15 am
Thursday, April 12
Session II

Tech Talk – A Lap Around Azure Data Lake (Insight)

Azure Data Lake is one of the most powerful PaaS services that Microsoft Azure offers to manage big data. Built on well-known projects such as HDFS and YARN, it allows for the ability to focus on the design of the solution instead of the administration part. A new language, U-SQL, combines SQL and C# to work with any type and size of data. During the session, we will explore Azure Data Lake Store and Azure Data Lake Analytics, the core components of the Azure Data Lake offering.
Speaker: Francesco Diaz,
Regional Solutions Manager Alps, Nordic & Southern Europe,
Insight
11:30 am - 12:15 pm
Thursday, April 12
Session II

Tech Talk – A Framework for Assessing the Quality of Product Usage Data (Autodesk)

This presentation will discuss the importance of data quality and outline an approach to assess and measure the quality of product usage event logs. A data quality assessment framework helps build trust in our data and enables analysts to generate a deep understanding of product usage patterns, product stability, and utilization of purchased assets by our customers. The value that can be unlocked from this data is dependent upon high quality and complete data sets that provide the ability to link product usage events with back office account and entitlement data to provide valuable insights.
Speaker: David Oh,
Data Engineer for the ADP (Autodesk Data Platform),
Autodesk
11:30 am - 12:15 pm
Thursday, April 12
Session II

Tech Talk – Presto: Fast SQL on Everything (Facebook)

Presto is an open source distributed query engine that supports much of the SQL analytics workload at Facebook. This talk introduces a selection of Facebook use cases, which range from user-facing reporting applications to multi-hour ETL jobs, then explains the architecture, implementation, features and performance optimizations that enable Presto to support these use cases.
Speaker: David Phillips,
Software Engineer,
Facebook
11:30 am - 12:15 pm
Thursday, April 12
Session II

Tech Talk – Packaging, Deploying and Running Apache Spark Applications in Production (Mapbox)

Apache Spark has proven itself to be indispensable due to its endless applications and use cases. Developers, data scientists, engineers and analysts alike can benefit from its power. However, deterministically managing dependencies, packaging, testing, scheduling and deploying a Spark application can be challenging. As organizations grow, these individuals become dispersed across multiple teams and departments. This makes a team specific solution no longer applicable. So, what type of tooling do you need to allow these individuals to solely focus on writing a Spark application? And more importantly, how do you enforce development best practices, such as unit testing, continuous integration, version control and deployment environments? The data engineering team at Mapbox has developed tooling and infrastructure to address these challenges and enable individuals across the organization to build and deploy Spark applications. This talk will walk you through our solution to packaging, deploying, scheduling and running Spark applications in production. It will also address some of the problems we’ve faced and how the adoption process is evolving across the team.
Speaker: Saba El-Hilo,
Data Engineer,
Mapbox
11:30 am - 12:15 pm
Thursday, April 12
Session II

Tech Talk – Lighthouse Related Product, an Efficient Cross-boundary Product Recommendation Platform on Qubole Ecosystem (Fanatics)

Fanatics, Inc. will introduce an item-to-item recommendation service platform in production, Lighthouse Related Product (LRP). LRP offers supervised-versus-non-supervised boundary-crossing and is extendable, flexible, and lightweight. LRP implements a modeling architecture that fuses into one system heterogeneous features from modern machine learning techniques of (1) non-supervised user-item matrices, (2) self-supervised Word2Vec, and (3) supervised XGBoost or deep learning. This architecture allows innate extendibility to user-item recommendations, and flexibility for both offline and online use cases. It is lightweight and efficient enough to handle near one million products' item-to-item recommendations on over 400 affiliated sites. The platform relies on the Apache Spark cluster in Qubole for both data feature extraction and prediction in a distributed manner with map procedure from a pre-trained supervised model; tasks on the Spark cluster in Qubole are seamlessly integrated into the rest of the workflows in Fanatics with another third-party scheduling service, Stone Branch. LRP has successfully passed the real-life load test of the 2017 holiday season and Super Bowl LII, and an earlier predecessor of the current version of LRP had achieved better performance in all measures, such as click-through rate and average order volume, compared to an industrial standard third-party recommendation service provider.
Speaker: Jing Pan,
User Experience Researcher,
Fanatics
11:30 am - 12:15 pm
Thursday, April 12
Session II

Tech Talk – From Zero to Activated Big Data in the Cloud – the First Year’s Journey (Auris Surgical Robotics)

What would you do in this scenario?: a blank slate; one year to prepare for “big data is on the way”; and your company’s acknowledgement that data is a strategic corporate asset? Attend this session to explore the goals, best practices, architectural constraints, and technologies that shaped the journey from that starting point to continuously delivered and live systems with engaged users. Also hear about the multiple use cases downstream from the data lake, such as APIs, guided exploration, streaming, and integration with external systems. Finally, learn how all of this can be accomplished by one full time employee and strategic partnerships.
Speaker: Brian Greene,
Cloud Data Architect,
Auris Surgical Robotics
2:30 pm - 3:15 pm
Thursday, April 12
Session III

Tech Talk – Self Regulating Streaming Capabilities in Apache Heron (Streamlio)

Several enterprises have been producing data not only at high volume but also at high velocity. Many daily business operations depend on real-time insights, and therefore real-time staging and processing of the data is gaining significance. Thus, there is a need for a scalable infrastructure that can continuously ingest and process billions of events per day the moment the data is acquired. To achieve real-time performance at scale, Twitter designed Heron for stream data processing. In production for more than four years, Heron faced crucial challenges from an operational point of view: the manual, time-consuming, and error-prone tasks of tuning various configuration knobs to achieve service level objectives (SLO), as well as the maintenance of SLOs in the face of sudden, unpredictable load variation and hardware or software performance degradation. In order to address these issues, we conceived and implemented several innovative methods and algorithms that aim to bring self-regulating capabilities to these systems, thereby reducing the number of manual interventions. In this talk, we will give a brief introduction to issues and enumerate challenges such as slow hosts, unpredictable spikes, network slowness, and network partitioning that we faced in production. We'll also describe how we made the systems self-regulating to minimize overhead and operations.
Speaker: Karthik Ramasamy,
Co-Founder,
Streamlio / Twitter
2:30 pm - 3:15 pm
Thursday, April 12
Session III

Tech Talk – Highly Scalable and Flexible ETL Tool Built on Top of Cascading Framework (BloomReach)

At BloomReach we have 100+ e-commerce customers sharing product catalogs that range from a few megabytes to hundreds of gigabytes, which then need to be parsed and transformed. In this presentation we will talk about how we built a custom ETL transformation tool on top of a cascading framework that handles custom transformations and joins at scale and speed.
Speaker: Navin Agarwal,
Principal Engineer,
BloomReach
2:30 pm - 3:15 pm
Thursday, April 12
Session III

Tech Talk – Building Data Functions at Poshmark: From KPI Monitoring to Enabling Social Graphs (Poshmark)

Poshmark is the largest social marketplace for fashion in the U.S. where anyone can buy, sell, and share their personal style. With users engaging in 300M+ activities every day, data is a core asset at Poshmark. We began our data journey with very basic uses of data - monitoring high-level business KPIs. Four years later, we are now deploying data applications for actions such as enabling a balanced social graph and driving our homepage content based on real-time community activity. Join me in this session to hear key highlights from this incredible journey of building a data function at Poshmark, along with insights from the development of a people-matching algorithm and real-time user-driven homepage content.
Speaker: Barkha Saxena,
Vice President, Data & Analytics,
Poshmark
2:30 pm - 3:15 pm
Thursday, April 12
Session III

Tech Talk – Using Qubole as the Data Lake for Programmatic Advertising (Adobe Ad Cloud)

Qubole has been the data warehouse of the DSP for the last six-plus years, and was selected as the ideal partner for mobilizing the considerable amount of diagnostic and base truth data contained within Amazon S3. From these origins, Qubole now powers our custom reporting infrastructure, machine learning algorithms, and user mapping reports, along with its evolving role in supporting system diagnoses and audits. We will touch on several use cases that demonstrate the flexibility and power of Qubole in democratizing data across the organization.
Speaker: Tom Silverstrim,
Sr. Manager, Adobe Media Optimizer,
Adobe Ad Cloud
2:30 pm - 3:15 pm
Thursday, April 12
Session III

Tech Talk – Key Objectives and Principles for Building Predictive Models on Big Data (Neustar)

Business analysts spend a lot of time today looking at what happened in the past, but what about trying to grasp what will happen in the future? For example, what if you are given 10 percent more budget for next quarter’s marketing spend? Do you know how you’ll use that extra money, and do you know what impact it will create? Or suppose you want to increase your budget, but need to show what you expect that increase to do - then what? Many of today’s data applications are simply “decision support systems” designed to be useful in the aforementioned scenarios. They help business professionals use data to better understand their environment and make better decisions. But with larger volumes of data and increased ambitions of competitive businesses, the end goals become tougher to achieve. As the VP of Engineering for MarketShare DecisionCloud at Neustar, which provides planning and analytics capabilities for marketers, Satya Ramachandran has taken on these challenges by leveraging big data technologies. In this talk, Satya will discuss some of the high expectations he’s faced at MarketShare, and also some of his successes. For example, despite the fact that data has grown significantly in recent years, business users still want faster results. This phenomenon led to efforts that supported larger amounts of data within his organization and demanded speed improvements - going from several minutes to sub-second responses. Satya will share some guiding principles that helped him successfully develop and deploy the systems his customers needed to be successful with their big data projects.
Speaker: Satya Ramachandran,
Vice President, Engineering,
Neustar