20/20: Human-in-the-Loop Data Exploration

Overview

We propose to build a new class of database systems designed for Human-In-the-Loop (HIL) operation. We target an ever growing set of data-centric applications in which data scientists of varying skill levels manipulate, analyze and explore large data sets, often using complex analytics and machine learning techniques. Enabling these applications with ease of use and at “human speeds” is key to democratizing data science and maximizing human productivity. Traditional database technologies are ill-suited to serve this purpose. Historically, databases assumed (1) text-based input (e.g., SQL) and output, (2) a point (i.e., stateless) query-response paradigm, (3) batch results, and (4) simple analytics. We will drop these fundamental assumptions and build a system that instead supports visual input and output, ”conversational” interaction, early and progressive results, and complex analytics. Building a system that integrates these features requires a complete rethinking of the full database stack, from the interface to the ”guts”, as well as incorporating pertinent algorithms.

The proposed work will make the following technical contributions

Visual interactive data exploration and manipulation: Interactive visualizations are effective and user friendly means of accessing and manipulating data. In our model, users interact with visualizations to express data operations (input) and also consume visualizations as results (output), requiring profound changes in the traditional data systems design and optimizations. On the input side, the PIs propose a visual interaction language that operates over visualizations, taking a visualization as input and producing another as output. On the output side, the PIs will study the notion of visual approximate query answering, which brings approximate query processing to the world of visualization to enable real-time analysis of large data sets. The PIs will also develop techniques for optimizing visualization-specific operations that are not well supported by existing databases. These interaction and visualization techniques will be investigated in the context of advanced analytics workloads such as anomaly detection. Natural-language-based, conversational query processing: The PIs will build a natural language interface to relational database systems. Using learning techniques, we will automatically translate natural language queries to SQL. Furthermore, the PIs will develop techniques that will facilitate a “conversation” between the database and the user. In data exploration, users interact with the system using a sequence of queries (aka a query session), each building on the previous one. This radically departs from the point interaction model of traditional query processing in which there are no assumed relationships between queries. The PIs will introduce a session-aware processing model where the system expects that users will engage in a long running “conversation” and optimizes its execution accordingly. Interactive query steering: A fluid, engaging conversation requires that users have the ability to get representative results early and progressively, and as they learn from these results, to exercise control by rapidly interrupting queries or changing their behavior (e.g., query parameters). The PIs propose novel online query processing and steering techniques that are designed for visual data manipulation involving complex analytics and machine learning. Below, we discuss different initial projects towards achieving these goals.

People

Carsten Binnig
Ugur Cetintemel
Tim Kraska
Andries van Dam
Stan Zdonik

Vizdom: Interactive Analytics through Pen and Touch

Machine learning (ML) and advanced statistics are important tools for drawing insights from large datasets. However, these techniques often require human intervention to steer computation towards meaningful results. In this project, we build a new system for interactive analytics through pen and touch called Vizdom. Vizdom’s frontend allows users to visually compose complex workflows of ML and statistics operators on an interactive whiteboard, and the back end leverages recent advances in workflow compilation tech niques to run these computations at interactive speeds. Additionally, we are exploring approximation techniques for quickly visualizing partial results that incrementally refine over time. Different from existing approx. query processing techniques, Vizdom takes the perception of the user into account to avoid unnecessary computation where the results are not perceivable by the user.

Vizdom: Interactive Analytics through Pen and Touch from Emanuel Zgraggen on Vimeo.

Publications

Philipp Eichmann, Emanuel Zgraggen, Carsten Binnig, Tim Kraska (SIGMOD 2020). FASTBench: A new Benchmark for Interactive Data Exploration.
Emanuel Zgraggen, Zheguang Zhao, Robert Zeleznik, and Tim Kraska. Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis. ACM CHI 2018
Yue Guo, Carsten Binnig, Tim Kraska: What you see is not what you get!: Detecting Simpson’s Paradoxes during Data Exploration. HILDA@SIGMOD 2017: 2:1-2:5
Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, Tim Kraska: Controlling False Discoveries During Interactive Data Exploration. SIGMOD Conference 2017: 527-540
Zheguang Zhao, Emanuel Zgraggen, Lorenzo De Stefani, Carsten Binnig, Eli Upfal, Tim Kraska: Safe Visual Data Exploration. SIGMOD Conference 2017: 1671-1674
Muhammad El-Hindi, Zheguang Zhao, Carsten Binnig and Tim Kraska. VisTrees: Fast Indexes for Interactive Data Exploration. Research Paper, HILDA 2016 (SIGMOD 2016)
Andrew Crotty, Alexander Galakatos, Emanuel Zgraggen, Carsten Binnig and Tim Kraska. The Case for Interactive Data Exploration Accelerators (IDEAs). Research Paper, HILDA 2016 (SIGMOD 2016)
Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Çetintemel, Stan Zdonik: An Architecture for Compiling UDF-centric Workflows. PVLDB 8(12): 1466-1477 (2015)
Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, Tim Kraska: Vizdom: Interactive Analytics through Pen and Touch. PVLDB 8(12): 2024-2035 (2015) (Best Demo)

Northstar: Making Data Science More Interactive

With the Northstar project, we envision a completely new approach to conducting exploratory analytics. We speculate that soon many conference rooms will be equipped with an interactive whiteboard, like the Microsoft Surface Hub. Data scientists and domain experts can use the whiteboards to avoid the usual week-long, back-and-forth interactions. Instead, we believe that the two can work together during a single meeting using an interactive whiteboard to visualize, transform and analyze even most complex data on the spot. This setting will undoubtedly help the domain experts to quickly arrive at an initial solution, which can be further refined offline. Our hypothesis is that we can make data exploration much easier for laymen while automatically protecting them from many common errors. Furthermore, we hypothesize that we can develop an interactive data exploration system that provides meaningful results in sub-seconds even for complex ML pipelines over very large datasets. The techniques will not only make machine learning more accessible to a broader range of users, but also ultimately enable more discoveries compared to any batch-driven approach.

Northstar includes four main components:

Vizdom: a novel visual data exploration environment specifically designed for pen and touch interfaces, such as the Microsoft Surface Hub.
IDEA: an intelligent cache and streaming approximation engine, which enables users to analyze data and create ML pipelines with immediate feedback over any type of data source and independent of the data size.
QUDE, which monitors every interaction the user does and tries to warn about common mistakes and problems.
Alpine Meadow: a ”query” optimizer for machine learning that allows users to declaratively indicate what they want (e.g., “predict label X”) while the system automatically figures out the best ML pipeline (i.e., plan) to achieve that goal.

Publications

Yeounoh Chung and Sacha Servan-Schreiber and Emanuel Zgraggen and Tim Kraska (IEEE Data Eng. Bull. 2018). Towards Quantifying Uncertainty in Data Analysis & Exploration.
Zeyuan Shang and Emanuel Zgraggen and Benedetto Buratti and Ferdinand Kossmann and Philipp Eichmann and Yeounoh Chung and Carsten Binnig and Eli Upfal and Tim Kraska (SIGMOD 2019). Democratizing Data Science through Interactive Curation of ML Pipelines.
Kevin Zeng Hu and Snehalkumar (Neil) S. Gaikwad and Madelon Hulsebos and Michiel A. Bakker and Emanuel Zgraggen and Cesar A. Hidalgo and Tim Kraska and Guoliang Li and Arvind Satyanarayan and Cagatay Demiralp (CHI 2019). VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository.
Tim Kraska. Northstar: An Interactive Data Science System. PVLDB 2018 11(12): 2150-2164.
Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert C. Zeleznik, and Emanuel Zgraggen. (2018). Towards Interactive Curation & Automatic Tuning of ML Pipelines. DEEM@SIGMOD 2018: 1:1-1:4
Kevin Zeng Hu and Michiel A. Bakker and Stephen Li and Tim Kraska and Cesar A. Hidalgo (CHI 2019). VizML: A Machine Learning Approach to Visualization Recommendation.

EchoQuery and DBPal: Chatting with Your Relational Database

Recent advances in automatic speech recognition and natural language processing have led to a new generation of robust voice-based interfaces. Yet, there is very little work on using voice-based interfaces to query database systems. In fact, one might even wonder who in her right mind would want to query a database system using natural-language-based voice commands! With this project, we make the case for querying database systems using a NL-based interface, a new querying and interaction paradigm we call Query-by-Natural Language (QbNL ). The aim of this project is to demonstrate the practicality and utility of QbV for relational DBMSs using a using a proof-of-concept system called EchoQuery. To achieve a smooth and intuitive interaction, the query interface of EchoQuery is inspired by casual human-to-human conversations.

EchoQuery uses DBPal, an NL-based interface to RDBMSs that focuses on robust translation of natural language statements to SQL using deep learning techniques.

The main features of EchoQuery are:

Hands-free Access: EchoQuery does not require the user to press a button or start an application using a gesture or a mouse-click. Instead, users can interact with the database by solely using their voice at any time. Dialogue-based Querying: While traditional database systems provide a one-shot (i.e., stateless) query interface, natural language conversations are incremental (i.e., stateful) in nature. To that end, EchoQuery provides a stateful dialogue-based query interface between the user and the database where (1) users can start a conversation with an initial query and refine that query incrementally over time, and (2) EchoQuery can seek for clarification if the query is incomplete or has some ambiguities that need to be resolved. Personalizable Vocabulary: Domain experts often use their own terms to formulate queries, which might be different from the schema elements (i.e., table and column names) of a database. Learning the terminology of a user and its translation to the underlying schema is similar to the problem of constructing a schema mapping in data integration. EchoQuery constructs these mappings incrementally on a per-user basis by issuing clarification questions using its dialogue-based query interface.

EchoQuery: Chatting with Your Relational Database from Vinh Tran on Vimeo.

EchoQuery relies on DBPal, an NL-based interface to RDBMSs that focuses on robust translation of natural language statements to SQL using deep learning techniques.

[DBPal: A Natural Language Interface for SQL Databases] https://vimeo.com/251178010

Publications:

Nathaniel Weir, Andrew Crotty, Alex Galakatos, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Ugur Çetintemel, Prasetya Utama, Nadja Geisler, Benjamin Hättasch, Steffen Eger, Carsten Binnig (SIGMOD 2020). A Fully Pluggable NL2SQL Training Pipeline.
Nathaniel Weir, Andrew Crotty, Alex Galakatos, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Ugur Çetintemel, Prasetya Utama, Nadja Geisler, Benjamin Hättasch, Steffen Eger, Carsten Binnig (CAST 2019). DBPAL: Weak Supervision for Learning a Natural Language Interface for Databases.
Fuat Basik, Benjamin Hättasch, Amir Ilkhechi, Arif Usta, Shekar Ramaswamy, Prasetya Utama, Nathaniel Weir Carsten Binnig, Ugur Cetintemel. DBPal: A Learned Natural Language Interface for Database Systems. ACM SIGMOD 2018.
Prasetya Utama, Nathaniel Weir, Fuat Basik, Carsten Binnig, Ugur Cetintemel, Benjamin Hättasch, Amir Ilkhechi, Shekar Ramaswamy, Arif Usta. An End-to-end Neural Natural Language Interface for Databases. arXiv:1804.00401.
Gabriel Lyons, Vinh Tran, Carsten Binnig et al.: Making the Case for Query-by-Voice with EchoQuery. Demo Paper, SIGMOD 2016Carsten Binnig, Ugur Cetinemel, et int, and Nathaniel Weir. Voice-based Data Exploration: Chatting with your Database. In: SCAI@ICTIR. 2017.
Gabriel Lyons, Vinh Tran, Carsten Binnig et al.: Making the Case for Query-by-Voice with EchoQuery. Demo Paper, SIGMOD 2016

Interactive Data Analysis over Modern Hardware

Modern database workloads present ample opportunities for intermediate result reuse. For example, exploration-oriented applications such as Vizdom typically generate workloads where each query serves as a jumping-off point for the next, which is obtained through incremental modifications (e.g., by refining filters, adding joins, drilling down etc.). Various techniques have been developed to profitably reuse intermediates in DBMSs. These solutions typically require intermediate results of individual operators be materialized into temporary tables to be considered for reuse later. However, these approaches are fundamentally ill-suited for use in modern main memory databases. Modern main memory DBMSs are typically limited by the bandwidth of the memory bus and query execution is thus heavily optimized to keep tuples in the the CPU caches and registers. Adding additional operations to a query plan that materialize intermediates into a temporary data structure in memory not only add additional traffic to the memory bus but more importantly prevent the important cache- and register-locality, which results in high performance penalties.

To that end, the goal of this project is to revisit “reuse” in the context of modern main memory databases. The main idea is to leverage internal data structures that are materialized anyway by pipeline breakers during query execution. This way, reuse is possible without any additional materialization costs. The focus of this work is on the most common data structure, hash tables (HTs), as found in hash-join and hash-aggregate operations. We leave other operators and data structures (e.g., trees for sorting) for future work. Our experiments show performance gains up to 50X speed-up compared to the execution strategies without reuse and up to 10x speed-up compared to traditional materialization-based reuse approaches without adding additional materialization overhead.

Currently, we face the next major shift in processor designs that arose from the physical limitations known as the ”dark silicon effect”. Due to thermal limitations and shrinking transistor sizes, multi-core scaling is coming to an end. A major new direction that hardware vendors are currently investigating involves specialized and energy-efficient hardware accelerators (e.g., ASICs) placed on the same die as the normal CPU cores. To this end, we studied a novel query processing engine called SiliconDB that targets such heterogeneous processor environments. We leverage the Sparc M7 platform to develop and test our ideas. Based on the SSB benchmarks, as well as other micro benchmarks, we compare the efficiency of SiliconDB with existing execution strategies that make use of co-processors (e.g., FPGAs, GPUs).

Publications:

Kayhan Dursun, Carsten Binnig, Ugur Çetintemel, Garret Swart, Weiwei Gong (PVLDB 2019). A Morsel-Driven Query Execution Engine for Heterogeneous Multi-Cores.
Kayhan Dursun, Carsten Binnig, Ugur Cetintemel and Tim Kraska (SIGMOD 2017). Revisiting Reuse in Modern Main Memory Databases.

Interactive Anomaly Exploration and Visualization

We have been interested in interactive interfaces for real-time anomaly detection over time-series data. Our approach uses something known as zero-positive learning in which training is done over non-anomalous data, thereby not requiring an a priori enumeration of anomalies. It achieves this by looking for subsequences that have statistical properties that do not match those of the training set. This is convenient for the user, but when an anomaly is reported it may be difficult or impossible to understand the semantics of a reported anomaly. Thus, we have designed a visual tool to interact with anomalies to try to understand their source. This includes novel use of correlation, what-if tools, and display techniques for time series.

Publications:

Eichmann, Philipp and Solleza, Franco and Tan, Junjay and Tatbul, Nesime and Zdonik, Stan (CHI 2019). Metro-Viz: Black-Box Analysis of Time Series Anomaly Detectors. Eichmann, Philipp and Solleza, Franco and Tatbul, Nesime and Zdonik, Stan (SIGMOD 2019). Visual Exploration of Time Series Anomalies with Metro-Viz.
Tae Jun Lee, Justin Gottschlich, Nesime Tatbul, Eric Metcalf, Stan Zdonik. Greenhouse: A Zero-Positive Machine Learning System for Time-Series Anomaly Detection. SysML 2018
Tae Jun Lee, Justin Gottschlich, Nesime Tatbul, Eric Metcalf, Stan Zdonik. Precision and Recall for Range-Based Anomaly Detection. SysML 2018

Acknowledgements

alt text

We are grateful to NSF for supporting this work (in part) by the IIS Award #1514491.

Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.