20/20: Human-in-the-Loop Data Exploration

Overview

We propose to build a new class of database systems designed for Human-In-the-Loop (HIL) operation. We target an ever growing set of data-centric applications in which data scientists of varying skill levels manipulate, analyze and explore large data sets, often using complex analytics and machine learning techniques. Enabling these applications with ease of use and at “human speeds” is key to democratizing data science and maximizing human productivity. Traditional database technologies are ill-suited to serve this purpose. Historically, databases assumed (1) text-based input (e.g., SQL) and output, (2) a point (i.e., stateless) query-response paradigm, (3) batch results, and (4) simple analytics. We will drop these fundamental assumptions and build a system that instead supports visual input and output, ”conversational” interaction, early and progressive results, and complex analytics. Building a system that integrates these features requires a complete rethinking of the full database stack, from the interface to the ”guts”, as well as incorporating pertinent algorithms.

The proposed work will make the following technical contributions

Visual interactive data exploration and manipulation: Interactive visualizations are effective and user friendly means of accessing and manipulating data. In our model, users interact with visualizations to express data operations (input) and also consume visualizations as results (output), requiring profound changes in the traditional data systems design and optimizations. On the input side, the PIs propose a visual interaction language that operates over visualizations, taking a visualization as input and producing another as output. On the output side, the PIs will study the notion of visual approximate query answering, which brings approximate query processing to the world of visualization to enable real-time analysis of large data sets. The PIs will also develop techniques for optimizing visualization-specific operations that are not well supported by existing databases. These interaction and visualization techniques will be investigated in the context of advanced analytics workloads such as anomaly detection. Natural-language-based, conversational query processing: The PIs will build a natural language interface to relational database systems. Using learning techniques, we will automatically translate natural language queries to SQL. Furthermore, the PIs will develop techniques that will facilitate a “conversation” between the database and the user. In data exploration, users interact with the system using a sequence of queries (aka a query session), each building on the previous one. This radically departs from the point interaction model of traditional query processing in which there are no assumed relationships between queries. The PIs will introduce a session-aware processing model where the system expects that users will engage in a long running “conversation” and optimizes its execution accordingly. Interactive query steering: A fluid, engaging conversation requires that users have the ability to get representative results early and progressively, and as they learn from these results, to exercise control by rapidly interrupting queries or changing their behavior (e.g., query parameters). The PIs propose novel online query processing and steering techniques that are designed for visual data manipulation involving complex analytics and machine learning. Below, we discuss different initial projects towards achieving these goals.

People

  • Carsten Binnig
  • Ugur Cetintemel
  • Tim Kraska
  • Andries van Dam
  • Stan Zdonik

Vizdom: Interactive Analytics through Pen and Touch

Machine learning (ML) and advanced statistics are important tools for drawing insights from large datasets. However, these techniques often require human intervention to steer computation towards meaningful results. In this project, we build a new system for interactive analytics through pen and touch called Vizdom. Vizdom’s frontend allows users to visually compose complex workflows of ML and statistics operators on an interactive whiteboard, and the back end leverages recent advances in workflow compilation tech niques to run these computations at interactive speeds. Additionally, we are exploring approximation techniques for quickly visualizing partial results that incrementally refine over time. Different from existing approx. query processing techniques, Vizdom takes the perception of the user into account to avoid unnecessary computation where the results are not perceivable by the user.

Vizdom: Interactive Analytics through Pen and Touch from Emanuel Zgraggen on Vimeo.

Publications

  • Emanuel Zgraggen, Zheguang Zhao, Robert Zeleznik, and Tim Kraska. Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis. ACM CHI 2018
  • Yue Guo, Carsten Binnig, Tim Kraska: What you see is not what you get!: Detecting Simpson’s Paradoxes during Data Exploration. HILDA@SIGMOD 2017: 2:1-2:5
  • Zheguang Zhao, Lorenzo De Stefani, Emanuel Zgraggen, Carsten Binnig, Eli Upfal, Tim Kraska: Controlling False Discoveries During Interactive Data Exploration. SIGMOD Conference 2017: 527-540
  • Zheguang Zhao, Emanuel Zgraggen, Lorenzo De Stefani, Carsten Binnig, Eli Upfal, Tim Kraska: Safe Visual Data Exploration. SIGMOD Conference 2017: 1671-1674
  • Muhammad El-Hindi, Zheguang Zhao, Carsten Binnig and Tim Kraska. VisTrees: Fast Indexes for Interactive Data Exploration. Research Paper, HILDA 2016 (SIGMOD 2016)
  • Andrew Crotty, Alexander Galakatos, Emanuel Zgraggen, Carsten Binnig and Tim Kraska. The Case for Interactive Data Exploration Accelerators (IDEAs). Research Paper, HILDA 2016 (SIGMOD 2016)
  • Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Çetintemel, Stan Zdonik: An Architecture for Compiling UDF-centric Workflows. PVLDB 8(12): 1466-1477 (2015)
  • Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, Tim Kraska: Vizdom: Interactive Analytics through Pen and Touch. PVLDB 8(12): 2024-2035 (2015) (Best Demo)

EchoQuery and DBPal: Chatting with Your Relational Database

Recent advances in automatic speech recognition and natural language processing have led to a new generation of robust voice-based interfaces. Yet, there is very little work on using voice-based interfaces to query database systems. In fact, one might even wonder who in her right mind would want to query a database system using natural-language-based voice commands! With this project, we make the case for querying database systems using a NL-based interface, a new querying and interaction paradigm we call Query-by-Natural Language (QbNL ). The aim of this project is to demonstrate the practicality and utility of QbV for relational DBMSs using a using a proof-of-concept system called EchoQuery. To achieve a smooth and intuitive interaction, the query interface of EchoQuery is inspired by casual human-to-human conversations.

EchoQuery uses DBPal, an NL-based interface to RDBMSs that focuses on robust translation of natural language statements to SQL using deep learning techniques.

The main features of EchoQuery are:

Hands-free Access: EchoQuery does not require the user to press a button or start an application using a gesture or a mouse-click. Instead, users can interact with the database by solely using their voice at any time. Dialogue-based Querying: While traditional database systems provide a one-shot (i.e., stateless) query interface, natural language conversations are incremental (i.e., stateful) in nature. To that end, EchoQuery provides a stateful dialogue-based query interface between the user and the database where (1) users can start a conversation with an initial query and refine that query incrementally over time, and (2) EchoQuery can seek for clarification if the query is incomplete or has some ambiguities that need to be resolved. Personalizable Vocabulary: Domain experts often use their own terms to formulate queries, which might be different from the schema elements (i.e., table and column names) of a database. Learning the terminology of a user and its translation to the underlying schema is similar to the problem of constructing a schema mapping in data integration. EchoQuery constructs these mappings incrementally on a per-user basis by issuing clarification questions using its dialogue-based query interface.

EchoQuery: Chatting with Your Relational Database from Vinh Tran on Vimeo.

EchoQuery relies on DBPal, an NL-based interface to RDBMSs that focuses on robust translation of natural language statements to SQL using deep learning techniques.

[DBPal: A Natural Language Interface for SQL Databases] https://vimeo.com/251178010

Publications:

  • Fuat Basik, Benjamin Hättasch, Amir Ilkhechi, Arif Usta, Shekar Ramaswamy, Prasetya Utama, Nathaniel Weir Carsten Binnig, Ugur Cetintemel. DBPal: A Learned Natural Language Interface for Database Systems. ACM SIGMOD 2018.
  • Prasetya Utama, Nathaniel Weir, Fuat Basik, Carsten Binnig, Ugur Cetintemel, Benjamin Hättasch, Amir Ilkhechi, Shekar Ramaswamy, Arif Usta. An End-to-end Neural Natural Language Interface for Databases. arXiv:1804.00401.
  • Gabriel Lyons, Vinh Tran, Carsten Binnig et al.: Making the Case for Query-by-Voice with EchoQuery. Demo Paper, SIGMOD 2016Carsten Binnig, Ugur Cetinemel, et int, and Nathaniel Weir. Voice-based Data Exploration: Chatting with your Database. In: SCAI@ICTIR. 2017.
  • Gabriel Lyons, Vinh Tran, Carsten Binnig et al.: Making the Case for Query-by-Voice with EchoQuery. Demo Paper, SIGMOD 2016

HashStash – Reuse for Interactive Data Exploration

Modern database workloads present ample opportunities for intermediate result reuse. For example, exploration-oriented applications such as Vizdom typically generate workloads where each query serves as a jumping-off point for the next, which is obtained through incremental modifications (e.g., by refining filters, adding joins, drilling down etc.). Various techniques have been developed to profitably reuse intermediates in DBMSs. These solutions typically require intermediate results of individual operators be materialized into temporary tables to be considered for reuse later. However, these approaches are fundamentally ill-suited for use in modern main memory databases. Modern main memory DBMSs are typically limited by the bandwidth of the memory bus and query execution is thus heavily optimized to keep tuples in the the CPU caches and registers. Adding additional operations to a query plan that materialize intermediates into a temporary data structure in memory not only add additional traffic to the memory bus but more importantly prevent the important cache- and register-locality, which results in high performance penalties.

To that end, the goal of this project is to revisit “reuse” in the context of modern main memory databases. The main idea is to leverage internal data structures that are materialized anyway by pipeline breakers during query execution. This way, reuse is possible without any additional materialization costs. The focus of this work is on the most common data structure, hash tables (HTs), as found in hash-join and hash-aggregate operations. We leave other operators and data structures (e.g., trees for sorting) for future work. Our experiments show performance gains up to 50X speed-up compared to the execution strategies without reuse and up to 10x speed-up compared to traditional materialization-based reuse approaches without adding additional materialization overhead.

Publications:

  • Kayhan Dursun, Carsten Binnig, Ugur Cetintemel and Tim Kraska. Revisiting Reuse in Modern Main Memory Databases. SIGMOD 2017.

Interactive Anomaly Exploration

We have been interested in interactive interfaces for real-time anomaly detection over time-series data. Our approach uses something known as zero-positive learning in which training is done over non-anomalous data, thereby not requiring an a priori enumeration of anomalies. It achieves this by looking for subsequences that have statistical properties that do not match those of the training set. This is convenient for the user, but when an anomaly is reported it may be difficult or impossible to understand the semantics of a reported anomaly. Thus, we have designed a visual tool to interact with anomalies to try to understand their source. This includes novel use of correlation, what-if tools, and display techniques for time series.

Publications:

  • Tae Jun Lee, Justin Gottschlich, Nesime Tatbul, Eric Metcalf, Stan Zdonik. Greenhouse: A Zero-Positive Machine Learning System for Time-Series Anomaly Detection. SysML 2018
  • Tae Jun Lee, Justin Gottschlich, Nesime Tatbul, Eric Metcalf, Stan Zdonik. Precision and Recall for Range-Based Anomaly Detection. SysML 2018

Acknowledgements

alt text

We are grateful to NSF for supporting this work (in part) by the IIS Award #1514491.

Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.