Like many startups, the number of employees at Airbnb has grown significantly over the past several years. In parallel we have seen explosive growth in both the amount of data and the number of internal data resources: data tables, dashboards, reports, metrics definitions, etc. On one hand, the growth in data resources is healthy and reflects our heavy investment in data tooling to promote data-informed decision making. However it also creates a new challenge: effectively navigating a sea of data resources of varying quality, complexity, relevance, and trustworthiness. In this post we describe our observation of this problem and the Dataportal, a novel data resource search and discovery tool that addresses this issue.
A data catalog helps companies organize and find data that's stored in their many systems. It works a lot like a fashion catalog. But instead of detailing swimsuits or shoes, it has information about tables, files, and databases from a company's ERP, HR, Finance, and E-commerce systems (as well as social media feeds). The catalog also shows where all the data entities are located. A data catalog contains lots of critical information about each piece of data, such as the data's profile (statistics or informative summaries about the data), lineage (how the data is generated), and what others say about it.
Not only does this provide useful information to users in the moment, but it has also helped raise awareness and increase the adoption of Lexikon. Since launching the Lexikon Slack Bot, we've seen a sustained 25% increase in the number of Lexikon links shared on Slack per week. You just listened to a track by a new artist on your Discover Weekly and you're hooked. You want to hear more and learn about the artist. So, you go to the artist page on Spotify where you can check out the most popular tracks across different albums, read an artist bio, check out playlists where people tend to discover the artist, and explore similar artists.
Unstructured enterprise data such as reports, manuals and guidelines often contain tables. The traditional way of integrating data from these tables is through a two-step process of table detection/extraction and mapping the table layouts to an appropriate schema. This can be an expensive process. In this paper we show that by using semantic technologies (RD-F/SP ARQL and database dependencies) paired with a simple but powerful way to transform tables with non-relational layouts, it is possible to offer query answering services over these tables with minimal manual work or domain-specific mappings. Our method enables users to exploit data in tables embedded in documents with little effort, not only for simple retrieval queries, but also for structured queries that require joining multiple interrelated tables.
Tackling the information retrieval gap between non-technical database end-users and those with the knowledge of formal query languages has been an interesting area of data management and analytics research. The use of natural language interfaces to query information from databases offers the opportunity to bridge the communication challenges between end-users and systems that use formal query languages. Previous research efforts mainly focused on developing structured query interfaces to relational databases. However, the evolution of unstructured big data such as text, images, and video has exposed the limitations of traditional structured query interfaces. While the existing web search tools prove the popularity and usability of natural language query, they return complete documents and web pages instead of focused query responses and are not applicable to database systems. This paper reports our study on the design and development of a natural language query interface to a backend relational database. The novelty in the study lies in defining a graph database as a middle layer to store necessary metadata needed to transform a natural language query into structured query language that can be executed on backend databases. We implemented and evaluated our approach using a restaurant dataset. The translation results for some sample queries yielded a 90% accuracy rate.