Book Image

NoSQL Data Models

By : Olivier Pivert
Book Image

NoSQL Data Models

By: Olivier Pivert

Overview of this book

Big Data environments are now to be handled in most current applications, this book addresses the latest issues and hurdles that are encountered in such environments. The book begins by presenting an overview of NoSQL languages and systems. Then, you’ll evaluate SPARQL queries over large RDF datasets and devise a solution that will use the MapReduce framework to process SPARQL graph patterns. Next, you’ll handle the production of web data, generate a set of links between two different datasets and overcome different heterogeneity problems. Moving ahead, you’ll take the multi-graph based approach to overcome challenges faced by the RDF data management community. Finally, you’ll deal with the flexible querying of graph databases and textual data management. By the end of this book, you’ll have gathered essential information on big data challenges faced by NoSQL databases.
Table of Contents (11 chapters)
List of Authors
End User License Agreement

1.4. New challenges for database research

Since their appearance at the turn of the year 2000, NoSQL databases have become ubiquitous and collectively store a large amount of data. In contrast with XML databases, which rose in popularity in the mid-1990s to settle on specific, document-centric applications, it seems safe to assume that NoSQL databases are here to stay, alongside relational ones. After a first decade of fruitful research in several directions, it seems that it is now time to unify all these research efforts.

First and foremost, in our sense, a formal model of NoSQL databases and queries is yet to be defined. The model should play the same role that relational algebra played as a foundation for SQL. This model should in particular allow us to:

  • – describe the data-model precisely;
  • – express complex queries;
  • – reason about queries and their semantics. In particular, it should allow us to reason about query equivalence;
  • – describe the cost model of queries;
  • – reason about meta-properties of queries (type soundness, security properties, for instance, non-interference, access control or data provenance);
  • – characterize high-level optimization.

Finding such a model is challenging in many ways. First, it must allow us to model data as it exists in current – and future – NoSQL systems, from the simple key-value store, to the more complex document store, while at the same time retaining compatibility with the relational model. While at first sight the nested relational algebra seems to be an ideal candidate (see, for instance, [ABI 84, FIS 85, PAR 92]), it does not allow us to easily model heterogeneous collections which are common in NoSQL data stores. Perhaps an algebra based on nested data types with extensible records similar to [BEN 13] could be of use. In particular, it has already been used successfully to model collections of (nested) heterogeneous JSON objects.

Second, if a realistic cost model is to be devised, the model might have to make the distributed nature of data explicit. This distribution happens at several levels: first, collections are stored in a distributed fashion, and second, computations may also be performed in a distributed fashion. While process calculi have existed for a long time (for instance, the one introduced by Milner et al. [MIL 92]), they do not seem to tackle the data aspect of the problem at hand.

Another challenge to be overcome is the interaction with high-level programming languages. Indeed, for database-oriented applications (such as Web applications), programmers still favor directly using a query language (such as SQL) with a language API (such as Java’s JDBC) or higher-level abstractions such as Object Relational Mappings, for instance. However, data analytic oriented applications favor idiomatic R or Python code [GRE 15, VAN 17, BES 17]. This leads to inefficient idioms (such as retrieving the bulk of data on the client side to filter it with R or Python code). Defining efficient, truly language-integrated queries remains an unsolved problem. One critical aspect is the server-side evaluation of user-defined functions, written in Python or R, close to the data and in a distributed fashion. Frameworks such as Apache Spark [ZAH 10], which enable data scientists to write efficient idiomatic R or Python code, do not allow us to easily reason about security, provenance or performance (in other words, they lack formal foundations). A first step toward a unifying solution may be the work of Benzaken et al. [BEN 18]. In this work, following the tradition of compiler design, an intermediate representation for queries is formally defined. This representation is an extension of the λ-calculus, or equivalently of a small, pure functional programming language, extended with data operators (e.g. joins and grouping). This intermediate representation is used as a common compilation target for high-level languages (such as Python and R). Intermediate terms are then translated into various back-ends ranging from SQL to MapReduce-based databases. This preliminary work seems to provide a good framework to explore the design space and address the problems mentioned in this conclusion.

Finally, while some progress has been made in implementing high-level operators on top of distributed primitives such as MapReduce, and while all these approaches seem to fit a similar template (in the case of join: prune non-joinable items early and regroup likely candidates while avoiding duplication as much as possible), it seems that some avenues must be explored to unify and formally describe such low-level algorithms, and to express their cost in a way that can be reused by high-level optimizers.

In conclusion, while relational databases started both from a formal foundation and solid implementations, NoSQL databases have developed rapidly as implementation artifacts. This situation highlights its limits and, as such, database and programming language research aims to ‘correct’ it in this respect.