Lade Inhalt...

A Comparison of NoSQL Time Series Databases

Studienarbeit 2015 44 Seiten

Leseprobe

Inhaltsverzeichnis

Abstract

Introduction

Background
Distributed Systems and NoSQL Databases
Time Series and Time Series Data
Time Series Databases (TSDBs)
Time Series Data Implications
Drawbacks of Relational Databases
Benefits of NoSQL Databases
Architecture of Time Series Databases

Related Work

Comparison Frameworks
Feature-oriented
Quality-oriented
NoSQL-specific characteristics
TSDB-specific characteristics

Application
Feature-oriented
Applied to open-source TSDBs

Quality-oriented
Applied to InfluxDB
Applied to OpenTSDB
Comparison Overview

Results

Conclusions

References

Abstract

During the last years NoSQL databases have been developed to address the needs of tremendous performance, reliability and horizontal scalability. NoSQL time series databases (TSDBs) have risen to combine valuable NoSQL properties with characteristics of time series data encountering many use-cases. Solutions offer the efficient handling of data volume and frequency related to time series. Developers and decision makers struggle with the choice of a TSDB among a large variety of solutions. Up to now no comparison exists focusing on the specific features and qualities of those heterogeneous applications.

This paper aims to deliver two frameworks for the comparison of TSDBs, firstly with a focus on features and secondly on quality. Furthermore, we apply and evaluate the frameworks on up to seven open-source TSDBs such as InfluxDB and OpenTSDB.

We come to the result that the investigated TSDBs differ mainly in support- and extension related points. They share performance-enhancing techniques, time-related query capabilities and data schemas optimized for the handling of time-series data. OpenTSDB seems to be the most advanced solution. The feature-oriented approach is very suitable to distinguish TSDBs by their general characteristics. Also, the quality-oriented comparison framework supports to discover distinctive features of TSDBs. However, the later approach lacks in validity for quality-related figures as it includes no comparative measurement approaches.

Future work should include to extend the quality-oriented comparison framework by evaluation techniques, such as benchmarks using case-sensitive workloads.

Keywords: NoSQL. Time Series Database (TSDB). Comparison Framework. OpenTSDB. InfluxDB. System Architecture. Distributed System.

Introduction

Since centuries data ordered by time and enriched by event data has been playing a key role in science and society [34]. Use-cases began with written-down experimental observations in philosophical fields and have led to today’s monitoring of huge and volatile service networks for example in video streaming, e-commerce and smart grid.

Methods and storing techniques have changed dramatically. As in earlier times a traditional database helped to store also time series data in large scale, nowadays even their capabilities are exhausted for some use-cases and new technologies arise to meet current and future needs. There is a growing need to store, process and view a tremendous amount of time-based data, coming from a variety of network agents, while there is an audience willing to analyze historical or recently incurred data. Advances in technology in recent years made it attractive to use distributed systems of commodity hardware as the number of network agents increased and the costs for storage, process performance and network bandwidth decreased exponentially (Moore’s Law [40]).

In the last few years there has been a growing interest in NoSQL databases. NoSQL databases like Cassandra [37], HBase [25] and BigTable [9] facilitate flexible and reliable data storage, capable to adapt to business needs in very short time with incredible performance. The implementation of those systems can lead to business advantages.

Capabilities of NoSQL databases match the requirements of today’s use-cases to manage time series data. However, time series data has special characteristics which demands careful adaptions of NoSQL database designs, such as strict data schemas and specific query functionalities. TSDBs have been developed that use NoSQL databases as persistent storages with the aim to store time series data. A current use case for the application of TSDBs is the smart grid domain. Imagine the challenges to monitor the behavior of local and resilient energy, gas and heat networks. A distributed and resistant time series database is necessary to store, provide and monitor time series data of various agents and devices, such as domestic and commercial consumers and producers.

However, TSDB solutions differ in various characteristics such as expandability, support and performance. It is very important to choose carefully and to match application needs with TSDB capabilities. There are many solutions available (Table 1), but no literature can be found comparing the TSDBs comprehensively either based on their offered features or quality attributes. Those solutions include a variety of open-source TSDBs, e.g. Druid1 and InfluxDB2. Also, many commercial solutions are available, such as Geras10 and IBM Informix11. Furthermore, there exist some solutions and services used internally by big players like Facebook with its in-memory TSDB called Scuba [1] and Amazon with its monitoring service named Amazon CloudWatch15.

illustration not visible in this excerpt

Table 1 . A variety of NoSQL time series databases (TSDBs)

In this paper we contribute to the current literature in two ways. In order to find out how TSDBs differ we develop two comparison frameworks for feature- and quality-oriented analyses. As a second contribution we apply the first framework on seven, and the second on two open-source TSDBs. This also serves as a test scenario to evaluate the validity of the frameworks for meaningful comparisons.

The remainder of this paper is organized as follows: Section 0 explains some of the essential background of distributed systems, NoSQL databases and time series databases. Section 0 is devoted to related work in the field of time series databases and the identification of the lack in literature. In section 0 we present two frameworks to compare time series databases: the feature-oriented and the quality-oriented comparison framework. Section 0 shows the application of those concepts on several open-source TSDBs, among many especially on InfluxDB and OpenTSDB. Section 0 summarizes the preceding results and gives an outlook into further work.

Background

In the following we revise and discover the essentials of distributed systems, especially the CAP-Theorem and architectural styles. Furthermore, we define time series data in detail and discover characteristics and features of time series databases.

Distributed Systems and NoSQL Databases

A distributed system is characterized as a network of computers which communicate to each other in order to deliver results. In sum they can represent a whole information system which appears as a single system to agents outside this system.

The Brewer’s CAP Theorem is a fundamental concept pointing out important design trade-offs in building distributed systems: Only two out of the three guarantees “Consistency”, “Availability” and “Partition tolerance” can be given by any distributed system. By “Consistency” we mean that the same state of data is seen by all participants. “Availability” refers to the guarantee that all requests made to the system end up in a response. In the end “Partition tolerance” means that even if elements of the system fail or messages are lost the system is able to continue its operations [26].

In the case of distributed systems architectural styles involve important design and evaluation questions leading to various implications on characteristics of distributed systems. The two general architecture styles are master-slave systems and peer-to-peer systems. Furthermore, information systems consist of various layers, e.g. a 3-layer system: presentation layer, application logic layer and database layer. They also can be designed by a component-based system architecture. Furthermore, component-based architectures which run as application servers are often seen as a distributed computing environment [57].

A specialization of distributed systems in the field of databases are NoSQL (Not only SQL) databases. They differ fundamentally from traditional, relational databases in means that they are not restricted to strong ACID-attributes and are optimized to handle large datasets in a distributed manner. This leads to performance-related advantages like lower latencies and higher throughputs as well as tremendous possibilities to scale, and thus means to maintain stable systems. Popular classifications of NoSQL databases are key-value stores, wide-column stores and document stores [51].

Time Series and Time Series Data

A time series has special characteristics and can be applied to various fields. Apparently, a time series is a sequence of events recorded and stored in time order. Different specialization terms can occur for different fields, e.g. price curves, load profiles and temperature traces. Common fields of application are physics, medicine, economics, IT-infrastructure, finance and other scientific experiment results.

Time series data refers to the composition of metrics and tags. A metric is an arrangement of numerical data in a successive time order, consisting of a title and several time-value pairs. Usually time series data is enriched by tags, thus metadata with additional information, e.g. CPU ID. The ultimate goal is to store the series in a table and to plot it via a graph for monitoring. Other intentions go into the field of time series analysis which defines approaches to investigate past data to gather meaningful statistics or the time series forecasting as a field of predicting prospective developments based on meaningful analytics [7]. Below we show how time series data is constructed:

- Time series data = metrics + tags
- Metric: an arrangement of numerical data in a successive time order
─ Metric title e.g. cpu.freq.1s
─ Data points time, value e.g. 09/03/2015 8:12:50, 900
- Tag: metadata to structurally enrich a metric with additional information
─ Tag tagtitle, value e.g. cpu.num, 542

Time Series Databases (TSDBs)

Time Series Data Implications

TSDBs have a few unique characteristics which necessarily have to be considered designing and implementing such systems. They are implied from the properties of time series data [22]:

- High volume of time-ordered data (data points usually <10 bytes, high frequency)
- Store tags as meta data
- Operations: Write-append only; sequential reads; no updates; bulk uploads/deletes
- Queries: Filters; aggregations; down-sampling; custom queries
- Ability to create and visualize custom graphs

Query functionalities of TSDBs are very specific. Filters help to select particular time series for given metric names, tagtitle-value pairs or time ranges. Aggregations include specific functions to aggregate data points, e.g. sum and average. Down-sampling describes the change of the displayed time resolution and leads to faster plotting based on reduced data volume. Some TSDBs also offer automated queries, e.g. to automatize query calculations and flush the results to separate time series or even to fan out time series by tagvalues.

An important factor is the cardinality of metrics. Storing time series data with a number of tags of different categories could lead to meaningful selections and aggregations. For example, storing time series data of various server CPU workloads and additionally storing tags about their cluster regions, belonging to server rack, manufacturer, etc. could lead to high cardinality and many combinations and thus enrich analyses by combining and drilling down requested aggregations and selections. Other design questions include how to deal with automated data purging [8],[22].

Drawbacks of Relational Databases

Traditional time series databases are database management systems capable and optimized to manage time series data, and are implemented in relational databases. This is done by creating a data schema including ID, metric and dateTime columns. Depending on the data model this approach also can be of high performance. Because of the ACID-characteristics relational databases are not able to scale easily. Problems often occur by reaching storage limitations although data points are small, but usually high in amount. Furthermore, time series-specific queries like groupings of data and other statistics-related requests can become very resource-intensive and difficult to handle for relational databases [22].

Benefits of NoSQL Databases

NoSQL-based time series databases have been developed to substitute traditional TSDBs because increased requirements are no longer fulfilled by the characteristics offered by traditional systems [22], [51]:

- Scalability: In a distributed system scalability means that the system is able to grow by further machines automatically to handle increased traffic, data volume or user population to stable response times. This is called horizontal scalability and is done by NoSQL systems in a simple and inexpensive manner.
- Extendibility: Various plugins, client libraries and APIs are usually available to interact with the systems in manifold ways.
- Reliability: Tolerance for failures of nodes make the system very reliant. Even with commodity hardware high reliability can be guaranteed. In former systems expensive hardware was necessary to guarantee reliant operation. Furthermore, scalability supports replication mechanisms and thus increases reliability.
- Performance: With scalability and flexibility performance is increased. Such systems can handle large and changing user populations.

Apparently, those features and abilities are differently addressed by various TSDBs. The fact that NoSQL-based solutions are beneficial for the storage and processing of large-scale time series data also acts as a motivation for a deeper analysis and is integrated predominantly in the quality-oriented comparison framework.

Architecture of Time Series Databases

The architecture of TSDBs is based on the structure of 3-layer information systems [57] with presentation layer, application logic layer and database layer (Fig. 1). As the presentation layer is the only connection to the user it handles API requests coming over a transport protocol, such as HTTP. Often graphical user interfaces (GUI) are implemented, e.g. for admin purposes. A more or less unique feature is the graph visualization which is often embedded in the GUI. The application logic layer serves as the core coordination unit enabling coordination services such as bootstrapping, replication, consensus, etc. Also, data collection is handled by this unit as well as the query engine. Distribution services are also embedded in this layer. The third layer is the database layer which includes the persistent data storage as well as the data structure.

illustration not visible in this excerpt

Fig. 1 . 3-layer architecture of time series databases [57]

Related Work

In the following we revise some literature about traditional time series databases, TSDBs and similar work that has been done due to the comparison of databases, in particular NoSQL databases.

Some work has been done covering the basic theory of traditional and modern time series databases. In [22] the authors discover the theory of modern TSDBs and best practices for implementing them. Furthermore, they concentrate on OpenTSDB and its modification for increased performance. A very comprehensive dissertation about time series data management has been written by Castillejos [8]. He concentrates on various methodologies to map time series data to relational databases and develops grouping mechanisms and integrated management frameworks. In [4] the authors exemplarily present the use of OpenTSDB in the use case of smart meter analytics. Further, they develop a prototype optimized to handle energy-related time series data and also compare its advantages to relational database solutions. Many authors concentrate to optimize, build and test adaptions of TSDBs (see [2], [53]), or even own-build time series databases (see [15], [48]). Also, some research has been done to optimize visualization tools which serve as essential plugins (see [50]).

In fact, some investigations have been made so far in the field of TSDB comparisons. In [27] the authors compare and evaluate a set of open-source TSDBs like OpenTSDB, KairosDB and Energy DataBus. Some of the basic TSDB details are compared like version, storage and commit activity. Based on a conceptual architecture for a cloud-native monitoring system in the field of industrial processes a workload profile for TSDB benchmarking is created. For KairosDB and Energy DataBus the load tests are evaluated, but due to technical reasons the tests cannot be analyzed for OpenTSDB. However, KairosDB dominates Energy DataBus in terms of scalability with a much better read and write performance for up to 36 machines in one cluster. Unfortunately, the authors do not describe their smart meter-oriented workload and data schema design within a satisfactory level of detail. Wlodarczyk [64] shows an analysis and comparison of four time series databases of different type - open-source and commercial. He comes to the result that OpenTSDB seems to be the most advanced, popular and promising solution. However, his results are outdated as today, some of the TSDBs already have been forced out of the market and an even larger set of new TSDBs has been developed during the last years. Also, his approach only focuses on a few higher level comparison elements. Shafer et al. [53] also includes a set of SQL-based time series databases in its comparison with a variety of TSDBs. They evaluate databases based on their fitness for extremely large numeric time series data based on design principles and storage usage.

Rather more has been achieved regarding to comparisons of NoSQL databases. Cryans et al. [12] present a comparison of PostgreSQL16 and HBase with a variety of comparison elements such as data structure, scalability and software architecture. However, we only adopt a small amount of the here suggested comparison elements and expand the spectrum for our frameworks. In [59] the authors compare NoSQL databases (Cassandra and HBase) with MySQL17 based on several parameters. Furthermore, they benchmark the latencies of all comparison agents and come to the result that Cassandra dominates the rest. Difallah et al. [16] suggest a variety of comparison elements for modern databases. Some of them we adopt in our comparison framework for TSDBs. Additionally, they benchmark MySQL and PostgreSQL. They do not concentrate on a specific workload. Rather they investigate the evolvement of workloads. There exist a variety of other comparisons of NoSQL databases based on benchmarks, e.g. [13] and [36].

Comparison Frameworks

In this section we establish and present the frameworks for the comparison of TSDBs. We give a detailed description for every comparison element. First, we start with the introduction of the feature-oriented comparison framework and come then to the presentation of the quality-oriented comparison framework.

We divide the characteristics of TSDBs into features and qualities. Feature-oriented characteristics are those which can be directly perceived by the end-user or developer. Those include for example general information about the TSDB or the expandability of the TSDB with plugins or client libraries. A more TSDB-oriented feature would include the availability of visualization tools. Characteristics which affect the qualities of TSDBs are those which implicitly support higher level competencies based on specific information system elements. This means for example the capability level to scale out a system based on increased demands. Various software elements and technologies can encourage this process in different ways.

Feature-oriented

The feature-oriented comparison framework (Table 2) focuses on the various capabilities of TSDBs. For this we choose five categorization fields, as follows: General information, Support and Community, Development, TSDB-specific and Miscellaneous. A few comparison elements are taken from [27] and [59]. We strictly isolate TSDB-specific elements from the rest, so that the framework is easily adaptable for the comparison of other NoSQL database categories.

illustration not visible in this excerpt

Table 2 . Feature-oriented comparison framework for TSDBs [27] [59]

General Information contains information which helps to compare TSDBs by their generic attributes such as coding language, current stable version, release date/version, license and developer. The underlying programming language, however, can help to presume how the system is built and what development struggles may have occurred. Furthermore, it helps to make assumptions about the development progress and expandability of the system. The license states under which legal conditions the corresponding system is to use. Although open-source code is free to use, terms and conditions of the several licenses are different. Also, options to modify, distribute and sublicense the code can be handled differently.

The section Support and Community constitutes evidence for the present offer of information about the corresponding TSDB. It is an essential resource for developers, users, scientists and decision makers to use the full potential of those systems in alignment to their needs. Documentations, blogs, FAQs and tutorials can help to understand the requirements and potential capabilities of the system and its limitations. A social media presence is helpful to communicate the intended future developments in a timely manner to a broad audience. Information about usage in production as well as benchmarks and science papers available to the public indicate a very mature system. We only include the reference to papers, if the authors implement the corresponding TSDB, e.g. to build prototypes for higher level systems, or they analyze the TSDB in sufficient extent. Sandboxes and demos contribute to give potential users a quick first hands-on experience.

[...]

Details

Seiten
44
Jahr
2015
ISBN (eBook)
9783656965756
ISBN (Buch)
9783656965763
Dateigröße
1.5 MB
Sprache
Englisch
Institution / Hochschule
Technische Universität Berlin – Wirtschaftsinformatik - Information Systems Engineering (ISE)
Erscheinungsdatum
2015 (Mai)
Note
1,0
Schlagworte
NoSQL Time Series Database (TSDB) Comparison Framework OpenTSDB InfluxDB System Architecture Distributed System

Autor

Zurück

Titel: A Comparison of NoSQL Time Series Databases