Book Image

A Definitive Guide to Apache ShardingSphere

By : Trista Pan, Zhang Liang, Yacine Si Tayeb
Book Image

A Definitive Guide to Apache ShardingSphere

By: Trista Pan, Zhang Liang, Yacine Si Tayeb

Overview of this book

Apache ShardingSphere is a new open source ecosystem for distributed data infrastructures based on pluggability and cloud-native principles that helps enhance your database. This book begins with a quick overview of the main challenges faced by database management systems (DBMSs) in production environments, followed by a brief introduction to the software's kernel concept. After that, using real-world examples of distributed database solutions, elastic scaling, DistSQL, synthetic monitoring, database gateways, and SQL authority and user authentication, you’ll fully understand ShardingSphere's architectural components, how they’re configured and can be plugged into your existing infrastructure, and how to manage your data and applications. You’ll also explore ShardingSphere-JDBC and ShardingSphere-Proxy, the ecosystem’s clients, and how they can work either concurrently or independently to address your needs. You’ll then learn how to customize the plugin platform to define personalized user strategies and manage multiple configurations seamlessly. Finally, the book enables you to get up and running with functional and performance tests for all scenarios. By the end of this book, you’ll be able to build and deploy a customized version of ShardingSphere, addressing the key pain points encountered in your data management infrastructure.
Table of Contents (18 chapters)
1
Section 1: Introducing Apache ShardingSphere
4
Section 2: Apache ShardingSphere Architecture, Installation, and Configuration
10
Section 3: Apache ShardingSphere Real-World Examples, Performance, and Scenario Tests

The evolution of DBMSs

With the rapid adoption of the cloud, SaaS delivery models, and open source repositories that are driving innovation, the proliferation of data has exploded in the past 10 years. These large datasets have made it mandatory for organizations who want an optimal customer experience to deploy effective and reliable database management systems (DBMSs). Nevertheless, this renewed focus for organizations on DBMSs and their requirements has not only created multiple opportunities for new technologies and new players in the industry but also numerous challenges. If you are reading this book, you are probably looking to upskill yourself and improve or expand your knowledge on how to effectively manage DBMSs.

Databases exist to store and access information. As a result, organizations now find it crucial to understand the latest techniques, technologies, and best practices to store and retrieve extensive data and the resulting traffic. The shift to cloud-based storage has also led to the expanded use of data clusters, and the related data science around data storing strategies. Data use for apps goes up and down throughout a typical day.

Reliable and scalable databases are required to help collect and process data by breaking large datasets into smaller ones. Such a need gave rise to concepts such as database sharding and partitioning, where both are used to scale extensive datasets into smaller ones while preserving performance and uptime. These concepts will be discussed in Chapter 3, Key Features and Use Cases – Your Distributed Database Essentials, in the Understanding data sharding section, and Chapter 10, Testing Frequently Encountered Application Scenarios.

Let's summarize what open source means according to The Open Source Definition (https://opensource.org/osd) – when we talk about open source, we refer to software that's released under a license where the copyright holder gives you and any other user the rights to use or change and distribute the software, even its source code, to anyone for any purpose deemed fit.

When it comes to databases, the role of open source is not only non-negligible, but it may come as a surprise to many. As of June 2021, over 50% of database management systems worldwide use an open source license (DB-Engines, Statista 2021). If we consider the recent developments of open source database software, we'll notice the proliferation of initiatives and communities dedicated to cloud-native database software.

Cloud-native databases have become increasingly important with the ushering-in of the cloud computing era. Its benefits include elasticity and the ability to meet on-demand application usage needs. Such a development creates the need for cloud migration capabilities and skills as businesses migrate workloads to different cloud platforms.

Currently, hybrid and multi-cloud environments are the norm, with nearly 75% of organizations reporting usage of a multi-cloud environment (https://www.lunavi.com/blog/multi-cloud-survey-72-using-multiple-cloud-providers-but-56-have-no-multi-cloud-strategy). The data that remains stored on-premises is, more often than not, composed of sensitive information that organizations are wary of migrating, or data that is connected to legacy applications or environments that make it too challenging to migrate.

This changed the concept of databases as we used to understand them, creating a new concept that includes data that is on-premises and in the cloud, with workloads running across various environments. The next big thing in terms of databases and infrastructure is the distributed cloud, which can be defined as an architecture where multiple clouds are used concurrently and managed centrally from a public cloud. It brings cloud-based services to organizations and blurs the lines between the cloud and on-premises systems.

The next section will introduce you to the challenges that are currently considered to be significant pain points in the industry. You may be familiar with some or all of them – if you are not, that is OK, and you will find that they are all explained in the next section.

These pain points will then be followed by equally important needs that currently haven't been met or are currently creating new opportunities in the industry.

Industry pain points

Because of the ever-expanding number of database types, engineers have to dedicate more of their time to learning SDKs and SQL dialects, and less time to developing. For an enterprise, technology selection is hard because of more complex tech stacks and the need to match their application frameworks, which can cause an oversized architecture.

The next few sections will introduce you to the most notable industry pain points, followed by new industry needs that are creating new opportunities for DBMSs.

Low-efficiency database management

Database administrators (DBAs) need to dedicate much of their time to surveying and using new databases to identify the differences in cooperation and monitoring methods, as well as to understand how to optimize performance.

The peripheral services and experience of a certain database are not universal or replicable. In production, the usage and maintenance cost of databases rises. The more database types a company deploys, the more investment will be required. If an enterprise adopts new databases suitable for new scenarios without a second thought, the investment is doomed to exponentially grow sooner or later.

New demands and increasingly frequent iteration

Different code is required to meet what could seem to be similar demands, with the only difference being the database type and the type of code that it supports. At the time of writing, while iteration frequency is already expected to rise sharply, developer response capability is reduced and inversely proportional to the number of database types. The exponential growth of common demands and database types slows down iteration significantly. The larger the number of databases, the slower the iteration pace and the lower the iteration performance level.

If, for example, the desired outcome is to encrypt all sensitive data at once, but doing so on a one-to-many database failed, the only possible solution is to modify the code on the business application side. Large firms frequently operate with dozens or even hundreds of systems, which poses great challenges for developers in encrypting all systems' data. Data encryption is only one of the many possible example challenges of this kind that developers may face, with other common demands such as permissions control, audit, and others all being frequently encountered in heterogeneous databases.

Lack of database inter-compatibility

We know for a fact that heterogeneous databases currently co-exist and will continue to do so for a long time, but without a common standard, we cannot collaboratively use databases. By common standard, we mean a universally accepted (or at least by a majority) technology reference such as the USB 2.0 or USB-C standard is for external hardware peripherals. If you are looking for a software example, look no further than SDKs that have been released to make apps for iOS or Android.

For databases, as you will learn throughout this book, we at the Apache ShardingSphere community are proponents of what we call Database Plus – which in simple terms means software that allows you to manage and improve any type of database, even to integrate different database types into the same system.

In terms of data computing, demands for a collaborative query engine and transaction management plans across heterogeneous databases are increasing. Nevertheless, at the moment, developers can only contribute to the development on the application side, making it difficult for their contribution to be developed into an infrastructure.

The new industry needs are creating new opportunities for DBMSs

The changing landscape within which enterprises operate is bound to affect their business decisions and operating procedures. This can be traced back to the expanding amounts of data and the internet argument mentioned in the Industry pain points section.

This section will give you an insight into what enterprises are looking to get from their database management systems across different industrial sectors. After that, we will look at the evolving role of a DBA, which some of you will be expected to step into.

Querying and storing enormous chunks of data

A large volume of data can crash standalone databases. We need more storage and servers to house the current enormous amount of data that will only increase in the future. A single database is unable to accommodate this data fortune.

Achieving prompt query data response time

Even though a DBMS has to accommodate enormous amounts of data, the experience and response time that's expected by customers and users do not allow DBMS downtime to organize the data little by little. How to retrieve the requested data from the data lake will be one of the top issues.

Querying and storing fragmented data types

Furthermore, the relational data structure has become one part of various data types. Documents, JSON, graphs, and key-value pairs are all attracting people's attention. This is reasonable since all of them come from varying business scenarios that involve keeping the world moving smoothly and efficiently.

All these new changes and requirements will bring necessary challenges and needs to databases and their operation and maintenance.

You may have been aware of or even already encountered some of these expectations in your professional experience. If you are just stepping into the professional world, you are bound to encounter these expectations, no matter your future industry. This is because the role of the database administrator has changed. More precisely, it has evolved, and the next section will tell you how.