Scaling your Databases

We have already discussed on the general ideas behind High Availability architecture. The scaling of databases is a very crucial component of the implementation of high availability architecture. In this post, we would explore the various techniques of scaling up databases. The distribution models discussed here apply to both relational and non-relational (NoSQL) databases.

As is the case with other components of high availability architecture that I would discuss in subsequent posts, it is imperative that the database solution that you plan to implement should be optimal to the needs of your product or application. Let us first see the need of scaling databases.

Why do we need to scale databases?

Razi Sharir, the CEO of Xeround, a cloud database company, identifies four key considerations for scaling your databases in a post on Mashable– traffic, downtime, performance and investment.

Your database must be scaled in such a way that it is able to handle traffic at any point of time, considering that the traffic would vary considerably with respect to time. Traffic is highly unpredictable and you must be able to sustain it during the peaks.

The scaling process should not lead to a downtime in your database. True that the records can change during the scaling process, you must ensure data consistency between pre-scaling and post-scaling databases.

In addition to that you need to figure out what type of scaling you require- whether you need scale up or out. Scaling up refers to upgrading to a bigger, more powerful database, whether scaling out requires adding nodes. Each has its own advantages and is useful in different situations.

Distribution Models:

1.Single Server:

The simplest way of distribution of databases is no distribution at all. This involves using a single server, which handles all reads and writes. This is a very simple process and eliminates all complexities.

The problem with single servers is that you have to scale up if you want better performance. Large servers, however, are not cost efficient as they are run for all the time, even though the loads are high only for small amounts of time.

2.Shards:

When there are different people trying to access different datasets, we employ a technique called sharding. Sharding is a way of horizontal distribution of the database, which means that rows of a database table are held separately, each partition forming a shard.

 Such a technique reduces the size of the database on each node and hence, improves performance. Next, we need to divide the database in such a way that similar data is grouped together, called aggregates.  We design them to combine data that’s commonly accessed together. Further, if a shard is based on some real world segmentation of data (European and American customers for instance), it is easier to decide where to direct a query. Another factor of consideration is to keep the load even.

Source: MSDN Blog

Many NoSQL databases have the facility of auto-sharding, where the database decides how to allocate data to shards ensuring that all the conditions above are satisfied and the data goes to the right shard.

Sharding has its disadvantages too. It brings about complexity in the whole structure, and also leads to a single point of failure, which means the corruption of a single shard can lead to failure of the whole table.

3. Master-Slave Replication:

In a master-slave replication, you make copies of the same data over many nodes. One of the nodes is designated the master, which is responsible for updating data on the other nodes, called the slaves.

Such a system is useful when you have a read-intensive dataset. You can add more slaves and route the read queries to them. However, writes and subsequent updates are performed by only the master. Hence, if the master fails, your read queries would work normally, a process known as read resilience but writes would fail until the master is restored or a new one is appointed (which can be done automatically).

Source: MySQL Performance Blog

Master-Slave replication can lead to inconsistency. Reading different slaves can give different values, if the data hasn’t been updated by the master yet. Worse, a recent write can not be read by a client. Also, if a master fails, there is a possibility that recent updates are not passed on to the slaves.

4. Peer-to-Peer Replication:

Peer to Peer Replication takes care of the problems with the Master-Slave distribution model. It removes a master so that every node has read as well as write access. This means that loss of any one doesn’t prevent any kind of access issues to the data set.

Source: MSDN

The biggest problem here is consistency. Sometimes, writing the same data in two different nodes can lead to a conflict. Write conflicts can be solved by merging inconsistent writes, but discussing ways to do that is beyond the scope of this post.

5. Combining sharding and replication:

Replication and sharding are ideas that can be combined. Using master-slave replication and sharding means that there can be multiple masters, but each data point has only a single master. When you combine peer-to-peer replication with sharding, each shard can have any number of peers, where, in case of failures, the data is built on other nodes.

In the final analysis, you have to determine the extent to which you should scale because this process requires money. You should chalk out the detailed plan and analyse whether it is worth the extra dollars you would be spending.

For further reading on the emergence of NoSQL databases and why they are not here to take over relational databases, I suggest you go through the book NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by Martin Fowler.

One Response

  1. […] an even higher level, when you have very high traffic, you might need to scale your databases by replication and […]

Leave a Reply