Why should we use NOSQL databases?

What is NOSQL?
Flexibility in schemas
Scaling a NoSQL Database
The need for consistent hashing

What is NOSQL?

NoSQL refers to non-relational databases designed for flexibility and scalability. There are different types of NoSQL databases, each suited for specific use cases. For instance, key-value stores like Redis are great for caching, document stores like MongoDB handle JSON-like data efficiently, wide-column stores like Cassandra excel at handling large-scale, distributed datasets, and graph databases like Neo4j specialize in relationship-based data.

Flexibility in schemas

One of the advantages of NoSQL databases is their flexibility in schemas, something that traditional SQL databases do not have. In SQL databases like MySQL or PostgreSQL, you must design and define the schema upfront, specifying the fields, their data types, and constraints. Once the schema is set, all data inserted into the database must strictly adhere to this predefined structure.

In contrast, NoSQL databases offer a schema-less or dynamic schema model. For example, in a NoSQL database like MongoDB, you can store JSON documents of varying structures within the same collection. This means you can add new fields or remove existing ones without altering the schema or affecting other documents. This flexibility is particularly useful for applications where data requirements evolve over time or when dealing with unstructured or semi-structured data.

Scaling a NoSQL Database

One of the key strengths of NoSQL databases is their focus on horizontal scalability. Unlike traditional SQL databases that often rely on vertical scaling—adding more memory or processing power to a single machine—NoSQL databases are designed to scale out. This means increasing the number of machines in a distributed system to accommodate more data and handle higher loads effectively.

To achieve this, NoSQL databases implement data partitioning by splitting the data across multiple shards. Each shard is stored on a different machine, enabling the system to distribute the workload and ensure efficient data access. There are several strategies for partitioning data, including:

Hash-based Partitioning: Data is distributed across shards using a hash function, ensuring even distribution of data.
Disadvantage: A potential issue with hash-based partitioning is that keys with the same values will always be hashed to the same partition. For example, if we are using user_id as the partition key, if there are many records belonging to the same user_id, then all the rows will be hashed to the same partition.
Range-based Partitioning: Data is partitioned based on value ranges, such as splitting records by date or alphabetically.
Disadvantage: A potential issue with range-based partitioning is that it may lead to imbalanced shards, but this problem can be mitigated as splitting of the range is automatically balanced by load balancer, which shifts the range to ensure the partition are balanced.
Geographic Partitioning: Data is stored closer to where it is most frequently accessed, improving latency for region-specific queries.
Disadvantage: A potential issue with geographic partitioning is the risk of an imbalance in partition shards. For instance, partitioning bycity_id could result in some cities with a high volume of records overloading certain shards, while others with fewer records remain underutilized. This imbalance can lead to performance bottlenecks.
Solution: To mitigate this, you can use a composite key(by combining multiple keys) to partition the data.

This horizontal scaling approach makes NoSQL databases an excellent choice for handling massive datasets while maintaining high performance and availability.

The need for consistent hashing

Once we partition our data across multiple shards, we are faced with this challenge: how do we handle adding or removing nodes? In traditional hash partitioning, changing the number of nodes requires replacing the hash function. This means all existing data must be rehashed and redistributed across the shards, resulting in significant data reshuffling and high network I/O costs.

Consistent hashing offers a solution to this problem. With consistent hashing, the amount of data that needs to be reshuffled is minimized whenever a node is added or removed, making the system more efficient and scalable.

In this example, we have a circular ring containing a few nodes at various positions. Each data point will be hashed to a position in the ring(e.g. hash(100) = 10 or hash(101) = 25), and will be allocated to the first clockwise node in the hash ring.

To ensure even distribution of data, each node is represented by multiple markers (virtual nodes) in the ring. This is to ensure that when we add or remove nodes, data will not all be assigned to one node but rather be spread out across the ring.

Adding nodes: When a new node is added, data from the anti-clockwise direction is reassigned to this new node, reducing the load on adjacent nodes.

Node Removal: If a node is removed, its data is reassigned to the next clockwise node, thus minimising the amount of data that needs to be reshuffled.

Data Replication: To ensure fault tolerance and high availability, each node's data is replicated to the next two nodes in the clockwise direction within the hash ring. This ensures that even if a node fails, the data can still be accessed from its replicas, maintaining system reliability and minimizing downtime.

Why should we use NOSQL databases?

Table of Contents

What is NOSQL?

Flexibility in schemas

Scaling a NoSQL Database

The need for consistent hashing