Writing Java: Picking the right NoSQL solution

I'm working on a new project and I am deliberately choosing a NoSQL solution as the database backend. NoSQL has become a real contender to databases like PostgreSQL because they are incredibly good at certain web-centric features:

Replication - Create copies of your data easily, either for backup or improve performance.
Horizontal scaling - Allow application to grow by adding inexpensive commodity machines.
Reliability - Web apps require extremely high uptime and NoSQL solutions have great strategies for dealing with failures.
(In)consistency - NoSQL solutions assumes that failure is the expected scenarios; data inconsistency is assumed to be the norm and strategies for managing inconsistent data is built into the solution.

Note that I am not saying that PostgreSQL and other relational databases do a bad job at replication, scaling, etc., but they are not designed with the use case the modern web apps and in some cases a relational database is not the appropriate solution. Yes, you might be able to move furniture using a convertible, but it doesn't mean you should!

Some of the major NoSQL solutions out there:

MongoDB - A mostly memory-driven architecture. It supports map/reduce operations and it claims to be VERY fast.... until your data exceeds the size of your RAM (or if you can't handle data corruption).

So I'm not sure if MongoDB is really a NoSQL database but rather a great caching solution, closer to the likes of memcached on the spectrum of datastores.

CouchDB - JSON-based document store. Supports Map/Reduce. CRUD operations are performed using HTTP methods, i.e., POST and GET. It is easy to learn because of the access operations and data types are familiar to developers. Requires read before write; in other words, to make a change to just one property of a large document, you need to download the entire document, make the change, then send the entire document back to the database. This makes it unsuitable for apps with lots of updates (inserting new documents is fine though).

Cassandra - Highly distributed column-based data store with eventual consistency. Does not support map/reduce but does support writes without read. Cassandra supports many tunable options for data replication and dealing with failures. It is great for large scale but has a big learning curve. Does not support range queries (unless you use an order-preserving partitioner, but that defeats the highly distributed aspect of Cassandra).

HBase - Column-based data store with map/reduce support. Supports range queries on the PK. Has excellent write performance. Used by Facebook to drive their real-time analytics solution. Potentially suffer from a single node of failure because of the underlying HDFS file system (the single point of failure is the Namenode); with multiple proposed solutions right now. Because of the complexities of HDFS and outstanding issues with setting up a HA namenode, HBase is still not ready for wide adoption yet.

That's it for now. I will add more later.

Writing Java

Monday, July 11, 2011

Picking the right NoSQL solution

No comments:

Post a Comment