SQL or NoSQL? How about both?

Wed, Oct 25, 2017       8-minute read

Hands up if you grew up with structured databases. Did you learn some flavour of SQL? Maybe MS Access in school and then MySQL when you started dabbling in code.

SQL has spent many years as the go to type of database management system. If you work professionally as a a software engineer it’s a safe bet that you have worked on dozens or maybe even hundreds of projects that are supported by a SQL back end. Most likely MySQL, but if not then almost certainly MSSQL or PostgreSQL.

Although SQL has a reputation for sometimes being slow, centralised and just old, it is undeniably stable and has formed a core part of most web-based software over the last two decades or so. And yet in recent years there has been a remarkable trend towards NoSQL database management systems.

If you have ever worked for a startup, there is a good chance you’ll have worked with a NoSQL database. Startups often have little in the way of legacy code to support, employ young keen engineers and encourage exploration of all those next big thing technologies that might just help them stand out from the crowd. With the recent explosion in tech startups, it was inevitable that something like NoSQL would gain traction.

Additionally, the advent of widely available cloud infrastructure has hastened adoption of NoSQL: I can set up distributed database clusters with a few clicks. I can then use tools like NPM and Composer to obtain an easy-to-use SDK within seconds. Clever IDEs utilise type hinting to write half the code for me and if I do get stumped I can always turn to a set of comprehensive docs or a community of fellow users who are evangelists for the technology.

All of this was impossible back in the day, when the best support came from the largest companies (albeit often paid), and rarely was it wise to bet on the little guy. That is why SQL rose to the heights that it did then and NoSQL is doing much the same now.

But just because a good old technology (or a trendy new one) can be used for a project, does that mean that it should?

Picking the right tool for the job

As a professional software engineer I have always been wary of subscribing blindly to development trends, however fun it might be to learn a new stack and magically prototype something in an insanely short amount of time. Whilst it is true that everything that sticks started out as the next big thing, most of the projects that purport to be the next big thing eventually disappear into obscurity.

It is a subtle balancing act to make the correct decisions at the appropriate time.

SQL evangelists will be quick to point out the benefits of structured, relational data. They will cite the stability of software libraries that have been actively maintained for over two decades.

NoSQL evangelists will be keen to point out the freedom that your data is afforded when it no longer has to adhere to a schema. They may point to the increased abilities to query nested data that often come with NoSQL databases.

Both camps are right, of course. And both are also usually wrong.

Some projects (most public-facing web applications, for instance), have very structured data. Sometimes it does not even need to be searchable; numeric primary keys are enough to display records to only the person who created them.

Likewise, some projects may need to ingest large amounts of free-form data to perform analysis on at a later date. Any tool that deals with logging might fall into this category. Here a SQL database would be absolutely useless.

It is important to always look at the use case of a project and leave your prejudices and favourite stacks at the door.

What about something in-between?

The above should be common sense. But what if you have a use case that is somewhere in the middle? What if your project needs to store structured data in a deeply nested manner (say, a JSON object)? Or what if it needs to ingest unstructured data, sanitise it and then store it in a structured manner? Or do that and then search on subfields of subfields?

Here is where it can start to get messy.

You could arrange the data as a JSON object, stringify it and store it as a text field. More recent SQL servers may even allow you to search within JSON strings. But how do you browse that manually? Are you able to easily update only a subset of that data if it changes?

The SQL mindset says to standardise the incoming data, process it into useful information and separate it out into a number of tables linked together with foreign keys. This is efficient and easy to manually browse — you can start a search from any part of the data and follow the links between the tables.

But how would you perform a keyword search on something like that? You could do a contains query, but once you get beyond tens of thousands of records the computational cost will likely be prohibitive.

So you think that maybe you should just store the data in a NoSQL database. This is, after all, the kind of thing that they are suitable for, right?

Querying the data may prove to be easier, but managing it becomes more of a task. And with no well-defined tables and schema the database ends up feeling like document soup if you have to dive into it manually. You will be retrieving all of the data on every query, which may prove too heavy. Unless you limit the return fields, but does that pose a problem as there’s not strictly a schema that you can leverage to implement this well? And what about records that are related in some way to one another?

The old problems are back, and it seems not only that both options are viable solutions, but also that neither of them are.

The solution

Take this advice with a grain of salt — every project’s needs will, after all, be different.

In a lot of cases, applications spend more time retrieving and displaying information than they do updating it. We are not talking about chatrooms or blockchain applications, but your standard web application.

There may be a steady flow of incoming data, but it does not need to be accessible in the same instant that it is submitted. It is usually perfectly acceptable for it to be placed onto a queue and churned through by a microservice some seconds later. If the application is particularly large this may even be a hard requirement of the application design.

In these kind of set-ups, it is useful to consider two different database roles. There is an internal database — the database that the application and its related microservices communicate with to carry out work that the user does not see — and the external database — the database that directly feeds requests made by the user.

The internal database needs to be easy to traverse. It will be accessed by internal services that know the relevant primary keys and have no problem dealing with highly structured data. The external database needs to be able to cope with whatever the user asks of it; maybe a simple keyword search, or maybe a complex boolean and keyword search with misspelt keywords.

The internal database can exist as a regular SQL server, taking standardised data and separating it out into however many tables are necessary to fully describe its complex nature. Queries will rarely be run against it, so storing information in a way that can be joined with minimal overhead need not be a primary concern. It should rely heavily on foreign key constrains to link information that came from the same initial dataset.

It will also trigger an action on every CUD (create, update, delete) operation.

The triggered action will gather the information together into a single document, probably a nested JSON object. It will then insert or update the document into the external database; your choice of NoSQL server. I have personally had great experience with Elasticsearch for this purpose.

Requests from the user, no matter how complex, can then form a query against the NoSQL database that should be very quick to execute. If performance is not great it is simply a matter of tweaking the document’s structure (and/or, if using something like Elasticsearch, the index’s mapping) to achieve improved performance, and reindexing the data.

The external database is completely disposable and can be optimised for reading. If you want to alter the application’s search in a way that is not backwards compatible, you can index the data side-by-side into a new database and seamlessly switch over once finished.

The end result is a near-real-time NoSQL database that is structured and easy to update. It will be blazingly fast to query and easy to horizontally scale as throughput increases.