Scalability - Why most web applications can not scale horizontal
Last week one of the greatest cloud computing events in the world was going on: The DockerCon2016. And one of the things docker want to bring to people is scalability for their applications. Especially horizontal scalability.
Docker makes a great job here but if your application does something trivial wrong it doesn’t matter. It then won’t be able to scale without faults, which may imply data loss or data corruption.
The regular Setup
One of the favourite example for setup a scalable web application is WordPress. Many people like it and many people use it.
Good news for WordPress: The WordPress core(!) can scale on that level. WordPress has no problem with 100 running instances facing content to your users. As long as you have no bad written plugins (Hehe…), your database backend, your storage systems and your network can handle the traffic - everything works well.
But to be honest: Who needs to scale a blog? And why is this important for a web application you are writing then?
The WordPress way of requests
Anyway. WordPress is a CMS for bloggers. This means you have bloggers on one hand and readers on the other. Hopefully more readers than blogger.
Let’s have a look to the requests which are generated by this type of application:
As you see there is a cluster where the services are running on. (Maybe a cloud or what ever) There are 3 instances of an application like WordPress and a database server.
The blue arrows are requests that cause reads against the database (like a HTTP-GET
triggering a select
statement). So your readers visit your blog and get some “important” articles shown.
The red arrow is a request which cause a write against the database (like a HTTP-POST
triggering a update
statement). It’s a blogger writing/editing an article or a reader posting a comment.
Everything works fine here.
The Problem
How does WordPress or most of the other web applications handle request? They use them to generate a query against the database which contains the content.
While most GET
requests triggers only select statements against the database the point of a POST
request is storing data inside your application which results a database entry in most cases.
What if your requests looks like this:
Like in the first diagram you also have your cluster at the bottom, the database and 3 instances of your application.
But you have a lot more write traffic on your application servers. So may all instances fire their write queries against your database simultaneously.
You point out “Yes, against the database and everything is fine :)” but in most cases you are wrong.
It’s the common problem of parallel computing: The Producer–consumer problem!
Heh?! Why? My database takes care about all the data and their integrity! You might think.
A little example
Let’s say you want to write a view counter and for every 100th user you do something crazy. Like generate a random picture you show or pay someone a dollar. It’s not important what you do just do something.
To show the user-count in your website your code may look like this:
<?php
function showViewCounter() {
// known variables:
// $conn -> the ready MySQLi object (connected to the database)
$sql = "Select counter from site_stats;";
$result = $conn->query($sql);
if ($result->num_rows > 0) {
$row = $result->fetch_assoc();
echo "<span>" . $row["counter"] . "</span>";
}
}
?>
But showing the number without increase the number is dump. So let’s do this, too. We now use a absolute stupid version just to make the problem obvious:
function showViewCounterWithDumpUpdate($id) {
// known variables:
// $conn -> the ready MySQLi object (connected to the database)
// Please notice: THIS IS NOT SECURE NEVER USE VARIABLES IN YOUR QUERIES!!!
$sql = "Select counter from site_stats where id=" . $id . ";";
$result = $conn->query($sql);
if ($result->num_rows > 0) {
$row = $result->fetch_row();
if ($row["counter"] % 100 == 0) {
// do some very special code
} else {
echo "<span>" . $row["counter"] . "</span>";
}
}
$sql = "Update site_stats SET (counter=" . $row["counter"] + 1 . " where id=" . $id . ";";
$result = $conn->query($sql);
}
The problem you are running in: What happens if there are more than 1000 requests in a minute so your cluster sends multiple requests for different users to the database servers while your showViewCounterWithDumpUpdate()
-function is running?
RIGHT! The update will probably overwrite the already increased number in your database.
Shouldn’t the database prevent something like that? Why doesn’t those transaction things work?
Yes, it should and it does! The transactions are working, but not when you are performing two of them. One for the query and one for writing. The problem is the value you stored in PHP and write back to the database.
How to prevent bad database updates
In my example it’s trivial and really simple to fix by a query like this:
UPDATE site_stats SET counter=counter+1;
This way the database increases the real value of the row instead of using your cached version. Another positive aspect is that you can use this query together with your select statement. So you perform a single database transaction for both queries.
Second option is just use a table and run insert instead of updating a row. Inserts won’t overwrite each other. This may let your database grow up rapidly so it’s not a perfect solution.
Another possible solution is creating a trigger. Triggers allow you to customize your database behaviour. This Link should help you to understand how to work with triggers.
At least and worst you can also lock tables. BUT YOU SHOULD NEVER DO THIS! It’s not trivial and for sure your database will run into a deadlock. Just NEVER EVER DO THIS!
Conclusion
Create a scalable application that only reads is easy but if you run into the problem having massive writes to the same table row you run into trouble and your application only scales as good as your database master(s).
Especially updates can make a lot of trouble so try to avoid them. Adding a table and appending lines instead of updating a row is a possible solution.
So in your next application think about the variables you use to store database values and what kind of data you place in your database.