forge

Performance and Scalability in Today’s Age – Horizontal vs. Vertical Scaling

In early 2006 I wrote an article for TechRepublic. That article was titled “The Declining Importance of Performance Should Change Your Priorities.” Of all of the articles that I have ever written this article has been the most criticized. People thought I was misguided. People thought I was crazy. People believed that I was disingenuous in my beliefs. What I didn’t say then is what I can share now. I believed – and believe – every word. The piece that the readers missed was my background coming to writing the piece – and the awareness that you must have some testing. The title wasn’t “Stop Performance Testing” nor was it “Design your Applications Poorly” it was just about the declining importance.

In 2009, I wrote a series of articles for Developer.com on performance from a development perspective – to cover in detail my awareness of the need to do good performance design. The articles were: Performance Improvement: Understanding, Performance Improvement: Session State, Performance Improvement: Caching, and Performance Improvement: Bigger and Better. In these articles I was talking about strategies, techniques, and approaches to learn how to do high performance systems – and how to measure them. I wrote these articles from the fundamental point of view of scalability that it is mostly horizontal – that adding more servers is how you maintained performance and increased scalability through the introduction of additional servers.

Today, I’m struck by how we scale environments so I want to take a step back to review these articles, I want to explain performance management, and I want to move forward about how we scale in today’s world.

The first thing you have to know about the article from 2006 is that I was writing quite a bit back then. One year was 75 articles on top of my “day” job in consulting. As a result, I took my writing topics from my consulting. At this point, I was at a client where there was so great an investment in performance testing that it was starving out important development needs. What I didn’t say is there was nearly zero correlation between the results of the detailed performance analysis really said nothing about how the system would really perform. That’s the reason I wrote the article, it was an attempt to soften the fervor for performance testing – not because I didn’t believe in knowing more about what you’re building. Rather, it was because I believed that things were changing – and I still believe they are. To this day I recommend performance, scalability, and load testing. I just do so with realistic expectations and realistic budgets.

[Note: I’ll use performance testing as shorthand for talking about performance, load, and scalability testing. I’m aware they are different related topics, but it’s just easier to simplify them to performance testing for conversational purposes – if we get into detail we can split one type of testing from another.]

It used to be that when we thought about performance (and scalability which is inextricably linked) we think about creating faster computers with more memory and faster hard disks – strike that let’s use SSDs. We believe that we had to grow the computer up and make it faster. However, in today’s world we’ve discovered that we have to have to apply multiple computers to the problem to solve whatever problem we have. In effect we’ve moved from vertical scaling to horizontal scaling. I need to explain this but before I go there I have to talk about the fundamentals of performance management. This is a conversation that Paul Culmsee and I’ve been having lately and for which he’s been writing blog articles.

Oversimplification of Performance Monitoring

Over the years I’ve written probably a dozen or so chapters on performance for various books. Honestly, I was very tired of it before I gave it up. Why? Well, it’s painfully simple and delightfully difficult. Simple in that there are really only four things you need to pay attention to: Memory, Disk, Processing, and Communication. It’s delightfully difficult because of the interplay between these items and the problem of setting up a meaningful baseline but more on that in a moment.

Perfect Processing

When I get a chance to work with folks on performance (which is more frequent than most folks would expect), I make it simple. I look at two things first. I look at processing – CPU because it’s easy. It’s the one everyone thinks about. I look for two things. Is there a single core in the CPU that’s completely saturated? If so I’ve got a simple problem with either hardware events or single threaded software. However, in today’s world of web requests the problem of single threading is much less likely than it was years ago. The second thing I look for is to see if all the cores are busy – that’s a different problem – but a solvable one. Whether it’s making the code more efficient by doing code profiling and fixing the bad regions of code or adding more cores to the server doesn’t matter from our perspective at the moment.

I Forgot About Memory

The next thing I look at is memory. Why memory? I look at memory because memory can cover a great many sins. Give me enough memory and I’ll only use the disks twice. Once to read it into memory and once to write it back out when it has changed or when I want to reboot the computer. If you have too little memory it will leak into disk performance issues due to the fun of virtual memory.

Let me go on record here about virtual memory. Stop using virtual memory. Turn off your paging files in Windows. This is a concept that has outlived its usefulness. Before the year 2000 when memory was expensive it was necessary evil. It’s not appropriate any longer. Memory should be used for caching – making thing faster. If your machine isn’t saying that it’s doing a lot of caching – you don’t have enough memory. (For SQL Server, look for SQL Buffer Manager: Page Life Expectancy – if it’s 300 or more … you’re probably OK.)

Dastardly Disks

So then, why don’t I start with disks – particularly since I’m about to tell you it’s the largest issue in computer performance today? The answer is that they’re hard to look at for numerous reasons. We’ve created so much complexity on top of hard disks. We have hard disk controllers that leverage RAID to create virtual disks from physical ones. We have SANs that take this concept and pull it outside of the computer and share it with other computers.

We have built file systems on top of the sectors of the disk. The sectors are allocated in groups called clusters. We sometimes read and write randomly. We sometimes read and write sequentially. Because of all of the structures we occasionally see “hot spots” where we’re exercising one of the physical disks too much. Oh and then there’s the effect of caching and how it changes disk access patterns. So ultimately I ask about the IOPS and I ask about latency (average time per read and average time per write). However, the complexity is so great and the patterns so diverse that it’s sometimes hard to see the real problem amongst all the noise. (If you want more on computer disk performance, check out my previous blog post “Computer Hard Disk Performance – From the Ground Up“)

With experience you learn to notice what is wrong with disk performance numbers and what the root cause is but often explaining this to a SAN engineer, SAN company, or even a server manager becomes a frustrating exercise. Gary Klein in Sources of Power talks about decisions and how experts make decisions and the finding is that they make decisions based on complex mental models of how things work and upon expectancies. In the book Lost Knowledge David DeLong addresses the difficulty of a highly experienced person teaching a less experienced person. The mental models of the expert are so well understood by the expert – and not by the novice – that communicating the knowledge about why something is an issue is difficult. Finally, in the book Efficiency in Learning Ruth Clark, Frank Nguyen, and John Sweller discuss how complex schemas (their word for the mental models) can be trained – but the approaches they lay out are time consuming and complex. This is my subtle way of saying that if the person you’re speaking with gets it – great. If not, well, you are going to be frustrated – and so are they.

Constant Communications

The final performance area, communication, is the most challenging to understand and to test. That is communications is the real struggle because it is always the bottleneck. Whether it’s the 300 baud modems that I had when I first started with a computer or the ten gigabit Ethernet switching networks that exist today – it’s never enough. It’s pretty common for me to walk into an environment where the network connectivity is assumed to be good or enough. Firewalls slip in between servers and their database server. We put in 1GB links between servers – and even when we bond them together two or four at a time we route them through switches that only have a 1GB backplane – meaning that they can only route 1GB worth of packets anyway. There are many points of failure in the connections between servers and I think I’ve seen them all fail – including the network cable itself.

One of my “favorite” problems involved switch diversity. That’s where the servers are connected to two different switches to minimize the possibility of a single switch failing and disconnecting the server. We had bonded four one gigabit connections two on each switch and we were seeing some very interesting (read: poor) performance. Upon further review it was discovered that the switches only had a single one gigabit connection between them so they couldn’t keep up with all the synchronization traffic between the two switches. Oh, and just for fun neither of the switches were connected to the core network with more than a one gigabit connection either. We fixed the local connectivity problem to the switches by doing bonding. We solved the single point of failure problem with switch diversity – but we didn’t fix the upstream issues that prevented the switches from performing appropriately.

With the complexity of communication networks it’s hard to figure out what exactly how to know there’s a performance problem. You can look at how much the computer transmits across the network or you can monitor a switch but there are so many pieces to measure it may be find to find the real bottleneck. I’m not saying that you shouldn’t take the effort to understand where the performance issues are, I’m just saying there are ways to quickly and easily find potential issues –and there are harder ways.

Communications are likely to always be the bottleneck. Even in my conversations today one of our biggest concerns is WAN performance and latency. We spend a great deal of time with client-side caching to minimize the amount of data that must be pulled across the WAN. We buy WAN accelerators that do transparent compression and block caching. It’s all to the goal of addressing the slowest part of the performance equation – the client connectivity. It’s almost always a part of the problem. (If you want a practical example, take a look at my blog post “SharePoint Search across the Globe.”)

Perhaps the most challenging part of working with communications as a bottleneck is the complexity. While we can dedicate a server to a task and therefore isolate the variables we’re trying to test, most communications resources are shared and because of this present the problem of needing to isolate the traffic we’re interested in – and it means that we’ll sometimes see performance issues which have nothing to do with our application.

One of the great design benefits of the Internet is its resiliency. Designed from the ground up to accommodate failures (due to nuclear attacks or what-have-you) buys you a lot of automatic recovery. However, automatic recovery takes time – and when it’s happening frequently, it may take too much time. One frequent problem is packet loss. There’s a normal amount of packet loss on any network due to sun spots, quarks, neutrinos, or whatever. However, when this packet loss even approaches 1% is can create very noticeable and real performance problems. This is particularly true of WAN connections. They’re the most susceptible to loss and because of the latencies involved they can have the greatest impact on performance. Things like packet loss rarely show up in test cases but show up in real life.

Bottlenecks

Here’s the problem with performance (scalability, and load) testing. When you’re looking for a performance problem, you’re looking for a bottleneck. A bottleneck is a rate limiter. It’s the thing that says you can’t go any faster because of me. It gets its name because a bottle’s neck limits the amount of fluid you can poor into or out of the bottle.

With finding bottlenecks, you’re looking for something that you won’t find until it becomes a problem. You have to make it a problem. You have to drive the system to the point where the rate limiting kicks in and figure out what caused it – that’s what performance testing is about. One would appropriately assume that we use performance testing to find bottlenecks. Well, no, at least not exactly. We use performance testing to find a single bottleneck. Remember a bottleneck is a rate limiter so the first bottleneck limits the system. Once you hit that rate you can’t test any more. You have to remove the bottleneck and then test again. Thus we see the second issue with performance testing.

The first issue that I discussed casually was complexity. Now the second issue is we can only find performance problems one at a time. We only get to identify one bottleneck at a time. That means that we have to make sure that we’re picking the right approach to finding bottle necks, so we find the right one. As it turns out, that’s harder than it first appears.

What Problem Are We Solving Again?

We’ve seen two issues with performance testing but I’ve not exposed the real problem yet. The real problem is in knowing how the users will use the system. I know. We’ve got use cases and test cases and requirements and all sorts of conversations about what the users will do – but what will they REALLY do? Even for systems where there are established patterns of use, the testing usually involves a change. We do testing because we want to know how the change in the code or the environment will impact the performance of the system. Those changes are likely to change behavior in unpredictable ways. (If you don’t agree read Diffusion of Innovations.) You won’t be able to know how a change in approach will impact the way that users actually use the system. Consider that Amazon.com knows that even small changes in performance of their site can have a dramatic impact on the number of shopping carts that are abandoned – that doesn’t really make any sense given the kinds of transactions that we’re talking about – but it’s reflected in the data.

I grew up in the telecommunications area in my early career. We were talking about SS7 – Signaling System 7. This is the protocol with which telephone switches communicate with one another to form up a call from one place to another. I became aware of this protocol when I was writing the program to read the switches – I was reading the log of what the messages were. The really crazy thing is that SS7 allowed for a small amount of data traffic across the voice network for switching. In 1984 Friedhelm Hillebrand and Bernard Ghillebaert realized that this small side-band data signaling area could be leveraged to allow users to send short messages to one another. This led to the development of the Short Message Service (SMS) that many mobile phone users use today. (It got a name change to texting somewhere along the way.) From my perspective there’s almost no possible way that the original designers would have been able to see that SS7 would be used by consumers to send “short” messages.

Perhaps you would like a less esoteric example. There’s something you may have heard of. It was originally designed to connect defense contractors and universities. It was designed to survive disaster. You may have guessed that the technology I am talking about is the Internet. Who could have known back then that we’d be using the Internet to order pizza, schedule haircuts, and generally solve any information or communication need?

My point with this is that if you can’t predict the use then you can’t create a synthetic load – the artificial load that performance testing uses to identify the bottleneck. You can guess at what people do but it’s that it’s a guess – it’s not even an educated guess. If you guess wrong about how the system is used you’ll reach the wrong conclusion about what the bottleneck will be. In articles, I have talked about reflections and exception handling. Those are certainly two key things that do perform more slowly – what if you believed that the code that did some reflections was the most frequently used code – instead of the least frequently used code? You’re likely to see CPU as the key bottleneck and ignore the disk performance issues that become apparent with heavy reporting access.

So the key is to make sure that the load you’re using to simulate the load of users is close to what they’ll actually do. If you don’t get the simulation right you may – or may not – see the real bottlenecks that prevent the system from meeting the real needs of the users. (This is one of the reasons why the DevOps movement is entertaining. You can do a partial deployment to production and observe how things perform for real.)

Baselining

Before I leave the topic of performance and performance testing, I have to include one more key part of performance management. That is, baselining the system. That is recording performance data when things are working as expected so that it’s possible to see which metric (or metrics) are out of bounds when a problem occurs. It is one thing to know that performance has dropped from 1 second per page to 3 seconds per page – and quite another to identify which performance counters are off. What often happens is that the system is running and then after a while there’s a problem. Rarely does anyone know how fast the system was performing when it was good – much less what the key performance counters were.

Establishing a baseline for performance is invaluable if you do find a performance problem in the future – and in most cases you will run into a performance problem in the future. Interestingly, however, rarely do organizations do a baseline because doing a baseline requires a time-disconnection from launch and remembering to do something a month after a new solution is launched is hard to do because you’re on to the next urgent project – or you simply forget. Further, without a periodic rebaselining of the system, you’ll never be able to see the dramatic change that happens right before the performance problem.

With baselining you’re creating a snapshot of “normal” that is, this is what normal is for the system at this moment. We leverage baselines to allow us to compare the abnormal condition against the known simplification of the normal condition. We can then look for values that are radically different than the normal that we recorded – and then work backwards to find the root cause of the discrepancy – the root cause of the performance issue. Consider performance dropping from 1 second per page to 3 seconds per page. Knowing a normal CPU range and an abnormal one can lead to an awareness that there’s much more CPU time in use. This might lead us to suspect new code changes are causing more processing, or perhaps there’s much more data in the system. Similarly if we see our average hard disk request latency jumps from 11ms to 23ms we can look at the disk storage system to see if it is coping with a greater number of requests – or perhaps a drive has failed in a RAID array and it hasn’t been noticed so the system is coping with fewer drive arms.

Vertical vs. Horizontal Scaling

There was a day when we were all chasing vertical scaling to deal with our scalability problems. That is we were looking for faster disks, more memory, and faster CPUs. That is we wanted to take the one box that we had and make it faster – fast enough to support the load we want to put on it. Somewhere along the way we became aware that no box – no matter how large – could survive the onslaught of a million web users. We decided – rightfully so – that we couldn’t build one monolithic box that would handle everything but instead we had to start building a small army of boxes willing to support the single cause of keeping a web site operational. However, as we’ve already stated, there are new problems that emerge as you try to scale horizontally.

First, it’s possible to scale horizontally inside of a computer – we do this all the time. Multiple CPUs with multiple cores and multiple disk arms all working towards a goal. We already know how this works –and what the problems are. At some point we hit a wall with processor speeds, we could only get the electrons to cycle so quickly. We had to start packing more and more of the CPUs with more and more cores. Which led us to a problem with is the problem of latency and throughput.

The idea is that you can do anything instantly if you have enough throughput. However, that doesn’t take into account latency – the minimum amount of time that it will take for a single operation to take. Consider the simple fact that nine women can have nine babies in nine months, however, no women can have a baby in a month. The latency of the transaction – having a baby – is nine months. No amount of additional women (or prompting by parents) will increase the rate at which babies are made. Overall throughput can be increased by adding more women but the inherent latency is just that – inherent. When we scale horizontally we confront this problem.

A 2Ghz CPU may complete a thread of execution twice as fast (more or less) than a 1Ghz CPU – however, a two core 2Ghz CPU won’t necessarily complete the same task any faster – unless the task can be split into multiple threads of execution. In that case we may see some – or dramatic – improvement in the performance. Our greatest enemy when considering horizontal scaling (inside a single computer or across multiple computers) is the problem of a single large task. The good news is that most of the challenges we face in computing today are not one single unbreakable task. Most tasks are inherently many smaller tasks – or can be reconstructed this way with a bit of work. Certainly when we consider the role of a web server we realize that it’s rarely one user consuming all the resources. Rather it’s the aggregation of all of the users that cause the concern.

The good news is that horizontal scaling is very effective at problems which are “death by a thousand cuts.” That is they are good at problems which can be broken down, distributed, and processed in discrete chunks. In fact, this is the process that is most central to horizontal scaling.

Horizontal Scaling as breaking things – down

My favorite example of horizontal scaling are massively distributed systems. A project called SETI@Home leverages “spare” computing cycles from thousands of computers to process radio telescope data looking for intelligent life beyond our planet. It used to be that the SETI project (Search for Extra Terrestrial Intelligence) used expensive super computers to process the data from the radio telescopes. This process was converted so that “normal” computers could process chunks of the data and return the results to centralized servers. There are the appropriate safeguards built into the system to prevent folks from falsifying the data.

Putting on my developer hat for a moment, there are some parts of a problem that simply can’t be broken down and far more where it simply isn’t cost effective/efficient to even try to break the problem down. Take for instance a recent project where I needed to build a list of the things to process. This took roughly 30 seconds to complete in production. The rest of the process, performing the migration, took roughly 40 hours of time if processed in a single thread. I didn’t have that much time for the migration. So we created a pool of ten threads that each pulled out one work item at a time out of the work list and worked on it (with the appropriate locking). The main thread of the application simply waited for the ten threads to complete. It wasn’t done in 4 hours – but it was done in 8 hours. Well within the required window because I was able to break the tasks into small chunks – one item at a time.

Admittedly without some development help breaking the workload down isn’t going to be easy – but anytime I’m faced with what folks tell me is a vertical scaling problem – where we need one bigger box – I have to immediately think about how to scale it horizontally.

Call to Action

If you’re facing scalability or performance challenges… challenge your thinking about how the problem is put together and what you can do to break the problem down.