Book Review-Humilitas: A Lost Key to Life, Love, and Leadership

Robert Bogue October 31, 2012 2 Comments

If you’ve been reading my blog – or even just the book reviews for a while you may have realized that I’m on a voyage of self-discovery – and it’s a voyage that isn’t even close to its end. I’m constantly trying to improve my understanding of psychology, sociology, and leadership – among other skills. My backlog of reading on these (and other topics) can feel overwhelming to me at times. I bring this up because I can already hear some of the potential comments about reading the book Humilitas – “So now that you’ve read the book you probably know everything there is to know about humility.” Um, no. The truth is that this the third book that I’ve read about humility. The other two, Humility and Humility: True Greatness, didn’t speak to me greatly enough to write a full book review on them. Not that they’re not good books – they just didn’t speak to me clearly.

I read these books because I realize that a lack of humility has been a barrier in my life. It’s like the pervasive smell that is in your clothes, in your hair, and in your world after going to a bon fire. Every time you believe you’ve rooted out the smell there’s another whiff of it. Humilitas ends with a quote from CS Lewis, ” If anyone would like to acquire humility, I can, I think, tell him the first step. The first step is to realize that one is proud. And a biggish step, too. At least, nothing whatever can be done before it. If you think you are not conceited, it means you are very conceited indeed.” In short, the moment you believe you’ve obtained humility, you haven’t.

I remember years ago I was playing a game Ultima IV: Quest of the Avatar the goal of which was to accumulate eight virtues (honesty, compassion, valor, justice, honor, sacrifice, spirituality, and humility). I was asked a question in one of the dialogs. “Are thou the most humble person on the planet?” (My apologies if the quote isn’t exact.) I can remember pausing and thinking that this was an awful question (which means it was brilliant). I had already earned the humility portion of the avatar, so I was humble. Being at the time 13 years old or so I answered yes. The game responded with “Thou has lost an eighth” a few times because I lost humility and honesty, which in turn caused me to use several of the other eighths that required the ‘truth’. It was profound because I had been working quite hard to complete the game and in one moment I had lost half of the progress (four eighths). For all the evils that people find in roll playing games – this was a very memorable moment for me about humility (and still not enough to teach it to me.)

Since then I’ve been accused of arrogance more than once by a variety of people – and I’ll accept that it’s a struggle particularly when someone challenges what I know. (Strangely enough, this happens more frequently than one would expect.) That’s why I was surprised to see the book very early on say that contemporary western culture often confuses conviction with arrogance. While by no means an excuse for the lack of caring in my responses at times, it softened the blow for me. If I know that the technology is configured wrong, it won’t work or it won’t work right, it may not be arrogance to share this belief – in a caring way.

One of the really difficult things for me has been trying to rapidly establish my credentials in front of new audiences and clients. Conventional wisdom says that you have to introduce yourself to a client with your certifications, achievements, and accolades in an environment where you’re selling. At a corporate level, look at how nearly every technology company has their “brand name clients” slide that lists who they’ve worked with. On a personal level nearly every conference I’ve spoken at suggests or requires a bio slide. It’s the “about the speaker” slide that is supposed to help the attendees realize why the speaker is someone to listen to – and by extension why the conference organizers did their job correctly by selecting such great speakers to speak to you.

In a lot of ways, I’m a slow learner. It’s been relatively recently – and at the specific suggestion of some friends (thank you) – that I’ve stopped doing the “about me” slide. I sometimes have a slide that shares my background – but it does so for the purposes of establishing how I think about the topic – not for the purposes of sharing my credentials. My audiences deserve to know where I’m coming from – but they don’t need to know about my certifications.

I’ve been “taught” this in many different ways. Back at the NSA Conference (click for my experiences) I attended a session by Connie Podesta where she said something simple and profound “the privilege of the platform.” Her talk was about how she made her business work. She was talking about her openness to learning, and how she was grateful for the privilege of speaking to another group of people. She never used the word humility – that I can remember – but she did demonstrate it.

Humilitas is the root of our word humble and the one word with two meanings “to be humiliated” and “to be humble”. The book makes a point that being humble is not self-effacing. It’s not a solitary virtue. It’s not something that is forced on you. Humility is others-serving and a choice. Humility is more than modesty or dignified restraint (modesta is its root word.)

I find it somewhat humorous that John Dickson spends so many words iterating and reiterating that his perspective – that Jesus the Nazarene dramatically changed the course of human events without asserting Christian beliefs. As a student of history he asserts that the change happened whether Jesus was the Christ, a profit, or “just” a carpenter. It’s humorous to me because without any belief there is historical evidence to indicate the impact of Jesus.

It seems like nearly everyone struggles with humility. Thomas Gilovich did a study of 1 million (million!) high school seniors and 70 percent thought they were “above average in leadership ability.” I don’t know these students but I do know that at least 20% of them are wrong. It’s not just students who struggle with humility. Gilovich found that 94 percent of college professors believe they are doing a “better-than-average job”. Hmm, again at least 44% of these processors have to be wrong – statistically speaking.

Dickson makes the point that humility is a useful virtue. That it serves not only the community at large – because humility is defined as using power to help others – but also helps the individual. Humble individuals know that they don’t know everything and so they’re always trying to learn from others, to understand someone else’s expertise.

Sometimes while presenting I have students in the classroom that want to demonstrate their knowledge by the questions that they ask. They ask about some specific situation that they already know the answer to. When I can’t answer their question they provide their own answer. The humble person doesn’t sit quietly and shy away from asking questions, rather they ask questions from the perspective of applying the talk to their situation – and to integrating the presentation into their way of thinking.

One last random thought, which comes by way of Amanda Gore. I was sent a link to one of her YouTube videos by a colleague. She was talking about a concept in Australia called tall poppies. Basically it’s a negative way of talking about those who have high merit. This is connected in that if you hold your head up high – deserved or not – you may have someone come try to cut your head off. Not that you shouldn’t own what you know but just that humility may be the right answer at times.

I invite you to join me on this journey by reading Humilitas.

Performance and Scalability in Today’s Age – Horizontal vs. Vertical Scaling

Robert Bogue October 31, 2012 No Comments

In early 2006 I wrote an article for TechRepublic. That article was titled “The Declining Importance of Performance Should Change Your Priorities.” Of all of the articles that I have ever written this article has been the most criticized. People thought I was misguided. People thought I was crazy. People believed that I was disingenuous in my beliefs. What I didn’t say then is what I can share now. I believed – and believe – every word. The piece that the readers missed was my background coming to writing the piece – and the awareness that you must have some testing. The title wasn’t “Stop Performance Testing” nor was it “Design your Applications Poorly” it was just about the declining importance.

In 2009, I wrote a series of articles for Developer.com on performance from a development perspective – to cover in detail my awareness of the need to do good performance design. The articles were: Performance Improvement: Understanding, Performance Improvement: Session State, Performance Improvement: Caching, and Performance Improvement: Bigger and Better. In these articles I was talking about strategies, techniques, and approaches to learn how to do high performance systems – and how to measure them. I wrote these articles from the fundamental point of view of scalability that it is mostly horizontal – that adding more servers is how you maintained performance and increased scalability through the introduction of additional servers.

Today, I’m struck by how we scale environments so I want to take a step back to review these articles, I want to explain performance management, and I want to move forward about how we scale in today’s world.

The first thing you have to know about the article from 2006 is that I was writing quite a bit back then. One year was 75 articles on top of my “day” job in consulting. As a result, I took my writing topics from my consulting. At this point, I was at a client where there was so great an investment in performance testing that it was starving out important development needs. What I didn’t say is there was nearly zero correlation between the results of the detailed performance analysis really said nothing about how the system would really perform. That’s the reason I wrote the article, it was an attempt to soften the fervor for performance testing – not because I didn’t believe in knowing more about what you’re building. Rather, it was because I believed that things were changing – and I still believe they are. To this day I recommend performance, scalability, and load testing. I just do so with realistic expectations and realistic budgets.

[Note: I’ll use performance testing as shorthand for talking about performance, load, and scalability testing. I’m aware they are different related topics, but it’s just easier to simplify them to performance testing for conversational purposes – if we get into detail we can split one type of testing from another.]

It used to be that when we thought about performance (and scalability which is inextricably linked) we think about creating faster computers with more memory and faster hard disks – strike that let’s use SSDs. We believe that we had to grow the computer up and make it faster. However, in today’s world we’ve discovered that we have to have to apply multiple computers to the problem to solve whatever problem we have. In effect we’ve moved from vertical scaling to horizontal scaling. I need to explain this but before I go there I have to talk about the fundamentals of performance management. This is a conversation that Paul Culmsee and I’ve been having lately and for which he’s been writing blog articles.

Oversimplification of Performance Monitoring

Over the years I’ve written probably a dozen or so chapters on performance for various books. Honestly, I was very tired of it before I gave it up. Why? Well, it’s painfully simple and delightfully difficult. Simple in that there are really only four things you need to pay attention to: Memory, Disk, Processing, and Communication. It’s delightfully difficult because of the interplay between these items and the problem of setting up a meaningful baseline but more on that in a moment.

Perfect Processing

When I get a chance to work with folks on performance (which is more frequent than most folks would expect), I make it simple. I look at two things first. I look at processing – CPU because it’s easy. It’s the one everyone thinks about. I look for two things. Is there a single core in the CPU that’s completely saturated? If so I’ve got a simple problem with either hardware events or single threaded software. However, in today’s world of web requests the problem of single threading is much less likely than it was years ago. The second thing I look for is to see if all the cores are busy – that’s a different problem – but a solvable one. Whether it’s making the code more efficient by doing code profiling and fixing the bad regions of code or adding more cores to the server doesn’t matter from our perspective at the moment.

I Forgot About Memory

The next thing I look at is memory. Why memory? I look at memory because memory can cover a great many sins. Give me enough memory and I’ll only use the disks twice. Once to read it into memory and once to write it back out when it has changed or when I want to reboot the computer. If you have too little memory it will leak into disk performance issues due to the fun of virtual memory.

Let me go on record here about virtual memory. Stop using virtual memory. Turn off your paging files in Windows. This is a concept that has outlived its usefulness. Before the year 2000 when memory was expensive it was necessary evil. It’s not appropriate any longer. Memory should be used for caching – making thing faster. If your machine isn’t saying that it’s doing a lot of caching – you don’t have enough memory. (For SQL Server, look for SQL Buffer Manager: Page Life Expectancy – if it’s 300 or more … you’re probably OK.)

Dastardly Disks

So then, why don’t I start with disks – particularly since I’m about to tell you it’s the largest issue in computer performance today? The answer is that they’re hard to look at for numerous reasons. We’ve created so much complexity on top of hard disks. We have hard disk controllers that leverage RAID to create virtual disks from physical ones. We have SANs that take this concept and pull it outside of the computer and share it with other computers.

We have built file systems on top of the sectors of the disk. The sectors are allocated in groups called clusters. We sometimes read and write randomly. We sometimes read and write sequentially. Because of all of the structures we occasionally see “hot spots” where we’re exercising one of the physical disks too much. Oh and then there’s the effect of caching and how it changes disk access patterns. So ultimately I ask about the IOPS and I ask about latency (average time per read and average time per write). However, the complexity is so great and the patterns so diverse that it’s sometimes hard to see the real problem amongst all the noise. (If you want more on computer disk performance, check out my previous blog post “Computer Hard Disk Performance – From the Ground Up“)

With experience you learn to notice what is wrong with disk performance numbers and what the root cause is but often explaining this to a SAN engineer, SAN company, or even a server manager becomes a frustrating exercise. Gary Klein in Sources of Power talks about decisions and how experts make decisions and the finding is that they make decisions based on complex mental models of how things work and upon expectancies. In the book Lost Knowledge David DeLong addresses the difficulty of a highly experienced person teaching a less experienced person. The mental models of the expert are so well understood by the expert – and not by the novice – that communicating the knowledge about why something is an issue is difficult. Finally, in the book Efficiency in Learning Ruth Clark, Frank Nguyen, and John Sweller discuss how complex schemas (their word for the mental models) can be trained – but the approaches they lay out are time consuming and complex. This is my subtle way of saying that if the person you’re speaking with gets it – great. If not, well, you are going to be frustrated – and so are they.

Constant Communications

The final performance area, communication, is the most challenging to understand and to test. That is communications is the real struggle because it is always the bottleneck. Whether it’s the 300 baud modems that I had when I first started with a computer or the ten gigabit Ethernet switching networks that exist today – it’s never enough. It’s pretty common for me to walk into an environment where the network connectivity is assumed to be good or enough. Firewalls slip in between servers and their database server. We put in 1GB links between servers – and even when we bond them together two or four at a time we route them through switches that only have a 1GB backplane – meaning that they can only route 1GB worth of packets anyway. There are many points of failure in the connections between servers and I think I’ve seen them all fail – including the network cable itself.

One of my “favorite” problems involved switch diversity. That’s where the servers are connected to two different switches to minimize the possibility of a single switch failing and disconnecting the server. We had bonded four one gigabit connections two on each switch and we were seeing some very interesting (read: poor) performance. Upon further review it was discovered that the switches only had a single one gigabit connection between them so they couldn’t keep up with all the synchronization traffic between the two switches. Oh, and just for fun neither of the switches were connected to the core network with more than a one gigabit connection either. We fixed the local connectivity problem to the switches by doing bonding. We solved the single point of failure problem with switch diversity – but we didn’t fix the upstream issues that prevented the switches from performing appropriately.

With the complexity of communication networks it’s hard to figure out what exactly how to know there’s a performance problem. You can look at how much the computer transmits across the network or you can monitor a switch but there are so many pieces to measure it may be find to find the real bottleneck. I’m not saying that you shouldn’t take the effort to understand where the performance issues are, I’m just saying there are ways to quickly and easily find potential issues –and there are harder ways.

Communications are likely to always be the bottleneck. Even in my conversations today one of our biggest concerns is WAN performance and latency. We spend a great deal of time with client-side caching to minimize the amount of data that must be pulled across the WAN. We buy WAN accelerators that do transparent compression and block caching. It’s all to the goal of addressing the slowest part of the performance equation – the client connectivity. It’s almost always a part of the problem. (If you want a practical example, take a look at my blog post “SharePoint Search across the Globe.”)

Perhaps the most challenging part of working with communications as a bottleneck is the complexity. While we can dedicate a server to a task and therefore isolate the variables we’re trying to test, most communications resources are shared and because of this present the problem of needing to isolate the traffic we’re interested in – and it means that we’ll sometimes see performance issues which have nothing to do with our application.

One of the great design benefits of the Internet is its resiliency. Designed from the ground up to accommodate failures (due to nuclear attacks or what-have-you) buys you a lot of automatic recovery. However, automatic recovery takes time – and when it’s happening frequently, it may take too much time. One frequent problem is packet loss. There’s a normal amount of packet loss on any network due to sun spots, quarks, neutrinos, or whatever. However, when this packet loss even approaches 1% is can create very noticeable and real performance problems. This is particularly true of WAN connections. They’re the most susceptible to loss and because of the latencies involved they can have the greatest impact on performance. Things like packet loss rarely show up in test cases but show up in real life.

Bottlenecks

Here’s the problem with performance (scalability, and load) testing. When you’re looking for a performance problem, you’re looking for a bottleneck. A bottleneck is a rate limiter. It’s the thing that says you can’t go any faster because of me. It gets its name because a bottle’s neck limits the amount of fluid you can poor into or out of the bottle.

With finding bottlenecks, you’re looking for something that you won’t find until it becomes a problem. You have to make it a problem. You have to drive the system to the point where the rate limiting kicks in and figure out what caused it – that’s what performance testing is about. One would appropriately assume that we use performance testing to find bottlenecks. Well, no, at least not exactly. We use performance testing to find a single bottleneck. Remember a bottleneck is a rate limiter so the first bottleneck limits the system. Once you hit that rate you can’t test any more. You have to remove the bottleneck and then test again. Thus we see the second issue with performance testing.

The first issue that I discussed casually was complexity. Now the second issue is we can only find performance problems one at a time. We only get to identify one bottleneck at a time. That means that we have to make sure that we’re picking the right approach to finding bottle necks, so we find the right one. As it turns out, that’s harder than it first appears.

What Problem Are We Solving Again?

We’ve seen two issues with performance testing but I’ve not exposed the real problem yet. The real problem is in knowing how the users will use the system. I know. We’ve got use cases and test cases and requirements and all sorts of conversations about what the users will do – but what will they REALLY do? Even for systems where there are established patterns of use, the testing usually involves a change. We do testing because we want to know how the change in the code or the environment will impact the performance of the system. Those changes are likely to change behavior in unpredictable ways. (If you don’t agree read Diffusion of Innovations.) You won’t be able to know how a change in approach will impact the way that users actually use the system. Consider that Amazon.com knows that even small changes in performance of their site can have a dramatic impact on the number of shopping carts that are abandoned – that doesn’t really make any sense given the kinds of transactions that we’re talking about – but it’s reflected in the data.

I grew up in the telecommunications area in my early career. We were talking about SS7 – Signaling System 7. This is the protocol with which telephone switches communicate with one another to form up a call from one place to another. I became aware of this protocol when I was writing the program to read the switches – I was reading the log of what the messages were. The really crazy thing is that SS7 allowed for a small amount of data traffic across the voice network for switching. In 1984 Friedhelm Hillebrand and Bernard Ghillebaert realized that this small side-band data signaling area could be leveraged to allow users to send short messages to one another. This led to the development of the Short Message Service (SMS) that many mobile phone users use today. (It got a name change to texting somewhere along the way.) From my perspective there’s almost no possible way that the original designers would have been able to see that SS7 would be used by consumers to send “short” messages.

Perhaps you would like a less esoteric example. There’s something you may have heard of. It was originally designed to connect defense contractors and universities. It was designed to survive disaster. You may have guessed that the technology I am talking about is the Internet. Who could have known back then that we’d be using the Internet to order pizza, schedule haircuts, and generally solve any information or communication need?

My point with this is that if you can’t predict the use then you can’t create a synthetic load – the artificial load that performance testing uses to identify the bottleneck. You can guess at what people do but it’s that it’s a guess – it’s not even an educated guess. If you guess wrong about how the system is used you’ll reach the wrong conclusion about what the bottleneck will be. In articles, I have talked about reflections and exception handling. Those are certainly two key things that do perform more slowly – what if you believed that the code that did some reflections was the most frequently used code – instead of the least frequently used code? You’re likely to see CPU as the key bottleneck and ignore the disk performance issues that become apparent with heavy reporting access.

So the key is to make sure that the load you’re using to simulate the load of users is close to what they’ll actually do. If you don’t get the simulation right you may – or may not – see the real bottlenecks that prevent the system from meeting the real needs of the users. (This is one of the reasons why the DevOps movement is entertaining. You can do a partial deployment to production and observe how things perform for real.)

Baselining

Before I leave the topic of performance and performance testing, I have to include one more key part of performance management. That is, baselining the system. That is recording performance data when things are working as expected so that it’s possible to see which metric (or metrics) are out of bounds when a problem occurs. It is one thing to know that performance has dropped from 1 second per page to 3 seconds per page – and quite another to identify which performance counters are off. What often happens is that the system is running and then after a while there’s a problem. Rarely does anyone know how fast the system was performing when it was good – much less what the key performance counters were.

Establishing a baseline for performance is invaluable if you do find a performance problem in the future – and in most cases you will run into a performance problem in the future. Interestingly, however, rarely do organizations do a baseline because doing a baseline requires a time-disconnection from launch and remembering to do something a month after a new solution is launched is hard to do because you’re on to the next urgent project – or you simply forget. Further, without a periodic rebaselining of the system, you’ll never be able to see the dramatic change that happens right before the performance problem.

With baselining you’re creating a snapshot of “normal” that is, this is what normal is for the system at this moment. We leverage baselines to allow us to compare the abnormal condition against the known simplification of the normal condition. We can then look for values that are radically different than the normal that we recorded – and then work backwards to find the root cause of the discrepancy – the root cause of the performance issue. Consider performance dropping from 1 second per page to 3 seconds per page. Knowing a normal CPU range and an abnormal one can lead to an awareness that there’s much more CPU time in use. This might lead us to suspect new code changes are causing more processing, or perhaps there’s much more data in the system. Similarly if we see our average hard disk request latency jumps from 11ms to 23ms we can look at the disk storage system to see if it is coping with a greater number of requests – or perhaps a drive has failed in a RAID array and it hasn’t been noticed so the system is coping with fewer drive arms.

Vertical vs. Horizontal Scaling

There was a day when we were all chasing vertical scaling to deal with our scalability problems. That is we were looking for faster disks, more memory, and faster CPUs. That is we wanted to take the one box that we had and make it faster – fast enough to support the load we want to put on it. Somewhere along the way we became aware that no box – no matter how large – could survive the onslaught of a million web users. We decided – rightfully so – that we couldn’t build one monolithic box that would handle everything but instead we had to start building a small army of boxes willing to support the single cause of keeping a web site operational. However, as we’ve already stated, there are new problems that emerge as you try to scale horizontally.

First, it’s possible to scale horizontally inside of a computer – we do this all the time. Multiple CPUs with multiple cores and multiple disk arms all working towards a goal. We already know how this works –and what the problems are. At some point we hit a wall with processor speeds, we could only get the electrons to cycle so quickly. We had to start packing more and more of the CPUs with more and more cores. Which led us to a problem with is the problem of latency and throughput.

The idea is that you can do anything instantly if you have enough throughput. However, that doesn’t take into account latency – the minimum amount of time that it will take for a single operation to take. Consider the simple fact that nine women can have nine babies in nine months, however, no women can have a baby in a month. The latency of the transaction – having a baby – is nine months. No amount of additional women (or prompting by parents) will increase the rate at which babies are made. Overall throughput can be increased by adding more women but the inherent latency is just that – inherent. When we scale horizontally we confront this problem.

A 2Ghz CPU may complete a thread of execution twice as fast (more or less) than a 1Ghz CPU – however, a two core 2Ghz CPU won’t necessarily complete the same task any faster – unless the task can be split into multiple threads of execution. In that case we may see some – or dramatic – improvement in the performance. Our greatest enemy when considering horizontal scaling (inside a single computer or across multiple computers) is the problem of a single large task. The good news is that most of the challenges we face in computing today are not one single unbreakable task. Most tasks are inherently many smaller tasks – or can be reconstructed this way with a bit of work. Certainly when we consider the role of a web server we realize that it’s rarely one user consuming all the resources. Rather it’s the aggregation of all of the users that cause the concern.

The good news is that horizontal scaling is very effective at problems which are “death by a thousand cuts.” That is they are good at problems which can be broken down, distributed, and processed in discrete chunks. In fact, this is the process that is most central to horizontal scaling.

Horizontal Scaling as breaking things – down

My favorite example of horizontal scaling are massively distributed systems. A project called SETI@Home leverages “spare” computing cycles from thousands of computers to process radio telescope data looking for intelligent life beyond our planet. It used to be that the SETI project (Search for Extra Terrestrial Intelligence) used expensive super computers to process the data from the radio telescopes. This process was converted so that “normal” computers could process chunks of the data and return the results to centralized servers. There are the appropriate safeguards built into the system to prevent folks from falsifying the data.

Putting on my developer hat for a moment, there are some parts of a problem that simply can’t be broken down and far more where it simply isn’t cost effective/efficient to even try to break the problem down. Take for instance a recent project where I needed to build a list of the things to process. This took roughly 30 seconds to complete in production. The rest of the process, performing the migration, took roughly 40 hours of time if processed in a single thread. I didn’t have that much time for the migration. So we created a pool of ten threads that each pulled out one work item at a time out of the work list and worked on it (with the appropriate locking). The main thread of the application simply waited for the ten threads to complete. It wasn’t done in 4 hours – but it was done in 8 hours. Well within the required window because I was able to break the tasks into small chunks – one item at a time.

Admittedly without some development help breaking the workload down isn’t going to be easy – but anytime I’m faced with what folks tell me is a vertical scaling problem – where we need one bigger box – I have to immediately think about how to scale it horizontally.

Call to Action

If you’re facing scalability or performance challenges… challenge your thinking about how the problem is put together and what you can do to break the problem down.

Public Speaking