Home Uncategorized SQL University: Parallelism Week – Introduction

    SQL University: Parallelism Week – Introduction

    460
    13

    Welcome to Parallelism Week at SQL University. My name is Adam Machanic, and I’m your professor.

    Imagine having 8 brains, or 16, or 32. Imagine being able to break up complex thoughts and distribute them across your many brains, so that you could solve problems faster. Now quit imagining that, because you’re human and you’re stuck with only one brain, and you only get access to the entire thing if you’re lucky enough to have avoided abusing too many recreational drugs.

    For your database server, on the other hand, that multi-brain dream is reality. It’s common for newer servers to have 16 or more logical processors, and numbers above 64–once the hallmark of extremely high-end hardware–are today no reason to raise an eyebrow. SQL Server is designed to take advantage of servers with a high number of processors, utilizing two or more at the same time to help it solve big problems faster. This is called parallelism, and understanding how it works and why is one the keys to making your database applications scale.

    This first post for Parallelism Week details the foundation material you need to understand before we dive in and investigate some of the query processing components in more detail. The next post will discuss how parallel query processing works and the various components you’ll see when you read parallel query plans, and the final post of the week will get into some of the means by which you can control parallelism at both the server and query level.

    four_brains

    Figure 1: A four-brained computer, frying an egg.

    Background: Threads, Threading, and Other Sewing Analogies

    Each CPU in your computer can do only one thing at a time, albeit very quickly. But if forced to finish an entire task before moving on to the next, certain tasks would seem choppy and sluggish even running on extremely fast processors. To get around this issue modern operating systems are designed to be able to do lots of things at once (concurrently). This is done by taking advantage of the fact that the CPU can switch back and forth between tasks almost as quickly as it can get them done. Each program on your computer runs under a process space–a virtual construct to which resources such as memory and CPU time can be granted. And each process can use one or more logical threads to do work on the CPU. Under the context of individual logical threads, a program is able to appear to do more than one thing at once. In reality the way this works is via time slicing.

    The Windows operating system includes a component called a scheduler, whose job it is to control which programs get CPU time and when. Operating system designers have come up with various methodologies that can be used to control CPU scheduling, and Windows uses a system called preemption. The idea behind preemption (or preemptive scheduling) is that by default each active thread (i.e., a thread that actually has some work to do) gets an equal amount of time on the CPU. The amount of CPU time is split equally, and the CPU switches back and forth between the various tasks vying for its attention. Each equal CPU time slice is called a quantum. In reality, some threads are more important than others, and these are said to have a higher priority. In a preemptive environment, the higher priority threads are able to force the lower priority threads to yield, in order that the higher priority threads can get more of the CPU time and finish their work more quickly.

    Imagine yourself sitting at your desk, and imagine 10 of your co-workers standing around your desk, each asking you to help them with a different project. Now imagine that each minute you abruptly stop whatever conversation you were having and turn to the next co-worker, and then the next, over and over, until all of the projects are complete. This would quickly become confusing and overwhelming for us, but it’s effectively what the CPU does as it works on multiple concurrent threads using time slicing. Each change from one thread to another is called a context switch, and although the CPU can do so much more effectively than the human brain, it’s not free. If a CPU is overloaded with logical threads and is doing too many context switches, it is said to be thrashing, and can spend more time switching back and forth than getting actual work done.

    SQL Server runs on top of the Windows operating system, and so it must work within the confines of the operating system’s preemption model. However, internally SQL Server uses its own scheduler, which is one of the components of the SQL Server Operating System (SQLOS for short). Unlike Windows, the SQLOS scheduler uses a cooperative (also called non-preemptive) scheduling model. In this model each thread (abstracted within SQL Server by an entity called a worker) can spend as much time as it needs on the CPU until its task is finished. The idea is that preemption should not be necessary because threads will eventually have to stop and acquire non-CPU resources (e.g. data from the disk system) and will enter wait states, thereby liberating the CPU for other tasks. SQL Server’s cooperative model is somewhat limited in its flexibility since again, as a Windows application, it must also abide by the operating system’s preemptive model. This means that SQL Server’s threads will still be subject to time slicing if there are competing processes running on the server. But in optimally-configured environments where that’s not the case, SQL Server’s cooperative scheduler will effectively be in control of the full lifetime of its threads, keeping them active until they are either finished or need to enter a wait state.

    Note that for brevity I’ve simplified descriptions of the inner workings of SQLOS and scheduling in general. I have also completely bypassed much of the background on wait states. For more information on these topics, especially on the various wait states and how they impact performance, I highly recommend reading Tom Davidson’s white paper, “SQL Server 2005 Waits and Queues.”

    Multitasking and Parallel Processing

    Much of the detail around the execution models described in the previous section involves interaction between multiple programs, which the operating system makes us believe are running on the same CPU at the same time. This is known as multitasking, and enables me to write this post in my Web browser while two instances of SQL Server, PowerPoint, Word, and a number of other background processes are all doing work on my laptop. The quanta are so fast that I don’t even notice, as I type this sentence, that the Web browser is being switched on and off of the CPU to which it’s assigned.

    SQL Server internally participates in a form of multitasking. Each worker within the process space is bound to a scheduler, which handles the cooperative scheduling for a single logical processor. Each scheduler can have many associated workers, but just like in the operating system only one worker at a time can actively work with the CPU. Therefore, it’s possible on a dual-core server to have, e.g., 10 concurrent queries running. What’s really happening is that the various queries–by virtue of the workers actually doing the processing–are switching in and out of wait states, so only up to two queries are actually consuming CPU time at any given moment. This assumes that each query uses only a single worker–which we’ll see is not a safe assumption.

    Above and beyond simple multitasking, SQL Server can break certain queries into component parts, each of which can be processed on a separate logical CPU. This is known as parallel processing, or, more simply, parallelism. When the query optimizer determines that a query–or, more accurately, one or more components of a query–can participate in parallelism (or can “be parallelized” or “go parallel“), two or more workers are used to do the processing necessary to satisfy the query. The various workers for all of the queries running concurrently within SQL Server are each subject to the cooperative scheduling model, and go on and off the schedulers and in and out of wait states as necessary. This means that large systems can often support hundreds or thousands of workers serving tens or hundreds of concurrent queries, even with comparatively few logical processors actually doing the work.

    The Pros and Cons of Parallel Processing

    Parallelism, as it turns out, is quite the double-edged sword.

    On one hand, taking all of the work that a query has to do and doing it in parallel on several CPUs rather than on a single one can mean extreme performance gains. Certain queries can scale almost linearly as more CPUs are introduced. This means that for each additional CPU, run-time decreases proportionally, so a query running on 10 CPUs will run in approximately 1/10th as much time as the same query running on one CPU.

    Certain operations, such as sorts, are especially well-suited to parallel processing due to the way their algorithms work. The best sort algorithms operate based on a predictable scale of N * log(N) operations per set, where N is the number of elements in the set. We can show mathematically that parallel sorts can in fact use less overall CPU time than the same sort on a single processor. If C is the number of CPUs across which a sort is being processed, then for any value of N or C greater than 1, C * ((N/C) * log(N/C)) is less than (N * log(N)). This means that as long as the cost of merging the sorted sub-ranges from each worker thread is not too high, the overall sort operation will require fewer clock ticks done in parallel than done in serial. More on this in a future post.

    Taking an opposite stance on the performance question, consider that splitting up work and binding a single query to multiple workers does require some overhead. There is memory overhead in maintaining state for each worker, CPU overhead due to a greater number of context switches that must occur, and wait state overhead at the points at which worker threads must join, or synchronize with one another. In addition to these basic costs associated with actually processing queries in parallel, SQL Server has a limited number of pooled (reusable) threads to work with and if too many queries go parallel at the same time the thread pool can be exhausted, causing any new queries to be forced to wait until an active thread becomes available.

    To combat these situations, SQL Server’s query optimizer only creates parallel plans when parts of the query exceed a given cost threshold, meaning that the optimizer believes that there should be sufficient return to warrant the parallelism investment. This system eliminates unnecessary parallelism in many–but certainly not all–cases, but also inhibits parallelism in some situations where it may be beneficial. A DBA or developer concerned with creating a well-tuned system must understand parallelism well enough to monitor for its proper use, and must know how to modify settings appropriately to guide the decisions made by the optimizer. In this way, a balance can be achieved such that parallelism is used to make the biggest queries run faster, and not used for small queries for which there would be no gain.

    Looking Forward

    This concludes the first of three parts of parallelism week here at SQL University. Part two will cover parallel query plans, and part three will get into the various ways that we, as end-users, can control parallelism in the query engine. If you have any questions on what’s covered in the post, please leave a comment below and I’ll answer as soon as possible.

    Enjoy!

    13 COMMENTS

    1. This community could benefit from some thoughtful prose on parallelism.  Great intro.  Look forward to the rest of the series.  The graphic is a nice touch, BTW.
      Jimmy

    2. Great article, especially the Background section. One of the most clear explanations I have read about "how it works".  Thanks!
      Erin

    3. Hey Adam,
      Great write up~  I have a couple of questions about applying this knowlege to my environment and I wish I could find more information on this in the form of best practices.    
      How should DBAs manage their parellelism settings in an environment where they have multiple instances of SQL Server running on the same physical server?  As far as I know each SQL Server instance is blind to the schedulers of the other instances so my assumption is that turning on parallelism would cause a big increase in context switches and CX related waits.  
      Is there any good documentation on how to set the cost threshold of parallelism?  The cost threshold of parallelism is like a microsoft mystery knob to me.  All I’ve read about it is that when the optimizer is looking to generate a plan if I turn that knob up the optimizer will prefer non parallel plans over parallel ones.  I haven’t come across documentation to tell me if I should be turning it up by 1’s by 10’s or by 100’s or anywhere to say look at this statistic and turn it up until that goes below a threshold.  All we know about the cost threshold for parallelism option is that it can be set to any value from 0 through 32767. The default value is 5. and cryptically the cost refers to an estimated elapsed time in seconds required to run the serial plan on a specific hardware configuration.  That last part sounds like the estimated cost of execution when looking at an execution plan but the documentation doesn’t explicitly state that.  How would someone set the CTOP to an optimal value?
      Thanks for your contributions to the SQL Community Adam!  Keep up the great work!
      -Mike

    4. Hi Mike,
      Great questions! But answering them would require a big chunk of a blog post and… There is one on those very topics (and related issues) scheduled for Friday. So hold those thoughts, and let me know if Friday’s post doesn’t get you where you need to be.

    5. Seems like you should have posted all 3 parts at the same time?
      When we had Conor Cunningham talking to us in our MCM class, he stated that the optimizer always creates a parallel plan and a non-parallel plan. But then the subject of cost threshold never came up, so perhaps he was giving us a simplified description.

    6. Hi Robert,
      SQL University topics are supposed to span a week. Had I posted all three at once, it would have only spanned a single day.
      What I understood from Conor is not that the optimizer always creates both plans but rather that it -considers- both plans. Which makes a lot of sense, and from what I can see in my server’s plan cache it’s correct. But I’ll ping Conor for the decisive answer.

    7. Really good article, succinct and informative, thanks.
      I have found that on my OLTP systems, the best way is just to set maxdop to 1.  Even with a high threshold it often – 2008 more than 2000 – seems to choose parallelism when the equivalent non parallel query is both dramatically faster, and 10’s or 100’s times less intensive on the CPU.
      Might not be the most efficient at all times, but it sure is safest on a high throuput OLTP system!
      Rich

    8. Rich: I tend to agree with you that for most OLTP workloads a MAXDOP of 1 is appropriate. But we’ll get into that just a bit more in Friday’s post 🙂

    9. Great article, made a complex subject make sense to someone with little experience in the topic.
      A quick note on the parallelism of the brain though… The human brain has tons of parallelism. When you look at an image different parts of the brain will identify movement, color and contrast, then combine them into one image. Different parts of the brain run different ‘processes’ independently all the time. Your analogy makes sense in that we don’t consciously assign different parts of a math problem to different parts of our brain, but when doing a math problem, another part of our brain is keeping our heart pumping and another part is managing your digestive system and another part of your conscious brain can be writing the problem down all while not interfering significantly with the conscious math problem.
      The parallelism of the brain and our ability to make us run hundreds or even thousands of ‘processes’ simultaneously is part of what allows us to excel over a computer in certain tasks, such as image recognition, and picking out the important part of an image instantly; while a computer may struggle to analyze the meaning and individual parts of an image.
      Again, I know you were just using an analogy and it makes sense in how we don’t consciously divide tasks, but I just thought this would be interesting information for a few of your readers.

    10. AdamP: A very good point indeed. I guess we can extend the analogy by aligning with SQL Server system vs user processes. We could say that keeping the heart beating, etc, is much like a system process, and that our brains can do lots of these in parallel. But something like a math problem is more of a user process, and although our brains can perhaps turn on a form of intra-query parallelism to help solve the math problem, we can’t really multi-task and do many math problems concurrently. At least, most of us can’t. I’ve met a few people along the way who break the mold. Proving once again that nature is a lot more powerful than technology in many ways.

    11. Great article, concise descriptive and a good entry point even for those that haven’t had an exposure to parallelism.

    12. Great article. Indeed, we are only beginning to understand the parallel processing capabilities of the human brain. However, at this time, it is generally acknowledged that uniprocessing is a limitation peculiar to the male brain.

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here