Sunday, August 22, 2010

Tech-Ed-2009-Peschka-Capacity Planning with SharePoint 2007

One major consideration for a successful SharePoint deployment is knowing how much to request in the way of resources. In some cases this can be by guess and by golly. However there are some effective principles to use in planning and testing capacity. In this rich, deep presentation, Steve Peschke covers them. This is an exceptionally valuable session for anyone looking to make their SharePoint deployment a success. It’s the best kind of presentation: meat and potatoes all the way through.

How to determine the necessary capacity for planned expansion/new deployments?

What’s the goal of your testing?

Most important for most of us is RPS (Requests per Second). We might also be interested in a specific operation, in terms of the way it affects the farm or the way the farm affects it. Perhaps even just a single page, in terms of its TTLB (time to last byte). Usually capacity planning testing means verifying that an existing approach/plan is viable, or it means proving a concept.

Once we know what we want to do, we know what we need to do: what we want to measure. Most tests are based on RPS, since so many things are based on it. However TTLB is also crucial, and many tests might include both. Peschke gives the example of a farm that needs to satisfy 100 RPS, with pages loading within 5 seconds.

Also: Crawl time: how long it takes and how much material there is to index. Document indexing rate is more complicated.

Determining Throughput Requirements, or RPS

Can be complicated. It must reflect not necessarily what can theoretically be done, but what your farm’s customers need. Here he makes on of his most important points: the number of users means nothing. So, naturally enough, ascertaining this is imperative, and Peschke offers several ways:

- Historical data, from IIS logs and Log Parser, Web Trends, etc.

- Start with number of users, divide into profiles, multiply the number of users by the number of operations per user profile, and base you RPS on the peak concurrency of this.

Peschke gives an example. It’s such a great example of his approach and the merits of this presentation that I reproduce it below.

1. Contoso has 80,000 users; during any 8 hours, up to 40,000 may be at work. So we have 80,000 users, 40,000 active, and concurrency of 5% to 10%. Concurrency means active at the same time, and can be estimated.

2. Of these, 10% are light, 70% are typical, 15% are heavy, and 5% are extreme. This is a best guess.

3. Let’s say light users do 20 RPH (requests per hour). Collectively, this means 80,000 RPH.

4. Let’s say typical users do 36 RPH (requests per hour). Collectively, this means 1,008,000 RPH.

5. Let’s say heavy users do 60 RPH (requests per hour). Collectively, this means 360,000 RPH.

6. Let’s say extreme users do 120 RPH (requests per hour). Collectively, this means 240,000 RPH.

7. This means 1,688,000 RPH, or 469 RPS for these 40,000 users.

8. When we factor in peak concurrency (10%), this comes to 46.9 RPS. That’s the target we need for this farm.

Excellent, logical analysis. Now that we have some idea of throughput, we need to consider what test mixes to prepare. What activities should these test mixes reflect? Historical information, if available, helps with this; otherwise Peschke refers us to test mixes in “Planning for Software Boundary” documents on TechNet. Yet inevitably some educated guesses will be necessary.

Once we know what our throughput is, and we know what sort of transactional mix is needed, we can design a test environment. Peschke notes that few people invest the necessary time in this; he recommends two months, including three weeks in a test lab.

The test environment needs to reflect crucial infrastructure factors. Including AD: AD (How many forests and domains? How many user accounts and groups needed?); Peschke mentioned he aims for one DC for every 3-4 web front ends. How will load-balancing be implemented for this?

Hardware must also be considered. A Visual Studio Team Test Controller will be needed, several VSST agents, as well as a separate SQL server (so it does not impact the MOSS SQL server which is, after all, one of the main choke points). He also recommended turning off anti-virus software on the load test controller and its agents.

Certain configuration changes should also be made, such as stopping the timer and admin services, as well as profile imports and crawls. All pages should be published, and none should be checked out. A wide variety of pages should be included. He also points out that each Write scenario will change the content database, and this database should be restored from backup after a test run which includes writes; this assures a consistent, uniform baseline. Again, he recommends stopping anti-virus software unless you want to measure performance when using the SharePoint-integrated anti-virus capability.

Other, account-related tasks to consider include how many users and in what roles, how these will be populated, what audiences/profile imports/search content sources will be needed (if any), and whether crawl content or profiles will need to be imported for testing (he gave the example of one crawl and profile import which took three days to complete).

More on Test Design

Sample Data are a major stumbling block for many implementations. Also, the sample content should be varied, not merely numerous iterations of a single document. The search query testing results will be ridiculous. Using a backup of an existing farm is the best option for sample data. Peschke recommended tools from www.codeplex.com /sptdatapop for populating test environment data. Even so, he noted in his experience that you will almost always have to write some of your own tools for this.

Web Tests Best Practices. A mixed bag of good insights. The system behaves differently for different roles, so a variety of such roles should figure in testing; do not simply use farm admin for everything. Test with model client apps such as RSS, Outlook, etc. In addition, don’t neglect time-based operations, such as Outlook syncs. Validation rules should also be used. Test the web test itself; does it work for all your URLs, and does it work for all users? He also recommended setting parse dependent requests to no.

Load Test Best Practices. Make sure your planned test reflects a good mix of tasks. Restore from a backup before each test run, remembering to defrag the indices. Use iisreset. Remember a warm-up period, since the first test after a restore will invoke many things which are not characteristic of regular operations. He briefly discussed “think times”, mainly to say that such user behavior is almost impossible to accurately model.

Sample Tests and Data Population. Peschke again referred to www.codeplex.com /sptdatapop as a great resource for many basic operations tests. A great resource for sample tests top be adapted for your own use. Other tools: a script to create users in AD, a tool to scrape a list of webs, lists, libraries, list items, etc., for use in webtests, tools to create webs, lists, libraries, list items, etc., a tool to create My Sites. These are fairly generic tasks.

Demo of a Test. Peschke next crossed his fingers and did a demo to add a web test and run a load test. What struck me most was that this testing was all done from within Visual Studio, in a very straightforward manner. Even so, it will take a few viewings, preferably with the software running on your own machine, for this to all make sense.

Questions to Consider. Good wisdom here about testing. Always assume there is a bottleneck, and ask yourself where this might be, and how it might be alleviated. And ask if the bottleneck is in an unexpected location. Is the throughput spiky? Are there many errors displaying? And be as concerned about tests which are extremely good as about tests which are extremely bad.

Post-Testing Investigation

Investigation Techniques: There were more ideas here than the standard troubleshooting ideas, and this reflected real, painful experience. Make sure the configuration is correct for hardware, the load balancer, SP settings, etc. Try running just portions of the test to divide and conquer. Simplify the farm topology. Isolate the workloads/operations/pages. Use Read scenarios rather than read/write ones. Avoid taking things for granted. This also helps illustrate why more time will be needed for testing than you might expect.

Investigating with Visual Studio Team Tests. Peschke presented a demo of this. He gave examples of table and graph views for seeking RPS and TTLB patterns and analyzing perf counters. He also gave recommendations for perf counters to focus on; this was fairly standard for machines (proc, memory, disk IO, network) and counters from VSTT default sets, adding SharePoint, Search, and Excel specific ones. One good example of this showed WFE CPU dropping while SQL CPU surged and the SQL Lock Wait Time went up, meaning that the bottleneck in this case was SQL. Peschke gave some good examples of reading test results and root-causing them.

Scaling Points. A loooong list of things to consider when scaling once capacity is understood. Too many to simply list here; I’ll include some highlights. Data files and log files on separate spindles. Does custom code need to be re-visited? For the various databases, should additional data files be added, up to one such file per proc core, and spread across multiple drives. The SQL disk design is critical for high-user or high-write implementations. Perhaps have separate farms: one just for My Sites, one just for publishing, one just for search, etc. Be sure to run on x64. In virtual environments, are VMs optimally configured? Are you monitoring the object cache and sized accordingly? Does your farm have two VLANs, one for page traffics, and one for a backend channel for interserver communication and SQL? Is there a dedicated WFE for indexing? There were more, but these seemed the most germane.

Conclusion: Exceptional granularity and depth, goes far beyond mere scenarios or demos. Worth absorbing, and worth coming back to in the future, regardless of which version of SharePoint you deploy. This is why we go to Tech-Ed.

0 Comments:

Post a Comment

<< Home