I/O Characterisitcs of Microsoft SQL Server – Part I

Database happens to be one of most common applications running in data centers. Anytime, you think of storing data for later use, you will have to consider some form of data management systems for doing so. Database is one such data management systems offering sophisticated mechanisms to store, access, and protect data. Various queries can be run against the database to retrieve data at various granularity (individual data record, groups of records or all the data in a database).

Databases are available in many flavors – Commercial: Oracle, Microsoft SQL Server, IBM DB2, Open Source: PostGreSQL, MySQL. Irrespective of the flavor, goal of a database is the same – save data for later consumption. With databases becoming the conduit for accessing / storing data and the amount of data they have to handle exploding to insane proportions, it was necessary to redesign IT infrastructure supporting these data management systems. Separating the storage tier from compute tier gave architects a lot of flexibility when designing and managing the infrastructure for these systems. Storing the data managed by a database in a separate layer offers interesting new possibilities – design a high-speed transport layer that could move data in and out at very high rates, design data protection schemes independent of the data management system software (storage replication, snapshots, disaster recovery applications etc.).

By now, it should be no surprise that databases are one of the heaviest consumers of data services offered by external storages (SAN, NAS). Quite naturally, data center architects want to consider the I/O requirements of databases as one of the key criteria when designing a new storage infrastructure. But, what are the characteristics of database’s I/O? Why are they interesting? How can one find the characteristics of a database’s data access? In this blog, I will provide some answers to the first 2 questions. This white paper should be useful to find answer to the last question.

The I/O profiles shown in this blog are that of one of the most popular data management systems – Microsoft SQL Server (2012). Before that, a quick look at the components of SQL Server:

  1. The predominant tenants are user databases (there can be multiple of those in a single SQL Server instance) used by individual applications, group of applications, or users.
  2. Database system components (e.g., master database, temp database) required for the functioning of SQL Server.
  3. Log files for database recovery.

Some common database operations:

  1. Queries executed by users directly or via applications which mostly accesses a very small set of data.
  2. Queries executed by owners of a database which mostly accesses a large subset or the entire dataset in the database.
  3. Operations executed by database administrators which build, manage data structures such as indexes, load data into database, or backup/restore database.

This blog focuses on I/O characteristics of a user database. The database that was profiled supported an Online Analytical Transaction Processing (OLTP) system. It must be noted that there are databases that support other types of operations – analytical, data aggregation etc. I/O characteristics of those databases are different from the one discussed here.

The very first thing to find out is the read/write mix of user operations that ran against the database.

Figure 1. Read-Write ratio of data access

As shown in figure 1, the I/O accesses were predominantly reads with few writes. This shouldn’t come as a big surprise, as in most transactional applications, users tend to browse a lot of information (movie titles on Netflix, items on Amazon.com) before purchasing few items. Absolute values of read and write percentages may vary depending on the individual applications, or the user behavior, but it is fair to say that the accesses are skewed towards reads.

The next interesting trait of database accesses is the size of a data request. Figure. 2 shows the distribution of the request sizes over the active period of the database.

clust-rep-clust-wkld-blk

Figure 2. Block Size Percentage

As shown in figure 2, the predominant request size is 8KB and <32KB (Microsoft literature talks about request size of operations on user databases in SQL Server being 8KB; figure 2 just confirms it). This relative count of request sizes is based on a large sample of data accesses per second as shown in figure 3.

clust-rep-clust-io-blk

Figure 3. Block Size Count

At this stage, the only question that remains unanswered is the number of outstanding requests issued by SQL Server. This variable is known by different names – number of outstanding requests, number of concurrent requests, number of requests in flight to name a few. It means the number of requests a storage system can service simultaneously. That is capped by the shortest of the queues that exist all along the I/O path. The shortest queue is usually the one that is advertised by storage to the hosts. Choosing an appropriate value for the number of outstanding requests during a storage benchmarking hinges on the following:

  1. Do you want to measure the max IOPS the storage can support with the above I/O profile of SQL Server?
  2. Do you want to ensure that the given storage can service I/O requests within the agreed SLA time limits?

Most POCs want to focus on #2, but actually end up measuring #1, because it is relatively easy to find. That’s a discussion for another blog. If your focus is on #1, use a larger value for the number of outstanding requests (32, 64), but if your focus is on #2 (it should be), start small (4, 8) and stop when you observe the latency increasing without any change in IOPS.

To recap, user databases managed by SQL Server access data at 8KB granularity and read more data than writing (the split might vary depending on the applications using the database or operations running on the database, but it is safe to say that reads are significantly more than writes).

It should be noted that the I/O profile described here is only one facet of SQL Server. In the next blog, I will talk about another interesting component of SQL Server – temp database and discuss its I/O characteristics which is vastly different from what we see in this blog.

One Size Fits All?

Designing storage for enterprise applications such as databases, mail servers, and VDI involve benchmarking various storage devices during proof-of-concept (POC) phase. One of the intents of the POC is to evaluate behavior of storage under various load conditions, against various I/O profiles that are supposed to represent / simulate the actual applications. The storage that  stands out during the benchmarking will be a prime candidate for consideration.

The ability of the selected storage to meet the needs of the applications really hinges on selecting the right I/O profiles that most closely simulate the applications’ I/O behavior. Unfortunately, there is neither an easy way to determine an application’s I/O characteristics nor readily available I/O requirements spelt out clearly by the application vendor as is done for other resources (CPU, memory and network). Because of this, most data center architects often end up doing one of the followings –

  1. Use an I/O profile that a storage vendor tells them to use
  2. Use an I/O profile that is perceived to simulate an application (but really no proof to validate the assumption)
  3. Use a completely random profile that has got nothing to do with the application

In 2009, a blog was published on microsoft.com that attempted to disclose the I/O profiles of many enterprise applications. Although, this information is much better than random guesses, it does not capture all the phases of an application. Benchmarking with these profiles may result in selection of a storage that is sub-optimal to handle all the needs of an application. Various storage vendors have published their perceived view of the applications and are recommending certain IO profiles to use during a benchmark. A few vendors have started relying on Infrastructure-level analytics to learn about the behavior of various applications as viewed by their storage devices and started publishing the learnings (on their cloud portal and blogs). Using these machine learnings is certainly a good start.

One of the most important characteristics of an application’s I/O is the block size it uses when moving its data. My colleague Pete Koehler has done a very nice job of explaining what it is and why it matters in his blog. He also explains how one can get the information needed to understand the I/O profile of an application in this blog. As part of my day-to-day work, using the methods Pete explains in his blog, I have extracted the I/O profiles of many applications you run in your data centers. I thought these profiles can be useful to all those who benchmark their storage systems, but want to do so with realistic I/O profiles, understand their applications, or want to just look at some cool graphs.

What I will do is publish these profiles in separate blogs and discuss a bit about the profiles – why they are interesting? Why some of them could be devastating on your storage? etc. Here is the list of applications I have profiled:

  1. Microsoft SQL Server 2012
    1. User Database
  2. VMware Horizon 6.0
  3. Microsoft Exchange
  4. Cassandra
  5. Software Compilation

After looking at the profiles of all the above applications, one thing you will hopefully realize is that there is no one size fit all option when sizing storage for these applications. I know what you will be thinking after reading this – damn it! what should I use to benchmark my storage? I will try to summarize my findings and specify separate profiles that mostly closely resemble the profile of each application. But, more importantly, I want to educate the good folks who are tasked to run POCs to not fall for the blind recommendations that are no where close to reality, but to understand their applications using the tools that can help them do so and benchmark their storage based on what they learn.

Have fun!

Building High Performing EMC ScaleIO based Hyper-converged Environments

Introduction

EMC ScaleIO is a software-based solution that aggregates storage media (spindles, SSDs) in servers to create a server-based SAN. It is built on vSphere hosts by deploying ScaleIO software in vSphere hypervisor and in linux-based VM running on each host. This allows the vSphere hosts to provide both storage and computing to the Virtual Machines (VM) running on them. This converged environment is called a HyperConverged Infrastructure (HCI).

Benefits

ScaleIO HCI offers several advantages over traditional SANs. Some of the key benefits are listed below:

  1. Converges compute and storage resources of commodity hardware into a single-layer in vSphere environments
  2. Combines HDDs, SSDs, and PCIe flash cards to create a virtual pool of block storage
  3. Creates a massively parallel and insanely scalable (both capacity and performance) storage system.
  4. Enables performance to scale linearly with the infrastructure (as more servers with storage are added).

Memory based Acceleration in ScaleIO HCI

The I/O latency offered by ScaleIO based HCI can be significantly lowered using server DRAM and PernixData FVP Software. FVP aggregates the DRAM in the servers that are part of a ScaleIO HCI and creates a massively parallel, linearly scalable data tier (referred to as Distributed Fault-Tolerant Memory [DFTM]) that can be used to accelerate data accessed frequently by the VMs and new data written by the VMs.

This new accelerated hyperconverged infrastructure was evaluated in a lab on a 4-node ScaleIO HCI shown in Figure 1.

Figure 1: ScaleIO HyperConverged Infrastructure with DFTM

Figure 1: ScaleIO HyperConverged Infrastructure with DFTM

I/O Performance

Here is a snippet of the I/O performance of the new ScaleIO stack with DFTM.

Figure 2. Read Operations/Sec attained from the accelerated ScaleIO HCI

Figure 2. Read Operations/Sec attained from the accelerated ScaleIO HCI

Figure 3. Write Operations/Sec attained from the accelerated ScaleIO HCI

Figure 3. Write Operations/Sec attained from the accelerated ScaleIO HCI

 

 

 

 

 

 

 

The workload used for the tests observed 8x increase in read operations/sec and 1.2x increase in write operations/sec per ScaleIO node. As the HCI scaled (new nodes were added), the performance increase due to I/O acceleration by DFTM scaled proportionally. With 4 nodes, the read and write operations/sec reached significant proportions touching 150K mark for reads and 25K mark for writes.

Why ScaleIO HCI with DFTM is interesting?

FVP decouples I/O performance of the converged infrastructure from its capacity. While administrators retain all the benefits of ScaleIO, they can manage the converged infrastructure’s I/O behavior independent of the underlying commodity hardware. Even if the hardware components vary from node-to-node in a cluster, the I/O performance experienced by the VMs will remain consistent and agnostic to  the physical characteristics of the components.

FVP serves as a single data tier for both reads and writes. This means that both reads and writes from VMs observe similar I/O latencies. When the latencies are similar, the rate of operations become similar.

Another interesting feature of this new architecture is that every accelerated VM gets large, high-speed buffers for writes. Unlike a shared storage, where the high-speed buffers (storage cache) are shared across all the VMs connected to the storage, FVP provides equal buffer chunks to all the VMs. Performance of this buffer can be changed easily by changing the underlying high-speed media (from SSDs to DRAM) and the number of VMs utilizing the write buffers can be increased by deploying high-speed media with larger capacities.

You can find more details about the architecture, experiments and results in this white paper. Feel free to leave your comments/questions here.

Destaging Writes from Acceleration Tier to Primary Storage – Part II

In the part I of this series, I introduced FVP’s asynchronous data destaging in write-back mode from flash to the primary storage. I discussed the various nuances of destaging and showed how asynchronous destaging helps applications by providing flash class latency using a typical I/O workload. In this blog, I will discuss the implications of accelerating a write-intensive workload and the impact of asynchronous destaging on the workload performance.

Accelerating Write Intensive workloads 

A VM running a bursty-write workload was used for this test. During the testing period, the workload issued only writes which peaked to a very high value periodically. This VM was selected to be accelerated by FVP and was put in write-back mode. Figure. 1 shows the write operations observed by the VM during the entire testing period. Writes reached as high as 15K/sec during the peak periods but were only ~250/sec otherwise. All the writes were serviced by the flash device during the entire testing period including the bursty periods. However unlike the experiment in part I of this series, during this test the primary storage couldn’t service writes at the same rate as that issued by the VM. As a result, the rate of destaging VM’s data from flash to the primary storage was slower (11K/sec) than the rate of writes issued by the VM (15K/sec) which meant all of the VM’s data couldn’t be destaged as soon as they arrived during the bursty period. Thanks to FVP, the writes were acknowledged as soon as they arrived allowing the VM to issue more writes, but were sent to the primary storage at a rate the storage was comfortable of handling. The non-overlapping write peaks in figure. 1 illustrates this behavior and highlights the advantage of having an acceleration tier that can service writes as soon as they arrive, but sends the data to its permanent residence asynchronously without overwhelming it.

(Click to enlarge)

IOPS-FCFig 1. Write Operations

As the VM starts issuing writes, the writes got serviced by flash at flash speed (flash  + network speed, when using peers) as shown in fig 2. However, since the rate of writes from the VM outpaced the rate of destaging, destaging region saw a continuous increase in the amount of data to be destaged.  FVP continued to service writes at flash speed till the occupancy of destaging region reached a threshold. If the occupancy crosses the threshold, FVP starts injecting additional latency when acknowledging a write back to the VM to throttle new writes. This threshold is a carefully selected value that gives destager enough cushion to flush the dirtied data even if the primary storage is slow in servicing writes. The injected latency depends on the destaging area occupancy and the SAN latency (latency experienced by destager when writing dirty blocks to the primary storage) and is added when acknowledging only those writes that fill the destaging area above the threshold. Thus, the effective write latency (blue line) seen by the VM during bursty write periods was higher than flash latency (orange line), but much lower than datastore latency (green line).

(click to enlarge)

Latency-FCFig 2. Latency of Write Operations

The throttling aggressiveness is determined using an intelligent algorithm and adjusts dynamically to maintain the occupancy of the destaging region under the threshold. If the occupancy doesn’t reduce, FVP increases the throttling rate further until destager is able to empty enough data from the destaging region so that the occupancy falls below the threshold. As soon as the occupancy drops below the threshold, FVP resumes servicing writes at flash speed. However, most often writes from enterprise applications occur in short-spurts. The default size chosen for the destaging area is adequate to handle the spurts. Writes, in such cases, should be serviced at flash speed.

In summary, even for write-intensive workloads,  FVP can still provide an SLA that is much better than that promised by the primary storage technologies available today. Even a high barrage of writes is easily handled by FVP at flash like latencies. With its intelligent capabilities, FVP handles the burst even when primary storage is incapable of handling it.

UP NEXT: Accelerating Write-only Workloads ….

Resources:

  1. Iometer configuration file used for the test: bursty_writes
  2. Destaging Writes from Acceleration Tier to Primary Storage – Part I

Destaging Writes from Acceleration Tier to Primary Storage – Part I

Frank posted a nice article on write-acceleration policies supported in FVP. It is a great read for anyone looking to for a quick intro on the two write-acceleration policies supported in FVP. At the end, some readers asked few interesting questions regarding ‘Write Destaging’, answers to which require a deep dive than a simple two-line replies. Hence, I thought of explaining FVP’s destager architecture as a multi-part blog series. This blog offers an introduction to asynchronous destaging of VM’s data from flash using an example.

BTW, kudos to all those readers who raised these questions! Just shows, how well these readers understood the technicalities of write-acceleration. Tip my hat to you folks and bow to you Frank.

Destaging Writes from Flash to Primary Storage

In the write-back mode, FVP acknowledges the writes coming from a VM as soon as it is written to the flash. The data is written to the primary storage (permanent residence of the data) eventually, at a rate the primary storage is comfortable of receiving data. This task of destaging the data written by VMs to their primary residence is delegated to what is called as a ‘Destager’, a key component of FVP that runs in the background. Essentially, in the write-back mode, writes from the VMs are acknowledged at flash speed (flash + network speed, when using peers), while they are sent to their permanent residence asynchronously at SAN speedNote that asynchronous data destaging is relevant only in write-back mode.

Destaging Area

At any given time, FVP uses flash in multiple ways – to host data read frequently by VMs (to accelerate reads), to buffer primary copies of data written by VMs running on the server which houses the flash (to accelerate writes), or to keep replicas of data written by VMs running on remote servers (to provide fault tolerance in write-back mode). In order to accelerate many VMs on a vSphere host, and to accelerate both reads and writes of these VMs, FVP has to manage the flash real estate very efficiently. FVP uses dynamically expanding and shrinking regions on flash to hold the writes coming from the VMs until all the data is moved to its permanent residence. This region is called ‘destaging area.  Each VM that is configured to be in write-back mode gets a separate destaging area.

Destaging Frequency

FVP acknowledges a write issued by a VM in write-back mode as soon as it is written to the VM’s destaging region on the flash. In the back ground, FVP activates the destager to migrate the VM’s data to its permanent residence. The migration happens at a rate the primary storage is capable of handling. When multiple VMs are configured to be in write-back mode, all their writes are acknowledged as soon as they are written to the individual destaging regions. In this case, destager migrates data from the destaging regions of all the VMs simultaneously, but more importantly, without overwhelming the underlying primary storage.

Implications of Destaging on Write-Acceleration: Flash Class Application Latencies! 

Let me illustrate the mechanics of destager with an example. In this experiment, a windows VM running iometer issued writes in burst to the primary storage. Figure 1 shows the rate of write operations during the experiment. Writes reached as high as 4K/sec during the bursty periods. This VM was selected to be accelerated by FVP and was put in write-back mode. All the writes were serviced by flash and the written data was destaged to the primary storage asynchronously by the destager. In this experiment, the primary storage was able to service writes at a high rate. Hence, the destager could empty the VM’s data as soon as it arrived.

The result: Writes/sec seen by VM = Writes/sec serviced by Flash = Destaging Rate = Writes/sec written to primary storage asynchronously (hence lines representing rate of writes serviced by different components overlap in fig 1).

(click to enlarge)
IOpsFig 1. Write Operations

However, the latency of write operations seen in the VM tells a different story. Figure 2 shows the latency of the write operations observed by different components during the test. By the virtue of write acceleration by FVP, all the writes were serviced by flash at flash speed (orange line showing “Local Flash Write” latency) even during periods of bursty writes. Write latency seen by the VM was almost the same as flash write latency (blue line showing “Total (Effective)” latency). Flash latency increased by only 200 microseconds during the bursty period. In contrast, I/O latency witnessed by destager when destaging VM’s data to the primary storage  reached as high as 3ms** (green line showing “Datastore Write” latency). This would be the latency seen by the VM, if it were to issue writes directly to the primary storage.

(click to enlarge)
LatencyFig 2. Latency of Write Operations

Most applications exhibit a write behavior that is similar to that shown in the above illustration. For such workloads, clearly, FVP offers an unprecedented boost in I/O QoS. This boost can be realized by a mere addition of an SSD to vSphere hosts and creation of a clustered acceleration tier on the SSDs using FVP.

NEXT UP: Accelerating write-intensive workloads…

** Primary storage used for this experiment was an all-flash SAN. In reality, latency could be even higher (few tens of milliseconds)  if the primary storage device was configured on magnetic disks.

Resources:

  1. Iometer configuration file used for the test: Bursty_writes
  2. Frank’s blog on Write-Back and Write-Through policies in FVP
  3. FVP Writeback policy deep dive whiteboard session

Get Pernix’d

The sudden explosion in the number of solutions built on flash-based storage surprises me. I remember researchers and industrial community discussing the reliability and longevity of Solid State Disks (SSDs) in FAST conference not too long ago. Fast forward today, these no longer seem to be something that worries the solution providers or the consumers. I now work for PernixData, a company that is aiming to carry forward the virtualization journey from where hypervisors left off (post CPU and memory virtualization).  Flash Virtualization Platform (FVP), the flagship product developed by PernixData is a clustered flash tier created by virtualizing server-side flash storage for accelerating virtual machines’ (VM) I/O access to the block-based storage devices. In this blog post, I intend to discuss the motivation behind developing FVP and the key benefits it offers.

Rise of high-performing, expensive storage tier (SANs, NASs)

Over the years, storage technology has taken an interesting course. Although, computing platforms (desktops, servers, laptops) provide persistent storage to the computing units, most IT users don’t trust this layer to be robust enough to provide either the performance that meet their SLAs or the technologies that let the data rest in peace (dedupe, compression, encryption, snapshot). As a result, a new storage layer with dedicated expertise to accomplish both, but  is external to the computing platform, has emerged. Almost all research efforts in the storage area continues in this external tier.

Bi-dimensional Problem

However, this external storage tier is plagued by a problem – improving a single layer in two orthogonal dimensions (performance and capacity) is extremely complicated.

The problem is illustrated better in the following graph:

graph_1

Storage for most data centers are sized in two dimensions – capacity and performance. Most often, storage sized for capacity doesn’t meet performance needs (Application SLAs). In that case, additional storage media have to be added to meet the performance needs (this is mostly true for transaction based applications). With advances in media capacity out-racing advances in media performance, this almost always lead to over-provisioning as adding storage media means adding extra giga and terabytes of unused storage capacity. There is a significant ‘Capex’ implication of sizing the storage this way.

Another complication may arise, when almost all processing cycles of the storage has to be dedicated to process application I/Os to meet their SLA. This leads to postponing all non-application traffic (mostly administrative in nature such as snapshots, backups, storage cloning, migration etc.,) to idle periods. This means that either storage admins have to depend on heavy automation and scheduling of these tasks OR burn mid-night oil to ensure the success of these operations. This has significant ‘Opex’ implications.

In summary, meeting performance requirements requires capacity to be over-provisioned, sticking to capacity needs compromises performance. Hence, it is very hard to expect the convergence of both.

Where is the local storage?

What happened to the local storage? Applications that effectively use local storage can be counted with fingers – Hadoops, High Performance Computing applications, the Googles, the Facebooks to name a few. But, they take a different approach to utilize the local storage capacity. They implement all the afore-mentioned capacity and resiliency features in software using commodity hardware. To obtain the performance their application need, they use a ridiculous amount of hardware that average business can’t even imagine. Don’t forget – they can throw many engineers with specialized expertise to solve the bi-dimensional problem. Finally, there are industry-standard benchmarks such as TPCs that can use local storage for reducing the cost of performance. Here, data protection at the hardware level is not given a high priority.

The lack of interest and demand have largely limited innovations in the local storage tier – except improvements to the media types. Server vendors are now supporting SAS/SATA/PCI-e based Solid State Disks (SSDs) along with traditional SATA/SAS magnetic disks. But the concern remains – who will use them?

Flash Virtualization Platform

Meanwhile, another revolution happened in the IT industry. VMware, with its flagship product vSphere, fork-lifted compute layer from the storage layer. This opens up interesting opportunities. One such opportunity is to solve the problem I have been discussing all along – to split  storage tier into two separate dimensions – Performance and Capacity. PernixData is one among the early few who recognized this opportunity. Result of their tireless effort is what you see today – Flash Virtualization Platform, an ethereal storage layer that uses local fast storage (SAS/SATA/PCI-e SSDs) to accelerate transient data and SAN storage to rest persistent data.

FVP aims to glue the orthogonal challenges that plague the storage layer by intelligently using the two storage tiers. This independent usage of the two tiers leads to plethora of opportunities for the server and storage vendors. Server vendors can focus on providing high-speed, local storage for servicing transient data while not worrying about the complicated data-resting technologies. While, storage vendors can focus on jazzing up their storage devices with attractive capacity-saving, data-protection technologies while not worrying about the performance impact of these technologies. Essentially, FVP lets you use storage solutions from your preferred vendor while speeding up data access by utilizing the best flash technology out there.

Let us revisit the problem.

graph_2When the persistent tier (external storage) is combined with the transient tier (Flash media in the local storage), a new solution emerges that can allow sizing of persistent tier to meet the current capacity requirements (and room for growth), while allowing the transient tier to use the latest technologies in flash to adequately meet the I/O performance demands by absorbing any burst of I/O requests from the applications at the moment it occurs. Even if an emergency administrative task has to be scheduled, application users wouldn’t experience any noticeable impact as the I/O request is serviced by the transient tier. How does FVP achieve this? Check out the videos here.

This solution has significant capex savings as the persistent storage doesn’t have to be over-sized. The only additional investment will be for procuring flash media (cost keeps reducing every day) and the license cost of FVP ;-). There is noticeable opex savings as well – no need to maintain the extra storage (power, space and cooling savings). Storage admins can breathe easy as the admin tasks they have to execute on the external storage is hidden from the application users and the impact is mostly not felt.

Linguistic Lesson

‘Pernix’ means agile, active. The name is very apt for what FVP can do to your IT environments; it can activate your virtual machines. Think of it as the magical spinach that gives Popeye his awesome power. “Protect your investment, Pernix your data”.

As Satyam (CTO, PernixData) likes to ask – do you want to get ‘Pernix’d?. I do. That’s why I decided to join the team. Questions is – Do you? If the answer is yes, join the beta program today.

Stay tuned as more is yet to come …

Missing in Action

Feels nice to be back after a long hiatus.. Wow! so many things happened in life – travel, injuries, vacation (to attend brother’s marriage), longer term projects etc., But the biggest of all was the birth of an angel. Yes we had our first child – a beautiful girl last year. She kept her daddy busy for the most part of the last year and first half of this year. Now, she understands that daddy has other things to do and has been kind enough to let me do what I love most (well after her) – share my thoughts and findings.

Although not finding a mention here, couple of interesting, but long projects kept me busy all this time. I have blogged/published/presented them elsewhere. I will just provide links here so that you know where to find them.

  1. Achieving 1 million IOps from a single vSphere host – http://blogs.vmware.com/performance/2012/03/a-conversation-about-1-million-iops.html
  2. Storage vMotioning a virtualized SQL Database – http://blogs.vmware.com/performance/2011/11/svmotion-sqlserver.html
  3. Storage vMotioning on a EMC VNX storage using VAAI – Presentation# USD.40 @ EMC World 2012

Collaborated with my friends Y.P.Chien and Eddie @ Kingston to publish several studies on vSphere Memory management. One of them is here:

  1. The Yin and Yang of Memory over commitment  in Virtualization http://media.kingston.com/images/usb/pdf/MKP_339_VMware_vSphere4.0_whitepaper.pdf

You may have already seen these. If not, roll your eyes over them. You may find these interesting to keep your eyes glued to them.

Mem.MinFreePct and Memory Reclamation in vSphere 5

It feels good to be back..

Recently, Frank published a blog about new sliding scale based estimation for minimum free memory % in vSphere 5. An interesting read for anyone looking to estimate memory capacity for his/her vSphere based virtual infrastructure. My good friend YP Chien from Kingston ran some tests to understand the memory reclamation techniques [ballooning, compression and host-swapping] in vSphere 5. But, he noticed that the host free memory levels at which various memory reclamations kicked-in were quite different from what it should have been based on sliding scale logic mentioned in Frank’s blog. YP immediately brought this to my attention (Thanks YP!). I dug a bit on this and found the issue. Instead of commenting on Frank’s blog, I thought of offering a deeper explanation here:

I will use the same example that was used in Frank’s blog. Consider a server configured with 96GB of RAM. The MinFreePct threshold will be set at 1597.36MB based on a sliding scale shown in the following table:

Threshold       Range (MB)                    Reserved Free Memory (MB)

6%                  0 – 4091                 245.76

4%                  4092 – 12287           327.68

2%                  12288 – 28671         327.68

1%                  Remaining                696.32 (in this case)

Total Free Mem                                     1597.36

For the host considered in the above example, various memory reclamation techniques kick-in at different thresholds as explained below:

Free Memory State    Threshold (% of MinFree)    Threshold in MB    Reclamation  Type               

Soft to High                        64 to 100                  1022.31 – 1597.36         Balloon

Low to Hard                       16 to 64                    255.57 – 1022.31          Balloon, Compression and/or Swap

Please note:

  1. There is no separate reclamation target for Memory Compression. It uses ‘Swap Target’ to reclaim memory.
  2. The choice of using memory compression [when enabled] or host-swap is dynamic. vSphere tries to use memory compression, but if it cannot reclaim enough memory soon it will resort to host swapping.
  3. Decrease in memory pressure doesn’t mean that the respective reclamation targets are set to zero immediately. vSphere constantly monitors the memory pressure in the host and gradually reduces a reclamation target if it finds memory pressure to have reduced. On the other hand, memory states could change as soon as the memory pressure in the host changes. Hence, it is possible for you to see some memory reclamation (balloon or swap) for extended time till the respective reclamation targets become zero even after the memory states indicate no or reduced memory pressure.

Hope this helps you understand when a specific type of memory reclamation kicks-in and why you would see it even when you don’t expect to see it. Feel free to throw in your comments or questions 🙂

Storage IO Control and Storage vMotion?

My colleague Duncan posted an article on yellow-bricks regarding storage vMotion (sVMotion) of a virtual disk placed in a storage IO control (SIOC) enabled datastore. I thought of providing some more information on this topic..

Yes, sVMotion will be treated as a regular stream of I/O requests coming from a particular VM to a vmdk that is placed on a SIOC enabled datastore. If the datastore wide I/O latency exceeds the congestion threshold of the datastore, SIOC kicks in and adjusts the device queue in the host according to the aggregate disk shares of all the VMs on the host that share the datastore. Within a particular host, I/O requests of each VM is given preferential priority based on VM’s disk shares. The I/O requests can be from an application needing data or from ESX requesting  sVMotion of the vmdk on a non-VAAI compatible storage.

How does SIOC treats sVMotion’s I/O traffic? When SIOC is active, a VM is allowed to have a certain number of concurrent I/O requests queued in the host for the SIOC enabled datastore. If sVMotion is initiated on the VM when it was actively issuing I/O requests, the VM’s quota of concurrent I/O requests will be shared by both sVMotion traffic and the other I/O traffic from the VM to the datastore. If the VM is sparsely issuing I/O requests to the datastore, then its quota of concurrent requests will be dominated by sVMotion traffic.

Note in both cases, the total concurrent I/O requests (sVMotion + other I/O traffic) is limited to a value proportional to the disk shares of the VM on the datastore.

What happens when the storage is VAAI compatible? ESX issues the sVMotion command to the storage. The storage initiates sVMotion on behalf of ESX. ESX doesn’t even see the sVMotion traffic. In this case, the VM is free to use its full quota of concurrent I/O requests.

You will not be able to see the exact number of each I/O request types in the device queue. Good news is, that you don’t have to worry about them. SIOC is capable of  handling these varying traffic conditions for you. If you feel geeky, and really want to get into this, your best bet will be monitoring the difference in sVMotion’s completion time under different load conditions of your VM. But know this – irrespective of the VM’s load situation, SIOC will not let sVMotion affect the I/O traffic on the datastore from any other VM. Though, the response time of I/O operations in the VM on which sVMotion was initiated will be affected by sVMotion.

What if the datastore is not congested? SIOC lets sVMotion use its full quota of bandwidth until datastore becomes congested (datastore wide latency > congestion threshold). Then SIOC does what it is designed to do.

Here is a question for you – If you have to sVMotion a vmdk on a SIOC enabled datastore when do you do it? 😉

How cool is vscsiStats? Part-II

I enabled vscistats collection in my vSphere host before starting the purge2 operation (check my white paper for more details) in the vCenter database . While the operation ran, I collected 20 samples of vscsiStats output at equal intervals (each interval was 7.5 seconds). vscsiStats output consists of histograms of various metrics – outstanding IOs, seek distance, length of a request, arrival time, all split between reads and writes. To obtain the histogram of a given metric at a particular time instant, I divided the difference in histogram values of the metric collected at successive time intervals by the sampling interval.

Example:

Outstanding Read IOs (=1) at time t(x) = (Outstanding Read IO (=1) until time t(x) – Outstanding Read IO (=1) until time t(x-1))/Sampling interval

Outstanding Read IOs (=2) at time t(x) = (Outstanding Read IO (=2) until time t(x) – Outstanding Read IO (=2) until time t(x-1))/Sampling interval

:

:

Outstanding Read IOs (>64) at time t(x) = (Outstanding Read IO (>64) until time t(x) – Outstanding Read IO (>64) until time t(x-1))/Sampling interval

NOTE: If you are thinking that the above steps are very cumbersome, I agree with you. I have an excel template which does it for me. All I need is vscsiStats output of 20 consecutive samples all saved in a single excel file. Irfan, in his virtual scoop blog, has provided few links to some neat blogs on visualizing vscsiStats. Check his blog.

I followed the above steps for the following histograms in vcscsiStats output.

Outstanding IOs:

Figure 1. Outstanding IOs during purge2 operation.


The graphs in figure 1 show the outstanding IOs during purge2 operation.  The number of read outstanding IOs was 64 (tidbit: pvscsi driver installed in the guest operating system has a default queue length of 64. During purge2 operation, the I/O queue in the pvscsi driver was full with read requests. Hence the number of outstanding read IO requests coming from the VM was 64) whereas the number of write outstanding IOs was zero.

Request Type: Graphs in figure 1 also show that the purge2 operation consisted of only read requests.

NOTE: Since purge2 operation is completely dominated by reads, for the remaining vscsistats histograms I only considered the respective read histograms.

Randomness: To identify the randomness of the purge2 operation I looked at the ‘seek’ histogram in the vscsiStats output.

Figure 2. Seek distance between read requests


The seek read distance histogram shows the distance between consecutive read requests in terms of logical blocks. A seek distance of 1 logical block between consecutive requests indicate a purely sequential workload. A seek distance of < 10 logical blocks indicate a quasi-sequential workload. A seek distance of 10+ logical blocks indicate a random workload. In this case, the seek distance between successive read requests was 500,000+ logical blocks, indicating a pure random read access pattern.

Size of an I/O Read: The last parameter I needed was the size of an I/O read request during the purge2 operation. This was provided by the ‘ioLengthReads’ histogram.

Figure 3. Size of Read Requests during purge2 operation

The size of the read requests seen during purge2 operation varied from 16KB to 64KB with some requests  as large as 128KB. The variation in I/O size indicates some kind of optimization employed during reads to fetch as much data as possible in one read operation.

Arrival Time for Reads: Another interesting histogram provided by vscsiStats (that is not required to create an IOmeter workload profile, but interesting) is the arrival time for the I/O requests (in this case for reads).

Figure 4. Arrival Time for Reads during purge2 operation

An arrival time of ≤100 microseconds  indicates that purge2 operation was very I/O intensive (also evidenced by 64 outstanding read requests throughout the operation).

With the information I collected from vscsiStats, I created a workload in IOmeter with the following paramters:

  • Outstanding IOs: 64
  • Access Type: 100%Read, 100%Random
  • Request size: 48KB (median of 16KB, 32KB, 48KB, 64KB, 128KB)

Rest, you will know when you read the white paper 😉

Next time, you get into troubleshooting I/O problems or planning storage resource for your vSphere environment, remember that you have the secret sauce at your finger tips. Surprise your storage admins by speaking in a language they understand – outstanding IOs, request size, access pattern and more..

Isn’t vscsiStats cool?