With money like that at stake, you'd think IT architect Eric Ulmer would be conservative when it comes to virtualizing his Minneapolis, London, Sydney, and Cologne data centers. Not so. While most other companies find their servers are maxed out at 10 or 12 virtual machines per server, he's designed ones that run 30 VMs each. And Ulmer's not stopping there. He'll move to 60 VMs per server as soon as he's completely confident that VMware's new vSphere 4 virtualization management software is up to the task.
Ulmer admits, though, that the big numbers make him nervous. "With virtualization, a lot of customers go down if one server fails," he says. "If five customers are down, we don't have enough phone lines to take all their complaints."
But it's not stopping him. His testing shows that VMware's ESX Server can run 60 virtual machines and possibly more. His group once assigned all the VMs intended for two servers to one by mistake. Ulmer was surprised to find the server humming along running 100 VMs without a noticeable performance hit.
Ulmer and his team are among an elite group of data center pioneers who right now are testing the limits of server virtualization, pushing for the next tier of performance even as most companies are just getting comfortable with the technology. They're increasing the number of VMs per server in order to save electricity, capital costs, and even labor when the right management tools are in place. In this deep recession, many IT managers would like to go further with server consolidation. It's these pioneers who are discovering the limits to how many virtual machines can practically be loaded onto any one server, and what problems to watch for as each additional VM stresses the overall system.
Ulmer and others who are pushing for more performance are getting plenty of encouragement from vendors, whose next generation of virtualization software and servers appear to be converging toward a big jump in productivity and capacity.
VMware has just launched an upgrade to its vSphere 4 data center operating system that shows how vendors are trying to push the state of the art in virtualization. With upgrades that'll be available this month and next, VMware claims companies can get a 30% productivity gain using existing servers--so for companies running 10 VMs today, 13 should be within reach, says Bogomil Balkansky, VP of product marketing.
Couple that with server makers that are launching their next-generation machines based on Intel's Nehalem, or Xeon 5500, chip, and you're talking major virtualization advances. The 5500 is really the first chip to escape from the personal computing bias of the original x86 chips. It has a memory controller built onto the chip instead of off-loaded to a separate dedicated chip, reducing latencies encountered as a VM's operating system manages the memory that its application is using. The 5500 also has more built-in virtualization support for the hypervisor. It's much more of a multithreaded server processor capable of juggling many assignments across its four cores, making it better equipped to run multiple VMs.
IBM and Hewlett-Packard each say they're seeing gains of just over 60% in benchmark tests of their new Xeon 5500 servers in virtualized environments, compared with previous generations. That means a hardware shift could let the typical 10 VM-per-server company bump up to 16. Taken together, the VMware software upgrade and the new server designs could very well let companies double the number of virtual machines they run per server.
We'll see if Cisco delivers on that claim in the real world. But if it can, going from 10 to 40 VMs per server would put more IT managers in Ulmer's world of extreme virtualization. It also would introduce more of them to the many problems he's dealing with.
The Sun Microsystems Sun Fire X-4600 servers he uses, designed by former Sun chief server architect Andy Bechtolsheim, max out what the Advanced Micro Devices Opteron chip can do. Twenty-eight servers are loaded with 128 GB of memory, a big number for their vintage; six have 256 GB of memory, going far beyond most of the current generation. In their next generation of Xeon 5500 servers, IBM and HP plan to put a maximum of 128 GB and 144 GB of memory, respectively, on their servers.
Ulmer's in the process of upgrading his 128-GB server memories to 256 GB. "Memory is the weak link. If you suffer memory depletion, it's the endgame for VMware's hypervisor" as it slows to a crawl, he says.
Ulmer's four- and eight-way Sun Fires are each equipped with dual- or quad-core Opteron CPUs; that's 16 or 32 cores per server. One of the few ways to make use of all those CPU cycles is by hosting multiple virtual machines. At 30 VMs per server, Ulmer's discovered that he's still only using 20% of available CPU cycles.
He's also bought specialized I/O hardware from a startup, Xsigo, which early on saw I/O as a potential bottleneck in virtualization. Xsigo puts converged network adapters on the server to move network and storage traffic coming from the VMs off the server and into an I/O Director, a hardware device that splits traffic up and sends it to its correct destination using 10-Gbps Ethernet pipes.
Most virtualization users let the hypervisor handle VM networking traffic, and that's a big constraint. VMware customers rely on the ESX Server's vSwitch, software in the hypervisor to route network traffic, which has a much greater impact on CPU resources than does bypassing it in favor of dedicated hardware. When network traffic appears, the hypervisor stops what it's doing, clears application instructions and data from the chip pipelines and buffers, and lets the vSwitch decide where to send the traffic. Frequent packet flows to other virtual machines, network routers, and storage will mean frequent interruptions of VM processing and slower operation. Ulmer believes off-loading network traffic from the hypervisor is one of the keys to increasing the number of VMs a server can run.
Because there's so much switching intelligence inside the Xsigo I/O Director, Ulmer uses just two cables from his heavily virtualized servers to the I/O device. Without the hardware device, he says, he'd end up with "a spider's den" of cables. "There'd be so many cables, I believe we'd run into human error," he says.
The I/O Challenge
If Ulmer's right and I/O is the next bottleneck holding back the number of VMs that can run on a server, then Cisco may have stolen a march on IBM, HP, Dell, and Sun as it brings converged network traffic to the virtualized server. In effect, with its Unified Computing System, Cisco is promising to do in Cisco servers and 10 Gigabit Ethernet devices what the Sun servers and Xsigo do with their own combination of hardware.
IBM and HP dispute that Cisco has gained an edge. There's no significant advantage to Cisco's approach, says Gary Thome, director of strategy and architecture for HP's blade group. HP doesn't see the data center as "a network with servers hanging off the end," he says, taking a swipe at Cisco's network orientation.
IBM will announce its own blade architecture upgrade later this year and should be able to provide a converged I/O blade without requiring customers to use nonstandard networking devices. It will do no good to multiply the number of CPU cycles if virtual machines sit idle as the hypervisor laboriously processes Ethernet packets. The goal of any high-powered blade platform is "to build a balanced system," says Rob Sauerwalt, strategic director of marketing at IBM.
For its part, Cisco has worked with VMware to produce VN-Link, a proprietary virtual network link protocol that's been built in firmware in Cisco's UCS 6100 Series Fabric Interconnect or in switching hardware outside the blade. The 6100 Series has the management intelligence to work with VMware's vNetwork Distributed Switch, so a cluster of hypervisors can feed undifferentiated VM network traffic through the distributed switch to the converged network adapters on Cisco's blades. The adapters feed the traffic to the Cisco Fabric Interconnect, where another pre-standard protocol--Cisco's implementation of 10-Gbps Fibre Channel over Ethernet--routes it to storage or data network devices.
Essentially, Cisco has virtualized I/O outside the hypervisor. It's created high-speed Fibre Channel and Ethernet channels that can be shuffled around to meet the needs of high I/O traffic VMs rather than assigning each VM a fixed resource. Cisco's servers should be able to deal with higher volumes of network and storage traffic and VM communications with less impact on core performance. Ultimately, that can help put more VMs on a blade.
There are risks, however, particularly on the management side. VMware's VSphere will need to discover, monitor, and track virtual machines as they're commissioned, provisioned, and decommissioned. If it can't do so effectively, VMs could disappear from view but still be alive and running in the software infrastructure, possibly offering intruders a path into the system. Ones that are supposed to run only with other highly secured VMs could end up being inadvertently moved to run with less secure ones.
Potential For Savings--Or Failure
Accenture's Ulmer knows the potential for savings from increased virtualization is great if the system can be managed effectively. When he reached 15 VMs per server, he achieved enough savings to make new deployments a wash, he says--deploying the next 15 VMs cost nothing in terms of virtualization software expense.
And by the time Ulmer got to 30 VMs per server, the cost savings not only paid for the software and hardware to virtualize the infrastructure, but also for the added hosting servers and network fabric of Accenture's outsourcing center. Accenture was able to speed up time to deployment and reduce labor costs for server management, Ulmer says. There's also an energy savings that he hasn't been able to calculate. Rapid payback has allowed Accenture to focus more of its spending on the most efficient hardware, letting it retire existing physical servers that are just18 months old, but not as efficient, Ulmer says.
Any failures in the stress testing mean Accenture will stick to 30 VMs. Ulmer's very mindful of those stiff SLA penalties, and the fact that Accenture's outsourcing business would take a hit if he experiments with high VM numbers, only to trip over some unexpected bottleneck.
"Everyone is watching us," he says. "We don't want to fail."
Illustration by Sek Leung