Welcome to the Penguin Blog
In High-Performance Computing, the importance of performance cannot be stressed enough. This is even more crucial when using a pay-per-use, HPC Cloud service like Penguin Computing on Demand (POD), where better computational efficiency directly translates in cost savings for the end-user. Achieving high performance requires an optimal combination of hardware and software components.
Our POD public HPC cloud combines non-virtualized, bare-metal compute nodes, low-latency network and fast storage with an optimized software stack and end-user applications specifically tuned for the hardware platform. Here I describe how we tuned the performance of the Weather Research and Forecasting (WRF) model on our new H30 cluster based on Sandy Bridge compute nodes.
WRF is a mesoscale numerical weather prediction system that is widely used for both atmospheric research and weather forecasting. The computational core of WRF is written in Fortran, and allows parallelism using both the distributed memory and shared memory models. The H30 cluster features compute nodes with dual Intel E5-2670 Sandy Bridge processors (hyperthreading off) and 64GB of memory. An Intel True Scale QDR Fabric provides low-latency connections, and the nodes are connected to a Ceph and NFS infrastructure via 10 Gb Ethernet.
We chose to use the Intel Composer XE compiler suite to build WRF, as this compiler can provide code optimization tailored to the processor features. For instance, the flag ‘-xAVX’ can generate Advanced Vector Extensions instructions that take advantage of the Sandy Bridge architecture.
I built version 3.5.1 of WRF using the Intel 12.1.0 compilers, following the Intel hints for improving performance. For distributed memory parallelism I used our custom build of Open MPI 1.6.4 which is tailored for POD's InfiniBand interconnect and supports the Performance Scaled Messaging (PSM) communication interface. All the other external libraries needed by WRF, for instance NetCDF, were built using the same compiler and MPI stacks. Two WRF executables were built: an MPI version leveraging distributed memory parallelism, and an hybrid MPI-OpenMP version leveraging both the distributed and shared memory models.
For testing and tuning I used the 4dbasic WRF benchmark, which is a test case designed to represent a typical end user scenario for regional weather modelling. Jobs were run using 12 of the H30 compute nodes, corresponding to 192 total processors.
A first run of the distributed memory executable completed in 75 minutes of wall clock time. This already represents a nice improvement on the WRF performance with respect to our Westmere cluster, M40, which is based on Westmere 2.9 GHz processors. A run of the 4dbasic benchmark on the M40 cluster (16 nodes, 192 processors, optimized with ‘-xSSE4.2’) takes about 110 minutes. Thus, we have a 32% improvement in performance right off the bat for switching from Westmere to Sandy Bridge processors. Even more impressive considering that our Sandy Bridge processors have a clock rate of 2.6 GHz.
Having established 75 minutes as the baseline for my tests, I worked on further tuning the WRF runs. A first way to improve the performance is to optimize memory and cache usage. This is usually achieved by pinning the computing processes to a particular CPU core or socket in order to improve the memory access and reduce the cache misses. Open MPI has a number of command line options specially designed for this purpose. I thus repeated the run using the following command line:
mpirun --bycore --bind-to-core --report-bindings ./wrf.exe
This will bind each compute process to a specific core. The last option (--report-bindings) produces a printout of the CPU affinity masks for each MPI process, allowing to check that the desired effect has been achieved. This run completed in about 65 minutes. A nice 13% improvement on the baseline.
I then went to work on the hybrid version, to see if further gains can be extracted with a careful balance of MPI and OpenMP threads. In addition to adjusting the number of MPI processes and threads, the WRF input parameters for evenly distributing the simulation domain between the compute processes can have an impact on performance. Furthermore, in the case of the OpenMP version, the domain of each process can be divided into tiles and ideally, the best performance is achieved when the size of a tile can fit inside the processor cache.
The best performance of the hybrid version was obtained using 96 MPI processes, 2 threads per process and 2 tiles, with the following command line (after setting numtiles=2 in namelist.input):
mpirun --npernode 8 --bysocket --bind-to-socket --report-bindings ./wrf.exe
Here the computing processes are pinned to a CPU socket. This run completed in 65 minutes too. No further improvement on the pure distributed memory run, but still 13% better than the baseline. The main advantage of using the hybrid WRF over the pure distributed memory version is the reduction of the number of MPI processes, and thus in the communication overhead. This probably takes jobs larger than 192 processors to become noticeable, especially if MPI runs on a fast interconnect like the POD InfiniBand fabric.
A close inspection of the WRF output revealed a further area for improvements: I/O. WRF reports detailed timing for the computational steps of the run. By comparing the time spent in computations to the total run time one can deduce the time spent during I/O. For the 65 minutes run the split turned out to be 60 minutes computation and 5 minutes I/O. This is already quite good, and is due to the efficiency of the POD Ceph storage, but WRF provides a way of further tuning the performance: I/O quilting.
With I/O quilting, some of the compute processes are dedicated to I/O. This allows the operations to proceed asynchronously, as now the pure compute processes do not have to wait anymore on the I/O completion to proceed with the computation. Of course, the drawback is that now there are less computing processes, so again a delicate balance has to be found between the number of processes devolved to I/O and those dedicated to the computation.
By assigning 2 processes to I/O I was able to further reduce the compute time to 62 minutes, a 17% improvement over the baseline. This also required a small modification of the command line, to ensure the two I/O processes were running on different compute nodes:
mpirun --bynode --cpus-per-proc 1 --bind-to-core --report-bindings ./wrf.exe
In conclusion, achieving high-performance takes a multifaceted approach: state-of-the art hardware, optimized software stacks, and fine tuning of the end user application are all important parts of the process. At Penguin Computing we strive to offer an HPC Cloud environment that includes the best of each component.
Today Cumulus Networks came out of stealth mode and presented their product Cumulus Linux, ‘the first true, full featured Linux operating system for datacenter networking’. This launch is yet another indication that the network infrastructure landscape is changing. And while the word dramatically is overused it is probably applicable in this context. What we are seeing is quite likely the beginning of the end of the high-margin business model that the established vendors of networking gear have been pursuing for a long time: Buy switch hardware in mass quantities at relatively low cost, develop a solid but closed software stack, offer support around the offering and charge a huge premium for the bundle.
Customers that run data centers at extremely large scale like Google, Amazon, Microsoft, and Facebook started to turn away from these established vendors a while back. Given the massive amounts of data moved by these organizations the offerings from tier 1 vendors were too expensive and lacked flexibility and control. While Google seems to have resorted to developing custom network switches from scratch (it was actually quite amusing to see how nobody knew what to make of google’s Pluto switch that mysteriously appeared somewhere in the Mid-West) other organizations operating extremely large scale data centers started purchasing industry-standard, bare metal hardware and added network software such as the OpenFlow based Indigo software stack. To make this all work a significant customization and software development effort was required that really only made sense at the extremely large scale.
Now with the advent of new network software players such as Cumulus and Big Switch buying standard switch hardware and a network software stack of choice is feasible for a wider range of customers. This is yet another example of how open architectures can lower the cost of computing. We have seen how Linux as an open source OS has changed the server industry that moved from proprietary Unix based offerings to industry standard x86 hardware running Linux. We are seeing how in the storage world open storage software systems for the large scale such as Ceph enable customers to move to industry standard servers as storage platforms. Now we are observing the same trend in the world of switches. The open source projects that are the foundation of the new open networking model are backed by commercial organizations that provide support and drive the open source development effort and while there certainly will be growing pains an ecosystem is evolving that makes the new paradigm viable for the mainstream datacenter.
For Penguin the open network model represents a great opportunity. Our core expertise is to integrate standard hardware and open software stacks. We have over fifteen years of integration experience with Linux based systems and as an HPC cloud provider we have a first-hand understanding of the challenges of running a data center at very large scale. Within the next month we will be announcing our own line of network switches that we will offer with a choice of a variety of network software stacks from our partners. We will essentially be offering the ‘best of both worlds’ to our customers. Working with our software partners we will be providing a fully supported, thoroughly tested and fully integrated switch offering. This offering will be at a much lower price point than offerings from the established tier 1 vendors, give our customers a choice regarding the network stack they want to deploy and not lock them in.
So what are the high level benefits of this new paradigm of open network stacks for the mainstream?
Lower Capital Expense: Access to industry-standard, bare metal switch hardware will enable customers to leverage economies of scale, driving down capital expense.
Lower Operational Expense: The network software stacks for this hardware is based on open standards like OpenFlow. These standards are conducive to the development of an ecosystem of tools and plug-ins that enable automation and management. Offerings like Cumulus are Linux based and enable system administrators with a Linux skill set to use their expertise to configure network switches without learning a proprietary OS and leverage well known automation tools as for example Chef or Puppet.
More Flexibility: The open nature of the network software stacks makes it easier for organizations to customize these software stacks to meet their requirements.
The Open Compute Project is taking the ‘industry standard’ approach to switch hardware one step further and is working on a design specification and a reference box for an open, top-of-rack switch that is enabled to run a variety of network software stacks. Facebook, the main driver behind OCP is in essence taking a custom hardware approach like google and is sharing it with the community to get momentum and leverage economies of scale.
At Penguin we are excited about the new model of open network architectures. It is a win-win for us and our customers. Our customers can take advantage of the benefits the new model offers while still buying from a single vendor that offers enterprise level support. We at Penguin can leverage our core expertise: Integrate and deliver open systems that work for our customers.
I’ve been deploying and playing with Ceph since before its first stable release, Argonaut, in July of 2012. During that time I’ve seen the project mature and grow in exciting ways. I have since had the opportunity to deploy several production Ceph clusters, most recently a new storage deployment for our public HPC Cloud, Penguin Computing on Demand (POD).
Why is Ceph such a good fit for POD? We wanted a storage solution that was:
Economical - the code is Open Source, and we can run it on our own hardware. We have deployed Ceph on Penguin Computing's Icebreaker 2712 storage server that accommodates 12 drives in a 2U form factor and is based on Intel's Xeon 5600 processors. Going forward we are planning to use the Icebreaker 2812 storage server or our Relion 2800 server, both based on the Xeon E5-2600. These chassis offer a good drive/core ratio, while offering a failure domain that we are comfortable with. Easily expandable with scale-out behavior - We can simply add more storage servers to the cluster to gain more space. Ceph automatically rebalances data appropriately, immediately using all resources available. For each additional server, we gain additional performance
Self-healing - We are using SATA drives, and everyone knows they are going to fail. Often at inconvenient times. With Ceph, failed drives don’t have to be attended to right away. Data is rebalanced early and quickly, and we don’t have to worry near as much about being in a “critical window” of time where a second or third failure may lead to data loss as we do when rebuilding RAID arrays.
Unified - We wanted one storage system that we could focus on. Ceph provides us with both object storage and block storage. It is also tightly integrated with OpenStack, which we use in our latest POD clusters.
No Single Point of Failure - This is critical for our POD customers, and really should be in any storage system. The architecture of Ceph is highly resilient to failures, and it was built from the ground-up as a fully distributed system.
It is worth noting that Ceph also provides a POSIX filesystem layer, CephFS. I’ve been following it’s progression and think it has real potential in the HPC space. However, it’s not ready for production yet, so I haven’t played with it as much. Once it is stable and supports multiple active metadata servers, we will definitely put it to use on POD.
A good thing to know about Ceph is that it chooses consistency over availability. Ceph tries as hard as it can to make sure that you never lose data, and that you never encounter “split-brain”, a problematic issue with distributed systems where there is conflicting information about the state of the cluster. Ceph prevents this through its use of PAXOS within the Ceph monitor daemons. The ultimate reality is that you need a majority of your monitors to be active and consenting for cluster I/O to proceed.
It is important to be cognizant of your failure domains when planning a Ceph cluster. How much data will the cluster have to replicate if you lose a drive? A chassis? A switch port? Can your network handle the resulting traffic and still leave room for client traffic? For a standard x86 architecture, I recommend to not have more than 12 data drives in any one chassis. Beyond that, the impact on the remaining cluster may be too great if a single chassis becomes unavailable. This is one of many trade-offs to consider when designing your Ceph cluster, just as there would be with any storage system.
Ceph truly is a unified storage system. It handles objects natively, with developers able to access objects via C, C++, Java, Python, Ruby, and PHP. Web developers can access Ceph through RADOS Gateway, an HTTP REST webservice that can present both an Amazon S3 and an OpenStack Swift compatible API. The block device is known as RADOS Block Device (RBD), and is a thin-provisioned block device that can be mounted within the Linux kernel (via kernel module), within VMs with KVM/libvirt or Xen (via librbd), or recently with a new FUSE module. An RBD is really just a collection of objects scattered throughout the cluster, so even streaming I/O to a simple disk gets the benefit of accessing multiple storage servers in parallel. Finally, CephFS presents a POSIX filesystem to complete the picture. With such an active open source community, there is a myriad of plug-ins for projects to connect to Ceph. Examples here are the tight integration in KVM/libvirt and OpenStack. There is a Hadoop plugin for CephFS, and work is being done to Ganesha (a user-space NFS server) to enable both NFS and pNFS on top of CephFS.
There is a lot of momentum behind Ceph right now. CephFS isn’t ready for show time, but the object and block layers are ready for production workloads today. Penguin Computing is excited to be in the mix and active with this fast-moving project. We have the right hardware solutions to build Ceph storage clusters of any size, and are happy to have recently partnered with Inktank to offer full software support for Ceph.
I just got back from the OpenCompute hardware technical session in Boston last week.
There was a proposal for 380V DC that was actually +/- 190V DC. The presenter was stating that it is possible to control and manage 190V DC where 380V DC presents challenges. Either Panduit or Anderson was there saying the same as you that they had a connector that could keep the arcing contained during a hot plug/unplug event and was approved by someone.
Also present were established power supply vendors. They expressed the sentiment that there didn't seem to be a need for high voltage DC distribution in the data center. They can get similar power supply efficiency using high voltage AC inputs where the safety is much better understood. Copper wire sizes would be similar for high voltage DC as AC for similar voltage and power levels.
High voltage DC is currently used internal to power supplies where the active power factor correction circuitry drives a set of high voltage bulk capacitance and the DC outputs draw from that high voltage source. That design is ripe with opportunities to provide additional value to users of IT equipment without resorting to DC distribution to the rack.
It's a bit of chicken and egg. No equipment can accept 380V DC so no one installs 380V DC infrastructure, so no one makes equipment to use it.
Additionally, the argument that high voltage DC could eliminate UPS output stages assumes that UPS only use batteries for energy storage. Such an assumption eliminates the use of flywheels and other UPS techniques which are equally effective, but don't feature high voltage DC.
Personally, I don't see the point. I think it's a better idea to eliminate 120V and get to 277/480V AC or 240/415V AC inside a data center so we could eliminate significant copper and transformer costs without needing an _entirely new_ infrastructure. Just use the existing power cord standards we already have in place.
That together with low-voltage (12V) distribution inside a rack as standardized by Open Rack seem to make the most sense to me. It offers the benefits without needing massive changes in the ecosystem.
It's important to remember that the initial Open Rack proposal of three power zones with 277/480V and 48V feeds was just a Facebook example configuration designed to work with their existing V1.0 data centers. Power supply and rectifier vendors have a number of different products that could be much more interesting to customers with existing 120/208V power distribution with A/B feeds. Penguin Computing intends to work closely with these vendors to bring these products to our Enterprise, Cloud and HPC customers.
This week Penguin announced that it is now an official OCP solution provider. I know … that somebody is ‘excited’ about something is the last resort for marketing guys to convey a boring message … but we at Penguin are really excited … we believe that OCP will be here to stay and have a lasting impact on the hardware landscape (Is Open Compute a Game Changer?). I don’t want to regurgitate all the benefits of the open hardware concept (a great summary can be found at Tech Republic) but talk a little bit about why Penguin believes that OCP presents a great opportunity not only for customers but also for us. At first glance it is not that obvious how a company like Penguin that focuses on delivering hardware solutions would see OCP as an opportunity. One could argue that opening up hardware designs will lead to less potential for vendors to differentiate their offering which will result in lower profits as ‘everybody and their dog’ will be offering the same standardized hardware.
While it is true that efforts like OCP will lead to further ‘commoditization’ of hardware it takes expertise and skill to integrate this open hardware and the corresponding software, particularly in areas that tend to be more complex such as High Performance Computing. So on the one hand further ‘commoditization’ will add pressure on hardware prices … no doubt about that. On the other hand, lower prices drive more and larger deployments with inherently increasing complexity. While there are numerous providers that know how to build servers, the air gets thinner when it comes to solution providers that really know how to make things work. And that is exactly what we at Penguin have been doing for the last 15 years … deliver trusted scalable Linux solutions that work. And because delivering quality solutions rather than pushing boxes is our sweet spot we are excited about the Open Compute Project.
BTW ... If you want to know more about how OCP is expected to change the server market and how Penguin is embracing 'open hardware' ... an insightful article based on an interview with our CEO Charles Wuischpard was just published by The Register
Last week, I learned that old standards never die, they live on to make life more difficult for years to come.
A looong time ago, when I was in college, we used has a VAX server than the VMS operating system which had a file system naming standard that was case insensitive, only allowed a single dot between the name and the extension and included a version number. It clearly influenced the design of the High Sierra and ISO-9660 standards for CD-ROM filesystems. And last week, while working on some hardware where I couldn't enable PXE due to customer restrictions, it bit me. In this situation, I was working with our IceBreaker 2716 servers and IceBreaker 4772 JBOD chassis configured using Nexenta software to create a cost effective, high performance VM storage subsystem. I needed the IceBreaker 2716's to be configured to meet a customer requirement of PXE being disabled in the BIOS. But to be able to work with the system, it would have been very helpful to be able to boot a Linux NFS or readonly-root with access to a set of custom tools for firmware and diagnostics. Normally, we do that in the lab using PXE and NFS readonly root. But with PXE disabled, that wasn't an option. There were multiple systems, so the solution needed to be quick, easy and cheap. I could reboot the systems multiple times to go into the BIOS, enable and disable PXE, but with multiple systems that would be a huge waste of time. A pile of USB keys could work, but I didn't have enough USB thumb drives on hand. What I did have were CD-ROM blanks. Cheap! Easy to replicate! Awesome!
Because I already had infrastructure using the excellent PXELINUX support from H. Peter Anvin's SYSLINUX package, it seemed like using ISOLINUX (the equivalent tool for CD-ROM's) would let me just copy the config files, kernel and initrd to a CD and go. ISOLINUX says it supports long names so I thought it should just work. It would be a tiny image, that I could write quickly and it would leverage the infrastructure I already had configured.
I created the CD, rebooted the machine and it failed to boot because mkisofs had to munge the filenames so that "vmlinuz-2.6.18-308.13.1.el5" became "VMLINUZ_2_6_18_308_13_1.EL5;1" but I had told ISOLINUX to look for the original name. Oops.
To avoid the issue altogether I just renamed the kernel to vmlinuz and created an empty file with a file name reflecting the kernel version as a reminder for myself.
- [ppokorny@rps1 ~]$ ls -l _iso total 25928
- -rw-rw-r--. 1 ppokorny ppokorny 0 Jan 29 00:30 2_6_32_279.el6
- -rw-r--r--. 1 ppokorny ppokorny 22531009 Jan 28 20:26 initramfs.img
- -rw-r--r--. 1 ppokorny ppokorny 24576 Jan 29 00:53 isolinux.bin
- -rw-r--r--. 1 ppokorny ppokorny 168 Jan 29 00:53 syslinux.cfg
- -rwxr-xr-x. 1 ppokorny ppokorny 3986608 Jan 28 20:26 vmlinuz
and changed syslinux.cfg to read:
- [ppokorny@rps1 ~]$ cat _iso/syslinux.cfg
- default linux
- label linux
- kernel vmlinuz
- append initrd=initramfs.img root=nfs:192.168.54.1:/var/lib/tftpboot/centos6u3 ro readonlyroot rd_NO_MD rd_NO_LVM rd_NO_DM
Now I could get back to doing the real work...
"Software Defined Networking" products are a new breed. One can find early examples of these switches on internet auction sites or searching for "open source switch". The latest versions of software defined networking (or SDN) give users more control over how their network is put together and how it works. This allows users to make the network an integral part of a flexible infrastructure where resources are allocated and configured in response to services being provisioned at the endpoints.
As an example of what SDN can do for users, consider one of the best and worst parts of Ethernet. Spanning Tree. This protocol allows Ethernet networks to recognize when a loop or multiple paths exist between switches and disables redundant or parallel links to prevent packets from being repeated endlessly. While this was extremely handy 20 years ago, it now limits network engineers ability to build high performance networks because spanning tree (and LACP bonding) place limits on how many parallel paths there can be between switches. Contrast this with Infiniband, where truely massive fabrics with full bandwidth between all endpoints are trivial to construct and manage.
The difference is in the way the network is managed. Ethernet requires an algorithm that can be evaluated in a distributed environment with only local information because there is no central agent in an Ethernet network. But Infiniband has a central agent called a subnet manager that sees all the paths in the network and can distribute and allocate traffic to make use of all the parallel links between endpoints. It does this once at connection setup and then gets out of the way so there is no performance impact for this central intelligence.
In a similar way, SDN provides that central intelligence for an Ethernet network of switches and allows the network to make global decisions to optimize the network for the workload.
The icing on the cake is that it's cheaper too.
Last week’s fourth Open Compute summit in Santa Clara was accompanied by a huge media buzz. More participants attended than ever before and an increasing number of vendors are jumping on the bandwagon. Why is it such a ‘big deal’ that specifications for hardware are openly available? Don’t we have enough contract manufacturers and large vendors to satisfy what the market need?
As Penguin's CTO Phil Pokorny pointed out in his AMD guest blog motherboard designs are often a compromise. Customer requirements can be very specific. For manufacturers, removal of components and customization of motherboards is typically more expensive than supplying a ‘one size fits all’ boards with features that some customers don’t need or without features that some customers would like to see. Moreover motherboards are often radically optimized for cost. This can lead to compromises in power efficiency, reliability and specific features.
The Open AMD 3.0 OCP design specifies a ‘bare bones’ motherboard that can be configured for different use cases. The platform was designed with the input of the financial services industry and is intended to provide a ‘universal, highly re-useable common motherboard that targets 70% to 80% of enterprise infrastructure’ (OCP Project AMD Motherboard Hardware). Even though designed based on feedback from Wall Street the server flavors (HPC, Storage, General Purpose) outlined in the specification are generally applicable and should cover a large percentage of use cases in any enterprise data center.
The design offers benefits on many fronts
Management: Having one motherboard design as a ‘common denominator’ in an enterprise data center simplifies system provisioning and system management as well as the management of an inventory of spare parts. OS images and drivers can be used across a wider range of servers.
Capital expense: The ‘bare bones’ design approach enables customers to pick and choose rather than ‘bundle purchase’ components that they don’t really need e.g. fully featured BMCs when only a subset of functionality is required. While these cost savings may seem small at the level of the individual server they add up in large scale and hyperscale deployments.
Economies of scale: With a higher level of standardization customers will benefit from better economies of scale.
Compatibility: OCP 1.0 servers deployed at Facebook were built to fit a custom rack design. OCP 2.0 designs were built to be compatible with the Open Rack specification. The Open AMD 3.0 reference design is compatible with industry standard 19’’ racks. While it makes sense to follow a “holistic design process that considers the interdependence of everything from the power grid to the gates in the chips on each motherboard.” (OCP Open Rack blog) compatibility with the 19’’ de-facto industry standard will drastically accelerate the main stream adoption of Open Compute Project server platforms.
The biggest benefit of the Open Compute Project though is its ‘openness’. It is quite likely that with OCP control over hardware designs will shift from large established vendors to a community of users and cooperating manufacturers. Analogous to the way Linux obliterated the market for ‘closed source’ UNIX implementations OCP has the potential to give established vendors a ‘run for the money’. The open design also provides a great opportunity for new players that can now build on existing open specifications and customize these specifications for specific market niches.
At Penguin we realize that OCP has the potential to turn the server market ‘upside down’. We are an active member of the OCP alliance and recently extended our Altus product line to include servers built according to version 3 of the OCP specification to our product portfolio. For Penguin Computing the bottom line is ‘Yes, OCP is a game changer’
BTW ... If you want to know more about how OCP is expected to change the server market and how Penguin is embracing 'open hardware' ... an insightful article based on an interview with our CEO Charles Wuischpard was just published by The Register
At SC’12 we showcased the Micro Data Center (MDC). A new concept that AOL designed in partnership with Penguin Computing. MDCs are small, self sufficient ‘Data Centers in a Box’ that just require external hookups for networking, power and water. There are two MDC flavors, one for outdoor use and one less ruggedized version for indoors. The outdoor MDC is housed in a 42U rack-size enclosure provided by Elliptical Mobile Solutions that is NEMA 3 rated and protects against fire, water, humidity and vandalism (the first MDC used in production survived ‘Sandy’ without a hitch). The indoor MDC is housed in a 37U rack-size enclosure from AST Modular and was designed for deployments in loosely controlled environments, as for example warehouse spaces. Each MDC contains high density servers and storage nodes from Penguin Computing’s Relion product line as well as PDU’s, switches, load balancers. The outdoor MDC is cooled by a direct expansion cooling module that is integrated with the enclosure, and has an option for using air-side economization. AOL first outdoor MDC that has been deployed in production is currently handling 30% of the traffic to AOL’s main site aol.com.
So why is this so exciting… ? Because moving away from the traditional datacenter deployment approach to a model were capacity can be deployed in small increments wherever it is needed allows for huge cost savings and more flexibility. For applications where the MDC approach is applicable AOL estimates over 90% cost savings. Beyond cost savings the new model also inherently enables the distribution of compute capacity so that natural disasters don’t incapacitate an entire operation. MDCs make it easier for providers that want to offer online services in countries where privacy laws require that data is kept in-country. MDCs can also reduce reliance on commercial content delivery networks as servers can be deployed in local vicinity to content consumers. Of course the MDC deployment model also has limitations. The MDC software architecture has to support self sufficiency of each MDC. Applications that depend on centralized services that need to be accessible with short latencies are obviously not a good fit. Neither are ‘Big Data’ applications or distributed applications that require a multitude of services with low latency..
While the upside potential is huge there is of course also a cost. That cost is mostly related to software. To enable the ‘self sufficiency’ of the services running in the MDC problems like database replication, configuration management and system dependencies need to be solved. ‘Cloudifying’ services my be one way to help address these issues. Overall MDCs are a promising approach to deploying data center capacity. MDCs could also be a good fit for small scale HPC deployments as HPC applications are by nature more self sufficient than large scale enterprise applications.
HPC efficiency is a measure (percentage) of the actual performance of a HPC system against its theoretical peak performance.
Theoretical Peak Performance
The theoretical peak performance (GFLOPS) is calculated by the following equation...
GFLOPS = node * ( sockets / node ) * ( cores / socket ) * GHz * FLOPS
FLOPS (FLoating Point Operations Per Second) is specific to the kind of CPU. The following table shows the FLOP values of some Intel and AMD CPUs.
|Intel Xeon E5-2600 (Sandy Bridge) series||8|
|Intel Xeon E3-1200 (Ivy Bridge) series||8|
|AMD Opteron 6200 (Bulldozer) series||4|
|AMD Opteron 6300 (Piledriver) series||4|
1 node * ( 2 sockets / node ) * ( 12 cores / socket ) * 2.4 GHz * 4 FLOPS = 230.4 GFLOPS
The Actual perfornance can be found by running XHPL. For more information, see...
How To Install / Configure / Execute XHPL (ACML) for AMD FMA4
How To Install / Configure / Execute XHPL (MKL) for Intel AVX
For example the acutal performance of an Altus 1804i with dual Opteron 6234 (2.4 GHz, 12 core), 128GB RAM (90%) is...
================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR01C2L2 123378 160 4 6 7469.13 1.776e+02 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0027476 ...... PASSED ================================================================================
The HPC efficiency is simply...
Efficiency = Actual Performance GFLOPS / Theoretical Peak Performance GFLOPS
Using the previous Altus 1804i example the HPC efficiency calculates to...
177.6 / 230.4 = 77.1 %To increase the HPC efficiency, increase the actual performance. This can be done by tweaking compilers, math libraries, shared/distributed memory, numactl, kernel parameters, etc.
Why are so many people interested in this platform? The short answer is power and density. Traditionally ARM based processors have been used for mobile devices, where low power consumption is key. At the same time every data center has power and cooling constraints and more and more cloud and ‘Big Data’ applications require scale out architectures. Our partner Calxeda is one of the first organizations to bring low-power ARM technology to the data center. The UDX1 is based on Calxeda EnergyCore SoCs (System on Chip) and can be configured with up to 48 servers and 192 cores in a 2U (we chose a 4U chassis to accommodate a larger number of drives – up to 36 3.5’’ drives). The power consumption per server is around 7W including RAM but excluding HDDs. What makes this super low power envelope possible is Calxeda’s SoC architecture that integrates the entire system logic on a single die: ARM9 quad core processors including the Neon SIMD engine, dedicated logic for power management, L2 cache, BMC (accessible through SoL), PCI-E, SATA and memory controllers.
Two issues that always pop-up in the context of ARM bases systems are the lack of 64-bit support with the inherent limitation of the addressable RAM to 4GB and the lack of applications and OSs' that run on ARM. The first issue is being worked on. The next generation processor code named Midway slated for next year will support 40-bit memory addressing and a 64-bit architecture built on the ARM V8 architecture is scheduled for 2014. The second issue matters for applications that cannot be recompiled on the platform or for customers that need to run enterprise distributions. As far as the enterprise distribution is concerned, there is an effort to build a RHEL based distribution for ARM. If applications cannot be recompiled emulators that facilitate the execution of x86 code through on-the-fly binary translation (and retrieval of already translated code from a cache) could be of interest. While certainly not as fast as native execution this type of technology could help with the transition to ARM. Also interesting … at the time of writing it looks like AMD is very likely to announce an 64-bit ARM based micro server based on the Seamicro platform acquired in March.
Even if the UDX1 may not be the perfect fit for your current workload it makes sense to deploy a UDX1 to port your applications to be ready when the more powerful platform hits the market.
OK. First, a note about performance testing. You don't want to use the default DD block size of 512bytes. That's much too small to get the best performance. 1megabyte (bs=1M) is probably the minimum I would use for streaming copy/read/write testing.
Returning from the AMD Fusion Developer Summit I am sitting here at SeaTac airport pondering the information overload from the last few days … but I should take a step back … the fun leading up to the conference actually started last week when we needed to setup a demo for the show …
As you may know, Penguin released the first rack mount server based on AMD's APU architecture last year and we deployed a cluster of over 100 systems at Sandia National Laboratories. So the idea was to show a demo that illustrates that the APU's GPU cores can be used for HPC type of workloads. My first, in hindsight admittedly naïve thought was to modify Octave (an open source version of Matlab) to take advantage of the AMD's CLBLAS libraries. After researching a little bit I realized quickly that I had been a bit too ambitious ...
Welcome to the Iceberg! As part of our new website and direction and looking at the kind of projects we are working on, we’ve decided to start our official Penguin blog. Going forward, this will be our way of sharing more detail on the cool developments we are working on in our chosen markets: High Performance Computing (HPC) or Supercomputing, what we are calling the Efficient Data Center, and Cloud Computing for the HPC or Big Data user. We will post material not only from our engineers and executives but also from our customers and partners.
It’s hard to believe but Penguin is now fourteen(14) years old and I’ve had the pleasure of being a Penguin for over five years with nearly four in the top job. In that time, the industry has changed dramatically and continues to do so. Traditional older players have exited or have been acquired. IBM has largely exited the x86 business, Sun was acquired by Oracle and exited the HPC market, Rackable Systems was subsumed into SGI, Linux Networks was acquired by SGI, Verari filed for bankruptcy, etc. On the other hand, new players (primarily the largely Asian contract manufacturers) have entered the market targeting the very largest data center opportunities. And more recently, cloud computing threatens to upend all traditional business models. Penguin has benefited from all these changes as customers seek a trusted established supplier who focuses on and delivers custom-built and optimized turn-key deployments; whether they be one-of-a-kind HPC clusters, custom compute and storage farms, or partial or fully outsourced HPC cloud solutions. These are big long term trends affecting a $70B market so we think the world is bright and that our skills and focus are in the right place at the right time.”