Why we love Ceph

I’ve been deploying and playing with Ceph since before its first stable release, Argonaut, in July of 2012. During that time I’ve seen the project mature and grow in exciting ways. I have since had the opportunity to deploy several production Ceph clusters, most recently a new storage deployment for our public HPC Cloud, Penguin Computing on Demand (POD).

Why is Ceph such a good fit for POD? We wanted a storage solution that was:

Economical – the code is Open Source, and we can run it on our own hardware. We have deployed Ceph on Penguin Computing’s Icebreaker 2712 storage server that accommodates 12 drives in a 2U form factor and is based on Intel’s Xeon 5600 processors. Going forward we are planning to use the Icebreaker 2812 storage server or our Relion 2800 server, both based on the Xeon E5-2600. These chassis offer a good drive/core ratio, while offering a failure domain that we are comfortable with.
Easily expandable with scale-out behavior – We can simply add more storage servers to the cluster to gain more space. Ceph automatically rebalances data appropriately, immediately using all resources available. For each additional server, we gain additional performance

Self-healing – We are using SATA drives, and everyone knows they are going to fail. Often at inconvenient times. With Ceph, failed drives don’t have to be attended to right away. Data is rebalanced early and quickly, and we don’t have to worry near as much about being in a “critical window” of time where a second or third failure may lead to data loss as we do when rebuilding RAID arrays.

Unified – We wanted one storage system that we could focus on. Ceph provides us with both object storage and block storage. It is also tightly integrated with OpenStack, which we use in our latest POD clusters.

No Single Point of Failure – This is critical for our POD customers, and really should be in any storage system. The architecture of Ceph is highly resilient to failures, and it was built from the ground-up as a fully distributed system.

It is worth noting that Ceph also provides a POSIX filesystem layer, CephFS. I’ve been following it’s progression and think it has real potential in the HPC space. However, it’s not ready for production yet, so I haven’t played with it as much. Once it is stable and supports multiple active metadata servers, we will definitely put it to use on POD.

A good thing to know about Ceph is that it chooses consistency over availability. Ceph tries as hard as it can to make sure that you never lose data, and that you never encounter “split-brain”, a problematic issue with distributed systems where there is conflicting information about the state of the cluster. Ceph prevents this through its use of PAXOS within the Ceph monitor daemons. The ultimate reality is that you need a majority of your monitors to be active and consenting for cluster I/O to proceed.

It is important to be cognizant of your failure domains when planning a Ceph cluster. How much data will the cluster have to replicate if you lose a drive? A chassis? A switch port? Can your network handle the resulting traffic and still leave room for client traffic? For a standard x86 architecture, I recommend to not have more than 12 data drives in any one chassis. Beyond that, the impact on the remaining cluster may be too great if a single chassis becomes unavailable. This is one of many trade-offs to consider when designing your Ceph cluster, just as there would be with any storage system.

Ceph truly is a unified storage system. It handles objects natively, with developers able to access objects via C, C++, Java, Python, Ruby, and PHP. Web developers can access Ceph through RADOS Gateway, an HTTP REST webservice that can present both an Amazon S3 and an OpenStack Swift compatible API. The block device is known as RADOS Block Device (RBD), and is a thin-provisioned block device that can be mounted within the Linux kernel (via kernel module), within VMs with KVM/libvirt or Xen (via librbd), or recently with a new FUSE module. An RBD is really just a collection of objects scattered throughout the cluster, so even streaming I/O to a simple disk gets the benefit of accessing multiple storage servers in parallel. Finally, CephFS presents a POSIX filesystem to complete the picture. With such an active open source community, there is a myriad of plug-ins for projects to connect to Ceph. Examples here are the tight integration in KVM/libvirt and OpenStack. There is a Hadoop plugin for CephFS, and work is being done to Ganesha (a user-space NFS server) to enable both NFS and pNFS on top of CephFS.

There is a lot of momentum behind Ceph right now. CephFS isn’t ready for show time, but the object and block layers are ready for production workloads today. Penguin Computing is excited to be in the mix and active with this fast-moving project. We have the right hardware solutions to build Ceph storage clusters of any size, and are happy to have recently partnered with Inktank to offer full software support for Ceph.

Recent Posts