Advanced Rook / Ceph troubleshooting
In my last post about rook & ceph we talked generally about storage options on Kubernetes and how Rook & Ceph work at a high-level. Today I wanted to dive a bit deeper into day-to-day operations, and some of the things we've learned while managing storage on our clusters!
A storage system should primarily be judged based on how resilient it is to failure, and how it behaves during outages. In this case, Rook and Ceph do quite well keeping cluster storage online, with the exception that some manual effort is required to keep the system "healthy" rather than "warning".
Before we get started, we need to setup the
ceph
tool!The ceph command line tool and the "tools" container
More or less vital for running a Rook/Ceph storage system, the ceph-toolbox should always be installed.
We use the
ceph
command-line utility so often that we store an alias for the following:alias ceph="kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}') ceph"
This allows us to enter the ceph-tools container by typing
ceph
while using the appropriate kubectl
context. For this tutorial, I'll be using ceph>
to indicate commands given in the ceph shell.Using the
ceph status
command, usually we'd see something like this:ceph> status
cluster:
health: HEALTH_OK
services:
osd: 2 osds: 2 up (since 1d), 2 in (since 1d)
data:
pgs: 400 active+clean
HEALTH_OK really means it - there's nothing to do here and we're all green!
OSD Failure:
A common cause of issues is OSD Failure, let's say some catastrophic hardware failure means a server is simply never coming back, and needs to be removed from the storage system.
You can see
2 up, 2 in
- those are our OSDs, which are the processes which manage bits of storage in our cluster. They're not always mapped 1-to-1 with servers in your cluster, but for this example let's assume they are. One day, you check on the cluster and notice:ceph> status
cluster:
health: HEALTH_WARN
Reduced data availability: 60 pgs inactive
Degraded data redundancy: 642/1284 objects degraded (50.000%), 60 pgs degraded, 60 pgs undersized
services:
osd: 2 osds: 1 up (since 2d), 1 in (since 2d)
data:
pgs: 100.000% pgs not active
642/1284 objects degraded (50.000%)
60 undersized+degraded+peered
Ack!
1 up, 1 in
out of 2! degraded (50.000%)
! What happened!? Let's look at our nodes:> kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker1 Ready <none> 1d v1.16.3
worker2 NotReady <none> 1d v1.16.3
In this case, I unplugged one of my raspberry pis, but in other cases that server is never coming back, so let's replace it! Assuming we can just plug-in another PI, get a glass of water, and then:
> kubectl get nodes
NAME STATUS ROLES AGE VERSION
worker1 Ready <none> 1d v1.16.3
worker2 NotReady <none> 1d v1.16.3
worker3 Ready <none> 5m v1.16.3
Adding a new OSD:
Let's edit our
CephCluster
resource with kubectl edit cephcluster -n rook-ceph
. Under spec
-> storage
-> nodes
, you can add your new node.You can check either the logs:
kubectl -n rook-ceph logs -l app=rook-ceph-operator
Or the namespace to watch the OSD come up: kubectl -n rook-ceph get pods
Once the OSD starts, you'll see it start to replace the fallen server:
ceph> status
services:
osd: 3 osds: 2 up (since 2m), 2 in (since 3m)
data:
pgs: 40.000% pgs not active
177/1288 objects degraded (13.742%)
36 active+clean
16 undersized+degraded+remapped+backfill_wait+peered
4 peering
3 activating
1 undersized+degraded+remapped+backfilling+peered
And, eventually, the cluster will go
HEALTH_OK
once it's finished peering! You'll be left with a situation like we see above: osd: 3 osds: 2 up (since 2m), 2 in (since 3m)
. So, let's remove that now-dead OSD!Removing an old OSD:
So our
worker2
is never coming back. Let's make sure we know which osd
s existed on that system:ceph> osd tree down
ID CLASS WEIGHT TYPE NAME STATUS
-9 0.07570 host worker2
0 ssd 0.07570 osd.0 down
So let's go ahead and remove
osd.0
for good. We can do that with the following order of commands:ceph osd out osd.0
ceph status
, ensure cluster is healthy and recovery is completekubectl -n rook-ceph delete deployment rook-ceph-osd-0
ceph auth del osd.0
ceph osd crush remove osd.0
ceph osd rm osd.0
kubectl delete node node-with-osd-0
Make sure you're checking
ceph status
often ane making sure the "recovery" io is nice and finished up before moving on to other OSDs. Rook and Ceph are excellent at preventing data-loss, but it's important to get comfortable with the platform before you start doing things in parallel!Wrapping up
Rook and Ceph are fantastic tools for Operators. It also serves as the backend for our storage on hosted KubeSail clusters, so you don't need to worry about any of the above! For those of you running your own clusters, here are a couple of my favorite Rook/Ceph resources and videos:
- 10 Commands Every Ceph Administrator Should Know
- Reddit.com/r/ceph
- Ceph overview (Rook Documentation)
- Ceph at CERN: A Year in the Life of a Petabyte-Scale Block Storage (Talk)
- Designing for High Performance Ceph at Scale (YouTube)
Thanks for reading, and as always, feel free to reach out on Discord if you have any questions or comments!
Stay in the loop!
Join our Discord server, give us a shout on twitter, check out some of our GitHub repos and be sure to join our mailing list!