Advanced Rook / Ceph troubleshooting

Seandon Mooy
Dec 11, 2019

In my last post about rook & ceph we talked generally about storage options on Kubernetes and how Rook & Ceph work at a high-level. Today I wanted to dive a bit deeper into day-to-day operations, and some of the things we've learned while managing storage on our clusters!

A storage system should primarily be judged based on how resilient it is to failure, and how it behaves during outages. In this case, Rook and Ceph do quite well keeping cluster storage online, with the exception that some manual effort is required to keep the system "healthy" rather than "warning".

Before we get started, we need to setup the ceph tool!

The ceph command line tool and the "tools" container

More or less vital for running a Rook/Ceph storage system, the ceph-toolbox should always be installed.

We use the ceph command-line utility so often that we store an alias for the following:

alias ceph="kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l 'app=rook-ceph-tools' -o jsonpath='{.items[0].metadata.name}') ceph"

This allows us to enter the ceph-tools container by typing ceph while using the appropriate kubectl context. For this tutorial, I'll be using ceph> to indicate commands given in the ceph shell.

Using the ceph status command, usually we'd see something like this:

ceph> status
  cluster:
    health: HEALTH_OK

  services:
    osd: 2 osds: 2 up (since 1d), 2 in (since 1d)

  data:
    pgs:     400 active+clean

HEALTH_OK really means it - there's nothing to do here and we're all green!

OSD Failure:

A common cause of issues is OSD Failure, let's say some catastrophic hardware failure means a server is simply never coming back, and needs to be removed from the storage system.

You can see 2 up, 2 in - those are our OSDs, which are the processes which manage bits of storage in our cluster. They're not always mapped 1-to-1 with servers in your cluster, but for this example let's assume they are. One day, you check on the cluster and notice:

ceph> status
  cluster:
    health: HEALTH_WARN
            Reduced data availability: 60 pgs inactive
            Degraded data redundancy: 642/1284 objects degraded (50.000%), 60 pgs degraded, 60 pgs undersized

  services:
    osd: 2 osds: 1 up (since 2d), 1 in (since 2d)

  data:
    pgs:     100.000% pgs not active
             642/1284 objects degraded (50.000%)
             60 undersized+degraded+peered

Ack! 1 up, 1 in out of 2! degraded (50.000%)! What happened!? Let's look at our nodes:

> kubectl get nodes
NAME      STATUS     ROLES    AGE    VERSION
worker1   Ready      <none>   1d     v1.16.3
worker2   NotReady   <none>   1d     v1.16.3

In this case, I unplugged one of my raspberry pis, but in other cases that server is never coming back, so let's replace it! Assuming we can just plug-in another PI, get a glass of water, and then:

> kubectl get nodes
NAME      STATUS     ROLES    AGE    VERSION
worker1   Ready      <none>   1d     v1.16.3
worker2   NotReady   <none>   1d     v1.16.3
worker3   Ready      <none>   5m     v1.16.3

Adding a new OSD:

Let's edit our CephCluster resource with kubectl edit cephcluster -n rook-ceph. Under spec -> storage -> nodes, you can add your new node.

You can check either the logs: kubectl -n rook-ceph logs -l app=rook-ceph-operator Or the namespace to watch the OSD come up: kubectl -n rook-ceph get pods

Once the OSD starts, you'll see it start to replace the fallen server:

ceph> status
  services:
    osd: 3 osds: 2 up (since 2m), 2 in (since 3m)

  data:
    pgs:     40.000% pgs not active
             177/1288 objects degraded (13.742%)
             36 active+clean
             16 undersized+degraded+remapped+backfill_wait+peered
             4  peering
             3  activating
             1  undersized+degraded+remapped+backfilling+peered

And, eventually, the cluster will go HEALTH_OK once it's finished peering! You'll be left with a situation like we see above: osd: 3 osds: 2 up (since 2m), 2 in (since 3m). So, let's remove that now-dead OSD!

Removing an old OSD:

So our worker2 is never coming back. Let's make sure we know which osds existed on that system:

ceph> osd tree down
ID CLASS WEIGHT  TYPE NAME             STATUS
-9       0.07570 host worker2
 0   ssd 0.07570      osd.0            down

So let's go ahead and remove osd.0 for good. We can do that with the following order of commands:

  1. ceph osd out osd.0
  2. ceph status, ensure cluster is healthy and recovery is complete
  3. kubectl -n rook-ceph delete deployment rook-ceph-osd-0
  4. ceph auth del osd.0
  5. ceph osd crush remove osd.0
  6. ceph osd rm osd.0
  7. kubectl delete node node-with-osd-0

Make sure you're checking ceph status often ane making sure the "recovery" io is nice and finished up before moving on to other OSDs. Rook and Ceph are excellent at preventing dataloss, but it's important to get comfortable with the platform before you start doing things in parallel!

Wrapping up

Rook and Ceph are fantastic tools for Operators. It also serves as the backend for our storage on hosted KubeSail clusters, so you don't need to worry about any of the above! For those of you running your own clusters, here are a couple of my favorite Rook/Ceph resources and videos:

Thanks for reading, and as always, feel free to reach out on gitter if you have any questions or comments!

Stay in the loop!

Give us a shout on twitter or gitter, checkout some of our GitHub repos and be sure to join our mailing list!