Troubleshooting a Private Cloud's Ceph Cluster
Introduction
In this guide, we explain how to get started troubleshooting issues with your Private Cloud's Ceph cluster. A goal of this guide is to collect common troubleshooting scenarios and outline a method of addressing them.
Prerequisites
Root Access to OpenStack Control Plane
Root access to your cloud's control plane nodes is required.
Get Ceph's Status
In most troubleshooting cases, you can get an overview of your Ceph
cluster by checking its status. To check your Ceph cluster's status, use
ceph status
.
For example:
# ceph status
cluster:
id: 34fa49b3-fff8-4702-8b17-4e8d873c845f
health: HEALTH_WARN
clock skew detected on mon.focused-capybara, mon.lovely-ladybug
2 daemons have recently crashed
services:
mon: 3 daemons, quorum relaxed-flamingo,focused-capybara,lovely-ladybug (age 5d)
mgr: relaxed-flamingo(active, since 5d), standbys: focused-capybara, lovely-ladybug
osd: 4 osds: 4 up (since 5d), 4 in (since 13d)
rgw: 3 daemons active (focused-capybara.rgw0, lovely-ladybug.rgw0, relaxed-flamingo.rgw0)
task status:
data:
pools: 13 pools, 337 pgs
objects: 110.16k objects, 388 GiB
usage: 1.1 TiB used, 11 TiB / 12 TiB avail
pgs: 337 active+clean
io:
client: 381 KiB/s rd, 1.2 MiB/s wr, 444 op/s rd, 214 op/s wr
Ceph Log Files
Ceph's log files are stored in /var/log/ceph/
within each control
plane node.
For example, listed are all log files for host focused-capybara
:
# ls -1 /var/log/ceph/*.log
/var/log/ceph/ceph.audit.log
/var/log/ceph/ceph.log
/var/log/ceph/ceph-mgr.focused-capybara.log
/var/log/ceph/ceph-mon.focused-capybara.log
/var/log/ceph/ceph-osd.1.log
/var/log/ceph/ceph-rgw-focused-capybara.rgw0.log
/var/log/ceph/ceph-volume.log
A OpenMetal Ceph cluster is comprised of several services: Ceph's Manager, Monitor, OSD, and RADOSGW
Ceph has a primary log file, log files for each service, and additional log files.
For example:
- Primary Log File:
/var/log/ceph/ceph.log
- Ceph Monitor Log File:
/var/log/ceph/ceph-mon.focused-capybara.log
- Ceph RADOSGW Log File:
/var/log/ceph/ceph-rgw-focused-capybara.rgw0.log
If you are unsure which Ceph service's log to look through, consider
starting with the primary log file, /var/log/ceph/ceph.log
.
Common Issues
Clock Skew
Ceph has a number of health checks, including one for clock skew, called
MON_CLOCK_SKEW
. For more, see Ceph's Health
Checks
guide and look for the text MON_CLOCK_SKEW. Ceph in our
configuration uses chronyd
to sync each node's clock. Kolla Ansible is
responsible for installing and configuring chronyd
into a Docker
container for each Ceph Monitor node. To administer chronyd
you must
do so through Docker.
Confirm Ceph's Health
To confirm the status of this health check, execute ceph status
and
examine the output.
For example:
cluster:
id: 34fa49b3-fff8-4702-8b17-4e8d873c845f
health: HEALTH_WARN
clock skew detected on mon.focused-capybara, mon.lovely-ladybug
[...output truncated...]
Alternatively, execute ceph health detail
to only see the status of
health checks.
For example:
HEALTH_WARN clock skew detected on mon.focused-capybara, mon.lovely-ladybug
[WRN] MON_CLOCK_SKEW: clock skew detected on mon.focused-capybara, mon.lovely-ladybug
mon.focused-capybara clock skew 0.663159s > max 0.05s (latency 0.000399254s)
mon.lovely-ladybug clock skew 0.368233s > max 0.05s (latency 0.000385143s)
Examine Chrony Logs
From here, you may want to examine the logs for each chrony
Docker
instance running.
For example:
docker logs chrony
Alternatively, consider viewing logs on the local file system for
chrony
via /var/log/kolla/chrony/
.
Addressing Clock Skew
There may be a number of methods to addressing clock skew. In this
example, we outline addressing this issue by restarting chrony
for
each node.
To address the MON_CLOCK_SKEW
for the example output in this section,
the Docker container chrony
was restarted for each node. For example:
# docker restart chrony
chrony
Next, perform the same Ceph health check as before to confirm the status. For example:
# ceph health detail
HEALTH_OK
If the clock skew issue is no longer present, you should see the status
of HEALTH_OK
assuming there are no other issues with the Ceph cluster.
Note! -- Restarting chrony
may be a heavy handed approach to
addressing the issue. Consider alternatively making using of chronyc
's
tracking
, sources
, and sourcestats
subcommands to diagnose clock
skew issues.
References
- Ceph's Troubleshooting
- Ceph's Troubleshooting Monitors