VScaler: Debugging kolla ceph OSD issues
Jump to navigation
Jump to search
Check the OSD status
[root@node02-enp175s0f0 ~]# docker exec ceph_mon ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 16.00000 root default
-2 4.00000 host 10.10.20.12
5 1.00000 osd.5 down 0 1.00000
8 1.00000 osd.8 up 1.00000 1.00000
13 1.00000 osd.13 up 1.00000 1.00000
16 1.00000 osd.16 up 1.00000 1.00000
-3 4.00000 host 10.10.20.13
2 1.00000 osd.2 up 1.00000 1.00000
4 1.00000 osd.4 up 1.00000 1.00000
9 1.00000 osd.9 up 1.00000 1.00000
12 1.00000 osd.12 up 1.00000 1.00000
-4 4.00000 host 10.10.20.11
1 1.00000 osd.1 up 1.00000 1.00000
6 1.00000 osd.6 up 1.00000 1.00000
10 1.00000 osd.10 up 1.00000 1.00000
14 1.00000 osd.14 up 1.00000 1.00000
-5 4.00000 host 10.10.20.10
3 1.00000 osd.3 up 1.00000 1.00000
7 1.00000 osd.7 up 1.00000 1.00000
11 1.00000 osd.11 up 1.00000 1.00000
15 1.00000 osd.15 up 1.00000 1.00000So we can see OSD 5 is down above. Lets look at docker to see whats happening (we see the container is stuck restarting - with an unexpected error (fairly useless error!)
[root@node02-enp175s0f0 ~]# docker ps | grep -i osd_5
5bce4c98e95a registry.vscaler.com:5000/kolla/centos-binary-ceph-osd:4.0.3 "kolla_start" 9 weeks ago Restarting (134) 13 hours ago ceph_osd_5
[root@node02-enp175s0f0 ~]# docker logs ceph_osd_5 | grep -i fail | tail -n 3
os/filestore/FileStore.cc: 2920: FAILED assert(0 == "unexpected error")
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x561f203c7ad5]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x561f203c7cb7]Ok lets drill down some more. What physical device is backing OSD 5
[root@node02-enp175s0f0 ~]# docker inspect ceph_osd_5 | grep -i '/var/lib/ceph/osd'
"/var/lib/ceph/osd/1d29b3f9-e8c6-406d-b881-9eb6ca878d28:/var/lib/ceph/osd/ceph-5:rw",
"Source": "/var/lib/ceph/osd/1d29b3f9-e8c6-406d-b881-9eb6ca878d28",
"Destination": "/var/lib/ceph/osd/ceph-5",
"/var/lib/ceph/osd/ceph-5": {},
[root@node02-enp175s0f0 ~]# df | grep /var/lib/ceph/osd/1d29b3f9-e8c6-406d-b881-9eb6ca878d28
/dev/sde2 1869578324 50644684 1818933640 3% /var/lib/ceph/osd/1d29b3f9-e8c6-406d-b881-9eb6ca878d28So know we know OSD5 is sde on the host system. Lets check dmesg for IO errors.
[849146.277876] sd 10:0:0:0: [sde] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[849146.285800] sd 10:0:0:0: [sde] Sense Key : Medium Error [current] [descriptor]
[849146.293209] sd 10:0:0:0: [sde] Add. Sense: Unrecovered read error - auto reallocate failed
[849146.301580] sd 10:0:0:0: [sde] CDB: Read(10) 28 00 70 88 68 b0 00 00 08 00
[849146.308550] blk_update_request: I/O error, dev sde, sector 1887987888A lof of the errors above occured. Looks like we have a faulty disk. depending on the issue, wipe and readd to ceph or replace and readd.
Sidenote: we can also run smart tests on the disk
smartctl -t short -a /dev/sde
smartctl -t long -a /dev/sde