VScaler: Debugging kolla ceph OSD issues

From Define Wiki
Jump to navigation Jump to search

Check the OSD status

[root@node02-enp175s0f0 ~]# docker exec ceph_mon ceph osd tree
ID WEIGHT   TYPE NAME            UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 16.00000 root default                                           
-2  4.00000     host 10.10.20.12                                   
 5  1.00000         osd.5           down        0          1.00000 
 8  1.00000         osd.8             up  1.00000          1.00000 
13  1.00000         osd.13            up  1.00000          1.00000 
16  1.00000         osd.16            up  1.00000          1.00000 
-3  4.00000     host 10.10.20.13                                   
 2  1.00000         osd.2             up  1.00000          1.00000 
 4  1.00000         osd.4             up  1.00000          1.00000 
 9  1.00000         osd.9             up  1.00000          1.00000 
12  1.00000         osd.12            up  1.00000          1.00000 
-4  4.00000     host 10.10.20.11                                   
 1  1.00000         osd.1             up  1.00000          1.00000 
 6  1.00000         osd.6             up  1.00000          1.00000 
10  1.00000         osd.10            up  1.00000          1.00000 
14  1.00000         osd.14            up  1.00000          1.00000 
-5  4.00000     host 10.10.20.10                                   
 3  1.00000         osd.3             up  1.00000          1.00000 
 7  1.00000         osd.7             up  1.00000          1.00000 
11  1.00000         osd.11            up  1.00000          1.00000 
15  1.00000         osd.15            up  1.00000          1.00000

So we can see OSD 5 is down above. Lets look at docker to see whats happening (we see the container is stuck restarting - with an unexpected error (fairly useless error!)

[root@node02-enp175s0f0 ~]# docker ps | grep -i osd_5 
5bce4c98e95a        registry.vscaler.com:5000/kolla/centos-binary-ceph-osd:4.0.3                    "kolla_start"       9 weeks ago         Restarting (134) 13 hours ago                       ceph_osd_5

[root@node02-enp175s0f0 ~]# docker logs ceph_osd_5 | grep -i fail | tail -n 3
os/filestore/FileStore.cc: 2920: FAILED assert(0 == "unexpected error")
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x561f203c7ad5]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x561f203c7cb7]

Ok lets drill down some more. What physical device is backing OSD 5

[root@node02-enp175s0f0 ~]# docker inspect ceph_osd_5 | grep -i '/var/lib/ceph/osd' 
                "/var/lib/ceph/osd/1d29b3f9-e8c6-406d-b881-9eb6ca878d28:/var/lib/ceph/osd/ceph-5:rw",
                "Source": "/var/lib/ceph/osd/1d29b3f9-e8c6-406d-b881-9eb6ca878d28",
                "Destination": "/var/lib/ceph/osd/ceph-5",
                "/var/lib/ceph/osd/ceph-5": {},
[root@node02-enp175s0f0 ~]# df | grep /var/lib/ceph/osd/1d29b3f9-e8c6-406d-b881-9eb6ca878d28
/dev/sde2      1869578324  50644684 1818933640   3% /var/lib/ceph/osd/1d29b3f9-e8c6-406d-b881-9eb6ca878d28

So know we know OSD5 is sde on the host system. Lets check dmesg for IO errors.

[849146.277876] sd 10:0:0:0: [sde] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[849146.285800] sd 10:0:0:0: [sde] Sense Key : Medium Error [current] [descriptor] 
[849146.293209] sd 10:0:0:0: [sde] Add. Sense: Unrecovered read error - auto reallocate failed
[849146.301580] sd 10:0:0:0: [sde] CDB: Read(10) 28 00 70 88 68 b0 00 00 08 00
[849146.308550] blk_update_request: I/O error, dev sde, sector 1887987888

A lof of the errors above occured. Looks like we have a faulty disk. depending on the issue, wipe and readd to ceph or replace and readd.

Sidenote: we can also run smart tests on the disk

smartctl -t short -a /dev/sde
smartctl -t long -a /dev/sde