Yes that is true, i said it, cellcli can lie to you there are some special cases were the output of cellcli is not the reality and you should double check it’s output with your standard OS tools. So this the output if dcli calling cellcli on an Exadata rack from an client:
[root@dm01db01 ~]# dcli -g cell_group -l root cellcli -e list cell attributes cellsrvStatus,msStatus,rsStatus dm01cel01: running running running dm01cel02: running running running dm01cel03: stopped running running dm01cel04: running running running dm01cel05: running running running dm01cel06: running running running dm01cel07: running running running dm01cel08: running running running dm01cel09: running running running dm01cel10: running running running dm01cel11: running running running dm01cel12: running running running dm01cel13: running running running dm01cel14: running running running [root@dm01db01 ~]#
It seems that on this node cellsrv on cel03 has stopped, lets zoom in and logon to that cell and verify:
CellCLI> list cell attributes cellsrvStatus,msStatus,rsStatus stopped running running
Well that is expected, same command some output but now double-check this with what is actually happening on the OS. Let see what processes are actually running at OS level:
[root@dm01cel03 trace]# ps -ef | grep cellsrv/bin/cell[srv] root 3143 3087 0 12:00 ? 00:00:00 /opt/oracle/cell18.104.22.168.1_LINUX.X64_130912/cellsrv/bin/cellsrvstat -interval=5 -count=720 root 21040 1 60 Mar26 ? 20:33:37 /opt/oracle/cell22.214.171.124.1_LINUX.X64_130912/cellsrv/bin/cellsrv 100 5000 9 5042 root 25662 1 0 Mar26 ? 00:02:08 /opt/oracle/cell126.96.36.199.1_LINUX.X64_130912/cellsrv/bin/cellrssrm -ms 1 -cellsrv 1 root 25673 25662 0 Mar26 ? 00:00:07 /opt/oracle/cell188.8.131.52.1_LINUX.X64_130912/cellsrv/bin/cellrsbmt -ms 1 -cellsrv 1 root 25674 25662 0 Mar26 ? 00:00:07 /opt/oracle/cell184.108.40.206.1_LINUX.X64_130912/cellsrv/bin/cellrsmmt -ms 1 -cellsrv 1 root 25676 25673 0 Mar26 ? 00:00:01 /opt/oracle/cell220.127.116.11.1_LINUX.X64_130912/cellsrv/bin/cellrsbkm -rs_conf /opt/oracle/cell18.104.22.168.1_LINUX.X64_130912/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell22.214.171.124.1_LINUX.X64_130912/cellsrv/deploy/config/cellrsms.state -cellsrv_conf /opt/oracle/cell126.96.36.199.1_LINUX.X64_130912/cellsrv/deploy/config/cellrsos.state -debug 0 root 25710 25676 0 Mar26 ? 00:00:07 /opt/oracle/cell188.8.131.52.1_LINUX.X64_130912/cellsrv/bin/cellrssmt -rs_conf /opt/oracle/cell184.108.40.206.1_LINUX.X64_130912/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell220.127.116.11.1_LINUX.X64_130912/cellsrv/deploy/config/cellrsms.state -cellsrv_conf /opt/oracle/cell18.104.22.168.1_LINUX.X64_130912/cellsrv/deploy/config/cellrsos.state -debug 0 [root@dm01cel03 trace]#
That is confusing, cellcli tells me that cellsrv is down but “ye good ol'” ps is telling me that cellsrv is up-and-running as it should. It looks like my cell storage is available, lets verify if we got all 12 disks from that cell available at ASM level and if we don’t have any repair timers counting down:
SYS@+ASM1> select count(*), repair_timer from v$asm_disk where path like '%DATA%dm05cel03' group by repair_timer; COUNT(*) REPAIR_TIMER ---------- ------------ 12 0 1 row selected.
All disks are having a repair timer of 0, meaning that no disks have failed, if there would really be a problem with the disks we would see the repair_timer counting down.
Now we can confirm that cellsrv is available and the output of cellcli is just plain wrong here. What is going on here. Lets start by checken the cell alertlog in $CELLTRACE
[RS] monitoring process /opt/oracle/cell22.214.171.124.1_LINUX.X64_130912/cellsrv/bin/cellrsomt (pid: 0) returned with error: 126 [RS] Monitoring process for service CELLSRV detected a flood of restarts. Disable monitoring process. Errors in file /opt/oracle/cell126.96.36.199.1_LINUX.X64_130912/log/diag/asm/cell/dm05cel03/trace/rstrc_25662_4.trc (incident=179): RS-7445 [CELLSRV monitor disabled] [Detected a flood of restarts]           Incident details in: /opt/oracle/cell188.8.131.52.1_LINUX.X64_130912/log/diag/asm/cell/dm05cel03/incident/incdir_179/rstrc_25662_4_i179.trc Sweep [inc]: completed
The cellrs processes are monitoring the cellms and the cellsrv processes, however there is a flood control being built in to prevent a loop of restarts. If this would happen it could bring down a cell, so in order to prevent that from happening this flood control has been built in. When this happens RS will stop monitoring the problematic service, cellsrv in this case. This also means that it will report back to cellcli that process is stopped. Personally i think that this built-in flood control is a good thing, however i would like to see cellcli report this properly. For instance it would be nice if Oracle would let cellcli report the cellsrv status to something like intermediate when RS has stopped monitoring it, it is now saying ‘stopped’ which not true at all. This also means that when you see cellcli reporting that cellsrv is down, you always need to double check if this is actually true before you try restarting cellsrv.