Cellcli can lie to you…

Yes that is true, i said it, cellcli can lie to you there are some special cases were the output of cellcli is not the reality and you should double check it’s output with your standard OS tools. So this the output if dcli calling cellcli on an Exadata rack from an client:

[root@dm01db01 ~]# dcli -g cell_group -l root cellcli -e list cell attributes cellsrvStatus,msStatus,rsStatus
dm01cel01: running       running         running
dm01cel02: running       running         running
dm01cel03: stopped       running         running
dm01cel04: running       running         running
dm01cel05: running       running         running
dm01cel06: running       running         running
dm01cel07: running       running         running
dm01cel08: running       running         running
dm01cel09: running       running         running
dm01cel10: running       running         running
dm01cel11: running       running         running
dm01cel12: running       running         running
dm01cel13: running       running         running
dm01cel14: running       running         running
[root@dm01db01 ~]#

It seems that on this node cellsrv on cel03 has stopped, lets zoom in and logon to that cell and verify:

CellCLI> list cell attributes cellsrvStatus,msStatus,rsStatus
         stopped         running         running

Well that is expected, same command some output but now double-check this with what is actually happening on the OS. Let see what processes are actually running at OS level:

[root@dm01cel03 trace]# ps -ef | grep cellsrv/bin/cell[srv]
root      3143  3087  0 12:00 ?        00:00:00 /opt/oracle/cell11. -interval=5 -count=720
root     21040     1 60 Mar26 ?        20:33:37 /opt/oracle/cell11. 100 5000 9 5042
root     25662     1  0 Mar26 ?        00:02:08 /opt/oracle/cell11. -ms 1 -cellsrv 1
root     25673 25662  0 Mar26 ?        00:00:07 /opt/oracle/cell11. -ms 1 -cellsrv 1
root     25674 25662  0 Mar26 ?        00:00:07 /opt/oracle/cell11. -ms 1 -cellsrv 1
root     25676 25673  0 Mar26 ?        00:00:01 /opt/oracle/cell11. -rs_conf /opt/oracle/cell11. -ms_conf /opt/oracle/cell11. -cellsrv_conf /opt/oracle/cell11. -debug 0
root     25710 25676  0 Mar26 ?        00:00:07 /opt/oracle/cell11. -rs_conf /opt/oracle/cell11. -ms_conf /opt/oracle/cell11. -cellsrv_conf /opt/oracle/cell11. -debug 0
[root@dm01cel03 trace]#

That is confusing, cellcli tells me that cellsrv is down but “ye good ol'” ps is telling me that cellsrv is up-and-running as it should. It looks like my cell storage is available, lets verify if we got all 12 disks from that cell available at ASM level and if we don’t have any repair timers counting down:

SYS@+ASM1> select count(*), repair_timer from v$asm_disk where path like '%DATA%dm05cel03' group by repair_timer;

---------- ------------
        12            0

1 row selected.

All disks are having a repair timer of 0, meaning that no disks have failed, if there would really be a problem with the disks we would see the repair_timer counting down.

Now we can confirm that cellsrv is available and the output of cellcli is just plain wrong here. What is going on here. Lets start by checken the cell alertlog in $CELLTRACE

[RS] monitoring process /opt/oracle/cell11. (pid: 0) returned with error: 126
[RS] Monitoring process for service CELLSRV detected a flood of restarts. Disable monitoring process.
Errors in file /opt/oracle/cell11.  (incident=179):
RS-7445 [CELLSRV monitor disabled] [Detected a flood of restarts] [] [] [] [] [] [] [] [] [] []
Incident details in: /opt/oracle/cell11.
Sweep [inc][179]: completed

The cellrs processes are monitoring the cellms and the cellsrv processes, however there is a flood control being built in to prevent a loop of restarts. If this would happen it could bring down a cell, so in order to prevent that from happening this flood control has been built in. When this happens RS will stop monitoring the problematic service, cellsrv in this case. This also means that it will report back to cellcli that process is stopped. Personally i think that this built-in flood control is a good thing, however i would like to see cellcli report this properly. For instance it would be nice if Oracle would let cellcli report the cellsrv status to something like intermediate when RS has stopped monitoring it, it is now saying ‘stopped’ which not true at all. This also means that when you see cellcli reporting that cellsrv is down, you always need to double check if this is actually true before you try restarting cellsrv.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s