Rolling back a failed rootupgrade.sh CRS upgrade

Recently i was upgrading a half rack Exadata to Grid Infrastructure 11.2.0.4. for a customer who had 1 node removed from the cluster, at least so we thought. While doing the upgrade we ran rootupgrade.sh on the first 2 nodes without issues. Now when running the script on what supposed to be the 3rd and final node in the cluster, the rootupgrade.sh failed with the following error:

CRS-1119: Unable to complete Oracle Clusterware upgrade while nodes dm0104 have not yet upgraded
CRS-1112: Failed to set the Oracle Clusterware operating version 11.2.0.4.0
CRS-4000: Command Set failed, or completed with errors.
/u01/app/11.2.0.4/grid/bin/crsctl set crs activeversion ... failed
Failed to set active version of the Grid Infrastructure at /u01/app/11.2.0.4/grid/crs/install/crsconfig_lib.pm line 9284.
/u01/app/11.2.0.4/grid/perl/bin/perl -I/u01/app/11.2.0.4/grid/perl/lib -I/u01/app/11.2.0.4/grid/crs/install /u01/app/11.2.0.4/grid/crs/install/rootcrs.pl execution failed

So what to do now, first step should be to find the root cause of your failed upgrade, then fix the problem and re-run the rootupgrade.sh if possible, otherwise rollback your patch:

[root@dm0101 ~]# dcli -g dbs_group -l root /u01/app/11.2.0.4/grid/bin/crsctl query crs activeversion
dm0101: Oracle Clusterware active version on the cluster is [11.2.0.3.0]
dm0102: Oracle Clusterware active version on the cluster is [11.2.0.3.0]
dm0103: Oracle Clusterware active version on the cluster is [11.2.0.3.0]
[root@dm0101 ~]# dcli -g dbs_group -l root /u01/app/11.2.0.4/grid/bin/crsctl query crs softwareversion
dm0101: Oracle Clusterware version on node [dm0201] is [11.2.0.4.0]
dm0102: Oracle Clusterware version on node [dm0202] is [11.2.0.4.0]
dm0103: Oracle Clusterware version on node [dm0203] is [11.2.0.4.0]
[root@dm0101 ~]#

Lets run cluvfy to verify the status of CRS on all nodes:

[oracle@dm0101 [] grid]$ mkdir /tmp/cvudbg
[oracle@dm0101 [] grid]$ export CV_TRACELOC=/tmp/cvudbg
[oracle@dm0101 [] grid]$ export SRVM_TRACE=true
[oracle@dm0101 [] grid]$ export SRVM_TRACE_LEVEL=1
[oracle@dm0101 [] grid]$ ./runcluvfy.sh comp crs -n all

Verifying CRS integrity

Checking CRS integrity...

WARNING:
PRVF-4038 : CRS is not installed on nodes:
dm0204
Verification will proceed with nodes:
dm0103,dm0102,dm0101


ERROR:
PRVG-10605 : Release version [11.2.0.4.0] is consistent across nodes but does not match the active version [11.2.0.3.0].
PRVG-10603 : Clusterware version consistency failed
Check failed on nodes:
dm0103,dm0102,dm0101

CRS integrity check failed

Verification of CRS integrity was unsuccessful.
Checks did not pass for the following node(s):
dm0204
[oracle@dm0101 [] grid]$

In this case the error was caused by the 4th node being removed just partially from the cluster. The fix here was to rollback the upgrade, remove the 4th node properly, then re-run the rootupgrade.sh again. Rolling back a failed rootupgrade.sh is done by running rootcrs.pl, where you start the rollback in reverse order, so in our case i start with node number 3, then run rootcrs.pl and node 2 using the following command:

[root@dm0101 ~]# /u01/app/11.2.0.4/grid/crs/install/rootcrs.pl -downgrade -oldcrshome /u01/app/11.2.0.3/grid -version 11.2.0.3.0 -force

Now, node 1 (dm0101) is the last node on which we start the rootcrs.pl script with the parameter -lastnode, this will tell rootcrs.pl to look into the $GI_HOME/cdata directory to look at the OCR backup that rootupgrade.sh made when it was started on the first node:

[root@dm0101 ~]# /u01/app/11.2.0.4/grid/crs/install/rootcrs.pl -downgrade -lastnode -oldcrshome /u01/app/11.2.0.3/grid -version 11.2.0.3.0 -force
Using configuration parameter file: /u01/app/11.2.0.4/grid/crs/install/crsconfig_params
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'dm0101'
CRS-2673: Attempting to stop 'ora.crsd' on 'dm0101'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'dm0101'
CRS-2673: Attempting to stop 'ora.dm0203-bk-vip.vip' on 'dm0101'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN2.lsnr' on 'dm0101'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'dm0101'
CRS-2673: Attempting to stop 'ora.elpadop.web.svc' on 'dm0101'
CRS-2673: Attempting to stop 'ora.lsfdp.lsfdpdg.svc' on 'dm0101'
CRS-2673: Attempting to stop 'ora.montyp.montypdg.svc' on 'dm0101'
...
..
.
CRS-2673: Attempting to stop 'ora.gpnpd' on 'dm0101'
CRS-2677: Stop of 'ora.gpnpd' on 'dm0101' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'dm0101' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully downgraded Oracle Clusterware stack on this node
Run '/u01/app/11.2.0.3/grid/bin/crsctl start crs' on all nodes
[root@dm0101 ~]#

Now to finalize the rollback, install and uninstall the USM components in the same order as you ran the rootcrs.pl scripts. So start with node 3, then 2 and end at node 1:

[root@dm0101 ~]# /u01/app/11.2.0.4/grid/bin/acfsroot uninstall
[root@dm0101 ~]# /u01/app/11.2.0.3/grid/bin/acfsroot install

At this point we are back at version 11.2.0.3 and we can remove those pesky remains of node 4 that is still there, and we can restart the rootupgrade.sh scripts again.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s