Can we trust the Exadata ISO image?

Recently i had to reimage a couple of Exadata compute nodes for different reasons like an almost simultaneous failure of 2 harddisks in the compute node.

After the compute got back online with a fresh image and restoring settings like cellip.ora, DNS, NTP and cellinit.ora we got to the point where we need to reinstall the Oracle components, this clusters has an Grid Infrastructure, and RDBMS home as well as an home. Adding this new node back to the cluster and installing the new home should not provide much issue, the image versions of the compute nodes are the same:

[root@dm01dbadm01 ~]# dcli -g /root/dbs_group -l root imageinfo | grep “Image version: “
dm01dbadm01: Image version:
dm01dbadm02: Image version:
dm01dbadm03: Image version:
dm01dbadm04: Image version:
dm01dbadm05: Image version:
dm01dbadm06: Image version:
dm01dbadm07: Image version:
dm01dbadm08: Image version:
[root@dm01dbadm01 ~]#

Let’s run the cluster verification utility and see if there are indeed no issues:

[oracle@dm01dbadm01 [+ASM1] ~]$ cluvfy comp peer -refnode dm01dbadm02 -n dm01dbadm02 -orainv oinstall -osdba dba | grep -B 3 -A 2 mismatched
Compatibility check: Package existence for "glibc (x86_64)" [reference node: dm04dbadm02]
Node Name Status Ref. node status Comment
------------ ------------------------ ------------------------ ----------
dm01dbadm02 glibc-2.12-1.166.el6_7.3 (x86_64) glibc-2.12-1.166.el6_7.7 (x86_64) mismatched
Package existence for "glibc (x86_64)" check failed

Compatibility check: Package existence for "glibc-devel (x86_64)" [reference node: dm04dbadm02]
Node Name Status Ref. node status Comment
------------ ------------------------ ------------------------ ----------
dm01dbadm02 glibc-devel-2.12-1.166.el6_7.3 (x86_64) glibc-devel-2.12-1.166.el6_7.7 (x86_64) mismatched
Package existence for "glibc-devel (x86_64)" check failed

That is a bit of a nasty surprise, cluvfy has found some mismatches between the glibc versions between our reference node and our freshly imaged node. It looks like our re-imaged node has a lower release of glibc installed then the rest of the cluster:

[root@dm01dbadm01 ~]# dcli -g /root/dbs_group -l root "yum info glibc | grep ^Release"
dm01dbadm01: Release : 1.166.el6_7.7
dm01dbadm01: Release : 1.192.el6
dm01dbadm01: Release : 1.192.el6
dm01dbadm02: Release : 1.166.el6_7.3
dm01dbadm03: Release : 1.166.el6_7.7
dm01dbadm03: Release : 1.192.el6
dm01dbadm03: Release : 1.192.el6
dm01dbadm04: Release : 1.166.el6_7.7
dm01dbadm04: Release : 1.192.el6
dm01dbadm04: Release : 1.192.el6
dm01dbadm05: Release : 1.166.el6_7.7
dm01dbadm05: Release : 1.192.el6
dm01dbadm05: Release : 1.192.el6
dm01dbadm06: Release : 1.166.el6_7.7
dm01dbadm06: Release : 1.192.el6
dm01dbadm06: Release : 1.192.el6
dm01dbadm07: Release : 1.166.el6_7.7
dm01dbadm07: Release : 1.192.el6
dm01dbadm07: Release : 1.192.el6
dm01dbadm08: Release : 1.166.el6_7.7
dm01dbadm08: Release : 1.192.el6
dm01dbadm08: Release : 1.192.el6

[root@dm01dbadm01 ~]# dcli -g /root/dbs_group -l root "yum info glibc | grep ^Version | head -1"
dm01dbadm01: Version : 2.12
dm01dbadm02: Version : 2.12
dm01dbadm03: Version : 2.12
dm01dbadm04: Version : 2.12
dm01dbadm05: Version : 2.12
dm01dbadm06: Version : 2.12
dm01dbadm07: Version : 2.12
dm01dbadm08: Version : 2.12
[root@dm01dbadm01 ~]#

The glibc release is lower on our new node then on every node in the rack. This customer has been using Exadata for a long time and it quiet experienced in maintaining them. We have a good insight in what additional software was installed and we are sure that no installed components have updated the glibc release to this higher version. The most logical explanation would that this was update during patching. After some digging around we found out that the new version was indeed installed by due to this glibc vulnerability (Oracle support account is needed):

glibc vulnerability (CVE-2015-7547) patch availability for Oracle Exadata Database Machine (Doc ID 2108582.1)

After manually installing the updated glibc, clufvy was happy and we managed to install Grid Infrastructure again. Unfortunately our issues didn’t stop there, after successfully installing GI and RDBMS we went tried installing the 12c RDBMS home:

[oracle@dm01dbadm01 [oracle] addnode]$ ./ -silent -waitforcompletion "CLUSTER_NEW_NODES={dm01dbadm02}"
Starting Oracle Universal Installer...

Checking Temp space: must be greater than 120 MB. Actual 13549 MB Passed
Checking swap space: must be greater than 150 MB. Actual 24176 MB Passed
[FATAL] [INS-30160] Installer has detected that the nodes [dm01dbadm02] specified for addnode operation have uncleaned inventory.
ACTION: Ensure that the inventory location /u01/app/oraInventory is cleaned before performing addnode procedure.
[oracle@dm01dbadm01 [oracle] addnode]$

Wait a minute, i am the only person working on this node and 15 minutes ago i managed to install an 11g home and now it is suddenly corrupted? A check confirmed that we did not have malformed XML in our inventory.xml or any non-printable ASCII characters in it. To completely rule out a corrupted XML we rebuild the Oracle Inventory, all without success.  Time for some debugging the OUI:

[oracle@dm01dbadm01 [oracle] addnode]$ ./ -silent -waitforcompletion "CLUSTER_NEW_NODES={dm01dbadm02}" -debug -logLevel finest

Hidden somewhere in the output there are these few lines:

[main] [ 2017-02-08 11:33:17.523 CET ] [RuntimeExec.runCommand:207] runCommand: Waiting for the process
[Thread-158] [ 2017-02-08 11:33:17.523 CET ] [] In
[Thread-157] [ 2017-02-08 11:33:17.523 CET ] [] In
[Thread-158] [ 2017-02-08 11:33:17.601 CET ] [] ERROR>bash: /usr/bin/ksh: No such file or directory
[main] [ 2017-02-08 11:33:17.602 CET ] [RuntimeExec.runCommand:209] runCommand: process returns 127
[main] [ 2017-02-08 11:33:17.603 CET ] [RuntimeExec.runCommand:226] RunTimeExec: output>
[main] [ 2017-02-08 11:33:17.603 CET ] [RuntimeExec.runCommand:235] RunTimeExec: error>
[main] [ 2017-02-08 11:33:17.603 CET ] [RuntimeExec.runCommand:238] bash: /usr/bin/ksh: No such file or directory
[main] [ 2017-02-08 11:33:17.604 CET ] [RuntimeExec.runCommand:257] Returning from RunTimeExec.runCommand
[main] [ 2017-02-08 11:33:17.604 CET ] [UnixSystem.pathExists:1013] pathExists..RuntimeExec returned 127

The OUI is trying to find the korn shell at /usr/bin/ksh and a double check on our new image confirms that the ISO gave us the korn shell in /bin/ksh:

[root@dm02dbadm02 ~]# ls -l /bin/ksh*
lrwxrwxrwx 1 root root 21 okt 30 02:35 /bin/ksh -> /etc/alternatives/ksh
-rwxr-xr-x 1 root root 1501800 sep 22 2015 /bin/ksh93
[root@dm02dbadm02 ~]# ls -l /etc/alternatives/ksh
lrwxrwxrwx 1 root root 10 okt 30 02:35 /etc/alternatives/ksh -> /bin/ksh93
[root@dm02dbadm02 ~]# ls -l /usr/bin/ksh
ls: cannot access /usr/bin/ksh: No such file or directory

It looks that the OUI for 12c is hard coded to look for ksh in /usr/bin, having ksh in /bin will result in the 12c OUI complaining about an issue with the inventory instead of throwing an error on not being able to locate ksh. After consulting with Oracle Support, we heard that:

After ol5 to ol6 migration, we need to ensure two things:
a. ol6’s ksh rpm package is installed. And /bin/ksh exists.
b. /usr/bin/ksh exists. Otherwise, create a softlink to /bin/ksh.

Support is referring to an issue that appeared when Exadata moved from OEL5 to OEL6 which is quiet some while ago. It is surprising that a fresh image has this bug making it impossible to install 12c without fixing the ksh location. The easy fix is to make a softlink to ksh in /usr/bin:

[root@dm01dbadm02 ~]# ln -s /bin/ksh /usr/bin/ksh
[root@dm01dbadm02 ~]# ls -l /usr/bin/ksh
lrwxrwxrwx 1 root root 8 feb 8 18:37 /usr/bin/ksh -> /bin/ksh

Et voila… our works and we have our 12c home back. What worries me is that we can’t trust the ISO image of an Exadata release to be up-to-date with what is being delivered by the regular patch cycles. Both the CVE issue with glibc and the ksh issue have already been patched.

So please Oracle make the Exadata ISO great again.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s