Oracle Big Data Appliance X4-2: First impressions

The last 2 weeks we are lucky enough to have the Big Data Appliance (BDA) from Oracle in our lab/demo environment at VX Company and iRent. In this blog post i am trying to share my first experiences and some general observations. I am coming from an Oracle (Exadata) RDBMS background so i will probably reflect some of that experience on the BDA. The BDA we got here at is a starters rack which consists of 6 stock X4-2 servers, which have 2 sockets, 8 core Intel Xeon E5-2650 processors:

[root@bda1node01 bin]# cat /proc/cpuinfo | grep "model name" | uniq
model name	: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
[root@bda1node01 bin]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2593.864
BogoMIPS:              5186.76
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):    8-15,24-31
[root@bda1node01 bin]#

Half of the memory banks are filled with 8GB DIMM’s, giving the BDA nodes a total of 64GB per node:

[root@bda1node01 ~]# dmidecode  --type memory | grep Size
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
[root@bda1node01 ~]#

The HDFS filesystem is writing to the 12 disks in the server, all mounted at /u[1-12]:

[root@bda1node01 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md2              459G   40G  395G  10% /
tmpfs                  32G   12K   32G   1% /dev/shm
/dev/md0              184M  116M   60M  67% /boot
/dev/sda4             3.1T  214M  3.1T   1% /u01
/dev/sdb4             3.1T  203M  3.1T   1% /u02
/dev/sdc1             3.6T  4.3G  3.6T   1% /u03
/dev/sdd1             3.6T  4.1G  3.6T   1% /u04
/dev/sde1             3.6T  4.2G  3.6T   1% /u05
/dev/sdf1             3.6T  3.8G  3.6T   1% /u06
/dev/sdg1             3.6T  3.4G  3.6T   1% /u07
/dev/sdh1             3.6T  3.1G  3.6T   1% /u08
/dev/sdi1             3.6T  3.9G  3.6T   1% /u09
/dev/sdj1             3.6T  3.4G  3.6T   1% /u10
/dev/sdk1             3.6T  3.2G  3.6T   1% /u11
/dev/sdl1             3.6T  3.8G  3.6T   1% /u12
cm_processes           32G  8.9M   32G   1% /var/run/cloudera-scm-agent/process
[root@bda1node01 bin]# hdparm -tT /dev/sda

 Timing cached reads:   18338 MB in  2.00 seconds = 9183.19 MB/sec
 Timing buffered disk reads:  492 MB in  3.03 seconds = 162.44 MB/sec
[root@bda1node01 bin]#

Deploying Cloudera CDH on the BDA is done by the onecommand alternative for the BDA called mammoth. Mammoth is used not only for deploying your rack but also for extending it with additional nodes:

[root@bda1node01 bin]# mammoth -l
INFO: Logging all actions in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20141116155709.log and traces in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20141116155709.trc
The steps in order are...
Step  1 = PreinstallChecks
Step  2 = SetupPuppet
Step  3 = PatchFactoryImage
Step  4 = CopyLicenseFiles
Step  5 = CopySoftwareSource
Step  6 = CreateLogicalVolumes
Step  7 = CreateUsers
Step  8 = SetupMountPoints
Step  9 = SetupMySQL
Step 10 = InstallHadoop
Step 11 = StartHadoopServices
Step 12 = InstallBDASoftware
Step 13 = HadoopDataEncryption
Step 14 = SetupKerberos
Step 15 = SetupEMAgent
Step 16 = SetupASR
Step 17 = CleanupInstall
Step 18 = CleanupSSHroot (Optional)
[root@bda1node01 bin]#

Interestingly Oracle is using puppet te deploy CDH and configuring the BDA nodes. Deploying a starter rack from start is quick, within a few hours we had our BDA installed and running. As an Exadata ‘fanboy’ i also have to say some words about cellars running on the BDA:

[root@bda1node01 bin]# ps -ef | grep [c]ell
oracle    8665     1  0 Nov05 ?        00:01:38 /opt/oracle/cell/cellsrv/bin/bdsqlrssrm -ms 1 -cellsrv 1
oracle    8672  8665  0 Nov05 ?        00:02:11 /opt/oracle/cell/cellsrv/bin/bdsqlrsomt -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8673  8665  0 Nov05 ?        00:00:47 /opt/oracle/cell/cellsrv/bin/bdsqlrsbmt -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8674  8665  0 Nov05 ?        00:00:30 /opt/oracle/cell/cellsrv/bin/bdsqlrsmmt -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8675  8673  0 Nov05 ?        00:00:08 /opt/oracle/cell/cellsrv/bin/bdsqlrsbkm -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8678  8674  0 Nov05 ?        00:10:57 /usr/java/default/bin/java -Xms256m -Xmx512m -XX:-UseLargePages -Djava.library.path=/opt/oracle/cell/cellsrv/lib -Ddisable.checkForUpdate=true -jar /opt/oracle/cell/oc4j/ms/j2ee/home/oc4j.jar -out /opt/oracle/cell/cellsrv/deploy/log/ms.lst -err /opt/oracle/cell/cellsrv/deploy/log/ms.err
oracle    8707  8675  0 Nov05 ?        00:00:48 /opt/oracle/cell/cellsrv/bin/bdsqlrssmt -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8710  8672 16 Nov05 ?        1-21:15:17 /opt/oracle/cell/cellsrv/bin/bdsqlsrv 100 5000 9 5042
oracle    8966     1  0 Nov05 ?        02:18:06 /opt/oracle/bd_cell/bd_cellofl- -startup 1 0 1 5042 8710 SYS_1212099_140907 cell
[root@bda1node01 bin]#

Of course this Oracle’s new product Big Data SQL which is cellsrv being ported so it can run on top of hadoop. Unfortunately we could not get Big Data SQL running yet because the mandatory patch that needs to be installed on top of the RDBMS/GI homes on the Exadata side conflicts with Oracle ( is mandatory for Big Data SQL), so we are waiting right now on development on fixing that issue. Strangly enough BD SQL also says that we have flashcache in write though mode on the BDA, and a whole lot of other abnormal readings:

[root@bda1node01 bin]# bdscli -e list bdsql detail
	 name:              	 bda1node01
	 bdsVersion:        	 OSS_PT.EXADOOP2_LINUX.X64_140907.2307
	 cpuCount:          	 0
	 diagHistoryDays:   	 7
	 fanCount:          	 0/0
	 fanStatus:         	 normal
	 flashCacheMode:    	 WriteThrough
	 interconnectCount: 	 0
	 kernelVersion:     	 2.6.39-400.215.9.el6uek.x86_64
	 locatorLEDStatus:  	 unknown
	 memoryGB:          	 0
	 metricHistoryDays: 	 7
	 offloadEfficiency: 	 1.0
	 powerCount:        	 0/0
	 powerStatus:       	 normal
	 releaseTrackingBug:	 17885582
	 status:            	 online
	 temperatureReading:	 0.0
	 temperatureStatus: 	 normal
	 upTime:            	 0 days, 0:00
	 bdsqlsrvStatus:    	 running
	 bdsqlmsStatus:     	 running
	 bdsqlrsStatus:     	 running
[root@bda1node01 bin]#

I am sure we have actually some CPU’s in here and that this machine is powered on, we have some actual memory in here and the temperature is more then zero degrees. Apart from the issues with Big Data SQL (which i am sure will be resolved soon) i am impressed with the set of tools that Oracle delivered with the BDA.


Grid logging in

Most of the talk about Oracle’s release of is about the InMemory feature, but more things have changed, for example some essential things about loggin in the Grid Infrastructure have changed. Normally in Oracle Grid Infrastructure logging for Grid components was done in $GI_HOME:

oracle@dm01db01(*gridinfra):/home/oracle> cd $ORACLE_HOME/log/`hostname -s`/

There we have the main alert log for GI and several subdirectories for the GI binaries where they write there logging information in a list of rotating logfiles. So after upgrading a cluster on Exadata, you will see that the alert log will be empty or just filled with a couple of lines e.g written at the point when upgrading GI stack:

oracle@dm01db01(*gridinfra):/u01/app/> cat alertdm01db01.log
2014-08-21 11:59:35.455
[client(9458)]CRS-0036:An error occurred while attempting to open file "UNKNOWN".
2014-08-21 11:59:35.455
[client(9458)]CRS-0004:logging terminated for the process. log file: "UNKNOWN"

So in the old directories are still there but are not being used anymore, so where can we find them? Prior to all GI binaries were writing directly to the alert.log file as well. Lets start by finding out were for example the crsd.bin file is writing to by looking at the op file descriptors in the proc filesystem:

[root@dm01db01 ~]# cd /proc/`ps -C crsd.bin -o pid=`/fd
[root@dm01db01 fd]# ls -la *.log
[root@dm01db01 fd]# ls -la *.trc
lrwx------ 1 root root 64 Aug 21 11:41 1 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc
l-wx------ 1 root root 64 Aug 21 11:41 15 -> /u01/app/grid/diag/crs/dm01db01/crs/trace/crsd.trc
lrwx------ 1 root root 64 Aug 21 11:41 2 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc
lrwx------ 1 root root 64 Aug 21 11:41 3 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc

Crsd.bin is not writing to any logfile anymore, it is writing to a trace file in a new location. The new logging writes to a new location which can be found at: $ORACLE_BASE/diag/crs/`hostname -s`/crs/trace/ and is now in regular ADR formatted directories. In the old structure we have all the logfile nicely divided in subdirectories, in this new structure everything is a single directory. This directory does contain a lot of files, this is on a freshly installed cluster node:

[root@dm01db01 trace]# ls -la | wc -l

The majority of the files are trace files from cluster commands like oifcfg, every execution is being logged and traced. These files consist of -.trc name formatting. All GI processes can easily be listed (the example below is from an environment with role separation between GI and rdbms):

[root@dm01db01 trace]# ls -la | grep [a-z].trc
-rw-rw----  1 grid     oinstall   1695134 Aug 24 15:30 crsd_oraagent_grid.trc
-rw-rw----  1 oracle   oinstall   4555557 Aug 24 15:30 crsd_oraagent_oracle.trc
-rw-rw----  1 root     oinstall   1248672 Aug 24 15:30 crsd_orarootagent_root.trc
-rw-rw----  1 oracle   oinstall   6156053 Aug 24 15:30 crsd_scriptagent_oracle.trc
-rw-rw----  1 root     oinstall   4856950 Aug 24 15:30 crsd.trc
-rw-rw----  1 grid     oinstall  10329905 Aug 24 15:31 diskmon.trc
-rw-rw----  1 grid     oinstall   2825769 Aug 24 15:31 evmd.trc
-rw-rw----  1 grid     oinstall       587 Aug 21 11:19 evmlogger.trc
-rw-rw----  1 grid     oinstall   8994527 Aug 24 15:31 gipcd.trc
-rw-rw----  1 grid     oinstall     12663 Aug 21 11:41 gpnpd.trc
-rw-rw----  1 grid     oinstall     11868 Aug 21 11:36 mdnsd.trc
-rw-rw----  1 grid     oinstall 132725348 Aug 24 15:31 ocssd.trc
-rw-rw----  1 root     oinstall   6321583 Aug 24 15:31 octssd.trc
-rw-rw----  1 root     oinstall     59185 Aug 24 14:02 ohasd_cssdagent_root.trc
-rw-rw----  1 root     oinstall     72961 Aug 24 14:38 ohasd_cssdmonitor_root.trc
-rw-rw----  1 grid     oinstall    804408 Aug 24 15:31 ohasd_oraagent_grid.trc
-rw-rw----  1 root     oinstall   1094709 Aug 24 15:31 ohasd_orarootagent_root.trc
-rw-rw----  1 root     oinstall  10384867 Aug 24 15:30 ohasd.trc
-rw-rw----  1 root     oinstall    169081 Aug 24 15:06 ologgerd.trc
-rw-rw----  1 root     oinstall   5781762 Aug 24 15:31 osysmond.trc
[root@dm01db01 trace]#

So if we look at the old pre cluster alert log you will see that there are a lot of processes writing directly to this file:

[root@dm02db01 ~]# lsof | grep alertdm02db01.log | wc -l

That are a lot of processes writing to one file, in the new environment, it is a lot less… just one:

[root@dm01db01 trace]# lsof | grep alert.log
java      23914     root  335w      REG              252,2     29668   10142552 /u01/app/grid/diag/crs/dm01db01/crs/trace/alert.log
[root@dm01db01 trace]# ps -p 23914 -o pid,args
23914 /u01/app/ -Xms128m -Xmx512m -classpath

For now this new logging structure is a lot less organized then the previous structure. It is however a lot more like regular database tracing and logging where there is one general alertlog and the different background processes like pmon, arch, lgwr etc. write to their own trace files.

Inserts on HCC tables

There are already a lot of blogposts and presentations done about Hybrid Columnar Compression and i am adding one more blogpost to that list. Recently i was doing some small tests one HCC and noticed that that inserts on a HCC row didn’t got compressed and yes i was using direct path loads:

DBA@TEST1> create table hcc_me (text1 varchar2(4000)) compress for archive high;

Table created.

KJJ@TEST1> insert /*+ append */ into hcc_me select dbms_random.string('x',100) from dual;

1 row created.

KJJ@TEST1> commit;

Commit complete.

KJJ@TEST1> select rowid from hcc_me;


DBA@TEST1> @compress
Enter value for schemaname: kjj
Enter value for tablename: hcc_me
Enter value for rowid: AAAWw/AAAAAACXzAAA
old   2:    dbms_compression.get_compression_type(upper('&SchemaName'),upper('&TableName'),'&RowID'),
new   2:    dbms_compression.get_compression_type(upper('kjj'),upper('hcc_me'),'AAAWw/AAAAAACXzAAA'),


So my row did not got compressed, let’s insert a little bit more data into our test table:

    sql_stmt varchar(200);
for i in 1..1000 loop
    sql_stmt := 'insert /*+ append_values */ into hcc_me select dbms_random.string(''x'',100) from dual';
    execute immediate sql_stmt;
end loop;

And lets see what we end up with:

select count(*), compression_type from (
select decode(dbms_compression.get_compression_type('KJJ','HCC_ME',rowid),
       1, 'COMP_NOCOMPRESS',
       2, 'COMP_FOR_OLTP',
       4, 'COMP_FOR_QUERY_HIGH',
       8, 'COMP_FOR_QUERY_LOW',
      64, 'COMP_BLOCK',
      1000000, 'COMP_RATIO_MINROWS',
      from hcc_me)
group by compression_type;

so none of my records got compressed:

---------- ---------------------

Maybe it size dependent, the row i am inserting into this HCC table is extremely small. Lets re-create the table and make every row one byte bigger then the previous row:


sql_stmt    varchar(200);
v_random1   varchar2(4000);


     execute immediate 'drop table hcc_me';
     execute immediate 'create table hcc_me (text1 varchar2(4000)) compress for archive high';

 for i in 1..1000 loop
     v_random1 := dbms_random.string('x', i);
     sql_stmt := 'insert /*+ append_values */ into hcc_me values (:1)';
     execute immediate sql_stmt using v_random1;
   end loop;

This will give me a table that has row 1 being 1 byte big and the last row 1000 bytes big. re-run our select statement and see if we have HCC compressed rows now:

---------- ---------------------

2 rows selected.

Victory! But now lets see where our records are starting to compress, lets adapt the query a bit:

select vsize(text1) row_bytes,
       1, 'COMP_NOCOMPRESS',
       2, 'COMP_FOR_OLTP',
       4, 'COMP_FOR_QUERY_HIGH',
       8, 'COMP_FOR_QUERY_LOW',
      64, 'COMP_BLOCK',
      1000000, 'COMP_RATIO_MINROWS',
      from hcc_me;

---------- ---------------------

1000 rows selected.

Alright, this is unexpected. I ran this test multiple times and ended up with different results, some rows around the 300 bytes mark are getting compressed and some are not. So somewhere around the 300 bytes oracle decides randomly wetter or not to compress. Oh wait, randomly… i am using DBMS_RANDOM to fill my 1 column with data, so lets take the random factor out of the equation and fill our rows fit on fixed character:


sql_stmt    varchar(200);


     execute immediate 'create table hcc_me (text1 varchar2(4000)) compress for archive high';

 for i in 1..1000 loop
     sql_stmt := 'insert /*+ append_values */ into hcc_me select lpad( ''x'','||i||',''x'') from dual';
     execute immediate sql_stmt;
   end loop;

Now we end up with a much effective compression:

---------- ---------------------

1000 rows selected.

Moral of this story is to be careful with HCC and small inserts, there is some logic built into the HCC engine that decides wetter or not it should compress based on size and estimated compression ratio. If you end up with a situation were not all rows are compressed (the insert was to small, forgot the do a direct load, or for whatever reason) a simple alter table move will compress all rows again.

Cellcli can lie to you…

Yes that is true, i said it, cellcli can lie to you there are some special cases were the output of cellcli is not the reality and you should double check it’s output with your standard OS tools. So this the output if dcli calling cellcli on an Exadata rack from an client:

[root@dm01db01 ~]# dcli -g cell_group -l root cellcli -e list cell attributes cellsrvStatus,msStatus,rsStatus
dm01cel01: running       running         running
dm01cel02: running       running         running
dm01cel03: stopped       running         running
dm01cel04: running       running         running
dm01cel05: running       running         running
dm01cel06: running       running         running
dm01cel07: running       running         running
dm01cel08: running       running         running
dm01cel09: running       running         running
dm01cel10: running       running         running
dm01cel11: running       running         running
dm01cel12: running       running         running
dm01cel13: running       running         running
dm01cel14: running       running         running
[root@dm01db01 ~]#

It seems that on this node cellsrv on cel03 has stopped, lets zoom in and logon to that cell and verify:

CellCLI> list cell attributes cellsrvStatus,msStatus,rsStatus
         stopped         running         running

Well that is expected, same command some output but now double-check this with what is actually happening on the OS. Let see what processes are actually running at OS level:

[root@dm01cel03 trace]# ps -ef | grep cellsrv/bin/cell[srv]
root      3143  3087  0 12:00 ?        00:00:00 /opt/oracle/cell11. -interval=5 -count=720
root     21040     1 60 Mar26 ?        20:33:37 /opt/oracle/cell11. 100 5000 9 5042
root     25662     1  0 Mar26 ?        00:02:08 /opt/oracle/cell11. -ms 1 -cellsrv 1
root     25673 25662  0 Mar26 ?        00:00:07 /opt/oracle/cell11. -ms 1 -cellsrv 1
root     25674 25662  0 Mar26 ?        00:00:07 /opt/oracle/cell11. -ms 1 -cellsrv 1
root     25676 25673  0 Mar26 ?        00:00:01 /opt/oracle/cell11. -rs_conf /opt/oracle/cell11. -ms_conf /opt/oracle/cell11. -cellsrv_conf /opt/oracle/cell11. -debug 0
root     25710 25676  0 Mar26 ?        00:00:07 /opt/oracle/cell11. -rs_conf /opt/oracle/cell11. -ms_conf /opt/oracle/cell11. -cellsrv_conf /opt/oracle/cell11. -debug 0
[root@dm01cel03 trace]#

That is confusing, cellcli tells me that cellsrv is down but “ye good ol'” ps is telling me that cellsrv is up-and-running as it should. It looks like my cell storage is available, lets verify if we got all 12 disks from that cell available at ASM level and if we don’t have any repair timers counting down:

SYS@+ASM1> select count(*), repair_timer from v$asm_disk where path like '%DATA%dm05cel03' group by repair_timer;

---------- ------------
        12            0

1 row selected.

All disks are having a repair timer of 0, meaning that no disks have failed, if there would really be a problem with the disks we would see the repair_timer counting down.

Now we can confirm that cellsrv is available and the output of cellcli is just plain wrong here. What is going on here. Lets start by checken the cell alertlog in $CELLTRACE

[RS] monitoring process /opt/oracle/cell11. (pid: 0) returned with error: 126
[RS] Monitoring process for service CELLSRV detected a flood of restarts. Disable monitoring process.
Errors in file /opt/oracle/cell11.  (incident=179):
RS-7445 [CELLSRV monitor disabled] [Detected a flood of restarts] [] [] [] [] [] [] [] [] [] []
Incident details in: /opt/oracle/cell11.
Sweep [inc][179]: completed

The cellrs processes are monitoring the cellms and the cellsrv processes, however there is a flood control being built in to prevent a loop of restarts. If this would happen it could bring down a cell, so in order to prevent that from happening this flood control has been built in. When this happens RS will stop monitoring the problematic service, cellsrv in this case. This also means that it will report back to cellcli that process is stopped. Personally i think that this built-in flood control is a good thing, however i would like to see cellcli report this properly. For instance it would be nice if Oracle would let cellcli report the cellsrv status to something like intermediate when RS has stopped monitoring it, it is now saying ‘stopped’ which not true at all. This also means that when you see cellcli reporting that cellsrv is down, you always need to double check if this is actually true before you try restarting cellsrv.

Rolling back a failed CRS upgrade

Recently i was upgrading a half rack Exadata to Grid Infrastructure for a customer who had 1 node removed from the cluster, at least so we thought. While doing the upgrade we ran on the first 2 nodes without issues. Now when running the script on what supposed to be the 3rd and final node in the cluster, the failed with the following error:

CRS-1119: Unable to complete Oracle Clusterware upgrade while nodes dm0104 have not yet upgraded
CRS-1112: Failed to set the Oracle Clusterware operating version
CRS-4000: Command Set failed, or completed with errors.
/u01/app/ set crs activeversion ... failed
Failed to set active version of the Grid Infrastructure at /u01/app/ line 9284.
/u01/app/ -I/u01/app/ -I/u01/app/ /u01/app/ execution failed

So what to do now, first step should be to find the root cause of your failed upgrade, then fix the problem and re-run the if possible, otherwise rollback your patch:

[root@dm0101 ~]# dcli -g dbs_group -l root /u01/app/ query crs activeversion
dm0101: Oracle Clusterware active version on the cluster is []
dm0102: Oracle Clusterware active version on the cluster is []
dm0103: Oracle Clusterware active version on the cluster is []
[root@dm0101 ~]# dcli -g dbs_group -l root /u01/app/ query crs softwareversion
dm0101: Oracle Clusterware version on node [dm0201] is []
dm0102: Oracle Clusterware version on node [dm0202] is []
dm0103: Oracle Clusterware version on node [dm0203] is []
[root@dm0101 ~]#

Lets run cluvfy to verify the status of CRS on all nodes:

[oracle@dm0101 [] grid]$ mkdir /tmp/cvudbg
[oracle@dm0101 [] grid]$ export CV_TRACELOC=/tmp/cvudbg
[oracle@dm0101 [] grid]$ export SRVM_TRACE=true
[oracle@dm0101 [] grid]$ export SRVM_TRACE_LEVEL=1
[oracle@dm0101 [] grid]$ ./ comp crs -n all

Verifying CRS integrity

Checking CRS integrity...

PRVF-4038 : CRS is not installed on nodes:
Verification will proceed with nodes:

PRVG-10605 : Release version [] is consistent across nodes but does not match the active version [].
PRVG-10603 : Clusterware version consistency failed
Check failed on nodes:

CRS integrity check failed

Verification of CRS integrity was unsuccessful.
Checks did not pass for the following node(s):
[oracle@dm0101 [] grid]$

In this case the error was caused by the 4th node being removed just partially from the cluster. The fix here was to rollback the upgrade, remove the 4th node properly, then re-run the again. Rolling back a failed is done by running, where you start the rollback in reverse order, so in our case i start with node number 3, then run and node 2 using the following command:

[root@dm0101 ~]# /u01/app/ -downgrade -oldcrshome /u01/app/ -version -force

Now, node 1 (dm0101) is the last node on which we start the script with the parameter -lastnode, this will tell to look into the $GI_HOME/cdata directory to look at the OCR backup that made when it was started on the first node:

[root@dm0101 ~]# /u01/app/ -downgrade -lastnode -oldcrshome /u01/app/ -version -force
Using configuration parameter file: /u01/app/
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'dm0101'
CRS-2673: Attempting to stop 'ora.crsd' on 'dm0101'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'dm0101'
CRS-2673: Attempting to stop '' on 'dm0101'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN2.lsnr' on 'dm0101'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'dm0101'
CRS-2673: Attempting to stop 'ora.elpadop.web.svc' on 'dm0101'
CRS-2673: Attempting to stop 'ora.lsfdp.lsfdpdg.svc' on 'dm0101'
CRS-2673: Attempting to stop 'ora.montyp.montypdg.svc' on 'dm0101'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'dm0101'
CRS-2677: Stop of 'ora.gpnpd' on 'dm0101' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'dm0101' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully downgraded Oracle Clusterware stack on this node
Run '/u01/app/ start crs' on all nodes
[root@dm0101 ~]#

Now to finalize the rollback, install and uninstall the USM components in the same order as you ran the scripts. So start with node 3, then 2 and end at node 1:

[root@dm0101 ~]# /u01/app/ uninstall
[root@dm0101 ~]# /u01/app/ install

At this point we are back at version and we can remove those pesky remains of node 4 that is still there, and we can restart the scripts again.

Info about Exadata’s imageinfo and rpm’s

Recently we noticed that imageinfo was displaying the following information at a customers site. Imageinfo was not displaying data i suspected, so a quick little blogpost about what was going on there.

[root@dm01db01 ~]# dcli -g /opt/oracle.SupportTools/onecommand/dbs_group -l root "imageinfo | grep 'Image version'"
dm01db01: Image version:
dm01db02: Image version:
dm01db03: Image version:
dm01db04: Image version:
dm01db05: Image version:
dm01db06: Image version:
dm01db07: Image version:
dm01db08: Image version:

So what happened here, did those last four nodes not got updated when installing the latest CellOS QDPFE? Well it turns out that all the correct packages are installed after a quick investigation:

[root@dm0105 ~]# for i in `yum info | grep Name | grep exadata |cut -d: -f2`; do yum info $i ; done | grep 'Version\|Name'
Name       : exadata-applyconfig
Version    :
Name       : exadata-asr
Version    :
Name       : exadata-base
Version    :
Name       : exadata-commonnode
Version    :
Name       : exadata-exachk
Version    :
Name       : exadata-firmware-compute
Version    :
Name       : exadata-ibdiagtools
Version    :
Name       : exadata-ipconf
Version    :
Name       : exadata-onecommand
Version    :
Name       : exadata-oswatcher
Version    :
Name       : exadata-sun-computenode
Version    :
Name       : exadata-validations-compute
Version    :
[root@dm0101 rpm-extraxt]#

So according to yum everything is installed here, rpm -qa also confirms this so no miss-match there:

[root@dm0105 ~]# rpm -qa | grep exadata
[root@dm0101 ~]#

It looks like imageinfo is displaying incorrect information here, how come and where does it gets its information from? Well, imageinfo gets its information out of a file called

root@dm01db05 ~]# cat /opt/oracle.cellos/
created:2013-01-09 03:37:26 -0800
f331a2bfa27cc7371d936009289c23b6 *commonos.tbz
22212b4436b57fba6d53be4d7a6fd975 *dbboot.tbz
c7cf016f2180ab820010a787b77c59fb *dbfw.tbz
fd9d8b4f4a30588ca39f66015d5edc4e *dbrpms.tbz
eaa1a7973efcdcd2fe40b09828ba0e01 *debugos.tbz
290909dec5b3995aed2298f20d345aea *exaos.tbz
cfda22b7599e47bdc686acf41fd5d67c *kernel.tbz
18db43d6e94797710ffa771702be33e8 *ofed.tbz
4831ba742464caaa11c1c0153b823886 *sunutils.tbz
activated:2013-11-24 04:41:03 +0100

you can see that this file contains all the informations about the current version of your Exadata software stack as well as a list of md5 checksums of the .tbz files in /opt/oracle.cellos/iso/cellbits. We checked the md5 checksums of these files as well so now we now 100% that we are on the correct level. Now how does this file gets updated, it seems that it is out-of-date or at least not inline with our system? Well it is being generated when you are patching your compute node to another level and it is being updated by the exadata-sun-computenode rpm. Lets peek a bit into that rpm and what it is deploying, first i am going to mount an ISO file as my yum repository:

[root@dm0101 ~]# mount -o loop /u01/app/oracle/stage.2288/112_latest_repo_130109.iso /var/www/html/yum/unknown/EXADATA/dbserver/11.2/latest.2288/
[root@dm0101 ~]# cp /var/www/html/yum/unknown/EXADATA/dbserver/11.2/latest.2288/x86_64/exadata-sun-computenode- 

Next step is to extract the files from the RPM and see what we get:

[root@dm0101 ~]# mkdir /tmp/exadata-sun-computenode-
[root@dm0101 ~]# cd /tmp/exadata-sun-computenode-
[root@dm0101 exadata-sun-computenode-]# rpm2cpio ../exadata-sun-computenode- | cpio -idv
194 blocks
[root@dm0101 exadata-sun-computenode-]#

Aha! there is a file called ./opt/oracle.cellos/tmpl/ in that RPM, but it is not showing everything, it still doesn’t got the md5 checksum information in it:

[root@dm0101 exadata-sun-computenode-]# cat ./opt/oracle.cellos/tmpl/
created:2013-03-02 19:28:04 -0800
[root@dm0101 exadata-sun-computenode-]#

To find out how that gets filled we need to take a look into the script that gets executed when we install this RPM. The output bellow is a bit truncated for readability:

[root@dm0101 rpm-extraxt]# rpm --scripts -qp exadata-sun-computenode-
# Don't start post script for the fresh imaging (on zero boot)
# Don't install any files (remove them)
if [ -f /.post.imaging ]; then
  rm -f /opt/oracle.cellos/tmpl/
  rm -f /opt/oracle.cellos/
  /sbin/service exachkcfg postrpm exadata-sun-computenode

So when we install this RPM it does (among other things not shown here) remove my file and a shell script called and it then executes a new version of Now is quiet a big script and it does setup your compute node properly and then finally it reboots your compute node. In this huge script there is a function that is being called that actually calculates the md5 checksums of my cellbits directory and places that info into a variable:

md5sum_iso_cellbits ()
  local imageid_out=$1

  [ "$imageid_out" == '' ] &amp;&amp; imageid_out=$IMAGE_ID_DEF
  print_log "Please wait. Calculating checksums for $imageid_out ..." INFO
  # Remove any existing md5 sums from the output file
  perl -ne '/[a-f\d]\s+\*.+/ or print $_' -i $imageid_out
  local tmp_lst=/tmp/msd5sum.cellbits.lst
  pushd /opt/oracle.cellos/iso/cellbits
  md5sum -b * &gt; $tmp_lst
  cat $tmp_lst &gt;&gt; $imageid_out
  rm -f $tmp_lst

Finally the function db_update_imageinfo () in creates a new and up-to-date version of my file and it makes imageinfo happy again. To make a long blogpost short, the solution to our misbehaving image info (and also imagehistory btw.) was to run

yum --enablerepo=exadata_dbserver_11. reinstall exadata-sun-computenode-

and let the nodes reboot. This solved our problem with the imageinfo command, unfortunately we could not determine the root cause for the problem.

About cellcli calibrate

With cellcli you can run the calibrate command, this will measure the performance of your harddisks and flashcards. It will eventually report to you the throughput and amount of IOPS that was measured during the calibrate, it will also report to you which luns are not performing up to par.

Calibrate has a few simple options:

CellCLI> help calibrate

  Usage: CALIBRATE [HARDDISK | FLASHDISK | lun1 [, lun2]] [FORCE]
  Purpose: Runs raw performance tests on disks.
           CELLSRV must be shutdown.  User id root is required.

    [FLASHDISK]: option to specify that only flash LUNs be calibrated.
    [HARDDISK] : option to specify that only hard disk LUNs be calibrated.
    lun1 [, lun2] : option to specify list of LUNs by name    [FORCE]    : runs test even while CELLSRV is running.

    CALIBRATE '0_8', '0_10' force


So let’s do a test run of calibrate on one of the harddisks in this cell in this V2 half rack:

CellCLI> calibrate 0_2 force
Calibration will take a few minutes...
LUN 0_2 on drive [20:2] random read throughput: 168.21 MBPS, and 419 IOPSCALIBRATE results are within an acceptable range.


Now this does tell us something about the throughput and the IOPS that was measured but it does not tells you HOW cellcli did manage to measure this. Let’s have a peek into how cellcli works, in order to set we have to increase the logmode and start cellcli in debugging mode:

[root@dm01cel01 ~]# cellcli -vvv -d
CellCLI: Release - Production on Sun Oct 27 15:19:55 CET 2013

Copyright (c) 2007, 2012, Oracle.  All rights reserved.
Cell Efficiency Ratio: 236

What this does, is dumping a lot of extra information on your screen regarding calls that cellcli does to ossmgmt.jar. Now cellcli is nothing but a perl script that calls to that jar file. With cellcli now running in debug mode and loglevel set to the highest we can now run our calibrate command again:

CellCLI> calibrate 0_2 force
yyn:0  state:0  yychar:-1
tok: -3 sval: calibrate
 next yychar:289
state 0, reducing 0 by rule 2 ($$1 :)
state 37, reducing 4 by rule 4 (line : $$1 cli $$2 EOL)
CLI command: CALIBRATE(289)
SPOOL arg : NONE(0)
CLI verb: CALIBRATE(289)
CLI object: NONE(0) 0_2
CLI addverb:
attr#1: FORCE(334)
++++Over Value is: 0
ForAll is false
CLI addverb:
attr#1: FORCE(334)
After reduction, shifting from state 0 to state 2
tok: -1 sval: null
state 2, reading 0 (end-of-file)
Getting the state of CELLSRV services...  running
Calibration will take a few minutes...
LUN 0_2 on drive [20:2] random read throughput: 168.33 MBPS, and 420 IOPSCALIBRATE results are within an acceptable range.


I have truncated the output above a little bit to make readable. Now this does not tell us a whole lot more but we can now see some interesting info in the cellcli logfile in /var/log/oracle/deploy/cellcli.lst.0* where the last parts can vary on your system depending the times you have started cellcli. When scrolling through that logfile you will see the output generated by cellcli you will find our cellcli calibrate command:

Oct 27, 2013 3:21:33 PM readAndProcessLine
INFO: Executing cellcli command: "calibrate 0_2 force"
Oct 27, 2013 3:21:33 PM invokeSingleRSCmd
FINE: Executing RS command: 0.0STATE CELLSRV!!!!!!!!!!!!!!.
Oct 27, 2013 3:21:33 PM invokeSingleRSCmd
FINE: RS comamnd return status: 2.

You see that the class checks wetter or not cellsrv is running or not (i have added the force parameter so i can run calibrate without stopping cellsrv).

Oct 27, 2013 3:21:33 PM fetchLuns
INFO: disknames.size: 1 listluncmd - list lun "0_2" attributes name, devicename, physicaldrives, disktype, lunsize
Oct 27, 2013 3:21:33 PM fetchCliResults
FINER: fetchCli cmd: MOD CALIBRATE :list lun "0_2" attributes name, devicename, physicaldrives, disktype, lunsize
Oct 27, 2013 3:22:14 PM fetchCliResults
FINER: fetchCli result: .	 0_2	 /dev/sdc	 20:2	 HardDisk	 557.861328125G

Next you see that cellcli uses to translate my lun name 0_2 to a blockdevice name, in my case /dev/sdc. it the goes on the gather names of all the pun:

Oct 27, 2013 3:22:14 PM fetchLuns
INFO: listluns: 	 0_2	 /dev/sdc	 20:2	 HardDisk	 557.861328125G

Oct 27, 2013 3:22:14 PM fetchCliResults
FINER: fetchCli cmd: MOD CALIBRATE :list physicaldisk attributes name, luns, diskMediaGroup
Oct 27, 2013 3:22:20 PM fetchCliResults
FINER: fetchCli result: .	 20:0	 0_0	 DMG_SYSTEM_HDD1
	 20:1	 0_1	 DMG_SYSTEM_HDD1
	 20:2	 0_2	 DMG_DATA_HDD1
	 20:3	 0_3	 DMG_DATA_HDD1
Oct 27, 2013 3:22:20 PM fetchLuns
INFO: listpds: 	 20:0	 0_0	 DMG_SYSTEM_HDD1
	 20:1	 0_1	 DMG_SYSTEM_HDD1
	 20:2	 0_2	 DMG_DATA_HDD1
	 20:3	 0_3	 DMG_DATA_HDD1

The output above is shortened for readability. Next runs, which is also run during cell deployment. tune_block_io /dev/sdc sets nr_request in /sys/block/sdc/queue/nr_requests to 4096:

Oct 27, 2013 3:22:20 PM calibrate
FINER: Execute: /opt/oracle/cell11. tune_block_io /dev/sdc

The nr_requests sets the amount of IO slots 4096 meaning that the deadline IO scheduler has more chance to merge IO requests. This should already been set to 4096 on all of your cells, the calibrate command sets it again anyway.

Oct 27, 2013 3:22:21 PM executeOrion
INFO: execute: /opt/oracle/cell11. -run advanced -type rand -matrix point  -disk_start_offset 0 -duration 30 -num_disks 1 -testname rand_0 -disk_end_offset 285184m -size_large 8192 -num_small 0 -num_large 10
Oct 27, 2013 3:22:51 PM executeOrion
INFO: execute: /opt/oracle/cell11. -run advanced -type rand -matrix point  -disk_start_offset 0 -duration 30 -num_disks 1 -testname rand_IOPS_0 -disk_end_offset 285184m -size_small 8 -num_large 0 -num_small 20

Finally calibrate starts the actual calibrate, it does this by calling Orion which is located in $OSS_BIN. It runs two separate runs, one doing small 8Kb requests and one doing 8192Kb large IO’s.