Vagrant for you RAC test environment

Creating your own test VM environment with RAC is a fun exercise to do, however after you have rebuild your environment a couple of times this will get a very tiresome exercise and you want to start to automate your deployments. People have already written several blogposts about the GI and RDBMS orchestrations and the several tools that are available for this. Within the Oracle community Ansible seems to be a very popular choice for the part of getting your test environment up-and-running. But what about the part to get your VM up-and-running, a very repetitive task and not a very interesting task to say the least.

One of your options would be to start scripting the creation of your VM’s and the installation of Linux afterwards. If you are using VirtualBox you could do this by writing a script around the VB commandline VBoxManage and create it from there. For your Linux deployment, PXE boot seems to be a logical choice to me but it still involves in running a DHCP and TFTP server locally (hint: dnsmasq), getting the ISO, bootloaders etc. A fun exercise to test but still quite a lot of work to automate or keep up-and-running.

So this is where Vagrant comes in, it can easily configure and deploy virtual machines for you. Basically it is “nothing  more” then a Ruby shell around several (VM) providers like VirtualBox, AWS, VMWare, Hyper-V and Dockers. Vagrant works with so called boxes, which are nothing more then compressed VM’s that can get modified for your needs at the moment you spin them up. You can choose to let Vagrant download a box from the Vagrant cloud or you could make your own box if your want. Running this:

vagrant init hashicorp/precise64

followed by:

vagrant up

This will give you a VirtualBox VM running Ubuntu which will be dowloaded from the Vagrant cloud. Out of the box Vagrant assumes you have VirtualBox installed. You can then ssh into the box with “vagrant ssh” or, within a multiple hosts scenario you can use “vagrant ssh nodename”. Stopping and starting your vagrant boxes can be done by respectively “vagrant halt” and “vagrant up”. If your are done and want to remove the VM, run “vagrant destroy”.

What if you want to do something more interesting, like deploying a RAC cluster which needs shared storage and multiple network interfaces. You need to create a file called Vagrantfile in your working directory. This file contains the code to modify your boxes with Vagrant. A very basic Vagrantfile should look something like this:

Vagrant.configure("2") do
  config.box = "hashicorp/precise64"
end

Let’s assume we want to create 2 VM’s with 4 disks of storage shared between them and as for networking we want a management interface, a public interface and interconnect network. We will end up with a this file: Here on GitHub Gist

Let’s break this file down so you got an understanding of what is going on. First of this file is, just as the rest of Vagrant, written in Ruby so all Ruby syntax will work in this file as well.

At the top i have defined some variables, such as the amount of servers i want to generate, hardware dimensions, shared disks etc. The API version is needed for Vagrant so it knows what syntax it can expect.

VAGRANTFILE_API_VERSION = "2"
ASM_LOC     = "/pathto/vagrant/rac/asmdisk"
num_disks   = 4
servers     = 2
mem         = 4096
cpu         = 2

The first step to do is to tell Vagrant how you want to setup your vagrant environment for this Vagrantfile. I am telling vagrant i want to use a box called oel68 (which is a custom Vagrant box i made) and i want X11 to be enabled for ease of use if i need to need to use DBCA or something similiar:

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.box = "oel68"
config.ssh.forward_x11 = true
config.ssh.forward_agent = true

Now for the interesting stuff, creating multiple VM’s for our RAC cluster. I didn’t want to copy and paste several server configurations and make small adjustments to them, instead i wanted to make it a bit more flexible so i just an each iterator that just loops through it until it reaches the “servers” variable. I am creating VM’s that are called rac1 until racn.

(1..servers).each do |rac_id|
config.vm.define "rac#{rac_id}" do |config|
config.vm.hostname = "rac#{rac_id}"

Next step is to create a ruby block that does the VirtualBox configuration. I am adding 2 nic’s, Nic 1 is already in the box by default and is a interface that is connected to my host with a nat network. Nic 2 is for the interconnects, nic 3 is my public interface. Further i am setting all the the nics to an Intel PRO/1000 MT Server card, changing the CPU and Memory settings and updating the SATA port count to 5 so we can add the shared storage later on.

# Do Virtualbox configuration
config.vm.provider :virtualbox do |vb|
	vb.customize ['modifyvm', :id, '--nic2', 'intnet', '--intnet2', 'rac-priv']
	vb.customize ['modifyvm', :id, '--nic3', 'hostonly', '--hostonlyadapter3', 'vboxnet0']

	# Change NIC type (https://www.virtualbox.org/manual/ch06.html#nichardware)
	vb.customize ['modifyvm', :id, '--nictype1', '82545EM']
	vb.customize ['modifyvm', :id, '--nictype2', '82545EM']
	vb.customize ['modifyvm', :id, '--nictype3', '82545EM']  

	# Change RAC node specific settings
	vb.customize ['modifyvm', :id, '--cpus', cpu]
	vb.customize ['modifyvm', :id, '--memory', mem]  

	# Increase SATA port count
	vb.customize ['storagectl', :id, '--name', 'SATA', '--portcount', 5]

We can now create the shared storage for our RAC cluster. We want to create 4 disks so we can use the same trick for this as we are doing for our server creation; an each iterator. We do need to take care of a few things here, we don’t want to overwrite an existing disk and only create and attach it once when we give the “vagrant up” command (this line). To be more precise, i only need one VM to create the disks with VBoxManage createmedium but i need all VM’s to attach these disks. The next IF loop makes sure that only the first node creates the disks and every other node only attaches the storage.

(1..num_disks).each do |disk|
	if ARGV[0] == "up" && ! File.exist?(ASM_LOC + "#{disk}.vdi")
		if rac_id == 1
			vb.customize ['createmedium',
						'--filename', ASM_LOC + "#{disk}.vdi",
						'--format', 'VDI',
						'--variant', 'Fixed',
						'--size', 5 * 1024]
			vb.customize ['modifyhd',
						 ASM_LOC + "#{disk}.vdi",
						'--type', 'shareable']
		end # End createmedium on rac1

		vb.customize ['storageattach', :id,
				'--storagectl', 'SATA',
				'--port', "#{disk}",
				'--device', 0,
				'--type', 'hdd',
				'--medium', ASM_LOC + "#{disk}.vdi"]
	end  # End if exist
end    # End of EACH iterator for disks

The code below is a workaround for a nasty bug with my CPU which i have both with VMWare Fusion and Virtual Box. It is well documented by Laurent Leturgez and Danny Bryant

# Workaound for Perl bug with root.sh segmentation fault,
# see this blogpost from Danny Bryant http://dbaontap.com/2016/01/13/vbox5/
vb.customize ['setextradata', :id, "VBoxInternal/CPUM/HostCPUID/Cache/Leaf", "0x4"]
vb.customize ['setextradata', :id, "VBoxInternal/CPUM/HostCPUID/Cache/SubLeaf", "0x4"]
vb.customize ['setextradata', :id, "VBoxInternal/CPUM/HostCPUID/Cache/eax", "0"]
vb.customize ['setextradata', :id, "VBoxInternal/CPUM/HostCPUID/Cache/ebx", "0"]
vb.customize ['setextradata', :id, "VBoxInternal/CPUM/HostCPUID/Cache/ecx", "0"]
vb.customize ['setextradata', :id, "VBoxInternal/CPUM/HostCPUID/Cache/edx", "0"]
vb.customize ['setextradata', :id, "VBoxInternal/CPUM/HostCPUID/Cache/SubLeafMask", "0xffffffff"]

We now have our VM’s ready and we can start the provisioning of these VM’s. If we just add a provisioning block like this Vagrant will start the provisioning in series. So create VM rac1, do provisioning, create rac2, do provisioning:

# Create disk partitions
if rac_id ==  1
       config.vm.provision "shell", inline: <<-SHELL
if [ -f /etc/SFDISK_CREATE_DATE ]; then
echo "Partition creation already done."
exit 0
fi
for i in `ls /dev/sd* | grep -v sda`;  do echo \\; | sudo sfdisk -q $i; done
date > /etc/SFDISK_CREATE_DATE
       SHELL
end # End create disk partitions

In most cases you want to start the provisioning when all VM’s are ready. Vagrant supports several provisioning methods like Ansible, Shell scripting, Puppet, Chef etc. If we are intalling a $GI_HOME we need both nodes to be up, have all the interfaces up with IP’s assigned etc.

if rac_id == servers
	# Start Ansible provisioning
	config.vm.provision "ansible" do |ansible|
		#ansible.verbose = "-v"
		ansible.limit = "all"
		ansible.playbook = "ansible/rac_gi_db.yml"
	end # End of Ansible provisioning
end

Above i am only starting the provision block once my rac_id equals the server variable, meaning when i have created all my RAC nodes. Now Ansible can do the provisioning of my servers in parallel because the ansible limit variable is set to all. Vagrant makes an Ansible host file with all the hosts which you can use for the provisioning. The whole provisioning of a RAC cluster itself is outside the scope of this blogpost. If you want Vagrant a go, you can download it here

The link to the full Vagrantfile on Gist is here

Patching the Big Data Appliance

I have been patching engineered systems since the launch of the Exadata V2 and recently i had the opportunity to patch the BDA we have in house. As far as comparisons go, this is were the similarities stop between Exadata and a Big Data Appliance (BDA) patching.
Our BDA is a so called startes rack consisting of 6 nodes running a hadoop cluster, for more information about this read my First Impressions blog post. On Exadata patching consist of a whole set of different patches and tools like patchmgr and dbnodeupdate.sh, on the Big Data Appliance we are patching with a tool called Mammoth. In this blogpost we will use the the upgrade from BDA software release 4.0 to 4.1 as an example to describe the patching process. You can download the BDA Mammoth bundle patch through DocID 1485745.1 (Oracle Big Data Appliance Patch Set Master Note). This patchset contains of 2 zipfiles which you should upload to your primary node in the cluster (usually node number 1 or 7), if you are not sure which node to use, you can use bdacli to find out which node you should use:

[root@bda1node01 BDAMammoth-ol6-4.1.0]# bdacli getinfo cluster_primary_host
bda1node01

After determining which node to use you can upload the 2 zipfiles from MOS and unzip them somewhere on that specific node:

[root@bda1node01 patch]# for i in `ls *.zip`; do unzip $i; done
Archive:  p20369730_410_Linux-x86-64_1of2.zip
  inflating: README.txt
   creating: BDAMammoth-ol6-4.1.0/
  inflating: BDAMammoth-ol6-4.1.0/BDAMammoth-ol6-4.1.0.run
Archive:  p20369730_410_Linux-x86-64_2of2.zip
   creating: BDABaseImage-ol6-4.1.0_RELEASE/
...
..
.
  inflating: BDABaseImage-ol6-4.1.0_RELEASE/json-select
[root@bda1node01 patch]#

After unzipping these files you end up with a huge file containing a shell script with an uuencode added binary payload:

[root@bda1node01 BDAMammoth-ol6-4.1.0]# ll
total 3780132
-rwxrwxrwx 1 root root 3870848281 Jan 16 10:17 BDAMammoth-ol6-4.1.0.run
[root@bda1node01 BDAMammoth-ol6-4.1.0]# 

This .run files does a couple of checks, like determining on which version of Linux we are running (this BDA is on OEL6), it installs a new version of BDAMammoth and moves to previous version to /opt/oracle/BDAMammoth/previous-BDAMammoth/. It also updates the Cloudera parcels and yum repository in /opt/oracle/BDAMammoth/bdarepo and places a new basimeage iso file on this node which is needed if you are adding new nodes to the cluster are need to reimaging them. Finally it also updates all the puppet files in /opt/oracle/BDAMammoth/puppet/manifests, Oracle uses puppet a lot for deploying software on the BDA, this means that patching the BDA is also done by a collection of puppet scripts. When this is done we can start the patch process by running mammoth -p, so here we go:

[root@bda1node01 BDAMammoth]# ./mammoth -p
INFO: Logging all actions in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20150130103200.log and traces in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20150130103200.trc
INFO: all_nodes not generated yet, skipping check for password-less ssh
ERROR: Big Data SQL is not supported on BDA v4.1.0
ERROR: Please uninstall Big Data SQL before upgrade cluster
ERROR: Cannot continue with operation
INFO: Running bdadiagcluster...
INFO: Starting Big Data Appliance diagnose cluster at Fri Jan 30 10:32:12 2015
INFO: Logging results to /tmp/bda_diagcluster_1422610330.log
SUCCESS: Created BDA diagcluster zipfile on node bda1node01
SUCCESS: Created BDA diagcluster zipfile on node bda1node02
SUCCESS: Created BDA diagcluster zipfile on node bda1node03
SUCCESS: Created BDA diagcluster zipfile on node bda1node04
SUCCESS: Created BDA diagcluster zipfile on node bda1node05
SUCCESS: Created BDA diagcluster zipfile on node bda1node06
SUCCESS: bdadiagcluster_1422610330.zip created
INFO: Big Data Appliance diagnose cluster complete at Fri Jan 30 10:32:54 2015
INFO: Please get the Big Data Appliance cluster diagnostic bundle at /tmp/bdadiagcluster_1422610330.zip
Exiting...
[root@bda1node01 BDAMammoth]# 

This error is obvious, we have Big Data SQL installed on our cluster (which was introduced in the BDA4.0 software) but this version we are running is not supported for BDA4.1. Unfortunately we are running BDSQL 1.0 which is the only version of BDSQL there is at this point, this is was also a known bug described in MOS Doc ID 1964471.1. So we have 2 options wait for a new release of BDSQL and postpone the upgrade or remove BDSQL and continue with the upgrade. We decided to continue with the upgrade and deinstall BDSQL for now, it was not working on the Exadata site due to conflicting patches with the latest RDBMS 12.1.0.2 patch. Removing BDSQL can be done with bdacli or mammoth-reconfig, bdacli calls mammoth-reconfig so as far as i know it doesn’t really matter. So lets give that a try:

[root@bda1node01 BDAMammoth]# ./mammoth-reconfig remove big_data_sql
INFO: Logging all actions in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20150130114222.log and traces in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20150130114222.trc
INFO: This is the install of the primary rack
ERROR: Version mismatch between mammoth and params file
ERROR: Mammoth version: 4.1.0, Params file version: 4.0.0
ERROR: Cannot continue with install #Step -1#
INFO: Running bdadiagcluster...
INFO: Starting Big Data Appliance diagnose cluster at Fri Jan 30 11:42:25 2015
INFO: Logging results to /tmp/bda_diagcluster_1422614543.log
SUCCESS: Created BDA diagcluster zipfile on node bda1node01
SUCCESS: Created BDA diagcluster zipfile on node bda1node02
SUCCESS: Created BDA diagcluster zipfile on node bda1node03
SUCCESS: Created BDA diagcluster zipfile on node bda1node04
SUCCESS: Created BDA diagcluster zipfile on node bda1node05
SUCCESS: Created BDA diagcluster zipfile on node bda1node06
SUCCESS: bdadiagcluster_1422614543.zip created
INFO: Big Data Appliance diagnose cluster complete at Fri Jan 30 11:43:03 2015
INFO: Please get the Big Data Appliance cluster diagnostic bundle at /tmp/bdadiagcluster_1422614543.zip
Exiting...
[root@bda1node01 BDAMammoth]#

Because we have already extracted the new version of Mammoth when executed the .run file we now have a version mismatch between the existing the mammoth software and params file /opt/oracle/BDAMammoth/bdaconfig/VERSION. We know that when we ran BDAMammoth-ol6-4.1.0.run the old mammoth software was backed up to /opt/oracle/BDAMammoth/previous-BDAMammoth/. Our first thought was to replace to new version with the previous version:

[root@bda1node01 oracle]# mv BDAMammoth BDAMammoth.new
[root@bda1node01 oracle]# cp -R ./BDA
BDABaseImage/           BDABaseImage-ol6-4.0.0/ BDABaseImage-ol6-4.1.0/ BDAMammoth.new/
[root@bda1node01 oracle]# cp -R ./BDAMammoth.new/previous-BDAMammoth/ ./BDAMammoth

Running it again, resulted in lots more errors, i am just pasting parts of the output here, just to give you an idea:

[root@bda1node01 BDAMammoth]# ./mammoth-reconfig remove big_data_sql
INFO: Logging all actions in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20150130113539.log and traces in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20150130113539.trc
INFO: This is the install of the primary rack
INFO: Checking if password-less ssh is set up
...
..
.
json-select: no valid input found for "@CIDR@"

*** json-select *** Out of branches near "json object <- json value
     OR json array <- json value" called from "json select command line"
*** json-select *** Syntax error on line 213:
}
...
..
.
ERROR: Puppet agent run on node bda1node01 had errors. List of errors follows

************************************
Error [6928]: Report processor failed: Permission denied - /opt/oracle/BDAMammoth/puppet/reports/bda1node01.oracle.vxcompany.local
************************************

INFO: Also check the log file in /opt/oracle/BDAMammoth/bdaconfig/tmp/pagent-bda1node01-20150130113621.log
...
..
.

So after fiddling around with this we found out that the soluting was actually extremely simple, there was no need to move the old 4.0 mammoth software back into place. The solution was to simply run the mammoth-reconfig script directly from backup directory: /opt/oracle/BDAMammoth/previous-BDAMammoth/ and we finally have BDSQL (mainly the cellsrv software) disabled on the cluster:

[root@bda1node01 BDAMammoth]# bdacli status big_data_sql_cluster
ERROR: Service big_data_sql_cluster is disabled. Please enable the service first to run this command.
[root@bda1node01 BDAMammoth]#

With BDSQL disabled we van give mammoth -p another shot and try to upgrade the hadoop cluster and wait for all the puppet scripts the finish. The output of the mammoth script is long so i won’t post it in here for readability reasons. All the logging from the mammoth script and puppet agents are all written to this /opt/oracle/BDAMammoth/bdaconfig/tmp/ during the patch process and eventually it is all nicely put into a single zipfile when the patching is done in /tmp/cdhctr1-install-summary.zip. The patch process on our starter rack took about +2 hours in total and about 40 minutes into the patching process we get this:

WARNING: The OS kernel was updated on some nodes - so those nodes need to be rebooted
INFO: Nodes to be rebooted: bda1node01,bda1node02,bda1node03,bda1node04,bda1node05,bda1node06
Proceed with reboot? [y/n]: y


Broadcast message from root@bda1node01.oracle.vxcompany.local
	(unknown) at 13:09 ...

The system is going down for reboot NOW!
INFO: Reboot done.
INFO: Please wait until all reboots are complete before continuing
[root@bda1node01 BDAMammoth]# Connection to bda1node01 closed by remote host.

Don’t be fooled that the reboot is the end of the patch process, the interesting part is just about the start. After the system is back online you can see that we are still on the old BDA software version:

[root@bda1node01 ~]# bdacli getinfo cluster_version
4.0.0

This is not really documented clearly by oracle, there is no special command to continue the patch process (like dbnodeupdate.sh -c on Exadata), just simply enter mammoth -p again and mammoth will pick up the patching process were it left off. In this second part of the patching process mammoth will actually update the Cloudera stack and all it components like Hadoop, Hive, Impala etc. this will take about 90 minutes to finish with a whole bunch of tests runs from map reduce jobs, ooze jobs, teragen sorts etc. (in my case it reported that i had a flume agent in concerning health after the upgrade). After all this is done the mammoth script will finish and we can verify if we have an upgraded cluster or not:

[root@bda1node01 BDAMammoth]# bdacli getinfo cluster_version
4.1.0

[root@bda1node01 BDAMammoth]# bdacli getinfo cluster_cdh_version
5.3.0-ol6

Except from the issues surrounding BDSQL patching the BDA worked like a charm, although Oracle could have documented it a bit better on how to continue after the OS upgrade and reboot. Personally i really like the fact that oracle uses puppet for orchestration on the BDA. Next step would be waiting on the update for BDSQL so we can do some proper testing.

Oracle Big Data Appliance X4-2: First impressions

The last 2 weeks we are lucky enough to have the Big Data Appliance (BDA) from Oracle in our lab/demo environment at VX Company and iRent. In this blog post i am trying to share my first experiences and some general observations. I am coming from an Oracle (Exadata) RDBMS background so i will probably reflect some of that experience on the BDA. The BDA we got here at is a starters rack which consists of 6 stock X4-2 servers, which have 2 sockets, 8 core Intel Xeon E5-2650 processors:

[root@bda1node01 bin]# cat /proc/cpuinfo | grep "model name" | uniq
model name	: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
[root@bda1node01 bin]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2593.864
BogoMIPS:              5186.76
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):    8-15,24-31
[root@bda1node01 bin]#

Half of the memory banks are filled with 8GB DIMM’s, giving the BDA nodes a total of 64GB per node:

[root@bda1node01 ~]# dmidecode  --type memory | grep Size
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
	Size: 8192 MB
	Size: No Module Installed
[root@bda1node01 ~]#

The HDFS filesystem is writing to the 12 disks in the server, all mounted at /u[1-12]:

[root@bda1node01 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md2              459G   40G  395G  10% /
tmpfs                  32G   12K   32G   1% /dev/shm
/dev/md0              184M  116M   60M  67% /boot
/dev/sda4             3.1T  214M  3.1T   1% /u01
/dev/sdb4             3.1T  203M  3.1T   1% /u02
/dev/sdc1             3.6T  4.3G  3.6T   1% /u03
/dev/sdd1             3.6T  4.1G  3.6T   1% /u04
/dev/sde1             3.6T  4.2G  3.6T   1% /u05
/dev/sdf1             3.6T  3.8G  3.6T   1% /u06
/dev/sdg1             3.6T  3.4G  3.6T   1% /u07
/dev/sdh1             3.6T  3.1G  3.6T   1% /u08
/dev/sdi1             3.6T  3.9G  3.6T   1% /u09
/dev/sdj1             3.6T  3.4G  3.6T   1% /u10
/dev/sdk1             3.6T  3.2G  3.6T   1% /u11
/dev/sdl1             3.6T  3.8G  3.6T   1% /u12
cm_processes           32G  8.9M   32G   1% /var/run/cloudera-scm-agent/process
[root@bda1node01 bin]# hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   18338 MB in  2.00 seconds = 9183.19 MB/sec
 Timing buffered disk reads:  492 MB in  3.03 seconds = 162.44 MB/sec
[root@bda1node01 bin]#

Deploying Cloudera CDH on the BDA is done by the onecommand alternative for the BDA called mammoth. Mammoth is used not only for deploying your rack but also for extending it with additional nodes:

[root@bda1node01 bin]# mammoth -l
INFO: Logging all actions in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20141116155709.log and traces in /opt/oracle/BDAMammoth/bdaconfig/tmp/bda1node01-20141116155709.trc
The steps in order are...
Step  1 = PreinstallChecks
Step  2 = SetupPuppet
Step  3 = PatchFactoryImage
Step  4 = CopyLicenseFiles
Step  5 = CopySoftwareSource
Step  6 = CreateLogicalVolumes
Step  7 = CreateUsers
Step  8 = SetupMountPoints
Step  9 = SetupMySQL
Step 10 = InstallHadoop
Step 11 = StartHadoopServices
Step 12 = InstallBDASoftware
Step 13 = HadoopDataEncryption
Step 14 = SetupKerberos
Step 15 = SetupEMAgent
Step 16 = SetupASR
Step 17 = CleanupInstall
Step 18 = CleanupSSHroot (Optional)
[root@bda1node01 bin]#

Interestingly Oracle is using puppet te deploy CDH and configuring the BDA nodes. Deploying a starter rack from start is quick, within a few hours we had our BDA installed and running. As an Exadata ‘fanboy’ i also have to say some words about cellars running on the BDA:

[root@bda1node01 bin]# ps -ef | grep [c]ell
oracle    8665     1  0 Nov05 ?        00:01:38 /opt/oracle/cell/cellsrv/bin/bdsqlrssrm -ms 1 -cellsrv 1
oracle    8672  8665  0 Nov05 ?        00:02:11 /opt/oracle/cell/cellsrv/bin/bdsqlrsomt -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8673  8665  0 Nov05 ?        00:00:47 /opt/oracle/cell/cellsrv/bin/bdsqlrsbmt -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8674  8665  0 Nov05 ?        00:00:30 /opt/oracle/cell/cellsrv/bin/bdsqlrsmmt -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8675  8673  0 Nov05 ?        00:00:08 /opt/oracle/cell/cellsrv/bin/bdsqlrsbkm -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8678  8674  0 Nov05 ?        00:10:57 /usr/java/default/bin/java -Xms256m -Xmx512m -XX:-UseLargePages -Djava.library.path=/opt/oracle/cell/cellsrv/lib -Ddisable.checkForUpdate=true -jar /opt/oracle/cell/oc4j/ms/j2ee/home/oc4j.jar -out /opt/oracle/cell/cellsrv/deploy/log/ms.lst -err /opt/oracle/cell/cellsrv/deploy/log/ms.err
oracle    8707  8675  0 Nov05 ?        00:00:48 /opt/oracle/cell/cellsrv/bin/bdsqlrssmt -rs_conf /opt/oracle/cell/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsms.state -cellsrv_conf /opt/oracle/cell/cellsrv/deploy/config/bdsqlrsos.state -debug 0
oracle    8710  8672 16 Nov05 ?        1-21:15:17 /opt/oracle/cell/cellsrv/bin/bdsqlsrv 100 5000 9 5042
oracle    8966     1  0 Nov05 ?        02:18:06 /opt/oracle/bd_cell/bd_cellofl-12.1.2.0.99_LINUX.X64_140907.2307/cellsrv/bin/bdsqloflsrv -startup 1 0 1 5042 8710 SYS_1212099_140907 cell
[root@bda1node01 bin]#

Of course this Oracle’s new product Big Data SQL which is cellsrv being ported so it can run on top of hadoop. Unfortunately we could not get Big Data SQL running yet because the mandatory patch that needs to be installed on top of the RDBMS/GI homes on the Exadata side conflicts with Oracle 12.1.0.2.1 (12.1.0.2 is mandatory for Big Data SQL), so we are waiting right now on development on fixing that issue. Strangly enough BD SQL also says that we have flashcache in write though mode on the BDA, and a whole lot of other abnormal readings:

[root@bda1node01 bin]# bdscli -e list bdsql detail
	 name:              	 bda1node01
	 bdsVersion:        	 OSS_PT.EXADOOP2_LINUX.X64_140907.2307
	 cpuCount:          	 0
	 diagHistoryDays:   	 7
	 fanCount:          	 0/0
	 fanStatus:         	 normal
	 flashCacheMode:    	 WriteThrough
	 id:
	 interconnectCount: 	 0
	 ipaddress1:        	 192.168.100.100/24
	 kernelVersion:     	 2.6.39-400.215.9.el6uek.x86_64
	 locatorLEDStatus:  	 unknown
	 makeModel:
	 memoryGB:          	 0
	 metricHistoryDays: 	 7
	 offloadGroupEvents:
	 offloadEfficiency: 	 1.0
	 powerCount:        	 0/0
	 powerStatus:       	 normal
	 releaseVersion:
	 releaseTrackingBug:	 17885582
	 status:            	 online
	 temperatureReading:	 0.0
	 temperatureStatus: 	 normal
	 upTime:            	 0 days, 0:00
	 bdsqlsrvStatus:    	 running
	 bdsqlmsStatus:     	 running
	 bdsqlrsStatus:     	 running
[root@bda1node01 bin]#

I am sure we have actually some CPU’s in here and that this machine is powered on, we have some actual memory in here and the temperature is more then zero degrees. Apart from the issues with Big Data SQL (which i am sure will be resolved soon) i am impressed with the set of tools that Oracle delivered with the BDA.

Grid logging in 12.1.0.2

Most of the talk about Oracle’s release of 12.1.0.2 is about the InMemory feature, but more things have changed, for example some essential things about loggin in the Grid Infrastructure have changed. Normally in Oracle Grid Infrastructure logging for Grid components was done in $GI_HOME:

oracle@dm01db01(*gridinfra):/home/oracle> cd $ORACLE_HOME/log/`hostname -s`/
oracle@dm01db01(*gridinfra):/u01/app/12.1.0.2/grid/log/dm01db01>

There we have the main alert log for GI and several subdirectories for the GI binaries where they write there logging information in a list of rotating logfiles. So after upgrading a 11.2.0.4 cluster on Exadata, you will see that the alert log will be empty or just filled with a couple of lines e.g written at the point when upgrading GI stack:

oracle@dm01db01(*gridinfra):/u01/app/12.1.0.2/grid/log/dm01db01> cat alertdm01db01.log
2014-08-21 11:59:35.455
[client(9458)]CRS-0036:An error occurred while attempting to open file "UNKNOWN".
2014-08-21 11:59:35.455
[client(9458)]CRS-0004:logging terminated for the process. log file: "UNKNOWN"

So in 12.1.0.2 the old directories are still there but are not being used anymore, so where can we find them? Prior to 12.1.0.2 all GI binaries were writing directly to the alert.log file as well. Lets start by finding out were for example the crsd.bin file is writing to by looking at the op file descriptors in the proc filesystem:

[root@dm01db01 ~]# cd /proc/`ps -C crsd.bin -o pid=`/fd
[root@dm01db01 fd]# ls -la *.log
[root@dm01db01 fd]# ls -la *.trc
lrwx------ 1 root root 64 Aug 21 11:41 1 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc
l-wx------ 1 root root 64 Aug 21 11:41 15 -> /u01/app/grid/diag/crs/dm01db01/crs/trace/crsd.trc
lrwx------ 1 root root 64 Aug 21 11:41 2 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc
lrwx------ 1 root root 64 Aug 21 11:41 3 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc

Crsd.bin is not writing to any logfile anymore, it is writing to a trace file in a new location. The new logging writes to a new location which can be found at: $ORACLE_BASE/diag/crs/`hostname -s`/crs/trace/ and is now in regular ADR formatted directories. In the old structure we have all the logfile nicely divided in subdirectories, in this new structure everything is a single directory. This directory does contain a lot of files, this is on a freshly installed cluster node:

[root@dm01db01 trace]# ls -la | wc -l
800

The majority of the files are trace files from cluster commands like oifcfg, every execution is being logged and traced. These files consist of -.trc name formatting. All GI processes can easily be listed (the example below is from an environment with role separation between GI and rdbms):

[root@dm01db01 trace]# ls -la | grep [a-z].trc
-rw-rw----  1 grid     oinstall   1695134 Aug 24 15:30 crsd_oraagent_grid.trc
-rw-rw----  1 oracle   oinstall   4555557 Aug 24 15:30 crsd_oraagent_oracle.trc
-rw-rw----  1 root     oinstall   1248672 Aug 24 15:30 crsd_orarootagent_root.trc
-rw-rw----  1 oracle   oinstall   6156053 Aug 24 15:30 crsd_scriptagent_oracle.trc
-rw-rw----  1 root     oinstall   4856950 Aug 24 15:30 crsd.trc
-rw-rw----  1 grid     oinstall  10329905 Aug 24 15:31 diskmon.trc
-rw-rw----  1 grid     oinstall   2825769 Aug 24 15:31 evmd.trc
-rw-rw----  1 grid     oinstall       587 Aug 21 11:19 evmlogger.trc
-rw-rw----  1 grid     oinstall   8994527 Aug 24 15:31 gipcd.trc
-rw-rw----  1 grid     oinstall     12663 Aug 21 11:41 gpnpd.trc
-rw-rw----  1 grid     oinstall     11868 Aug 21 11:36 mdnsd.trc
-rw-rw----  1 grid     oinstall 132725348 Aug 24 15:31 ocssd.trc
-rw-rw----  1 root     oinstall   6321583 Aug 24 15:31 octssd.trc
-rw-rw----  1 root     oinstall     59185 Aug 24 14:02 ohasd_cssdagent_root.trc
-rw-rw----  1 root     oinstall     72961 Aug 24 14:38 ohasd_cssdmonitor_root.trc
-rw-rw----  1 grid     oinstall    804408 Aug 24 15:31 ohasd_oraagent_grid.trc
-rw-rw----  1 root     oinstall   1094709 Aug 24 15:31 ohasd_orarootagent_root.trc
-rw-rw----  1 root     oinstall  10384867 Aug 24 15:30 ohasd.trc
-rw-rw----  1 root     oinstall    169081 Aug 24 15:06 ologgerd.trc
-rw-rw----  1 root     oinstall   5781762 Aug 24 15:31 osysmond.trc
[root@dm01db01 trace]#

So if we look at the old pre 12.1.0.2 cluster alert log you will see that there are a lot of processes writing directly to this file:

[root@dm02db01 ~]# lsof | grep alertdm02db01.log | wc -l
443

That are a lot of processes writing to one file, in the new environment, it is a lot less… just one:

[root@dm01db01 trace]# lsof | grep alert.log
java      23914     root  335w      REG              252,2     29668   10142552 /u01/app/grid/diag/crs/dm01db01/crs/trace/alert.log
[root@dm01db01 trace]# ps -p 23914 -o pid,args
  PID COMMAND
23914 /u01/app/12.1.0.2/grid/jdk/jre/bin/java -Xms128m -Xmx512m -classpath
/u01/app/12.1.0.2/grid/tfa/dm01db01/tfa_home/jlib/RATFA.jar:/u01/app/12.1.0.2/grid/tfa/dm01db01/tfa_home/jlib/je-5.0.84.jar:

For now this new logging structure is a lot less organized then the previous structure. It is however a lot more like regular database tracing and logging where there is one general alertlog and the different background processes like pmon, arch, lgwr etc. write to their own trace files.

Inserts on HCC tables

There are already a lot of blogposts and presentations done about Hybrid Columnar Compression and i am adding one more blogpost to that list. Recently i was doing some small tests one HCC and noticed that that inserts on a HCC row didn’t got compressed and yes i was using direct path loads:

DBA@TEST1> create table hcc_me (text1 varchar2(4000)) compress for archive high;

Table created.

KJJ@TEST1> insert /*+ append */ into hcc_me select dbms_random.string('x',100) from dual;

1 row created.

KJJ@TEST1> commit;

Commit complete.

KJJ@TEST1> select rowid from hcc_me;

ROWID
------------------
AAAWw/AAAAAACXzAAA

DBA@TEST1> @compress
Enter value for schemaname: kjj
Enter value for tablename: hcc_me
Enter value for rowid: AAAWw/AAAAAACXzAAA
old   2:    dbms_compression.get_compression_type(upper('&SchemaName'),upper('&TableName'),'&RowID'),
new   2:    dbms_compression.get_compression_type(upper('kjj'),upper('hcc_me'),'AAAWw/AAAAAACXzAAA'),

COMPRESSION_TYPE
---------------------
COMP_NOCOMPRESS

So my row did not got compressed, let’s insert a little bit more data into our test table:

declare
    sql_stmt varchar(200);
begin
for i in 1..1000 loop
    sql_stmt := 'insert /*+ append_values */ into hcc_me select dbms_random.string(''x'',100) from dual';
    execute immediate sql_stmt;
    commit;
end loop;
end;
/

And lets see what we end up with:

select count(*), compression_type from (
select decode(dbms_compression.get_compression_type('KJJ','HCC_ME',rowid),
       1, 'COMP_NOCOMPRESS',
       2, 'COMP_FOR_OLTP',
       4, 'COMP_FOR_QUERY_HIGH',
       8, 'COMP_FOR_QUERY_LOW',
      16, 'COMP_FOR_ARCHIVE_HIGH',
      32, 'COMP_FOR_ARCHIVE_LOW',
      64, 'COMP_BLOCK',
      1000000, 'COMP_RATIO_MINROWS',
      -1, 'COMP_RATIO_ALLROWS') "COMPRESSION_TYPE"
      from hcc_me)
group by compression_type;

so none of my records got compressed:

  COUNT(*) COMPRESSION_TYPE
---------- ---------------------
      1000 COMP_NOCOMPRESS

Maybe it size dependent, the row i am inserting into this HCC table is extremely small. Lets re-create the table and make every row one byte bigger then the previous row:

declare

sql_stmt    varchar(200);
v_random1   varchar2(4000);

begin

     execute immediate 'drop table hcc_me';
     execute immediate 'create table hcc_me (text1 varchar2(4000)) compress for archive high';

 for i in 1..1000 loop
     v_random1 := dbms_random.string('x', i);
     sql_stmt := 'insert /*+ append_values */ into hcc_me values (:1)';
     execute immediate sql_stmt using v_random1;
     commit;
   end loop;
end;

This will give me a table that has row 1 being 1 byte big and the last row 1000 bytes big. re-run our select statement and see if we have HCC compressed rows now:

  COUNT(*) COMPRESSION_TYPE
---------- ---------------------
       697 COMP_FOR_ARCHIVE_HIGH
       303 COMP_NOCOMPRESS

2 rows selected.

Victory! But now lets see where our records are starting to compress, lets adapt the query a bit:

select vsize(text1) row_bytes,
       decode(dbms_compression.get_compression_type('KJJ','HCC_ME',rowid),
       1, 'COMP_NOCOMPRESS',
       2, 'COMP_FOR_OLTP',
       4, 'COMP_FOR_QUERY_HIGH',
       8, 'COMP_FOR_QUERY_LOW',
      16, 'COMP_FOR_ARCHIVE_HIGH',
      32, 'COMP_FOR_ARCHIVE_LOW',
      64, 'COMP_BLOCK',
      1000000, 'COMP_RATIO_MINROWS',
      -1, 'COMP_RATIO_ALLROWS') COMPRESSION_TYPE
      from hcc_me;

 ROW_BYTES COMPRESSION_TYPE
---------- ---------------------
         1 COMP_NOCOMPRESS
         2 COMP_NOCOMPRESS
         3 COMP_NOCOMPRESS
         4 COMP_NOCOMPRESS
         5 COMP_NOCOMPRESS
<cut>
       292 COMP_NOCOMPRESS
       293 COMP_NOCOMPRESS
       294 COMP_NOCOMPRESS
       295 COMP_FOR_ARCHIVE_HIGH
       296 COMP_NOCOMPRESS
       297 COMP_NOCOMPRESS
       298 COMP_NOCOMPRESS
       299 COMP_NOCOMPRESS
       300 COMP_FOR_ARCHIVE_HIGH
       301 COMP_NOCOMPRESS
       302 COMP_NOCOMPRESS
       303 COMP_NOCOMPRESS
       304 COMP_FOR_ARCHIVE_HIGH
       305 COMP_FOR_ARCHIVE_HIGH
       306 COMP_FOR_ARCHIVE_HIGH
       307 COMP_FOR_ARCHIVE_HIGH
       308 COMP_FOR_ARCHIVE_HIGH
       309 COMP_FOR_ARCHIVE_HIGH
       310 COMP_FOR_ARCHIVE_HIGH
       311 COMP_FOR_ARCHIVE_HIGH
       312 COMP_FOR_ARCHIVE_HIGH
       313 COMP_FOR_ARCHIVE_HIGH
       314 COMP_NOCOMPRESS
       315 COMP_NOCOMPRESS
       316 COMP_FOR_ARCHIVE_HIGH
       317 COMP_FOR_ARCHIVE_HIGH
       318 COMP_FOR_ARCHIVE_HIGH
       319 COMP_FOR_ARCHIVE_HIGH
<cut>
       996 COMP_FOR_ARCHIVE_HIGH
       997 COMP_FOR_ARCHIVE_HIGH
       998 COMP_FOR_ARCHIVE_HIGH
       999 COMP_FOR_ARCHIVE_HIGH
      1000 COMP_FOR_ARCHIVE_HIGH

1000 rows selected.

Alright, this is unexpected. I ran this test multiple times and ended up with different results, some rows around the 300 bytes mark are getting compressed and some are not. So somewhere around the 300 bytes oracle decides randomly wetter or not to compress. Oh wait, randomly… i am using DBMS_RANDOM to fill my 1 column with data, so lets take the random factor out of the equation and fill our rows fit on fixed character:

declare

sql_stmt    varchar(200);

begin

     execute immediate 'create table hcc_me (text1 varchar2(4000)) compress for archive high';

 for i in 1..1000 loop
     sql_stmt := 'insert /*+ append_values */ into hcc_me select lpad( ''x'','||i||',''x'') from dual';
     execute immediate sql_stmt;
     commit;
   end loop;
end;

Now we end up with a much effective compression:

 ROW_BYTES COMPRESSION_TYPE
---------- ---------------------
         1 COMP_NOCOMPRESS
         2 COMP_NOCOMPRESS
         3 COMP_NOCOMPRESS
         4 COMP_NOCOMPRESS
         5 COMP_NOCOMPRESS
<cut>
        65 COMP_NOCOMPRESS
        66 COMP_NOCOMPRESS
        67 COMP_NOCOMPRESS
        68 COMP_FOR_ARCHIVE_HIGH
        69 COMP_FOR_ARCHIVE_HIGH
        70 COMP_FOR_ARCHIVE_HIGH
        71 COMP_FOR_ARCHIVE_HIGH
<cut>
       998 COMP_FOR_ARCHIVE_HIGH
       999 COMP_FOR_ARCHIVE_HIGH
      1000 COMP_FOR_ARCHIVE_HIGH

1000 rows selected.

Moral of this story is to be careful with HCC and small inserts, there is some logic built into the HCC engine that decides wetter or not it should compress based on size and estimated compression ratio. If you end up with a situation were not all rows are compressed (the insert was to small, forgot the do a direct load, or for whatever reason) a simple alter table move will compress all rows again.

Cellcli can lie to you…

Yes that is true, i said it, cellcli can lie to you there are some special cases were the output of cellcli is not the reality and you should double check it’s output with your standard OS tools. So this the output if dcli calling cellcli on an Exadata rack from an client:

[root@dm01db01 ~]# dcli -g cell_group -l root cellcli -e list cell attributes cellsrvStatus,msStatus,rsStatus
dm01cel01: running       running         running
dm01cel02: running       running         running
dm01cel03: stopped       running         running
dm01cel04: running       running         running
dm01cel05: running       running         running
dm01cel06: running       running         running
dm01cel07: running       running         running
dm01cel08: running       running         running
dm01cel09: running       running         running
dm01cel10: running       running         running
dm01cel11: running       running         running
dm01cel12: running       running         running
dm01cel13: running       running         running
dm01cel14: running       running         running
[root@dm01db01 ~]#

It seems that on this node cellsrv on cel03 has stopped, lets zoom in and logon to that cell and verify:

CellCLI> list cell attributes cellsrvStatus,msStatus,rsStatus
         stopped         running         running

Well that is expected, same command some output but now double-check this with what is actually happening on the OS. Let see what processes are actually running at OS level:

[root@dm01cel03 trace]# ps -ef | grep cellsrv/bin/cell[srv]
root      3143  3087  0 12:00 ?        00:00:00 /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/bin/cellsrvstat -interval=5 -count=720
root     21040     1 60 Mar26 ?        20:33:37 /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/bin/cellsrv 100 5000 9 5042
root     25662     1  0 Mar26 ?        00:02:08 /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/bin/cellrssrm -ms 1 -cellsrv 1
root     25673 25662  0 Mar26 ?        00:00:07 /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/bin/cellrsbmt -ms 1 -cellsrv 1
root     25674 25662  0 Mar26 ?        00:00:07 /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/bin/cellrsmmt -ms 1 -cellsrv 1
root     25676 25673  0 Mar26 ?        00:00:01 /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/bin/cellrsbkm -rs_conf /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/deploy/config/cellrsms.state -cellsrv_conf /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/deploy/config/cellrsos.state -debug 0
root     25710 25676  0 Mar26 ?        00:00:07 /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/bin/cellrssmt -rs_conf /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/deploy/config/cellinit.ora -ms_conf /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/deploy/config/cellrsms.state -cellsrv_conf /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/deploy/config/cellrsos.state -debug 0
[root@dm01cel03 trace]#

That is confusing, cellcli tells me that cellsrv is down but “ye good ol'” ps is telling me that cellsrv is up-and-running as it should. It looks like my cell storage is available, lets verify if we got all 12 disks from that cell available at ASM level and if we don’t have any repair timers counting down:

SYS@+ASM1> select count(*), repair_timer from v$asm_disk where path like '%DATA%dm05cel03' group by repair_timer;

  COUNT(*) REPAIR_TIMER
---------- ------------
        12            0

1 row selected.

All disks are having a repair timer of 0, meaning that no disks have failed, if there would really be a problem with the disks we would see the repair_timer counting down.

Now we can confirm that cellsrv is available and the output of cellcli is just plain wrong here. What is going on here. Lets start by checken the cell alertlog in $CELLTRACE

[RS] monitoring process /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/cellsrv/bin/cellrsomt (pid: 0) returned with error: 126
[RS] Monitoring process for service CELLSRV detected a flood of restarts. Disable monitoring process.
Errors in file /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/log/diag/asm/cell/dm05cel03/trace/rstrc_25662_4.trc  (incident=179):
RS-7445 [CELLSRV monitor disabled] [Detected a flood of restarts] [] [] [] [] [] [] [] [] [] []
Incident details in: /opt/oracle/cell11.2.3.2.1_LINUX.X64_130912/log/diag/asm/cell/dm05cel03/incident/incdir_179/rstrc_25662_4_i179.trc
Sweep [inc][179]: completed

The cellrs processes are monitoring the cellms and the cellsrv processes, however there is a flood control being built in to prevent a loop of restarts. If this would happen it could bring down a cell, so in order to prevent that from happening this flood control has been built in. When this happens RS will stop monitoring the problematic service, cellsrv in this case. This also means that it will report back to cellcli that process is stopped. Personally i think that this built-in flood control is a good thing, however i would like to see cellcli report this properly. For instance it would be nice if Oracle would let cellcli report the cellsrv status to something like intermediate when RS has stopped monitoring it, it is now saying ‘stopped’ which not true at all. This also means that when you see cellcli reporting that cellsrv is down, you always need to double check if this is actually true before you try restarting cellsrv.

Rolling back a failed rootupgrade.sh CRS upgrade

Recently i was upgrading a half rack Exadata to Grid Infrastructure 11.2.0.4. for a customer who had 1 node removed from the cluster, at least so we thought. While doing the upgrade we ran rootupgrade.sh on the first 2 nodes without issues. Now when running the script on what supposed to be the 3rd and final node in the cluster, the rootupgrade.sh failed with the following error:

CRS-1119: Unable to complete Oracle Clusterware upgrade while nodes dm0104 have not yet upgraded
CRS-1112: Failed to set the Oracle Clusterware operating version 11.2.0.4.0
CRS-4000: Command Set failed, or completed with errors.
/u01/app/11.2.0.4/grid/bin/crsctl set crs activeversion ... failed
Failed to set active version of the Grid Infrastructure at /u01/app/11.2.0.4/grid/crs/install/crsconfig_lib.pm line 9284.
/u01/app/11.2.0.4/grid/perl/bin/perl -I/u01/app/11.2.0.4/grid/perl/lib -I/u01/app/11.2.0.4/grid/crs/install /u01/app/11.2.0.4/grid/crs/install/rootcrs.pl execution failed

So what to do now, first step should be to find the root cause of your failed upgrade, then fix the problem and re-run the rootupgrade.sh if possible, otherwise rollback your patch:

[root@dm0101 ~]# dcli -g dbs_group -l root /u01/app/11.2.0.4/grid/bin/crsctl query crs activeversion
dm0101: Oracle Clusterware active version on the cluster is [11.2.0.3.0]
dm0102: Oracle Clusterware active version on the cluster is [11.2.0.3.0]
dm0103: Oracle Clusterware active version on the cluster is [11.2.0.3.0]
[root@dm0101 ~]# dcli -g dbs_group -l root /u01/app/11.2.0.4/grid/bin/crsctl query crs softwareversion
dm0101: Oracle Clusterware version on node [dm0201] is [11.2.0.4.0]
dm0102: Oracle Clusterware version on node [dm0202] is [11.2.0.4.0]
dm0103: Oracle Clusterware version on node [dm0203] is [11.2.0.4.0]
[root@dm0101 ~]#

Lets run cluvfy to verify the status of CRS on all nodes:

[oracle@dm0101 [] grid]$ mkdir /tmp/cvudbg
[oracle@dm0101 [] grid]$ export CV_TRACELOC=/tmp/cvudbg
[oracle@dm0101 [] grid]$ export SRVM_TRACE=true
[oracle@dm0101 [] grid]$ export SRVM_TRACE_LEVEL=1
[oracle@dm0101 [] grid]$ ./runcluvfy.sh comp crs -n all

Verifying CRS integrity

Checking CRS integrity...

WARNING:
PRVF-4038 : CRS is not installed on nodes:
dm0204
Verification will proceed with nodes:
dm0103,dm0102,dm0101


ERROR:
PRVG-10605 : Release version [11.2.0.4.0] is consistent across nodes but does not match the active version [11.2.0.3.0].
PRVG-10603 : Clusterware version consistency failed
Check failed on nodes:
dm0103,dm0102,dm0101

CRS integrity check failed

Verification of CRS integrity was unsuccessful.
Checks did not pass for the following node(s):
dm0204
[oracle@dm0101 [] grid]$

In this case the error was caused by the 4th node being removed just partially from the cluster. The fix here was to rollback the upgrade, remove the 4th node properly, then re-run the rootupgrade.sh again. Rolling back a failed rootupgrade.sh is done by running rootcrs.pl, where you start the rollback in reverse order, so in our case i start with node number 3, then run rootcrs.pl and node 2 using the following command:

[root@dm0101 ~]# /u01/app/11.2.0.4/grid/crs/install/rootcrs.pl -downgrade -oldcrshome /u01/app/11.2.0.3/grid -version 11.2.0.3.0 -force

Now, node 1 (dm0101) is the last node on which we start the rootcrs.pl script with the parameter -lastnode, this will tell rootcrs.pl to look into the $GI_HOME/cdata directory to look at the OCR backup that rootupgrade.sh made when it was started on the first node:

[root@dm0101 ~]# /u01/app/11.2.0.4/grid/crs/install/rootcrs.pl -downgrade -lastnode -oldcrshome /u01/app/11.2.0.3/grid -version 11.2.0.3.0 -force
Using configuration parameter file: /u01/app/11.2.0.4/grid/crs/install/crsconfig_params
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'dm0101'
CRS-2673: Attempting to stop 'ora.crsd' on 'dm0101'
CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'dm0101'
CRS-2673: Attempting to stop 'ora.dm0203-bk-vip.vip' on 'dm0101'
CRS-2673: Attempting to stop 'ora.LISTENER_SCAN2.lsnr' on 'dm0101'
CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'dm0101'
CRS-2673: Attempting to stop 'ora.elpadop.web.svc' on 'dm0101'
CRS-2673: Attempting to stop 'ora.lsfdp.lsfdpdg.svc' on 'dm0101'
CRS-2673: Attempting to stop 'ora.montyp.montypdg.svc' on 'dm0101'
...
..
.
CRS-2673: Attempting to stop 'ora.gpnpd' on 'dm0101'
CRS-2677: Stop of 'ora.gpnpd' on 'dm0101' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'dm0101' has completed
CRS-4133: Oracle High Availability Services has been stopped.
Successfully downgraded Oracle Clusterware stack on this node
Run '/u01/app/11.2.0.3/grid/bin/crsctl start crs' on all nodes
[root@dm0101 ~]#

Now to finalize the rollback, install and uninstall the USM components in the same order as you ran the rootcrs.pl scripts. So start with node 3, then 2 and end at node 1:

[root@dm0101 ~]# /u01/app/11.2.0.4/grid/bin/acfsroot uninstall
[root@dm0101 ~]# /u01/app/11.2.0.3/grid/bin/acfsroot install

At this point we are back at version 11.2.0.3 and we can remove those pesky remains of node 4 that is still there, and we can restart the rootupgrade.sh scripts again.