Creating a RAC cluster using Ansible (part 2)

In my previous 2 blogposts i explained how to setup a Vagrantfile that deploys 2 virtualbox machines that can be used as a basis for a RAC cluster and how you could use Ansible to deploy RAC on those VM’s. In this post i want to dive a bit deeper into how i setup Ansible and why, keep in mind that this just one way of doing this.

The Github repository containing all the files can be found here

I am not going to go over the contents of the commons role as it probably speaks for itself on what it does. After the common role, we first need to setup networking which we need for interconnects. Because we have added extra network devices and we later on need to make sure that the device we configured as interconnect interface always keeps being the interconnect. Reason for this is that we have configured Virtualbox so that these interfaces are on their own network. To keep the device persistency we configure udev. This will work because in our Vagrantfile we have set the MAC addresses of the interfaces to a fixed value:

- name: Setup udev for network devices
  replace: dest=/etc/udev/rules.d/70-persistent-net.rules regexp='ATTR.*{{ item.device }}' replace='ATTR{address}=="{{ item.mac|lower }}", ATTR{type}=="1", KERNEL=="eth*", NAME="{{ item.device }}'
  with_items: "{{ network_ether_interfaces }}"
  when: network_ether_interfaces is defined
  register: udev_net

The array the is being referenced at the with_items line here is a host specific setting so it will take this value from the host_vars/ file. This means that the line above will make an udev rule that is specific for every network interface on the host that ansible is running on.

Ansible uses jinja2 as the format to dynamically format files. We user this format for our network configuration files. With the ansible template command, ansible will read the ethernet.j2 file and the jinja engine will use the template to create the correct file. In case of the action below, the ethernet.j2 will be uploaded as a ifcfg- file and it will do so on every host as long it can find data for that host in host_vars directory:

- name: Create the network configuration for ethernet devices
  template: src=ethernet.j2 dest=/etc/sysconfig/network-scripts/ifcfg-{{ item.device }}
  with_items: "{{ network_ether_interfaces }}"
  when: network_ether_interfaces is defined
  register: ether_result

The jinja2 code is quiet simple to read, the ethernet.j2 file looks like this:

# {{ ansible_managed }}
{% if item.bootproto == 'static' %}
DEVICE={{ item.device }}
BOOTPROTO=static
{% if item.address is defined %}
IPADDR={{ item.address }}
{% endif %}
{% if item.onboot is defined %}
ONBOOT={{ item.onboot }}
{% endif %}
{% if item.peerdns is defined %}
PEERDNS={{ item.peerdns }}
{% endif %}
{% if item.defroute is defined %}
DEFROUTE={{ item.defroute }}
{% endif %}
{% if item.netmask is defined %}
NETMASK={{ item.netmask }}
{% endif %}
{% if item.gateway is defined %}
GATEWAY={{ item.gateway }}
{% endif %}
{% if item.mac is defined %}
HWADDR={{ item.mac }}
{% endif %}
{% endif %}

{% if item.bootproto == 'dhcp' %}
DEVICE={{ item.device }}
BOOTPROTO=dhcp
{% endif %}

It is basically divided into to part, the first part tells jinja2 what to do when there is static device configuration and the other one for DHCP enabled devices. Jinja2 will create a line for every item found in the host_vars. The results of this actions are being registered by Ansible with the line “register: ether_result” We are using these results in the next action:

- name: bring up network devices
  shell: ifdown {{ item.item.device }}; ifup {{ item.item.device }}
  with_items: "{{ ether_result.results }}"
  when: ether_result is defined and item.changed

Here we are only restarting those interfaces which are registered in the ether_result action and are changed. A more complex use of ansible is in the template for the hosts file. The hosts file is the basis for DNSMasq which is being used as a simpler alternative for Bind DNS. The template for the hosts file is being created with information from the ansible facts. These facts are being gathered automatically by Ansible as soon as a playbook begins but we have changed things already in this run so right after we have brought up the network interfaces we have gathered our facts again, with these updated facts we can now build our hosts file.

127.0.0.1 localhost
{% for host in groups['all'] %}
{{ hostvars[host]['ansible_eth2']['ipv4']['address'] }} {{ hostvars[host]['ansible_hostname'] }}
{{ hostvars[host]['ansible_eth2']['ipv4']['address'] | regex_replace('(^.*\.).*$', '\\1') }}{{ hostvars[host]['ansible_eth2']['ipv4']['address'].split('.')[3] | int + 10 }} {{ hostvars[host]['ansible_hostname'] }}-vip
{{ hostvars[host]['ansible_eth1']['ipv4']['address'] }} {{ hostvars[host]['ansible_hostname'] }}-priv
{% endfor %}
{% set count = 1 %}
{% for i in range(1,4) %}
{{ hostvars[inventory_hostname]['ansible_eth2']['ipv4']['address'] | regex_replace('(^.*\.).*$', '\\1') }}{{ count | int +250 }} rac-cluster-scan
{% set count = count + 1 %}
{% endfor %}

The for-loop will go through all the facts and get the IP4 address and generates the correct hosts file entry for every node in the cluster. It then takes take the IP4 address and add 10 to the last octet the create the address for the vip interface. The eth1 address i used for the interconnect in our cluster. The last part the the file is a loop the generate 3 additional ip4 addresses based on the address of eth2 and adds 250 to it plus the integer of the loop. These are the our SCAN addresses of our cluster. Now we have the hosts file setup we can install dnsmasq and have DNS ready. We are pinging the interfaces just to make sure they are up, if the ping failes ansible will stop the runbook.

Our network is now setup as we want it to be and we can go on to configure storage. Vagrant already created four shared disks for us which are presented to both virtual machines. We now have to make sure that we have device persistence storage which we can do with both ASMLib and Udev. For both methods there need to be partition, i am using sfdisk as an easy way to create partitions on all 4 disks:

- name: Create disk partitions
  shell: echo "0," | sfdisk -q {{item.1}}
  with_indexed_items: "{{ device_list.stdout_lines }}"
  when: "{{ item.0 }} > 0"
  become: true
  run_once: true
  register: sfdisk_output

The with_indexed_items command gives me an index number which is used to make sure we are not destroying the partitions on /dev/sda where the OS is installed on. Because the OS is installed on the the first disk, we can start sfdisk at index 1. When ansible needs to install asmlib, it then installs the needed RPM’s and uploads and runs a shell script. The reason for this script is that for ASMLib there is some interactivity needed to configure it. This can be solved with regular shell scripting with a here document, as far as i know there is no such thing in Ansible. For Udev we can more or less copy what we did the make our network interfaces persistent.

After we have installed a toolset we can go start with the installation of the Grid Infrastructure. The ASM diskstring and prefix depends on if we have configured it earlier with udev or with asmlib. In order to use the correct valuables i am adding these values to the ansibles facts as custom fact. Depending on which role it finds Ansible well load these values into the facts:

- set_fact: asm_diskstring="/dev/oracleasm/disks"
  when: "'configure_asmlib' in role_names"

- set_fact: asm_diskstring="/dev/ASMDISK*"
  when: "'configure_udev' in role_names"

Now we can create our SSH keys on both nodes, download the public key back to our Virtual Box host. After we have uploaded the public keys to all the hosts we can add the to our know_hosts using ssh-keyscan.

We upload the Grid Infrastructure zipfiles, extract them and install the cvuqdisk rpm file on all nodes. For the installation of Grid Infrastructure software itself we network_ether_interfaces array to set the client and interconnect interfaces correctly. The response file is made out of a jinja2 template so we can easily customise some settings depending on our needs.  Because we are doing a silent install from a response file we need to run the configToolAllCommands which is normally an interactive part when you use the OUI. Finally we create the FRA diskgroup and we can clean up everything we uploaded and now longer need.

The installation of the RDBMS software with Ansible is now very straightforward. It is just a case of adjusting the limits.conf, creating the response file using the jinja2 template and install the software like your regular silent install. The same goes for the actual creation of the database, there are the checks, the response file creation and the database creation itself. For database creation right now i use the response file method, as soon as i have some time i will switch this out for database creation using a template to have more fine-grain control on the created database.

If you take just the Vagrant file and the ansible file and start installing from scratch it will take you a couple of hours (about 3) to download the vagrant box, the software, creating the machines and the provisioning with Ansible.

  • The Github repository can be found here
  • The blogpost about Vagrant and the vagrantfile can be found here
  • Creating a RAC cluster using Ansible (part 1) can be found here

Grid logging in 12.1.0.2

Most of the talk about Oracle’s release of 12.1.0.2 is about the InMemory feature, but more things have changed, for example some essential things about loggin in the Grid Infrastructure have changed. Normally in Oracle Grid Infrastructure logging for Grid components was done in $GI_HOME:

oracle@dm01db01(*gridinfra):/home/oracle> cd $ORACLE_HOME/log/`hostname -s`/
oracle@dm01db01(*gridinfra):/u01/app/12.1.0.2/grid/log/dm01db01>

There we have the main alert log for GI and several subdirectories for the GI binaries where they write there logging information in a list of rotating logfiles. So after upgrading a 11.2.0.4 cluster on Exadata, you will see that the alert log will be empty or just filled with a couple of lines e.g written at the point when upgrading GI stack:

oracle@dm01db01(*gridinfra):/u01/app/12.1.0.2/grid/log/dm01db01> cat alertdm01db01.log
2014-08-21 11:59:35.455
[client(9458)]CRS-0036:An error occurred while attempting to open file "UNKNOWN".
2014-08-21 11:59:35.455
[client(9458)]CRS-0004:logging terminated for the process. log file: "UNKNOWN"

So in 12.1.0.2 the old directories are still there but are not being used anymore, so where can we find them? Prior to 12.1.0.2 all GI binaries were writing directly to the alert.log file as well. Lets start by finding out were for example the crsd.bin file is writing to by looking at the op file descriptors in the proc filesystem:

[root@dm01db01 ~]# cd /proc/`ps -C crsd.bin -o pid=`/fd
[root@dm01db01 fd]# ls -la *.log
[root@dm01db01 fd]# ls -la *.trc
lrwx------ 1 root root 64 Aug 21 11:41 1 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc
l-wx------ 1 root root 64 Aug 21 11:41 15 -> /u01/app/grid/diag/crs/dm01db01/crs/trace/crsd.trc
lrwx------ 1 root root 64 Aug 21 11:41 2 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc
lrwx------ 1 root root 64 Aug 21 11:41 3 -> /u01/app/grid/crsdata/dm01db01/output/crsdOUT.trc

Crsd.bin is not writing to any logfile anymore, it is writing to a trace file in a new location. The new logging writes to a new location which can be found at: $ORACLE_BASE/diag/crs/`hostname -s`/crs/trace/ and is now in regular ADR formatted directories. In the old structure we have all the logfile nicely divided in subdirectories, in this new structure everything is a single directory. This directory does contain a lot of files, this is on a freshly installed cluster node:

[root@dm01db01 trace]# ls -la | wc -l
800

The majority of the files are trace files from cluster commands like oifcfg, every execution is being logged and traced. These files consist of -.trc name formatting. All GI processes can easily be listed (the example below is from an environment with role separation between GI and rdbms):

[root@dm01db01 trace]# ls -la | grep [a-z].trc
-rw-rw----  1 grid     oinstall   1695134 Aug 24 15:30 crsd_oraagent_grid.trc
-rw-rw----  1 oracle   oinstall   4555557 Aug 24 15:30 crsd_oraagent_oracle.trc
-rw-rw----  1 root     oinstall   1248672 Aug 24 15:30 crsd_orarootagent_root.trc
-rw-rw----  1 oracle   oinstall   6156053 Aug 24 15:30 crsd_scriptagent_oracle.trc
-rw-rw----  1 root     oinstall   4856950 Aug 24 15:30 crsd.trc
-rw-rw----  1 grid     oinstall  10329905 Aug 24 15:31 diskmon.trc
-rw-rw----  1 grid     oinstall   2825769 Aug 24 15:31 evmd.trc
-rw-rw----  1 grid     oinstall       587 Aug 21 11:19 evmlogger.trc
-rw-rw----  1 grid     oinstall   8994527 Aug 24 15:31 gipcd.trc
-rw-rw----  1 grid     oinstall     12663 Aug 21 11:41 gpnpd.trc
-rw-rw----  1 grid     oinstall     11868 Aug 21 11:36 mdnsd.trc
-rw-rw----  1 grid     oinstall 132725348 Aug 24 15:31 ocssd.trc
-rw-rw----  1 root     oinstall   6321583 Aug 24 15:31 octssd.trc
-rw-rw----  1 root     oinstall     59185 Aug 24 14:02 ohasd_cssdagent_root.trc
-rw-rw----  1 root     oinstall     72961 Aug 24 14:38 ohasd_cssdmonitor_root.trc
-rw-rw----  1 grid     oinstall    804408 Aug 24 15:31 ohasd_oraagent_grid.trc
-rw-rw----  1 root     oinstall   1094709 Aug 24 15:31 ohasd_orarootagent_root.trc
-rw-rw----  1 root     oinstall  10384867 Aug 24 15:30 ohasd.trc
-rw-rw----  1 root     oinstall    169081 Aug 24 15:06 ologgerd.trc
-rw-rw----  1 root     oinstall   5781762 Aug 24 15:31 osysmond.trc
[root@dm01db01 trace]#

So if we look at the old pre 12.1.0.2 cluster alert log you will see that there are a lot of processes writing directly to this file:

[root@dm02db01 ~]# lsof | grep alertdm02db01.log | wc -l
443

That are a lot of processes writing to one file, in the new environment, it is a lot less… just one:

[root@dm01db01 trace]# lsof | grep alert.log
java      23914     root  335w      REG              252,2     29668   10142552 /u01/app/grid/diag/crs/dm01db01/crs/trace/alert.log
[root@dm01db01 trace]# ps -p 23914 -o pid,args
  PID COMMAND
23914 /u01/app/12.1.0.2/grid/jdk/jre/bin/java -Xms128m -Xmx512m -classpath
/u01/app/12.1.0.2/grid/tfa/dm01db01/tfa_home/jlib/RATFA.jar:/u01/app/12.1.0.2/grid/tfa/dm01db01/tfa_home/jlib/je-5.0.84.jar:

For now this new logging structure is a lot less organized then the previous structure. It is however a lot more like regular database tracing and logging where there is one general alertlog and the different background processes like pmon, arch, lgwr etc. write to their own trace files.

Info about Exadata’s imageinfo and rpm’s

Recently we noticed that imageinfo was displaying the following information at a customers site. Imageinfo was not displaying data i suspected, so a quick little blogpost about what was going on there.

[root@dm01db01 ~]# dcli -g /opt/oracle.SupportTools/onecommand/dbs_group -l root "imageinfo | grep 'Image version'"
dm01db01: Image version: 11.2.3.2.1.130109
dm01db02: Image version: 11.2.3.2.1.130109
dm01db03: Image version: 11.2.3.2.1.130109
dm01db04: Image version: 11.2.3.2.1.130109
dm01db05: Image version: 11.2.3.2.0.120713
dm01db06: Image version: 11.2.3.2.0.120713
dm01db07: Image version: 11.2.3.2.0.120713
dm01db08: Image version: 11.2.3.2.0.120713

So what happened here, did those last four nodes not got updated when installing the latest CellOS QDPFE? Well it turns out that all the correct packages are installed after a quick investigation:

[root@dm0105 ~]# for i in `yum info | grep Name | grep exadata |cut -d: -f2`; do yum info $i ; done | grep 'Version\|Name'
Name       : exadata-applyconfig
Version    : 11.2.3.2.1.130109
Name       : exadata-asr
Version    : 11.2.3.2.1.130109
Name       : exadata-base
Version    : 11.2.3.2.1.130109
Name       : exadata-commonnode
Version    : 11.2.3.2.1.130109
Name       : exadata-exachk
Version    : 11.2.3.2.1.130109
Name       : exadata-firmware-compute
Version    : 11.2.3.2.1.130109
Name       : exadata-ibdiagtools
Version    : 11.2.3.2.1.130109
Name       : exadata-ipconf
Version    : 11.2.3.2.1.130109
Name       : exadata-onecommand
Version    : 11.2.3.2.1.130109
Name       : exadata-oswatcher
Version    : 11.2.3.2.1.130109
Name       : exadata-sun-computenode
Version    : 11.2.3.2.1.130109
Name       : exadata-validations-compute
Version    : 11.2.3.2.1.130109
[root@dm0101 rpm-extraxt]#

So according to yum everything is installed here, rpm -qa also confirms this so no miss-match there:

[root@dm0105 ~]# rpm -qa | grep exadata
exadata-asr-11.2.3.2.1.130109-1
exadata-validations-compute-11.2.3.2.1.130109-1
exadata-base-11.2.3.2.1.130109-1
exadata-exachk-11.2.3.2.1.130109-1
exadata-oswatcher-11.2.3.2.1.130109-1
exadata-ipconf-11.2.3.2.1.130109-1
exadata-firmware-compute-11.2.3.2.1.130109-1
exadata-applyconfig-11.2.3.2.1.130109-1
exadata-onecommand-11.2.3.2.1.130109-1
exadata-commonnode-11.2.3.2.1.130109-1
exadata-sun-computenode-11.2.3.2.1.130109-1
exadata-ibdiagtools-11.2.3.2.1.130109-1
[root@dm0101 ~]#

It looks like imageinfo is displaying incorrect information here, how come and where does it gets its information from? Well, imageinfo gets its information out of a file called image.id:

root@dm01db05 ~]# cat /opt/oracle.cellos/image.id
id:1357731445
kernel:2.6.32-400.11.1.el5uek
kernelstock:2.6.32-300.19.1.el5uek
version:11.2.3.2.1.130109
type:production
label:OSS_11.2.3.2.1_LINUX.X64_130109
cellrpmver:OSS_11.2.3.2.1_LINUX.X64_130109
cellswver:OSS_11.2.3.2.1_LINUX.X64_130109
mode:install
created:2013-01-09 03:37:26 -0800
node:COMPUTE
f331a2bfa27cc7371d936009289c23b6 *commonos.tbz
22212b4436b57fba6d53be4d7a6fd975 *dbboot.tbz
c7cf016f2180ab820010a787b77c59fb *dbfw.tbz
fd9d8b4f4a30588ca39f66015d5edc4e *dbrpms.tbz
eaa1a7973efcdcd2fe40b09828ba0e01 *debugos.tbz
290909dec5b3995aed2298f20d345aea *exaos.tbz
cfda22b7599e47bdc686acf41fd5d67c *kernel.tbz
18db43d6e94797710ffa771702be33e8 *ofed.tbz
4831ba742464caaa11c1c0153b823886 *sunutils.tbz
activated:2013-11-24 04:41:03 +0100
status:success

you can see that this file contains all the informations about the current version of your Exadata software stack as well as a list of md5 checksums of the .tbz files in /opt/oracle.cellos/iso/cellbits. We checked the md5 checksums of these files as well so now we now 100% that we are on the correct level. Now how does this file gets updated, it seems that it is out-of-date or at least not inline with our system? Well it is being generated when you are patching your compute node to another level and it is being updated by the exadata-sun-computenode rpm. Lets peek a bit into that rpm and what it is deploying, first i am going to mount an ISO file as my yum repository:

[root@dm0101 ~]# mount -o loop /u01/app/oracle/stage.2288/112_latest_repo_130109.iso /var/www/html/yum/unknown/EXADATA/dbserver/11.2/latest.2288/
[root@dm0101 ~]# cp /var/www/html/yum/unknown/EXADATA/dbserver/11.2/latest.2288/x86_64/exadata-sun-computenode-11.2.3.2.1.130302-1.x86_64.rpm 

Next step is to extract the files from the RPM and see what we get:

[root@dm0101 ~]# mkdir /tmp/exadata-sun-computenode-
[root@dm0101 ~]# cd /tmp/exadata-sun-computenode-
[root@dm0101 exadata-sun-computenode-]# rpm2cpio ../exadata-sun-computenode-11.2.3.2.1.130302-1.x86_64.rpm | cpio -idv
./etc/init.d/pciibmap
./etc/init.d/workaround_13083530
./etc/yum.repos.d/Exadata-computenode.repo.sample
./etc/yum/pluginconf.d/installonlyn.conf
./opt/oracle.SupportTools/corecontrol
./opt/oracle.SupportTools/dbserver_backup.sh
./opt/oracle.SupportTools/sundiag.sh
./opt/oracle.cellos/exadata.computenode.post.sh
./opt/oracle.cellos/tmpl/image.id.11.2.3.2.1.130302.in
194 blocks
[root@dm0101 exadata-sun-computenode-]#

Aha! there is a file called ./opt/oracle.cellos/tmpl/image.id.11.2.3.2.1.130302.in in that RPM, but it is not showing everything, it still doesn’t got the md5 checksum information in it:

[root@dm0101 exadata-sun-computenode-]# cat ./opt/oracle.cellos/tmpl/image.id.11.2.3.2.1.130302.in
id:1362281283
kernel:2.6.32-400.21.1.el5uek
kernelstock:2.6.32-300.19.1.el5uek
version:11.2.3.2.1.130302
type:production
label:OSS_11.2.3.2.1_LINUX.X64_RELEASE
cellrpmver:OSS_11.2.3.2.1_LINUX.X64_130302
cellswver:OSS_11.2.3.2.1_LINUX.X64_130302
mode:install
created:2013-03-02 19:28:04 -0800
node:COMPUTE
[root@dm0101 exadata-sun-computenode-]#

To find out how that gets filled we need to take a look into the script that gets executed when we install this RPM. The output bellow is a bit truncated for readability:

[root@dm0101 rpm-extraxt]# rpm --scripts -qp exadata-sun-computenode-11.2.3.2.1.130302-1.x86_64.rpm
.
..
fi
# Don't start post script for the fresh imaging (on zero boot)
# Don't install any files (remove them)
if [ -f /.post.imaging ]; then
  rm -f /opt/oracle.cellos/tmpl/image.id.11.2.3.2.1.130302.in
  rm -f /opt/oracle.cellos/exadata.computenode.post.sh
else
  /sbin/service exachkcfg postrpm 11.2.3.2.1.130302 exadata-sun-computenode
fi
..
.

So when we install this RPM it does (among other things not shown here) remove my image.id file and a shell script called exadata.computenode.post.sh and it then executes a new version of exadata.computenode.post.sh. Now exadata.computenode.post.sh is quiet a big script and it does setup your compute node properly and then finally it reboots your compute node. In this huge exadata.computenode.post.sh script there is a function that is being called that actually calculates the md5 checksums of my cellbits directory and places that info into a variable:

md5sum_iso_cellbits ()
{
  local imageid_out=$1

  [ "$imageid_out" == '' ] && imageid_out=$IMAGE_ID_DEF
  print_log "Please wait. Calculating checksums for $imageid_out ..." INFO
  # Remove any existing md5 sums from the output file
  perl -ne '/[a-f\d]\s+\*.+/ or print $_' -i $imageid_out
  local tmp_lst=/tmp/msd5sum.cellbits.lst
  pushd /opt/oracle.cellos/iso/cellbits
  md5sum -b * > $tmp_lst
  cat $tmp_lst >> $imageid_out
  rm -f $tmp_lst
  popd
}

Finally the function db_update_imageinfo () in exadata.computenode.post.sh creates a new and up-to-date version of my image.id file and it makes imageinfo happy again. To make a long blogpost short, the solution to our misbehaving image info (and also imagehistory btw.) was to run

yum --enablerepo=exadata_dbserver_11.2.3.2.1_x86_64_base reinstall exadata-sun-computenode-11.2.3.2.1.130109-1

and let the nodes reboot. This solved our problem with the imageinfo command, unfortunately we could not determine the root cause for the image.id problem.

Analyzing page allocation failures on Exadata

I have seen several clients who are struggling to decode page allocation failures on Exadata, in this post i will try to explain how to read the backtrace. The following is an anonymized client case where page allocation failures are leading up to a node reboot.

Jan 1 11:58:02 dm01db01 kernel: oracle: page allocation failure. order:1, mode:0x20
Jan 1 11:58:02 dm01db01 kernel: Pid: 80047, comm: oracle Tainted: P           2.6.32-400.11.1.el5uek #1
Jan 1 11:58:02 dm01db01 kernel: Call Trace:
Jan 1 11:58:02 dm01db01 kernel:  <IRQ>  [<ffffffff810ddf74>] __alloc_pages_nodemask+0x524/0x595
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110da3f>] kmem_getpages+0x4f/0xf4
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110dc3c>] fallback_alloc+0x158/0x1ce
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110ddd3>] ____cache_alloc_node+0x121/0x134
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110e3f3>] kmem_cache_alloc_node_notrace+0x84/0xb9
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110e46e>] __kmalloc_node+0x46/0x73
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813b9aa8>] ? __alloc_skb+0x72/0x13d
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813b9aa8>] __alloc_skb+0x72/0x13d
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813b9bdb>] alloc_skb+0x13/0x15
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813b9f11>] dev_alloc_skb+0x1b/0x38
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02a3722>] ipoib_cm_alloc_rx_skb+0x31/0x1de [ib_ipoib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02a4d04>] ipoib_cm_handle_rx_wc+0x3a1/0x5b8 [ib_ipoib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa0198bdc>] ? mlx4_ib_free_srq_wqe+0x27/0x54 [mlx4_ib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa01904d4>] ? mlx4_ib_poll_cq+0x620/0x65e [mlx4_ib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa029fe97>] ipoib_poll+0x87/0x128 [ib_ipoib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813c4b69>] net_rx_action+0xc6/0x1cd
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8105e8cd>] __do_softirq+0xd7/0x19e
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff810aefdc>] ? handle_IRQ_event+0x66/0x120
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff81012eec>] call_softirq+0x1c/0x30
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff81014695>] do_softirq+0x46/0x89
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8105e752>] irq_exit+0x3b/0x7a
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8145bea1>] do_IRQ+0x99/0xb0
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff81012713>] ret_from_intr+0x0/0x11
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02bc2df>] ? kcalloc+0x35/0x3d [rds]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02bc2df>] ? kcalloc+0x35/0x3d [rds]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02bc724>] ? __rds_rdma_map+0x16c/0x32c [rds]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02bca09>] ? rds_get_mr+0x42/0x4f [rds]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02b67b2>] ? rds_setsockopt+0xae/0x14f [rds]
Jan 1 11:58:02 dm01db01 kernel:  <EOI>  [<ffffffff81458045>] ? _spin_lock+0x21/0x25

Lets take a look at that call trace, in line 1 we see that this regarding the oracle binary that is having the page allocation failures. The way the linux kernel works is that it tries to allocatie pages is in powers of 2, so ‘order:1’ means that linux failed to allocate 2^1 ( 2^1 ) pages, so 2 pages in total. We can see how much that is if we lookup the pagesize on Exadata:

[root@dm0101 ~]# getconf PAGESIZE
4096

so this means we wanted to allocate: 2^1*4096=8192

So in this case this system could not allocate 8k of memory, if we walk through the trace we can see where it went wrong. You can see in lines 10,11 en 12 that Linux tries to allocate Socket Kernel Buffers (alloc_skb) and fails. Walking further down the stack we can see it is IP over Infinband that tries to allocate these buffers (ipoib_cm_alloc_rx_skb).

The mode:0x020 line is the GFP (Get Free Pages) flag that is being sent with the request, you can look up what flag it is in gfp.h:

[root@dm0101 ~]# grep 0x20 /usr/src/kernels/`uname -r`/include/linux/gfp.h
#define __GFP_HIGH	((__force gfp_t)0x20u)	/* Should access emergency pools? */

So if we look further down the messages file we can see the following part:

Jun 14 11:58:02 dm04dbadm02 kernel: Node 0 DMA free:15800kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15252kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes

As we can see we have only 4kB of low memory left and absolutely no room left in kernel slabs at all. As you can see rds and ip and thus also ipoib are slab allocated pages (you can monitor slabs using slabinfo):

[root@dm0101 ~]# cat /proc/slabinfo | grep "^ip\|rds"
rds_ib_frag        23600  23600   8192    1    4 : tunables    8    4    0 : slabdata  23600  23600      0
rds_ib_incoming    23570  23570   8192    1    4 : tunables    8    4    0 : slabdata  23570  23570      0
rds_iw_frag            0      0     40   92    1 : tunables  120   60    8 : slabdata      0      0      0
rds_iw_incoming        0      0    120   32    1 : tunables  120   60    8 : slabdata      0      0      0
rds_connection        18     77    688   11    2 : tunables   54   27    8 : slabdata      7      7      0
ip6_dst_cache         10     24    320   12    1 : tunables   54   27    8 : slabdata      2      2      0
ip6_mrt_cache          0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
ip_mrt_cache           0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
ip_fib_alias           0      0     32  112    1 : tunables  120   60    8 : slabdata      0      0      0
ip_fib_hash           28    106     72   53    1 : tunables  120   60    8 : slabdata      2      2      0
ip_dst_cache         486    530    384   10    1 : tunables   54   27    8 : slabdata     53     53      0
[root@dm0101 ~]#

Furthermore, ipoib uses an MTU size 64k so that is at least an order 4 of slab managed pages.

[root@dm0101 ~]# ifconfig bond0 | grep MTU
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:65520  Metric:1

With res running on our interconnect this will eventually lead to a node eviction. If rds message can’t be written in the pages then interconnect traffic won’t arrive at this host (remember that infiniband is a lossless fabric).