Analyzing page allocation failures on Exadata

I have seen several clients who are struggling to decode page allocation failures on Exadata, in this post i will try to explain how to read the backtrace. The following is an anonymized client case where page allocation failures are leading up to a node reboot.

Jan 1 11:58:02 dm01db01 kernel: oracle: page allocation failure. order:1, mode:0x20
Jan 1 11:58:02 dm01db01 kernel: Pid: 80047, comm: oracle Tainted: P           2.6.32-400.11.1.el5uek #1
Jan 1 11:58:02 dm01db01 kernel: Call Trace:
Jan 1 11:58:02 dm01db01 kernel:  <IRQ>  [<ffffffff810ddf74>] __alloc_pages_nodemask+0x524/0x595
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110da3f>] kmem_getpages+0x4f/0xf4
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110dc3c>] fallback_alloc+0x158/0x1ce
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110ddd3>] ____cache_alloc_node+0x121/0x134
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110e3f3>] kmem_cache_alloc_node_notrace+0x84/0xb9
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8110e46e>] __kmalloc_node+0x46/0x73
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813b9aa8>] ? __alloc_skb+0x72/0x13d
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813b9aa8>] __alloc_skb+0x72/0x13d
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813b9bdb>] alloc_skb+0x13/0x15
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813b9f11>] dev_alloc_skb+0x1b/0x38
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02a3722>] ipoib_cm_alloc_rx_skb+0x31/0x1de [ib_ipoib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02a4d04>] ipoib_cm_handle_rx_wc+0x3a1/0x5b8 [ib_ipoib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa0198bdc>] ? mlx4_ib_free_srq_wqe+0x27/0x54 [mlx4_ib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa01904d4>] ? mlx4_ib_poll_cq+0x620/0x65e [mlx4_ib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa029fe97>] ipoib_poll+0x87/0x128 [ib_ipoib]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff813c4b69>] net_rx_action+0xc6/0x1cd
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8105e8cd>] __do_softirq+0xd7/0x19e
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff810aefdc>] ? handle_IRQ_event+0x66/0x120
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff81012eec>] call_softirq+0x1c/0x30
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff81014695>] do_softirq+0x46/0x89
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8105e752>] irq_exit+0x3b/0x7a
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff8145bea1>] do_IRQ+0x99/0xb0
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffff81012713>] ret_from_intr+0x0/0x11
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02bc2df>] ? kcalloc+0x35/0x3d [rds]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02bc2df>] ? kcalloc+0x35/0x3d [rds]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02bc724>] ? __rds_rdma_map+0x16c/0x32c [rds]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02bca09>] ? rds_get_mr+0x42/0x4f [rds]
Jan 1 11:58:02 dm01db01 kernel:  [<ffffffffa02b67b2>] ? rds_setsockopt+0xae/0x14f [rds]
Jan 1 11:58:02 dm01db01 kernel:  <EOI>  [<ffffffff81458045>] ? _spin_lock+0x21/0x25

Lets take a look at that call trace, in line 1 we see that this regarding the oracle binary that is having the page allocation failures. The way the linux kernel works is that it tries to allocatie pages is in powers of 2, so ‘order:1’ means that linux failed to allocate 2^1 ( 2^1 ) pages, so 2 pages in total. We can see how much that is if we lookup the pagesize on Exadata:

[root@dm0101 ~]# getconf PAGESIZE
4096

so this means we wanted to allocate: 2^1*4096=8192

So in this case this system could not allocate 8k of memory, if we walk through the trace we can see where it went wrong. You can see in lines 10,11 en 12 that Linux tries to allocate Socket Kernel Buffers (alloc_skb) and fails. Walking further down the stack we can see it is IP over Infinband that tries to allocate these buffers (ipoib_cm_alloc_rx_skb).

The mode:0x020 line is the GFP (Get Free Pages) flag that is being sent with the request, you can look up what flag it is in gfp.h:

[root@dm0101 ~]# grep 0x20 /usr/src/kernels/`uname -r`/include/linux/gfp.h
#define __GFP_HIGH	((__force gfp_t)0x20u)	/* Should access emergency pools? */

So if we look further down the messages file we can see the following part:

Jun 14 11:58:02 dm04dbadm02 kernel: Node 0 DMA free:15800kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15252kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes

As we can see we have only 4kB of low memory left and absolutely no room left in kernel slabs at all. As you can see rds and ip and thus also ipoib are slab allocated pages (you can monitor slabs using slabinfo):

[root@dm0101 ~]# cat /proc/slabinfo | grep "^ip\|rds"
rds_ib_frag        23600  23600   8192    1    4 : tunables    8    4    0 : slabdata  23600  23600      0
rds_ib_incoming    23570  23570   8192    1    4 : tunables    8    4    0 : slabdata  23570  23570      0
rds_iw_frag            0      0     40   92    1 : tunables  120   60    8 : slabdata      0      0      0
rds_iw_incoming        0      0    120   32    1 : tunables  120   60    8 : slabdata      0      0      0
rds_connection        18     77    688   11    2 : tunables   54   27    8 : slabdata      7      7      0
ip6_dst_cache         10     24    320   12    1 : tunables   54   27    8 : slabdata      2      2      0
ip6_mrt_cache          0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
ip_mrt_cache           0      0    128   30    1 : tunables  120   60    8 : slabdata      0      0      0
ip_fib_alias           0      0     32  112    1 : tunables  120   60    8 : slabdata      0      0      0
ip_fib_hash           28    106     72   53    1 : tunables  120   60    8 : slabdata      2      2      0
ip_dst_cache         486    530    384   10    1 : tunables   54   27    8 : slabdata     53     53      0
[root@dm0101 ~]#

Furthermore, ipoib uses an MTU size 64k so that is at least an order 4 of slab managed pages.

[root@dm0101 ~]# ifconfig bond0 | grep MTU
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:65520  Metric:1

With res running on our interconnect this will eventually lead to a node eviction. If rds message can’t be written in the pages then interconnect traffic won’t arrive at this host (remember that infiniband is a lossless fabric).

Exadata versus IPv6

Recently one of my customers got a complaint from their DNS administrators, our Exadata’s are doing 40.000 DNS requests per minute. We like our DNS admins so we had a look into these request and what was causing them. I started with just firing up a tcpdump on one of the bonded client interfaces on a random compute node:

[root@dm01db01 ~]# tcpdump -i bondeth0 -s 0 port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bondeth0, link-type EN10MB (Ethernet), capture size 65535 bytes
15:41:04.937009 IP dm0101.domain.local.59868 > dnsserver01.domain.local:  53563+ AAAA? dm0101-vip.domain.local. (41)
15:41:04.937287 IP dm0101.domain.local.46672 > dnsserver01.domain.local:  44056+ PTR? 8.18.68.10.in-addr.arpa. (41)
15:41:04.938409 IP dnsserver01.domain.local > dm0101.domain.local.59868:  53563* 0/1/0 (116)
15:41:04.938457 IP dm0101.domain.local.56576 > dnsserver01.domain.local:  45733+ AAAA? dm0101-vip.domain.local.domain.local. (54)
15:41:04.939547 IP dnsserver01.domain.local > dm0101.domain.local.46672:  44056* 1/1/1 PTR dnsserver01.domain.local. (120)
15:41:04.940204 IP dnsserver01.domain.local > dm0101.domain.local.56576:  45733 NXDomain* 0/1/0 (129)
15:41:04.940237 IP dm0101.domain.local.9618 > dnsserver01.domain.local:  64639+ A? dm0101-vip.domain.local. (41)
15:41:04.941912 IP dnsserver01.domain.local > dm0101.domain.local.9618:  64639* 1/1/1 A dm0101-vip.domain.local (114)

So what are we seeing here, there are a bunch of AAAA requests to the DNS server and only one A record request. But the weirdest thing is of course the requests with the double domainname extensions. If we zoom in at those AAAA records requests we see the following, here is the request:

15:41:04.937009 IP dm0101.domain.local.59868 > dnsserver01.domain.local:  53563+ AAAA? dm0101-vip.domain.local. (41)

And here is our answer:

15:41:04.938409 IP dnsserver01.domain.local > dm0101.domain.local.59868:  53563* 0/1/0 (116)

The interesting part is in the answer of the dnsserver, in 0/1/0 the DNS server tells me that for this lookup it found 0 answer resource records, 1 authority resource records, and 0 additional resource records. So it could not resolve my VIP name in DNS. Now if we look at the A records request:

15:41:04.945697 IP dm0101.domain.local.10401 > dnsserver01.domain.local:  37808+ A? dm0101-vip.domain.local. (41)

and the answer:

15:41:04.947249 IP dnsserver01.domain.local > dm0101.domain.local.10401:  37808* 1/1/1 A dm0101-vip.domain.local (114)

Now by looking at the answer: 1/1/1 we can see that i got 1 answer record in return (the first 1), so the DNS server knows the IP for dm0101-vip.domain.local when an A record is requested. What is going on here? Well the answer is simple, AAAA records are IPv6 DNS requests, our DNS servers are not configured for IPv6 name request so it rejects these requests. So what about those weird double domain names like dm0101-vip.domain.local.domain.local? When Linux requests a DNS record the following happens:

1. Linux issues DNS request for dm0101-vip.domain.local, because IPv6 is enabled, it issues an AAAA request.
2. DNS server is not configured for IPv6 requests and discards request.
3. Linux retries the requests, looks into resolv.conf and adds domainame, we now have dm0101-vip.domain.local.domain.local
4. Once again, the DNS server discards this request.
5. Linux once agains retries the AAAA request, adds domain name: dm0101-vip.domain.local.domain.local.domain.local
6. DNS server discards AAAA request
7. Linux now falls back to a DNS IPv4 request and does an A request: dm0101-vip.domain.local
8. DNS servers understands this and replies

This happens because Exadata comes with IPv6 Enabled on both infiniband and ethernet interfaces:

[root@dm0101 ~]# ifconfig bond0;ifconfig bond1
bond0     Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.100.1  Bcast:192.168.100.255  Mask:255.255.255.0
          inet6 addr: fe80::221:2800:13f:2673/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:65520  Metric:1
          RX packets:226096104 errors:0 dropped:0 overruns:0 frame:0
          TX packets:217747947 errors:0 dropped:55409 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:320173078389 (298.1 GiB)  TX bytes:176752381042 (164.6 GiB)

bond1     Link encap:Ethernet  HWaddr 00:21:28:84:16:49  
          inet addr:10.18.1.10  Bcast:10.18.1.255  Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fe84:1649/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:14132063 errors:2 dropped:0 overruns:0 frame:2
          TX packets:7334898 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2420637835 (2.2 GiB)  TX bytes:3838537234 (3.5 GiB)

[root@dm0101 ~]# 

Let’s disable ipv6, my client is not using on ipv6 its internal network anyway (like most companies i assume). You can edit /etc/modprobe.conf to prevent it from being loaded at boot time, add the following 2 lines modprobe.conf:

alias ipv6 off
install ipv6 /bin/true

Then add the below entries to /etc/sysconfig/network

IPV6INIT=no

Reboot the host and lets look at what we see after the host is up again:

[root@dm0103 ~]# cat /proc/net/if_inet6
00000000000000000000000000000001 01 80 10 80       lo
fe8000000000000002212800013f111f 08 40 20 80    bond0
fe80000000000000022128fffe8e5f6a 02 40 20 80     eth0
fe80000000000000022128fffe8e5f6b 09 40 20 80    bond1
[root@dm0103 ~]# ifconfig bond0;ifconfig bond1
bond0     Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.100.3  Bcast:192.168.100.255  Mask:255.255.255.0
          inet6 addr: fe80::221:2800:13f:111f/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:65520  Metric:1
          RX packets:318265 errors:0 dropped:0 overruns:0 frame:0
          TX packets:268072 errors:0 dropped:16 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:433056862 (412.9 MiB)  TX bytes:190905039 (182.0 MiB)

bond1     Link encap:Ethernet  HWaddr 00:21:28:8E:5F:6B  
          inet addr:10.18.1.12  Bcast:10.18.1.255  Mask:255.255.255.0
          inet6 addr: fe80::221:28ff:fe8e:5f6b/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:10256 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5215 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1559169 (1.4 MiB)  TX bytes:1350653 (1.2 MiB)

[root@dm0103 ~]# 

So disabling ipv6 modules through modprobe.conf did not do the trick, what did broughgt the ipv6 stack:

[root@dm0103 ~]# lsmod | grep ipv6
ipv6 291277 449 bonding,ib_ipoib,ib_addr,cnic

The infiniband stack brought up ipv6, we can disable ipv6 at kernel level:

root@dm0103 ~]# sysctl -a | grep net.ipv6.conf.all.disable_ipv6 
net.ipv6.conf.all.disable_ipv6 = 0
[root@dm0103 ~]# echo 1 > /proc/sys/net/ipv6/conf/all/disable_ipv6
[root@dm0103 ~]# sysctl -a | grep net.ipv6.conf.all.disable_ipv6 
net.ipv6.conf.all.disable_ipv6 = 1
[root@dm0103 ~]# cat /proc/net/if_inet6
[root@dm0103 ~]# 

Now we are running this exadata compute node without ipv6, we can now check if we still have infiniband connectivity, on a cell start a ibping server and use ibstat to get the port GUID:

[root@dm01cel01 ~]# ibstat -p
0x00212800013ea3bf
0x00212800013ea3c0
[root@dm01cel01 ~]# ibping -S

On our ipv6 disabled host start the ibping to one of the

[root@dm0103 ~]# ibping -c 4 -v -G 0x00212800013ea3bf
ibwarn: [14476] ibping: Ping..
Pong from dm01cel01.oracle.vxcompany.local.(none) (Lid 6): time 0.148 ms
ibwarn: [14476] ibping: Ping..
Pong from dm01cel01.oracle.vxcompany.local.(none) (Lid 6): time 0.205 ms
ibwarn: [14476] ibping: Ping..
Pong from dm01cel01.oracle.vxcompany.local.(none) (Lid 6): time 0.247 ms
ibwarn: [14476] ibping: Ping..
Pong from dm01cel01.oracle.vxcompany.local.(none) (Lid 6): time 0.139 ms
ibwarn: [14476] report: out due signal 0

--- dm01cel01.oracle.vxcompany.local.(none) (Lid 6) ibping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 4001 ms
rtt min/avg/max = 0.139/0.184/0.247 ms
[root@dm0103 ~]# 

So we have infiniband connectivity, lets see how Oracle reacts:

[root@dm0103 ~]# crsctl stat res -t

And now we play the waiting game… well basically it never comes back, it tries to read from 2 network sockets if we look with strace, it hangs at:

[pid 15917] poll([{fd=3, events=POLLIN|POLLRDNORM}, {fd=4, events=POLLIN|POLLRDNORM}], 2, -1

Which points to 2 file descriptors which it can’t read from:

[root@dm0103 ~]# ls -altr /proc/15917/fd
total 0
dr-xr-xr-x 7 root root  0 Feb  3 18:37 ..
lrwx------ 1 root root 64 Feb  3 18:37 4 -> socket:[3447070]
lrwx------ 1 root root 64 Feb  3 18:37 3 -> socket:[3447069]
lrwx------ 1 root root 64 Feb  3 18:37 2 -> /dev/pts/0
lrwx------ 1 root root 64 Feb  3 18:37 1 -> /dev/pts/0
lrwx------ 1 root root 64 Feb  3 18:37 0 -> /dev/pts/0
dr-x------ 2 root root  0 Feb  3 18:37 .
[root@dm0103 ~]# 

There is an dependency between ipv6 and CRS on an Exadata, disabling ipv6 will cripple your clusterware. There is no real solution for this problem because we need ipv6 on an Exadata, we can’t disable it. However we van easily reduce the amount of ipv6 DNS lookups by extending our /etc/hosts file and adding all hostnames, vip names etc. of all our hosts in our cluster in every single hostfile on computenodes. Unfortunately we can’t do this on our Cell servers, because oracle does not want us to go ‘messing’ with them so you have to live with it for now.

Adding an Exadata V2 as a target in Enterprise Manager 12c

Although Oracle says that with Enterprise Manager 12c it “provides the tools to effectively and efficiently manage your Oracle Exadata Database Machine” it is a bit of a challenge to get it all working correctly on an Exadata V2. It looks like when developing the Exadata plugin for Enterprise manager 12c they clearly developed it on a X2 only, getting a V2 as a target into Enterprise Manager does not work out of the box. In order to get Enterprise Manager 12c to discover your Exadata V2 you need to do some extra steps.

Exadata discovery is done using the first compute node in your Exadata rack (e.g. dm01db01). The agent uses a file called databasemachine.xml which is located in your One Command directory:

[oracle@dm01db01 [+ASM1] ~]$ ls -la /opt/oracle.SupportTools/onecommand/database*
-rw-r--r-- 1 root root 15790 May 10 22:07 /opt/oracle.SupportTools/onecommand/databasemachine.xml
[oracle@dm01db01 [+ASM1] ~]$

This file is being generated with dbm_configurator.xls in the One Command directory, unfortunately for V2 owners, early One Command versions did not generate these files so you have generated it yourself. Obviously you need Excel and a Windows pc to use dbm_configurator.xls as it uses VBA (Visual Basic for Applications) to generate the One Command files.

  • On the first node in the rack scp the following 2 files from /opt/oracle.SupportTools/onecommand:
    1. config.dat
    2. onecommand.params
  • Download OneCommand: Patch 13612149
  • Unzip the file p13612149_112242_Generic.zip windows host
  • Extract the tarbal onecmd.tar
  • Open dbm_configurator.xls in Excel
  • Enable macro’s withing excel
  • Click on the import button in the top left and locate the onecommand.params file (make sure that config.dat is in the same directory)
  • Check if the imported data is still correct
  • Click the generate button
  • Click the create config files button

After this upload at least the databasemachine.xml to /opt/oracle.SupportTools/onecommand on your first node in your rack.

Next step is to correct the Infiniband naming of the compute node HBA’s, right now on a V2 these are as follow:

[root@dm01db01 mlx4_0]# ibnodes | grep dm01db
Ca     : 0x00212800013f1242 ports 2 "dm01db04 HCA-1"
Ca     : 0x00212800013f12da ports 2 "dm01db02 HCA-1"
Ca     : 0x00212800013f111e ports 2 "dm01db03 HCA-1"
Ca     : 0x00212800013f2672 ports 2 "dm01db01 HCA-1"

Unfortunately the agent discovery process is looking for a naming convention that goes ‘hostname S ip-address HCA-1′. Fortunately Oracle provided us with a script to correct this: /opt/oracle.cellos/ib_set_node_desc.sh. When you run this script on a V2 not much will happen, it is broken on a V2 system. The problem is in the infiniband bond naming:

[root@dm01db01 ~]# grep IFCFG_BONDIB /opt/oracle.cellos/ib_set_node_desc.sh
  local IFCFG_BONDIB=/etc/sysconfig/network-scripts/ifcfg-bondib
        local addr=`awk -F= 'BEGIN {IGNORECASE=1} /^IPADDR=[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/ {print $2}' $IFCFG_BONDIB$id 2>/dev/null`
[root@dm01db01 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bondib
cat: /etc/sysconfig/network-scripts/ifcfg-bondib: No such file or directory
[root@dm01db01 ~]# 

So Exadata V2 IB bond has a different, it is actually called bond0 instead of bondib:

[root@dm01db01 ~]# ifconfig bond0
bond0     Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
          inet addr:192.168.100.1  Bcast:192.168.100.255  Mask:255.255.255.0
          inet6 addr: fe80::221:2800:13f:2673/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:65520  Metric:1
          RX packets:55048256 errors:0 dropped:0 overruns:0 frame:0
          TX packets:56638365 errors:0 dropped:21 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:12207158878 (11.3 GiB)  TX bytes:18646886557 (17.3 GiB)

[root@dm01db01 ~]# 

So instead of using the broken ib_set_node_desc.sh script, fix it manually:

[root@dm01db01 ~]# dcli -g dbs_group -l root "echo -n `hostname -s` S `ifconfig bond0 | grep 'inet addr' | cut -f2 -d: | cut -f1 -d' '` HCA-1 > /sys/class/infiniband/mlx4_0/node_desc"

If all went well you should end up with the following:

[root@dm01db01 ~]# ibnodes | grep dm01db
Ca     : 0x00212800013f1242 ports 2 "dm01db01 S 192.168.100.1 HCA-1"
Ca     : 0x00212800013f12da ports 2 "dm01db01 S 192.168.100.1 HCA-1"
Ca     : 0x00212800013f111e ports 2 "dm01db01 S 192.168.100.1 HCA-1"
Ca     : 0x00212800013f2672 ports 2 "dm01db01 S 192.168.100.1 HCA-1"

After these changes the guided discovery of your Exadata should now run as is described in the cloud control manual.

Peeking at your Exadata infiniband traffic

As a DBA you are probably very curious on what is going on, on your system. So when you have a shiny Exadata you probably had a look at the infiniband fabric that is connecting the compute nodes and storage nodes together. When you want to see what kind traffic is going from the compute nodes to the storage nodes, or on the RAC interconnects you can use tcpdump to do so (if it is not install you can do a ‘yum install tcpdump’):

[root@dm01db02 ~]# tcpdump -i bond0 -s 0 -w /tmp/tcpdump.pcap
tcpdump: WARNING: arptype 32 not supported by libpcap - falling back to cooked socket
tcpdump: listening on bond0, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
2073 packets captured
2073 packets received by filter
0 packets dropped by kernel
[root@dm01db02 ~]#

This will give you a dump file (/tmp/tcpdump.pcap) which you can analyze with your favorite network analyzer (probably Wireshark). If you are new to this you can download and install Wireshark here: http://www.wireshark.org/download.html

Using tcpdump you can sniff all the IPOIB traffic (ip over infiniband), but can you take a peak at the other traffic that is going on the Infiniband wire? Yes there is a way, you can use Mellanox’s ibdump. This tool is not installed by default on your compute nodes so need to download it and install it on the node of your choice (as a reminder: don’t install anything on your cellservers!):

[root@dm01db02 ~]# wget http://www.mellanox.com/downloads/tools/ibdump-1.0.5-4-rpms.tgz
--2012-02-11 15:13:27--  http://www.mellanox.com/downloads/tools/ibdump-1.0.5-4-rpms.tgz
Resolving www.mellanox.com... 98.129.157.233
Connecting to www.mellanox.com|98.129.157.233|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 486054 (475K) [application/x-gzip]
Saving to: `ibdump-1.0.5-4-rpms.tgz'

100%[==========================================================================================================================================>] 486,054      290K/s   in 1.6s

2012-02-11 15:13:29 (290 KB/s) - `ibdump-1.0.5-4-rpms.tgz' saved [486054/486054]
[root@dm01db02 ~]

Extract the tarball:

[root@dm01db02 ~]# tar -xvf ibdump-1.0.5-4-rpms.tgz
ibdump-1.0.5-4-rpms/
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.i386-rhel5.4.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.ppc64-rhel5.4.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.i386-rhel5.5.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.ppc64-rhel5.5.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.i386-rhel5.6.rpm
ibdump-1.0.5-4-rpms/ibdump_release_notes.txt
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.x86_64-rhel5.4.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.x86_64-rhel5.5.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.ppc64-rhel5.6.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.x86_64-rhel5.6.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.i686-rhel6.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.ppc64-rhel6.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.x86_64-rhel6.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.i586-sles10sp3.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.ppc64-sles10sp3.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.x86_64-sles10sp3.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.i586-sles11.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.ppc64-sles11.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.i586-sles11sp1.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.ppc64-sles11sp1.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.x86_64-sles11sp1.rpm
ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.x86_64-sles11.rpm
[root@dm01db02 ~]#

Next step, install it. It will be placed into your /usr/bin folder:

[root@dm01db02 ~]# rpm -i ./ibdump-1.0.5-4-rpms/ibdump-1.0.5-4.x86_64-rhel`lsb_release -r|awk '{print $2}'`.rpm
[root@dm01db02 ~]# ls -la /usr/bin/ibdump
-rwxr-xr-x 1 root root 41336 Dec 19  2010 /usr/bin/ibdump
[root@dm01db02 ~]#

Now you are ready to play with ibdump, running it without parameters will make ibdump sniffing interface mlx4_0 (which is ib0) and writes the frames into a file called sniffer.pcap in your working directory. Some parameters can be added such as the dump file location:

[root@dm01db02 ~]# ibdump -o /tmp/ibdump.pcap
 ------------------------------------------------
 IB device                      : "mlx4_0"
 IB port                        : 1
 Dump file                      : /tmp/ibdump.pcap
 Sniffer WQEs (max burst size)  : 4096
 ------------------------------------------------

Initiating resources ...
searching for IB devices in host
Port active_mtu=2048
MR was registered with addr=0x1bc58590, lkey=0x8001c34e, rkey=0x8001c34e, flags=0x1
QP was created, QP number=0x60005b

Ready to capture (Press ^c to stop):
Captured:     11711 packets, 10978982 bytes

Interrupted (signal 2) - exiting ...

[root@dm01db02 ~]#

There are some drawback to ibdump though:

  • ibdump may encounter packet drops upon a burst of more than 4096 (or 2^max-burst) packets.
  • Packets loss is not reported by ibdump.
  • Outbound retransmitted and multicast packets may not be collected correctly.
  • ibdump may stop capturing packets when run on the same port of the Subnet Manager (E.G.: opensm). It is advised not to run the SM and ibdump on the same port.

Be aware of the issues above, besides that: Have fun peeking around at your Exadata infiniband fabric!