Last year Oracle added a new functionality to dbms_stats.gather_system_stats, making it more Exadata aware and thus let the optimizer make choices that might result in offloading more often. See this blog post from Kerry Osborne about it, i made a small comment about the MBRC in it and i would like to zoom in on that a bit more on this post.
So what happens during gathering van of the Exadata system statistics, de easiest way to find out is to run an SQL trace along with the system stats:
SYS@kjj AS SYSDBA> alter session set events 'sql_trace wait=true, bind=true, level=12'; Session altered. SYS@kjj AS SYSDBA> exec dbms_stats.gather_system_stats('EXADATA'); PL/SQL procedure successfully completed.
If we look in the trace file we see that Oracle starts with some recursive SQL statements to find out what datafiles there are and the starts doing a lot of IO, these are all recorded with a unique wait event:
WAIT #139869656707320: nam='Disk file I/O Calibration' ela= 3255 count=1 p2=182669 p3=2 obj#=-1 tim=1379436371918507
It looks like Oracle is doing some type of IO to find out what out what this storage can do, lets take a deeper look into these IO’s. You can use the GDB route, described by Frits Hoogland here but for this purpose there is an somewhat easier way. The new trace interface in 11g has an interesting option where we can trace the OSS calls to the storage:
SYS@kjj AS SYSDBA> oradebug setmypid Statement processed. SYS@kjj AS SYSDBA> oradebug unlimit Statement processed. SYS@kjj AS SYSDBA> alter session set events 'trace[libcell.client_library.*]'; Session altered. SYS@kjj AS SYSDBA> alter session set events 'sql_trace wait=true, bind=true, level=12'; Session altered. SYS@kjj AS SYSDBA> exec dbms_stats.gather_system_stats('EXADATA'); PL/SQL procedure successfully completed. SYS@kjj AS SYSDBA> oradebug tracefile_name /u01/app/oracle/diag/rdbms/kjj/kjj/trace/kjj_ora_13915.trc SYS@kjj AS SYSDBA> oradebug close_trace Statement processed. SYS@kjj AS SYSDBA>
So what am i doing there, first i want to remove the 5MB tracefile limit for my session because the above steps will generate a huge trace file, next is how we van enable trace of all the communication of my shadow process to the storage layer so this:
alter session set events 'trace[libcell.client_library.*]';
Will trace all of the following
Components in library LIBCELL: -------------------------- Client_Library Client Library Disk_Layer Disk Layer Network_Layer Network Layer IPC_Layer IPC Layer
libcell.client_library.* means that i want to trace all calls to libcell.client_library and all of the 3 child layers. If we now take a look at the trace file and we zoom into to the Disk file I/O Calibration waits we see that is now preceded with trace data containing information about the communication between my shadow process and the storage layer:
oss_open called for o/192.168.100.11/DATA_CD_09_dm01cel07 with guid 56d7b5d4-12a-0-0 connect: box name -192.168.100.11, port - 5042 TCP-connected to: box name -192.168.100.11, ospid - 14950, port - 5042, my box refid: 5, source id: 2575757666 SKGXP-connecting to the box 192.168.100.11 OSS OS Pid 14950 SKGXP-connected to the appliance number -728063998 Sending message 0xdabf2f0 with frag # 0, msgid 27 to box my_box_refid: 5 (my sourceid in box: 2575757666 box inc: 17) Command with band 1, payload length 138, payload 0xdabf4f0 Command code OPEN, msgid 27, refid 0 Open flags: readwrite - 1, sync - 1, char - 0. Device name - DATA_CD_09_dm01cel07 Vrpc completion - cmd 0xdabf330, req 0xdabee50, msgid 27, frag # 1 Vrpc reply 0xdabf3e8 with refid 27 for existing message: 0xdabf2f0 Reply with band 0, payload length 0, payload (nil) Command code REPLY OPEN, msgid 0, refid 27 Returned open file descriptor 0 opening file number 9 oss_read open file: 9, offset: 38498574336, blocks: 128 Sending message 0xdabf2f0 with frag # 0, msgid 28 to box my_box_refid: 5 (my sourceid in box: 2575757666 box inc: 17) Command with band 0, payload length 0, payload (nil) Command code BREAD, msgid 28, refid 0 Block arguments fd 0 offset 38498574336 cnt 128 blksize 512 prefetch 0 oss_read done oss_wait called for request: 0xdabee50 Vrpc completion - cmd 0xdabf330, req 0xdabee50, msgid 28, frag # 1 Vrpc reply 0xdabf3e8 with refid 28 for existing message: 0xdabf2f0 Reply with band 7, payload length 65536, payload 0x7f35f0e37e00 Command code REPLY READ, msgid 0, refid 28 Number of bytes read 65536 oss_wait done, request 0xdabee50, result 0x7f35f116bfc8 WAIT #139869656707320: nam='Disk file I/O Calibration' ela= 3255 count=1 p2=182669 p3=2 obj#=-1 tim=1379436371918507
So the above is an excerpt of the tracefile generated, where we can see how Exadata handles its IO requests. I will zoom in on how Oracle does this, it is not specific for Exadata system stats though. In this case it is an IO call where Oracle needs to open up a new connection to a celldisk, this disk was apparently not needed in earlier IO requests. In line 1 you can see that the oss_open call says it it wants to open disk 9 in cell07. The next line it sets up the connection, it does this by first contacting the cell using a TCP call in and in then switches to RDS using SKGXP, sending a message down the wire with msgid 27. Line 12 tells us that the cell response to out msgid 27 with a refid 27 and we now have a new file descriptor to this cell disk with #9. On this level files are not datafiles but files are chunks of data on a disk so, on this half rack we have 7 cells * 12 disk is maximum of 84 disks. Now we have our file descriptor open we can issue our read request with the oss_read call in line 19, it requests to read 128 blocks, keep in mind that these are blocks on at a celldisk level and thus not database blocks. As you can see in line 21 you can see at has a 512K blksize. Also note that this read request has msgid of 28, it looks like these are sequential for your shadow process. Directly after the oss_read call exadata issues an oss_wait and tries to reap the IO request. In line 28 we can see that the 128 blocks resulted in a read of 64k and then oss_wait call is done and oracle issues our wait.
So now we know how Exadata handles IO request lets zoom in on our IO’s for our Disk file I/O Calibration wait
[oracle@dm0101 [kjj] trace]$ grep 'oss_read open file' /u01/app/oracle/diag/rdbms/kjj/kjj/trace/kjj_ora_13915.trc ... .. . oss_read open file: 19, offset: 37603999744, blocks: 128 oss_read open file: 12, offset: 38469820416, blocks: 16 oss_read open file: 16, offset: 38503866368, blocks: 128 oss_read open file: 43, offset: 38536380416, blocks: 16 oss_read open file: 51, offset: 37821882368, blocks: 128 oss_read open file: 78, offset: 37753462784, blocks: 16 oss_read open file: 44, offset: 38539337728, blocks: 128 oss_read open file: 33, offset: 38455091200, blocks: 16 oss_read open file: 49, offset: 37558116352, blocks: 128 oss_read open file: 30, offset: 38454730752, blocks: 16 oss_read open file: 16, offset: 38439862272, blocks: 128 oss_read open file: 67, offset: 37605408768, blocks: 16 ... .. . [oracle@dm0101 [kjj] trace]$
So when oracle fires the dbms_stats.gather_system_stats(‘EXADATA’) you see that it is switching between requesting 128 blocks and 16 blocks from different celldisks. Apparently oracle uses these IO’s to calculate wetter or not it can set the MBRC to 128 or not (it also takes db_cache_size into account). Then it deletes the results in aux_stats$:
DELETE FROM SYS.AUX_STATS$ WHERE SNAME = 'SYSSTATS_MAIN'
and eventually it updates the values of, amongst others, MBRC to 128 in this case:
PARSING IN CURSOR #139869655553736 len=71 dep=1 uid=0 oct=2 lid=0 tim=1379436419280111 hv=4016519028 ad='db24c118' sqlid='32kbjpmrqfdvn' INSERT INTO SYS.AUX_STATS$ VALUES ('SYSSTATS_MAIN', 'MBRC', :B1 , NULL) END OF STMT BINDS #139869655553736: Bind#0 oacdty=02 mxl=22(21) mxlc=00 mal=00 scl=00 pre=00 oacflg=13 fl2=206001 frm=00 csi=00 siz=24 off=0 kxsbbbfp=7f35f12e4088 bln=22 avl=03 flg=09 value=128 EXEC #139869655553736:c=0,e=267,p=0,cr=0,cu=3,mis=0,r=1,dep=1,og=1,plh=0,tim=1379436419280370 CLOSE #139869655553736:c=0,e=1,dep=1,type=3,tim=1379436419280396