The Oracle EMT: October 2012

Wednesday, October 31, 2012

Thank you!

I started this blog with some small hope that it would find a place online to grow, and tonight it has hit 100 page views! To me:

that is really amazing especially since I've kept it to myself and told nobody in my personal or professional life at this point, let alone share it online at some of the other blogs I've been frequenting
it means that what I have to say, or share, is relevant or of some value to the general Oracle space
this gives me some motivation to keep going forward and building this site up more

So again, thanks to everybody out there that's visiting the site!!

-The Oracle EMT

I had found this post a little while ago on cleaning up historical data, but I wanted to save blogging about it until I had something real to say about it. This is that time as it has become very apparent that another one of our Main Tenants should be something to do with data, maybe Data Consumption or Data Usage but SOMETHING because data is just piling up in our systems every day. Audit data, user data, this data, and that data and how much is being purged, how much can be purged, who has a strategy for keeping track of all of this and where does that fit in with a corporate strategy?

Skip several years later, and the database had grown to be huge for that time. Retrievals were taking longer and longer, and talking to some of the sysadmins late at night around the coffee machine, I learned that the backup was getting too large to even back it up within a week. The application was starting to hit pretty much a ‘brick wall’; it started to fall further and further behind, and something really needed to be done.

Does this sound familiar? How big is your system? Has it hit 1 TB? Are you approaching 5, 10, 50, 100, or more TB of space for your footprint? How do you approach your backup system? Do the people that manage your backups tell you when you're having a problem, or do you have to figure it out yourself? Questions like this are going through my head as I re-read that link because lately in the morning I'm seeing something like this in our new event monitoring alert:

Waits Wait Time Server

Backup: MML create a backup piece 79291.40622 X1

Backup: sbtbackup 3937.101761 X2

Backup: sbtbackup 17308.812339 X3

This sure seems like our systems are suddenly "wasting" a lot of cycles per hour on backups, especially when these events are from backups that run into our morning several hours and sometimes even into the afternoon. So begins the investigation!

Wednesday, October 17, 2012

System Wait Events Part 2

For my first follow up post, I thought it would be appropriate to show why ample alerting can help you to save the day. It just so happens that on the day that we were closing A/R for the year, we started seeing some very odd things in a new alert we have.

Waits Wait Time

gc buffer busy acquire 17253.907174

read by other session 16173.301272

cell single block physical read 4533.532144

Disk file operations I/O 3961.591694

gc buffer busy release 3783.63651

RMAN backup & recovery I/O 2962.854376

buffer busy waits 1803.283808

gc current block busy 1652.088758

At this time, we see in our DB logs a multitude of these entries:

Tue Sep 04 13:15:04 2012

OSPID: 19311: connect: ossnet: connection failed to server x.x.x.x, result=5 (login: sosstcpreadtry failed) (difftime=1953)

Anecdotal evidence of something going wrong is that at this time I observed that closing a form took 20-30 seconds for an individual, and I had another business user indicate that in the last hour or so he's been getting a "circle mouse" indicating waiting on the form when he tries to move between records on a student account. So knowing that "something is going on" I start drilling in some more to investigate sessions which are making up some of the time in the above alert buckets and this session breakdown of wait times over 2 seconds shows:

WAIT_EVENT	TOT_WAITS	WAIT_PCT	WAIT_SECONDS	WAIT_MINUTES	TIME_PCT
gc buffer busy acquire	1033	0.09	418.49	6.97	0.03
gc cr block 2-way	31312	2.71	3.21	0.05	0
gc cr block busy	6460	0.56	3.24	0.05	0
SQL*Net message from client	532955	46.14	1526240.71	25437.35	99.96
cell single block physical read	1972	0.17	2.8	0.05	0

I don’t know what the session was doing, or if it was successful, but that doesn’t look like a normal session wait time profile on our system. Later we saw this as well (but it was related to our DBA trying to kill the RMAN jobs):

Tue Sep 04 14:18:52 2012

ARCH: Possible network disconnect with primary database

So, this is all nice and everything but what happened?

Answer: Our storage engineer started researching the issue and found that our IB switches and other Exadata components had no issues, but when he checked our 7420 storage array out it was a completely different matter all together. There were no issues with head 1 of the array, but when he went to check on head 2 he was unable to log in through GUI or ssh with also a simple df command hanging on an NFS mount coming from that head. When he performed a takeover to move NFS control over to head 1, head 2 complained about lacking resources when attempting to fork something and the only way he could force a takeover was to power down head 2. What I learned by researching for this article can be found here about what a 7420 head configuration could look like in Illustration 20 and Illustration 21 and the following paragraph as well.

Wednesday, October 3, 2012

Parallel Executions

I mentioned earlier that we implemented Exadata, and I learned something about parallel executions as a result! Ignoring the fact that too much parallel execution on our system after the first few hours almost brought our system down, one of our developers had found that a simple query of a table with 54 rows in 8 blocks was taking too long to process versus our old environment.

Global Stats

=======================================

=======================================

| 0.00 | 0.00 | 1 | 3 |

=======================================

Global Stats

===============================================================

===============================================================

| 1.92 | 0.02 | 0.86 | 1.04 | 1 | 5 |

===============================================================

The 0 values in the first example were actually recorded to be 0.000079s and when we were seeing it then taking 2-3 seconds at time for each call, he started drilling down to see why there was such bad for our online performance. He found that the default INSTANCES setting on the table was causing our Optimizer to compute a parallel execution plan which never caused us a problem because we didn't have an exponential number of potential slaves available for the parallel execution to take up the overhead of our RAC nodes. The sins of our past, marking tables and indexes for parallel execution without needing to or really utilizing the feature, had caught up with us and Exadata didn't CAUSE the flaws it merely REVEALED them to us. So we obviously fixed this, as well as limited our parallel executions on our DB nodes, but what else did we do?

We learned from our failure by creating an alert to track the current (and future) items that were running ~~marked~~ as parallel in our application just in case we had other objects running ~~marked~~ similarly (or implemented new objects in the future):

SELECT sql_id,
       substr(sql_text, 0, 200) AS sql_text,
       SUM(px_servers_executions) AS sum_px_exec,
       to_char(MAX(last_active_time), 'DD-MON-YYYY HH24:MI:SS') AS last_active_time
INTO
&SQLID,
&TEXT,
&EXEC,
&TIME
FROM gv$sqlstats
WHERE px_servers_executions > 0
   AND upper(sql_text) NOT LIKE '%GV$%'
   AND upper(sql_text) NOT LIKE '%ORACLEOEM%'
   AND upper(sql_text) NOT LIKE '%GV%DB%'
   AND upper(sql_text) NOT LIKE '%SYS%DB%'
   AND last_active_time > SYSDATE - 1
GROUP BY
       sql_id,
       sql_text
order by
    sum_px_exec desc

Obviously you can see I "commented" out a lot of the returns from gv$sqlstats since many objects owned by the system are parallel which we really don't care too much about. Here are a few more scripts you can use to identify the objects in your system which have parallelism activated:

select owner, table_name, degree

from dba_tables

where TRIM(DEGREE) > TO_CHAR(1);

select owner, table_name, index_name, degree

from dba_indexes

where TRIM(DEGREE) > TO_CHAR(1);

Edit 10/4/2012: Re-read this today and realized that I mispoke about what our alert was all about, so I corrected it by changing a few words.

Wednesday, October 31, 2012

Thank you!

Data (historical and backups)

Wednesday, October 17, 2012

System Wait Events Part 2

Wednesday, October 3, 2012

Parallel Executions