Wednesday, October 31, 2012

Thank you!

I started this blog with some small hope that it would find a place online to grow, and tonight it has hit 100 page views!  To me:
  • that is really amazing especially since I've kept it to myself and told nobody in my personal or professional life at this point, let alone share it online at some of the other blogs I've been frequenting
  • it means that what I have to say, or share, is relevant or of some value to the general Oracle space
  • this gives me some motivation to keep going forward and building this site up more
So again, thanks to everybody out there that's visiting the site!!

-The Oracle EMT

Data (historical and backups)

I had found this post a little while ago on cleaning up historical data, but I wanted to save blogging about it until I had something real to say about it.  This is that time as it has become very apparent that another one of our Main Tenants should be something to do with data, maybe Data Consumption or Data Usage but SOMETHING because data is just piling up in our systems every day.  Audit data, user data, this data, and that data and how much is being purged, how much can be purged, who has a strategy for keeping track of all of this and where does that fit in with a corporate strategy?

Skip several years later, and the database had grown to be huge for that time. Retrievals were taking longer and longer, and talking to some of the sysadmins late at night around the coffee machine, I learned that the backup was getting too large to even back it up within a week. The application was starting to hit pretty much a ‘brick wall’; it started to fall further and further behind, and something really needed to be done.

Does this sound familiar?  How big is your system?  Has it hit 1 TB?  Are you approaching 5, 10, 50, 100, or more TB of space for your footprint?  How do you approach your backup system?  Do the people that manage your backups tell you when you're having a problem, or do you have to figure it out yourself?  Questions like this are going through my head as I re-read that link because lately in the morning I'm seeing something like this in our new event monitoring alert:

  Waits                                                           Wait Time        Server
 
  Backup: MML create a backup piece           79291.40622    X1
  Backup: sbtbackup                                        3937.101761   X2                  
  Backup: sbtbackup                                      17308.812339   X3
 
This sure seems like our systems are suddenly "wasting" a lot of cycles per hour on backups, especially when these events are from backups that run into our morning several hours and sometimes even into the afternoon.  So begins the investigation!

Wednesday, October 17, 2012

System Wait Events Part 2

For my first follow up post, I thought it would be appropriate to show why ample alerting can help you to save the day.  It just so happens that on the day that we were closing A/R for the year, we started seeing some very odd things in a new alert we have.

Waits                                                           Wait Time

  gc buffer busy acquire                                     17253.907174
  read by other session                                      16173.301272
  cell single block physical read                             4533.532144
  Disk file operations I/O                                    3961.591694
  gc buffer busy release                                       3783.63651
  RMAN backup & recovery I/O                                  2962.854376
  buffer busy waits                                           1803.283808
  gc current block busy                                       1652.088758

At this time, we see in our DB logs a multitude of these entries:

Tue Sep 04 13:15:04 2012
OSPID: 19311: connect: ossnet: connection failed to server x.x.x.x, result=5 (login: sosstcpreadtry failed) (difftime=1953)

Anecdotal evidence of something going wrong is that at this time I observed that closing a form took 20-30 seconds for an individual, and I had another business user indicate that in the last hour or so he's been getting a "circle mouse" indicating waiting on the form when he tries to move between records on a student account.  So knowing that "something is going on" I start drilling in some more to investigate sessions which are making up some of the time in the above alert buckets and this session breakdown of wait times over 2 seconds shows:

WAIT_EVENT
TOT_WAITS
WAIT_PCT
WAIT_SECONDS
WAIT_MINUTES
TIME_PCT
gc buffer busy acquire
1033
0.09
418.49
6.97
0.03
gc cr block 2-way
31312
2.71
3.21
0.05
0
gc cr block busy
6460
0.56
3.24
0.05
0
SQL*Net message from client
532955
46.14
1526240.71
25437.35
99.96
cell single block physical read
1972
0.17
2.8
0.05
0

I don’t know what the session was doing, or if it was successful, but that doesn’t look like a normal session wait time profile on our system.  Later we saw this as well (but it was related to our DBA trying to kill the RMAN jobs):
Tue Sep 04 14:18:52 2012
ARCH: Possible network disconnect with primary database

So, this is all nice and everything but what happened?

Answer:  Our storage engineer started researching the issue and found that our IB switches and other Exadata components had no issues, but when he checked our 7420 storage array out it was a completely different matter all together.  There were no issues with head 1 of the array, but when he went to check on head 2 he was unable to log in through GUI or ssh with also a simple df command hanging on an NFS mount coming from that head.  When he performed a takeover to move NFS control over to head 1, head 2 complained about lacking resources when attempting to fork something and the only way he could force a takeover was to power down head 2.  What I learned by researching for this article can be found here about what a 7420 head configuration could look like in Illustration 20 and Illustration 21 and the following paragraph as well.

Wednesday, October 3, 2012

Parallel Executions

I mentioned earlier that we implemented Exadata, and I learned something about parallel executions as a result!  Ignoring the fact that too much parallel execution on our system after the first few hours almost brought our system down, one of our developers had found that a simple query of a table with 54 rows in 8 blocks was taking too long to process versus our old environment.

Global Stats
=======================================
| Elapsed |  Other   | Fetch | Buffer |
| Time(s) | Waits(s) | Calls |  Gets  |
=======================================
|    0.00 |     0.00 |     1 |      3 |
=======================================
Global Stats
===============================================================
| Elapsed |   Cpu   | Concurrency |  Other   | Fetch | Buffer |
| Time(s) | Time(s) |  Waits(s)   | Waits(s) | Calls |  Gets  |
===============================================================
|    1.92 |    0.02 |        0.86 |     1.04 |     1 |      5 |
===============================================================

The 0 values in the first example were actually recorded to be 0.000079s and when we were seeing it then taking 2-3 seconds at time for each call, he started drilling down to see why there was such bad for our online performance.  He found that the default INSTANCES setting on the table was causing our Optimizer to compute a parallel execution plan which never caused us a problem because we didn't have an exponential number of potential slaves available for the parallel execution to take up the overhead of our RAC nodes.  The sins of our past, marking tables and indexes for parallel execution without needing to or really utilizing the feature, had caught up with us and Exadata didn't CAUSE the flaws it merely REVEALED them to us.  So we obviously fixed this, as well as limited our parallel executions on our DB nodes, but what else did we do?

We learned from our failure by creating an alert to track the current (and future) items that were running marked as parallel in our application just in case we had other objects running marked similarly (or implemented new objects in the future):

SELECT sql_id,
       substr(sql_text, 0, 200) AS sql_text,
       SUM(px_servers_executions) AS sum_px_exec,
       to_char(MAX(last_active_time), 'DD-MON-YYYY HH24:MI:SS') AS last_active_time
INTO
&SQLID,
&TEXT,
&EXEC,
&TIME
  FROM gv$sqlstats
WHERE px_servers_executions > 0
   AND upper(sql_text) NOT LIKE '%GV$%'
   AND upper(sql_text) NOT LIKE '%ORACLEOEM%'
   AND upper(sql_text) NOT LIKE '%GV%DB%'
   AND upper(sql_text) NOT LIKE '%SYS%DB%'
   AND last_active_time > SYSDATE - 1
GROUP BY
       sql_id,
       sql_text
order by
    sum_px_exec desc

Obviously you can see I "commented" out a lot of the returns from gv$sqlstats since many objects owned by the system are parallel which we really don't care too much about.  Here are a few more scripts you can use to identify the objects in your system which have parallelism activated:


select owner, table_name, degree
from  dba_tables
where TRIM(DEGREE) > TO_CHAR(1);

select owner, table_name, index_name, degree
from  dba_indexes
where TRIM(DEGREE) > TO_CHAR(1);

Edit 10/4/2012:  Re-read this today and realized that I mispoke about what our alert was all about, so I corrected it by changing a few words.