Wednesday, July 16, 2014

Why the wait event "SQL*Net message from dblink" can suddenly increase for no reason

Following yesterday's article about making assumptions post project implementation, are the technical details to what I was talking about!

We launched our 12c upgrade to a major Oracle database in our infrastructure and being one of the first in the company to do so made us (okay, probably just me) very cautious (or leery) about what bugs would shake out of the system after we put it in.  Over the weekend we did the upgrade by installing 12c on new hardware, made sure GoldenGate was up to date before we broke the link between systems, and then linking to the new system.  After the logs were synchronized and everything was tested, not a creature was stirring, not even a mouse.

The next day, we still had no obvious issues but our system wait event alert told me that the activity for the wait event "SQL*Net message from dblink" was starting to trend upwards from where it normally is during the day.  By midday, the activity was measured to be 10-15 TIMES what it was Friday before the deployment and not only that but "TCP Socket (KGAS)" activity was trending higher on the new 12c database.  Of course, 12c is the culprit here, and we have to find out what bug we are encountering right?  Not so fast.  We need to get evidence and prove that out!

First off, is what we were seeing in the database log every few seconds on the new 12c database connecting back out to our main database:

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 12.1.0.1.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 12.1.0.1.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 12.1.0.1.0 - Production
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0

This leads us to My Oracle Support article "ORA-0600 [17099]" in document ID 1325533.1 but why is nobody noticing a problem here that is being reported if "things are not right" to such a degree?  I'm glad you asked, because while I started researching this I noticed that my SQL client tool was hanging for up to 15 minutes before the error "ORA-03113: end-of-file on communication channel" was returned.  That is not right at all.

So, what kind of errors are we seeing in our main database log?  I am glad you asked!

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.3.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address:
Error 3135 trapped in 2PC on transaction 3283.9.55580. Cleaning up.
Error stack returned to user:
ORA-03135: connection lost contact
ORA-02063: preceding line from 12C_DB

Seeing that most of the error messages are the same between both databases made me do more research on My Oracle Support!  What I found was "Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out" in document ID 1628949.1 and again the article was extensive.  Not only in depth, but it had another hint to what the problem was which was found in the first MOS article I shared here.

What is the problem?  The SQLNET.EXPIRE_TIME parameter was not set in the $TNS_ADMIN/sqlnet.ora file on the new 12c platform, the way it was in the original database which we "cloned" over.  After setting this on both RAC nodes and reloading the listeners, the issue went away along with all of the error messages in the log file!

We never had a problem with 12c, even though it was easy to blame it outright at the start!

1 comment:

  1. Que tiempo le asignó a la variable sqlnet.expire_time?

    ReplyDelete