Long, but hopefully worth it. Chris, do you think this should also be duplicated in the database forum? If so, let me know, and I'll post it there as well.
After we upgraded to OTM 5.5 CU4 from 5.0, we experienced performance problems on the database server. We’re running the certified Linux version on a server with 24 Gb. of RAM and 4 dual core processors; a fairly robust system.
I wanted to utilize HUGE_PAGES to allocate a LOT of the machine’s memory to the database buffer cache (About 10 Gb. initially). My thought process was accessing data from RAM was quicker than accessing data from a disk (And, that still stands, BTW).
So, the users get on the system, and they’re complaining about performance, even moving from screen to screen is slow. Looking at the system using the Linux top utility, I notice something rather strange that I haven’t seen before. I see that SYSTEM utilization is much higher than user utilization, meaning the system is busy doing something other than processing your data (i.e. house keeping).
Here’s an example of the top session (I’ve highlighted the problem area):
top - 09:51:56 up 19 days, 19:45, 2 users, load average: 8.97, 8.97, 8.48
Tasks: 255 total, 11 running, 244 sleeping, 0 stopped, 0 zombie
Cpu0 : 2.3% us, 97.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu1 : 3.0% us, 96.7% sy, 0.0% ni, 0.0% id, 0.3% wa, 0.0% hi, 0.0% si
Cpu2 : 5.0% us, 95.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu3 : 3.0% us, 97.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu4 : 2.3% us, 97.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu5 : 2.3% us, 97.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu6 : 2.3% us, 96.7% sy, 0.0% ni, 0.0% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu7 : 1.3% us, 98.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 24953744k total, 24934296k used, 19448k free, 73048k buffers
Swap: 36861100k total, 23132k used, 36837968k free, 11840736k cached
As you can see, the system is busy doing stuff, but NOT what we want it to be doing.
So, after a few iterations that I won’t go through in detail (Disabling huge pages, changing some database parameters, etc.), our Linux guru, John Poff looks closely at what’s happening. His findings are quite interesting. He found that one of the Oracle system processes was calling a Linux system function called GetTimeOfDay. And, it was calling this a LOT! How many times? Well, over a 30 second period, a call from an Oracle process was made to this function over 505,000 times. That’s a LOT, and it was affecting our performance in a BIG way.
So, off to MetaLink I go, searching for Linux and GetTimeOfDay, and I find a few hits. It’s associated with a database parameter called STATISTICS_LEVEL. The OTM 5.5 documentation says this setting should be set to ALL, which I had done, as they outlined in the documentation, and in the sample init.ora file delivered in the scripts directory. This parameter also has an effect on another parameter, TIMED_OS_STATISTICS, which degrades performance even more. The MetaLink TARs say that the STATISTICS_LEVEL parameter should only be set to ALL if you’re looking for something specific; otherwise, set it to the default, which is TYPICAL.
So, after a database parameter change, and a bounce of the entire system, this recommended parameter seems to have been the source of all of our grief. Once our users got back on, I did NOT see the behavior that I had seen before on the database server.
I hope others can avoid the turmoil that John, Beth, myself and others have endured in discovering this tidbit of information. It took almost 2 weeks to get to the bottom of this for us. I don’t know, maybe we’re slow. But, the ironic part of this is that Oracle recommended setting this parameter to the non-default value. If you follow the documentation, you’ll probably experience performance problems on your database tier. I’m guessing it may affect other platforms, as well, but I’m not 100% certain.
Thanks...Steve Hughes
After we upgraded to OTM 5.5 CU4 from 5.0, we experienced performance problems on the database server. We’re running the certified Linux version on a server with 24 Gb. of RAM and 4 dual core processors; a fairly robust system.
I wanted to utilize HUGE_PAGES to allocate a LOT of the machine’s memory to the database buffer cache (About 10 Gb. initially). My thought process was accessing data from RAM was quicker than accessing data from a disk (And, that still stands, BTW).
So, the users get on the system, and they’re complaining about performance, even moving from screen to screen is slow. Looking at the system using the Linux top utility, I notice something rather strange that I haven’t seen before. I see that SYSTEM utilization is much higher than user utilization, meaning the system is busy doing something other than processing your data (i.e. house keeping).
Here’s an example of the top session (I’ve highlighted the problem area):
top - 09:51:56 up 19 days, 19:45, 2 users, load average: 8.97, 8.97, 8.48
Tasks: 255 total, 11 running, 244 sleeping, 0 stopped, 0 zombie
Cpu0 : 2.3% us, 97.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu1 : 3.0% us, 96.7% sy, 0.0% ni, 0.0% id, 0.3% wa, 0.0% hi, 0.0% si
Cpu2 : 5.0% us, 95.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu3 : 3.0% us, 97.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu4 : 2.3% us, 97.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu5 : 2.3% us, 97.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu6 : 2.3% us, 96.7% sy, 0.0% ni, 0.0% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu7 : 1.3% us, 98.7% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 24953744k total, 24934296k used, 19448k free, 73048k buffers
Swap: 36861100k total, 23132k used, 36837968k free, 11840736k cached
As you can see, the system is busy doing stuff, but NOT what we want it to be doing.
So, after a few iterations that I won’t go through in detail (Disabling huge pages, changing some database parameters, etc.), our Linux guru, John Poff looks closely at what’s happening. His findings are quite interesting. He found that one of the Oracle system processes was calling a Linux system function called GetTimeOfDay. And, it was calling this a LOT! How many times? Well, over a 30 second period, a call from an Oracle process was made to this function over 505,000 times. That’s a LOT, and it was affecting our performance in a BIG way.
So, off to MetaLink I go, searching for Linux and GetTimeOfDay, and I find a few hits. It’s associated with a database parameter called STATISTICS_LEVEL. The OTM 5.5 documentation says this setting should be set to ALL, which I had done, as they outlined in the documentation, and in the sample init.ora file delivered in the scripts directory. This parameter also has an effect on another parameter, TIMED_OS_STATISTICS, which degrades performance even more. The MetaLink TARs say that the STATISTICS_LEVEL parameter should only be set to ALL if you’re looking for something specific; otherwise, set it to the default, which is TYPICAL.
So, after a database parameter change, and a bounce of the entire system, this recommended parameter seems to have been the source of all of our grief. Once our users got back on, I did NOT see the behavior that I had seen before on the database server.
I hope others can avoid the turmoil that John, Beth, myself and others have endured in discovering this tidbit of information. It took almost 2 weeks to get to the bottom of this for us. I don’t know, maybe we’re slow. But, the ironic part of this is that Oracle recommended setting this parameter to the non-default value. If you follow the documentation, you’ll probably experience performance problems on your database tier. I’m guessing it may affect other platforms, as well, but I’m not 100% certain.
Thanks...Steve Hughes
Comment