Monthly Archives: October 2012

Final (!) update/fix for EM12c increased redo from EM_JOB_METRICS after upgrade to R2 (bug/patch 14726136)

Update 13 Nov 2012: Patch 14726136 has been obsoleted. The note on MOS indicated that the patch caused some metrics to be calculated incorrectly. The patch has been replaced by patch 14833587. I have applied the new patch and all appears well — jobs are running and redo generation on the repository database remains where it was before the upgrade to EM12c R2.

Following up again on the EM12cR2 upgrade issue from BP1 that causes significantly increased redo logging on the repository database due to heavy insert activity in EM_JOB_METRICS. I first covered this here, with a followup here, a partial workaround here, and then a warning here.

Patch 14726136 has been re-released on MOS. The initial release caused problems for me as it prevented all OEM scheduled jobs from running, eventually causing me to rollback the patch. I am very pleased to report that the new update of the patch (from Oct 30th) applies cleanly and all of my jobs are now running on-schedule and completing successfully. EM_JOB_METRICS is showing no more than 11 inserts per second, and more than a minute is passing between the batches of inserts. My redo volume is already down significantly.

Big thanks to Oracle support and development and the EM12c team!

Advertisements

Patch 14726136

You may have seen the patch in the title drop regarding increased redo logging after upgrading to EM12cR2.

I tried the patch Friday morning, and while it did decrease my redo logging, it also prevented any OEM scheduled jobs from running. I eventually rolled back the patch and jobs ran again.

I would be very interested to know if the patch works for you AND your jobs still run.

Fix for ASH Analytics deployment error after upgrade to R2

When upgrading from Oracle Enterprise Manager 12c R1 to R2, the one-system upgrade requires that you place the OMS into a new middleware home. After you’ve validated your R2 installation, and followed the instructions for removing the R1 install, the OMS host’s agent is still running out of your original middleware home so you can’t just remove the entire old directory.

You may have decided to clean up the leftover files from your old middleware home, taking care not to remove the agent. That’s all well and good. But after doing so, you may run into an error the next time you try to deploy the ASH Analytics PL/SQL packages to a database instance.

As a reminder, deploying these packages is done by going to the database target home page, and selecting “ASH Analytics” from the Performance menu.

If you’ve done a cleanup of your old middleware home, you are very likely to receive the following error text in the deployment job:
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/ashviewer_pkgdefs.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/ashviewer_pkgbodys.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/prvs_awr_data.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/prvt_awr_data.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/prvs_awr_data_cp.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/prvt_awr_data_cp.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/dbms_compare_period.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/prvt_compare_period.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/eaddm_pkgdef.sql
Package deployment driver file is at /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/eaddm_pkgbody.sql
Instantiated JDBC Engine with connect string host.domain.com:1521:SID


You can view the output of this execution at the following location /tmp/DB_Deploy_201210171243.log
Altering session to set schema to DBSNMP

Currently executing null :: ALTER SESSION SET CURRENT_SCHEMA = DBSNMP
Altered session to set schema to DBSNMP

Executing /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/ashviewer_pkgdefs.sql

Error while executing the script : oracle.sysman.assistants.common.dbutil.SQLFatalErrorException: java.io.FileNotFoundException: /oracle/oem/Middleware/plugins/oracle.sysman.db.oms.plugin_12.1.0.2.0/sql/db/latest/instance/ashviewer_pkgdefs.sql (No such file or directory) at line number - -1
Driver SQL Script encountered errors.
Fail

At least the error text makes the problem clear. The SQL files containing the needed packages can’t be found, because EM12cR2 is still looking for them in the old middleware home.

The fix is easy. Just add a symlink in your old middleware home for the ‘plugins’ directory that points to the new middleware home’s plugins directory:

oracle@omshost$ cd /oracle/oem/Middleware
oracle@omshost$ ln -s ../Middleware12cR2/plugins/ ./plugins

Once you’ve finished that, return to the ASH Analytics deployment page and re-run the deployment. It should succeed and you can now take advantage of ASH Analytics!

If you need a primer on ASH Analytics, I highly recommend viewing the set of webinars presented by Doug Burns through AllThingsOracle.

How and why you should set target lifecycle status properties in EM12c

If, like me, you’re using EM12c after you were already plenty familiar with EM11g, you may have missed an important detail in the EM12c new features guide.

From New Features in Oracle Enterprise Manager Cloud Control 12c

Each target now has a lifecycle status target property which can be set to one of the following values: mission critical, production, staging, test, or development. This target property is used to specify the priority by which data from the target should be handled. When Enterprise Manager is under a heavy load, targets where the value of the lifecycle property is mission critical or production are treated with higher priority than targets where the value of the lifecycle property is staging, test, or development. Setting the priorities of your most important targets ensures that even as your data center grows and the number of managed targets grows, your most important targets continue to be treated at high priority.

You may not use some of the other new features like administration groups or lifecycle management, but it’s still very much worth your while to set the lifecycle status target property. After all, you’re more concerned about alerts and monitoring on your mission critical and other production systems than you are on the staging and test systems, so why not tell EM12c about that and gain the benefits of target prioritization?

If you have quite a few targets it can be quite tedious to step through them all in the GUI interface to set this property. It works, but it’ll take a while. Enter emcli. Rob Zoeteweij has covered the setup of EMCLI in his blog post Installing EMCLI on EM12c so I won’t repeat that here other than to add that with the release of EM12cR2 there is no longer a JDK in the $OMS_HOME so if you’re running 12.1.0.2 you should amend his instructions as follows:


oracle@omshost$ export JAVA_HOME=$OMS_HOME/../jdk16/jdk
oracle@omshost$ export PATH=$JAVA_HOME/bin:$PATH
oracle@omshost$ export ORACLE_HOME=$OMS_HOME
oracle@omshost$ cd $ORACLE_HOME
oracle@omshost$ mkdir emcli
oracle@omshost$ java -jar $ORACLE_HOME/sysman/jlib/emclikit.jar client -install_dir=$ORACLE_HOME/emcli
Oracle Enterprise Manager 12c Release 2.
Copyright (c) 1996, 2012 Oracle Corporation. All rights reserved.

EM CLI client-side install completed successfully.
oracle@omshost$ $ORACLE_HOME/emcli/emcli setup -url=https://omshost.domain.com:7803/em -username=sysman
Oracle Enterprise Manager Cloud Control 12c Release 2.
Copyright (c) 1996, 2012 Oracle Corporation and/or its affiliates. All rights reserved.


Enter password

Warning: This certificate has not been identified as trusted in the local trust store
--------------------------------------
[certificate details snipped]
--------------------------------------
Do you trust the certificate chain? [yes/no] yes
Emcli setup successful

Once you have emcli running you can easily create files containing the target properties you would like to set, including the target lifecycle status, and apply them in bulk to your EM12c installation.

For the example I will demonstrate setting the target lifecycle status property for host targets. First you need to produce a list of your host targets, formatted for eventual input to the set_target_property_value verb:

oracle@omshost$ ./emcli get_targets -noheader -script -targets=host | awk '{print $4":"$3":LifeCycle Status:"}' > /tmp/targets

As another example, from commenter Bill Korb, to produce a list of database targets, including any currently under blackout, run:

oracle@omshost$ ./emcli get_targets -noheader -format='name:script;column_separator:|;' -targets='%database%' | awk -F\| '{print $4":"$3":LifeCycle Status:"}' > /tmp/targets

Edit the resulting file, appending the host or database’s lifecycle stage to each line. Be aware of the predefined lifecycle stages provided by Oracle, which are listed below in order of precedence:

  1. MissionCritical
  2. Production
  3. Stage
  4. Test
  5. Development

You can modify the names of these lifecycle stages with the modify_lifecycle_stage_name verb if you wish. Your file should now look something like:

dev1.domain.com:host:LifeCycle Status:Development
omshost.domain.com:host:LifeCycle Status:MissionCritical
prod1.domain.com:host:LifeCycle Status:Production
prod2.domain.com:host:LifeCycle Status:Production
prod3.domain.com:host:LifeCycle Status:Production
prod4.domain.com:host:LifeCycle Status:Production
stage1.domain.com:host:LifeCycle Status:Stage
test1.domain.com:host:LifeCycle Status:Test
test2.domain.com:host:LifeCycle Status:Test

Now make the call to emcli to load your target property definitions into the OMS:

oracle@omshost$ ./emcli set_target_property_value -property_records="REC_FILE" -input_file="REC_FILE:/tmp/targets" -separator=property_records="\n"

There you go. Your hosts are now updated with appropriate target lifecycle stages and the OMS will prioritize them based on these settings whenever the OMS is under high load. Repeat this for your listeners (-targets=oracle_listener), database instances (-targets=oracle_database) and so on until all of your targets have a lifecycle stage assigned. I’ve broken these out by target type for simplicity of documentation, but you can also just produce a single large file containing the lifecycle status for all of your targets and load the whole thing at once. This same technique works to assign a contact, comment, location, or any other target property you find useful.

Setting up notifications and incident rules in EM12c

Minor update 20130219:  You may also wish to read Rob Zoeteweij’s white paper on Incident Rules in EM12c.

Minor update 20130906: You may also wish to read Kellyn Pot’Vin’s blog article on Incident Rule Sets, EM12c Enterprise Monitoring, Part IV, part of a series covering effective usage of EM12c for enterprise monitoring.

I’ve been seeing a few hits here from people searching for information on how to set up notification emails in EM12c. I haven’t specifically covered this, but I think people are finding it due to the post about setting up OOB monitoring notifications for the OMS. So here’s a post on setting up incident notifications in EM12cR2, detailing the specific set of notifications that I find most useful as a DBA that maintains the OEM stack.

Out of the box, EM12c comes with several pre-defined incident rules (accessible via the Setup menu, under Incidents, then Incident Rules). The default rules are overly chatty for my needs. Subscribing to them all means I’ll receive several duplicate emails, and subscribing to only a few of them means I’ll miss important notifications. The incident rules are very configurable and that’s a great advantage if you’re running a large shop with a number of different admins that each have responsibility over different areas of the landscape, but in a small shop with one DBA, one sysadmin and one manager, setting things up to provide a reasonable set of notifications seems like a bit of overkill. But it is well worth the effort.

So you’ve decided to take the plunge and set up incident/notification rules more suitable to your site. My first recommendation is to go right ahead and subscribe to all of those overly chatty default rules. You’ll receive a ton of emails, but you’ll get an idea of the incidents that occur day-to-day and which ones need attention and which ones are just noise. Let your EM12c install run for a while with those default rules subscribed. Go through a few OMS bounces and a few maintenance windows for your monitored systems. Let it all sink in.

Look closely at some of the notification emails you receive. Notice the “Update details” section in those emails. This section is key to keep an eye on while you define your incident notification policies. It will make it obvious to you as you look at each wanted (or unwanted) notification exactly which rule(s) created the incident and which rule(s) sent out the email. In my situation, I’m the OEM admin as well as the database admin, so I want to see practically everything, and want to send a tightly curated set of notifications to the sysadmin(s) and manager(s) in my shop.

The alerts I consider the most important to receive are:

  • Failed backup jobs
  • Database system incidents
  • Agent incidents
  • Listener incidents
  • Metric alert incidents
  • Notification when a corrective action linked to a metric runs
Incident Rules

EM12c Incident Rules

I’ll discuss each of these in turn. First off, don’t edit the predefined rules — create a new rule set that applies to all targets so that you can always revert back to the out-of-the-box behavior.

Notification for failed backup jobs was more difficult in EM12cR1. There was a bug that caused them to appear for a while and then suddenly stop. This is fixed in EM12cR2 so that gives us all the more reason to use the latest and greatest version. I have a working system so I haven’t tested to see if this has changed in EM12cR2, but as of EM12cR1 I’ve found that the most workable way to deal with backup jobs is to schedule them all directly from the database target, selecting “Schedule Backup” from the Availability menu. Even if you use stored RMAN scripts, I still recommend scheduling backup jobs this way rather than directly creating an RMAN Script type job. Step through the wizard and hit the ‘Edit RMAN Script’ button near the end and paste in the run { execute script script_name; } portion at the end, then save and schedule the job as usual. Note that you can’t currently edit the RMAN script attached to an existing job, though I’ve submitted that as a feature request. In the ‘Access’ submenu in the job configuration I prefer to check the ‘Action Required’, ‘Success’, and ‘Problems’ checkboxes. With the ‘Problems’ box checked and the incident rules I will describe, you will receive double-notification for failed backup jobs but that’s one situation where I think it’s worth it to get duplicates.

On to the incident rule for failed backup jobs.

In your newly created rule set, add an event rule (applies to incoming events or updates to existing events). In the dropdown for “Type”, select “High Availability”. In the radio select button that appears, select “Specific events of type High Availability”. Another table will appear on-screen, in this table select ‘Add’, and select Component ‘Backup’, Operation ‘Backup’, and Status ‘Failed’. Click Next to get to the Add Actions screen. Here you will add a notification action, in my case a simple email to my EM12c administrator account’s email address. Click ‘Add’, and make sure that ‘Always execute the actions’ is selected. In the Basic Notifications “Email To” section, click the magnifying glass and select your administrator account. Click ‘Continue’, then ‘Next’. Give the rule a descriptive name and click ‘Next’, then ‘Continue’ again. Your backup failure notification rule is created but will not be saved until you click ‘Save’. Do so now to avoid losing your work. You should now receive notifications for failed jobs of type Database Backup.

Return to your new incident rule set and click ‘Edit’ to begin setting up notifications for Database System incidents. I prefer to receive notice of Database System incidents rather than Database Instance incidents, as this avoids duplicate emails when a database is taken down or an agent is stopped. Click the ‘Rules’ subtab, and create a new incident rule (applies to newly created incidents or updates to existing incidents). I want to know when a database system incident is created, and I also want to know when the problem is resolved and the incident is closed. So I set the new incident rule to apply to Target Type in Database System with Status in Closed or New. As before, add a notification rule set to email your administrator account. I also prefer to set this rule so that it assigns Database System incidents to me, which you can optionally configure in the ‘Update Incident’ section. Give the new notification rule a name and save it and the entire rule set.

Return to editing the rule set, on the ‘Rules’ subtab, and add another incident rule. This rule will be for Agent incidents, so set it to apply where Target Type in Agent. I will deal with metric alerts from Agents later with a catchall rule so in this case I only want the rule to apply to incidents where the Category is in Availability. By doing so and also selecting Status in Closed or New, I will receive notification when an agent is not available to the OMS for any reason — whether the agent is stopped, blocked, or the network drops out, the incident will have the category set to Availability. Create a notification rule that emails your administrator account, name it and save it.

Now the catchall rule for metric alerts. Database Systems don’t produce metric alerts and we want to find out about metric threshold incidents that occur on database instances, listeners, agents, hosts, everything. So we create a new incident rule that will fire on incidents with Status in Closed or New (again, so we receive notice when alerts are created AND when they’re cleared), and apply it where Category in Business, Capacity, Configuration, Diagnostics, Error, Fault, Jobs, Load, Performance, Security. I specifically exclude Availability here because I have rules for agents, listeners and database systems keyed directly to Availability and I do not want to receive duplicate alerts. Add a notification to your administrator account, name it and save it.

Listeners also need notifications. The Database System target encompasses the instance and the listener, but will not fire an incident if the listener goes down (like it will if the database instance itself does — presumably this is because the database itself is still available to existing sessions). So we add a new incident rule where target in Listener, Status in Closed or New and Category in Availability. Add a notification to your administrator account, name it and save it.

Corrective actions are great. They allow automated responses to specific metric issues that can save a DBA from having to run things manually. For example, you may run archivelog backups with delete all input once every half hour but occasionally your database has very heavy processing that can fill up your disk space allocated to archived logs in 20 minutes. This is the perfect situation for a corrective action based on the Archive Area Used % metric that will fire off an archivelog backup to keep you from running out of space and receiving the “archiver hung” error. I won’t cover creating corrective actions in this post, but assuming you have one in place here is how you can set up notifications whenever it runs.

Add a new event rule that applies to specific events of type ‘Metric Alert’. In the table that comes up, click Add and select the metric on which you have a corrective action defined — in my case, this is Metric Group Archive Area, Metric Archive Area Used (%) for Target Type Database Instance. Check the box next to this metric (for some reason it appears twice on my screen — I selected both, I think one applies to pre 10G databases while the applies to more recent versions, either way I don’t receive any duplicates here). Make sure that ‘All Objects’ is selected and no objects are in the exclude list. At the bottom of this subwindow, leave the ‘Severity’ drop down blank and check all four boxes, like in the image below.

Next click OK in that window, then click ‘Next’. As with the other notifications, add a notification email to your administrator account, name it and save it. Make doubly sure that you save the entire incident rule set or you will lose your work.

This covers all of the alerts that I consider most important to receive myself. The next step is adjusting your monitoring thresholds so that they meet the needs of your environment. Once you have the monitoring thresholds set somewhere that makes sense, that’s the time to start adding even more incident rule set to support notifications to your sysadmins or managers or users. I prefer to keep “my” rule set with the small set of rules I’ve listed above, and handle most notifications to others in their own customized rule sets. Just for example, your manager may want to see alerts for all availability issues on production but doesn’t want their mailbox cluttered up with alerts about your sandbox systems. I create a new rule set for each environment (Production, Test, Development, Sandbox), with very specific incident rules covering the exact issues others wish to be notified about. In my case that’s generally the Load and Capacity categories for Database Instance targets (to my manager) and all metric alerts (excluding categories Security and Configuration) on Host targets to send alerts to the sysadmin. OEM is a little chatty on security recommendations and we deal with host patching and vulnerability monitoring outside of OEM so I don’t like to bother the sysadmin with EM12c’s opinion of our system security.

That’s it. What sort of notification rules do the rest of you out there use that differ from this? What can I do better?

Partial workaround for increased redo generation on EM12cR2 repository database after upgrade from R1

Another followup to my previous posts on increased redo logging on the repository database after upgrading to EM12cR2.

On the OTN Forum thread, slohit posted a suggestion with a way to reduce the frequency of job dispatcher runs:

To reduce the frequency of inserts, it would be better to slow down the frequency of the dispatcher. This is OK only if you are fine with a lower response time between steps for your jobs. If it is only a few jobs/job steps scheduled to run together, this is all right

emctl set property -name oracle.sysman.core.jobs.minDispatcherSleepInterval -value 0.5
emctl set property -name oracle.sysman.core.jobs.linearBackoff -value 0.5

By making this change, followed by a bounce of the OMS, the redo generation on my repository database decreased by about 50%. It’s still 200-300% more than before the upgrade, but it’s quite an improvement. The EM_JOB_METRICS table now only receives about 20 inserts per second compared to 60. My 13GB repository database should now only produce about 5GB of redo per day instead of 10GB.

Prior to the change, emctl get property reported these property values as null so presumably the OMS was working with whatever the built-in default values were.

In my case, a delay between job steps is a tradeoff that I can tolerate. This OMS runs simple database backups, SQL and RMAN scripts, so I do not have any time-critical multi-step jobs that could be negatively impacted by delayed response to completion of individual job steps. Your mileage may vary.

Oracle Support is continuing to investigate this bug. At least one other poster on the forum has reported that they too are experiencing this issue so I’m looking forward to finding out if this change reduces their redo generation to an acceptable rate. That same poster also reports that they do not see the issue on a new install of EM12cR2, only on a system upgraded from R1. It’s not clear to me why an upgraded system would experience this when processing the raw metric data while a fresh install would not, nor is it clear how many upgraded systems are experiencing the issue. But at least we now have a partial workaround for bug 14726136.

BUG 14726136 – INCREASED REDO LOGGING ON REPOSITORY DB AFTER UPGRADING TO OEM 12.1.0.2

Following up on my previous post about increased redo logging on the repository DB after upgrading to OEM 12cR2, we now have bug 14726136 filed.

The heavy redo activity is coming from SYSMAN inserts into the EM_JOB_METRICS table. It’s doing 60 inserts a second, in batches of 10 per LOG_TIMESTAMP, where each insert looks like:

insert into "SYSMAN"."EM_JOB_METRICS"("SOURCE","LOG_TIMESTAMP","METRIC_NAME","METRIC_QUALIFIER","METRIC_VALUE") values
('omshost.domain.com:4890_Management_Service',TO_TIMESTAMP('26-SEP-12 03.09.25.215415 PM'),'DISP_REQ_STEPS','0','25');

The 10 inserts differ only in the values of the METRIC_QUALIFIER and METRIC_VALUE columns:

'0', '25'
'0', '0'
'1', '12'
'1', '0'
'2', '25'
'2', '0'
'3', '10'
'3', '0'
'4', '10'
'4', '0'

LogMiner shows this as a significant proportion of the total database activity, with the following extracts based on about 20 minutes (or about 200MB) of repository DB operation:

SQL> select TABLE_NAME, count(*) from v$logmnr_contents group by table_name order by 2 desc;

                                                       TABLE_NAME           COUNT(*)
                                                     EM_JOB_METRICS           201740
                                                                              163843
                                                    EM_METRIC_ITEMS            13374
                                                    EM_METRIC_VALUES           12524
                                                MGMT_AVAILABILITY_MARKER        3845
                                                       COL_USAGE$               2403
                                                          ROUT                  1453
                                                     SCHEDULER$_JOB             1067


SQL> select seg_owner, operation, count(*) from v$logmnr_contents group by seg_owner, operation order by 1;

                                              SEG_OWNER     OPERATION     COUNT(*)  
                                               DBSNMP        DELETE             13  
                                               DBSNMP        INSERT             13  
                                               DBSNMP        UPDATE              4  
                                                RCAT         DELETE            269  
                                                RCAT         INSERT           1372  
                                                RCAT    SELECT_FOR_UPDATE       50  
                                                RCAT         UPDATE            884  
                                                 SYS         DELETE             88  
                                                 SYS         INSERT           1041
                                                 SYS    SELECT_FOR_UPDATE      265
                                                 SYS         UPDATE           4802  
                                               SYSMAN        DELETE          97325  
                                               SYSMAN        INSERT         106610  
                                               SYSMAN       INTERNAL           196  
                                               SYSMAN       LOB_TRIM             5  
                                               SYSMAN       LOB_WRITE          326  
                                               SYSMAN   SELECT_FOR_UPDATE     1203  
                                               SYSMAN    SEL_LOB_LOCATOR       103  
                                               SYSMAN      UNSUPPORTED       13626  
                                               SYSMAN        UPDATE          21808
                                               UNKNOWN       DELETE             82  
                                               UNKNOWN       INSERT             80  
                                               UNKNOWN  SELECT_FOR_UPDATE       65
                                               UNKNOWN       UPDATE             65
                                                             COMMIT          19792
                                                          DPI SAVEPOINT         15
                                                            INTERNAL        124178
                                                            ROLLBACK            33
                                                              START          19825

This is not a particularly busy OMS, it’s monitoring fewer than 200 targets and only runs a single job every couple minutes.