Tag Archives: incidents

Setting up notifications and incident rules in EM12c

Minor update 20130219:  You may also wish to read Rob Zoeteweij’s white paper on Incident Rules in EM12c.

Minor update 20130906: You may also wish to read Kellyn Pot’Vin’s blog article on Incident Rule Sets, EM12c Enterprise Monitoring, Part IV, part of a series covering effective usage of EM12c for enterprise monitoring.

I’ve been seeing a few hits here from people searching for information on how to set up notification emails in EM12c. I haven’t specifically covered this, but I think people are finding it due to the post about setting up OOB monitoring notifications for the OMS. So here’s a post on setting up incident notifications in EM12cR2, detailing the specific set of notifications that I find most useful as a DBA that maintains the OEM stack.

Out of the box, EM12c comes with several pre-defined incident rules (accessible via the Setup menu, under Incidents, then Incident Rules). The default rules are overly chatty for my needs. Subscribing to them all means I’ll receive several duplicate emails, and subscribing to only a few of them means I’ll miss important notifications. The incident rules are very configurable and that’s a great advantage if you’re running a large shop with a number of different admins that each have responsibility over different areas of the landscape, but in a small shop with one DBA, one sysadmin and one manager, setting things up to provide a reasonable set of notifications seems like a bit of overkill. But it is well worth the effort.

So you’ve decided to take the plunge and set up incident/notification rules more suitable to your site. My first recommendation is to go right ahead and subscribe to all of those overly chatty default rules. You’ll receive a ton of emails, but you’ll get an idea of the incidents that occur day-to-day and which ones need attention and which ones are just noise. Let your EM12c install run for a while with those default rules subscribed. Go through a few OMS bounces and a few maintenance windows for your monitored systems. Let it all sink in.

Look closely at some of the notification emails you receive. Notice the “Update details” section in those emails. This section is key to keep an eye on while you define your incident notification policies. It will make it obvious to you as you look at each wanted (or unwanted) notification exactly which rule(s) created the incident and which rule(s) sent out the email. In my situation, I’m the OEM admin as well as the database admin, so I want to see practically everything, and want to send a tightly curated set of notifications to the sysadmin(s) and manager(s) in my shop.

The alerts I consider the most important to receive are:

  • Failed backup jobs
  • Database system incidents
  • Agent incidents
  • Listener incidents
  • Metric alert incidents
  • Notification when a corrective action linked to a metric runs
Incident Rules

EM12c Incident Rules

I’ll discuss each of these in turn. First off, don’t edit the predefined rules — create a new rule set that applies to all targets so that you can always revert back to the out-of-the-box behavior.

Notification for failed backup jobs was more difficult in EM12cR1. There was a bug that caused them to appear for a while and then suddenly stop. This is fixed in EM12cR2 so that gives us all the more reason to use the latest and greatest version. I have a working system so I haven’t tested to see if this has changed in EM12cR2, but as of EM12cR1 I’ve found that the most workable way to deal with backup jobs is to schedule them all directly from the database target, selecting “Schedule Backup” from the Availability menu. Even if you use stored RMAN scripts, I still recommend scheduling backup jobs this way rather than directly creating an RMAN Script type job. Step through the wizard and hit the ‘Edit RMAN Script’ button near the end and paste in the run { execute script script_name; } portion at the end, then save and schedule the job as usual. Note that you can’t currently edit the RMAN script attached to an existing job, though I’ve submitted that as a feature request. In the ‘Access’ submenu in the job configuration I prefer to check the ‘Action Required’, ‘Success’, and ‘Problems’ checkboxes. With the ‘Problems’ box checked and the incident rules I will describe, you will receive double-notification for failed backup jobs but that’s one situation where I think it’s worth it to get duplicates.

On to the incident rule for failed backup jobs.

In your newly created rule set, add an event rule (applies to incoming events or updates to existing events). In the dropdown for “Type”, select “High Availability”. In the radio select button that appears, select “Specific events of type High Availability”. Another table will appear on-screen, in this table select ‘Add’, and select Component ‘Backup’, Operation ‘Backup’, and Status ‘Failed’. Click Next to get to the Add Actions screen. Here you will add a notification action, in my case a simple email to my EM12c administrator account’s email address. Click ‘Add’, and make sure that ‘Always execute the actions’ is selected. In the Basic Notifications “Email To” section, click the magnifying glass and select your administrator account. Click ‘Continue’, then ‘Next’. Give the rule a descriptive name and click ‘Next’, then ‘Continue’ again. Your backup failure notification rule is created but will not be saved until you click ‘Save’. Do so now to avoid losing your work. You should now receive notifications for failed jobs of type Database Backup.

Return to your new incident rule set and click ‘Edit’ to begin setting up notifications for Database System incidents. I prefer to receive notice of Database System incidents rather than Database Instance incidents, as this avoids duplicate emails when a database is taken down or an agent is stopped. Click the ‘Rules’ subtab, and create a new incident rule (applies to newly created incidents or updates to existing incidents). I want to know when a database system incident is created, and I also want to know when the problem is resolved and the incident is closed. So I set the new incident rule to apply to Target Type in Database System with Status in Closed or New. As before, add a notification rule set to email your administrator account. I also prefer to set this rule so that it assigns Database System incidents to me, which you can optionally configure in the ‘Update Incident’ section. Give the new notification rule a name and save it and the entire rule set.

Return to editing the rule set, on the ‘Rules’ subtab, and add another incident rule. This rule will be for Agent incidents, so set it to apply where Target Type in Agent. I will deal with metric alerts from Agents later with a catchall rule so in this case I only want the rule to apply to incidents where the Category is in Availability. By doing so and also selecting Status in Closed or New, I will receive notification when an agent is not available to the OMS for any reason — whether the agent is stopped, blocked, or the network drops out, the incident will have the category set to Availability. Create a notification rule that emails your administrator account, name it and save it.

Now the catchall rule for metric alerts. Database Systems don’t produce metric alerts and we want to find out about metric threshold incidents that occur on database instances, listeners, agents, hosts, everything. So we create a new incident rule that will fire on incidents with Status in Closed or New (again, so we receive notice when alerts are created AND when they’re cleared), and apply it where Category in Business, Capacity, Configuration, Diagnostics, Error, Fault, Jobs, Load, Performance, Security. I specifically exclude Availability here because I have rules for agents, listeners and database systems keyed directly to Availability and I do not want to receive duplicate alerts. Add a notification to your administrator account, name it and save it.

Listeners also need notifications. The Database System target encompasses the instance and the listener, but will not fire an incident if the listener goes down (like it will if the database instance itself does — presumably this is because the database itself is still available to existing sessions). So we add a new incident rule where target in Listener, Status in Closed or New and Category in Availability. Add a notification to your administrator account, name it and save it.

Corrective actions are great. They allow automated responses to specific metric issues that can save a DBA from having to run things manually. For example, you may run archivelog backups with delete all input once every half hour but occasionally your database has very heavy processing that can fill up your disk space allocated to archived logs in 20 minutes. This is the perfect situation for a corrective action based on the Archive Area Used % metric that will fire off an archivelog backup to keep you from running out of space and receiving the “archiver hung” error. I won’t cover creating corrective actions in this post, but assuming you have one in place here is how you can set up notifications whenever it runs.

Add a new event rule that applies to specific events of type ‘Metric Alert’. In the table that comes up, click Add and select the metric on which you have a corrective action defined — in my case, this is Metric Group Archive Area, Metric Archive Area Used (%) for Target Type Database Instance. Check the box next to this metric (for some reason it appears twice on my screen — I selected both, I think one applies to pre 10G databases while the applies to more recent versions, either way I don’t receive any duplicates here). Make sure that ‘All Objects’ is selected and no objects are in the exclude list. At the bottom of this subwindow, leave the ‘Severity’ drop down blank and check all four boxes, like in the image below.

Next click OK in that window, then click ‘Next’. As with the other notifications, add a notification email to your administrator account, name it and save it. Make doubly sure that you save the entire incident rule set or you will lose your work.

This covers all of the alerts that I consider most important to receive myself. The next step is adjusting your monitoring thresholds so that they meet the needs of your environment. Once you have the monitoring thresholds set somewhere that makes sense, that’s the time to start adding even more incident rule set to support notifications to your sysadmins or managers or users. I prefer to keep “my” rule set with the small set of rules I’ve listed above, and handle most notifications to others in their own customized rule sets. Just for example, your manager may want to see alerts for all availability issues on production but doesn’t want their mailbox cluttered up with alerts about your sandbox systems. I create a new rule set for each environment (Production, Test, Development, Sandbox), with very specific incident rules covering the exact issues others wish to be notified about. In my case that’s generally the Load and Capacity categories for Database Instance targets (to my manager) and all metric alerts (excluding categories Security and Configuration) on Host targets to send alerts to the sysadmin. OEM is a little chatty on security recommendations and we deal with host patching and vulnerability monitoring outside of OEM so I don’t like to bother the sysadmin with EM12c’s opinion of our system security.

That’s it. What sort of notification rules do the rest of you out there use that differ from this? What can I do better?

Advertisements