Setting up notifications and incident rules in EM12c

Minor update 20130219:  You may also wish to read Rob Zoeteweij’s white paper on Incident Rules in EM12c.

Minor update 20130906: You may also wish to read Kellyn Pot’Vin’s blog article on Incident Rule Sets, EM12c Enterprise Monitoring, Part IV, part of a series covering effective usage of EM12c for enterprise monitoring.

I’ve been seeing a few hits here from people searching for information on how to set up notification emails in EM12c. I haven’t specifically covered this, but I think people are finding it due to the post about setting up OOB monitoring notifications for the OMS. So here’s a post on setting up incident notifications in EM12cR2, detailing the specific set of notifications that I find most useful as a DBA that maintains the OEM stack.

Out of the box, EM12c comes with several pre-defined incident rules (accessible via the Setup menu, under Incidents, then Incident Rules). The default rules are overly chatty for my needs. Subscribing to them all means I’ll receive several duplicate emails, and subscribing to only a few of them means I’ll miss important notifications. The incident rules are very configurable and that’s a great advantage if you’re running a large shop with a number of different admins that each have responsibility over different areas of the landscape, but in a small shop with one DBA, one sysadmin and one manager, setting things up to provide a reasonable set of notifications seems like a bit of overkill. But it is well worth the effort.

So you’ve decided to take the plunge and set up incident/notification rules more suitable to your site. My first recommendation is to go right ahead and subscribe to all of those overly chatty default rules. You’ll receive a ton of emails, but you’ll get an idea of the incidents that occur day-to-day and which ones need attention and which ones are just noise. Let your EM12c install run for a while with those default rules subscribed. Go through a few OMS bounces and a few maintenance windows for your monitored systems. Let it all sink in.

Look closely at some of the notification emails you receive. Notice the “Update details” section in those emails. This section is key to keep an eye on while you define your incident notification policies. It will make it obvious to you as you look at each wanted (or unwanted) notification exactly which rule(s) created the incident and which rule(s) sent out the email. In my situation, I’m the OEM admin as well as the database admin, so I want to see practically everything, and want to send a tightly curated set of notifications to the sysadmin(s) and manager(s) in my shop.

The alerts I consider the most important to receive are:

  • Failed backup jobs
  • Database system incidents
  • Agent incidents
  • Listener incidents
  • Metric alert incidents
  • Notification when a corrective action linked to a metric runs
Incident Rules

EM12c Incident Rules

I’ll discuss each of these in turn. First off, don’t edit the predefined rules — create a new rule set that applies to all targets so that you can always revert back to the out-of-the-box behavior.

Notification for failed backup jobs was more difficult in EM12cR1. There was a bug that caused them to appear for a while and then suddenly stop. This is fixed in EM12cR2 so that gives us all the more reason to use the latest and greatest version. I have a working system so I haven’t tested to see if this has changed in EM12cR2, but as of EM12cR1 I’ve found that the most workable way to deal with backup jobs is to schedule them all directly from the database target, selecting “Schedule Backup” from the Availability menu. Even if you use stored RMAN scripts, I still recommend scheduling backup jobs this way rather than directly creating an RMAN Script type job. Step through the wizard and hit the ‘Edit RMAN Script’ button near the end and paste in the run { execute script script_name; } portion at the end, then save and schedule the job as usual. Note that you can’t currently edit the RMAN script attached to an existing job, though I’ve submitted that as a feature request. In the ‘Access’ submenu in the job configuration I prefer to check the ‘Action Required’, ‘Success’, and ‘Problems’ checkboxes. With the ‘Problems’ box checked and the incident rules I will describe, you will receive double-notification for failed backup jobs but that’s one situation where I think it’s worth it to get duplicates.

On to the incident rule for failed backup jobs.

In your newly created rule set, add an event rule (applies to incoming events or updates to existing events). In the dropdown for “Type”, select “High Availability”. In the radio select button that appears, select “Specific events of type High Availability”. Another table will appear on-screen, in this table select ‘Add’, and select Component ‘Backup’, Operation ‘Backup’, and Status ‘Failed’. Click Next to get to the Add Actions screen. Here you will add a notification action, in my case a simple email to my EM12c administrator account’s email address. Click ‘Add’, and make sure that ‘Always execute the actions’ is selected. In the Basic Notifications “Email To” section, click the magnifying glass and select your administrator account. Click ‘Continue’, then ‘Next’. Give the rule a descriptive name and click ‘Next’, then ‘Continue’ again. Your backup failure notification rule is created but will not be saved until you click ‘Save’. Do so now to avoid losing your work. You should now receive notifications for failed jobs of type Database Backup.

Return to your new incident rule set and click ‘Edit’ to begin setting up notifications for Database System incidents. I prefer to receive notice of Database System incidents rather than Database Instance incidents, as this avoids duplicate emails when a database is taken down or an agent is stopped. Click the ‘Rules’ subtab, and create a new incident rule (applies to newly created incidents or updates to existing incidents). I want to know when a database system incident is created, and I also want to know when the problem is resolved and the incident is closed. So I set the new incident rule to apply to Target Type in Database System with Status in Closed or New. As before, add a notification rule set to email your administrator account. I also prefer to set this rule so that it assigns Database System incidents to me, which you can optionally configure in the ‘Update Incident’ section. Give the new notification rule a name and save it and the entire rule set.

Return to editing the rule set, on the ‘Rules’ subtab, and add another incident rule. This rule will be for Agent incidents, so set it to apply where Target Type in Agent. I will deal with metric alerts from Agents later with a catchall rule so in this case I only want the rule to apply to incidents where the Category is in Availability. By doing so and also selecting Status in Closed or New, I will receive notification when an agent is not available to the OMS for any reason — whether the agent is stopped, blocked, or the network drops out, the incident will have the category set to Availability. Create a notification rule that emails your administrator account, name it and save it.

Now the catchall rule for metric alerts. Database Systems don’t produce metric alerts and we want to find out about metric threshold incidents that occur on database instances, listeners, agents, hosts, everything. So we create a new incident rule that will fire on incidents with Status in Closed or New (again, so we receive notice when alerts are created AND when they’re cleared), and apply it where Category in Business, Capacity, Configuration, Diagnostics, Error, Fault, Jobs, Load, Performance, Security. I specifically exclude Availability here because I have rules for agents, listeners and database systems keyed directly to Availability and I do not want to receive duplicate alerts. Add a notification to your administrator account, name it and save it.

Listeners also need notifications. The Database System target encompasses the instance and the listener, but will not fire an incident if the listener goes down (like it will if the database instance itself does — presumably this is because the database itself is still available to existing sessions). So we add a new incident rule where target in Listener, Status in Closed or New and Category in Availability. Add a notification to your administrator account, name it and save it.

Corrective actions are great. They allow automated responses to specific metric issues that can save a DBA from having to run things manually. For example, you may run archivelog backups with delete all input once every half hour but occasionally your database has very heavy processing that can fill up your disk space allocated to archived logs in 20 minutes. This is the perfect situation for a corrective action based on the Archive Area Used % metric that will fire off an archivelog backup to keep you from running out of space and receiving the “archiver hung” error. I won’t cover creating corrective actions in this post, but assuming you have one in place here is how you can set up notifications whenever it runs.

Add a new event rule that applies to specific events of type ‘Metric Alert’. In the table that comes up, click Add and select the metric on which you have a corrective action defined — in my case, this is Metric Group Archive Area, Metric Archive Area Used (%) for Target Type Database Instance. Check the box next to this metric (for some reason it appears twice on my screen — I selected both, I think one applies to pre 10G databases while the applies to more recent versions, either way I don’t receive any duplicates here). Make sure that ‘All Objects’ is selected and no objects are in the exclude list. At the bottom of this subwindow, leave the ‘Severity’ drop down blank and check all four boxes, like in the image below.

Next click OK in that window, then click ‘Next’. As with the other notifications, add a notification email to your administrator account, name it and save it. Make doubly sure that you save the entire incident rule set or you will lose your work.

This covers all of the alerts that I consider most important to receive myself. The next step is adjusting your monitoring thresholds so that they meet the needs of your environment. Once you have the monitoring thresholds set somewhere that makes sense, that’s the time to start adding even more incident rule set to support notifications to your sysadmins or managers or users. I prefer to keep “my” rule set with the small set of rules I’ve listed above, and handle most notifications to others in their own customized rule sets. Just for example, your manager may want to see alerts for all availability issues on production but doesn’t want their mailbox cluttered up with alerts about your sandbox systems. I create a new rule set for each environment (Production, Test, Development, Sandbox), with very specific incident rules covering the exact issues others wish to be notified about. In my case that’s generally the Load and Capacity categories for Database Instance targets (to my manager) and all metric alerts (excluding categories Security and Configuration) on Host targets to send alerts to the sysadmin. OEM is a little chatty on security recommendations and we deal with host patching and vulnerability monitoring outside of OEM so I don’t like to bother the sysadmin with EM12c’s opinion of our system security.

That’s it. What sort of notification rules do the rest of you out there use that differ from this? What can I do better?

Advertisements

13 thoughts on “Setting up notifications and incident rules in EM12c

  1. Oti Ometie.

    Nice post. Easy to understand, and even easier to follow and make modifications to suit your need. Thumbs up mate.

    Reply
  2. sravya

    Thanks for nice post. I have a quick question ,

    We have generic production incident which applies the rules to all production databases and does email and paging. Now we started getting diskdevice alerts out to our phones on particular server which we want to avoid (as we know backup runs at that moment and this is expected).

    Inorder to just only mail and not to page I created a seperate incident only for that server with that metric. But As I didn’t modify the existing production incident will that still keep sending pages?

    how can we accompolish it?

    Reply
    1. Brian Pardy Post author

      Hi Sravya,

      You are correct. By creating an additional incident rule for that server and that metric, the new rule will send an email notifications AND the old rule will continue to send pager notifications as it did before. You would need to exclude this metric from your original incident rule somehow or find another way to keep the metric from reaching the critical status.

      There are a few different ways to do that.

      My preference is to edit the metric thresholds so that no alert is triggered. You can do this by changing the advanced metric collection settings to adjust the number of occurrences required before an alert is triggered. So if I have a backup that takes one hour to run and it forces the disk busy % to 100%, then I may set that metric to run every 15 minutes but require 5 consecutive occurrences before triggering an alert. That way the backup does not cause the metric to trigger an alert.

      You can also try some other options. I would try something like this:
      1) Add a new incident rule to your new rule set that matches this metric, and add a rule action that sets the priority to “Low” when it sends an email notification.
      2) Edit your original incident rule set, and change your notification action so that instead of running as “Always execute the actions”, set it to “Only execute if specified conditions match” and select the checkbox next to Priority so that you can apply it to every priority EXCEPT for Priority=Low.

      That should allow you to, in the future, set up rules that specify Priority=Low for any metrics for which you do not want to send pages. There are probably several other ways to do this as well, such as by excluding Category=Load, or ignoring this metric alert and using your own metric extension to notify on high disk usage outside of your bakcup window, or setting status to “Work In Progress” and ignoring that status… any of these should work for you.

      Good luck!

      Reply
  3. Blesson

    For a production 11gR2 Oracle EBS related database that has been added in our Enterprise Manager Grid Control 12c , we require the email alert for blocking and locking sessions to be sent to a particular user from the application team who wants to be notified for any locking / blocking sessions which are in the same state for more than 30 minutes and the no of sessions being blocked is > 5 . The version of our EM is 12.1.0.2.0 Currently we receive email alerts for issues sent to the DBA team email address which is stored in the OEM12c . But for this one particular database , the application team wants the email alerts and that too only related to the blocking / locking sessions . Can you help me set this up .

    Reply
    1. Brian Pardy Post author

      Hello Blesson,

      This should be possible. I am not sure from your comment if you already have this working – it sounds like you do, for the DBA team, but just in case I’ll mention how I would suggest setting it up. I will make a guess that you are using monitoring templates (under the Enterprise menu) to define the metric thresholds that you use for alerts on most of your databases. Assuming that is the case, within EM12c you should be able to go directly to the specific 11gR2 EBS database you’re interested in here, then go to Oracle Database -> Monitoring -> Metric and Collection Settings. The settings that you make on this screen would apply to only that one database, so this is where I would set your User Locks -> Maximum Blocked Session count threshold to critical at 5, then click the pencil icon to the right and that should bring up a detail screen. The detail screen should show how the collection schedule for the metric (every 10 minutes in my case for this metric), and here you can change the “number of occurrences” setting from 1 to something more appropriate, such as 3. With the metric collected every 10 minutes and number of occurrences set to 3, then the metric should only trigger an alert and incident after 30 minutes (3 * 10).

      That will make sure the alerts occur at the threshold/timing you wanted. Next you need to add a new incident rule that will identify this specific incident. I would create a new incident ruleset, call it something like “Apps Team Incident Ruleset”, then add rules to the ruleset that apply only to this specific database target and only for these specific metric alert events.

      I would use a rule of type “Incoming events and updates to events”, make sure the radio button is selected next to “Specific events of type metric alert”, then click the Add button and select the target type “database instance”, then find the metric for blocked user sessions. Set the severity to warning/critical (depending on the thresholds you set and your needs), then add an action to the incident rule that will email the apps team. Save this ruleset and make sure it is active and you should be all set from here. It’s a little more complicated than that and this is mostly from memory, but I think you should be able to make that work. Please feel free to ask if you have other questions.

      Reply
  4. Sameer

    Hello Brain,

    Have you tried setting up incident rule upon failure of corrective action? F.ex: an alert is raised and you have attached a corrective action to mitigate the alert situation. You need an email, if the corrective action fails. At my client, we have setup multiple rules and they all work fine as expected. But unfortunately, I am having issues with mentioned situation and would like to know in case if you have worked on similar rule or scenario.

    Currently, I am waiting for useless/endless feedback from my Oracle SR.

    Regards
    Sameer

    Reply
    1. Brian Pardy Post author

      Hello Sameer,

      First I have to say that I have upgraded to EM13c and no longer have an EM12c environment available. I have a corrective action in EM13c that remounts an NFS backup directory when a backup job fails, and I receive an email notification from the corrective action whenever it runs, whether it succeeded or failed.

      From a look at the 12c documentation (https://docs.oracle.com/cd/E24628_01/doc.121/e24473/jobs.htm#EMADM10883) it sounds like it works the same way as in EM13c.

      I have two separate rulesets that handle this. One of them is used only to invoke the corrective action when an event of type High Availability, component Backup, operation Backup has status=Failed. This incident rule does not send me any notifications at all.

      In the other ruleset (which I use for all of my notifications), I have a simple rule for all events of type High Availability, component Backup, operation backup, status=Failed and the only rule action for this rule is to email me a notification.

      Based on the documentation, a corrective action triggers the same incident rules as the event that caused the corrective action. So I receive a notification when the backup fails (from the second ruleset I described above) and then another notification when the corrective action succeeds or fails.

      I hope somehow this is helpful even though I don’t still have EM12c available. Good luck!

      -Brian

      Reply
      1. Sameer

        Thanks Brain for the reply.

        >and I receive an email notification from the corrective action whenever it runs, whether it >succeeded or failed.

        Will it be possible for you to upload screen shots? You can mail me as well if uploading here is not possible. I will appreciate that…

        One thing I would like to know from your above post, I see in the image (Edit Selected Metrics)

        Severity and Corrective Action Status ->

        You haven’t selected any severity as I see in the image and you have checked all the 4 check boxes, meaning for success & problem for corrective action (warning & critical).

        What I do is, I also select severity from drop down box (Critical & Warning) and for CA, I select only “Problems for critical metric alert” and “Problems for warning metric alert”. For success, I do get an email. But for problems (critical & warning), I do not..

        I will try to check this with my 13c test OEM and will get back to you.

        Thanks
        Sameer

    2. Brian Pardy Post author

      Hi Sameer,

      I don’t see a reply button on your latest comment for some reason so I am responding to your first one.

      I wonder if we are seeing different behavior because you are running a corrective action triggered by Metric Alert events, while I am running a corrective action triggered by High Availability events. These may be processed differently inside of the incident rule system – I can see that the rule screens are very different.

      Rule screen for metric alert event: https://s32.postimg.org/mo7z04m9h/metricalertrule.png
      Rule screen for high availability event: https://s32.postimg.org/9isgu0sdx/harule.png

      I do also have a Metric Alert-based corrective action set up to run archive log backups when the archive area used % metric reaches critical, but I don’t think this corrective action has ever failed, so I am not sure whether or not it actually sends me notifications properly on failure. I will try to see if I can find a way to test this out in a way to force that corrective action to fail to see if I receive notifications.

      Have you tried this with the Severity drop down box left blank? I cannot remember why I suggested leaving that blank when I wrote this post, but leaving it blank may help.

      Also, just to confirm, in this rule’s action settings, do you have anything set in the “Conditions for actions” area? I have mine set to “Always execute the actions”, and the only action specified I have is a basic notification “E-mail To” my EM13c user.

      As a last thing to check, I see at the bottom of the two screenshots I linked in this response there is a checkbox for “Corrective action completed” – if I check that box, it offers me choices of “with failed status” or “with successful status”. Is that checkbox there in EM12c? If it is, that might be the best way to make this work for you. I am not sure if it is new in EM13c or not, and I definitely do not have it checked, but it looks like this might be the way Oracle expects us to configure corrective action notifications.

      Good luck!

      Reply
      1. Sameer

        Hi Brain,

        “I wonder if we are seeing different behavior because you are running a corrective action triggered by Metric Alert events, while I am running a corrective action triggered
        by High Availability events. These may be processed differently inside of the incident rule system – I can see that the rule screens are very different.”

        This is very likely possible. I got same question from Oracle if I see the same behavior for Host based alerts. As I don’t have any CA written, I can not verify it. So the behavior might be broken
        for some events.

        “Have you tried this with the Severity drop down box left blank? I cannot remember why I suggested leaving that blank when I wrote this post, but leaving it blank may help.”

        Yes, but the its weird, rule is not executed at all. F.ex: A tablespace gets full and reaches “Warning” threshold, then -> CA gets executed and fails -> but I guess as there is not mapping set in the “Severity” drop down, no notification is sent.

        So I think, when I leave the Severity (“Warning”) unchecked (as you suggest as well),
        CA failure rule doesn’t get even picked-up 😦

        “Conditions for actions” -> I am executing always. I haven’t set any condition.

        “Corrective action completed” -> This is 13c feature and check box is missing in 12c. 😦 🙂
        Thanks again Brain for responses.

        Sameer

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s