[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: snmpconf-pm-04 notes

Wes Hardaker wrote:
> Steve> So we've made it easy to poll but haven't achieved sub-second
> Steve> notification (only possible with a notification). And it's here
> Steve> that I'm a bit reluctant to say that a notification is a
> Steve> requirement. To be honest, I really don't have a sense of this
> Steve> yet. One more notification isn't that big a deal but I'm most
> Steve> worried about getting into ratholes about how we can make sure
> Steve> that floods of notifications don't occur.
> Steve> All we would absolutely need would be to have a notification
> Steve> that is sent whenever the abnormalTerminations gauge goes from
> Steve> zero to non-zero and no more than 1 notification per 60 seconds
> Steve> (or some similar non-configurable constant). If we could keep
> Steve> it simple like this it might be worth it.
> Throttling is a very very good thing, you're right.  They should be
> throttled.  I wouldn't, however, base my notifications on the value 0
> for the pmPolicyAbnormalTerminations object.  It implies that it won't
> fire another notification till it wraps around to 0 again or until the
> object is destroyed and recreated.  I'd say fire if it changes at all,
> and don't fire another one till after the throttle factor.

Ack. That pesky troublesome abnormalTerminations object again. This has
confused people in the past so I recently rewrote it:

pmPolicyAbnormalTerminations OBJECT-TYPE
    SYNTAX      Gauge32
    UNITS       "elements"
    MAX-ACCESS  read-only
    STATUS      current
         "The number of elements that, in their most recent filter or
         action execution, have experienced a run-time exception and
         terminated abnormally. Note that if a policy was experiencing
         a run-time exception while processing a particular element
         but on a subsequent invocation it runs normally, this number
         can decline."
    ::= { pmPolicyEntry 12 }

Let's say we have 20 elements. If, when policy P processes those
elements, the first iteration through them 3 of them failed, this object
would be set to 7. If on a later iteration through them policy P failed
on only 1 of them, this object would be set to 1. If on a later
iteration through them policy P succeeded on all of them, this object
would be set to 0. This object is a gauge.

In other words, this should be 0 on most policies. Any nonzero value
indicates a policy that is currently failing on some elements. If you
fix a policy or it "fixes itself", the value will go back to zero.
pmPolicyExecutionErrors is a counter and shows all errors the policy has
ever experienced.

So that's why I'm thinking about the 0 to 1 transition as the noteworthy

> And I disagree that it shouldn't be a configurable throttle value.  A
> global notification throttle object (defaulting to, say, 60) should be
> defined to deal with this.

The 0 to 1 transition should substantially limit the number of messages
and then all we need is some reasonable fixed interval to mop up any
issues regarding repetitively bouncing between 0 and 1.

Keep in mind that the MTTR of a broken script is fairly long because you
have to find a programmer, give him a few cokes, and wait. The reason we
want a notification is so that some automated remedial action can occur
instantly (disable the policy, install the last good version, shut down
the device, ...) until the repair is made some time later. So once we're
alerted, we're probably not going to find another notification
"interesting" for hours (until we think it's actually repaired). What
I'm trying to argue is that if we can guarantee an alert on the first 0
to 1 transition your pretty much done. Very little chance of
underreporting the problem. All that's left is to make sure you don't

Regarding overreporting, we just need some protection from floods. I say
just pick a number. I realize it should be higher than I originally
said. Maybe 30-60 minutes. Why would you ever set it lower? In the
unlikely case that your programmer has instant response time and fixes
the problem in 10 minutes, why can't we just watch the error counters
for another 50 minutes?

I'm not saying it won't ever be useful, just that most of the time it
isn't useful and I can't see that it's ever necessary. I'm just trying
to avoid feature creep. It's easy to add objects now. It's easy to add
them later. But it's impossible to lower the cost after the fact.

> (side note: I'd also really like to see multiple actions be available
> per policy rule, and ideally fall back actions as well.  Or can a code
> table call another code object in the same table? (hence making it
> possible to write a function X that calls Y and Z, which could also be
> called independently by another action).

We only have one action. Fall-back actions are available through the use
of precedence:

Let's say I have a normal QOS, a gold QOS, and a videoconference QOS:
Filter                     Action                    Precedence
if(interface)              apply normal QOS          1
if (interface          
  && roleMatch("Gold"))    apply Gold QOS            2
if (interface #27
  && between 2 and 3 PM)   apply Videoconference QOS 3

So all interfaces get at least normal. Anything tagged with "Gold"
gets Gold QOS. If you delete the "Gold" tag, it reverts to normal. No
matter what the previous QOS was, at 2PM interface 27 gets set to
Videoconference. At 3PM interface 27 reverts back to whatever it 
deserves at that time.