[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

snmpconf String handling in PM-09



Hi -

This posting is a follow-up to a hallway conversation Steve
Waldbusser and I had in Salt Lake City.  There appear to be
some string handling problems in the Policy Based Management
MIB version draft-ietf-snmpconf-pm-09.txt resulting from a
bit of ambiguity whether strings in the PolicyScript language
are to be considered:

        1) arbitrary sequences of octets, some of which may
           be interpreted as UTF-8

        2) UTF-8 data

        3) sequences of Unicode / ISO 10646 code points, which
           may be converted to UTF-8 under some circumstances

The decision to make the language loosely typed, along
with SNMP's requirements for strong typing, can result
in problems if we're not very careful in the language and
library definitions.  Here are some specific examples from
the document:

|     hex_digit:    digit | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
|                         | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
|
|     escape_seq:    '\''   |   '\"'   |   '\?'   |   '\\'
|                  | '\a'   |   '\b'   |   '\f'   |   '\n'
|                  | '\r'   |  '\t'    |   '\v'
|                  | '\' oct_digit+    | '\x' hex_digit+
..
|     string_literal:    '"' s_char* '"'
|
|     s_char:            non_quote | ''' | escape_seq

This snippet defines, among other things, the use of "\x"
escapes in strings.  For hex values in the range 0x80 to 0xff,
there is a potential problem.  Does the string "Jos\xe9",
when transformed into an index, become 5.74.111.115.195.169 or
4.74.111.115.233?  (forgive me if I get a bit or two wrong -
I did the calculations by hand)  The difference results from
whether the hex escape is thought of as a code point for
é (consequently subject to the UTF-8 transformation)
or as a raw octet.


This interacts with the definition of the var class:

| The String type is the set of all finite ordered sequences of
| zero or more 8-bit unsigned integer values ("elements"). The
| string type can store textual data as well as binary data
| sequences. Each element is regarded as occupying a position
| within the sequence. These positions are indexed with
| nonnegative integers. The first element (if any) is at
| position 0, the next element (if any) at position 1, and so
| on. The length of a string is the number of elements (i.e.,
| 8-bit values) within it. The empty string has length zero and
| therefore contains no elements.

It needs to be made clearer that the elements of strings as
defined here are NOT characters.  I'd also suggest that, in
addition to the \x escape for sticking arbitrary octets (which
may represent values that would be illegal in a UTF-8 encoding)
we also need a \u escape for putting arbitrary code points
(which will be subject to the UTF-8 transform) in a string
constant.  This is particularly important where the developer's
tools may not have the necessary input or rendering methods for
arbitrary writing systems.

|   - Strings are compared with ==, <=, < etc. (Details in Sec. 6.2.1)

It also needs to be made clear whether, for example, &eacute;
appearing in a string ends up getting normalized in some way,
as in http://www.unicode.org/unicode/reports/tr15/#Examples
Otherwise, to ensure consistent operation across
implementations, script authors would be forced to enter all
their strings in hex or octal.

Furthermore, it should be made explicit that "==" "<="
and so on are NOT subject to locale-specific behaviour.

|   - No variable substitution in "" strings. '' strings are 1 char only.

I've already commented on why we should get rid of this.

| Octet String
|     On input:
|       Either a String or an Integer. If an Integer, it will be coerced
|       to a String with the ToString() function. This string will be
|       used as an unencoded representation of the octet string value.
|
|     On output:
|       A String containing the unencoded value of the octet string.
|
| Object Identifier
|     On input and on output:
|       A String containing a decimal ascii encoded object identifier
|       of the following form:
|
|           oid:       subid [ '.' subid ]* [ '.' ]
|           subid:     '0' | decimal_constant
|
|     It is an RTE if an Object Identifier argument is not in the form
|     above. Note that a trailing '.' is acceptable and will simply be
|     ignored (note however, that a trailing dot could cause a strncmp()
|     comparison of two otherwise-identical OIDs to fail - instead use
|     oidncmp()).
|
|     Note that ascii descriptors (e.g. "ifIndex") are never used in
|     these encodings "over the wire". They are never returned from
|     accessor functions nor are they ever accepted by them. NMS user
|     interfaces are encouraged to allow humans to view object
|     identifiers with ascii descriptors, but they must translate those
|     descriptors to dotted-decimal format before sending them in MIB
|     objects to policy agents.

I think these two could interact badly in the case of extracting
and re-assembling an index of type OCTET STRING whose value happens
to be something like "+123" (Would the "+" get lost along the way?)

| 9.1.3.4.  searchColumn()
|
|     integer searchColumn(string columnoid, string &oid,
|                          string pattern, integer mode
|                          [, string context, NonLocalArgs])
..
|         The 'mode' value controls what type of match to perform on
|         this 'SearchString' value. There are 6 possibilities for mode:
|
|             mode       Search Action
|                0       Case sensitive exact match of 'pattern'
|                        and 'SearchString'
|                1       Case insensitive exact match of 'pattern'
|                        and 'SearchString'
|                2       Case sensitive substring match, finding
|                        'pattern' in 'SearchString'
|                3       Case insensitive substring match, finding
|                        'pattern' in 'SearchString'
|                4       Case sensitive regular expression match,
|                        searching 'SearchString' for the regular
|                        expression given in 'pattern'.
|                5       Case insensitive regular expression match,
|                        searching 'SearchString' for the regular
|                        expression given in 'pattern'.

Do we *really* want to go here?  draft-alvestrand-i18n-howto-01.txt
gives a glimpse at the issues.  For more details please see
http://www.unicode.org/unicode/reports/tr18/ and
http://www.unicode.org/unicode/reports/tr21/

|         searchColumn uses the POSIX extended regular expressions
|         defined in POSIX 1003.2.

I think we need to be explicit about making the behaviour locale-
independent.

|     integer roleMatch(string roleString [, string element,
|                       string context, string contextEngineID])
..
|         'contextEngineID' contains the contextEngineID of the remote
|         system that 'element' resides on. It is encoded as a pair of
|         hex digits (upper and lower case are valid) for each octet of
|         the contextEngineID. If 'contextEngineID' is not present, the

Why do we mandate a special encoding for this string?

| 9.4.1.  regexp()
|
|     integer regexp(string pattern, string str,
|                    integer case [, string &match])

Same issues as for 9.1.3.4 above.

| 9.4.4.  oidncmp()
|
|     integer oidncmp(string oid1, string oid2, integer n)
|
|         Arguments 'oid1' and 'oid2' are strings containing
|         ASCII dotted-decimal representations of object identifiers
|         (e.g. "1.3.6.1.2.1.1.1.0").
|
|         oidcmp compares not more than 'n' subidentifiers of 'oid1' and
|         'oid2' and returns -1 if 'oid1' is less than 'oid2', 0 if they
|         are equal, and 1 if 'oid1' is greater than 'oid2'.
|
|
| 9.4.5.  inSubtree()
|
|     integer inSubtree(string oid, string prefix)
|
|         Arguments 'oid' and 'prefix' are strings containing
|         ASCII dotted-decimal representations of object identifiers
|         (e.g. "1.3.6.1.2.1.1.1.0").
|
|         inSubtree returns 1 if every subidentifier in 'prefix' equals
|         the corresponding subidentifier in 'oid', otherwise it returns
|         0. The is equivalent to oidncmp(oid1, prefix, oidlen(prefix))
|         is provided because this is an idiom and because it avoids
|         evaluating 'prefix' twice if is an expression.

We've found it useful to combine these two into a single function

#         oidcmp returns
#               -2 if oid1 is a parent of oid1
#               -1 if oid1 is less than oid2 (but not its parent)
#               0  if oid1 is equal to oid2
#               +1 if oid1 is greater than oid2 (but not its child)
#               +2 is oid1 is a child of oid2

| 9.4.9.  parseIndex()
..
|         If 'type' is String and 'len' is greater than zero, 'len'
|         subids will be parsed. For each subid parsed, the chr() value
|         of the subid will be appended to the returned string. If any

Coupled with the definition of chr() and the rules for encoding
octet strings, this won't work.  A Unicode / IS 10646 code point
past 127 will be encoded in multiple subidentifiers.  Modifying
the definition of chr() to simply return an octet, without
consideration of UTF-8 transformation reults would probably
fix this.

..
|         If 'type' is String and 'len' is -1, subids will be parsed
|         until the end of 'oid'. For each subid parsed, the chr() value
|         of the subid will be appended to the returned string. If any
|         subid is greater than 255, 'index' will be set to -1 on return
|         and an empty string will be returned.

Same problem.

| 9.4.14.  chr()
|
|     string chr(integer utf8)
|
|         Returns a one-character string containing the character
|         specified by the UTF8 code contained in 'utf8'. Note that a
|         property of UTF8 is that 7-bit ASCII characters are
|         represented by the same UTF8 code-points as their ascii
|         equivalents.

This definition is broken (see above).  There are probably two
functions needed here.
        The first would take an integer in the range 0-255
        as its input, and returns a string of length one
        containing a single octet with that binary value.

        The other would take an integer in the range of
        valid ISO 10646 code points and returns a string
        of some number of octets correspnding to the UTF-8
        encoding of that code point (RFC 2279).  Since there
        are some integers in the range that do not have legal
        UTF-8 encodings (think surrogates), I would suggest
        returning a zero-length string for those cases.

| 9.4.15.  ord()
|
|     integer ord(string str)
|
|         Returns the UTF8 code-point value of the first character of
|         'str'. This function complements chr(). Note that a property of
|         UTF8 is that 7-bit ASCII characters are represented by the
|         same UTF8 code-points as their ascii equivalents.

The description could be clearer.  The value of the first
byte of str will determine how many additional bytes would
need to be read to compute the value of the corresponding IS
10646 code point.

| 9.4.16.  substr()
|
|     string substr(string &str, integer offset
|                   [, integer len, string replacement])

A "health warning" that this function operates on octets rather
than characters would be in order.

|   strncmp()
|   strncasecmp()

We should be explicit that these differ from their Posix
equivalents in how they are (not) affected by locale.

|   sprintf()
|   sscanf()

I don't even want to think about "%c" for these two.  :-)

..
| UTF8String ::= TEXTUAL-CONVENTION
|     STATUS       current
|     DESCRIPTION
..
|         Since additional code points are added by
|         amendments to the 10646 standard from time
|         to time, implementations must be prepared to
|         encounter any code point from 0x00000000 to
|         0x7fffffff.  Byte sequences that do not
|         correspond to the valid UTF-8 encoding of a
|         code point or are outside this range are
|         prohibited.

Note that the current Unicode specifications have tightened
up their view of what is legal UTF-8.  See table 3.1b in
http://www.unicode.org/unicode/reports/tr27/

(When I raised this question on the snmpv3 WG list, no
one wanted to deal with it.  :-)

 ------------------------------------------------------
 Randy Presuhn          BMC Software, Inc.  1-3141
 randy_presuhn@bmc.com  2141 North First Street
 Tel: +1 408 546-1006   San Josť, California 95131  USA
 ------------------------------------------------------
 My opinions and BMC's are independent variables.
 ------------------------------------------------------