Part 1 of 2 Parts

Part 1 introduces some problems discovered with DN handling in
FreeIPA.

Part 2 discusses proposed utility classes which eliminate the
problems described in Part 1 and how to use them.

Introduction
------------

[Note to keep things simple I'm going to just use the term DN to
discuss Distinguished Names. A complete and accurate discussion would
involve the RDN and AVA components of a DN and the rules governing
their relationships. The gory details can be found in the
documentation of the proposed DN, RDN and AVA classes. A brief
discussion can be found in [1]] 

Sometimes values used in FreeIPA appear to get corrupted. This is due
to how we've been handling LDAP DN's. The problem manifests itself
when a FreeIPA value contains a LDAP DN reserved character because in
many instances we store FreeIPA values as DN's.

We'll use as an example a privilege named "r,w" for Read/Write. The
comma (,) is a reserved character and must be escaped when used in a
DN.

The LDAP RFC for string representations of DN's (RFC 4514) require the
type and value to be encoded in UTF-8 and then permits multiple ways
of escaping reserved characters. Possible ways to escape r,w would be:

'r\,w'          # backslash escape
                # backslash, then escaped char

'r\2cw'         # hexadecimal ascii escape
                # backslash, then 2 hex digits for the ascii code point

'#722C77'       # binary encoded (only should be used for binary data)
                # number sign (i.e. #) followed by two hex digits for
                # each octet

It is implementation dependent as to which method is used. The LDAP
client library we use uses the backlash escape method. The LDAP server
we use uses the hexadecimal ascii escape method. Thus you can't expect
one encoding method will be seen throughout the code, it depends on
the context.

To use a DN with LDAP it must be encoded according to one of the above
methods. But when we use FreeIPA values which appear in a DN we must
decode the DN to retrieve the original value. However it's critical
the decoded values in a DN are only accessed individually
component-wise because you can't decode an entire DN string and
preserve the integrity of the individual components (just consider the
case of a comma or an equal sign in one of the components). [1]

Similarly one cannot just compose a DN via string formatting
(e.g. %s=%s) where one of the string substitutions is a string which
needs to be encoded. For example if we wanted to form 'cn=rw' we could
write '%s=%s' % ('cn', 'rw') but '%s=%s' % ('cn', 'r,w') would fail
because it's invalid DN syntax.

Also one cannot just compare DN encoded strings because different
implementations use different encoding syntaxes. The decoded values
would be equal, the encoded values equivalent, but the encoded values
may not be equal. One must decode the values and compare them
component-wise. For example r,w might be r\,w on one side of a
comparison and r\2cw on the other side of a comparison. If you try
to compare the two encoded values they won't compare as equal when in
fact they are equal as decoded values.

The above requirements suggest the use of a class to represent a DN
which easily handles access, composition, comparison, encoding,
decoding and UTF-8 conversion. Several classes have been added to
FreeIPA. The classes also relieve the programmer from having to know
what transformations have been applied to a value. The class access
methods always provide the right form.

Coding Issues in FreeIPA
------------------------

Here is a brief list of problematic coding practices involving
DN's. Note, sometimes our code does the right thing, or partly the
right thing (e.g. functions in ldap2.py), the problem is it's not
consistent and it's very difficult to determine when certain
transformations have been applied or not when working with a value at
some random location in the code.

* Forming a RDN via string formatting, e.g. %s=%s. (fails if either
  string needs encoding, may not be UTF-8)

* Composing DN's via string formatting, e.g. %s,%s. Technically this
  is the concatenation of RDN sequences.  (fails if either string
  needs encoding, may not be UTF-8)

* Not knowing what state a DN or RDN is in, encoded (escaped) or
  decoded (unescaped), UTF-8 encoded vs. unicode (UTF-8 decoded) and
  trying to operate on it without accounting for the encoding. (We've
  been encoding DN's at certain late stages, e.g. just prior to
  passing it to a LDAP library or via the get_dn() methods, sometimes
  it's ambiguous as to whether the DN string you're operating on has
  gone through the encoding logic provided by ldap2 or not).

* Trying to decompose a DN or extract values from a DN via string
  searches or regular expressions (various failure modes depending on
  whether the DN string is encoded or not).

* Trying to compare a DN or a component of a DN via string operations
  (various failure modes when a DN is encoded)

* Not handling the UTF-8 <-> unicode conversion. DN's must be encoded
  in UTF-8. FreeIPA values by convention are UCS4 Unicode. Thus a DN
  can never be a unicode object and when FreeIPA unicode values are
  inserted into a DN they must be encoded as UTF-8 prior to
  escaping. We sometimes get away with comparing UTF-8 DN components
  to FreeIPA unicode values because of Python's automatic type
  promotion, but there are edge cases which can fail.

* Trying to use a DN component and one of the DN's entry's attribute
  values interchangeably. A common case is to have the first RDN of a
  DN replicated as an attribute in the entry. cn is a good example and
  what follows illustrates:

  # r\2Cw, privilege, usersys.redhat.com
  dn: cn=r\2Cw,ou=privilege,dc=usersys,dc=redhat,dc=com
  objectClass: groupOfNames
  objectClass: top
  cn: r,w

  In this example the first RDN's value in the DN is r,w but because
  it's a DN it must be encoded hence the first RDN is
  cn=r\2Cw. However the cn is also replicated as an attribute/value in
  the entry, e.g. cn: r,w

  If we want the value of the cn the code sometimes plucks off the
  first RDN's value, in this case r\2Cw and sometimes the code grabs
  the value of the cn attribute in the entry, in this case r,w.  But
  r\2Cw is not the same as r,w. Why? Because attribute values are not
  subject to the encoding rules of DN's.

  Extracting the value of the first RDN would be fine if we were using
  the proposed DN classes because the class knows how to extract and
  decode the component value, but currently we try to extract the
  substring up to the first comma, which of course is incorrect.

----

[1] Technically a DN component is the type or value of an AVA
(Attribute Value Assertion) which is expressed as type=value. AVA's
belong to a container called an RDN (Relative Distinguished Name). A
RDN is a set of AVA's, hence the AVA's in a RDN are unordered. Most
RDN's are single-valued containing only one AVA (hence one
type=value). However a RDN may be multi-valued in which case there are
more than one AVA. When multiple AVA's are present in a RDN they are
separated by a plus sign (+), e.g. cn=Bob+ou=people. DN's are an
ordered sequence of RDN's. In the string representation of a DN the
RDN's are separated by a comma (,). RDN's are relative because a RDN
has meaning only by virtue of it's parent RDN. A sequence of RDN's
(e.g. a DN) form a path in a tree with the left most RDN being a leaf
and the right most RDN being the root. Thus a DN is a unique path in
the directory tree and is why it's called a Distinguished Name,
because the DN uniquely identifies an LDAP entry.