Part 1 of 2 Parts Part 1 introduces some problems discovered with DN handling in FreeIPA. Part 2 discusses proposed utility classes which eliminate the problems described in Part 1 and how to use them. Introduction ------------ [Note to keep things simple I'm going to just use the term DN to discuss Distinguished Names. A complete and accurate discussion would involve the RDN and AVA components of a DN and the rules governing their relationships. The gory details can be found in the documentation of the proposed DN, RDN and AVA classes. A brief discussion can be found in [1]] Sometimes values used in FreeIPA appear to get corrupted. This is due to how we've been handling LDAP DN's. The problem manifests itself when a FreeIPA value contains a LDAP DN reserved character because in many instances we store FreeIPA values as DN's. We'll use as an example a privilege named "r,w" for Read/Write. The comma (,) is a reserved character and must be escaped when used in a DN. The LDAP RFC for string representations of DN's (RFC 4514) require the type and value to be encoded in UTF-8 and then permits multiple ways of escaping reserved characters. Possible ways to escape r,w would be: 'r\,w' # backslash escape # backslash, then escaped char 'r\2cw' # hexadecimal ascii escape # backslash, then 2 hex digits for the ascii code point '#722C77' # binary encoded (only should be used for binary data) # number sign (i.e. #) followed by two hex digits for # each octet It is implementation dependent as to which method is used. The LDAP client library we use uses the backlash escape method. The LDAP server we use uses the hexadecimal ascii escape method. Thus you can't expect one encoding method will be seen throughout the code, it depends on the context. To use a DN with LDAP it must be encoded according to one of the above methods. But when we use FreeIPA values which appear in a DN we must decode the DN to retrieve the original value. However it's critical the decoded values in a DN are only accessed individually component-wise because you can't decode an entire DN string and preserve the integrity of the individual components (just consider the case of a comma or an equal sign in one of the components). [1] Similarly one cannot just compose a DN via string formatting (e.g. %s=%s) where one of the string substitutions is a string which needs to be encoded. For example if we wanted to form 'cn=rw' we could write '%s=%s' % ('cn', 'rw') but '%s=%s' % ('cn', 'r,w') would fail because it's invalid DN syntax. Also one cannot just compare DN encoded strings because different implementations use different encoding syntaxes. The decoded values would be equal, the encoded values equivalent, but the encoded values may not be equal. One must decode the values and compare them component-wise. For example r,w might be r\,w on one side of a comparison and r\2cw on the other side of a comparison. If you try to compare the two encoded values they won't compare as equal when in fact they are equal as decoded values. The above requirements suggest the use of a class to represent a DN which easily handles access, composition, comparison, encoding, decoding and UTF-8 conversion. Several classes have been added to FreeIPA. The classes also relieve the programmer from having to know what transformations have been applied to a value. The class access methods always provide the right form. Coding Issues in FreeIPA ------------------------ Here is a brief list of problematic coding practices involving DN's. Note, sometimes our code does the right thing, or partly the right thing (e.g. functions in ldap2.py), the problem is it's not consistent and it's very difficult to determine when certain transformations have been applied or not when working with a value at some random location in the code. * Forming a RDN via string formatting, e.g. %s=%s. (fails if either string needs encoding, may not be UTF-8) * Composing DN's via string formatting, e.g. %s,%s. Technically this is the concatenation of RDN sequences. (fails if either string needs encoding, may not be UTF-8) * Not knowing what state a DN or RDN is in, encoded (escaped) or decoded (unescaped), UTF-8 encoded vs. unicode (UTF-8 decoded) and trying to operate on it without accounting for the encoding. (We've been encoding DN's at certain late stages, e.g. just prior to passing it to a LDAP library or via the get_dn() methods, sometimes it's ambiguous as to whether the DN string you're operating on has gone through the encoding logic provided by ldap2 or not). * Trying to decompose a DN or extract values from a DN via string searches or regular expressions (various failure modes depending on whether the DN string is encoded or not). * Trying to compare a DN or a component of a DN via string operations (various failure modes when a DN is encoded) * Not handling the UTF-8 <-> unicode conversion. DN's must be encoded in UTF-8. FreeIPA values by convention are UCS4 Unicode. Thus a DN can never be a unicode object and when FreeIPA unicode values are inserted into a DN they must be encoded as UTF-8 prior to escaping. We sometimes get away with comparing UTF-8 DN components to FreeIPA unicode values because of Python's automatic type promotion, but there are edge cases which can fail. * Trying to use a DN component and one of the DN's entry's attribute values interchangeably. A common case is to have the first RDN of a DN replicated as an attribute in the entry. cn is a good example and what follows illustrates: # r\2Cw, privilege, usersys.redhat.com dn: cn=r\2Cw,ou=privilege,dc=usersys,dc=redhat,dc=com objectClass: groupOfNames objectClass: top cn: r,w In this example the first RDN's value in the DN is r,w but because it's a DN it must be encoded hence the first RDN is cn=r\2Cw. However the cn is also replicated as an attribute/value in the entry, e.g. cn: r,w If we want the value of the cn the code sometimes plucks off the first RDN's value, in this case r\2Cw and sometimes the code grabs the value of the cn attribute in the entry, in this case r,w. But r\2Cw is not the same as r,w. Why? Because attribute values are not subject to the encoding rules of DN's. Extracting the value of the first RDN would be fine if we were using the proposed DN classes because the class knows how to extract and decode the component value, but currently we try to extract the substring up to the first comma, which of course is incorrect. ---- [1] Technically a DN component is the type or value of an AVA (Attribute Value Assertion) which is expressed as type=value. AVA's belong to a container called an RDN (Relative Distinguished Name). A RDN is a set of AVA's, hence the AVA's in a RDN are unordered. Most RDN's are single-valued containing only one AVA (hence one type=value). However a RDN may be multi-valued in which case there are more than one AVA. When multiple AVA's are present in a RDN they are separated by a plus sign (+), e.g. cn=Bob+ou=people. DN's are an ordered sequence of RDN's. In the string representation of a DN the RDN's are separated by a comma (,). RDN's are relative because a RDN has meaning only by virtue of it's parent RDN. A sequence of RDN's (e.g. a DN) form a path in a tree with the left most RDN being a leaf and the right most RDN being the root. Thus a DN is a unique path in the directory tree and is why it's called a Distinguished Name, because the DN uniquely identifies an LDAP entry.