Introduction

Where to get the code

I put a git repo with my work up on fedorapeople. It's been rebased against master, in the dn branch. To get the code:

git clone git://fedorapeople.org/~jdennis/freeipa.dn.git
git checkout dn

Suggested review strategy

The patch is huge as one would expect when changing the data type of the most heavily used data item in the project. 114 files were affected and 5,526 lines of code modified in some form. I don't think it's practical to review every changed line of code. Instead you should realize the bulk of the changes can be distilled into some basic patterns that were repeated frequently.

Changing dn's that were strings into DN objects was a big set of changes. Another frequent code change was the insertion of assert statements to verify routines that were expecting DN objects in fact were passed DN objects and not strings. Those two rather simple modifications account for a large part of the changed lines. Each change from string to DN was individually visited by me but was performed by an editor macro. This sped the process and eliminated human error during the edit. After the macro executed I visually verified the modification was correct. I don't think there is much value in trying to review all of those types of changes, it's unlikely there was a mistake in it due to the automation in conjunction with my review and the fact the code fully passes the unit tests and installs the server correctly. I'm not saying there couldn't be a lingering error, just that trying to eyeball every one of these in a review does not seem like a fruitful pursuit.

Another large collection of innocuous changes was simplifying how DN are checked in the unit tests. Formerly we used a lambda expression in the expected value to promote the returned value (e.g. the got value) into a DN object so the equality test could be performed by two DN's. That was fine when we only had a handful of DN's in the code. But now that everything is a DN all those lambda expressions in the test results got unwieldy, verbose and made the code hard to read. The promotion to DN now occurs automatically. Every place in the unit tests were we used a lambda expression to compare a DN has been removed and replaced with the simple expected value. As above I don't see much value in reviewing all the changes to lambda expressions in the unit tests.

Another large block of diffs in the patch come from moving a file or removing a file.

The above changes probably account for 80% of the patch.

The rest of the changes fall into some basic groups:

  1. Changing how components of a dn are accessed and modified. String operations were replaced with the use of DN operators.
  2. Refactoring some of the ldap objects to get all ldap operations to funnel through the same low level ldap code so we can use DN's through the entire pipeline and how data flowing in and out of LDAP is converted. This is explained in great detail below.

Reviewing changes from item 1 may help you better understand how DN's streamline the code by replacing (convoluted) string manipulation with concise Python operators that clearly expresses code logic (internally those simple operators assure escaping, encoding, formatting, etc. are properly handled). For review purposes the question arises if I sufficiently understood the exiting code logic and was able to replicate the logic with DN operators. There were definitely a number of places in the code where I had study the code long and hard to suss out it's actual logic. Were there mistakes in any of these conversions? Careful code review is one approach, but there are quite a few of them to look at. I'm more of the mind that if a logic error was introduced we're more likely to find it via the unit tests or other code exercise rather than having eyeballs on every one of these changes.

Item 2 is one area that is most deserving of a traditional code review. It's an architectural change. It attempts to push our code in a particular direction; the convergence on a single API for performing ldap operations and a reduction of duplicate code. This was perhaps the most difficult part of the entire task, we had so many different ways of doing the same thing and unfortunately DN's flowed through every one of those needlessly different code paths.

State of rebase

The repo is current with master as of Friday July 6th. FWIW keeping the branch current with master has been a significant time sink because the changes are spread out over the entire code base.

State of testing

As of Friday July 6th all unit tests pass. ipa-server-install runs without error and produces a server installation under which all unit tests run by lite-server passes.

FWIW during development I typically use RPM's from the devel repo to install the server components, this way I know the 389-ds, KDC, etc. install is clean. Then I run the unit tests under lite-server. Once all the unit tests run successfully I build RPM's from my tree and perform a new install using code from my tree, if there are no install errors I then re-run the unit tests using lite-server against the installed components from my tree. I've found this to be a good way to isolate the two major parts of our code, install code and run-time code, both of which share DN changes.

What needs testing

Obviously others should repeat the unit test and install testing I did as described in State of testing. However one critical area I have not tested are the command line utilities, especially those that set up replication, perform migration etc. If someone could help exercise those it would be a tremendous help.

Remaining things to clean up

I've got a few FIXME's sprinkled through the code that I need to go back and revisit (marked with my initials 'jrd').

There are a few places where code formatting or file layout are ugly. Since I was most focused on getting things to run cleanly I postponed making things pretty. Along these lines we should be careful with what we export from modules. Python provides the __all__ module export list, we really should be making better use of that.

I need to add documentation in places, it's not up to my usual standards of documenting everything (although I did document some stuff already).

We need to break the DN class into mutable and immutable variants. There is a lengthy discussion of this below, but I don't believe it's affecting anything at the moment.

Why did I use tuples instead of strings when initializing DN's?

I'm sure this question will come up. Why did you use individual tuples to initialize a DN when you could have used a string? It's a good question to ask.

Here's an example:

dn = 'cn=masters,cn=ipa,cn=etc,%s' % suffix

became:

dn = DN(('cn', 'masters'), ('cn', 'ipa'), ('cn', 'etc'), suffix)

but it could have been this:

dn = DN('cn=masters,cn=ipa,cn=etc', suffix)

Why not? Isn't easier to read and still produces the same DN object?

Yes, the last form will produce the same value because the DN constructor will parse the first string as dn string representation, breaking it into it's components and then appends the suffix dn components (here we assume suffix is already a DN object).

There are two issues with the above:

It's inefficient, it has to fully parse the string and produce tuples from the parsing, so why not feed it the tuples to begin with and avoid all the parsing work?

It's dependent upon the string being a correctly formatted and escaped string representation of a dn. For this particular simple dn it's not hard to get this right.

But now suppose someone comes along later and decides the first cn value (masters) needs to be parameterized. I can almost guarantee it will be done as string substitution (I've seen actual examples of this multiple times) where the above becomes this:

dn = DN('cn=%s,cn=ipa,cn=etc' % my_master, suffix)

This type of dn formation via string formatting is 100% wrong and is a large part of what motivated us to introduce DN objects in the first place. It's wrong because simple string substitution may not produce a valid dn string representation. Now we're back to the exact same problem we tried so hard to get rid of.

My belief is for best code practice with dn's one should use tuples for the following reasons:

  • It's more efficient

  • It expresses what DN's really are, a sequence of attribute/value pairs (technically that's not correct but it's a reasonable mental model in the majority of cases [4])

  • It discourages the type of careless string formatting discussed above. If the original code had been:

    dn = DN(('cn', 'masters'), ('cn', 'ipa'), ('cn', 'etc'), suffix)
    

and the programmer needed to parameterize 'masters' the natural inclination would have been the correct formulation where the parameterized value was passed as a tuple member:

dn = DN(('cn', my_master), ('cn', 'ipa'), ('cn', 'etc'), suffix)

I personally find the tuple expression of a dn just as easy to read as a string formatted version, but I realize this may be an issue of personal taste. I believe code readability is an important factor to consider.

[4]DN's are a sequence of RDN's, an RDN is a set of AVA's, an AVA is an attribute value pair. In the vast majority of cases a RDN contains only a single AVA which means an RDN reduces to a single attribute/value pair. An example of an RDN with multiple AVA's would be "ou=people+department=sales". In this case there are two AVA's in the RDN "ou=people" and "department=sales" formed as a conjunction via the + sign meaning both assertions must be true for a match to occur.

Discussion of the changes

Motivation

DN's are compound binary objects with specific rules. DN's have a string representation defined by RFC 4514. There is more than one valid string representation for equivalent DN's. It is not possible to correctly manipulate DN's (extract components, perform comparisons, assemble components, etc.) using simple string operations on a given string representation. The string representation needs to be parsed into it's constituent components and operations performed on the normalized individual components. After manipulation the components can be reassembled back into one of the valid string representations for data exchange purposes.

In IPA we tried to treat DN's as simple strings, they aren't and this introduced problems.

Overarching Goal

Define a DN object class which properly implements DN behavior.

Inside of IPA code anything that is a DN will be a DN object 100% of the time

Quick overview of what changed

  • Convert every string specifying a DN into a DN object

  • Every place a dn was manipulated in some fashion it was replaced by the use of DN operators

  • Add new DNParam parameter type for parameters which are DN's

  • Require that every place a dn is used it must be a DN object. This translates into lot of:

    assert isinstance(dn, DN)
    

    sprinkled through out the code.

  • Moved ipalib.dn to ipapython.dn because DN class is shared with all components, not just the server which uses ipalib.

  • All API's now accept DN's natively, no need to convert to str

  • unit test frame work now accepts DN's natively, no need for lambda expressions in the expected dicts

  • removed ipalib.encoder (see below for details)

  • Entity & Entry classes now utilize DN's

  • Fix __getattr__ in Entity & Entity so that Python internal methods are not masked, e.g. those beginning and ending with double underscores.

  • All certificate subject bases are now DN's

  • Fixed ldapupdate to work with DN's. Rewrote a couple of functions in ldapupate and added documentation on the data structure being used. The goal was to be clearer and more Pythonic.

  • Removed all usage of get_schema() which was being called prior to accessing the .schema attribute of an object. If a class is using internal lazy loading as an optimization it's not right to require users of the interface to be aware of internal optimization's. Instead the class should handle it's lazy loading completely internally by loading the data on first access. This is very easy to do with class properties. schema is now a property of the class instead of an attribute. When the schema property is accessed it actually calls a private internal method to perform the lazy loading.

  • Added SchemaCache class to cache the schema's from individual servers. This was done because of the observation we talk to different LDAP servers, each of which may have it's own schema. Previously we globally cached the schema from the first server we connected to and returned that schema in all contexts.

  • We are aware of the LDAP syntax of all LDAP attributes. Every attribute returned from an LDAP operation is passed through a central table look-up based on it's LDAP syntax. The table key is the LDAP syntax it's value is a Python callable that returns a Python object matching the LDAP syntax. There are a handful of LDAP attributes whose syntax is historically incorrect (e.g. DistguishedNames that are defined as DirectoryStrings). The table driven conversion mechanism is augmented with a table of hard coded exceptions.

    Currently only the following conversions occur via the table:

    • dn's are converted to DN objects
    • binary objects are converted to Python str objects (IPA convention).
    • everything else is converted to unicode using UTF-8 decoding (IPA convention).

    However, now that the table driven conversion mechanism is in place it would be trivial to do things such as converting attributes which have LDAP integer syntax into a Python integer, etc.

  • Expected values in the unit tests which are a DN no longer need to use lambda expressions to promote the returned value to a DN for equality comparison. The return value is automatically promoted to a DN. The lambda expressions have been removed making the code much simpler and easier to read.

  • Add class level logging to a number of classes which did not support logging, less need for use of root_logger.

  • Remove ipaserver/conn.py, it was unused.

  • Consolidated duplicate code wherever I found it.

Derived Goals

Background

I like to think of API boundaries. Anything which is not part of IPA code is on one side of an API boundary and anything inside of IPA code is on another side of an API boundary. All the code comprising IPA is in the IPA programming domain utilizing IPA API's. One imports data into the IPA programming domain or exports data from the IPA programming domain at an external API boundary.

There is tremendous value in knowing inside of IPA a data item will have a well defined specific data type 100% of the time. This applies to other things besides just DN's. Inconsistency leads to the following problems:

  • Programmer confusion
  • Inefficiently always coercing to the desired type.
  • Forgetting to coerce to the desired type.
  • Inventing local (incorrect) solutions as opposed to encapsulating all the logic in one central class definition. Encapsulation in a class means a fix can be applied in just one place rather than trying to find every place needing fixing when it's spread out over hundreds of files and thousands of lines of code.
  • Cut-n-paste code duplication.

When a data item has domain specific properties it should be encapsulated in a class definition that implements those domain specific properties. Trying to locally apply domain specific properties to native language data types (i.e. strings, integers, lists, etc.) eschews the primary advantage of writing in an object oriented language and introduces many of the problems object orientated methodology was designed to solve.

Goals

  • Start to converge on type consistency in the IPA code domain. DN's will be an initial step in that direction.
  • Anything that is logically a dn will be a DN object everywhere in IPA code.
    • DN's will be converted to DN objects when they enter the IPA domain via some type of data import at an API boundary. Examples include:
      • Reading data from an LDAP server via a native LDAP call.
      • Reading data from an RPC call.
      • Reading data from any non-IPA external API.
      • Reading data from user input.
      • Reading data from a file.
    • DN's will be converted to a string representation when they are exported from the IPA domain. Examples include:
      • Writing data to an LDAP server via a native LDAP call.
      • Writing data to an RPC call.
      • Writing data to any non-IPA external API.
      • Writing data to a user interface.
      • Writing data to a file.
  • Converge on ldap2 as the IPA interface to LDAP (a stated project goal)

Existing problems in conflict with the goals

Multiple inconsistent LDAP interfaces

We currently have 5 different incompatible ways to interact with ldap instead of just one consistent API.

  • ldap2
  • IPAdmin
  • Entity objects
  • Entry objects
  • native python-ldap

Trying to get consistent DN usage across these 5 different mechanisms was daunting.

Approach taken

Part of the Refactoring was trying to coalesce all of these into something common. It would have been too disruptive to completely remove the 5 different strategies and replace it with a common API. Instead the approach I took was to retain the 5 different API's but change their underlying implementation to call into the exact same code that ldap2 is using, IPASimpleLDAPObject. In the refactored code every place a LDAP connection is created it's done with the common IPASimpleLDAPObject.

I also modified the what is returned from the ldap support code so that it has the same access methodologies as the Entity and Entry classes (a named tuple). Thus everything can be accessed the same way as it always had been (e.g. using either tuple indexing or object attribute) but it originates from exactly one common location in the code.

Now that all the ldap API's share common access methodologies and underlying data structures we're well on the way to converge on a single ldap api down the road. We've reduced duplicate code and increased code sharing. This seems like a win/win because we've haven't ripped out the other API's we just made them share common code. Because behaviors and data are shared and access methods are interchangeable it's now easy to take small baby steps converting one small area of code at a time to incrementally converge on just one API as time permits in an iterative progression.

DN object implementation experience

DN objects were coded up well before they were widely utilized in the IPA code base. How did they hold up in the myriad ways IPA would utilize them? There is a big difference between using them in a few isolated places verses using them extensively.

Overall I was pretty happy, I didn't discover any bugs in the DN implementation [1]. However what I did discover was they needed some extra features for specific situations. The following enhancements were added:

  • An insert() method to permit a RDN or DN to be inserted to inserted at an arbitrary location in an existing DN.
  • The find() and rfind() methods (modeled after their string cousins) to locate a RDN or DN in an existing DN.
  • The replace() method to allow replacing a sequence of RDN's with another RDN or DN within a DN.
  • Locking to prevent modification. [1]
  • __repr__() method to produce a value that can be read by the interpreter and so that when objects are printed you can distinguish if an item is a string, a string representation of a DN, or it's a DN object.
  • Permitting a string to be compared for equality with a DN object. Formerly only two DN's could be compared which makes sense because two objects need to be of the same type to be equal. But it was worthwhile in practice to promote a string to a DN if it was being compared to a DN and perform the equality test between the two DN's. This is a form of automatic type promotion. Python already does automatic type promotion in a limited number of circumstances (but nothing like on the order of Perl). Originally I was against automatic type promotion on the belief if you were comparing a string to a DN it was a failure of program logic, but there were some circumstances where it was evil necessity. The equality test is the only place where an automatic type promotion occurs.

Overall, I found DN's easy to work with, Pythonic, and in many cases greatly simplified code so the intended logic was obvious and concise.

[1](1, 2) See the discussion concerning DN Mutability

Why and how LDAP encode/decode was refactored

1. As noted in Multiple inconsistent LDAP interfaces it's important that all LDAP operations funnel through the same common code otherwise it's too difficult and error prone to provide the exact same type conversions universally throughout the IPA programming domain.

2. We also noted in Derived Goals that inside the IPA programming domain we would always utilize the same object types.

Issues 1 and 2 above presented a structural problem that forced some LDAP code refactoring.

ldap2 encode/decode decorators

The encoder.py module provided the encode/decode decorators that adorned some methods in ldap2. Their purpose was to convert from IPA types to native python-ldap types when calling python-ldap and to convert python-ldap types back to IPA types.

Unfortunately the type conversion was occurring at the wrong logical location. Instead of occurring at the boundary between IPA and python-ldap it was occurring between IPA API's. When code executed in decorated methods in ldap2 it operated on python-ldap types not IPA types. This is because the decorators converted the types on the way into ldap2 methods and converted the results from a ldap2 method when it returned a value(s).

But ldap2 is executing IPA logic! We do things such as the synthetic generation of member, indirect_member, and memberof attributes and values. We also do other IPA specific data munging on LDAP data inside ldap2. Much of this processing is benefited by using IPA types such as DN objects. The type conversion really needs to occur at the python-ldap API boundary.

To decorate or not to decorate, that is the question

Should the decorators just be moved over the python-ldap methods? Unfortunately that's impossible because you have to decorate at the point of method declaration, in other words inside python-ldap.

You can get around that problem by defining another class that implements the python-ldap API and decorating those methods. But if you have to re-implement part of a class's API there might be a better method (see below).

What about subclassing the python-ldap object?

Short story, great idea, doesn't work. Why? It has to do with how python-ldap implemented things internally. In python-ldap there are many places in SimpleLDAPObject which call self.foo() expecting to call SimpleLDAPObject.foo(), but if you subclass SimpleLDAPObject and override the foo method you then when something in the SimpleLDAPObject implementation calls self.foo() it gets the derived class's foo method not the SimpleLDAPObject.foo method! It should have been calling SimpleLDAPObject.foo not self.foo. If the derived class did type conversion then when something inside SimpleLDAPObject called self.foo it would get types it knows nothing about leading to pandemonium and all sorts of really bad things.

The implemented solution for the python-ldap API boundary

We define a new IPA specific class called IPASimpleLDAPObject, it does not subclass anything (aside from the root object metaclass which makes it a new style Python class).

IPASimpleLDAPObject implements only those methods from SimpleLDAPObject that are used in IPA. Each method in IPASimpleLDAPObject explicitly converts it's parameters to python-ldap types, calls the corresponding SimpleLDAPObject method, and then converts the result back to an IPA type before returning.

Another nice feature of this approach is we can add class level logging and log the actual raw python-ldap operations.

IPASimpleLDAPObject is now used for all LDAP connections (typically called 'conn' in the code). This is what provides the uniformity despite still having multiple LDAP mechanisms.

But what about decorating the IPASimpleLDAPObject methods?

I decided against that approach for the following reasons:

  • encode.py was complicated, it had a lot of functionality and customization that was never used which made it difficult to understand what actually happened at run time.
  • Decorators are a somewhat obtuse Python construct. Many programmers don't understand them. But what's worse is they tend to hide code that would otherwise be visible to a programmer looking at the sources. In the case of the encode/decode decorators there were a lot of important things happening inside the decorators that were mostly hidden from view unless you had a really good grasp of the entire mechanism. There is a virtue to making program logic clear and explicit, the decorators did not achieve that. I equate Python decorators with C preprocessor hell when CPP macros are abused. It can extraordinarily difficult as a programmer see what the C compiler is seeing, or in the case of Python decorators what the Python interpreter is seeing.
  • Decorators are inefficient, they add a lot of extra processing to each function call.
  • The encode/decode decorators were hard to use correctly. This is evidenced by mistakes in the current code. They required a very detailed understanding of when and how to use them. Coding mistakes are hard to detect. Combined with the fact they're somewhat invisible it exacerbates the problem of using them correctly.
  • The decorators were difficult to construct to work universally on all methods function signatures. Not all methods are the same. So you either end up with a couple of complicated decorators which are difficult to understand (compounded by the fact decorators are inherently obtuse) or you end up with a large collection of decorators geared to individual method signatures. But if you're going to end up with almost as many decorators as there are methods why not just put the logic in the method where it's visible, changeable, and more efficient?

See Actual problems with encoder.py and the decorators for a discussion of other issues. In particular note how correctly using a decorator is not obvious and some problems were not caught by pylint or other programmers.

What replaced the encoder.py and encode/decode decorators

Conversion was moved to IPASimpleLDAPObject. Each SimpleLDAPObject method used in IPA is wrapped by IPASimpleLDAPObject. The wrapping method explicitly converts the necessary parameters before calling the native ldap method. Likewise the result of the native ldap method is converted before the wrapper returns.

  • The conversion routines were based on the code in encoder.py but greatly simplified. It's much easier to understand it's logic and they are more efficient.
  • The conversions are explicit and clear for every method, any programmer looking at the code will immediately see and understand them. There are no unusual special cases. It's easy to apply per method special case handling. The conversion is done with common functions to avoid code duplication.
  • The conversion back into IPA types from an LDAP result are more flexible. It it based on the syntax of the attribute and uses a table driven methodology. For example, if the syntax of an attribute says it's a dn then it's converted to a DN object. Because it's table driven it's easy to maintain and extend, all you have to do is associate a LDAP syntax with a Python type call (e.g. constructor).
  • Since every IPA LDAP call from any of the 4 mechanisms described in Multiple inconsistent LDAP interfaces now funnels through IPASimpleLDAPObject everything in IPA universally gets the same treatment from exactly one code location. The only exceptions are the handful of places which still calls python-ldap directly.

Actual problems with encoder.py and the decorators

This is incorrect:

str(var).encode(self.encoder_settings.encode_to)

Note: encode_to == 'utf-8'

The str() function will do one of several things depending on the value passed to it. If the value is not a string it will then call the object's __str__() method to get a string representation. However, if the value is a unicode object then str() will perform a encode operation on the unicode object using the default encoding. In this instance the encode() is called on an already encoded value (the result of str()) which is pointless. But worse is the fact the default encoding is often incorrectly set to 'ascii' instead of 'utf-8' thus the str() function might raise a encode exception when the original value was unicode!

One need to convert the value to a string (unicode) and then call encode on the unicode string.

The correct way to do this is:

import codecs
utf8_codec = codecs.lookup('utf-8')
value = utf8_codec.encode(unicode(value))[0]

Note: calling unicode on a unicode object simply returns the exact same object (with it's ref count incremented). This means calling unicode on a unicode object is effectively a no-op, thus it's not inefficient.

The decoder used a table based on LDAP syntax to decide if an attribute was binary or not. Binary attributes were converted to Python str objects and non-binary attributes were converted to unicode objects (IPA convention). The conversion based on LDAP syntax was a good idea but was a bit anemic. The values in the table were simply used as boolean flag indicating if the attribute was to be decoded into unicode. It could be much richer with the table returning a Python callable used to perform the conversion. That would be much more flexible and is what was implemented in the new code. The original was a good idea, it was just further refined.

The encode_args decorator will not work correctly if the decorated function has keyword args and both the positional index of the arg and the keyword name are not passed to the decorator. The reason is because what the decorator sees is how the function is called. If the arg is a positional arg (without the name=) it appears in the arg list, however if the same arg is passed as a keyword arg (with the name=) then arg appears in the keyword dict, but never does the arg appear in both the positional list and the keyword dict, it's one or the other depending on how the arg is passed in the function call.

Therefore keyword args in the function declaration must appear as both a positional integer and a name string in the encode_args decorator invocation. Otherwise depending on how the decorated function is called it may miss encoding that argument because encode_args did not look for it's presence in both locations.

The following exhibits these errors because the keyword names are absent:

@encode_args(1, 2, 3)
@decode_retval()
def find_entries(self, filter=None, attrs_list=None, base_dn='',
        scope=_ldap.SCOPE_SUBTREE, time_limit=None, size_limit=None,
        normalize=True, search_refs=False):

@encode_args(1, 2, 3)
def modify_password(self, dn, new_pass, old_pass=''):

This encodes the attr parameter, why?:

@encode_args(1, 2)
def make_filter_from_attr(self, attr, value, rules='|', exact=True,

DN Mutability

There is one issue with DN objects. DN objects as originally implemented are mutable, i.e. you can modify them. Python makes a clear distinction between mutable and immutable objects. Examples of immutable Python objects are strings, integers and floats. Examples of mutable Python objects are lists, dicts, and sets.

Mutability has a direct impact on how objects are used as keys in Python dictionaries and sets. When a Python object is used as a key in a dict or as member of a set Python will compute a hash value for that object and use the object's hash value to access the item in the dict or set. For immutable objects the hash value is computed from the value of the object. Thus two strings that are equal will hash to the same hash value even though they are different objects. For example:

>>> A = 'some string'
>>> B = 'some string'
>>> hash(A)
-1704511165
>>> hash(B)
-1704511165
>>> hash(A) == hash(B)
True
>>> A == B
True
>>> id(A)
143567432
>>> id(B)
143567352
>>> id(A) == id(B)
False
>>> A is B
False
>>> d = {A: 1}
>>> A in d
True
>>> B in d
True
>>> d[B] = 2
>>> d
{'some string': 2}
>>> A in d
True

A and B are two different Python objects, however they are equal. Both A and B produce the same hash value. You can tell they are different objects because A and B have different id's. If you ask Python if A and B are equal Python says yes. If you ask Python if A and B are the same object Python says no [2]. For those familiar with Lisp this is the analog of equal vs. eq. If you use A as a key in a dict then A will hash to the same value, thus object A and object B are considered to be members of the dict (or members of a set) even though they are different objects. This is how you can use a dict (or set) to give you unique non-duplicate collections, because immutable objects hash to the same value. Note how either object A or object B can be used to change the value in the dict and the dict has just one entry no matter whether A or B is used as a key.

But consider the dilemma facing Python when a mutable object is used as a key in dict or member of a set. Since it's mutable it's hash value may be different at different moments in time, this leads to unpredictable results. The solution is simple, instead of using the object's value to compute the hash it's identity is used instead. Every Python object has a unique identity (most implementations use the pointer to the objects memory block as it's identity).

This means two mutable objects that are otherwise equal do not share the same hash value thus they do not share the same key in a dict and are considered different members in a Python set. This can produce counter-intuitive results. In general only Python native immutable types can be used in Python dicts and sets when uniqueness is desired. User defined types (i.e. classes defined in a program) are by default considered mutable and thus do not provide uniqueness when used as keys in a dict or members of a set.

Python determines if an object is hashable if the object contains a __hash__ method. The __hash__ method is invoked to obtain the object's hash value. For immutable objects the __hash__ method uses the object's value. For mutable objects the __hash__ method uses the object's identity. It is entirely possible to provide immutable semantics to user defined classes if the class defines it's own __hash__ method.

It's important to understand the difference in Python between hashing, equality, and identity.

Equality and inequality (i.e. the == operator implemented by a class via the __eq__ method and != operator implemented by the class __ne__ method respectively) compare values. The 'is' keyword is used to compare object identity. Thus when you see:

if foo is None:

in the code what is occurring is testing to see if the object the symbol foo is referring to is exactly the same object as the global object None. Why does this work? Because None is a singleton, there is exactly one None object. If you assign None to symbol you're assigning a reference to exactly the same None object used everywhere.

The 'in' operator, e.g.:

if foo in (bar, blatz, burfl):

uses the equality operator (e.g. ==) iteratively to test each member for equality (not identity). The above is equivalent to:

for x in (bar, blatz, burfl):
    if foo.__eq__(x):
        return True
return False

The distinction between membership testing using the 'in' operator verses using a hash value for dict keys or sets is critical to understand because the former uses the __eq__ method of an object and the later uses the __hash__ method of an object. In practical terms it's whether an object is mutable which determines if the 'in' operator and key index or set membership operate the same or differently.

Why DN object mutability is an issue

After I converted everything to use DN objects I discovered some logic was failing. This is because in a number of places we use DN's as keys in a dict or members of a set to eliminate duplicates assuring uniqueness. But two DN objects that were otherwise equal and would compare equal were showing up as duplicates. This was due to the fact they were hashing using their object identity as opposed to their object value.

When dn's were strings the 'in' operator and object hashing behaved the same. But when dn's became mutable DN objects equal DN objects then hashed to different values but compared equal (preserving the 'in' operator semantics).

How much of an issue is this?

Since to the best of my knowledge we never modify a DN while it's being used as a key in a dict or as a member of set it doesn't affect us as long as DN objects implement a __hash__ method based on it's value. Providing a custom __hash__ method was the expedient solution I implemented.

However, it's not technically correct according to the semantics of the Python language. It does have the potential to produce difficult to diagnose bugs which also requires the person trying to track down the problem to have the type of detailed understating of Python semantics discussed above.

An other issue I came up with as I converted everything to DN objects was the fact some DN's are logically constants (for example the containers defined in constants.py). Since these constants are defined in one place but passed by reference (in Python mutable objects are passed by reference) though many different function calls it would be easy to accidentally modify in a local context what is logically a global constant leading to errors.

To address the issue of accidental modification of a constant I introduced locking (much like we lock Param objects preventing Param's from modification). DN's which are supposed to be immutable can be locked. But there is a better solution (see below).

Proposed solution

The correct solution is to provide both immutable and mutable variants of DN objects (for those interested search for discussions of Python sets verses frozen sets). By default a DN object would be immutable. We would then subclass DN to produce a mutable version of a DN, probably called something like EditableDN. You can convert between the two easily by passing one into the constructor of the other.

Since editing DN's are the exception rather than the rule it makes sense for DN's to default to being the immutable variant.

Having both immutable and mutable variants of DN's solves both the hashing issue as well as providing protection against accidental modification. Immutable DN's are a cleaner solution than locking a mutable DN class.

Do we have a problem which would prevent accepting the patch?

A good question to ask is if the issues discussed above should prevent integrating the DN work until until DN's have been broken out into mutable and immutable variants.

I don't think so.

We're talking obscure corner cases here. I've been through every line of code that uses a DN and I've seen no cases of us modifying a DN while it's a member of a set or used as a key in a dict. We do need to fix it to bullet-proof things and obey Python semantics but I believe we can do this in a subsequent patch (I'm already working on it) without much short term risk until the patch is ready.

[2]Caveat, sometimes two short strings will be the same Python object. This is because of an internal Python optimization. Python will cache some strings and integers and reuse them, this prevents excessive memory use to allocate copies and reduces the cycles spent generating the object. Thus in some special cases two strings or integers that are equal may in fact be the same object. This in no way invalidates the discussion concerning mutability, hashing, and equality. Caching and reusing equal immutable objects works precisely because they are immutable.