Category :

Separating Configuration and Code

Ben Kuhn wrote a post on the tradeoff between readability, hackability, and generality in software. The post discusses how what starts off as a simple script often runs into a place where he has to either duplicate code, make it harder to understand, or harder to extend.

One thing worth noting about this problem is that it has many cases that are similar, but not quite the same, and it’s not obvious where the logic should for each case should live. Creating multiple separate functions means each individual function is readable but makes it difficult to modify or extend the whole system, while embedding the behavior into classes creates code that’s harder to reason about and debug when problems arise. I’ve frequently run into the same situation. The solution that’s worked for me is to explicitly separate configuration and code.

 

Screenshot 2015-04-14 15.40.56

 

Conceptually, data represents inputs given from an external source, configuration represents the bare minimum specification you need to get a certain type of output from that data, and code is a stateless set of procedures that takes configuration + data to return an output. In other words, configuration can be thought of as a domain specific language (DSL), and the code is the underlying stuff that makes the DSL work. In simpler cases, configuration can be a lookup table.

Example 1: compare Ben’s second solution:

 

to this:

Here we’ve explicitly separated the configuration for each type of server from the logic in the functions. This means we can look up all the server specific details in one place without messing around with the logic, and vice versa. It also effectively compresses the several parameters needed for each type of server into one lookup, so we only have to pass one parameter.

 

I would argue that the second version is more readable, easier to modify, and more general.

  • There’s only one api function.
  • Unlike refactoring using classes, we haven’t added any state to the program. In fact, the relevant parameter is called out in the function arguments, rather than being woven into multiple places in the function’s definition.
  • Separating server details into the dict makes them easy to read and modify. In particular, the most common type of change will be adding or changing a type of server, which can now be done without touching deploy .
  • It’s not too difficult to pass one argument,  server, down several layers of function calls without making the code substantially harder to reason about, and we can look up the parameters associated with  server whenever necessary.

Example 2: I recently built some software that parses data from a market research survey and turns it into a report (example). The architecture of the code looks like this:

 

Screenshot 2015-04-15 23.59.16

All the functions in the code are stateless and idempotent, making it easy to debug, and editing reports involves pretty simple changes to the configuration. I actually added a section on Monday and it took just about 10 minutes–the entire section (along with 3 subsections) is generated from 11 lines of configuration, with no changes to the code.

It’s also easy to write other completely different types of reports–for example, we use the exact same Python code with different configuration to run a report that takes reader reactions to an article and rates different possible headlines.

So next time you find yourself writing code that does a lot of similar but slightly different things, check if you can factor it out into an explicit configuration table or dict and a smaller number of more general functions.

Edit: I got a reply from Ben, where he pointed out a few weaknesses of this approach:

  • A config structure can actually make a problem more complicated if you start needing ad-hoc config-specific pieces of code.
  • Nontrivial config changes frequently require some fairly deep code restructuring.
  • More problematically, if you want to modify behavior without modifying code, you are forced to pre-specify a framework that you think will be general enough to capture any config option that you want, which is super hard and usually fails. And changing your config DSL to accommodate new behaviors is not always trivial.
  • If your configs have parts that are supposed to be coupled together, you’ve just punted the problem of abstracting common code up a level to abstracting common config snippets.
  • Passing an entire config dict through a bunch of functions has many of the same problems as passing an object (in fact, objects are just dicts plus virtual dispatch).

Overall I agree with Ben’s comments and think a lot of it comes down to the type of problem you’re working on. If you’re working on problems where the hardest part is figuring out what the solution should look like, then a config driven approach makes a lot of sense. On the other hand, if figuring out what you want is relatively easy but figuring out how to do it is hard, then separating config won’t get you very much and may actually make things more complicated.

Categories: Productivity, Programming