Thursday, March 29, 2007

Closures for Organizing Your Code

Much of the discussion of Closures in Java has been about they way they affect public APIs. But there is another aspect that is just as important: the way closures affect private APIs between parts of your program. Closures often enable a tremendous simplification of a program design compared to what would be required in their absence. The following describes my implementation of a graph algorithm for computing the Jeffersonians of a graph using algorithm K from Knuth's The Art of Computer Programming, volume 4B, section 7.5.7.

As you may be aware the set of Jeffersonians of a graph is best computed using a complex recursive algorithm. Although recursive algorithms can be translated into algorithms without using recursion (Java without recursion remains Turing-complete), the recursive version of the algorithm is much shorter and easier to understand. We're lucky to be living in an age in which virtually all programming languages support recursion. Though details of the implementation are not important, my implementation went something like this:

public Collection<Jeffersonian> findAllJeffersonians(Graph g) {
    Collection<Jeffersonian> result = new ArrayList<Jeffersonian>();
    findAllJeffersoniansInternal(g, result);
    return result;
}

The idea is that the recursive part of the algorithm can pass around the collection into which the result will be placed, and every Jeffersonian that is found will be placed into the collection:

private void findAllJeffersoniansInternal(
Graph g, Collection<Jeffersonian> result) {
// complex recursive algorithm here
Jeffersonian found = ...;
result.add(found);
// more complex recursion here
}

One pot of coffee and an all-nighter later I had this working like a charm. The next day my tech lead asked me to add an API element that determines whether or not a graph has a Jeffersonian or not. That was easy:

public boolean hasJeffersonian(Graph g) {
    return findAllJeffersonians(g).size() != 0;
}

This didn't pass code review. The problem is that this new method is to be used in the inner loop of Google's über-secret application that will take over the world. Never mind that. The problem is performance. Determining whether or not a graph has a Jeffersonian can be done in linear time, but enumerating all of them requires quadratic time (or worse). But my implementation does it the hard way. By then it was Friday afternoon and I really wanted to head home for a glass of wine, so I did what any self-loathing software engineer would do: I cut and pasted the complex recursive code in findAllJeffersoniansInternal into hasJeffersonianInternal and added a boolean return value (true when a Jeffersonian was found). Then I added logic to short-circuit the rest of the algorithm once a Jeffersonian had been found at any step. The code was messy but workable, and I had it passing tests in less than an hour. The code duplication left me somewhat uncomfortable, but the two methods were different enough that merging them would have been hard. I considered adding a second flag so I could have one body of code to do both versions, but I decided to leave that refactoring until Monday.

Something very strange happened over the weekend, though. On Monday my pointy-haired boss told me there was both good news and bad news, and asked which I wanted first. Knowing how these jokes work (the second one always trumps the first) I asked for the bad news first. The bad news was that my machine had crashed, losing all of my work from Friday. Including my implementation of hasJeffersonian. The good news was that my machine had been replaced with a brand new one, a fast new 40-core workstation, and it came with JDK7 preinstalled. I had been using JDK6 before, so I was eager to try the new Java language features.

Taking a fresh look at the problem of writing hasJeffersonian, I decided to refactor the original program to pass a closure instead of a collection:

public Collection<Jeffersonian> findAllJeffersonians(Graph g) {
    Collection<Jeffersonian> result = new ArrayList<Jeffersonian>();
    findAllJeffersoniansInternal(g, { Jeffersonian j => result.add(j); });
    return result;
}

private void findAllJeffersoniansInternal(
Graph g, {Jeffersonian => void} foundJeffersonian) {
// complex recursive algorithm here
Jeffersonian found = ...;
foundJeffersonian.invoke(found);
// more complex recursion here
}

Then I realized I could use the nicer syntax allowed for passing a closure to a method:

public Collection<Jeffersonian> findAllJeffersonians(Graph g) {
    Collection<Jeffersonian> result = new ArrayList<Jeffersonian>();
    findAllJeffersoniansInternal(Jeffersonian j : g) {
        result.add(j);
    }
    return result;
}

Solving the second problem was then trivial:

public boolean hasJeffersonian(Graph g) {
    findAllJeffersoniansInternal(Jeffersonian j : g) {
return true;
} return false; }

That was the entire implementation. I had a strange sense of elation, but I couldn't quite tell why. I could no longer remember why the problem was so messy on Friday. This refactoring seemed trivial, and this code was so clear. What made it so hard before?

Then I woke up. It's 2007, not 2009. JDK7 is barely a gleam in the eye of Sun. My machine is only dual-core. Consensus on closures is elusive. As far as I can tell, there isn't any such thing as a graph's Jeffersonian, or a Google plan to take over the world. It's Monday morning, and I have to figure out how to merge two almost-copies of a big recursive algorithm.

But on the bright side, my boss is a really nice guy.

27 comments:

Santhosh Kumar T said...

I also faced same issue once. and I solved it as follows:


public Collection findAllJeffersonians(Graph g) {
return findJeffersonians(g, null);
}


public Collection findAllJeffersonians(Graph g, Collection result) {
if(result==null)
result = new ArrayList();
findAllJeffersoniansInternal(g, result);
return result;
}

public boolean hasJeffersonian(Graph g) {
try{
findAllJeffersonians(g, Collections.EMPTY_LIST);
return false;
}catch(UnsupportedOperationException uoe){
return true;
}
}


I removed generics usage as, the Blogger comment treating them
as HTML and rejecting my comment ;(

Anonymous said...

santhosh - Nice trick. I'll try to remember that one. That said, I think the closures version looks cleaner and would apply more generally.

gafter - I think that there's a resolution to be found between FCM and BGGA. I think the three use cases are asynchronous callbacks (best handled currently with FCM's control semantics), synchronous expressions (best handled currently by BGGA, but I think a variation on FCM for expression methods would do the trick), and control blocks (best handled currently by BGGA's "abbreviated syntax" and semantics - which I hope the FCM folks can just adopt nearly wholesale). For all three cases (and anonymous inner classes), I think unrestricted access to local vars (with the option for "volatile" modifiers) would be best. There really are no new threading issues that I've seen. And people will cope with secretly instantiated container objects. I think it's possible to get a resolution on BGGA vs. FCM.

Anonymous said...

The trick using an exception is not that much different to how it is currently thought to be solved in BGGA (maybe not in detail, though).
I think, I found a solution not employing exceptions (no, Neal, not the M-word ;) ). Needs some code rewrite on compilation regarding closures and the called method. Still have to think about recursive use and details.
I like the Jeffersonian example. Actually had a similar problem yesterday regarding validation: first, a method validating a hierarchical structure returning a list of all failures, and second, a method to return fail-fast on the same structure.

Anonymous said...

I understand that the obvious implementation for transparent return is with exceptions, but that's not how it's described at the programming level.

I think that's an important distinction.

Also, it's worth noting (and I think we're all on the same page here) that you can still solve more problems generally with closures vs. having a convenient side effect of the API at hand (such as the constrained empty list).

Anonymous said...

Neal, why do not declare findAllJeffersoniansInternal()
as a loop method using for
to allow to use break/continue at
call site.

Jochen "blackdrag" Theodorou said...

Hi Neal, nice post.

But there is one part in your post that makes me thinking. That is why haven't you named your method "each" instead of "findAll". Usually the closure in findAll is used to test if an entry fits a criteria. Your closure then defines the criteria. So this variant would leave the construction of the result to the method and would give the closure instead the power to decide if a criteria matches or not. A version that might look a bit strange is to have two closures, one for the result construction and one for the criteria. Ok, is more a crazy idea.. but given that you do searches on the graph and use different criteria for this, then using a closure for the criteria is of more benefit. The solution with two closures enables you to just have the graph traversal in your method. For example you could this way use a HashSet or a LinkedList or something completely different, just depending on your need. I would show an example but I don't know how to curry a method ;)

Neal Gafter said...

Remi and Jochen: excellent ideas.

Ricky Clarkson said...

Neal, you must have been reading too much Joel Spolsky - your posts are getting more entertaining.

I was actually disappointed to discover that the all-nighter was just a dream.

I was thinking about the same kind of thing recently, after IDEA kindly alerted me to the fact that two of my automated test methods have the same 7 initial lines of code. I could extract a method, but IDEA isn't capable of doing that when the rest of the method uses more than one reference from the code block.

The software is a network simulator, those few lines of code are some setup. There is no great link between the two automated tests. If I could put code outside classes or at least outside methods, I could do something like:

{
 Context context=ContextUtility.debugContext();
 Computer computer=whatever;
 Card card=whatever;
 public boolean test1()
 {
  stuff that uses context, computer and card
 }

 public boolean test2()
 {
  more stuff that uses context, computer and card
 }
}

I don't think there's a comparable solution to Lisp's let (or let*) macro, in Java, with or without closures. To get rid of the duplication, I'll have to do some artificial refactoring, i.e., refactoring that is only done to get rid of duplication, and doesn't actually make anything simpler.

Neal Gafter said...

There is a comparable solution to Lisp's let - it's called the control invocation syntax, and it is included in BGGA.

Ricky Clarkson said...

Can you demonstrate how you'd use the control invocation syntax for my example, or a similar one?

Neal Gafter said...

void runTest({ Context, Computer, Card => void } block) {
Context context=ContextUtility.debugContext();
Computer computer=whatever;
Card card=whatever;
block.invoke(context, computer, card);
}

public boolean test1() {
runTest(Context context, Computer computer, Card card) {
// stuff that uses context, computer and card
}
}

public boolean test2() {
runTest(Context context, Computer computer, Card card) {
// more stuff that uses context, computer and card
}
}

Ricky Clarkson said...

Thanks for that. I guessed you'd write it in that way, but having seen it instead of imagined it, it is certainly more elegant than I supposed.

A very small point is that you actually need a name for runTest, which you don't for a let block in Lisp. I'm loathe to provide names for things that aren't actually abstractions.

If reasons for that aren't obvious, consider the case where test A and B use runTest, C and D use runTest2, etc. With scale, the 'artifical' name for the method becomes more and more of a problem.

Anonymous said...

Ricky:
If you have runTest and runTest2, either their signature would be different (so it's necessary to have two methods, which still may have the same name), or you pass some configuration parameters to keep a single runTest. I wonder how many tests your class would have to have to make piles of runTest methods applicable, as the idea is to abstract out code pattern that are common on test methods.

Anonymous said...

I noticed you dropped the ':' in the runTest() calls. Looks great. I hope that works its way into the spec(s). (And I do think syntax matters.)

Anonymous said...

I feel like leaving another comment here for more info to parallel my comments on the FCM side of things.

By the way, my concern with normal BGGA syntax/semantics is that people don't expect blocks to have values in Java. And simple expressions probably shouldn't contain control-flow commands (break/return/throw/...) anyway. That's why I think "expression closures" like in ECMAScript 4 (without the function keyword, though) would be a good fit for LINQ-like stuff. Keeps it simpler really.

But as I've mentioned, I love the "Control invocation syntax" blocks of BGGA, and I think this is the most important kind of closure to add.

For asynchronous closures, the need is mostly a matter of simpler syntax than anonymous inner classes (for single methods, too), and I think having non-local return/break is likely to cause confusion. That (in addition to closure blocks having values) is a big reason why I think BGGA closure semantics (other than control invocation blocks) are not best for this case.

Ideally (maybe), the control invocation blocks could be used for unusual cases like invokeAndWait(), redispatching return/breaks to the correct thread. That would rock, since invokeAndWait really isn't quite asynchronous. But normal asynchronous code is probably not best served by Tennent's.

Karsten Wagner said...

(I removed the generics in the code below because of the blogger-limitation)

This problem could be solved best by using generators instead of closures. Let's have

public Iterator giveJeffersonians(Graph g) {
// does the calculation but allows for recursion because it uses
// a generator (similar to Icon or Sather, but using the Iterator-interface)
}

Then we could use it like every Iterator:

public Collection collect(Iterator it) {
Collection result = new ArrayList();
for(Object o: it) result.add(o);
return result;
}

public Collection findAllJeffersonians(Graph g) {
return collect(giveJeffersonians(g));
}

public boolean hasJeffersonian(Graph g) {
return giveJeffersonians(g).hasNext();
}

Quite simple and flexible. The advantage is that it's easy to do parallel-iteration:

public Iterator giveMax(Iterator it1, Iterator it2) {
for(Object j1: it1; Object j2: it2) yield(max(j1, j2));
}

public Iterator giveJeffersonianMax(Graph g1, Graph g2) {
return giveMax(giveJeffersonians(g1), giveJeffersonians(g2));
}

How would you do this as effectively with closures (without having lazy evaluation)?


Closures in languages with mutation have a big problem. Consider this code:

for(int i = 0; i < elements; i++) {
Element el = elements[i];
String text = "element " + i;
el.addAction({ some_label.setText(text); }
}

Looks nice, but in fact it won't work. The action would always the 'text'-variable from the frame which contains the last value of 'text'. So 'some_label' would always be set to 'element 9' if there are 10 elements in the Array. To work around it, you have to write something like:

for(int i = 0; i < elements; i++) {
Element el = elements[i];
String text = "element " + i;
el.addAction({ text => { some_label.setText(text); }}.invoke(text));
}

This would bind the actual value to a new frame for every look iteration, creating the wanted behavior. But you have to spot it first.

This kind of possible error don't exists in pure function languages (where closures belongs) but is hard to find in a language like Java. That's the reason I really like the 'final' limitation of the already existing closures in Java because it makes this kind of mistake nearly impossible.

Neal Gafter said...

@Karsten: How do you implement a 'generator' when the generating method is recursive, as it is in this case? I tried to emphasize the recursive nature early in the essay; perhaps you missed it.

Karsten Wagner said...

In principle generators work like co-routines but in a more limited way. They use a 'yield' statement which returns a value and suspends the generator. If there is not pending 'yield' left, the generator is finished. A 'return' is like 'yield' but it also terminates the generator. Below I use a syntax which simply uses the Iterator-Interface to encapsulate the generator.

Lets now write two generators which do recursive iteration over a tree:

class Tree {
Tree prev, next;
Object value;
}

Iterator treeWalkDepthFirst(Tree t) {
if (t.prev != null) {
for(Object value: treeWalkDepthFirst(t.prev)) yield value;
}
if (t.next != null) {
for(Object value: treeWalkDepthFirst(t.next)) yield value;
}
return t.value;
}

Iterator treeWalk(Tree t) {
yield t.value;
if (t.prev != null) {
for(Object value: treeWalk(t.prev)) yield value;
}
if (t.next != null) {
for(Object value: treeWalk(t.next)) yield value;
}
}

The advantage of this approach is that it fits very well into the Iterator-concept of Java and makes it rather easy to write complex Iterators in a straight forward way. Take a look at the Sather language for a good implementation of this concept.

Neal Gafter said...

@Karsten: Now I know what you meant by generators. I thought you were referring to the programming pattern. I agree that a different language extension could be added to address this use case.

Anonymous said...

@Karsten: "Looks nice, but in fact it won't work. The action would always the 'text'-variable from the frame which contains the last value of 'text'."
I'm not sure you are correct. Unless the compiler "optimizes" here to reuse the same object each time when creating the String variable (it's defined in the loop!), each closure will refer to a different object. As with optimizing in normal cases, the compiler could prevent optimization when a variable is used within a closure.

With coroutines I agree that they provide another mechanism for enabling state-aware operations. Your examples will have to use the iterators directly, though, as for-each does only allow for Iterables and arrays.
What coroutines might also provide is some mechanism for implementing control invocation syntax.

Karsten Wagner said...

No, its mandatory that a closure acts the way I've described it. A closure closes over the variables, not over the value of a variable. In languages without mutation (like in Haskell) there is no difference between those. But in imperative languages where mutation isn't an exception but very frequent, it's quite visible. In current Java a closure captures simply the values because of the final restriction. But if we allow free mutation, consider this:

for(int i = 0; i < elements; i++) {
Element el = elements[i];
String text = "element " + i;
el.addAction({ some_label.setText(text); });
text += "!!";
}

What should happen here? If we allow 'closure by copy' (like in current Java) the 'text += "!!"' wouldn't do anything - which is wrong, because the code inside the closure is called after the statement which has changed the value of 'text'. And because of this, it also can't create a new variable for 'text' at each iteration of the loop or we would end with a frame with multiple variables for 'text'. This isn't how closures work.

In Ocaml variables are always immutable, mutation is only possible to fields of 'objects'. This is identically to 'current Java' where you can't mutate variables from closures - something which is very sensible. But many people call for a removal this 'final' restriction - and I suspect that those people simply haven't uses closures in an imperative language enough to know about the risks. Of course it's possible to remember it, but my experience from JavaScript (which I'm doing a lot in the moment) shows that it's really a frequent source of errors. It's simply to easy to overlook-

So if the final restriction got removed in Java, I would really like to see that at least there is some keyword to mark 'risky' variables as mutable. This would ring at least some 'alarm bells' making this kind of mistake less probable. In fact it would be a good idea to make all locals 'final' per default and require the use of some keyword (for example 'transient') to declare variables mutable when necessary. If we learned one lesson from functional programming than that mutation is risky and should avoided as far as possible. FP is as much about limitations (referential transparency) as it is about new features (like closures).

Generators are like coroutines but they are more limited. This limitation allows for better compiler optimization and for less possible sources of errors (the typical threading problems) but gives certain advantages which are very useful in an imperative language. Closures are less powerful, create more sources for mistakes and fit less well into the concept of a language like Java (and closure do exist in Java altogether, even if they are a bit less easy to use - which again rings some alarm bells that we're maybe leaving OOP-land by using them). But while closures are widely known, Generators are a less well known and thus less often considered as a useful addition to an imperative language.

Anonymous said...

Karsten, the variable text is created as local variable within the loop having the scope of the remaining statements of the loop. Of course the concatenation would work but when continuing with the next loop iteration, a new local variable text is created not being identical to the one before nor being the same instance. That's how Java (semantically) works and it has nothing to do with closures.

Anonymous said...

Ah, forget about the instance stuff in my previous post, it's not relevant to what I tried to explain. It's about it being a local variable and its scope being limited to the life-time of the iteration step. At least, that's what I read in the JLS. If the compiler optimizes this to reuse the same spot on a stack, when being used within a closure, this optimization can surely be forbidden.
But Neal will know much better than I do.

n said...

I think this example is extremely important because it demonstrates something you can't actually do with the Java language as it stands today without a) Using a hack such as that suggested by Santhosh in the first comment or b) introducing some domain specific classes representing the evaluation of the algorithm.

The solution with a closure works wonderfully.

Anonymous said...

@Karsten:

I think the problem in JavaScript is caused by there not being a 'block-level' scope. JS only has only Object and function scopes. Therefore, the 'text' variable is the same variable for each iteration in the loop. I've been bitten by this problem in ActionScript.

In Java (as Stefan says), each iteration of the loop creates a new scope and therefore a new 'text' variable.

Anonymous said...

The problem with generators is "finally". Every generator is a new resource that needs closed. One of the nice use cases for closures is make it easier (not harder) to close resources.

Now, with closures _and_ continuations, you get generators for free (maybe not easily optimized, but they work the same way). But I don't feel like starting a continuations debate. We can leave that for Java 8.

Fireblaze said...

Your example is flawed, there is no need for closures just because you cant not figure out how to remove duplicated code.

It looks to me that the problem here is the algorithm implementation.
If you post real working code, the community can help you with your problem. And probably come up with a propper clean solution.