Tuesday, July 31, 2007

Internal Versus External Iterators

In the "Gang Of Four" Patterns book's discussion of the Iterator pattern, we read (page 260):

Who controls the iteration? A fundamental issue is deciding which party controls the iteration, the iterator or the client that uses the iterator. When the client controls the iteration, the iterator is called an external iterator (C++ and Java), and when the iterator controls it, the iterator is an internal iterator (Lisp and functional languages). Clients that use an external iterator must advance the traversal and request the next element explicitly from the iterator. In contrast, the client hands an internal iterator an operation to perform, and the iterator applies that operation to every element in the aggregate.

External iterators are more flexible than internal iterators. It's easy to compare two collections for equality with an external iterator, for example, but it's practically impossible with internal iterators. Internal iterators are especially weak in a language like C++ that does not provide anonymous functions, closures, or continuations like Smalltalk and CLOS. But on the other hand, internal iterators are easier to use, because they define the iteration logic for you.

To make this very concrete, one might define a collection-like interface using external iterators like this:

public interface ExternalIterable<T> {
    ExternalIterator<T> iterator();
}
public interface ExternalIterator<T> {
    T next();
    boolean hasNext();
}

On the other hand, using internal iterators one might define an interface something like this:

public interface InternalIterable<T> {
    void iterate(Function<T> closure);
}
public interface Function<T> {
    void invoke(T t);
}

Languages with well-integrated support for closures (such as Scala, Smalltalk, and Ruby) usually provide support for looping over their collections using internal iterators - they are, after all, easier to use in most cases - while other object-oriented languages (such as C++, Java, and C#) tend to use external iterators. Without well-integrated language support for closures, internal iterators would be too painful to use effectively. For that reason, the Java collection framework uses external iterators. But once we have closures in the language, wouldn't it be worth reversing that decision?

The answer is no, and it isn't just because it would be an incompatible change to an existing interface. As discussed above, external iterators are more flexible for some clients. The simpler code that clients can write using internal iterators is already achieved in many clients (of external iterators) due to the previous addition of the for-each loop in JDK5. For the remaining clients, simple library methods can bridge the gap between internal and external iterators. See, for example, the "eachEntry" method for iterating over the entries of a map, discussed in my earlier postings on closures. To see how easy the conversion is, here is the code to convert from an external iterator to an internal one:

    public <T> InternalIterable<T> internalize(final ExternalIterable<T> ext) {
        return new InternalIterable<T>() {
            public void iterate(Function<T> closure) {
                for (ExternalIterator<T> it = ext.iterator(); it.hasNext(); ) {
                    closure.invoke(it.next());
                }
            }
        };
    }

Iteration using internal iterators is often much easier to implement, because the iterator implementation doesn't have to explicitly store and manage the state of the iteration. Much of the complexity in the implementation of the iterators for Java's HashMap and TreeMap (and their Set cousins) would simply vanish if the iterators were internal. For that reason, it is interesting to see if it is possible to have the iterator implemented internally, but exposed to the client externally, by writing a utility method that converts between the two iterable interfaces. This is the reverse of the conversion above. How easy this is to implement depends on the features of your programming language.

C# provides a "yield return" construct that helps provide the convenience of implementing internal iterators and the flexibility of using external iterators. But it is not quite powerful enough to bridge the gap between them. See notes from Cyrus Najmabadi's attempt to do so. Neither are simple (local) byte-code rewriting systems such as Aviad Ben Dov's Yielder Framework for Java. You can do it using continuations, coroutines, or fibers. But Java doesn't have them.

You can solve the problem in Java by resorting to the use of a separate thread to simulate coroutines. The result is messy and expensive, as each converted external iterator requires its own thread. Here is my implementation; can you do better?

30 comments:

Anonymous said...

External iterators may have things to cleanup (like a Thread in your example), and it is sometimes better to not wait for the GC to do that.

In my application, I have to create an interface ClosableIterator which is an Iterator with a close() method.

Iterator it = cont.iterator();
...
it.close();

However this is not compatible with the for-each syntax ...

Arnaud

Neal Gafter said...

@Anonymous: right, that is another advantage of internal iteration: the implementation can provide any cleanup code required.

Fabio said...

Another advantage of internal iterator is the possibility to implement atomic group operation on the elements of the collection (synchronized on the collection implementation's lock).

ps.: sorry about my poor english :P

Steven Brandt said...

If the invoke() method on Function returns a boolean, and the interpretation on that value means "stop internal iteration if true" then it is easy to convert back and forth between an InternalIterable and an Iterable. Or is that a good way or is that cheating?

Neal Gafter said...

@Steven: I'd love to see your code to do the conversions (with or without the boolean return values). I don't think it helps at all. In fact, I think it makes it harder.

Ricky Clarkson said...

You're probably aware of this, but where internal iterators are preferred, testing collections for equality can be done using 'zip', which makes two collections of items into one collection of pairs.

I can implement what you asked for in exponential time - that is, I can call my internal iterator for every invocation of the external iterator's next() method.

Here's my implementation: http://docs.google.com/Doc?id=dx5mfkq_91z4mgjc

Rémi Forax said...

Hi Neal,
you forget to mention that internal
iterators can be far more effective
that their external counterparts,
by example when iterating through
a tree.

Else as you certainly know
there is a python implementation named
stackless python have a VM
mecanism that allow to stop and
restart an execution
by allowing multiple stacks
for one thread (only one is active
at a time). With that mecanism,
yield is easy to implement.

The big question is why there is no
"stackless" Java.

Rémi

Neal Gafter said...

@Ricky: I don't think exponential time is better, but I very much would like to see your implementation of "zip" for internal-iterator-based collections.

Mario Gleichmann said...

Hm, i always wonder if 'external iteration' is a kind of violating DRY.
In most of cases (simply) iterating through the items of a collection (and that are REALLY most of the cases) stays the same, only the applied execution logic is different...
So separating the execution logic from iteration logic seems pretty natural to me :o)

Anonymous said...

I've long thought that if we just had closures and continuations, you get the best of both worlds. But I like closures at the core because it covers 90% of cases and provides better resource management. If continuations get you the harder 10% for free, then happy day.

Anonymous said...

Pseudo-code for n-way zip:

List[List] zip (List... x)
..int tuplePos = 0;
..for (i : x)
....int pos = 0;
....for (j : i)
......if (result.size() <= pos)
........list = new List()
........result.add(list); ......else list = result.get(pos)
......while (list.size() < tuplePos) ........list.add(null);
......list.add(j);
......pos++;
....tuplePos++;
..return result;

Neal Gafter said...

@Anonymous: that's pretty bad, performance-wise. It also has the disadvantage that it iterates over elements that its receiver doesn't use. Comparing (for equality) the list of odd numbers to the list of even numbers should take a small constant amount of time, but basing it on this zip - it would not terminate before lunch.

Michael said...

Is this as mapcar a la Lisp??

Neal Gafter said...

@Michael: Since lisp lists are already external iterators, mapcar converts from external to internal iterators. That's the easy direction.

Michael said...

Something else:

A key (like in lisp again) function would be nice:

public void iterate(Function<T> f, Function<T> key);

So before calling f, key gets called and extract the right element. So you could avoid defining a lot of closures.

For example:
{int=>int} sum;
{String=>int} parseInt;
iterate( sum, parseInt )

Neal Gafter said...

@Michael: you've invented map-reduce!

Anonymous said...

You may also be interested in Towards the best collection traversal interface...


Most programming languages support collections, represented by an in-memory data structure, a file, a database, or a generating function. A programming language system gives us typically one of the two interfaces to systematically access elements of a collection. One traversal API is based on enumerators -- e.g., for-each, map, filter higher-order procedures -- of which the most general is fold. The second approach relies on streams, a.k.a. cursors, lazy lists. Generators such as the ones in Icon, Ruby and Python are a hybrid approach.

It is well-known that given a cursor interface to a collection, we can implement an enumerator. It is less appreciated that given an enumerator interface, we can always derive a cursor -- in an automatic way. We demonstrate that generic procedure for languages with and without first-class continuations.

Now that cursors and enumerators are inter-convertible, an implementor of a collection has a choice: which of the two interfaces to implement natively? We argue that he should offer the enumerator interface as the native one. The paper elaborates that enumerators are superior: in efficiency; in ease of programming; in more predictable resource utilization and avoidance of resource leaks. We present a design of the overall optimal collection traversal interface, which is based on a left-fold-like combinator with premature termination. The design has been implemented and tested in practice.

Neal Gafter said...

@Anonymous: I would love to see that transformation applied in a language, such as Java, that is not lazy and does not have continuations, to solve the problem posed here.

Jim Baker said...

I'd agree with using external iterators. Here's my experience in Python: Python leverages them nicely, allowing for composition with an iterator algebra similar to C#'s LINQ or the query execution plans in Oracle or SQLServer. You might be interested in my Python Cookbook recipe (shameless promotion), which shows both the iterator algebra for relational queries and how one can manipulate two iterators in a satisfyingly clever way (IMHO) to implement the underlying logic of a merge join. And incidentally Python 3K is moving away from internal iterators because for complex logic, a generator expression for example is just much easier to follow than say reduce.

Back to Java. Chaotic Java's post is very compelling because generators are a great way of creating iterators, and a bytecode solution, in ASM itself, instead of emulation with our own framing, is something that the Jython 2.5 work needs. (So this is very timely for my GSoC student Tobias Ivarsson who is working on a new ASM backend for Jython.) The first comment by Anonymous is true of many iterators (certainly interesting ones like database cursors), or otherwise you get resource leaks. But the fix is you just need to scope this with deterministic destruction either via try-catch-finally, Python 2.5's with-statement, or C#'s using, or perhaps there's a forthcoming RAII equivalent in Java 7?

Lastly Henry Baker's paper in OOPS Messenger from 1993 gives some great historical context, as well as describing a great problem to compare iterator approaches, the samefringe problem. Looks like you need to be subscribed to the ACM to get the full text unfortunately.

Jim Baker said...

To follow up on my previous comment: I really haven't looked much at Java 7, but automatic resource management (perhaps better term than C++'s RAII) is in the works. I suppose the use of "do" has to do with the fact that it's already reserved, but I think "using" or "with" would have been much clearer, with good precedents.

Anonymous said...

Regarding the fold<->iterator transform: Oleg already shows how to do it without first-class continuations. Furthermore, laziness can be emulated with closures, so I don't see that as an obstactle.

The only remaining issue AFAICS is whether Java's type system is up to the task of implementing Oleg's solution. I suspect it isn't, but that's just a "gut feeling".

(Not the same Anonymous as the one posting the link to Oleg's paper.)

Neal Gafter said...

@Anonymous: the face that closures can be used to emulate laziness doesn't help if the laziness is required in the internal iterator; that would require a global program transformation. I'm looking for a closed (i.e. complete) solution to the problem as posed. I don't mind if the solution contains unsafe casts.

arnaud said...

@Jim: even if the proposed "automatic resource management" is useful, would it really help in the for-each syntax since there is no explicit iterator declared?
Would the java runtime call internally the same close() method as the one used in the new syntax "do (...) { ... }" ?

Anonymous said...

As long as we're mentioning Henry Baker, here's his take on the iterator question.

Jim Baker said...

@arnaud: ARM has nothing to do with the for-each loop itself (the external iterator), but instead what it is iterating on.

Where it would get mixed is if Java 7 were to also introduce Python idioms like generator expressions. This would make it convenient to create iterator objects from a for-each loop. In this case you do have something to cleanup with ARM. Python generators are closed via the close() method, and you can do things like the following:

from __future__ import with_statement
from contextlib import closing

with closing((i * 2 for i in xrange(100))) as g:
~~~~print g.next()
print g.next()

(~ means space here)

The closing contextmanager is used to automatically close the generator expression when we except the scope of the with-statement. The last print statement will raise an exception because the generator is already closed, even though we still have a reference to it.

If you are interested, check out PEP 343 for more details on Python's support of ARM. It would certainly be nice to see this in Java 7!

ounos said...

gafter:
>It's easy to compare two collections for equality with an external iterator, for example, but it's practically impossible with internal iterators.

I can generalize this phenomenon. Assume 2 algorithms need to communicate in the same callstack (in this case: 2 iterations). [Note that since a method call can switch from one algorithm to another one, we only need to care about 2 algorithms].

How can they be arranged in the callstack? Only one at most can be on top (by viewing the root of the stack on top of everything) of the other, so only one can use the stack to have its program counter automatically managed, and use state saved on the stack (local vars). The one at the bottom needs to handle its state on the heap, because it needs to destroy its stack frame in order to return.

Heap-based algorithms are more flexible than stack-based, since they allow interaction with either another heap-based algorithm or a stack-based. The latter is more restrictive, as it needs to be placed on top, and its counterpart necessarily at bottom (and thus, in the heap).

This analysis also explains the fact that the equality check is still easy with an internal and an external iterator. (Just put the stack-based/internal on top and the heap-based/external at bottom, and at each return of the external, you have both results needed for the equality comparison).

But having two internal iterators mean both of them use the stack to persist their state, so each can only contain the whole execution of the other inside it (per the nesting that a stack implies). The only way would be to use a separate stack(thread) (assuming no continuations available).

Sandro Magi said...

I've implemented Oleg's best traversal interface in C#. It's more useful for functional languages with proper tail call support, but I also provide FoldLeft over ordinary IEnumerable which supports early termination in my FP# library as detailed in that post.

Cyrus said...

Hi Neal. Just ran into your post on this. Noticed that you linked to same conversation on this topic. Hope you're doing well :)

-- Cyrus

Tesfaye said...

Really what you are posting are interesting and giving detail resources for those who need it. I learned a lot from your posts. Thank you Neel Gafter. Keep it up
Tesfaye Gari (tesfaye.gari@gmail.c

Ikariel said...

Another advantage of internal iteration is the reification of the action done on each item, as a function or closure. This is also possible with external iteration but not as common, because it is not needed. That object can then be shared among iterations.

Another advantage of external iteration is possibility of deferrement of the iteration which helps to combine iterators, or process the whole iteration later, maybe with total anonymity of the original collection.