Wednesday, August 14, 2019

Accidentally Quadratic Constant Folding

I just fixed a bug in the implementation of constant folding in the C# and VB.Net compilers. They used to require O(n^2) space. Now they require O(n) space.

The naive approach to implementing string concatenation for an expression such as "S1" + "S2" + "S3" is to recursively visit the left-hand and right-hand operands and compute their constant string values, and then concatenate these strings to produce the constant string for the concatenation. It seems obviously correct and the only reasonable way to do it. But it has a serious problem.

The concatenation operation is left associative, so for a long sequence of n concatenations, each of size k, we would produce intermediate results of size k, 2k, 3k, and so on up to nk. The total amount of space needed to represent this collection of strings is O(k n^2). For moderately-sized values of n, this would cause the compilers to run out of memory and crash. This is not merely a theoretical problem, as we had been getting customer reports of this problem about quarterly for the past 7-8 years.

Fixing the problem was fairly straightforward, using a technique I learned from the Cedar/Mesa compiler in the early '80s. Rather than representing string constants in the compilers using strings, they are now represented using a data structure called Ropes. Concatenating two ropes requires only a small constant amount of memory.

Any compiler for a language that supports constant folding of string concatenation will run into this problem unless there is a similarly low-space representation of two concatenated strings. This is a technique that should be in the bag of tricks of any compiler writer.


Eric Lippert said...

In the original C++ based compiler I removed the recursions on the most-likely-to-be-deep side of the constant foldings because the problem was not the quadratic step in the fold; it was that we could run out of stack space if there were large -- typically machine-generated -- expressions. It was my intention to put similar logic into Roslyn and to eliminate the quadratic steps at the same time, but evidently I did not get around to it. I'm glad to hear this is finally done!

Eric Lippert said...

Incidentally, Herman Venter and I built a ropes implementation into JScript in the late 1990s. We abandoned it, because it turned out that the overhead of the rope data structure cost more than the small win we got in deferring string concatenations. It turned out that most people are not concatenating thousands of strings into one string. It was a great lesson to me: not all optimizations are actually improvements, even if their algorithmic order looks better.

Neal Gafter said...

Indeed, I would not necessarily recommend replacing strings with ropes in the language. I am recommending ropes only for the purpose of representing constant strings in the compiler so that string concatenation can be folded (in the compiler) in O(1) space.