Calculating Programs from Galois Connections

Galois connections are ubiquitous in mathematics and computing science. One is often amazed that, once two functions are identified as a Galois connection, a long list of nice and often useful properties follow from one concise, elegant defining equation. But how does one construct a program from a specification given as a Galois connection? This is the topic of a recent work of José Nuno Oliveira and I, and this article is an advertisement.

Galois Connections as Specifications

In program construction one often encounters program specification of the form “… the smallest such number”, “the longest prefix of the input list satisfying …”, etc. A typical example is whole number division: given a natural number `x` and a positive integer `y`, `x / y` is the largest natural number that, when multiplied by `y`, is at most `x`. For another example, the Haskell function `takeWhile p` returns the longest prefix of the input list such that all elements satisfy predicate `p`.

Such specifications can be seen as consisting of two parts. The easy part specifies a collection of solution candidates: numbers that are at most `x` after multiplication with `y`, or all prefixes of the input list. The hard part, on the other hand, picks one optimal solution, such as the largest, the longest, etc., among the collection.

Our goal is to calculate programs for such specifications. But how best should the specification be given in the first place? Take division for example, one might start from a specification that literally translates our description above into mathematics:
``` ```

``    x / y = ⋁{ z | z * y ≤ x } ``
``` ```

As we know, however, suprema is in general not easy to handle. One could also explicitly name the remainder:
``` ```

``    z = x / y  ≡  (∃ r : 0 ≤ r < y : x = z * y + r)``
``` ```

at the cost of existentially quantifying over the remainder.

A third option looks surprising simpler: given `x` and `y`, the value `x / y` is such that for all `z`,
``` ```

``    z * y ≤ x  ≡  z ≤ x / y(1)``
``` ```

Why is this sufficient as a definition of `x / y`? Firstly, by substituting `x / y` for `z`, the right hand side of `≡` reduces to true, and we obtain on the left hand side `(x / y) * y ≤ x`. This tell that `x / y` is a candidate --- it satisfies the easy part of the specification. Secondly, read the definition from left to right: `z * y ≤ x ⇒ z ≤ x / y`. It says that `x / y` is the largest among all the numbers satisfying the easy part.

Equations of the form are called Galois connections. Given preorders `⊑` and `≤`, Functions `f` and `g` form a Galois connection if for all `x` and `z` we have
``` ```

``    f z ⊑ x  ≡  z ≤ g x(2)``
``` ```

The function `f` is called the lower adjoint and `g` the upper adjoint.

The definition of division above is a Galois connection where `f = (* y)` and `g = (/ y)`. For another example, `takeWhile p` can be specified as an upper adjoint:
``` ```

``    map p? zs ⊑ xs  ≡  zs ⊑ takeWhile p xs(3)``
``` ```

where `⊑` is the prefix ordering: `ys ⊑ xs` if `ys` is a prefix of `xs`, and `map p?` is a partial function: `map p? xs = xs` if `p x` holds for each `x` in `xs`.

We love Galois connections because once two functions are identified as such, a long list of useful properties follows: `f (g x) ⊑ x`, `z ≤ g (f z)`, `f` and `g` are monotonic, and are inverses of each other in the other's range... etc.

These are all very nice. But can one calculate a program from a Galois connection? Given `⊑`, `≤`, and `f`, how does one construct `g`?

The "Shrink" Operator

José discovered and proposed a relational operator to handle such calculations. To use the operator, we have to turn the Galois connection `(1)` into point-free style. We look at the left hand side of `(1)`: `f z ⊑ x`, and try to write it as a relation between `z` and `x`. Let `f°` denote the relational converse of `f` -- roughly, think of it as the inverse function of `f`, that it, it maps `f z` to `z`, and let `∘` denote relational composition -- function composition extended to relations. Thus `f z ⊑ x` translates to
``` ```

``    f° ∘ (⊑)``
``` ```

It is a relation between `z` and `x`: putting `x` on the left hand side of `f° ∘ (⊑)`, it relates, through `⊑`, to `f z`, which is then mapped to `z` through `f°`.

Then we wish that `f° ∘ (⊑)` can be transformed into a (relational) fold or unfold, which is often the case because the defining components: `⊑`, `≤`, and `f`, are often folds or unfolds. Consider the lower adjoint of `takeWhile p` in `(3)`. Since `⊑`, the relation that takes a list and returns a prefix of the list, can be defined as a fold on lists, `(map p?)° ∘ (⊑)`, by fold fusion, is also a fold. Consider `(1)`, since `≤` and `(* y)` are both folds on natural numbers, `(* y)° ∘ (≤)` can be both a fold and an unfold.

In our paper we showed that a Galois connection `(2)` can be transformed into
``` ```

``    g = (f° ∘ (⊑)) ↾ (≥)``
``` ```

where `↾` is the new operator José introduced. The relation `S ↾ R`, pronounced "`S` shrunk by `R`", is a sub-relation of `S` that yields, for each input, an optimal result under relation `R`. Note that the equation made the easy/hard division explicit: `f° ∘ (⊑)` is the easy part: we want a solution `z` that satisfies `f z ⊑ x`, while `≥` is the criteria we use, in the hard part, to choose an optimal solution.

The `↾` operator is similar to the `min` operator of Bird and de Moor, without having to use sets (which needs a power allegory). It satisfies a number of useful properties. In particular, we have theorems stating when `(↾ R)` promotes into folds and unfolds. For example,
``` ```

``    (fold S) ↾ R ⊇ fold (S ↾ R)``
``` ```

if `R` is transitive and `S` is monotonic on `R`.

With the theorems we can calculate `g`. Given `g`, specified as an upper adjoint in a Galois connection with lower adjoint `f`, we first try to turn `f° ∘ (⊑)` into a fold or an unfold, and then apply the theorems to promote `(↾ (≥))`. For more details, take a look at our paper!

Programming from Galois connections — principles and applications

Shin-Cheng Mu and José Nuno Oliveira. Technical Report TR-IIS-10-009, Academia Sinica, December 2010.
[PDF]

This report is an extended version of our conference submission Programming from Galois connections.

Problem statements often resort to superlatives such as in eg. “… the smallest such number”, “… the best approximation”, “… the longest such list” which lead to specifications made of two parts: one defining a broad class of solutions (the easy part) and the other requesting the optimal such solution (the hard part).

This report introduces a binary relational combinator which mirrors this linguistic structure and exploits its potential for calculating programs by optimization. This applies in particular to specifications written in the form of Galois connections, in which one of the adjoints delivers the optimal solution at target.

The framework encompasses re-factoring of results previously developed by Bird and de Moor for greedy and dynamic programming, in a way which makes them less technically involved and therefore easier to understand and play with.

Programming from Galois connections

Shin-Cheng Mu and José Nuno Oliveira. In the 12th International Conference on Relational and Algebraic Methods in Computer Science (RAMiCS #12), LNCS 6663, pages 294-313. May 30 – June 3, 2011.
[PDF]

Problem statements often resort to superlatives such as in eg. “… the smallest such number”, “… the best approximation”, “… the longest such list” which lead to specifications made of two parts: one defining a broad class of solutions (the easy part) and the other requesting the optimal such solution (the hard part).

This paper introduces a binary relational combinator which mirrors this linguistic structure and exploits its potential for calculating programs by optimization. This applies in particular to specifications written in the form of Galois connections, in which one of the adjoints delivers the optimal solution at target.

The framework encompasses re-factoring of results previously developed by Bird and de Moor for greedy and dynamic programming, in a way which makes them less technically involved and therefore easier to understand and play with.

An accompanying technical report is available online.

Constructing datatype-generic fully polynomial-time approximation schemes using generalised thinning

Shin-Cheng Mu, Yu-Han Lyu, and Akimasa Morihata. In the 6th ACM SIGPLAN workshop on Generic programming (WGP 2010), pages 97-108, Sep. 2010. [PDF]

The fully polynomial-time approximation scheme (FPTAS) is a class of approximation algorithms that is able to deliver an approximate solution within any chosen ratio in polynomial time. By generalising Bird and de Moor’s Thinning Theorem to a property between three orderings, we come up with a datatype-generic strategy for constructing fold-based FPTASs. Greedy, thinning, and approximation algorithms can thus be seen as a series of generalisations. Components needed in constructing an FPTAS are often natural extensions of those in the thinning algorithm. Design of complex FPTASs is thus made easier, and some of the resulting algorithms turn out to be simpler than those in previous works.

Evaluating Simple Polynomials

In the end of FLOLAC ’10 I had a chance to show the students, in 25 minutes, what functional program calculation is about. The student have just been exposed to functional programming a week ago in a three-hour course, after which they have written some simple programs handling concrete data but may have problem grasping those more abstract concepts like folds. I have talked to them about maximum segment sum way too many times (in the context of imperative program derivation, though), and it is perhaps too complex to cover in 25 minutes. The steep list problem, on the other hand, can be dealt with in 5 minutes. Thus I need another example.

This is what I eventually came up with: given a list of numbers `a₀, a₁, a₂ ... an` and a constant `X`, compute `a₀ + a₁X, + a₂X² + ... + anXn`. In Haskell it can be specified as a one-liner:
``` ```

``  poly as = sum (zipWith (×) as (iterate (×X) 1))``
``` ```

One problem of this example is that the specification is already good enough: it is a nice linear time algorithm. To save some multiplications, perhaps, we may try to further simplify it.

It is immediate that `poly [] = 0`. For the non-empty case, we reason:
``` ```

``````   poly (a : as)
=   { definition of poly }
sum (zipWith (×) (a:as) (iterate (×X) 1))
=   { definition of iterate }
sum (zipWith (×) (a:as) (1 : iterate (×X) X))
=   { definition of zipWith }
sum (a : zipWith (×) as (iterate (×X) X))
=   { definition of sum }
a + sum (zipWith (×) as (iterate (×X) X))
``````
``` ```

The expression to the right of `a +` is unfortunately not `poly as` — the last argument to `iterate` is `X` rather than `1`. One possibility is to generalise `poly` to take another argument. For this problem, however, we can do slightly better:
``` ```

``````   a + sum (zipWith (×) as (iterate (×X) X))
=   { since iterate f (f b) = map f (iterate f b) }
a + sum (zipWith (×) as (map (×X) (iterate (×X) 1)))
=   { zipWith (⊗) as . map (⊗X) = map (⊗X) . zipWith (⊗) as
if ⊗ associative }
a + sum (map (×X) (zipWith (×) as (iterate (×X) 1)))
=   { sum . map (×X) = (×X) . sum }
a + (sum (zipWith (×) as (iterate (×X) 1))) × X
=   { definition of poly }
a + (poly as) × X
``````
``` ```

We have thus come up with the program
``` ```

``````  poly [] = 0
poly (a : as) = a + (poly as) × X
``````
``` ```

Besides the definitions of `sum`, `zipWith`, `iterate`, etc, the rules used include:

1. `map f (iterate f x) = iterate f (f x)`
2. `zipWith (⊗) as . map (⊗X) = map (⊗X) . zipWith (⊗) as` if `⊗` associative
3. `sum . map (×X) = (×X) . sum`, a special case of `foldr ⊕ e . map (⊗X) = (⊗X) . foldr ⊕ e` if `(a ⊕ b) ⊗ X = (a ⊗ X) ⊕ (b ⊗ X)` and `e ⊗ X = e`.

Well, this is not a very convincing example. Ideally I’d like to have a derivation, like the steep list, where we gain some improvement in complexity by calculation.

What is your favourite example for functional program calculation?

Sum of Squares of Differences

In the final exam of the Program Construction course in FLOLAC ’10, I gave the students this problem (from Kaldewaij’s book):
``` ```

``````|[ con N {N ≥ 2}; a : array [0..N) of int;
var r : int;
S
{ r = (Σ i,j : 0 ≤ i < j < N : (a.i - a.j)²) }
]|``````
``` ```

In words, given an array of integers having at least two elements, compute the sum of squares of the difference between all pairs of elements. (Following the convention of the guarded command language, function application is written `f.x`, and an array is seen as a function from indices to values.)

It is not hard to quickly write up a `O(N²)` program using nested loops, which, I have to confess, is what I would do before reading Kaldewaij’s book and realised that it is possible to do the task in linear time using one loop. Unfortunately, not many students managed to come up with this solution, therefore I think it is worth some discussion.

Quantifiers

Before we solve the problem, let us review the “Dutch style” quantifier syntax and rules. Given a commutative, associative binary operator `⊕` with unit element `e`, if we informally denote the (integral) values in the interval `[A .. B)` by `i₀, i₁, i₂ ... in`, the quantified expression:
``` ```

``   (⊕ i : A ≤ i < B : F.i)``
``` ```

informally denotes `F.i₀ ⊕ F.i₁ ⊕ F.i₂ ⊕ ... ⊕ F.in`. More generally, if all values satisfying predicate `R` can be enlisted `i₀, i₁, i₂ ... in`, the expression
``` ```

``   (⊕ i : R.i : F.i)``
``` ```

denotes `F.i₀ ⊕ F.i₁ ⊕ F.i₂ ⊕ ... ⊕ F.in`. We omit the `i` in `R.i` and `F.i` when there can be no confusion.

A more formal characterisation of the quantified expression is given by the following rules:

1. `(⊕ i : false : F.i) = e`
2. `(⊕ i : i = x : F.i) = F.x`
3. `(⊕ i : R : F) ⊕ (⊕ i : S : F) = (⊕ i : R ∨ S : F) ⊕ (⊕ i : R ∧ S : F)`
4. `(⊕ i : R : F) ⊕ (⊕ i : R : G) = (⊕ : R : F ⊕ G)`
5. `(⊕ i : R.i : (⊕ j : S.j : F.i.j)) = (⊕ j : S.j : (⊕ i : R.i : F.i.j))`

Rules 1 and 3 give rise to a useful rule "split off `n`": consider `i` such that `0 ≤ i < n + 1`. If `n > 0`, the set of possible values of `i` can be split into two subsets: `0 ≤ i < n` and `i = n`. By rule 3 (reversed) and 1 we get:
``` ```

``  (⊕ i : 0 ≤ i < n + 1 : F.i) = (⊕ i : 0 ≤ i < n : F.i) ⊕ F.n``
``` ```

Expressions quantifying more than one variables can be expressed in terms of quantifiers over single variables:
``` ```

``   (⊕ i,j : R.i ∧ S.i,j : F.i.j) = (⊕ i : R.i : (⊕ j : S.i.j : F.i.j))``
``` ```

If `⊗` distributes into `⊕`, we have an additional property:
``` ```

``   x ⊗ (⊕ i : R : F) = (⊗ i : R : x ⊗ F)``
``` ```

As a convention, `(+ i : R : F)` is often written `(Σ i : R : F)`.

Computing the Sum of Squares of Differences

The first step is to turn the constant `N` to a variable `n`. The main worker of the program is going to be a loop, in whose invariant we try to maintain:
``` ```

``   P  ≣  r = (Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²)``
``` ```

In the end of the loop we increment `n`, and the loop terminates when `n` coincides with `N`:
``` ```

``````   { Inv: P ∧ 2 ≤ n ≤ N , Bound: N - n}
do n ≠ N → ... ; n := n + 1 od``````
``` ```

We shall then find out how to update `r` before `n := n + 1` in a way that preserves `P`.

Assume that `P` and `2 ≤ n ≤ N` holds. To find out how to update `s`, we substitute `n` for `n + 1` in the desired value of `r`:
``` ```

``````   (Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²)[n+1 / n]
= (Σ i,j : 0 ≤ i < j < n + 1 : (a.i - a.j)²)
=   { split off j = n }
(Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²) +
(Σ i : 0 ≤ i < n : (a.i - a.n)²)
=   { P }
r + (Σ i : 0 ≤ i < n : (a.i - a.n)²)
``````
``` ```

This is where most people stop the calculation and start constructing a loop computing `(Σ i : 0 ≤ i < n : (a.i - a.n)²)`. One might later realise, however, that most computations are repeated. Indeed, the expression above can be expanded further:
``` ```

``````   r + (Σ i : 0 ≤ i < n : (a.i - a.n)²)
=   { (x - y)² = x² - 2xy + y² }
r + (Σ i : 0 ≤ i < n : a.i² - 2 × a.i × a.n + a.n²)
=   { Rule 4 }
r + (Σ i : 0 ≤ i < n : a.i²)
- (Σ i : 0 ≤ i < n : 2 × a.i × a.n)
+ (Σ i : 0 ≤ i < n : a.n²)
=   { a.n is a constant, multiplication distributes into addition }
r + (Σ i : 0 ≤ i < n : a.i²)
- 2 × (Σ i : 0 ≤ i < n : a.i) × a.n
+ (Σ i : 0 ≤ i < n : a.n²)
=   { simplifying the last term }
r + (Σ i : 0 ≤ i < n : a.i²)
- 2 × (Σ i : 0 ≤ i < n : a.i) × a.n + n × a.n²
``````
``` ```

which hints at that we can store the values of `(Σ i : 0 ≤ i < n : a.i²)` and `(Σ i : 0 ≤ i < n : a.i)` in two additional variables:
``` ```

``````  Q₀  ≣  s = (Σ i : 0 ≤ i < n : a.i²)
Q₁  ≣  t = (Σ i : 0 ≤ i < n : a.i)
``````
``` ```

It merely takes some routine calculation to find out how to update `s` and `t`. The resulting code is:
``` ```

``````|[ con N {N ≥ 2}; a : array [0..N) of int;
var r, s, t, n : int;

r, s, t, n := (a.0 - a.1)², a.0² + a.1², a.0 + a.1, 2
{ Inv: P ∧ Q₀ ∧ Q₁ ∧ 2 ≤ n ≤ N , Bound: N - n }
; do n ≠ N →
r := r + s - 2 × t × a.n + n × a.n²;
s := s + a.n²;
t := t + a.n;
n := n + 1
od
{ r = (Σ i,j : 0 ≤ i < j < N : (a.i - a.j)²) }
]|``````
``` ```

Another “One Loop” Solution

Among those students who did come up with a program, most of them resorted to a typical two-loop, `O(N²)` solution. Given that this 9-hour course is, for almost all of them, their first exposure to program derivation, I shall perhaps be happy enough that around 3 to 4 out of 38 students came up with something like the program above.

One student, however, delivered a program I did not expect to see:
``` ```

``````|[ con N {N ≥ 2}; a : array [0..N) of int;
var r, i, j : int;

r, i, j := 0, 0, 0
{ Inv: ... ∧ 0 ≤ i ≤ j ∧ 0 ≤ j ≤ N, Bound: ? }
; do j ≠ N →
if i < j → r := r + (a.i - a.j)²;  i := i + 1
| i = j → i, j := 0, j + 1
fi
od
]|``````
``` ```

The program uses only one loop, but is still `O(N²)` — on a closer inspection one realises that it is actually simulating the inner loop manually. Still, I’d be happy if the student could show me a correctness proof, with a correct loop invariant and a bound, since both of them are more complex than what I expected them to learn. Unfortunately, in the answer handed in, the program, the invariant, and the bound all contain some bugs. Anyone wants to give it a try?

An Exercise Utilising Galois Connections

Given two partial orders `(A, ⊑)`, `(B, ≼)`, two functions `f : A → B`, `g : B → A` form a Galois connection between them if for all `a : A`, `b : B` we have

``  f a ≼ b ≣ a ⊑ g b``

We will refer to this defining property as “GC” later. The function `f` is called the lower adjoint and `g` the upper adjoint of the Galois connection. Galois connections are interesting because once two functions are identified as such, they immediately satisfy a rich collection of useful properties:

• letting `a := g b` in GC, we get `f (g b) ≼ b`;
• letting `b := f a`, we get `a ⊑ g (f a)`;
• `f` is monotonic, since:
`````` f a₁ ≼ f a₂
≣   { GC }
a₁ ⊑ g (f a₂)
⇐  {  since a ⊑ g (f a) }
a₁ ⊑ a₂``````
• similarly, `g` is monotonic: `b₁ ≼ b₂ ⇒ f b₁ ⊑ f b₂`,

and many more.

In the recent work of Sharon and me on maximally dense segments we needed quite a number of functions to be monotonic, idempotent, etc. It only occurred to me after submitting the paper: could they be defined as Galois connections? The number of properties we needed in the paper is huge and it would be nice to establish them on fewer basic properties. And it looks prettier.

Longest Prefix Up to a Certain Sum

One such function is `trim` in the paper, but it is sufficient to consider a simplification: let `sam : [Int] → [Int]` (for “sum atmost”) return the longest prefix of the input list whose sum is no larger than a constant `U`. Denote “`x` is a prefix of `y`” by `x ⊑ y`. We want to show that `sam` satisfies

• monotonicity: `x ⊑ y ⇒ sam x ⊑ sam y`, and
• idempotence: `sam (sam x) = sam x`.

Can they be derived by defining `sam` as a Galois connection?

I learned from José N. Oliveira‘s talk A Look at Program “G”alculation in IFIP WG 2.1 #65 Meeting how (the uncurried version of) `take` can be defined as a Galois connection. It turns out that `sam` is just the same. We consider a slight generalisation `sam' : (Int, [Int]) → [Int]` that takes an upper bound as a parameter. It can be characterised by:

``sum y ≤ b  ∧  y ⊑ x   ≣   y ⊑ sam' (b, x)``

There is in fact a Galois connection hidden already! To see that, define `⟨f, g⟩ a = (f a, g a)` (in the Haskell Hierarchy Library it is defined in Control.Arrow as `&&&`), and denote the product of binary relations by `×`, that is, if `a ≤ b` and `x ⊑ y` then `(a,x)` is related to `(b,y)` by `≤×⊑`. We write a composed relation as an infix operator by surrounding it in square brackets `(a,x) [≤×⊑] (b,y)`.

Using these notations, the defining equation of `sam'` can be rewritten as:

``⟨sum, id⟩ y [≤×⊑] (b,x)   ≣   y ⊑ sam' (b,x)``

Thus `sam'` is the upper adjoint in a Galois connection between `((Int, [Int]), ≤×⊑)` and `([Int], ⊑)`!

Now that `⟨sum, id⟩` and `sam'` form a Galois connection, we have:

• `f (g b) ≼ b` instantiates to `⟨sum, id⟩ (sam' (b,x)) [≤×⊑] (b,x)`, that is, `sum (sam' (b,x)) ≤ b` and `sam' (b,x) ⊑ x`;
• `a ⊑ g (f a)` instantiates to `x ⊑ sam' (sum x, x)`. Together with the previous property we have `x = sam' (sum x, x)`;
• monotonicity of the lower adjoint instantiates to `y₁ ⊑ y₂ ⇒ sum y₁ ≤ sum y₂ ∧ y₁ ⊑ y₂`;
• monotonicity of the upper adjoint instantiates to
``(b₁,x₁) [≤×⊑] (b₂,x₂)   ⇒   sam' (b₁,x₁) ⊑ sam' (b₂,x₂) ``

that is

``b₁ ≤ b₂   ∧  x₁ ⊑ x₂   ⇒   sam' (b₁,x₁) ⊑ sam' (b₂,x₂) ``

a generalisation of the monotonicity we want.

Finally, to show idempotence, we reason

``````   sam' (b₁, x) ⊑ sam' (b₁, sam' (b₂, x))
≣   { GC }
⟨sum, id⟩ (sam' (b₁, x)) [≤×⊑]  (b₁, sam' (b₂, x))
≣   { definitions }
sum (sam' (b₁, x)) ≤ b₁   ∧   sam' (b₁, x) ⊑ sam' (b₂, x)
⇐  { properties above }
b₁ ≤ b₂``````

These are all nice and pretty. There is another function, however, that is much harder to deal with, which I will write about next time.

Finding Maximally Dense Segments

Sharon and I have finally concluded, for now, our work on the maximally dense segment problem (draft, with an errata already!), on which we have been working on and off for the past two years. Considering the algorithm itself and its derivation/proofs, I am quite happy with what we have achieved. The algorithm is rather complex, however, and it is a challenge presenting it in an accessible way. Sharon has done a great job polishing the paper, and I do hope more people would be interested in reading it and it would, eventually, inspire more work on interesting program derivations.

The basic form of the problem looks like a natural variation of the classical maximum segment sum problem: given a list of numbers, find a consecutive segment whose average, that is, sum divided by length, is maximum. The problem would be trivial without more constraints, since one could simply return the largest element, thus we usually impose a lower bound `L` on the length of feasible segments.

It was noticed by Huang [3], that a segment having maximum average need not be longer than `2L - 1`: given a segment of `2L` elements or more, we cut it in the middle. If the two halves have different averages, we keep the larger one. Otherwise the two halves have the same average. Either way, we get a shorter, feasible segment whose average is not lower. The fact hints at a trivial `O(nL)` algorithm: for each suffix of the list, find its best prefix upto `2L - 1` elements long.

A difficult challenge, however, is to come up with an algorithm that is `O(n)`, independently of `L`. The problem can be generalised to the case where the elements do not have length 1, but each has a width, and the goal is to maximise the density — sum of the elements divided by sum of their width. It makes the problem sightly more complicated, but does not change its nature. If we go on to impose an upper bound `U` on the length as well, however, the problem becomes much more difficult. There was an published algorithm that claimed to be linear only to be found not so. We discovered that two later algorithms, which appeared to have concluded the problem, also fail for a boundary case. The bug is easy to fix for one of the algorithm, but might not be so for the other.

Our algorithm closely relates to that of Chung and Lu [1] and that of Goldwasser et al [2]. The algorithm is perhaps too complex to present in detail in a blog post (that’s why we need a paper!), but I will try to give an outline using pictures from the paper, my slides and poster.

One of the ways to visualise the problem is to see each element as a block, the number being the area of the block, and the density would be its height. The input is a list of (area, width) pairs, and the goal is to find a consecutive segment maximising the height. Shown below is the input list `[(9,6),(6,2),(14,7),(20,4),(-10,5),(20,8),(-2,2),(27,6)]`, and the dashed line is their average height:

Notice that an area can be negative. In the paper, since the alphabet `w` is used for “window” (to be explained below), we instead refer to the width as “breadth”.

Prefixes of Suffixes, and the Window

Many optimal segment problems (finding some optimal segment of a given list) are solved by finding, for each suffix, its optimal prefix, as shown below. Each bar is a suffix of the input, and the blue part is its optimal prefix:

It is preferable that an optimal prefix of `a : x` can be computed from the optimal prefix of `x`, that is, the function computing the optimal prefix is a `foldr`. If it is true, the algorithm may keep a pair of (optimal segment, optimal prefix). Each time a new element is read, it computes the new optimal prefix using the previous optimal prefix, and update the optimal segment if the new prefix is better. If you like structured recursion (or the so-called “origami programming”), this form of computation is an instance of a zygomorphism.

For each optimal prefix to be computable from the previous optimal prefix, it may not extend further than the latter. We do not want the following to happen:

However, it appears to be possible for the maximally dense prefix! Imagining adding a very small, or even negative area. We might get a denser prefix by extending further to the right since the denominator is larger.

The first theorem we had to prove aimed to show that it does not matter — if a maximally dense prefix extends further than the previous one, it is going to be suboptimal anyway. Thus it is safe if we always start from the right end of the previous prefix. That is, we do not compute the maximally dense prefix of the entire input, but merely the maximally dense prefix of the previous prefix.

This is an instance of the sliding window scheme proposed by Zantema [4]. The blue part is like a “window” of the list, containing enough information to guarantee the correctness of the algorithm. As the algorithm progresses, the two ends of the window keeps sliding to the left, hence the name.

To formally show that the window contains enough information to compute the maximally dense segment, we have to clearly state what window is, and what invariant it satisfies. It turned out to be quite tricky to formally state the intuition that “the window does not always give you the optimal prefix, but it does when it matters,” and was the first challenge we met.

Since we aim at computing a segment at least `L` units in breadth, it might be handy to split the window into a “compulsory part” (the shortest prefix that is at least `L` units wide) and the rest, the “optional part”. The algorithm thus looks like this:

where the yellow bars are the compulsory parts and blue bars the optional parts. Each time we read an element into the compulsory part, zero or more elements (since the elements have non-uniform breadths) may be shifted from the compulsory part to the optional part. Then we compute a maximally dense prefix (the yellow and the blue parts together) that does not extend further than the previous one. The best among all these prefixes is the maximally dense segment.

We want a linear time algorithm, which means that all the computation from a pair of yellow-blue bars to the next pair has to be done in (amortised) constant time — how is that possible at all? To do so we will need to exploit some structure in the optional part, based on properties of density and segments.

Right-Skew Segments, and the DRSP

A non-empty list of elements `x` is called right-skew if, for every non-empty `x₁` and `x₂` such that `x₁ ⧺ x₂ = x`, we have `density x₁ ≤ density x₂`. Informally, a right-skew list is drawn as the blue wavy block below:

The rising wavy slope informally hints that the right half has a higher density than the left half wherever you make it cut. Howver, the drawing is at risk from the misunderstanding that a right-skew segment is a list of elements with ascending areas or densities. Note that neither the areas nor the densities of individual elements have to be ascending. For example, the list `[(9,6),(6,2),(14,7)]`, with densities `[1.5, 3, 2]`, is right-skew.

Right-skew lists are useful because of the following property. Imagining placing a list `z` next to `x`, as depicted above. To find a maximally dense prefix of `z ⧺ x` starting with `z`, it is sufficient to consider only `z` and `z ⧺ x` — nothing in the middle, such as `z ⧺ x₁`, can be denser than the two ends!

Given a window with compulsory part `c` and optional part `x`, if we can partition `x` into `x₁ ⧺ x₂ ⧺ ... ⧺ xn`, such that `x₁`, `x₂`, … `xn` are all right-skew, then to compute the maximally dense prefix of `c ⧺ x`, we only need to consider `c`, `c ⧺ x₁`, `c ⧺ x₁ ⧺ x₂`,… and `c ⧺ x₁ ⧺ x₂ ⧺ ... ⧺ xn`.

Such a partition is always possible for any list `x` — after all, each element itself constitute a singleton right-skew list. However, there is one unique right-skew partition such that the densities of `x₁`, `x₂`, … `xn` are strictly decreasing. This is called the decreasing right-skew partition (DRSP) of `x`. We will partition the optional part of the window into its DRSP. A window now looks like the picture below:

Sharon summarised many nice properties of DRSP in the paper, for which we unfortunately do not have space here. We will only look at some properties that matters for this blog post. Firstly, consider the diagram below:

In the bottom row, the leftmost block is the density of `c`, and the second block is the density of `c ⧺ x₁`, etc. If segments `x₁`, `x₂`, … `xn` have decreasing densities, the densities of `c`, `c ⧺ x₁`, `c ⧺ x₁ ⧺ x₂`,… and `c ⧺ x₁ ⧺ x₂ ⧺ ... ⧺ xn` must be bitonic — first ascending, then descending. It helps to efficiently locate the maximally dense prefix.

Secondly, the DRSP can be built and maintained in a `foldr`. The following diagram depicts how the DRSP for the list of areas `[1,4,2,5,3]` (all with breadth `1`) can be built by adding elements from the left one by one (which eventually results in one big partition):

The rule is that blocks newly added from the left keeps merging with blocks to its right until it encounters a block shorter than itself. The top-left of the diagram indicates that the DRSP of `(3` is itself. Since `5 > 3`, adding `1` results in a partition containing two segments. When `2` is added, it is merged with `5` to form a new segment with density `3.5`. No merging is triggered with the addition of `4` since `4 > 3.5` and thus `[4,3.5,3]` form a decreasing sequence. Newly added `1` first merges `4`, forming a block having density `2.5`. Since `2.5 < 3.5`, it again merges with the block `[2,5]`. Eventually all elements are grouped into one segment with density `3`. One important thing here is that adding a new element only involves merging some initial parts of the DRSP.

Algorithm Overview

Recall that our algorithm computes, for each suffix, a prefix (a window) that is possibly optimal and contains enough information to compute all future optimal solutions. Since a feasible prefix is wider than `L`, we split it into a (yellow) compulsory part and a (blue) optional part. To obtain a linear time algorithm, we have to compute one row from the previous row in amortised constant time (the corresponding diagram is duplicated here):

The diagram below depicts how to go from one row to the next. The blue part is partitioned into DRSP. Each time an element is added to the yellow part, some elements may be shifted to the blue part, and that may trigger some right-skew segments in the blue part to be merged (second row). Then we look for a maximally dense prefix by going from right to left, chopping away segments, until we find the peak (third row):

Note that the operation shown on the third row (chopping to find the maximum) always chop away a right-skew segment in its entirety. It is important that the merging happens at the left end of the optional part, while the chopping happens at the right end. By using a tree-like data structure, each merging can be a `O(1)` operation. With the data structure, we may argue that, since each element can be merged at most once, throughout the algorithm only `O(n)` merging could happen. Similarly, each element can be chopped away at most once, so the chopping could happen at most `O(n)` time as well. Therefore the operations in the second and third rows above are both amortised `O(1)`.

Problem with Having an Upper Bound

The discussion so far already allows us to develop an algorithm for the maximally dense segment problem without an upper bound on the breadth of feasible segments. Having the upper bound makes the problem much harder because, different from the chopping depicted above, an upper bound may cut through a right-skew segment in the middle:

And a right-skew segment, with some elements removed, might not be right-skew anymore!

Our solution is to develop another data structure that allows efficient removal from the right end of a DRSP, while maintaining the DRSP structure. The final configuration of a window looks like the diagram below, where the new data structure is represented by the green blocks:

Unfortunately, it is inefficient to add new elements from the left into the green blocks. Therefore we have to maintain the window in a way similar to how a queue is implemented using two lists. New elements are added from the left into the blue blocks; when we need to remove element from the right of a block, it is converted to a green block in large chunks.

For more details, see the paper!

References

1. Chung, Kai-Min and Lu, Hsueh-I. An Optimal Algorithm for the Maximum-Density Segment Problem. SIAM Journal on Computing 34(2):373-387, 2004.
2. Goldwasser, Michael H. and Kao, Ming-Yang and Lu, Hsueh-I. Linear-Time Algorithms for Computing Maximum-Density Sequence Segments with Bioinformatics Applications. Journal of Computer and System Sciences, 70(2):128-144, 2005.
3. Huang, Xiaoqui. An algorithm for identifying regions of a {DNA} sequence that satisfy a content requirement. Computer Applications in the Biosciences 3(10): 219-225, 1994.
4. Zantema, Hans. Longest segment problems. Science of Computer Programming, 18(1):39-66, 1992.

Functional pearl: maximally dense segments

Sharon Curtis and Shin-Cheng Mu. Submitted.
[PDF]

errata:
• Page 3: “This input sequence does not have a solution…” what we meant was “This input does not have a prefix that is within bounds.” We used another example where the input does not have a feasible segment at all before changing to example, but I forgot to change the text accordingly.
• Page 4, Proof of Theorem 3.2: the first `mdsM x ⇑d win (a:x)` should be `mdsM x ⇑d wp (trim (a:x))`; `a : x <b L` and `a : x ≥b L` should respectively be `trim (a : x) <b L` and `trim (a : x) ≥b L`.
• Thanks to Josh Ko for pointing out both errors.

The problem of finding a maximally dense segment (MDS) of a list is a generalisation of the well-known maximum segment sum (MSS) problem, but its solution is more challenging. We extend and illuminate some recent work on this problem with a formal development of a linear-time online algorithm, in the form of a sliding window zygomorphism. The development highlights some elegant properties of densities, involving partitions which are decreasing and all right-skew.

Code and supplementary proofs are available online.

keywords: program derivation, segment problem, maximum density, sliding window, zygomorphism, right-skew.

The Maximum Segment Sum Problem: Its Origin, and a Derivation

In a previous paper of mine, regrettably, I wrongly attributed the origin of the maximum segment sum problem to Dijkstra and Feijen’s Een methode van programmeren. In fact, the story behind the problem was told very well in Jon Bentley’s Programming Pearls.

The Problem, and the Linear-Time Algorithm

Given a list of numbers, the task is to compute the largest possible sum of a consecutive segment. In a functional language the problem can be specified by:

`````` mss = max . map sum . segments
``````

where `segments = concat . map inits . tails` enlists all segments of the input list, `map sum` computes the sum of each of the segments, before `max :: Ord a ⇒ [a] → a` picks the maximum. The specification, if executed, is a cubic time algorithm. Yet there is a linear time algorithm scanning through the list only once:

``````mss = snd . foldr step (0,0)
where step x (p,s) = (0 ↑ (x+p), (0 ↑ (x+p)) ↑ s)
``````

where `a ↑ b` yields the maximum of `a` and `b`.

Both the specification and the linear time program are short. The program is merely a `foldr` that can be implemented as a simple for-loop in an imperative language. Without some reasoning, however, it is not that trivial to see why the program is correct (hint: the `foldr` computes a pair of numbers, the first one being the maximum sum of all prefixes of the given list, while the second is the maximum sum of all segments). Derivation of the program (given below) is mostly mechanical, once you learn the basic principles of program calculation. Thus the problem has become a popular choice as the first non-trivial example of program derivation.

Origin

Jon Bentley recorded in Programming Pearls that the problem was proposed by Ulf Grenander of Brown University. In a pattern-matching procedure he designed, a subarray having maximum sum is the most likely to yield a certain pattern in a digitised image. The two dimensional problem took too much time to solve, so he simplified to one dimension in order to to understand its structure.

In 1977 [Grenander] described the problem to Michael Shamos of UNILOGIC, Ltd. (then of Carnegie-Mellon University) who overnight designed Algorithm 3. When Shamos showed me the problem shortly thereafter, we thought that it was probably the best possible; … A few days later Shamos described the problem and its history at a Carnegie-Mellon seminar attended by statistician Jay Kadane, who designed Algorithm 4 within a minute.

Jon Bentley, Programming Pearls (1st edition), page 76.

Jay Kadane’s Algorithm 4 is the now well-known linear time algorithm, the imperative version of the functional program above:

``````maxpre, maxseg = 0, 0
for i in range (0, N):
maxpre = 0 ↑ (maxpre + a[i])
maxseg = maxpre ↑ maxseg
``````

Algorithm 3, on the other hand, is a divide and conquer algorithm. An array `a` is split into two halves `a₁ ⧺ a₂`, and the algorithm recursively computes the maximum segment sums of `a₁` and `a₂`. However, there could be some segment across `a₁` and `a₂` that yields a good sum, therefore the algorithm performs two additional loops respectively computing the maximum suffix sum of `a₁` and the maximum prefix sum of `a₂`, whose sum is the maximum sum of segment crossing the edge. The algorithm runs in `O(N log N)` time. (My pseudo Python translation of the algorithm is given below.)

In retrospect, Shamos did not have to compute the maximum prefix and suffix sums in two loops each time. The recursive function could have computed a triple quadruple of (maximum prefix sum, maximum segment sum, maximum suffix sum, and sum of the whole array) for each array. The prefix and suffix sums could thus be computed bottom-up. I believe that would result in a `O(N)` algorithm. This linear time complexity might suggest that the “divide” is superficial — we do not have to divide the array in the middle. It is actually easier to divide the array into a head and a tail — which was perhaps how Kadane quickly came up with Algorithm 4!

A Functional Derivation

I learned the function derivation of the maximum segment sum problem from one of Jeremy’s papers [3] and was very amazed. It was perhaps one of the early incident that inspired my interest in program calculation. The derivation does not appear to be very well known outside the program derivation circle — not even for functional programmers, so I would like to redo it here.

The first few steps of the derivation goes:

``````   max . map sum . segs
=    { definition of segs }
max . map sum . concat . map inits . tails
=    { since map f . concat = concat . map (map f) }
max . concat . map (map sum) . map inits . tails
=    { since max . concat = max . map max }
max . map max .  map (map sum) . map inits . tails
=    { since map f . map g = map (f.g) }
max . map (max . map sum . inits) . tails
``````

The purpose of the book-keeping transformation above is to push `max . map sum` closer to `inits`. The fragment `max . map sum . inits` is a function which, given a list of numbers, computes the maximum sum among all its prefixes. We denote it by `mps`, for maximum prefix sum. The specification has been transformed to:

``   mss = max . map mps . tails ``

This is a common strategy for segment problems: to solve a problem looking for an optimal segment, proceed by looking for an optimal prefix of each suffix. (Symmetrically we could process the list the other way round, look for an optimal suffix for each prefix.)

We wish that `mps` for each of the suffixes can be efficiently computed in an incremental manner. For example, to compute `mps [-1,3,3,-4]`, rather than actually enumerating all suffixes, we wish that it can be computed from `-1` and `mps [3,3,-4] = 6`, which can in turn be computed from `3` and `mps [3,-4] = 3`, all in constant time. In other words, we wish that `mps` is a `foldr` using a constant time step function. If this is true, one can imagine that we could efficiently implement `map mps . tails` in linear time. Indeed, `scanr f e = map (foldr f e) . tails`!

The aim now is to turn `mps = max . map sum . inits` into a `foldr`. Luckily, `inits` is actually a `foldr`. In the following we will perform `foldr`-fusion twice, respectively fusing `map sum` and `max` into `inits`, thus turning the entire expression into a `foldr`.

The first fusion goes:

``````   max . map sum .inits
=    { definition of inits }
max . map sum . foldr (\x xss -> [] : map (x:) xss) [[]]
=    { fold fusion, see below }
max . foldr zplus [0]
``````

The fusion condition can be established below, through which we also construct the definition of `zplus`:

``````   map sum ([] : map (x:) xss)
=  0 : map (sum . (x:)) xss
=    { by definition, sum (x : xs) = x + sum xs }
0 : map (x+) (map sum xss)
=    { define zplus x xss = 0 : map (x+) xss }
zplus x (map sum xss)
``````

We continue with the derivation and perform another fusion:

``````   max . foldr zplus [0]
=    { fold fusion, let zmax x y = 0 ↑ (x+y) }
foldr zmax 0 {-"."-}
``````

For the second fold fusion to work, we have to prove the following fusion condition:

``````   max (0 : map (x+) xs)
=  0 ↑ max (map (x+) xs)
=    { since  max (map (x +) xs) = x + max xs }
0 ↑ (x + max xs) {-"."-}
``````

The property `max (map (x +) xs) = x + max xs` in the last step follows from that `(↑)` distributes into `(+)`, that is, `(x + y) ↑ (x + z) = x + (y ↑ z)`. This is the key property that allows the whole derivation to work.

By performing `foldr`-fusion twice we have established that

``mps = foldr zmax 0``

In words, `mps (x : xs)`, the best prefix sum of `x : xs`, can be computed by `zmax x (mps xs)`. The definition of `zmax` says that if `x + mps xs` is positive, it is the maximum prefix sum; otherwise we return `0`, sum of the empty prefix.
Therefore, `mss` can be computed by a `scanr`:

``````   mss
=    { reasoning so far }
max . map (foldr zmax 0) . tails
=    { introducing scanr }
max . scanr zmax 0 {-"."-}
``````

We have derived `mss = max . scanr zmax 0`, where `zmax x y = 0 ↑ (x+y)`.

Many functional derivations usually stop here. This gives us an algorithm that runs in linear time, but takes linear space. A tupling transformation eliminates the need for linear space:

``  mss = snd . (head &&& max) . scanr zmax 0 ``

where `(f &&& g) a = (f a, g a)`. The part `(head &&& max) . scanr zmax 0` returns a pair, the first component being the result of `mps`, the second `mss`. By some mechanical simplification we get the final algorithm:

``````mss = snd . foldr step (0,0)
where step x (p,s) = (0 ↑ (x+p), (0 ↑ (x+p)) ↑ s)
``````

A Relational Derivation?

The maximum segment sum problem later turned out to be a example of Richard and Oege’s Greedy Theorem [2]. It is an exercise in the Algebra of Programming book, but I have not seen the solution given anywhere. For completeness, I recorded a relational derivation in a paper of mine about some other variations of the maximum segment sum problem[4].

References

1. Bentley, Jon. Programming Pearls. Addison-Wesley, Inc, 1987.
2. Bird, Richard and de Moor, Oege. Algebra of Programming. Prentice-Hall, 1997
3. Gibbons, Jeremy. Calculating Functional Programs. Proceedings of ISRG/SERG Research Colloquium, Oxford Brookes University, November 1997.
4. Mu, Shin-Cheng. Maximum segment sum is back: deriving algorithms for two segment problems with bounded lengths. Partial Evaluation and Program Manipulation (PEPM ’08), pp 31-39. January 2008.

Appendix: Algorithm 3

``````
def mss(l,u):
if l > u:
return 0          # empty array
else if l == u:
return (0 ↑ a[l])  # singleton array
else:
m = (l + u) / 2

# compute maximum suffix sum of a[0..m]
sum, maxToLeft = 0, 0
for i in range (m, l-1, -1):
sum = sum + a[i]
maxToLeft = maxToLeft ↑ sum
# compute maximum prefix sum of a[m+1..u]
sum, maxToRight = 0, 0
for i in range (m+1, u+1):
sum = sum + a[i]
maxToLeft = maxToRight ↑ sum
maxCrossing = maxToLeft + maxToRight

# recursively compute mss of a[0..m] and a[m+1..u]
maxInL = mss(l,m)
maxInR = mss(m+1,u)
return (maxInL ↑ maxCrossing ↑ maxInR)
``````