Category Archives: Research Blog

Proving the Church-Rosser Theorem Using a Locally Nameless Representation

Around 2 years ago, for an earlier project of mine (which has not seen its light yet!) in which I had to build a language with variables and prove its properties, I surveyed a number of ways to handle binders. For some background, people have noticed that, when proving properties about a language with bound variables, one often ends up spending most of the effort proving “uninteresting” things about names, substitution, weakening, etc. For formal theorem proving to be practical, the effort has to be reduced. A number of approaches were proposed. Among them, the “locally nameless” style appeared to be interesting and promising.

With the only reference I used, Engineering Formal Metatheory by Aydemir et al., which outlines the principle ideas of the approach, I imagined how it works and tried to implement my own version in Agda. My first few implementations, however, all ended up in a mess. It appeared that there was an endless number of properties to prove. Besides the complexity of the language I was implementing, there must be something I got wrong about the locally nameless representation. Realising that I could not finish the project this way, I eventually decided to learn from the basics.

I started with the tutorial by Chargueraud, with complete code in Coq. I would then follow his footprints using Agda. The task is a classical one: define untyped λ-calculus and its reduction rules, and prove the Church-Rosser theorem.

Term

From an abstract level, there is nothing too surprising. We define a syntax for untyped λ-calculus that distinguishes between free and bound variables:

data Term : Set where
  bv : (i : BName) → Term
  fv : (x : FName) → Term
  ƛ : (e : Term) → Term
  _·_ : (e₁ : Term) → (e₂ : Term) → Term

where BName = ℕ represents bound variables by de Bruin indexes, while FName is the type of free variables. The latter can be any type that supports equality check and a method that generates a new variable not in a given set (in fact, a List) of variables. If one takes FName = String, the expression λ x → x y, where y occurs free, is represented by ƛ (bv 0 · fv "y"). For ease of implementation, one may takeFName = ℕ as well.

Not all terms you can build are valid. For example, ƛ (bv 1 · fv "y") is not a valid term since there is only one ƛ binder. How to distinguish the valid terms from invalid ones? I would (and did) switch to a dependent datatype Term n, indexed by the number of enclosing binders, and let BName = Fix n. The index is passed top-down and is incremented each time we encounter a ƛ. Closed terms are then represented by the type Term 0.

The representation above, if works, has the advantage that a term that can be build at all must be valid. Choosing such a representation was perhaps the first thing I did wrong, however. Chargueraud mentioned a similar predicate that also passes the “level” information top-down, and claimed that the predicate, on which we will have to perform induction on to prove property about terms, does not fit the usual pattern of induction. This was probably why I had so much trouble proving properties about terms.

The way to go, instead, is to use a predicate that assembles the information bottom up. The predicate LC (“locally-closed” — a term is valid if it is locally closed) is defined by:

data LC : Term → Set where
  fv : ∀ x → LC (fv x)
  ƛ : (L : FNames) → ∀ {e} → 
       (fe : ∀ {x} → (x∉L : x ∉ L) → LC ([ 0 ↦ fv x ]  e)) → LC (ƛ e)
  _·_ : ∀ {e₁ e₂} → LC e₁ → LC e₂ → LC (e₁ · e₂)

A free variable alone is a valid term. A application f · e is valid if both f and e are. And an abstraction ƛ e is valid if e becomes a valid term after we substitute any free variable x for the first (0-th) bound variable. There can be an additional constraint on x, that it is not in L, a finite set of “protected” variables — such co-finite quantification is one of the features of the locally nameless style.

The “open” operator [ n ↦ t ] e substitutes the term t for the n-th bound variable in e. It is defined by

[_↦_] : ℕ → Term → Term → Term
[ n ↦ t ] (bv i) with n ≟ i
...        | yes _ = t
...        | no _ = bv i
[ n ↦ t ] (fv y) = fv y
[ n ↦ t ] (ƛ e) = ƛ ([ suc n ↦ t ] e)
[ n ↦ t ] (e₁ · e₂) = [ n ↦ t ] e₁ · [ n ↦ t ] e₂

note how n is incremented each time we go into a ƛ. A dual operator,

[_↤_] : ℕ → FName → Term → Term

instantiates the n-th bound variable to a term.

β Reduction and Church-Rosser Theorem

Small-step β reduction can be defined by:

data _β→_ : Term → Term → Set where
  β-red : ∀ {t₁ t₂} → Body t₁ → LC t₂ 
          → ((ƛ t₁) · t₂) β→ [ 0 ↦ t₂ ] t₁
  β-app₁ : ∀ {t₁ t₁' t₂} → LC t₂ 
           → t₁ β→ t₁'
           → (t₁ · t₂) β→ (t₁' · t₂)
  β-app₂ : ∀ {t₁ t₂ t₂'} → LC t₁ 
           → t₂ β→ t₂'
           → (t₁ · t₂) β→ (t₁ · t₂')
  β-ƛ : ∀ L {t₁ t₁'}
           → (∀ x → x ∉ L → ([ 0 ↦ fv x ] t₁) β→ ([ 0 ↦  fv x ] t₁'))
           → ƛ t₁ β→ ƛ t₁'

where β-red reduces a redux, β-app₁ and β-app₂ allows reduction respectively on the right and left hand sides of an application, and β-ƛ goes into a ƛ abstraction — again we use co-finite quantification.

Given _β→_ we can define its reflexive, transitive closure _β→*_, and the reflexive, transitive, symmetric closure _β≣_. The aim is to prove that _β→*_ is confluent:

β*-confluent : 
  ∀ {m s t} → (m β→* s) → (m β→* t)
                  → ∃ (λ u → (s β→* u) × (t β→* u))

which leads to the Church-Rosser property:

β-Church-Russer : ∀ {t₁ t₂} → (t₁ β≣ t₂)
                    → ∃ (λ t → (t₁ β→* t) × (t₂ β→* t))

At an abstract level, the proof follows the classical route: it turns out that it is easier to prove the confluence of a “parallel reduction” relation _⇉_ which allows β reduction to happen in several places of a term in one step. We then prove that _β→*_ is equivalent to _⇉*_, thereby proving the confluence of _β→*_ as well. All these can be carried out relatively nice and clean.

The gory details, however, hides in proving the infrastructutal properties supporting the abstract view of the proofs.

Infrastructures

Confluence, Church-Rosser… these are the interesting stuffs we want to prove. However, we often end up spending most of the time proving those infrastructure properties we have to prove — which is why there have been so much recent research hoping to find better representations that simplify them. The locally nameless style is supposed to be such a representation. (Another orthogonal topic is to seek generic representations such that the proofs can be done once for all languages.)

In my code, most of these properties are piled in the file Infrastructure.agda. They range from stuffs you might expect to have:

open-term : ∀ k t {e} → LC e → e ≡ [ k ↦ t ] e
close-var-open-aux : ∀ k x e → LC e → e ≡ [ k ↦ fv x ] ([ k ↤ x ] e)

to stuffs not that obvious:

open-var-inj : ∀ k x t u → x ∉ fvars t → x ∉ fvars u
               → [ k ↦ fv x ] t ≡ [ k ↦ fv x ] u → t ≡ u
open-term-aux : ∀ j t i u e → ¬ (i ≡ j)
                → [ j ↦ t ] e ≡ [ i ↦ u ] ([ j ↦ t ] e) 
                → e ≡ [ i ↦ u ] e

The lemma open-var-inj is one of the early lemmas that appeared in Chargueraud’s tutorial, which might give one the impression that it is an easy first lemma to prove. On the contrary, it is among the tedious ones — I needed a 40-line proof (most of the cases were simply eliminated by contraction, though).

It takes experience and intuition to know what lemmas are needed. Without promise that it will work, I would think something must have gone wrong when I found myself having to prove weird looking lemmas like:

close-var-rec-open : 
  ∀ x y z t i j 
  → ¬(i ≡ j) → ¬(y ≡ x) → y ∉ fvars t
  → [ i ↦ fv y ] ([ j ↦ fv z ] ([ j ↤ x ] t))
    ≡ [ j ↦ fv z ] ([ j ↤ x ] ([ i ↦ fv y ] t))

which is not easy to prove either.

The Bottom Line?

So, is the locally nameless representation what it claims to be — a way to represent binders that simplifies the infrastructural proofs and is easier to scale up? When I was struggling with some of the proofs in Infrastructure.agda I did wonder whether the claim is true only for Coq, with cleverly designed proof tactics, but not for Agda, where everything is done by hand (so far). Once the infrastructural proofs are done, however, the rest was carried out very pleasantly.

To make a fair comparison, I should re-implement everything again using de Bruin notation. That has to wait till some other time, though. (Any one want to give it a try?)

It could be the case that, while some proofs are easily dismissed in Coq using tactics, in Agda the programmer should develop some more abstractions. I did feel myself repeating some proof patterns, and found one or two lemmas that do not present in Chargueraud’s tutorial which, if used in Agda, simplifies the proofs a bit. There could be more, but at this moment I am perhaps too involved in the details to see the patterns from a higher viewpoint.

The exercise does pay off, though. Now I feel I am much more familiar with this style, and am perhaps more prepared to use it in my own project.

Code

A zip file containing all the code.

References

  1. Brian Aydemir, Arthur Chargueraud, Benjamin Pierce, Randy Pollack, and Stephanie Weirich. Engineering Formal Metatheory. POPL ’08.
  2. Arthur Chargueraud. The Locally Nameless Representation. Journal of Automated Reasoning, May 2011

Calculating Programs from Galois Connections

Galois connections are ubiquitous in mathematics and computing science. One is often amazed that, once two functions are identified as a Galois connection, a long list of nice and often useful properties follow from one concise, elegant defining equation. But how does one construct a program from a specification given as a Galois connection? This is the topic of a recent work of José Nuno Oliveira and I, and this article is an advertisement.

Galois Connections as Specifications

In program construction one often encounters program specification of the form “… the smallest such number”, “the longest prefix of the input list satisfying …”, etc. A typical example is whole number division: given a natural number x and a positive integer y, x / y is the largest natural number that, when multiplied by y, is at most x. For another example, the Haskell function takeWhile p returns the longest prefix of the input list such that all elements satisfy predicate p.

Such specifications can be seen as consisting of two parts. The easy part specifies a collection of solution candidates: numbers that are at most x after multiplication with y, or all prefixes of the input list. The hard part, on the other hand, picks one optimal solution, such as the largest, the longest, etc., among the collection.

Our goal is to calculate programs for such specifications. But how best should the specification be given in the first place? Take division for example, one might start from a specification that literally translates our description above into mathematics:

    x / y = ⋁{ z | z * y ≤ x } 

As we know, however, suprema is in general not easy to handle. One could also explicitly name the remainder:

    z = x / y  ≡  (∃ r : 0 ≤ r < y : x = z * y + r)

at the cost of existentially quantifying over the remainder.

A third option looks surprising simpler: given x and y, the value x / y is such that for all z,

    z * y ≤ x  ≡  z ≤ x / y(1)

Why is this sufficient as a definition of x / y? Firstly, by substituting x / y for z, the right hand side of reduces to true, and we obtain on the left hand side (x / y) * y ≤ x. This tell that x / y is a candidate --- it satisfies the easy part of the specification. Secondly, read the definition from left to right: z * y ≤ x ⇒ z ≤ x / y. It says that x / y is the largest among all the numbers satisfying the easy part.

Equations of the form are called Galois connections. Given preorders and , Functions f and g form a Galois connection if for all x and z we have

    f z ⊑ x  ≡  z ≤ g x(2)

The function f is called the lower adjoint and g the upper adjoint.

The definition of division above is a Galois connection where f = (* y) and g = (/ y). For another example, takeWhile p can be specified as an upper adjoint:

    map p? zs ⊑ xs  ≡  zs ⊑ takeWhile p xs(3)

where is the prefix ordering: ys ⊑ xs if ys is a prefix of xs, and map p? is a partial function: map p? xs = xs if p x holds for each x in xs.

We love Galois connections because once two functions are identified as such, a long list of useful properties follows: f (g x) ⊑ x, z ≤ g (f z), f and g are monotonic, and are inverses of each other in the other's range... etc.

These are all very nice. But can one calculate a program from a Galois connection? Given , , and f, how does one construct g?

The "Shrink" Operator

José discovered and proposed a relational operator to handle such calculations. To use the operator, we have to turn the Galois connection (1) into point-free style. We look at the left hand side of (1): f z ⊑ x, and try to write it as a relation between z and x. Let denote the relational converse of f -- roughly, think of it as the inverse function of f, that it, it maps f z to z, and let denote relational composition -- function composition extended to relations. Thus f z ⊑ x translates to

    f° ∘ (⊑)

It is a relation between z and x: putting x on the left hand side of f° ∘ (⊑), it relates, through , to f z, which is then mapped to z through .

Then we wish that f° ∘ (⊑) can be transformed into a (relational) fold or unfold, which is often the case because the defining components: , , and f, are often folds or unfolds. Consider the lower adjoint of takeWhile p in (3). Since , the relation that takes a list and returns a prefix of the list, can be defined as a fold on lists, (map p?)° ∘ (⊑), by fold fusion, is also a fold. Consider (1), since and (* y) are both folds on natural numbers, (* y)° ∘ (≤) can be both a fold and an unfold.

In our paper we showed that a Galois connection (2) can be transformed into

    g = (f° ∘ (⊑)) ↾ (≥)

where is the new operator José introduced. The relation S ↾ R, pronounced "S shrunk by R", is a sub-relation of S that yields, for each input, an optimal result under relation R. Note that the equation made the easy/hard division explicit: f° ∘ (⊑) is the easy part: we want a solution z that satisfies f z ⊑ x, while is the criteria we use, in the hard part, to choose an optimal solution.

The operator is similar to the min operator of Bird and de Moor, without having to use sets (which needs a power allegory). It satisfies a number of useful properties. In particular, we have theorems stating when (↾ R) promotes into folds and unfolds. For example,

    (fold S) ↾ R ⊇ fold (S ↾ R)

if R is transitive and S is monotonic on R.

With the theorems we can calculate g. Given g, specified as an upper adjoint in a Galois connection with lower adjoint f, we first try to turn f° ∘ (⊑) into a fold or an unfold, and then apply the theorems to promote (↾ (≥)). For more details, take a look at our paper!

Evaluating Simple Polynomials

In the end of FLOLAC ’10 I had a chance to show the students, in 25 minutes, what functional program calculation is about. The student have just been exposed to functional programming a week ago in a three-hour course, after which they have written some simple programs handling concrete data but may have problem grasping those more abstract concepts like folds. I have talked to them about maximum segment sum way too many times (in the context of imperative program derivation, though), and it is perhaps too complex to cover in 25 minutes. The steep list problem, on the other hand, can be dealt with in 5 minutes. Thus I need another example.

This is what I eventually came up with: given a list of numbers a₀, a₁, a₂ ... an and a constant X, compute a₀ + a₁X, + a₂X² + ... + anXn. In Haskell it can be specified as a one-liner:

  poly as = sum (zipWith (×) as (iterate (×X) 1))

One problem of this example is that the specification is already good enough: it is a nice linear time algorithm. To save some multiplications, perhaps, we may try to further simplify it.

It is immediate that poly [] = 0. For the non-empty case, we reason:

   poly (a : as)
 =   { definition of poly }
   sum (zipWith (×) (a:as) (iterate (×X) 1))
 =   { definition of iterate }
   sum (zipWith (×) (a:as) (1 : iterate (×X) X))
 =   { definition of zipWith } 
   sum (a : zipWith (×) as (iterate (×X) X))
 =   { definition of sum }
   a + sum (zipWith (×) as (iterate (×X) X))

The expression to the right of a + is unfortunately not poly as — the last argument to iterate is X rather than 1. One possibility is to generalise poly to take another argument. For this problem, however, we can do slightly better:

   a + sum (zipWith (×) as (iterate (×X) X))
 =   { since iterate f (f b) = map f (iterate f b) }
   a + sum (zipWith (×) as (map (×X) (iterate (×X) 1)))
 =   { zipWith (⊗) as . map (⊗X) = map (⊗X) . zipWith (⊗) as 
          if ⊗ associative }
   a + sum (map (×X) (zipWith (×) as (iterate (×X) 1)))
 =   { sum . map (×X) = (×X) . sum }
   a + (sum (zipWith (×) as (iterate (×X) 1))) × X
 =   { definition of poly }
   a + (poly as) × X

We have thus come up with the program

  poly [] = 0
  poly (a : as) = a + (poly as) × X


Besides the definitions of sum, zipWith, iterate, etc, the rules used include:

  1. map f (iterate f x) = iterate f (f x)
  2. zipWith (⊗) as . map (⊗X) = map (⊗X) . zipWith (⊗) as if associative
  3. sum . map (×X) = (×X) . sum, a special case of foldr ⊕ e . map (⊗X) = (⊗X) . foldr ⊕ e if (a ⊕ b) ⊗ X = (a ⊗ X) ⊕ (b ⊗ X) and e ⊗ X = e.

Well, this is not a very convincing example. Ideally I’d like to have a derivation, like the steep list, where we gain some improvement in complexity by calculation.

What is your favourite example for functional program calculation?

Sum of Squares of Differences

In the final exam of the Program Construction course in FLOLAC ’10, I gave the students this problem (from Kaldewaij’s book):

|[ con N {N ≥ 2}; a : array [0..N) of int;
   var r : int;
   S
   { r = (Σ i,j : 0 ≤ i < j < N : (a.i - a.j)²) }
]|

In words, given an array of integers having at least two elements, compute the sum of squares of the difference between all pairs of elements. (Following the convention of the guarded command language, function application is written f.x, and an array is seen as a function from indices to values.)

It is not hard to quickly write up a O(N²) program using nested loops, which, I have to confess, is what I would do before reading Kaldewaij’s book and realised that it is possible to do the task in linear time using one loop. Unfortunately, not many students managed to come up with this solution, therefore I think it is worth some discussion.

Quantifiers

Before we solve the problem, let us review the “Dutch style” quantifier syntax and rules. Given a commutative, associative binary operator with unit element e, if we informally denote the (integral) values in the interval [A .. B) by i₀, i₁, i₂ ... in, the quantified expression:

   (⊕ i : A ≤ i < B : F.i)

informally denotes F.i₀ ⊕ F.i₁ ⊕ F.i₂ ⊕ ... ⊕ F.in. More generally, if all values satisfying predicate R can be enlisted i₀, i₁, i₂ ... in, the expression

   (⊕ i : R.i : F.i)

denotes F.i₀ ⊕ F.i₁ ⊕ F.i₂ ⊕ ... ⊕ F.in. We omit the i in R.i and F.i when there can be no confusion.

A more formal characterisation of the quantified expression is given by the following rules:

  1. (⊕ i : false : F.i) = e
  2. (⊕ i : i = x : F.i) = F.x
  3. (⊕ i : R : F) ⊕ (⊕ i : S : F) = (⊕ i : R ∨ S : F) ⊕ (⊕ i : R ∧ S : F)
  4. (⊕ i : R : F) ⊕ (⊕ i : R : G) = (⊕ : R : F ⊕ G)
  5. (⊕ i : R.i : (⊕ j : S.j : F.i.j)) = (⊕ j : S.j : (⊕ i : R.i : F.i.j))

Rules 1 and 3 give rise to a useful rule "split off n": consider i such that 0 ≤ i < n + 1. If n > 0, the set of possible values of i can be split into two subsets: 0 ≤ i < n and i = n. By rule 3 (reversed) and 1 we get:

  (⊕ i : 0 ≤ i < n + 1 : F.i) = (⊕ i : 0 ≤ i < n : F.i) ⊕ F.n

Expressions quantifying more than one variables can be expressed in terms of quantifiers over single variables:

   (⊕ i,j : R.i ∧ S.i,j : F.i.j) = (⊕ i : R.i : (⊕ j : S.i.j : F.i.j))

If distributes into , we have an additional property:

   x ⊗ (⊕ i : R : F) = (⊗ i : R : x ⊗ F)

As a convention, (+ i : R : F) is often written (Σ i : R : F).

Computing the Sum of Squares of Differences

The first step is to turn the constant N to a variable n. The main worker of the program is going to be a loop, in whose invariant we try to maintain:

   P  ≣  r = (Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²)

In the end of the loop we increment n, and the loop terminates when n coincides with N:

   { Inv: P ∧ 2 ≤ n ≤ N , Bound: N - n}
   do n ≠ N → ... ; n := n + 1 od

We shall then find out how to update r before n := n + 1 in a way that preserves P.

Assume that P and 2 ≤ n ≤ N holds. To find out how to update s, we substitute n for n + 1 in the desired value of r:

   (Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²)[n+1 / n]
 = (Σ i,j : 0 ≤ i < j < n + 1 : (a.i - a.j)²)
 =   { split off j = n }
   (Σ i,j : 0 ≤ i < j < n : (a.i - a.j)²) + 
   (Σ i : 0 ≤ i < n : (a.i - a.n)²)
 =   { P }
   r + (Σ i : 0 ≤ i < n : (a.i - a.n)²)

This is where most people stop the calculation and start constructing a loop computing (Σ i : 0 ≤ i < n : (a.i - a.n)²). One might later realise, however, that most computations are repeated. Indeed, the expression above can be expanded further:

   r + (Σ i : 0 ≤ i < n : (a.i - a.n)²)
 =   { (x - y)² = x² - 2xy + y² }
   r + (Σ i : 0 ≤ i < n : a.i² - 2 × a.i × a.n + a.n²)
 =   { Rule 4 }
   r + (Σ i : 0 ≤ i < n : a.i²) 
     - (Σ i : 0 ≤ i < n : 2 × a.i × a.n) 
     + (Σ i : 0 ≤ i < n : a.n²)
 =   { a.n is a constant, multiplication distributes into addition }
   r + (Σ i : 0 ≤ i < n : a.i²) 
     - 2 × (Σ i : 0 ≤ i < n : a.i) × a.n 
     + (Σ i : 0 ≤ i < n : a.n²)
 =   { simplifying the last term }
   r + (Σ i : 0 ≤ i < n : a.i²) 
     - 2 × (Σ i : 0 ≤ i < n : a.i) × a.n + n × a.n²

which hints at that we can store the values of (Σ i : 0 ≤ i < n : a.i²) and (Σ i : 0 ≤ i < n : a.i) in two additional variables:

  Q₀  ≣  s = (Σ i : 0 ≤ i < n : a.i²)
  Q₁  ≣  t = (Σ i : 0 ≤ i < n : a.i)

It merely takes some routine calculation to find out how to update s and t. The resulting code is:

|[ con N {N ≥ 2}; a : array [0..N) of int;
   var r, s, t, n : int;

   r, s, t, n := (a.0 - a.1)², a.0² + a.1², a.0 + a.1, 2
   { Inv: P ∧ Q₀ ∧ Q₁ ∧ 2 ≤ n ≤ N , Bound: N - n }
 ; do n ≠ N → 
      r := r + s - 2 × t × a.n + n × a.n²;
      s := s + a.n²;
      t := t + a.n;
      n := n + 1 
   od
   { r = (Σ i,j : 0 ≤ i < j < N : (a.i - a.j)²) }
]|

Another “One Loop” Solution

Among those students who did come up with a program, most of them resorted to a typical two-loop, O(N²) solution. Given that this 9-hour course is, for almost all of them, their first exposure to program derivation, I shall perhaps be happy enough that around 3 to 4 out of 38 students came up with something like the program above.

One student, however, delivered a program I did not expect to see:

|[ con N {N ≥ 2}; a : array [0..N) of int;
   var r, i, j : int;

   r, i, j := 0, 0, 0
   { Inv: ... ∧ 0 ≤ i ≤ j ∧ 0 ≤ j ≤ N, Bound: ? }
 ; do j ≠ N →
       if i < j → r := r + (a.i - a.j)²;  i := i + 1
        | i = j → i, j := 0, j + 1
       fi
   od
]|

The program uses only one loop, but is still O(N²) — on a closer inspection one realises that it is actually simulating the inner loop manually. Still, I’d be happy if the student could show me a correctness proof, with a correct loop invariant and a bound, since both of them are more complex than what I expected them to learn. Unfortunately, in the answer handed in, the program, the invariant, and the bound all contain some bugs. Anyone wants to give it a try?

An Exercise Utilising Galois Connections

Given two partial orders (A, ⊑), (B, ≼), two functions f : A → B, g : B → A form a Galois connection between them if for all a : A, b : B we have

  f a ≼ b ≣ a ⊑ g b

We will refer to this defining property as “GC” later. The function f is called the lower adjoint and g the upper adjoint of the Galois connection. Galois connections are interesting because once two functions are identified as such, they immediately satisfy a rich collection of useful properties:

  • letting a := g b in GC, we get f (g b) ≼ b;
  • letting b := f a, we get a ⊑ g (f a);
  • f is monotonic, since:
     f a₁ ≼ f a₂
    ≣   { GC }
     a₁ ⊑ g (f a₂)
    ⇐  {  since a ⊑ g (f a) }
     a₁ ⊑ a₂
  • similarly, g is monotonic: b₁ ≼ b₂ ⇒ f b₁ ⊑ f b₂,

and many more.

In the recent work of Sharon and me on maximally dense segments we needed quite a number of functions to be monotonic, idempotent, etc. It only occurred to me after submitting the paper: could they be defined as Galois connections? The number of properties we needed in the paper is huge and it would be nice to establish them on fewer basic properties. And it looks prettier.

Longest Prefix Up to a Certain Sum

One such function is trim in the paper, but it is sufficient to consider a simplification: let sam : [Int] → [Int] (for “sum atmost”) return the longest prefix of the input list whose sum is no larger than a constant U. Denote “x is a prefix of y” by x ⊑ y. We want to show that sam satisfies

  • monotonicity: x ⊑ y ⇒ sam x ⊑ sam y, and
  • idempotence: sam (sam x) = sam x.

Can they be derived by defining sam as a Galois connection?

I learned from José N. Oliveira‘s talk A Look at Program “G”alculation in IFIP WG 2.1 #65 Meeting how (the uncurried version of) take can be defined as a Galois connection. It turns out that sam is just the same. We consider a slight generalisation sam' : (Int, [Int]) → [Int] that takes an upper bound as a parameter. It can be characterised by:

sum y ≤ b  ∧  y ⊑ x   ≣   y ⊑ sam' (b, x)

There is in fact a Galois connection hidden already! To see that, define ⟨f, g⟩ a = (f a, g a) (in the Haskell Hierarchy Library it is defined in Control.Arrow as &&&), and denote the product of binary relations by ×, that is, if a ≤ b and x ⊑ y then (a,x) is related to (b,y) by ≤×⊑. We write a composed relation as an infix operator by surrounding it in square brackets (a,x) [≤×⊑] (b,y).

Using these notations, the defining equation of sam' can be rewritten as:

⟨sum, id⟩ y [≤×⊑] (b,x)   ≣   y ⊑ sam' (b,x)

Thus sam' is the upper adjoint in a Galois connection between ((Int, [Int]), ≤×⊑) and ([Int], ⊑)!

Now that ⟨sum, id⟩ and sam' form a Galois connection, we have:

  • f (g b) ≼ b instantiates to ⟨sum, id⟩ (sam' (b,x)) [≤×⊑] (b,x), that is, sum (sam' (b,x)) ≤ b and sam' (b,x) ⊑ x;
  • a ⊑ g (f a) instantiates to x ⊑ sam' (sum x, x). Together with the previous property we have x = sam' (sum x, x);
  • monotonicity of the lower adjoint instantiates to y₁ ⊑ y₂ ⇒ sum y₁ ≤ sum y₂ ∧ y₁ ⊑ y₂;
  • monotonicity of the upper adjoint instantiates to
    (b₁,x₁) [≤×⊑] (b₂,x₂)   ⇒   sam' (b₁,x₁) ⊑ sam' (b₂,x₂) 

    that is

    b₁ ≤ b₂   ∧  x₁ ⊑ x₂   ⇒   sam' (b₁,x₁) ⊑ sam' (b₂,x₂) 

    a generalisation of the monotonicity we want.

Finally, to show idempotence, we reason

   sam' (b₁, x) ⊑ sam' (b₁, sam' (b₂, x))
≣   { GC }
   ⟨sum, id⟩ (sam' (b₁, x)) [≤×⊑]  (b₁, sam' (b₂, x))
≣   { definitions }
   sum (sam' (b₁, x)) ≤ b₁   ∧   sam' (b₁, x) ⊑ sam' (b₂, x)
⇐  { properties above }
   b₁ ≤ b₂

These are all nice and pretty. There is another function, however, that is much harder to deal with, which I will write about next time.

Finding Maximally Dense Segments

Sharon and I have finally concluded, for now, our work on the maximally dense segment problem (draft, with an errata already!), on which we have been working on and off for the past two years. Considering the algorithm itself and its derivation/proofs, I am quite happy with what we have achieved. The algorithm is rather complex, however, and it is a challenge presenting it in an accessible way. Sharon has done a great job polishing the paper, and I do hope more people would be interested in reading it and it would, eventually, inspire more work on interesting program derivations.

The basic form of the problem looks like a natural variation of the classical maximum segment sum problem: given a list of numbers, find a consecutive segment whose average, that is, sum divided by length, is maximum. The problem would be trivial without more constraints, since one could simply return the largest element, thus we usually impose a lower bound L on the length of feasible segments.

It was noticed by Huang [3], that a segment having maximum average need not be longer than 2L - 1: given a segment of 2L elements or more, we cut it in the middle. If the two halves have different averages, we keep the larger one. Otherwise the two halves have the same average. Either way, we get a shorter, feasible segment whose average is not lower. The fact hints at a trivial O(nL) algorithm: for each suffix of the list, find its best prefix upto 2L - 1 elements long.

A difficult challenge, however, is to come up with an algorithm that is O(n), independently of L. The problem can be generalised to the case where the elements do not have length 1, but each has a width, and the goal is to maximise the density — sum of the elements divided by sum of their width. It makes the problem sightly more complicated, but does not change its nature. If we go on to impose an upper bound U on the length as well, however, the problem becomes much more difficult. There was an published algorithm that claimed to be linear only to be found not so. We discovered that two later algorithms, which appeared to have concluded the problem, also fail for a boundary case. The bug is easy to fix for one of the algorithm, but might not be so for the other.

Our algorithm closely relates to that of Chung and Lu [1] and that of Goldwasser et al [2]. The algorithm is perhaps too complex to present in detail in a blog post (that’s why we need a paper!), but I will try to give an outline using pictures from the paper, my slides and poster.

One of the ways to visualise the problem is to see each element as a block, the number being the area of the block, and the density would be its height. The input is a list of (area, width) pairs, and the goal is to find a consecutive segment maximising the height. Shown below is the input list [(9,6),(6,2),(14,7),(20,4),(-10,5),(20,8),(-2,2),(27,6)], and the dashed line is their average height:

Notice that an area can be negative. In the paper, since the alphabet w is used for “window” (to be explained below), we instead refer to the width as “breadth”.

Prefixes of Suffixes, and the Window

Many optimal segment problems (finding some optimal segment of a given list) are solved by finding, for each suffix, its optimal prefix, as shown below. Each bar is a suffix of the input, and the blue part is its optimal prefix:

It is preferable that an optimal prefix of a : x can be computed from the optimal prefix of x, that is, the function computing the optimal prefix is a foldr. If it is true, the algorithm may keep a pair of (optimal segment, optimal prefix). Each time a new element is read, it computes the new optimal prefix using the previous optimal prefix, and update the optimal segment if the new prefix is better. If you like structured recursion (or the so-called “origami programming”), this form of computation is an instance of a zygomorphism.

For each optimal prefix to be computable from the previous optimal prefix, it may not extend further than the latter. We do not want the following to happen:

However, it appears to be possible for the maximally dense prefix! Imagining adding a very small, or even negative area. We might get a denser prefix by extending further to the right since the denominator is larger.

The first theorem we had to prove aimed to show that it does not matter — if a maximally dense prefix extends further than the previous one, it is going to be suboptimal anyway. Thus it is safe if we always start from the right end of the previous prefix. That is, we do not compute the maximally dense prefix of the entire input, but merely the maximally dense prefix of the previous prefix.

This is an instance of the sliding window scheme proposed by Zantema [4]. The blue part is like a “window” of the list, containing enough information to guarantee the correctness of the algorithm. As the algorithm progresses, the two ends of the window keeps sliding to the left, hence the name.

To formally show that the window contains enough information to compute the maximally dense segment, we have to clearly state what window is, and what invariant it satisfies. It turned out to be quite tricky to formally state the intuition that “the window does not always give you the optimal prefix, but it does when it matters,” and was the first challenge we met.

Since we aim at computing a segment at least L units in breadth, it might be handy to split the window into a “compulsory part” (the shortest prefix that is at least L units wide) and the rest, the “optional part”. The algorithm thus looks like this:

where the yellow bars are the compulsory parts and blue bars the optional parts. Each time we read an element into the compulsory part, zero or more elements (since the elements have non-uniform breadths) may be shifted from the compulsory part to the optional part. Then we compute a maximally dense prefix (the yellow and the blue parts together) that does not extend further than the previous one. The best among all these prefixes is the maximally dense segment.

We want a linear time algorithm, which means that all the computation from a pair of yellow-blue bars to the next pair has to be done in (amortised) constant time — how is that possible at all? To do so we will need to exploit some structure in the optional part, based on properties of density and segments.

Right-Skew Segments, and the DRSP

A non-empty list of elements x is called right-skew if, for every non-empty x₁ and x₂ such that x₁ ⧺ x₂ = x, we have density x₁ ≤ density x₂. Informally, a right-skew list is drawn as the blue wavy block below:

The rising wavy slope informally hints that the right half has a higher density than the left half wherever you make it cut. Howver, the drawing is at risk from the misunderstanding that a right-skew segment is a list of elements with ascending areas or densities. Note that neither the areas nor the densities of individual elements have to be ascending. For example, the list [(9,6),(6,2),(14,7)], with densities [1.5, 3, 2], is right-skew.

Right-skew lists are useful because of the following property. Imagining placing a list z next to x, as depicted above. To find a maximally dense prefix of z ⧺ x starting with z, it is sufficient to consider only z and z ⧺ x — nothing in the middle, such as z ⧺ x₁, can be denser than the two ends!

Given a window with compulsory part c and optional part x, if we can partition x into x₁ ⧺ x₂ ⧺ ... ⧺ xn, such that x₁, x₂, … xn are all right-skew, then to compute the maximally dense prefix of c ⧺ x, we only need to consider c, c ⧺ x₁, c ⧺ x₁ ⧺ x₂,… and c ⧺ x₁ ⧺ x₂ ⧺ ... ⧺ xn.

Such a partition is always possible for any list x — after all, each element itself constitute a singleton right-skew list. However, there is one unique right-skew partition such that the densities of x₁, x₂, … xn are strictly decreasing. This is called the decreasing right-skew partition (DRSP) of x. We will partition the optional part of the window into its DRSP. A window now looks like the picture below:

Sharon summarised many nice properties of DRSP in the paper, for which we unfortunately do not have space here. We will only look at some properties that matters for this blog post. Firstly, consider the diagram below:

In the bottom row, the leftmost block is the density of c, and the second block is the density of c ⧺ x₁, etc. If segments x₁, x₂, … xn have decreasing densities, the densities of c, c ⧺ x₁, c ⧺ x₁ ⧺ x₂,… and c ⧺ x₁ ⧺ x₂ ⧺ ... ⧺ xn must be bitonic — first ascending, then descending. It helps to efficiently locate the maximally dense prefix.

Secondly, the DRSP can be built and maintained in a foldr. The following diagram depicts how the DRSP for the list of areas [1,4,2,5,3] (all with breadth 1) can be built by adding elements from the left one by one (which eventually results in one big partition):

The rule is that blocks newly added from the left keeps merging with blocks to its right until it encounters a block shorter than itself. The top-left of the diagram indicates that the DRSP of (3 is itself. Since 5 > 3, adding 1 results in a partition containing two segments. When 2 is added, it is merged with 5 to form a new segment with density 3.5. No merging is triggered with the addition of 4 since 4 > 3.5 and thus [4,3.5,3] form a decreasing sequence. Newly added 1 first merges 4, forming a block having density 2.5. Since 2.5 < 3.5, it again merges with the block [2,5]. Eventually all elements are grouped into one segment with density 3. One important thing here is that adding a new element only involves merging some initial parts of the DRSP.

Algorithm Overview

Recall that our algorithm computes, for each suffix, a prefix (a window) that is possibly optimal and contains enough information to compute all future optimal solutions. Since a feasible prefix is wider than L, we split it into a (yellow) compulsory part and a (blue) optional part. To obtain a linear time algorithm, we have to compute one row from the previous row in amortised constant time (the corresponding diagram is duplicated here):

The diagram below depicts how to go from one row to the next. The blue part is partitioned into DRSP. Each time an element is added to the yellow part, some elements may be shifted to the blue part, and that may trigger some right-skew segments in the blue part to be merged (second row). Then we look for a maximally dense prefix by going from right to left, chopping away segments, until we find the peak (third row):

Note that the operation shown on the third row (chopping to find the maximum) always chop away a right-skew segment in its entirety. It is important that the merging happens at the left end of the optional part, while the chopping happens at the right end. By using a tree-like data structure, each merging can be a O(1) operation. With the data structure, we may argue that, since each element can be merged at most once, throughout the algorithm only O(n) merging could happen. Similarly, each element can be chopped away at most once, so the chopping could happen at most O(n) time as well. Therefore the operations in the second and third rows above are both amortised O(1).

Problem with Having an Upper Bound

The discussion so far already allows us to develop an algorithm for the maximally dense segment problem without an upper bound on the breadth of feasible segments. Having the upper bound makes the problem much harder because, different from the chopping depicted above, an upper bound may cut through a right-skew segment in the middle:

And a right-skew segment, with some elements removed, might not be right-skew anymore!

Our solution is to develop another data structure that allows efficient removal from the right end of a DRSP, while maintaining the DRSP structure. The final configuration of a window looks like the diagram below, where the new data structure is represented by the green blocks:

Unfortunately, it is inefficient to add new elements from the left into the green blocks. Therefore we have to maintain the window in a way similar to how a queue is implemented using two lists. New elements are added from the left into the blue blocks; when we need to remove element from the right of a block, it is converted to a green block in large chunks.

For more details, see the paper!

References

  1. Chung, Kai-Min and Lu, Hsueh-I. An Optimal Algorithm for the Maximum-Density Segment Problem. SIAM Journal on Computing 34(2):373-387, 2004.
  2. Goldwasser, Michael H. and Kao, Ming-Yang and Lu, Hsueh-I. Linear-Time Algorithms for Computing Maximum-Density Sequence Segments with Bioinformatics Applications. Journal of Computer and System Sciences, 70(2):128-144, 2005.
  3. Huang, Xiaoqui. An algorithm for identifying regions of a {DNA} sequence that satisfy a content requirement. Computer Applications in the Biosciences 3(10): 219-225, 1994.
  4. Zantema, Hans. Longest segment problems. Science of Computer Programming, 18(1):39-66, 1992.

The Maximum Segment Sum Problem: Its Origin, and a Derivation

In a previous paper of mine, regrettably, I wrongly attributed the origin of the maximum segment sum problem to Dijkstra and Feijen’s Een methode van programmeren. In fact, the story behind the problem was told very well in Jon Bentley’s Programming Pearls.

The Problem, and the Linear-Time Algorithm

Given a list of numbers, the task is to compute the largest possible sum of a consecutive segment. In a functional language the problem can be specified by:

 mss = max . map sum . segments

where segments = concat . map inits . tails enlists all segments of the input list, map sum computes the sum of each of the segments, before max :: Ord a ⇒ [a] → a picks the maximum. The specification, if executed, is a cubic time algorithm. Yet there is a linear time algorithm scanning through the list only once:

mss = snd . foldr step (0,0)
  where step x (p,s) = (0 ↑ (x+p), (0 ↑ (x+p)) ↑ s)

where a ↑ b yields the maximum of a and b.

Both the specification and the linear time program are short. The program is merely a foldr that can be implemented as a simple for-loop in an imperative language. Without some reasoning, however, it is not that trivial to see why the program is correct (hint: the foldr computes a pair of numbers, the first one being the maximum sum of all prefixes of the given list, while the second is the maximum sum of all segments). Derivation of the program (given below) is mostly mechanical, once you learn the basic principles of program calculation. Thus the problem has become a popular choice as the first non-trivial example of program derivation.

Origin

Jon Bentley recorded in Programming Pearls that the problem was proposed by Ulf Grenander of Brown University. In a pattern-matching procedure he designed, a subarray having maximum sum is the most likely to yield a certain pattern in a digitised image. The two dimensional problem took too much time to solve, so he simplified to one dimension in order to to understand its structure.

In 1977 [Grenander] described the problem to Michael Shamos of UNILOGIC, Ltd. (then of Carnegie-Mellon University) who overnight designed Algorithm 3. When Shamos showed me the problem shortly thereafter, we thought that it was probably the best possible; … A few days later Shamos described the problem and its history at a Carnegie-Mellon seminar attended by statistician Jay Kadane, who designed Algorithm 4 within a minute.

Jon Bentley, Programming Pearls (1st edition), page 76.

Jay Kadane’s Algorithm 4 is the now well-known linear time algorithm, the imperative version of the functional program above:

maxpre, maxseg = 0, 0
for i in range (0, N):
     maxpre = 0 ↑ (maxpre + a[i])
     maxseg = maxpre ↑ maxseg

Algorithm 3, on the other hand, is a divide and conquer algorithm. An array a is split into two halves a₁ ⧺ a₂, and the algorithm recursively computes the maximum segment sums of a₁ and a₂. However, there could be some segment across a₁ and a₂ that yields a good sum, therefore the algorithm performs two additional loops respectively computing the maximum suffix sum of a₁ and the maximum prefix sum of a₂, whose sum is the maximum sum of segment crossing the edge. The algorithm runs in O(N log N) time. (My pseudo Python translation of the algorithm is given below.)

In retrospect, Shamos did not have to compute the maximum prefix and suffix sums in two loops each time. The recursive function could have computed a triple quadruple of (maximum prefix sum, maximum segment sum, maximum suffix sum, and sum of the whole array) for each array. The prefix and suffix sums could thus be computed bottom-up. I believe that would result in a O(N) algorithm. This linear time complexity might suggest that the “divide” is superficial — we do not have to divide the array in the middle. It is actually easier to divide the array into a head and a tail — which was perhaps how Kadane quickly came up with Algorithm 4!

A Functional Derivation

I learned the function derivation of the maximum segment sum problem from one of Jeremy’s papers [3] and was very amazed. It was perhaps one of the early incident that inspired my interest in program calculation. The derivation does not appear to be very well known outside the program derivation circle — not even for functional programmers, so I would like to redo it here.

The first few steps of the derivation goes:

   max . map sum . segs
=    { definition of segs }
   max . map sum . concat . map inits . tails
=    { since map f . concat = concat . map (map f) }
   max . concat . map (map sum) . map inits . tails 
=    { since max . concat = max . map max } 
   max . map max .  map (map sum) . map inits . tails
=    { since map f . map g = map (f.g) }
   max . map (max . map sum . inits) . tails

The purpose of the book-keeping transformation above is to push max . map sum closer to inits. The fragment max . map sum . inits is a function which, given a list of numbers, computes the maximum sum among all its prefixes. We denote it by mps, for maximum prefix sum. The specification has been transformed to:

   mss = max . map mps . tails 

This is a common strategy for segment problems: to solve a problem looking for an optimal segment, proceed by looking for an optimal prefix of each suffix. (Symmetrically we could process the list the other way round, look for an optimal suffix for each prefix.)

We wish that mps for each of the suffixes can be efficiently computed in an incremental manner. For example, to compute mps [-1,3,3,-4], rather than actually enumerating all suffixes, we wish that it can be computed from -1 and mps [3,3,-4] = 6, which can in turn be computed from 3 and mps [3,-4] = 3, all in constant time. In other words, we wish that mps is a foldr using a constant time step function. If this is true, one can imagine that we could efficiently implement map mps . tails in linear time. Indeed, scanr f e = map (foldr f e) . tails!

The aim now is to turn mps = max . map sum . inits into a foldr. Luckily, inits is actually a foldr. In the following we will perform foldr-fusion twice, respectively fusing map sum and max into inits, thus turning the entire expression into a foldr.

The first fusion goes:

   max . map sum .inits   
=    { definition of inits }
   max . map sum . foldr (\x xss -> [] : map (x:) xss) [[]]
=    { fold fusion, see below }
   max . foldr zplus [0]

The fusion condition can be established below, through which we also construct the definition of zplus:

   map sum ([] : map (x:) xss)
=  0 : map (sum . (x:)) xss
=    { by definition, sum (x : xs) = x + sum xs }
   0 : map (x+) (map sum xss) 
=    { define zplus x xss = 0 : map (x+) xss }  
   zplus x (map sum xss)

We continue with the derivation and perform another fusion:

   max . foldr zplus [0]
=    { fold fusion, let zmax x y = 0 ↑ (x+y) }
   foldr zmax 0 {-"."-}

For the second fold fusion to work, we have to prove the following fusion condition:

   max (0 : map (x+) xs)
=  0 ↑ max (map (x+) xs)
=    { since  max (map (x +) xs) = x + max xs }
   0 ↑ (x + max xs) {-"."-}

The property max (map (x +) xs) = x + max xs in the last step follows from that (↑) distributes into (+), that is, (x + y) ↑ (x + z) = x + (y ↑ z). This is the key property that allows the whole derivation to work.

By performing foldr-fusion twice we have established that

mps = foldr zmax 0

In words, mps (x : xs), the best prefix sum of x : xs, can be computed by zmax x (mps xs). The definition of zmax says that if x + mps xs is positive, it is the maximum prefix sum; otherwise we return 0, sum of the empty prefix.
Therefore, mss can be computed by a scanr:

   mss
=    { reasoning so far }
   max . map (foldr zmax 0) . tails
=    { introducing scanr }
   max . scanr zmax 0 {-"."-}

We have derived mss = max . scanr zmax 0, where zmax x y = 0 ↑ (x+y).

Many functional derivations usually stop here. This gives us an algorithm that runs in linear time, but takes linear space. A tupling transformation eliminates the need for linear space:

  mss = snd . (head &&& max) . scanr zmax 0 

where (f &&& g) a = (f a, g a). The part (head &&& max) . scanr zmax 0 returns a pair, the first component being the result of mps, the second mss. By some mechanical simplification we get the final algorithm:

mss = snd . foldr step (0,0)
  where step x (p,s) = (0 ↑ (x+p), (0 ↑ (x+p)) ↑ s)

A Relational Derivation?

The maximum segment sum problem later turned out to be a example of Richard and Oege’s Greedy Theorem [2]. It is an exercise in the Algebra of Programming book, but I have not seen the solution given anywhere. For completeness, I recorded a relational derivation in a paper of mine about some other variations of the maximum segment sum problem[4].

References

  1. Bentley, Jon. Programming Pearls. Addison-Wesley, Inc, 1987.
  2. Bird, Richard and de Moor, Oege. Algebra of Programming. Prentice-Hall, 1997
  3. Gibbons, Jeremy. Calculating Functional Programs. Proceedings of ISRG/SERG Research Colloquium, Oxford Brookes University, November 1997.
  4. Mu, Shin-Cheng. Maximum segment sum is back: deriving algorithms for two segment problems with bounded lengths. Partial Evaluation and Program Manipulation (PEPM ’08), pp 31-39. January 2008.

Appendix: Algorithm 3


def mss(l,u):
  if l > u: 
    return 0          # empty array
  else if l == u:
    return (0 ↑ a[l])  # singleton array
  else:
    m = (l + u) / 2

    # compute maximum suffix sum of a[0..m]
    sum, maxToLeft = 0, 0
    for i in range (m, l-1, -1):
      sum = sum + a[i]
      maxToLeft = maxToLeft ↑ sum
    # compute maximum prefix sum of a[m+1..u]
    sum, maxToRight = 0, 0
    for i in range (m+1, u+1):
      sum = sum + a[i]
      maxToLeft = maxToRight ↑ sum
    maxCrossing = maxToLeft + maxToRight

    # recursively compute mss of a[0..m] and a[m+1..u]
    maxInL = mss(l,m)
    maxInR = mss(m+1,u)
    return (maxInL ↑ maxCrossing ↑ maxInR)

A Survey of Binary Search

If you think you know everything you need to know about binary search, but have not read Netty van Gasteren and Wim Feijen’s note The Binary Search Revisited, you should.

In the FLOLAC summer school this year we plan to teach the students some basics of Hoare logic and Dijkstra’s weakest precondition calculus — a topic surprisingly little known among Taiwanese students. Binary search seems to be a topic I should cover. A precise specification will be given later, but basically the requirement is like this: given a sorted array of N numbers and a value to search for, either locate the position where the value resides in the array, or report that the value does not present in the array, in O(log N) time.

Given that everybody should have learned about binary search in their algorithm class, you would not expect it to be a hard programming task. Yet in his popular book Programming Pearls, Jon Bentley noted that surprisingly few professions programmers managed to implement the algorithm without bugs at their first attempt. I tried it myself and, being very careful (and having learned something about program construction already), I produced a program which I believe to be correct, but rather inelegant. I tried it on my students and they did produce code with some typical bugs. So it seems to be a good example for a class about correct program construction.

The Van Gasteren-Feijen Approach

The surprising fact that van Gasteren and Feijen pointed out was that binary search does not apply only to sorted lists! In fact, the usual practice comparing binary search to searching for a word in a dictionary is, according to van Gasteren and Feijen, a major educational blunder.

Van Gasteren and Feijen considered solving a more general problem: let M and N be integral numbers with M < N, and let Φ be a relation such that M Φ N (where Φ is written infix), with some additional constraints to be given later. The task is to find l such that

M ≤ l < N   ∧   l Φ (l+1)

This is the program:

  { M < N ∧ M Φ N }
  l, r := M, N
  { Inv: M ≤ l < r ≤ N  ∧  l Φ r,   Bound: r - l }
; do l+1 ≠ r →
    { l + 2 ≤ r }
    m := (l + r) / 2
  ; if m Φ r → l := m
    [] l Φ m → r := m
    fi
  od
  { M ≤ l < N  ∧  l Φ (l+1) }

Notice first that the loop guard l+1 ≠ r, if satisfied, guarantees that l and r are not adjacent numbers, therefore assigning m := (l + r) / 2 establishes l < m < r, and thus the bound r - l is guaranteed to decrease. The if statement clearly maintains the invariant, if at least one of the guards are always satisfied:

 l Φ r  ∧  l < m < r   ⇒   l Φ m  ∨  m Φ r(*)

For Φ satisfying the condition above, at the end of the loop we will find some l such that l Φ (l+1).

What relations satisfy (*)? Examples given by van Gasteren and Feijen include:

  • i Φ j = a[i] ≠ a[j] for some array a. Van Gasteren and Feijen suggested using this as the example when introducing binary search.
  • i Φ j = a[i] < a[j],
  • i Φ j = a[i] × a[j] ≤ 0,
  • i Φ j = a[i] ∨ a[j], etc.

Searching for a Key in a Sorted Array

To search for a key K in an ascending-sorted array a, it seems that we could just pick this Φ:

i Φ j = a[i] ≤ K < a[j]

and check whether a[i] = K after the loop. There is only one problem, however -- we are not sure we can establish the precondition a[l] ≤ K < a[r]!

Van Gasteren and Feijen's solution is to add to imaginary elements to the two ends of the array. That is, for a possibly empty array a[0..N), we imagine two elements a[-1] such that a[-1] ≤ x and a[N] such that x < a[N] for all x. I believe this is equivalent to using this Φ:

i Φ j  =  (i = -1  ∨  a[i] ≤ K)  ∧  (K < a[j]  ∨  j = N)

which still satisfies (*) if a is sorted. And here is the program

  { 0 ≤ N ∧ -1 Φ N }
  l, r := -1, N
  { Inv: -1 ≤ l < r ≤ N  ∧  l Φ r,   Bound: r - l }
; do l+1 ≠ r →
    { l + 2 ≤ r }
    m := (l + r) / 2
  ; if a[m] ≤ K → l := m
    [] K < a[m] → r := m
    fi
  od
  { -1 ≤ l < N  ∧  l Φ (l+1) }
; if l > -1 → found := a[l] = K
  [] l = -1 → found := false
  fi

Do not worry about the idea "adding" elements to a. The invariant implies that -1 < m < N, thus a[-1] and a[N] are never accessed, and the array a needs not be actually altered. They are just there to justify the correctness of the program. It also enables us to handle possibly empty arrays, while the loop body seems to be designed for the case when the range [l..r) is non-empty.

Bentley's Program

Bentley's program for binary search in Programming Pearls can be rephrased as below:

  l, r := 0, N-1
; do l ≤ r →
    m := (l + r) / 2
  ; if a[m] < K → l := m + 1
    [] a[m] = K → found := true; break
    [] K < a[m] → r := m - 1
    fi
  od
; found := false

I would like to be able to derive this program in class, since this appears to be the more popular version. Apart from the presence of break, which I do not yet know of a easy variation of Hoare logic that helps to derive it, to relate the test a[m] < K to l := m + 1 I will have to bring in the fact that a is sorted in an earlier stage of the development. Thus it is harder to put it in a more general picture.

For several reasons I used to believe that Bentley's program could be preferred, for example, it seems to shrink the range more effectively, assigning l and r to m + 1 and m - 1, rather than m. On a second thought I realised that it might not be true. Variable l can be assigned m + 1 because the possibility of a[m] = K is covered in another case with an early exit, and r is assigned m - 1 because this algorithm represents an array segment with an inclusive right bound, as opposed to the previous algorithm.

The two algorithms do not solve exactly the same problem. With multiple occurrences of K in the array, Bentley's algorithm is non-deterministic about which index it returns, while the van Gasteren-Feijen algorithm, enforced by the specification, always returns the largest index. When K does not appear in the array, van Gasteren and Feijen's program could be more efficient because it needs only one comparison in the loop, rather than two as in Bentley's case (I am assuming that the last comparison is a catch-all case and need not be implemented). What if K does present in the array? An analysis by Timothy J. Rolfe concluded that a single-comparison approach is still preferable in average -- benefit of the early exit does not outweigh the cost of the extra comparison in the loop.

On Computing the Middle Index

There are some other interesting stories regarding the assignment m := (l + r) / 2. Joshua Bloch from Google noted that for large arrays, adding l and r may cause an overflow, and Bloch was not picking on Bentley -- the bug was reported by Sun. Bloch suggests one of the following:

int m = l + ((r - l) / 2);                          /* for Java/C/C++ */
int m = (l + r) >>> 1;                              /* for Java */
m = ((unsigned int)l + (unsigned int)r)) >> 1;      /* for C/C++ */

Since the publication of the blog post, there have been numerous discussions on whether it should be considered a bug in the binary search algorithm or the integer datatype, and some more machine dependent issues like whether one may have an array so large that cannot be indexed by an int, etc.

To be more language-independent, on the other hand, Roland Backhouse in his book Program Construction: Calculating Implementations from Specifications suggested using (l + r - 1) / 2, such that the value of m will be ⌊(l + r)/2⌋, regardless of whether the integral division operator actually rounds the number up or down.

Exercise?

Among the exercises suggested by van Gasteren and Feijen, this one caught my eye: let array a[0,N), with 0 < N, be the concatenation of a strictly increasing and a strictly decreasing array. Use binary search to find the maximum element. (For this problem I think it is reasonable to assume that the two sub-arrays could be empty, while a is non-empty.) This happens to a sub-routine I needed for an O(N log N) algorithm for the maximum segment density problem (there are linear-time algorithms for this problem, though), and I do remember I started off treating it as an unimportant sub-routine but had a hard time getting it right. I am glad that now I know more about it.

References

A “Side-Swapping” Lemma Regarding Minimum, Using Enriched Indirect Equality

Yu-Han Lyu and I were studying some paper from the algorithm community, and we noticed a peculiar kind of argument. For a much simplified version, let X and D be two relations of type A → B, denoting two alternative approaches to non-deterministically compute possible solution candidates to a problem. Also let be a transitive relation on B, and its converse. The relation min ≤ : {B} → B, given a set, returns one of its elements that is no larger (under ) than any elements in the set, if such a minimum exists.
We would like find solution as small as possible under .

When arguing for the correctness of its algorithm, the paper we are studying claims that the method X is no worse than D in the following sense: if every solution returned by D is no better than some solution returned by X, which we translate to:

D ⊆ ≥ . X

then the best (smallest) solution by X must be no worse than (one of the) best solutions returned by D:

min ≤ . ΛX ⊆ ≤ . min ≤ . ΛD

where Λ converts a relation A → B to a function A → {B} by collecting its results to a set. Note that, awkwardly, X and D are swapped to different sides of relational inclusion.

“What? How could this be true?” was my first reaction. I bombarded Yu-Han with lots of emails, making sure that we didn’t misinterpret the paper. An informal way to see it is that since every result of D is outperformed by something returned by X, collectively, the best result among the latter must is “lower-bounded” by the optimal result of D. But this sounds unconvincing to me. Something is missing.

Totality and Well-Boundedness

It turns out that the reasoning can be correct, but we need some more constraints on D and . Firstly, D must yield some result whenever X does. Otherwise it could be that D ⊆ ≥ . X is true but ΛD returns an empty set, while ΛX still returns something. This is bad because X is no more a safe alternative of D — it could sometimes do too much. One way to prevent it from happening so is to demand that ΛD = dom ∈ . ΛD, where is the membership relation, and dom ∈, the domain of , consists only of non-empty sets. It will be proved later that this is equivalent to demanding that D be total.

Secondly, we need to be sure that every non-empty set has a minimum, or min ≤ always yields something for non-empty sets. Therefore min ≤ . ΛD would not fall back to the empty relation. Formally, it can be expressed as dom ∈ = dom (min ≤). Bird and de Moor called this property well-boundedness of .

Recall that min ≤ = ∈ ∩ ≤/∋. The part guarantees that min ≤ returns something that is in the given set, while ≤/∋ guarantees that the returned value is a lower-bound of the given set. Since ΛD (as well as ΛX) is a function, we also have min ≤ . ΛD = D ∩ ≤/D°, following from the laws of division.

Later we will prove an auxiliary lemma stating that if is well-bounded, we have:

≤/∋ . dom ∈   ⊆   ≤ . min ≤ . dom ∈

The right-hand side, given a non-empty list, takes its minimum and returns something possibly smaller. The left-hand side merely returns some lower-bound of the given set. It sounds weaker because it does not demand that the set has a minimum. Nevertheless, the inclusion holds if is well-bounded.

An algebraic proof of the auxiliary lemma was given by Akimasa Morihata. The proof, to be discussed later, is quite interesting to me because it makes an unusual use of indirect equality. With the lemma, proof of the main result becomes rather routine:

  min ≤ . ΛX   ⊆   ≤ . min ≤ . ΛD
≣   { since ΛD = dom ∈ . ΛD }
  min ≤ . ΛX   ⊆   ≤ . min ≤ . dom ∈ . ΛD
⇐  { ≤/∋ . dom ∈ ⊆ ≤ . min ≤ . dom ∈, see below }
  min ≤ . ΛX   ⊆   ≤/∋ . dom ∈ . ΛD
≣   { since ΛD = dom ∈ .  ΛD }
  min ≤ . ΛX   ⊆   ≤/∋ . ΛD
≣   { since ΛD is a function, R/S . f = R/(f° . S) }
  min ≤ . ΛX   ⊆   ≤/D° 
≣   { Galois connection }
  min ≤ . ΛX . D°   ⊆   ≤ 
⇐   { min ≤ . ΛX ⊆ ≤/X° }
  ≤/X°. D°   ⊆   ≤ 
⇐   { since  D ⊆ ≥ . X }
  ≤/X°. X° . ≤   ⊆   ≤ 
⇐   { division }
  ≤ . ≤   ⊆   ≤ 
≣   { ≤ transitive }
  true

Proof Using Enriched Indirect Equality

Now we have got to prove that ≤/∋ . dom ∈ ⊆ ≤ . min ≤ . dom ∈ provided that is well-bounded. To prove this lemma I had to resort to first-order logic. I passed the problem to Akimasa Morihata and he quickly came up with a proof. We start with some preparation:

  ≤/∋ . dom ∈ ⊆ ≤ . min ≤ . dom ∈
⇐   { since min ≤ ⊆ ∈ }
  ≤/(min ≤)° . dom ∈ ⊆ ≤ . min ≤ . dom ∈

And then we use proof by indirect (in)equality. The proof, however, is unusual in two ways. Firstly, we need the enriched indirect equality proposed by Dijkstra in
EWD 1315: Indirect equality enriched (and a proof by Netty). Typically, proof by indirect equality exploits the property:

x = y    ≡   (∀u. u ⊆ x ≡ u ⊆ y)

and also:

x ⊆ y   ≡   (∀u. u ⊆ x ⇒ u ⊆ y)

When we know that both x and y satisfy some predicate P, enriched indirect equality allows us to prove x = y (or x ⊆ y) by proving a weaker premise:

x = y   ≡   (∀u. P u ⇒ u ⊆ x ≡ u ⊆ y)

Note that both ≤/(min ≤)° . dom ∈ and ≤ . min ≤ . dom ∈ satisfy X = X . dom ∈. Later we will try to prove:

X ⊆ ≤/(min ≤)° . dom ∈    ⇒    X ⊆ ≤ . min ≤ . dom ∈

for X such that X = X . dom ∈.

The second unusual aspect is that rather than starting from one of X ⊆ ≤/(min ≤)° . dom ∈ or X ⊆ ≤ . min ≤ . dom ∈ and ending at another, Morihata’s proof took the goal as a whole and used rules like (P ⇒ Q) ⇒ (P ⇒ P ∧ Q). The proof goes:

  (X ⊆ ≤/(min ≤)° . dom ∈  ⇒  X ⊆ ≤ . min ≤ . dom ∈)
⇐   { dom ∈  ⊆ id }
  (X  ⊆ ≤/(min ≤)°  ⇒  X ⊆ ≤ . min ≤ . dom ∈)
≣    { Galois connection } 
  (X . (min ≤)° ⊆ ≤  ⇒  X ⊆ ≤ . min ≤ . dom ∈)
⇐   { (P ⇒ Q) ⇒ (P ⇒ P ∧ Q) }
  (X . (min ≤)° ⊆ ≤  ⇒  X ⊆ X . (min ≤)° . min ≤ . dom ∈)
⇐   { R ∩ S ⊆ R  }
  (X . (min ≤)° ⊆ ≤  ⇒  X ⊆ X . (((min ≤)° . min ≤) ∩ id) . dom ∈)
≣   { dom R = (R° . R) ∩ id }
  (X . (min ≤)° ⊆ ≤  ⇒  X ⊆ X . dom (min ≤) . dom ∈)
≣   { ≤ well-bounded: dom ∈ = dom (min ≤) }
  (X . (min ≤)° ⊆ ≤  ⇒  X ⊆ X . dom ∈ . dom ∈)
≣   { dom ∈ . dom ∈ = dom ∈ }
  (X . (min ≤)° ⊆ ≤  ⇒  X ⊆ X . dom ∈)
≣   { X = X . dom ∈ }
  (X . (min ≤)° ⊆ ≤  ⇒  true)
≣ true

Auxiliary Proofs

Finally, this is a proof that the constraint ΛD = dom ∈ . ΛD is equivalent to D being total, that is id ⊆ D° . D. Recall that dom ∈ = ((∋ . ∈) ∩ id). We simplify dom ∈ . ΛD a bit:

  dom ∈ . ΛD
= ((∋ . ∈) ∩ id) . ΛD
=   { ΛD a function }
  (∋ . ∈ . ΛD) ∩ ΛD
=   { ∈ . ΛD = D }
  (∋ . D) ∩ ΛD

We reason:

  dom ∈ . ΛD = ΛD
≡   { R ∩ S = S iff S ⊆ R }
  ΛD ⊆ ∋ . D
≡   { ΛD function, shunting }
  id ⊆ (ΛD)° . ∋ . D
≡ id ⊆ D° . D

which is the definition of totality.

Determining List Steepness in a Homomorphism

The steep list problem is an example we often use when we talk about tupling. A list of numbers is called steep if each element is larger than the sum of elements to its right. Formally,

steep [] = True
steep (x : xs) = x > sum xs ∧ steep xs

The definition above is a quadratic-time algorithm. Can we determine the steepness of a list in linear time? To do so, we realise that we have to compute some more information. Let steepsum = ⟨steep, sum⟩ where ⟨f, g⟩ x = (f x, g x), we discover that steepsum can be computed as a foldr:

steepsum = foldr step (True, 0)
  where step x (b, s) = (x > s ∧ b, s + b)

which takes linear time. After computing steepsum, simply take steep = fst . steepsum.

In the Software Construction course we also talked about list homomorphism, that is, functions that can be defined in the form

h [] = e
h [a] = f a
h (xs ⧺ ys) = h xs ⊚ h ys

where is associative and e is the identity element of . The discussion would be incomplete if we didn’t mention the third homomorphism theorem: if a function on lists can be computed both from right to left and from left to right, that it, both by a foldr and a foldl, it can be computed from anywhere in the middle by a homomorphism, which has the potential of being parallelised. Hu sensei had this idea using steepsum as an exercise: can we express steepsum as a foldl, and a list homomorphism?

Unfortunately, we cannot — steepsum is not a foldl. To determining the steepness from left to right, we again have to compute some more information — not necessarily in the form of a tuple.

Right Capacity

The first idea would be to tuple steepsum with yet another element, some information that would allow us to determine what could be appended to the right of the input. Given a list of numbers xs, let rcap xs (for right capacity) be an (non-inclusive) upperbound: a number y < rcap xs can be safely appended to the right of xs without invalidating the steepness within xs. It can be defined by:

rcap [] = ∞
rcap (x : xs) = (x - sum xs) ↓ rcap xs

where stands for binary minimum. For an explanation of the second clause, consider x : xs ⧺ [y]. To make it a steep list, y has to be smaller than x - sum xs and, furthermore, y has to be small enough so that xs ⧺ [y] is steep. For example, rcap [9,5] is 4, therefore [9,5,3] is still a steep list.

Following the theme of tupling, one could perhaps try to construct ⟨steep, sum, rcap⟩ as a foldl. By doing so, however, one would soon discover that rcap itself contains all information we need. Firstly, a list xs is steep if and only if rcap xs is greater than 0:
Lemma 1: steep xs ⇔ rcap xs > 0.
Proof: induction on xs.
case []: True ⇔ ∞ > 0.
case (x : xs):

         steep (x : xs) 
       ⇔ x > sum xs ∧ steep xs
       ⇔ x - sum xs > 0  ∧ steep xs
       ⇔   { induction }
         x - sum xs > 0 ∧ 0 < rcap xs
       ⇔ ((x - sum xs) ↓ rcap xs) > 0
       ⇔ rcap (x : xs) > 0

Secondly, it turns out that rcap is itself a foldl:
Lemma 2: rcap (xs ⧺ [x]) = (rcap xs - x) ↓ x.
Proof: induction on xs.
case []: rcap [x] = x = (∞ - x) ↓ x.
case (y : xs):

         rcap (y : xs ⧺ [x])
       = ((y - sum xs) - x) ↓ rcap (xs ⧺ [x])
       =   { induction }
         ((y - sum xs) - x) ↓ (rcap xs - x) ↓ x
       =   { since (a - x) ↓ (b -x) = (a ↓ b) -x }
         (((y - sum xs) ↓ rcap xs) - x) ↓ x
       = (rcap (y : xs) - x) ↓ x


Therefore we have rcap = foldl (λ y x → (y - x) ↓ x) ∞. If the aim were to determine steepness from left to right, our job would be done.

However, the aim is to determine steepness from both directions. To efficiently compute rcap using a foldr, we still need to tuple rcap with sum. In summary, the function ⟨sum, rcap⟩ can be computed both in terms of a foldl:

⟨sum, rcap⟩ = foldl step (0, ∞)
   where step (s, y) x = (s + x, (y - x) ↓ x)

and a foldr:

⟨sum, rcap⟩ = foldr step (0, ∞)
   where step x (s, y) = (x + s, (x - s) ↓ y)

Now, can we compute ⟨sum, rcap⟩ by a list homomorphism?

List Homomorphism

Abbreviate ⟨sum, rcap⟩ to sumcap. The aim now is to construct such that sumcap (xs ⧺ ys) = sumcap xs ⊚ sumcap ys. The paper Automatic Inversion Generates Divide-and-Conquer Parallel Programs by Morita et al. suggested the following approach (which I discussed, well, using more confusing notations, in a previous post): find a weak inverse of ⟨sum, rcap⟩, that is, some g such that sumcap (g z) = z for all z in the range of sumcap. Then we may take z ⊚ w = sumcap (g z ⧺ g w).

For this problem, however, I find it hard to look for the right g. Anyway, this is the homomorphism that seems to work:


sumcap [] = (0, ∞)
sumcap [x] = (x, x)
sumcap (xs ⧺ ys) = sumcap xs ⊚ sumcap ys
  where (s1,c1) ⊚ (s2,c2) = (s1+s2, (c1-s2) ↓ c2)

However, I basically constructed by guessing, and proved it correct afterwards. It takes a simple inductive proof to show that rcap (xs ⧺ ys) = (rcap xs - sum ys) ↓ rcap ys.

I am still wondering what weak inverse of sumcap would lead us to the solution above, however.