Maximum Segment Density with Lower-Bounded Length Shin-Cheng Mu May 2007 This literate Haskell script presents a program solving the maximum segment density problem: given a list of numbers, compute the maximum density (average) of a segment that is longer than a lower-bound. At the first glance, the derivation would make use of the Thinning Theorem. It turns out that we need a kind of "global" thinning difficult to be modelled by a preorder. See Mu (2007) for details. The derived algorithm is similar to that of Lin et al. (2002). Functions defined: msdl_s :: Len -> [V] -> R The specification. msdl_1 :: Len -> [V] -> R An intermediate stage of derivation kept for expository purpose. We pass a set of solutions around, represented naively using a list. msdl_2 :: Int -> [V] -> R The next stage of derivation. After exploiting some properties we realise that the set can be represented by a sequence. The sequences are still represented by lists. msdl_l :: Len -> [V] -> R Changing the representation by fusing "diffs" into the fold. The program has become rather cryptical. msdl :: Len -> [V] -> R Derived program, instead of lists, we use SimpleQueue and BraunSeq in Edison. QuickCheck properties: test_lb :: Len The lower-bound on the length of segments. prop_msdl_1_correct :: [V] -> Bool That msdl_1 implements mssu_s. prop_msdl_2_correct :: [V] -> Bool That msdl_2 implements mssu_s. prop_msdl_l_correct :: [V] -> Bool That msdl_l implements mssu_s. prop_msdl_correct :: [V] -> Bool That msdl implements mssu_s. > import List This program uses Edison. If you do not have Edison installed, comment out the line below as well as the section for mssu. > import qualified Data.Edison.Seq.SimpleQueue as Q > import qualified Data.Edison.Seq.SizedSeq as Sz > import qualified Data.Edison.Seq.BraunSeq as B > import Test.QuickCheck > type V = Integer > type Len = Int > type R = Rational Specification. In order to avoid round-off errors during QuickChecking, we assume that V, the type of values we deal, is Integer and use Rational to represent density. In general, V need not be restricted to Integer. > msdl_s :: Len -> [V] -> R > msdl_s lb = maxlist . map avg . filter ((lb <=) . snd) . > map (split (sum, length)) . concat . map inits . tails > maxlist :: Ord v => [v] -> v > maxlist = foldr1 bmax > x `bmax` y | x < y = y > | otherwise = x > split (f,g) a = (f a, g a) > avg :: (V,Len) -> R > avg (s,l) = (fromIntegral s) / (fromIntegral l) After (global) thinning, the algorithm is expressed by a fold passing around a set of solutions. > msdl_1 :: Len -> [V] -> R > msdl_1 lb = snd . foldr (step1 lb) (([],[(0,0)]),-32767) > type SL = (V, Len) > type Queue a = [a] > type Set a = [a] > type ST1 = ((Queue V, Set SL),R) > step1 :: Len -> V -> ST1 -> ST1 > step1 lb a ((x,ys),m) | length x < lb = > ((a:x,ys),avg (split (sum, length) (a:x))) > step1 lb a ((x,ys),m) | otherwise = > ((a:x',ys'), m `bmax` findmax1 (a:x') ys') > where (x',b) = split (init,last) x > ys' = (0,0) : rm1 b ys > add a (s,l) = (a+s,1+l) > acat (s1,l1) (s2,l2) = (s1+s2,l1+l2) > rm1 b ys = [add b y | y <- ys, and [safe b y y' | y' <- ys]] > safe b (s1,l1) (s2,l2) = l2 <= l1 || > avg (b+s1,1+l1) > avg (s2-s1,l2-l1) > findmax1 x ys = maxlist [avg ((s,l) `acat` y) | y <- ys ] > where (s,l) = split (sum, length) x Testing. > test_lb :: Len > test_lb = 4 > prop_msdl_1_correct = > forAll (longer test_lb) $ > \x -> msdl_s test_lb x == msdl_1 test_lb x > longer 0 = arbitrary > longer (n+1) = do a <- arbitrary > x <- longer n > return (a:x) By exploiting properties of the set xs, we realise that there are faster ways to implement rm and maxfind. The program is kept for illustrative purpose because after introducing diffs, the program becomes rather cryptical. > msdl_2 :: Int -> [V] -> R > msdl_2 lb = snd . foldr (step2 lb) (([],[]),-32767) > type ST2 = ((Queue V, [SL]),R) > step2 :: Len -> V -> ST2 -> ST2 > step2 lb a ((x,ys),m) | length x < lb = > ((a:x,ys),avg (split (sum, length) (a:x))) > step2 lb a ((x,ys),m) | otherwise = > ((a:x',ys'), m `bmax` findmax2 (split (sum, length) (a:x')) ys') > where (x',b) = split (init,last) x > ys' = rm2 lb b ys > rm2 lb b ys = (bump . dropR_l long) ((b,1) : map (add b) ys) > where long (s,l) = l > lb -1 > bump [] = [] > bump [y] = [y] > bump (y:z:ys) > | avg y > avg z = y:z:ys > | otherwise = bump (z:ys) > findmax2 x ys = findmax2' x ((0,0):ys) > findmax2' x [] = avg x > findmax2' x [y] = avg x `bmax` avg (x `acat` y) > findmax2' x (y:z:ys) > | avg (x `acat` y) > avg (x `acat` z) = avg (x `acat` y) > | otherwise = findmax2' x (z:ys) > dropR_l p [] = [] > dropR_l p xs | p (last xs) = dropR_l p (init xs) > | otherwise = xs > prop_msdl_2_correct = > forAll (longer test_lb) $ > \x -> msdl_s test_lb x == msdl_2 test_lb x The final algorithm. Refined from msdl2 by planting diffs. > msdl_l :: Len -> [V] -> R > msdl_l lb = snd . foldr (stepl_l lb) (([],((0,0),[])),-32767) > type ST = ((Queue V, (SL, [SL])), R) > stepl_l :: Len -> V -> ST -> ST > stepl_l lb a ((x,ys),m) | length x < lb = > ((a:x,ys),avg (split (sum, length) (a:x))) > stepl_l lb a ((x,ys),m) | otherwise = > ((a:x',ys'), > m `bmax` findmax_l (split (sum, length) (a:x')) ys') > where (x',b) = split (init,last) x > ys' = rm_l lb b ys > rm_l lb b ((c,n),ys) = > ((c',n'), (bump . dropR_l long) ((c,n):ys)) > where long (s,l) = 1+n-l > lb -1 > (c',n') = (b+c,1+n) > bump [] = [] > bump [y] = [y] > bump (y:z:ys) > | avgd (c',n') y > avgd (c',n') z = y:z:ys > | otherwise = bump (z:ys) > avgd (c,n) (s,l) = avg (c-s,n-l) > avgdx (t,m) (c,n) (s,l) = avg (t+c-s, m+n-l) > findmax_l x ((c,n),ys) = findmax_l' x ((c,n),(c,n):ys) > findmax_l' x ((c,n),[]) = avg x > findmax_l' x ((c,n),[y]) = avg x `bmax` avgdx x (c,n) y > findmax_l' x ((c,n),y:z:ys) > | avgdx x (c,n) y > avgdx x (c,n) z > = avgdx x (c,n) y > | otherwise = findmax_l' x ((c,n),z:ys) > prop_msdl_l_correct = > forAll (longer test_lb) $ > \x -> msdl_s test_lb x == msdl_l test_lb x Derived algorithm, using advanced data structures for sequences. If you do not have Edison installed, comment out the rest of the file. > msdl :: Len -> [V] -> R > msdl lb = snd . foldr (stepl lb) (((Sz.empty,0),((0,0),Sz.empty)),-32767) We use a SimpleQueue to implement the prefix, and a Braun sequence to implement the set of solutions. We need to maintain the sizes of both (by using the Sized module of Edison). Also, we need to maintain the sum of the prefix. > type PrefixSeq a = (Sz.Sized (Q.Seq) a, V) -- maintaining the sum. > type SolSeq a = Sz.Sized (B.Seq) a > type STQ = ((PrefixSeq V, (SL, SolSeq SL)) ,R) > stepl :: Len -> V -> STQ -> STQ > stepl lb a ((x,ys),m) | pssize x < lb = > ((a `pslcons` x,ys),avg (split (pssum, pssize) (a `pslcons` x))) > stepl lb a ((x,ys),m) | otherwise = > ((a `pslcons` x',ys'), > m `bmax` findmax (split (pssum, pssize) (a `pslcons` x')) ys') > where (x',b) = split (psrtail,psrhead) x > ys' = rm lb b ys Some functions maintaining the prefix. > pslcons a (x, s) = (a `Sz.lcons` x, s+a) > psrtail (x,s) = (Sz.rtail x, s-(Sz.rhead x)) > psrhead (x,s) = Sz.rhead x > pssize (x,s) = Sz.size x > pssum (x,s) = s > rm lb b ((c,n),ys) = > ((c',n'), (bump . dropR long) ((c,n) `Sz.lcons` ys)) > where long (s,l) = 1+n-l > lb -1 > (c',n') = (b+c,1+n) > bump ys > | Sz.null ys = ys > | Sz.null (Sz.ltail ys) = ys > | otherwise = let (y,ys') = (Sz.lhead ys, Sz.ltail ys) > z = Sz.lhead ys' > in if avgd (c',n') y > avgd (c',n') z > then ys else bump ys' > dropR p xs > | Sz.null xs = xs > | p (Sz.rhead xs) = dropR p (Sz.rtail xs) > | otherwise = xs > findmax x ((c,n),ys) = > if i == k-1 then avg x `bmax` avgdx x (c,n) (Sz.lookup i ys') > else avgdx x (c,n) (Sz.lookup (i+1) ys') > where ys' = (c,n) `Sz.lcons` ys > k = Sz.size ys' > i = bsearch p 0 (Sz.size ys') > p j = (j == k-1) || > (avgdx x (c,n) (Sz.lookup j ys') <= > avgdx x (c,n) (Sz.lookup (j+1) ys')) The function call bsearch p l r returns the largest i within the range l <= i < r such that p i holds. > bsearch p l r > | l == r = -1 > | l+1 == r = if p l then l else -1 > | otherwise = let m = (l + r) `div` 2 > in if p m then bsearch p m r > else bsearch p l m > prop_msdl_correct = > forAll (longer test_lb) $ > \x -> msdl_s test_lb x == msdl test_lb x References [LJ02] Y.-L. Lin, T. Jiang, and K.-M. Chao. Efficient algorithms for locating the length-constrained heaviest segments, with applications to biomolecular sequence analysis. In Proceedings of the 27th International Symposium on Mathematical Foundations of Computer Science, Lecture Notes in Computer Science 2420, pages 459–470. Springer-Verlag, 2002. [Mu07] S-C. Mu. Maximum segment sum is back: deriving algorithms for two maximum segment problems with bounded lengths. Under submission. 2007.