Maximum Segment Density with Lower-Bounded Length Shin-Cheng Mu Sep 2007 This literate Haskell script presents a program solving the maximum segment density problem: given a list of numbers, compute the maximum density (average) of a segment that is longer than a lower-bound. At the first glance, the derivation would make use of the Thinning Theorem. It turns out that we need a kind of "global" thinning difficult to be modelled by a preorder. See Mu (2007) for details. The derived algorithm is similar to that of Lin et al. (2002). Functions defined: msdl_s :: Len -> [V] -> R The specification. msdl_1 :: Len -> [V] -> R An intermediate stage of derivation kept for expository purpose. We pass a set of solutions around, represented naively using a list. msdl_2 :: Int -> [V] -> R The next stage of derivation. After exploiting some properties we realise that the set can be represented by a sequence. With the discovery of Goldwasser et al. (2005) we are able to refine the algorithm to amortised linear time. The sequences are still represented by lists. msdl_l :: Len -> [V] -> R Changing the representation by fusing "diffs" into the fold. The program has become rather cryptical. msdl :: Len -> [V] -> R Derived program, instead of lists, we use SimpleQueue and BraunSeq in Edison. QuickCheck properties: test_lb :: Len The lower-bound on the length of segments. prop_msdl_1_correct :: [V] -> Bool That msdl_1 implements msdl_s. prop_msdl_2_correct :: [V] -> Bool That msdl_2 implements msdl_s. prop_msdl_l_correct :: [V] -> Bool That msdl_l implements msdl_s. prop_msdl_correct :: [V] -> Bool That msdl implements msdl_s. > import List > import Control.Arrow This program uses Edison. If you do not have Edison installed, comment out the line below as well as the section for mssu. > import qualified Data.Edison.Seq.SimpleQueue as Q > import qualified Data.Edison.Seq.SizedSeq as Sz > import qualified Data.Edison.Seq.BraunSeq as B > import Test.QuickCheck > type V = Integer > type Len = Int > type R = Rational Specification. In order to avoid round-off errors during QuickChecking, we assume that V, the type of values we deal, is Integer and use Rational to represent density. In general, V need not be restricted to Integer. > msdl_s :: Len -> [V] -> R > msdl_s lb = maxlist . map avg . filter ((lb <=) . snd) . > map sumlen . concat . map inits . tails > sumlen = sum &&& length > maxlist :: Ord v => [v] -> v > maxlist = foldr1 bmax > x `bmax` y | x < y = y > | otherwise = x > avg :: (V,Len) -> R > avg (s,l) = (fromIntegral s) / (fromIntegral l) After (global) thinning, the algorithm is expressed by a fold passing around a set of solutions. > msdl_1 :: Len -> [V] -> R > msdl_1 lb = snd . foldr (step1 lb) (([],[(0,0)]),-32767) > type SL = (V, Len) > type Queue a = [a] > type Set a = [a] > type ST1 = ((Queue V, Set SL),R) > step1 :: Len -> V -> ST1 -> ST1 > step1 lb a ((x,ys),m) | length x < lb = > ((a:x,ys),avg (sumlen (a:x))) > step1 lb a ((x,ys),m) | otherwise = > ((a:x',ys'), m `bmax` findmax1 (a:x') ys') > where (x',b) = (init &&& last) x > ys' = (0,0) : rm1 (map (add b) ys) > add a (s,l) = (a+s,1+l) > acat (s1,l1) (s2,l2) = (s1+s2,l1+l2) > rm1 ys = [y | y <- ys, and [safe y y' | y' <- ys]] > where safe (s1,l1) (s2,l2) = l2 <= l1 || > avg (s1,l1) > avg (s2-s1,l2-l1) > findmax1 x ys = maxlist [avg ((s,l) `acat` y) | y <- ys ] > where (s,l) = sumlen x Testing. > test_lb :: Len > test_lb = 4 > prop_msdl_1_correct = > forAll (longer test_lb) $ > \x -> msdl_s test_lb x == msdl_1 test_lb x > longer 0 = arbitrary > longer (n+1) = do a <- arbitrary > x <- longer n > return (a:x) By exploiting properties of the set xs, we realise that there are faster ways to implement rm and maxfind. The program is kept for illustrative purpose because after introducing diffs, the program becomes rather cryptical. > msdl_2 :: Int -> [V] -> R > msdl_2 lb = snd . foldr (step2 lb) (([],[]),-32767) > type ST2 = ((Queue V, [SL]),R) > step2 :: Len -> V -> ST2 -> ST2 > step2 lb a ((x,ys),m) | length x < lb = > ((a:x,ys),avg (sumlen (a:x))) > step2 lb a ((x,ys),m) | otherwise = > ((a:x',ys'), m `bmax` m') > where (x',b) = (init &&& last) x > sl = sumlen (a:x') > ys' = cut2 sl (rm2 ((b,1) : map (add b) ys)) > m' = if null ys' then avg sl > else avg (sl `acat` last ys') > rm2 [] = error "[]" > rm2 [y] = [y] > rm2 (y:z:ys) | avg y > avg z = y:z:ys > | otherwise = rm2 (z:ys) > cut2 x [] = error "[]" > cut2 x [y] = > if avg x > avg (x `acat` y) > then [] > else [y] > cut2 x ys = -- ys = ys'' ++ [z,y] > let (ys',y) = (init &&& last) ys > (ys'',z) = (init &&& last) ys' > in if avg (x `acat` z) > avg (x `acat` y) > then cut2 x ys' > else ys > dropR_l p [] = [] > dropR_l p xs | p (last xs) = dropR_l p (init xs) > | otherwise = xs > prop_msdl_2_correct = > forAll (longer test_lb) $ > \x -> msdl_s test_lb x == msdl_2 test_lb x The final algorithm. Refined from msdl2 by planting diffs. > msdl_l :: Len -> [V] -> R > msdl_l lb = snd . foldr (stepl_l lb) (([],((0,0),[])),-32767) > type ST = ((Queue V, (SL, [SL])), R) > stepl_l :: Len -> V -> ST -> ST > stepl_l lb a ((x,ys),m) | length x < lb = > ((a:x,ys),avg (sumlen (a:x))) > stepl_l lb a ((x,((c,n),ys)),m) | otherwise = > ((a:x',(cn', ys')), m' `bmax` m) > where (x',b) = (init &&& last) x > cn' = (b+c,1+n) > sl = sumlen (a:x') > ys' = cut_l sl cn' (rm_l cn' ((c,n):ys)) > m' = if null ys' then avg sl else avgdx sl cn' (last ys') > rm_l (c',n') [] = error "[]" > rm_l (c',n') [y] = [y] > rm_l (c',n') (y:z:ys) > | avgd (c',n') y > avgd (c',n') z = y:z:ys > | otherwise = rm_l (c',n') (z:ys) > avgd (c,n) (s,l) = avg (c-s,n-l) > avgdx (t,m) (c,n) (s,l) = avg (t+c-s, m+n-l) > cut_l x _ [] = error "[]" > cut_l x (c,n) [y] = > if avg x > avgdx x (c,n) y > then [] > else [y] > cut_l x (c,n) ys = -- ys = ys'' ++ [z,y] > let (ys',y) = (init &&& last) ys > (ys'',z) = (init &&& last) ys' > in if avgdx x (c,n) z > avgdx x (c,n) y > then cut_l x (c,n) ys' > else ys > prop_msdl_l_correct = > forAll (longer test_lb) $ > \x -> msdl_s test_lb x == msdl_l test_lb x Derived algorithm, using advanced data structures for sequences. If you do not have Edison installed, comment out the rest of the file. > msdl :: Len -> [V] -> R > msdl lb = snd . foldr (stepl lb) (((Sz.empty,0),((0,0),Sz.empty)),-32767) We use two SimpleQueue to implement the prefix and the set of solutions. We need to maintain the sizes of both (by using the Sized module of Edison). Also, we need to maintain the sum of the prefix. > type PrefixSeq a = (Sz.Sized (Q.Seq) a, V) -- maintaining the sum. > type SolSeq a = Sz.Sized (Q.Seq) a > type STQ = ((PrefixSeq V, (SL, SolSeq SL)) ,R) > stepl :: Len -> V -> STQ -> STQ > stepl lb a ((x,ys),m) | pssize x < lb = > ((a `pslcons` x,ys),avg ((pssum &&& pssize) (a `pslcons` x))) > stepl lb a ((x,((c,n),ys)),m) | otherwise = > ((a `pslcons` x',(cn',ys')), m' `bmax` m) > where (x',b) = (psrtail &&& psrhead) x > cn' = (b+c,1+n) > sl = (pssum &&& pssize) (a `pslcons` x') > ys' = cut sl cn' (rm cn' ((c,n) `Sz.lcons` ys)) > m' = if Sz.null ys' then avg sl else avgdx sl cn' (Sz.rhead ys') Some functions maintaining the prefix. > pslcons a (x, s) = (a `Sz.lcons` x, s+a) > psrtail (x,s) = (Sz.rtail x, s-(Sz.rhead x)) > psrhead (x,s) = Sz.rhead x > pssize (x,s) = Sz.size x > pssum (x,s) = s > rm (c',n') ys | Sz.null ys = ys > | Sz.null (Sz.ltail ys) = ys > | otherwise = let (y,ys') = (Sz.lhead ys, Sz.ltail ys) > z = Sz.lhead ys' > in if avgd (c',n') y > avgd (c',n') z > then ys else rm (c',n') ys' > dropR p xs > | Sz.null xs = xs > | p (Sz.rhead xs) = dropR p (Sz.rtail xs) > | otherwise = xs > cut x (c,n) ys > | Sz.null ys = Sz.empty > | Sz.size ys == 1 = > if avg x > avgdx x (c,n) (Sz.rhead ys) > then Sz.empty else ys > | otherwise = > let (ys',y) = (Sz.rtail &&& Sz.rhead) ys > (ys'',z) = (Sz.rtail &&& Sz.rhead) ys' > in if avgdx x (c,n) z > avgdx x (c,n) y > then cut x (c,n) ys' > else ys > prop_msdl_correct = > forAll (longer test_lb) $ > \x -> msdl_s test_lb x == msdl test_lb x References [GK05] M. H. Goldwasser, M.-Y. Kao, and H.-I. Lu. Linear-time algorithms for computing maximum-density sequence segments with bioinformatics applications. Journal of Computer and System Sciences, 70(2):128–144, 2005. [LJ02] Y.-L. Lin, T. Jiang, and K.-M. Chao. Efficient algorithms for locating the length-constrained heaviest segments, with applications to biomolecular sequence analysis. In Proceedings of the 27th International Symposium on Mathematical Foundations of Computer Science, Lecture Notes in Computer Science 2420, pages 459–470. Springer-Verlag, 2002. [Mu07] S-C. Mu. Maximum segment sum is back: deriving algorithms for two maximum segment problems with bounded lengths. Under submission. 2007.