Tag Archives: Regular Expression

An Unsuccessful Attempt to Compute the Intersection of Regular Expressions

Several months ago, Tyng-Ruey, Chin-Lung, and I were working on some ideas in which we needed to compute the intersection of two regular expressions (REs). The traditional approach is to convert them to finite state automata, take the intersection, and convert the result back to a RE. The disadvantage is that the structure of the original REs would be lost. We therefore tried to perform the intersection directly on the REs. The result was not satisfactory, however.

Regular expressions can be represented by the following algebraic datatype:

> data RE = Z            -- Empty
>         | E            -- Epsilon
>         | L Token      -- Literal
>         | RE :+ RE     -- Plus
>         | RE :. RE     -- Times 
>         | S RE         -- Star
>            deriving Show

> type Token = Char
> la = L 'a'
> lb = L 'b'
> lc = L 'c'

At one point we will need to determine whether an expression is nullable. The following function nullable actually computes the intersection of the input with epsilon:

> nullable :: RE -> RE
> nullable Z = Z
> nullable E = E
> nullable (L _) = Z
> nullable (r :+ s) = nullable r .+ nullable s
> nullable (r :. s) = nullable r .: nullable s
> nullable (S r) = E

I defined some “smart” constructors doing some minimal simplification for plus and times:

> Z .+ r = r
> r .+ Z = r
> r .+ s = r :+ s

> Z .: r = Z
> r .: Z = Z
> E .: r = r
> r .: E = r
> r .: s = r :. s

Now, here comes Brzozowski’s derivative of regular expressions. Intuitively speaking, d a e yields a regular expression resulting from extracting a prefixing token a from the regular expression e. Brzozowski actually defined derivative with respect to strings rather than just tokens.

> d :: Token -> RE -> RE
> d _ Z = Z
> d _ E = Z
> d a (L b) | a == b = E
>           | otherwise = Z
> d a (r :+ s) = d a r .+ d a s
> d a (r :. s) = (d a r .: s) .+ (nullable r .: d a s)
> d a (S r) = d a r .: S r

The interesting case is probably r :. s. To extract a from r :. s, we either extract a from r or, if r is nullable, extract a from s.

Now let us consider how to define intersection. The intersection of any expression with the empty set Z yields Z. Intersection with epsilon has been defined as nullable above.

> Z `cap` _ = Z
> _ `cap` Z = Z
> E `cap` E = E
> E `cap` r = nullable r
> r `cap` E = nullable r

The case for literals is defined below. Again the interesting case is the one regarding r :. s:

> L a `cap` L b | a == b = L a
>               | otherwise = Z
> L a `cap` (r :+ s) = (L a `cap` r) .+ (L a `cap` s)
> L a `cap` (r :. s) = (L a `cap` r .: nullable s) .+ 
>                      (nullable r .: L a `cap` s) 
> L a `cap` S r = L a `cap` r

The cases for plus is simple:

> (r :+ s) `cap` t = (r `cap` t) .+ (s `cap` t)

The case for :. is the most difficult. We have to do case analysis on the left side of :.:

> (Z :. s) `cap` t = Z
> (E :. s) `cap` t = s `cap` t
> (L a :. s) `cap` t = L a .: (s `cap` d a t)
> ((r :+ s) :. t) `cap` u = ((r .: t) `cap` u) .+ ((s .: t) `cap` u)
> ((r :. s) :. t) `cap` u = (r .: (s .: t)) `cap` u
> (S r :. s) `cap` t = (s `cap` t) .+ ((r .: (S r .: s)) `cap` t)

The case (Z :. s) `cap` t should be Z because (Z :. s) is Z, while (E :. s) `cap` t is s `cap` t because (E :. s) simplifies to s. The intersection of (L a :. s) and t is a cons the intersection of s and d a t. If it is not possible to extract a from t, the expression collapses to Z.

The case for :+ is given by distributivity, the case for (r :. s) :. t by associativity. The case for (S r :. s) `cap` t actually comes from expanding S r :. s to s :+ (r :. S r :. s). This may have contributed to the non-termination we will see later.

Finally, the case for (S r) `cap` t results from expanding S r to E :+ (r :. S r):

> (S r) `cap` t = E `cap` t .+ ((r .: S r) `cap` t)

Unfortunately, cap does not always terminate when star is involved. In some cases it works well:

*Main> S la `cap` lb
Z
*Main> S la `cap` (la :+ lb)
L 'a'
*Main> S la `cap` S lb
E

In some cases it loops. For example, consider S lb `cap` S lb. We get

  b* cap b* 
 = e cap b* + bb* cap b*
 = e + b (b* cap d b b*)
 = e + b (b* cap b*)
 = e + b ({-- loop --})

We could probably try to capture the recursive pattern in, perhaps, a fold. But we were then distracted by other topics and did not explore further.

Is it possible at all?