40

I have a large collection of regular expression that when matched call a particular http handler. Some of the older regex's are unreachable (e.g. a.c* ⊃ abc*) and I'd like to prune them.

Is there a library that given two regex's will tell me if the second is subset of the first?

I wasn't sure this was decidable at first (it smelled like the halting problem by a different name). But it turns out it's decidable.

10
  • Not entirely sure I understand - are you saying that you have two regexes, a.c* and abc* ? And you wan't to decipher if they are the same, or partially the same? Or is a.c* ⊃ abc* a whole regex? As I've never seen that notation before
    – MDEV
    Commented Sep 11, 2013 at 11:52
  • 2
    ⊃ means strict superset, I probably should have used ⊇, which is more common. I'm trying to say that every string accepted by abc* is also accepted by a.c*
    – deft_code
    Commented Sep 11, 2013 at 19:25
  • 4
    What is your definition of Regex? In most programming languages, regular expression syntax, which often allows back references, is more powerful than regular languages. So decidability of inclusion is not even clear...
    – J.-E. Pin
    Commented Sep 16, 2013 at 20:16
  • 1
    In this case I mean proper regular expressions. I'm using the RE2 library which implements only common regex features that can map directly to proper regular expressions.
    – deft_code
    Commented Sep 17, 2013 at 19:57
  • 1
    Correction, it smells rather EXPSPACE-complete ;) en.wikipedia.org/wiki/EXPSPACE Commented Sep 18, 2013 at 5:13

4 Answers 4

18
+100

Trying to find the complexity of this problem lead me to this paper.

The formal definition of the problem can be found within: this is generally called the inclusion problem

The inclusion problem for R, is to test for two given expressions r, r′ ∈ R, whether r ⊆ r′.

That paper has some great information (summary: all but the simplest expressions are fairly complex), however searching for information on the inclusion problem leads one directly back to StackOverflow. That answer already had a link to a paper describing a passable polynomial time algorithm which should cover a lot of common cases.

9

I found a python regex library that provides set operations.

http://github.com/ferno/greenery

The proof says Sub ⊆ Sup ⇔ Sub ∩ ¬Sup is {}. I can implement this with the python library:

import sys
from greenery.lego import parse

subregex = parse(sys.argv[1])
supregex = parse(sys.argv[2])

s = subregex&(supregex.everythingbut())
if s.empty():
  print("%s is a subset of %s"%(subregex,supregex))
else:
  print("%s is not a subset of %s, it also matches %s"%(subregex,supregex,s)

examples:

subset.py abcd.* ab.*
abcd.* is a subset of ab.*

subset.py a[bcd]f* a[cde]f*
a[bcd]f* is not a subset of a[cde]f*, it also matches abf*

The library may not be robust because as mentioned in the other answers you need to use the minimal DFA in order for this to work. I'm not sure ferno's library makes (or can make) that guarantee.

As an aside: playing with the library to calculate inverse or simplify regexes is lots of fun.
a(b|.).* simplifies to a.+. Which is pretty minimal.
The inverse of abf* is ([^a]|a([^b]|bf*[^f])).*|a?. Try to come up with that on your own!

1
  • 1
    That library doesn't guarantee to generate a minimal DFA, but I don't believe the DFAs in question need to be minimal to get the correct answer.
    – qntm
    Commented Dec 8, 2013 at 23:19
6

If the regular expressions use "advanced features" of typical procedural matchers (like those in Perl, Java, Python, Ruby, etc.) that allow accepting languages that aren't regular, then you are out of luck. The problem is in general undecidable. E.g. the problem of whether one pushdown automaton recognizes the same context free (CF) language as another is undecidable. Extended regular expressions can describe CF languages.

On the other hand, if the regular expressions are "true" in the theoretical sense, consisting only of concatenation, alternation, and Kleene star over strings with a finite alphabet, plus the usual syntactic sugar on these (character classes, +, ?, etc), then there is a simple polynomial time algorithm.

I can't give you libraries, but this:

For each pair of regexes r and s for languages L(r) and L(s)
  Find the corresponding Deterministic Finite Automata M(r) and M(s)
    Compute the cross-product machine M(r x s) and assign accepting states
       so that it computes L(r) - L(s)
    Use a DFS or BFS of the the M(r x s) transition table to see if any
       accepting state can be reached from the start state
    If no, you can eliminate s because L(s) is a subset of L(r).
    Reassign accepting states so that M(r x s) computes L(s) - L(r)
    Repeat the steps above to see if it's possible to eliminate r

Converting a regex to a DFA generally uses Thompson's construction to get a non-deterministic automaton. This is converted to a DFA using the Subset Construction. The cross-product machine is another standard algorithm.

This was all worked out in the 1960's and is now part of any good undergrad computer science theory course. The gold standard for the topic is Hopcroft and Ullman, Automata Theory.

4

There is an answer in the mathematics section: https://math.stackexchange.com/questions/283838/is-one-regular-language-subset-of-another.

Basic idea:

  • Compute the minimal DFA for both languages.
  • Calculate the cross product of both automates M1 and M2, which means that each state consists of a pair [m1, m2] where m1 is from M1 and m2 from M2 for all possible combinations.
  • The new transition F12 is: F12([m1, m2], x) => [F1(m1, x), F2(m2, x)]. This means if there was a transition in M1 from state m1 to m1' while reading x and in M2 from state m2 to m2' while reading x then there is one transition in M12 from [m1, m2] to [m1', m2'] while reading x.
  • At the end you look into the reachable states:
    • If there is a pair [accepting, rejecting] then the M2 is not a subset of M1
    • If there is a pair [rejecting, accapting] then M1 is not a subset of M2

It would be benificial if you would just compute the new transition and the resulting states, omitting all non reachable states from the beginning.

Not the answer you're looking for? Browse other questions tagged or ask your own question.