How do I write a regex that matches lines that do not contain
hi
?
That’s a frequently asked question if I ever saw one.
Of course, the proper answer is: you don’t. You write a regex that does match hi
and then invert the matching logic, ostensibly with grep -v
. But where’s the fun in that?
One interesting theorem that pops up in any book or class on formal grammars is that regular languages are closed under complement: the inverse of a regular expression is also a regular expression. This means that writing inverted regular expressions is theoretically possible, though it turns out to be quite tricky
Just try writing a regex that matches strings that does not contain “hi”, and test it against “hi”, “hhi” and “ih”, “iih” and such variations. Some solutions are coming up.
A way to cheat is using PCRE negative lookahead: ^(?!.*foo)
matches all strings not containing the substring “foo”. However, lookahead assertions require a stack, and thus can’t be modelled as a finite state machine. In other words, they don’t fit the mathematical definition of a regular expression, and therefore disqualify.
There are simple, well-known algorithms for turning regular expressions into non-deterministic finite automata, and from there to deterministic FA. Less commonly used and known are algorithms for inverting a DFA and for generating familiar textual regex from it.
You can find these described in various lecture notes and slides, so I won’t recite them.
What I had a harder time finding was software that actually did this. So here is a Haskell program. It’s highly suboptimal but it does the job. When executed, it will ask for a regex and will then output a grep command that matches everything the regex does not (without -v
, obviously).
The expressions it produces are quite horrific; it’s computer generated code, after all.
A regular expression for matching strings that do not match .*hi.*
could be grep -E '^([^h]|h+$|h+[^hi])*$'
.
This app, however, suggests grep -E '^([^h]([^h]|)*||([^h][^h]*h|h)|([^h][^h]*(h(hh*[^hi]|[^hi]))|(h(hh*[^hi]|[^hi])))((hh*[^hi]|[^h])|)*|([^h][^h]*(h([^hi][^h]*h|h))|(h([^hi][^h]*h|h)))(([^hi][^h]*h|h)|)*)$'
It still works exactly as stated though!
The app just supports a small subset of regex, just enough to convince someone that it works, and as a party trick lets you answer the original question exactly as stated.
Technically correct is the best kind of correct.