Semi-coherent ramblings of a semi-professional coder
Wednesday, February 9, 2011
The Pumping Lemma, and Why It's Slightly More Important Than You Might Have Thought
A while ago I answered a question on The Stack about the pumping lemma. My answer was, of course, a raving success, earning an unprecedented twenty-three up-votes (ie. not very many). But what's important here isn't how awesome I am, but rather how awesome other people aren't. On a site populated with professional software engineers, the majority of which probably have degrees in computer science, I was surprised to see exactly how poorly the majority of people understand what what the pumping lemma is and what it has to do with anything. So, for the benefit of my vast readership (hi mom!), here is my explanation of the pumping lemma in layman's terms.
The pumping lemma is a simple proof to show that a language is not regular, meaning that a Finite State Machine cannot be built for it. The canonical example is the language (a^n)(b^n). This is the simple language which is just any number of as, followed by the same number of bs. So the strings
etc. are in the language, but
etc. are not.
It's simple enough to build a FSM for these examples:
This one will work all the way up to n=4. The problem is that our language didn't put any constraint on n, and Finite State Machines have to be, well, finite. No matter how many states I add to this machine, someone can give me an input where n equals the number of states plus one and my machine will fail. So if there can be a machine built to read this language, there must be a loop somewhere in there to keep the number of states finite. With these loops added:
all of the strings in our language will be accepted, but there is a problem. After the first four as, the machine loses count of how many as have been input because it stays in the same state. That means that after four, I can add as many as as I want to the string, without adding any bs, and still get the same return value. This means that the strings:
with (a*) representing any number of as, will all be accepted by the machine even though they obviously aren't all in the language. In this context, we would say that the part of the string (a*) can be pumped. The fact that the Finite State Machine is finite and n is not bounded, guarantees that any machine which accepts all strings in the language MUST have this property. The machine must loop at some point, and at the point that it loops the language can be pumped. Therefore no Finite State Machine can be built for this language, and the language is not regular.
At this point you're probably asking yourself, "what does this have to do with anything and why should I care?" A fair question. The most immediate application of the pumping lemma is probably the problem of parsing HTML, or more generally, XML. As you may or may not remember, regular expressions and finite state machines are equivalent. That, along with the fact that HTML tags can be embedded within themselves deeper than Inception's dream levels, means that regular expressions can't parse HTML. In other words, regular expressions can't parse HTML. Which brings me to my next point: regular expressions can't parse HTML. Give me a regular expression for parsing HTML and I'll give you valid HTML that it is incapable of parsing. Seriously, I'm not busy.
Of course that probably doesn't seem like all that big of a deal, given all the freely available HTML parsers out there, but it turns out that the programmers who don't understand the pumping lemma are the same programmers who unnecessarily reinvent well established tools when faced with simple problems.