Tuesday, December 13, 2011

Mergesort: Like You're 5

Merge sort is based around the principle that merging two sorted sets into a single sorted set is easy. If we take two sorted sets, S1 and S2 such that:
S1 = [1, 3, 4] and S2 = [2, 5]
And we want to merge them into a new set, S3, which is also sorted, here's how we do it the mergesort-way. First, start with the two original sets, S1 and S2, and have S3 be empty.
S1 = [1, 3, 4] and S2 = [2, 5] and S3 = []
What we want to do is add the elements to S3 in order so that once the merge is complete the sorting is already finished. So the first step is to add the smallest element to S3. Since the sets S1 and S2 are already sorted, the smallest elements of each are their first elements. All we need to do to find the smallest element is look at the first element of each and take whichever is smaller. So we look at:
      v                  v
S1 = [1, 3, 4] and S2 = [2, 5]
and notice that 1 is smaller than 2, so we can be certain that 1 is the smallest element of either of the sets. So we can take that out of S1 and add it as the first element of S3. After that we just repeat the process and compare the first elements again.
      v               v
S1 = [3, 4] and S2 = [2, 5] and S3 = [1]
In this case 2 is less than three, so we append it to S3 and repeat.
      v               v
S1 = [3, 4] and S2 = [5] and S3 = [1, 2]
Three is less than 5, so the next step looks like:
      v            v
S1 = [4] and S2 = [5] and S3 = [1, 2, 3]
4 is less than 5, so we append it to S3, leaving S1 empty.
S1 = [] and S2 = [5] and S3 = [1, 2, 3, 4]
Now that S1 is empty, we can just append the entire contents for S2 on to the end of S3, since they're already in order, and we end up with S3 as the complete, sorted merge of S1 and S2
S1 = [] and S2 = [] and S3 = [1, 2, 3, 4, 5]
Of course this whole time you've been wondering, how does this apply to sorting in general? We don't normally get the luxury of starting with two sorted lists so this whole thing is stupid and you're stupid for writing it. Well, although I'm going to have to disagree about the me being stupid part, I see where you're coming from. The way it works, just like with quicksort, is that it has to be applied recursively. Here is a trivial example, let's use merge sort to sort this small set:
S1 = [4, 1, 3, 2]
You should notice two differences between this and the starting conditions of the other example: It's one list instead of two, and it's not sorted. The first difference is easy to fix, let's just split it in half down the middle.
S1 = [4, 1] and S2 = [3, 2]
There's one problem solved, now let's work on the other. Neither S1 nor S2 is sorted, so we'll have to sort them. The way we do that is, of course, with mergesort! Let's start with S1.
S1 = [4, 1]
Once again we've got one unsorted set, and once again, we begin by splitting it in half.
S1 = [4] and S2 = [1]
Now the sets only contain one element each, which makes them sorted by definition! Just like in quicksort, the base case of mergesort is that a set containing 1 or 0 elements is automatically sorted. Now we just need to merge them like we did before, comparing the first element of each set and adding the smaller one to the result. I won't bore you with the step by step of the merge process since we already went over it in depth, but just trust me when I say that the result will look like this
S1 = [] and S2 = [] and S3 = [1, 4]
Just to clarify, I've left the empty sets S1 and S2 in just to illustrate that we got to this result the same wasy as before; by taking elements one at a time out of S1 and S2 until they were both empty. Now we've got one of our subsets sorted and we can bring it back a level to get here:
S1 = [1, 4] and S2 = [3, 2]
Now we just need to do mergesort on S2 and we'll be able to merge them together. Once again we split S2 into two sets:
S1 = [3] and S2 = [2]
Do the merge process on them again:
S1 = [] and S2 = [] and S3 = [2, 3]
Take this result back a level:
S1 = [1, 4] and S2 = [2, 3]
Now just merge this whole thing:
S1 = [] and S2 = [] and S3 = [1, 2, 3, 4]
And we're done!

Quicksort: Like You're 5

One of the problems with the explainations of quicksort that I've noticed is that they often fail to seperate the concept from the implementation. It's easy to get bogged down in the details of iteration, element swapping, etc. without even getting to the meaty center of the discussion: how quicksort works.  So here's my attempt to explain the concept without getting into the implementation.
lets take the set:
[5, 3, 4, 9, 1, 7]
First pick a random element of the set to be the pivot. Let's just pick the first one, 5. So now we have:
Pivot - 5, remaining set [3, 4, 9, 1, 7]
Now iterate through the set and compare them to the pivot. If they are "less than" the pivot, place them in one set, if they are "greater than" the pivot, place them in the other set. We should end up with this:
Less than [3, 4, 1], pivot - 5, greater than [9, 7].
From here we can see that all the elements are in place with respect to the pivot, 5. If we somehow sorted the "less than" and "greater than" sets, we could just merge all the sets together and the whole thing would be sorted. The trick is to sort those "less than" and "greater than" sets, and the way to do that is, of course, using quicksort! Let's sort the "less than" set first. So we have:
[3, 4, 1]
Let's pick the first element to be the pivot again, so we end up with:
pivot - 3, remaining set [4, 1]
repeating the comparison step from before we get:
less than [1], pivot - 3, greater than [4]
Now, we're in the same situation as before only one thing is different; the "less than" and "greater than" sets only have one element, and a one element set is automatically sorted! This is the base case of quicksort: performing quick sort on a one element set (or an empty set) just returns the set. So now all we have to do is merge the three sets together and it's sorted. We end up with:
[1, 3, 4]
Now we go back to where we were before, only now we have the "less than" set already sorted. So we're here:
Less than [1, 3, 4], pivot - 5, greater than [9, 7]
Now we just have to sort that greater than set, once again using quicksort and selecting the first element as the pivot we get:
less than [7], pivot - 9, greater than []
The less than set and greater than set each have 1 or less elements so they're already sorted and we can just merge them together! We get:
[7, 9]
Bringing this sorted set back to where we started we get
less than [1, 3, 4], pivot - 5, greater than [7, 9]
Now the less than and greater than sets are sorted and we can just merge the whole thing together! We get:
[1, 3, 4, 5, 7, 9]
and we're done.

Friday, November 18, 2011

How many bugs can you find?

Working in C++ you are given an "immutable string" class, which we'll call "IMS", that you're forced to work with for some kind of bureaucratic BS reasons that you probably disagree with. You don't know much about it other than it's a class written in C++ which stores a string and doesn't allow it to be modified. Management has heard, and ignored, your suggestion to just use a "const std::string," because that sounds complicated and they're management, so just do it. A coworker approaches and explains that they've figured out how to do string equality comparisons with the new IMS class, and shows you the code they drafted up to show how it works. See how many bugs you can find.

IMS istr("hello");
if(istr.c_str() == "hello")
   std::cout << "it worked!" << endl;

Note that when run, "it worked!" gets printed out, and note that the "IMS::c_str()" function returns a char*. See if you can answer these questions:

  1. Why won't this type of string comparison work?
  2. What exactly happens when the comparison in the if statement is executed?
  3. Why does "it worked!" get printed out?
  4. What bug can you conclude is in the IMS class?
The questions are sorted in order of difficulty. Detailed answers follow, so stop here if you want to try to figure them out on your own..

Questions 1 and 2 are closely related. What happens in the istr.c_str() == "hello" boils down to a simple pointer comparison. In C++ when pointers are compared with the == operator, they are just compared numerically. There is no pointer dereferencing or anything like that, the pointers are just compared as numbers. So the only way this comparison can evaluate to true is if both pointers point to the exact same memory location; it has nothing to do with comparing strings. To do a string comparison on a pair of c strings, you need to dereference the pointers and step through the strings comparing the characters along the way. A function like strcmp() or strncmp() would be a good solution.

Question 3 is tricky because it's not specified C++ behavior that is causing it; it's a compiler optimization. When the compiler encounters the string literal "hello" on the first line, it stores it somewhere in the text segment of the program's memory space. When it encounters it again on line 2, the compiler is smart enough to identify that it has already stored that string literal, so instead of duplicating it, it just uses a pointer to the first one. This is possible because the text segment is read-only, so neither pointer can be used to modify the string which would mess it up for the other pointer.  The comparison passes because it's comparing two pointers which are pointing to the exact same memory location in the text segment. Knowing this, we can solve question 4.

The IMS class is supposed to be an "immutable" string, but let's look at what's going on in this short code segment. An IMS is instantiated with a char* pointing to the string "hello".  On the next line when we call c_str() on that IMS object, we get back the exact same pointer value pointing to the exact same "hello" string. Apparently the IMS class is just using whatever memory the given char* is pointing to to store it's string internally. The problem is, the IMS class has no idea where that char* came from, what memory it's pointing at, who else has access to it, and most importantly, it has no idea if it will be modified. For all it knows, the char* could be pointing to a string being temporarily stored on the stack, rendering the "immutuable" string anything but.

Saturday, February 12, 2011

The Problem With The Halting Problem

Of course everybody knows the halting problem is undecidable, what is less known is why the halting problem is undecidable. Surprisingly, I think the problem comes from the fact that the proof is actually too simple. It's really only a few sentences and doesn't even involve any big words or anything. You don't have to follow along with a pencil and paper, recomputing each step of the proof to see how it was done. You just start reading and before you know it it's over. You read and understand every single word but feel like you've been gypped. Something about it just doesn't feel right. It's one of the most profound proofs in the theory of computation, but you don't feel any different after reading it. The problem is that the simplicity is deceptive. If you don't take the time to limber up your mind for all the bending it has to do, you'll skip right over it. So before we get into the proof, let's start with some mind-stretching.

The proof goes something like this:
  1. Assume that the halting problem is decidable.
  2. ???
  3. Deduce that the universe doesn't exist.
Limber yet? Good.

Now we can address what you've been wondering this whole time; what the heck the halting problem is. The halting problem is just the problem of deciding whether a given program will eventually terminate or continue to run indefinitely on a given input. If it were decidable that would mean that for any given program on any given input, it would be theoretically possible to determine whether or not that program would halt on that input. Emphasis is of course on the fact that it would have to be decidable on any program and any input; solving trivial examples is obviously possible. Unfortunately the most obvious solution doesn't quite cut it. If you just run the program on the input and see if it halts, you would never know how long to let it run before deciding that it will never halt. The best you could do is wait a while and hope it halts, but if it didn't you still wouldn't know if it would eventually halt or not. Even with a time machine you could fast forward to the end of the universe and still never know if it really was going to loop forever or if it was just about to halt. If a solution exists it's going to involve a bit more cunning.

It may seem like it's really not even worth it to muster up the necessary cunning to solve the halting problem, but a solution would actually be incredibly valuable. Hard math problems like the Collatz conjecture would be trivially solvable; just make a program to do an exhaustive search of every number and give that program to the halting problem solver to see if it ever finishes. Done. There are a myriad of other undecidable problems that the halting-problem-solver would make short work of. Not to mention, how cool would it be to have a compiler that is able to warn you that the program you just compiled will never finish running. I know it would have saved me some debugging time.

Well, now that you're all excited about how awesome a solution to the halting problem would be, let's get to the reason it can never exist. Like we went over in the outline, step one of the proof is pretty simple, we just assume we have some program that can solve the halting problem. Let's call it Halting_Problem_Solver(p, i). Halting_Problem_Solver takes two parameters, the first is a program (p), and the second is some input (i). Halting_Problem_Solver returns true if the program p will ever halt on input i, and false if it will loop infinitely. Simple enough, right? Just make sure your mind is still limber for step two.

Now that we have a Halting_Problem_Solver, we're going to design a program which uses it to crash the universe. This is where it gets tricky. We'll call this second program Halting_Problem_Crasher(p). It takes one input, a program p. Halting_Problem_Crasher(p) is going to call Halting_Problem_Solver(p, p), which is to say it will see if program p will ever halt when given itself as input. Whatever Halting_Problem_Solver determines that p does with itself as its input, Halting_Problem_Crasher does the exact opposite. In other words, if it is determined that program p does eventually halt when given itself as input, Halting_Problem_Crasher enters an infinite loop. If it is determined that program p loops infinitely when given itself as input, Halting_Program_Crasher halts.

Here it is in some python-esque pseudo-code

Halting_Problem_Crasher(p):
  if(Halting_Problem_Solver(p, p) == Halts):
    while True: pass #infinte loop
  else:
    exit() #halt

Still with me? Ok, now what we're going to do is call Halting_Problem_Crasher(Halting_Problem_Crasher). That is, we're running Halting_Problem_Crasher with itself as input. Lets run through what happens. Halting_Problem_Crasher takes the program it's given and gives it to the Halting_Problem_Solver, using it as both of the parameters. So in this case, it calls
Halting_Problem_Solver(Halting_Problem_Crasher, Halting_Problem_Crasher)
So Halting_Problem_Solver is going to try to figure out if Halting_Problem_Crasher will loop infinitely or eventually halt with itself as input. But wait, that's what we were trying to figure out in the first place when we called
Halting_Problem_Crasher(Halting_Problem_Crasher)
So what should Halting_Problem_Solver determine? If it says it halts, Halting_Problem_Crasher loops. If it says it loops, Halting_Problem_Crasher halts. No matter what it says it's wrong. The only way it can work is by looping and halting simultaneously. The very existence of a solution to the halting problem proves that looping and halting are equivalent. And if looping and halting are equivalent, so are true and false, one and zero, up and down, etc. In other words, the existence of a solution to the halting problem would disprove the our very existence.

Wednesday, February 9, 2011

The Pumping Lemma, and Why It's Slightly More Important Than You Might Have Thought

A while ago I answered a question on The Stack about the pumping lemma. My answer was, of course, a raving success, earning an unprecedented twenty-three up-votes (ie. not very many). But what's important here isn't how awesome I am, but rather how awesome other people aren't. On a site populated with professional software engineers, the majority of which probably have degrees in computer science, I was surprised to see exactly how poorly the majority of people understand what what the pumping lemma is and what it has to do with anything. So, for the benefit of my vast readership (hi mom!), here is my explanation of the pumping lemma in layman's terms.

The pumping lemma is a simple proof to show that a language is not regular, meaning that a Finite State Machine cannot be built for it. The canonical example is the language (a^n)(b^n). This is the simple language which is just any number of as, followed by the same number of bs. So the strings

ab
aabb
aaabbb
aaaabbbb
etc. are in the language, but
aab
bab
aaabbbbbb
etc. are not.
It's simple enough to build a FSM for these examples:
This one will work all the way up to n=4. The problem is that our language didn't put any constraint on n, and Finite State Machines have to be, well, finite. No matter how many states I add to this machine, someone can give me an input where n equals the number of states plus one and my machine will fail. So if there can be a machine built to read this language, there must be a loop somewhere in there to keep the number of states finite. With these loops added:
all of the strings in our language will be accepted, but there is a problem. After the first four as, the machine loses count of how many as have been input because it stays in the same state. That means that after four, I can add as many as as I want to the string, without adding any bs, and still get the same return value. This means that the strings:
aaaa(a*)bbbb
with (a*) representing any number of as, will all be accepted by the machine even though they obviously aren't all in the language. In this context, we would say that the part of the string (a*) can be pumped. The fact that the Finite State Machine is finite and n is not bounded, guarantees that any machine which accepts all strings in the language MUST have this property. The machine must loop at some point, and at the point that it loops the language can be pumped. Therefore no Finite State Machine can be built for this language, and the language is not regular.

At this point you're probably asking yourself, "what does this have to do with anything and why should I care?" A fair question. The most immediate application of the pumping lemma is probably the problem of parsing HTML, or more generally, XML. As you may or may not remember, regular expressions and finite state machines are equivalent. That, along with the fact that HTML tags can be embedded within themselves deeper than Inception's dream levels, means that regular expressions can't parse HTML. In other words, regular expressions can't parse HTML. Which brings me to my next point: regular expressions can't parse HTML. Give me a regular expression for parsing HTML and I'll give you valid HTML that it is incapable of parsing. Seriously, I'm not busy.

Of course that probably doesn't seem like all that big of a deal, given all the freely available HTML parsers out there, but it turns out that the programmers who don't understand the pumping lemma are the same programmers who unnecessarily reinvent well established tools when faced with simple problems.

Friday, January 14, 2011

Pandora's Unexpected Consequence

Let me just say that I like Pandora. I listen to it just about every day. It's just that in the course of my interaction with Pandora I've found some faults, and, like a normal person, rather than come up with solutions I'd rather just smugly point them out on my public blog.

Anyway, the specific problem I'm talking about is actually an unexpected consequence of the Music Genome Project as opposed to Pandora. The music genome project creates a roughly 100 to 400 dimension vector for each song which classifies it's musical qualities. The numerical values in the vector correspond to different aspects of the song like how much distortion is on the guitar, what tonality it's in, etc. Pandora chooses songs to play for you by finding songs that are musically similar to the songs that you've already told it you like, probably using a simple distance formula for determining the similarity of songs. The problem that I've decided to let bother me today is that the Music Genome Project is missing one genome that I think is crucial; how GOOD the song is.

If you were an artist in the olden days, in order to get your song played on the radio it had to be popular. If not enough people liked the song, it just wouldn't get played. Although popularity isn't a great measure of quality it's at least some sort of approximation, making it infinitely better than nothing. Now-a-days, if you want to get your song played on Pandora, instead of making a song that people like, you just need to make a song that is musically similar to a song that is popular. Reverse engineering songs from popular music genomes sounds like a recipe for musical disaster, yet there is no reason it wouldn't work for Pandora since quality is immune from measurement. Internet radio is still in its infancy so it remains to be seen if this defect will result in any serious problems, but even now I can't tell you how many "DJ Whatever" remixes of good songs I've had come up on my radio stations in spite of thumbs-downing every single one of them. Just because it's musically similar to the original doesn't mean it's similarly good.