Approaching Certainty: Big O Notation and Run Times

Recently, I have been learning about SQL indexing and the way databases store information. During this process, the run time for sorts, searches, etc. has come up. Because of that, I think it's worth trying to explain some of the run times (Big O Notation) for a few different searches or rather, what it means when we say $O(n)$.

Credit to this post on Reddit for giving me an idea of how to explain it but, I think I'll try to take a little different route. I didn't read the whole post actually, but I do plan on borrowing a few ideas.

First off, what is Big O Notation? Dealing with algorithms in computer science, I would define it as a function to represent the amount of time (worst case scenario) it will take a process (algorithm) to run on a given data structure. Why is this important? When building algorithms to handle data, it takes some amount of time to execute; depending on how it's built it may take longer than another algorithm that yields the same result. Also, it may be important for how well the algorithm scales depending on how much data you're working with.

To work with these ideas, we generally use Big O Notation (calculating the worst case scenario [run time] - sometimes the average case). One example of how this can be useful would be that an algorithm in practice returns its result immediately every time, if we assume that this scenario is the average run time for the algorithm, when we run our process on different or larger amounts of data we may find that it actually executes in quadratic time - a drastic decline in performance. Knowing that we have an algorithm that runs in worst case, $O(\log n)$, we know what to expect as the data changes.

In this post, I plan to go over a few of the common time complexities I run into often. These are constant, logarithmic, linear, linearithmic, and quadratic run times.

For these explanations, I'd like to set up a scenario:
You have a bunch of boxes and, depending on the algorithm, you've misplaced an item in these boxes. Other than the missing item, these boxes may have other items or properties; the other information about the boxes will be used to further explain my example.

Constant:
Constant run time is the easiest to understand (I'd think) because, it's just that - constant. It's written as $O(1)$. Based on my scenario, and it [the box scenario] isn't really needed for this run time, we can picture a room of 10 boxes. In a search, we could say every box has the item I'm looking for, so every time we look, we find the item right away. It always takes the same amount of time for this search; hence, constant run time. In a insertion or deletion, it could be looked at as we always remove a box from the same position, or we always add a new box to the room (at random even) the same way; the process always takes a constant amount of time.

Logarithmic:
Logarithmic run time is a little more tricky but, not too bad. It's written as $O(\log n)$ Let's say that we have 10 boxes organized by weight and we're looking for a box of a specific weight. Given that we don't know the weights of the boxes only the initial weight and these boxes are ordered from least weight to most weight, left to right - a logarithmic search will be sufficient. Since, there are only 10 boxes, starting at the left most and weighing each one might not be so bad; however, what if there were 1000 boxes? If it took 15 seconds to weigh each box and if the box we were looking for was the right most box, it would take 15000 seconds (or over 4 hours), worst case, to find the box we want. Maybe instead, we could start in the middle and work our way out in some direction from there? That would give us $n/2$ or with 1000 boxes, 500 checks worst case (assuming we were lucky enough to check the correct half); a little over 2 hours but, still not good. Going back to 10 boxes, what if we applied the second method, starting at the middle box, over and over again? Let's say the box we want is in the third box. Starting at box five, if it weighed too much, we would then look at the 4 boxes to the left and try the middle again - box two. If that was too light, we can check the 2 boxes to the right - box three and four. No matter which one we pick, we'll have found our box in 3 or 4 checks, or $\log_2 10$. This is how a logarithmic run time (or search in this case) works. Scaling up to 1000, instead of taking 15000 seconds, we take $log_2 1000$ seconds, or about 150 seconds, just over 2 min. This is a significant improvement and shows why a logarithmic run time may be considered ideal in many cases.

NOTE: In this case, because we split the groups in half, we dealt with log base 2; this may not always be the case depending on what data structure we're working with - it's logarithmic time, not logarithmic base 2 time after all.

Linear:
Linear, like constant, run time is pretty straight forward (haha...). It's written as $O(n)$. Let's take the same scenario as we did with logarithmic time with one exception; the boxes are not sorted. Since they are not sorted, we can't apply our method of dividing them in half, as we'd be in the same situation as we were with one group - still needing to check each one. That means in the case of 1000 boxes, we may have to search all 1000 depending on where we start. After looking at logarithmic time, this seems like a terrible way to search for things but, these next two scenarios can be much worse (as well as many other run times that I won't be talking about in this post).

Linearithmic:
Linearithmic, I think, is kinda difficult to give an example for; hopefully what I came up with makes sense. It's written as $O(n \log n)$. It makes a little more sense to me to describe a linearithmic run time as a way of sorting objectes rather than searching them - maybe because I can't think of a good searching example. Let's take our linear scenarios and set it up so that we can search for our item in logarithmic time, that is, we have 10 unsorted boxes and we want to order them by weight. We can start by grouping the boxes into two's and sort those based on the left most box (so the lightest box is on the left). From here, we can group the boxes again, giving us three groups of sorted boxes - two of 4 and one of 2. Again, these were sorted by comparing the left most box of one group (of the groups being combined) to the left most box of a second group and putting the lower of the two in a new group until the groups become one new sorted group. We can do this again and we get a group of 6, giving us two groups now - one of 4 and one of 6. And finally, one more combination to get a single sorted group. This is also known as a merge sort, I will attempt to demonstrate below. This is linearithmic ($O(n \log n)$) because we may have had to compare each box (worst case) $n \log n$ times. That is, it could have taken up to 34 comparisons ($10 * \log_2 10$ - again, base 2 because we're grouping 2 at a time) to sort the boxes. This is relatively efficient, considering other options may have taken much longer (quadratic time explained below).

Demonstration (the numbers are the weights of each box and do not have to be unique):
Starting Point: 3 1 9 4 2 5 10 8 7 6
First Grouping: 1,3 4,9 2,5 8,10 6,7 - 4 box movements, 5 checks (comparisons)
Second Grouping: 1,3,4,9 2,5,8,10 6,7 - 0 box movements, 5 checks
Third Grouping: 1,3,4,9 2,5,6,7,8,10 - 2 box movements, 4 checks
Fourth and Final Grouping: 1,2,3,4,5,6,7,8,9,10 - 5 box movements 9 checks
NOTE: The boxes are now sorted and while in this example it may look like we took 40 comparisons, we actually only took 23. Remember, $n \log n$ is the worst case scenario.

Quadratic:
Quadratic run time isn't too difficult to understand but, it may not always be that easy to recognize. It's written as $O(n^2)$. Let's say we have 10 unsorted boxes, and in each box there are 10 more unsorted boxes, one of which has the item we seek. Since these boxes are unsorted, we can't divide them up so, we must look in each box and in each of it's inner boxes until we find our item. In other words, if the item is in the very last inner box we check of the last outer box we check, that means we have checked 10 inner boxes for each of the 10 boxes we have or $10^2 (100)$ checks. This is horribly inefficient and it would actually be faster to sort the boxes first and then search but, there are some scenarios where you are stuck with quadratic run times.

Those are the most common run times I've run into and a bit of an explanation on what they actually mean. I should mention that, there are cases where you might end up with an algorithm that may have a run time of $n + \log n$. We should look at this as though it was a limit going to infinity, since both functions will end up at infinity we don't need to represent it as infinity + infinity, since that is still infinity. In that case we take the slower of the two since, the addition of $\log n$ only slightly effects the graph - this would be a big O of $O(n)$. There are many other run times that show up in algorithms and it's important to understand what they mean, especially if you're designing algorithms to scale. This is all I'll cover for now on this topic. Any questions, comments and/or suggestions are appreciated.

Approaching Certainty

Wednesday, June 5, 2013

Big O Notation and Run Times

No comments:

Post a Comment

About Me