Today I give my take on a blog article I stumbled across today called “Computer Algorithms: Linear Search in Sorted Lists” which gives a look at applying the linear search model to something sorted. I will offer a suggestion to perhaps make this a bit better for implementation and perhaps performance reasons. So buckle up and we will take another spin around the Programming Underground!
Ok, so if you are a reasonably skilled programmer you are probably saying to yourself “Linear search for sorted lists? Are you on crack? Use an algorithm like binary search!”. And no, I am not on crack and neither is Stoimen Popov. What he is attempting to say is that for sorted lists which contain often searched for values, why not implement a linear search over these common values and go to the binary search method if the value has not already been found?
What he is describing is a cache. Starting with a sorted list we search for a value and once found, we move it to the top of the list. Next time we search for the value, we do a quick linear search at the top… which would be much quicker than a binary search every time. If it is not found in the first few entries at the top of the list, then you would go to a binary search to find the value in the remaining sorted list.
This approach is sound and definitely on the right track. The problem with it, besides having to go back to the binary search if the items being searched for was not at the top, is that if a person tries to do this on a list that is searched often, eventually you will throw your sorted list into disorganization. Perhaps you search for the value “Tom” a lot and someone else search “John” a lot and someone else “Jack” or “George” or “Ken” etc. Next thing you know you have hundreds of items out of order at the top of your list in a list that was suppose to be sorted. In other words, it is going to degrade over time. Especially if this list is being hit through something that shares the list.
My take on this is that we should look at breaking out this list of commonly searched for values into its own list, leaving the original list intact. Now perhaps this is what the original poster meant but failed to mention.
The thought here is that we will not destroy the order of the original sorted list by moving items we find to the top. Instead we will find the value and put it into a second list of X items. Then when a new search is performed, we linear search this subset list first and if not found then we go to a binary search on the original list. This allows the original list to remain “pure” and also gives us a second list to manipulate.
If we search for a value that was found in our smaller “cached” list, we move it to the top of the cached list and continue on. If it was found in the larger list, we add it to the top of the cached list. Once the cached list reaches X items (we can base the size of this list depending on the size of the original list), we start popping off items from the bottom. Thus the cached list is really a FIFO structure.
My initial attempt would be to make this cache list anywhere from 10 – 20% the size of the original list. So if we had 100 items, the cached item would be 10 to 20 items. However, this will always depend on the type of data you are storing.
The only real two setbacks I could see with this is how to implement a generic enough version of it and that it can take up additional memory (the classic space for performance trade off). Do we make the cache something static between calls? Your language and design will surely dictate how you do this. And do we have the space for this cached list?
I think putting this cache in its own list, broken out from the original list will prevent the original list from degrading, will allow us to manipulate the size of the cached terms, afford us the ability to even share the cache between lists if we wanted and also take advantage of the small linear search before using the binary search theory.
I am glad I stumbled across this article and will be giving it more thought during my implementations of it. It really kicks us out of the rut of seeing a sorted list and immediately skipping linear search as a method. Why not use it for a small subset and then use the binary as a backup? No harm in that I think!
Thanks for reading! 🙂