Thursday, 7 November 2013

Priority Queue Example

The training program for campus hires is still going on, with Narayana Reddy from Architecture group and Swapna Shetty from training department driving the program well. I stepped in for one class in which in order to drive a better understanding of Priority Queues, I presented a small example.

But first, what exactly is a Priority Queue? It is an abstract data type which is like a regular queue or stack data structure, but where additionally each element has a “priority” associated with it [Source: wikipedia]. After telling this definition, I asked the trainees, how do you find the five largest sized files in eTrans (one of our products)?

The first response is what can be termed as brute-force method. Find the size of all files, store them in a data structure, sort the sizes and find the last five elements. The second response said, store them in a data structure that maintains the elements in sorted order so that we avoid the work of sorting.

They missed thinking about space complexity. So I counter asked: what if there are millions of files? Or even tens of millions. You would need a data structure that stores all those files and waste so much memory space, when all you need is to know what are the five largest elements.

Then I explained: Priority Queues are the solution to this problem where you don’t have to use up all that memory. All we need is a data structure that holds one more than the required elements. So in our case it will be six size priority queue. I wrote the design of the program on the board as follows:

1) Create a class called Pair with instance variables key (String) and value (long).

2) Pair implements Comparable.

3) Its compareTo method
--- returns -1 if this value is less than the compared value.
--- returns 1 if this value is greater than the compared value.
--- returns 0 other wise.

4) Create a class called FindLargestFiles.
--- Write the main method to take two command line arguments : 1) an integer that represents how many largest files you want to find. 2) the root folder where you want to start your search.
--- Store the integer in a variable called top.
--- Instantiate a priority queue called pq with size top +1.
--- Call the method getFileSizes with the root folder name as argument.
--- Use the foreach loop to print out the key and value pairs in pq.

5) Write a method getFileSizes that takes a folder name as parameter
--- if it finds a folder, it recurses
--- else if it finds a java file, it adds the name and size to the priority queue
--- if queue exceeds top, remove the minimum

Then I showed the source code (and also ran it).

The explanation is: For finding the five largest files, we instantiate a priority queue of size six. Basically the trick here is that the first six filenames & sizes go straight into the priority queue. After inserting the sixth element, we remove the minimum sized element. Now we have five largest elements in the priority queue. For the seventh file we again insert the filename & size, then remove the minimum sized element. Again the priority queue is left with five largest files examined so far. The process continues for all files. At the end of the run, after all files are examined, the priority queue is left with the five larges files.

The code to find if a file is a java file, i.e., if the name ends with “.java” is straightforward. Similarly, getting the file size is a simple call to the method length of the File class.

As an exercise, I asked the group to get the file sizes by number of lines. This is easily done by a while loop in which you increment a loop counter after getting each line. Finally, in order to emphasize the importance of priority queues in programming, I concluded the class by reading out an extract from Algorithms by Wayne and Sedgewick -

To appreciate the value of the priority-queue abstraction, consider the following problem: You have a huge input stream of N string and associated integer values, and your task is to find the largest or smallest M integers (and associated strings) in the input stream. You might imagine the stream to be financial transactions, where your interest is to find the big ones, or pesticide levels in an agricultural product, where your interest is to find the small ones, or requests for service, or results from a scientific experiment, or whatever. In some applications, the size of the input stream is so huge that it is best to consider it to be unbounded. One way to address the problem would be to sort the input stream and take the M largest keys from the result, but we have just stipulated that the input stream is too large for that. Another approach would be to compare each new key against the M largest seen so far, but that is also likely to be prohibitively expensive unless M is small. With priority queues, we solve the problem. For the huge values of N that are likely to be encountered in our modern computational infrastructure, these implementations can make the difference between being able to address such a problem and not having the resources to do it at all.

A concept is well understood if people are given examples from their regular work. To use a more technical phrase, we can call it as finding the relevant use case. This process is applicable at a high level too. First we monitor new technologies. Next we ask how do we apply them in our domain. Then we extend the thought process to how we translate them into our product features. This three step process is nothing but finding the relevant use case applied to technology adoption.

No comments:

Post a Comment