Picking the first/top entry of a set of values is a pretty frequent need; this usually means defining one value that has priority over the other and involves sorting.
But sorting can be expensive and re-sorting every time you add an entry to your values is certainly not a very convenient way to pick the first entry out of a set of values with some kind of priority.
Heaps are a perfect match for everything that has priorities, such as a priority queue:
import time import heapq class PriorityQueue: def __init__(self): self._q = [] def add(self, value, priority=0): heapq.heappush(self._q, (priority, time.time(), value)) def pop(self): return heapq.heappop(self._q)[-1]
Then, our PriorityQueue
can be used to retrieve entries given a priority:
>>> def f1(): print('hello')
>>> def f2(): print('world')
>>>
>>> pq = PriorityQueue()
>>> pq.add(f2, priority=1)
>>> pq.add(f1, priority=0)
>>> pq.pop()()
hello
>>> pq.pop()()
world
PriorityQueue
works by storing everything in an heap. Heaps are particularly efficient at retrieving the top/first element of a sorted set without having to actually sort the whole set.
Our priority queue stores all the values in a three-element tuple: priority
, time.time()
, and value
.
The first entry of our tuple is priority
(lower is better). In the example, we recorded f1
with a better priority than f2
, which ensures than when we use heap.heappop
to fetch tasks to process, we get f1
and then f2
, so that we end up with the hello world
message and not world hello
.
The second entry, timestamp
, is used to ensure that tasks that have the same priority are processed in their insertion order. The oldest task will be served first as it will have the smallest timestamp.
Then, we have the value itself, which is the function we want call for our task.
A very common approach to sorting is to keep a list of entries in a tuple, where the first element is key
for which we are sorting and the second element is the value itself.
For a scoreboard, we can keep each player's name and how many points they got:
scores = [(123, 'Alessandro'), (143, 'Chris'), (192, 'Mark']
Storing those values in tuples works because comparing two tuples is performed by comparing each element of the first tuple with the element in the same index position in the other tuple:
>>> (10, 'B') > (10, 'A')
True
>>> (11, 'A') > (10, 'B')
True
It's very easy to understand what's going on if you think about strings. 'BB' > 'BB'
is the same as ('B', 'B') > ('B', 'A')
; in the end, a string is just a list of characters.
We can use this property to sort our scores
and retrieve the winner of a competition:
>>> scores = sorted(scores)
>>> scores[-1]
(192, 'Mark')
The major problem with this approach is that every time we add an entry to our list, we have to sort it again, or our scoreboard would became meaningless:
>>> scores.append((137, 'Rick'))
>>> scores[-1]
(137, 'Rick')
>>> scores = sorted(scores)
>>> scores[-1]
(192, 'Mark')
This is very inconvenient because it's easy to miss re-sorting somewhere if we have multiple places appending to the list, and sorting the whole list every time can be expensive.
The Python standard library offers a data structure that is a perfect match when we're interested in finding out the winner of a competition.
In the heapq
module, we have a fully working implementation of a heap data structure, a particular kind of tree where each parent is smaller than its children. This provides us with a tree that has a very interesting property: the root element is always the smallest one.
And being implemented on top of a list, it means that l[0]
is always the smallest element in a heap
:
>>> import heapq
>>> l = []
>>> heapq.heappush(l, (192, 'Mark'))
>>> heapq.heappush(l, (123, 'Alessandro'))
>>> heapq.heappush(l, (137, 'Rick'))
>>> heapq.heappush(l, (143, 'Chris'))
>>> l[0]
(123, 'Alessandro')
You might have noticed, by the way, that the heap finds the loser of our tournament, not the winner, and we were interested in finding the best player, with the highest value.
This is a minor problem we can easily solve by storing all scores as negative numbers. If we store each score as * -1
, the head of the heap will always be the winner:
>>> l = []
>>> heapq.heappush(l, (-143, 'Chris'))
>>> heapq.heappush(l, (-137, 'Rick'))
>>> heapq.heappush(l, (-123, 'Alessandro'))
>>> heapq.heappush(l, (-192, 'Mark'))
>>> l[0]
(-192, 'Mark')