
Tools on display (source: Frank Vincentz via Wikimedia Commons ). Expanding Your python Knowledge: Lesser-Known Libraries
The Python ecosystem is vast and far-reaching in both scope and depth. Starting out in this crazy, open-source forest is daunting, and even with years of experience, it still requires continual effort to keep up-to-date with the best libraries and techniques.
In this report we take a look at some of the lesser-known Python libraries and tools. Python itself already includes a huge number of high-quality libraries; collectively these are called the standard library . The standard library receives a lot of attention, but there are still some libraries within it that should be better known. We will start out by discussing several, extremely useful tools in the standard library that you may not know about.
We’re also going to discuss several exciting, lesser-known libraries from the third-party ecosystem. Many high-quality third-party libraries are already well-known, including Numpy and Scipy, Django, Flask, and Requests; you can easily learn more about these libraries by searching for information online. Rather than focusing on those standouts, this report is instead going to focus on several interesting libraries that are growing in popularity.
Let’s start by taking a look at the standard library.
The Standard LibraryThe libraries that tend to get all the attention are the ones heavily used for operating-system interaction, like sys , os , shutil , and to a slightly lesser extent, glob . This is understandable because most Python applications deal with input processing; however, the Python standard library is very rich and includes a bunch of additional functionality that many Python programmers take too long to discover. In this chapter we will mention a few libraries that every Python programmer should know very well.
collectionsFirst up we have the collections module. If you’ve been working with Python for any length of time, it is very likely that you have made use of the this module; however, the batteries contained within are so important that we’ll go over them anyway, just in case .
collections.OrderedDictcollections.OrderedDict gives you a dict that will preserve the order in which items are added to it; note that this is not the same as a sorted order.
The need for an ordered dict comes up surprisingly often. A common example is processing lines in a file where the lines (or something within them) maps to other data. A mapping is the right solution, and you often need to produce results in the same order in which the input data appeared. Here is a simple example of how the ordering changes with a normal dict :
>>> dict(zip(ascii_lowercase, range(4))) {'a': 0, 'b': 1, 'c': 2, 'd': 3} >>> dict(zip(ascii_lowercase, range(5))) {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4} >>> dict(zip(ascii_lowercase, range(6))) {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'f': 5, 'e': 4}
>>> dict(zip(ascii_lowercase, range(7))) {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'g': 6, 'f': 5, 'e': 4}
See how the key " f " now appears before the " e " key in the sequence of keys? They no longer appear in the order of insertion, due to how the dict internals manage the assignment of hash entries.
The OrderedDict , however, retains the order in which items are inserted:
>>> from collections import OrderedDict >>> OrderedDict(zip(ascii_lowercase, range(5))) OrderedDict([('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4)]) >>> OrderedDict(zip(ascii_lowercase, range(6))) OrderedDict([('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4), ('f', 5)]) >>> OrderedDict(zip(ascii_lowercase, range(7))) OrderedDict([('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4), ('f', 5), ('g', 6)]) Warning OrderedDict: Beware creation with keyword argumentsThere is an unfortunate catch with OrderedDict you need to be aware of: it doesn’t work when you create the OrderedDict with keyword arguments, a very common Python idiom:
>>> collections.OrderedDict(a=1,b=2,c=3) OrderedDict([('b', 2), ('a', 1), ('c', 3)])This seems like a bug, but as explained in the documentation, it happens because the keyword arguments are first processed as a normal dict before they are passed on to the OrderedDict .
collections.defaultdictcollections.defaultdict is another special-case dictionary: it allows you to specify a default value for all new keys.
Here’s a common example:
>>> d = collections.defaultdict(list) >>> d['a']
[]
You didn’t create this item yet? No problem! Key lookups automatically create values using the function provided when creating the defaultdict instance.
By setting up the default value as the list constructor in the preceding example, you can avoid wordy code that looks like this:
d = {} for k in keydata: if not k in d: d[k] = [] d[k].append(...)The setdefault() method of a dict can be used in a somewhat similar way to initialize items with defaults, but defaultdict generally results in clearer code.
In the preceding examples, we’re saying that every new element, by default, will be an empty list. If, instead, you wanted every new element to contain a dictionary, you might say defaultdict(dict) .
collections.namedtupleThe next tool, collections.namedtuple , is magic in a bottle! Instead of working with this:
tup = (1, True, "red")You get to work with this:
>>> from collections import namedtuple >>> A = namedtuple('A', 'count enabled color') >>> tup = A(count=1, enabled=True, color="red") >>> tup.count 1 >>> tup.enabled True >>> tup.color "red" >>> tup A(count=1, enabled=True, color='red')The best thing about namedtuple is that you can add it to existing code and use it to progressively replace tuples: it can appear anywhere a tuple is currently being used, without breaking existing code, and without using any extra resources beyond what plain tuples require. Using namedtuple incurs no extra runtime cost, and can make code much easier to read. The most common situation where a namedtuple is recommended is when a function returns multiple results, which are then unpacked into a tuple. Let’s look at an example of code that uses plain tuples, to see why such code can be problematic:
>>> def f(): ... return 2, False, "blue"
>>> count, enabled, color = f()

>>> tup = f()

>>> enabled = tup[1]
Simple function returning a tuple.
When the function is evaluated, the results are unpacked into separate names.
Worse, the caller might access values inside the returned tuple by index .
The problem with this approach is that this code is fragile to future changes. If the function changes (perhaps by changing the order of the returned items, or adding more items), the unpacking of the returned value will be incorrect. Instead, you can modify existing code to return a namedtuple instance:
>>> def f(): ... # Return a namedtuple! ... return A(2, False, "blue") >>> count, enabled, color = f()
Even though our function now returns a namedtuple , the same calling code stills works.
You now also have the option of working with the returned namedtuple in the calling code:
>>> tup = f() >>> print(tup.count)
2
Being able to use attributes to access data inside the tuple is much safer rather than relying on indexing alone; if future changes in the code added new fields to the namedtuple , the tup.count would continue to work.
The collections module has a few other tricks up its sleeve, and your time is well spent brushing up on the documentation . In addition to the classes shown here, there is also a Counter class for easily counting occurrences, a list-like container for efficiently appending and removing items from either end ( deque ), and several helper classes to make subclassing lists, dicts, and strings easier.
contextlibA context manager is what you use with the with statement. A very common idiom in Python for working with file data demonstrates the context manager:
with open('data.txt', 'r') as f: data = f.read()This is good syntax because it simplifies the cleanup step where the file handle is closed. Using the context manager means that you don’t have to remember to do f.close() yourself: this will happen automatically when the with block exits.
You can use the contextmanager decorator from the contextlib library to benefit from this language feature in your own nefarious schemes. Here’s a creative demonstration where we create a new context manager to print out performance (timing) data.
This might be useful for quickly testing the time cost of code snippets, as shown in the following example. The numbered notes are intentionally not in numerical order in the code. Follow the notes in numerical order as shown following the code snippet.
from time import perf_counter from array import array from contextlib import contextmanager @contextmanager
def timing(label: str): t0 = perf_counter()

yield lambda: (label, t1 - t0)

t1 = perf_counter()

with timing('Array tests') as total:

with timing('Array creation innermul') as inner: x = array('d', [0] * 1000000)

with timing('Array creation outermul') as outer: x = array('d', [0]) * 1000000

print('Total [%s]: %.6f s' % total()) print(' Timing [%s]: %.6f s' % inner()) print(' Timing [%s]: %.6f s' % outer())
The array module in the standard library has an unusual approach to initialization: you pass it an existing sequence, such as a large list, and it converts the data into the datatype of your array if possible; however, you can also create an array from a short sequence, after which you expand it to its full size. Have you ever wondered which is faster? In a moment, we’ll create a timing context manager to measure this and know for sure!
The key step you need to do to make your own context manager is to use the @contextmanager decorator.
The section before the yield is where you can write code that must execute before the body of your context manager will run. Here we record the timestamp before the body will run.
The yield is where execution is transferred to the body of your context manager; in our case, this is where our arrays get created. You can also return data: here I return a closure that will calculate the elapsed time when called. It’s a little clever but hopefully not excessively so: the final time t1 is captured within the closure even though it will only be determined on the next line.
After the yield , we write the code that will be executed when the context manager finishes. For scenarios like file handling, this would be where you close them. In this example, this is where we record the final time t1 .
Here we try the alternative array-creation strategy: first, create the array and then increase size.
For fun, we’ll use our awesome, new context manager to also measure the total time.
On my computer, this code produces this output:
Total [Array tests]: 0.064896 s Timing [Array creation innermul]: 0.064195 s Timing [Array creation outermul]: 0.000659 sQuite surprisingly, the second method of producing a large array is around 100 times faster than the first. This means that it is much more efficient to create a small array, and then expand it, rather than to create an array entirely from a large list.
The point of this example is not to show the best way to create an array: rather, it is that the contextmanager decorator makes it exceptionally easy to create your own context manager, and context managers are a great way of providing a clean and safe means of managing before-and-after coding tasks.
concurrent.futuresThe concurrent.futures module that was introduced in Python 3 provides a convenient way to manage pools of workers. If you have previously used the threading module in the Python standard library, you will have seen code like this before:
import threading def work(): return sum(x for x in range(1000000)) thread = threading.Thread(target=work) thread.start() thread.join()This code is very clean with only one thread, but with many threads it can become quite tricky to deal with sharing work between them. Also, in this example the result of the sum is not obtained from the work function, simply to avoid all the extra code that would be required to do so. There are various techniques for obtaining the result of a work function, such as passing a queue to the function, or subclassing threading.Thread , but we’re not going discuss them any further, becausethe multiprocessing package provides a better method for using pools, and the concurrent.futures module goes even further to simplify the interface. And, similar to multiprocessing, both thread-based pools and process-based pools have the same interface making it easy to switch between either thread-based or process-based approaches.
Here we have a trivial example using the ThreadPoolExecutor . We download the landing page of a plethora of popular social media sites, and, to keep the example simple, we print out the size of each.Note that in the results, we show only the first four to keep the output short.
from concurrent.futures import ThreadPoolExecutor as Executor urls = """google twitter facebook youtube pinterest tumblr instagram reddit flickr meetup classmates microsoft apple linkedin xing renren disqus snapchat twoo whatsapp""".split() def fetch(url):
from urllib import request, error

try: data = request.urlopen(url).read() return '{}: length {}'.format(url, len(data)) except error.HTTPError as e: return '{}: {}'.format(url, e) with Executor(max_workers=4) as exe:

template = 'http://www.{}.com' jobs = [exe.submit( fetch, template.format(u)) for u in urls]

results = [job.result() for job in jobs]

print('\n'.join(results))
Our work function, fetch() , simply downloads the given URL.
Yes, it is rather odd nowadays to see urllib because the fantastic third-party library requests is a great choice for all your web-access needs. However, urllib still exists and depending on your needs, may allow you to avoid an external dependency.
We create a ThreadPoolExecutor instance, and here you can specify how many workers are required.
Jobs are created, one for every URL in our considerable list. The executor manages the delivery of jobs to the four threads.
This is a simple way of waiting for all the threads to return.
This produces the following output (I’ve shortened the number of results for brevity):
http://www.google.com: length 10560 http://www.twitter.com: length 268924 http://www.facebook.com: length 56667 http://www.youtube.com: length 437754 [snip] Even though one job