Speed up your Python Scripts using Ubuntu and GNU Parallel
Introduction
NOTE: This blog post is not an overview of Linux-based environments, it assumes a base knowledge and that Python is set up.
I’m sure one of the main reasons most of us have gotten into Python is to automate long tedious workflows that must be repeated for multiple sets of data. The beauty of Python is the set-it-and-forget-it simplicity once the code is written, however, depending on the size and number of datasets, along with the number of steps involved, we may have to “forget” about our script for anywhere from hours to days! If only there was a way to speed up our scripts…
This is where the power of the Linux-based operating system can come into play, through the shell library GNU Parallel! Before we get too in-depth into the application of this library, we first must understand what parallel computing is. Parallel computing is the practice of utilizing multiple cores simultaneously to perform a certain task. A core is a “brain” within our CPU, it is where tasks are sent to be performed. Most PC’s nowadays have at least 2 cores, if not more, and most modern software packages (such as ArcGIS Pro) usually take advantage of parallel computing to perform tasks in less time than previously possible. Python is a general exception to this, as a python script will only run on a single core, which brings us to the main advantage of this library. GNU Parallel allows us to run multiple instances of a Python script, each on its own core. This allows us to run through our data X times faster, depending on the number of cores we specify.
Setup and Commands
In order for this to work properly, a few modifications must be made to your Python scripts:
- All files that are being modified/created must have unique names! Since we will have multiple instances of Python running at the same time, unique naming conventions must be used to prevent one instance from overwriting files of another instance (this includes temporary files!).
- The script must be setup to receive input datasets from the command line. This is to prevent each script from running the same sets of data. As an example, suppose you have 10 raster datasets that need to be classified. Instead of looping through the directory containing the datasets within Python, this will be done in the shell and each instance of Python will receive the dataset it is to run via the following line of code:
os.path.abspath(os.path.expanduser(str(sys.argv[-1])))
Now that we have our script in order, we must install parallel functionality. To do this on Ubuntu, simply run the following line in the shell:
sudo apt-get install parallel
Once installed, change to the directory containing your datasets using the cd command. Once in place, we run our script in parallel using the following command in the shell:
find . -name ".tif" -type f | parallel -j 8 --progress python "path to python script"
Let look at this command in more depth:
- The first portion before the “|” is a shell find command, which is searching through the current directory and any sub-directories looking for a specific file, or in this case any files that have a .tif extension. The “.tif” can be replaced with any type of file you are looking for, whether that be shapefiles “.shp” or zipfiles “.zip”. These are the files we will be sending to our Python scripts.
- “|” is a pipe symbol in the Ubuntu shell, it takes the result of the command before it, and passes it’s output to the command after it. In this case, what will be piped over is a list of .tif files, which will be picked up by the parallel program, and passed individually to the Python script using the line of code specified earlier!
- “-j” is specifying how many cores the command is to use. If you do not know how many cores your computer has, execute parallel –number-of-cores in the shell. In the case above, 8 cores will be used, meaning 8 instances of python will be executed simultaneously!
- “–progress” is where we put the command we wish to parallelize, in this case it is a call to run our Python script.
Always test Python parallelization with a small directory of files before executing on everything! It is easy to make a mistake in the shell command or within the script, especially related to file management.
A Note on Core Management
A good rule to go by when using the parallel utility is to always leave at least one core empty so that you are still able to navigate and monitor your computer without it being fully bogged down. This means we also must consider the number of cores that the commands Python calls will use. In the case of ArcPy, we can regulate this using the following command:
arcpy.env.parallelProcessingFactor = string
More about this command can be read here, but it is basically setting how many cores a process in arcpy can use. I highly recommend giving it a read, as it also gives additional information on parallel computing and how to decide on the number of cores to use.
Conclusion
GNU parallel is an extremely powerful tool for increasing processing speeds through executing the same process on more than one core. While it can be used for any process that can be executed in the shell, it is extremely useful for GIS users working with Python, as explained in the above example. Although many here won’t be using a Linux-based environment, I feel this is an example of why Linux-based environments can be so powerful, and how they can benefit the GIS operator.