Useful tools for common Python tasks.
This package arose from a desire to standardize useful methods and classes I found myself reusing in many projects. These fall into several sub-packages:
scripting
: method with useful defaults and settings for log format, verbosity, and destinationio_tools
: methods for intelligently guessing file compression from extension and safely buffering numerical data before writing to an HDF5 fileparallel
: determine which options for parallelization are available in the current environment, and run a method on a dataset using a master-slave paradigm. TheParallelizer
class arose from a common use case of writing/testing/running scripts on a local machine using multiprocessing or multithreading for parallelization but then needing to modify the scripts to use MPI on a large cluster. TheParallelizer
allows the same script to be run in both contexts without any need for changing code.plotting
: useful color schemes for maximum contrast and methods for conversion between color spaces
python_utilities
may be installed in three ways, in order of preference:
- Using conda:
conda install -c conda-forge sdaxen_python_utilities
- Using pip:
pip install sdaxen_python_utilities
- From GitHub:
- Download the source from this repository
- Download this repository to your machine
- Clone this repository to your machine with
git clone https://github.com/sdaxen/python_utilities.git
- OR download an archive by navigating to https://github.com/sdaxen/python_utilities and clicking "Clone or download > Download ZIP". Extract the archive.
- Clone this repository to your machine with
- Add the path to the repository to your
$PYTHONPATH
. On Unix, this can be done withexport PYTHONPATH=[PATH/TO/REPO]:$PYTHONPATH
where[PATH/TO/REPO]
is replaced with the path on your machine.
An example usage of the most common methods/classes is given below.
In this example, we read in a file that contains a range of numbers.
We then compute the product between each of those numbers and a single
number. We do this in parallel, so that as each slave node is ready,
the master sends it a number from the file. All results are logged
to log.txt
, and the results are saved to a file products.txt
.
from python_utilities.scripting import setup_logging
from python_utilities.io_tools import smart_open
from python_utilities.parallel import Parallelizer, make_data_iterator
# Methods written for parallel have non-keyword (num1) and keyword (num2)
# arguments. All keyword arguments must be constant across all parallel
# runs, while non-keyword arguments may vary. Here, we will vary num1, but
# num2 will be constant.
def product(num1, num2=100):
return num1 * num2
# log everything, including logging.debug messages, to log.txt
setup_logging("log.txt", verbose=True)
data_list = []
# smart_open recognizes the .gz extension
with smart_open("numbers.txt.gz", "r") as f:
for line in f:
data_list.append(float(line.strip()))
# items in iterator must be lists or tuples (non-keyword args)
data_iterator = make_data_iterator(data_list)
# use multiprocessing if available
parallelizer = Parallelizer(parallel_mode="processes")
run_kwargs = {"out_file": "products.txt", # save one result per line
"out_str": "%d\n", # formatting of output line
"out_format": lambda x: x, # modify result before saving
"logging_str": "Multiplied by %d", # format log line
"logging_format": lambda x: (x), # modify result before logging
"kwargs": {"num2": 100}} # pass constant keyword argument
# run the method on every item in the iterator. If out_file specified,
# boolean success is returned. Otherwise, result is returned. Use
# parallelizer.run to run method on all data before returning and return
# in order.
for success, data in parallelizer.run_gen(product, data_iterator,
**run_kwargs):
print(success)