Skip to content

sethaxen/python_utilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Utilities

Useful tools for common Python tasks.

Introduction

This package arose from a desire to standardize useful methods and classes I found myself reusing in many projects. These fall into several sub-packages:

  • scripting: method with useful defaults and settings for log format, verbosity, and destination
  • io_tools: methods for intelligently guessing file compression from extension and safely buffering numerical data before writing to an HDF5 file
  • parallel: determine which options for parallelization are available in the current environment, and run a method on a dataset using a master-slave paradigm. The Parallelizer class arose from a common use case of writing/testing/running scripts on a local machine using multiprocessing or multithreading for parallelization but then needing to modify the scripts to use MPI on a large cluster. The Parallelizer allows the same script to be run in both contexts without any need for changing code.
  • plotting: useful color schemes for maximum contrast and methods for conversion between color spaces

Installation

python_utilities may be installed in three ways, in order of preference:

  1. Using conda: conda install -c conda-forge sdaxen_python_utilities
  2. Using pip: pip install sdaxen_python_utilities
  3. From GitHub:
    1. Download the source from this repository
    2. Download this repository to your machine
      • Clone this repository to your machine with git clone https://github.com/sdaxen/python_utilities.git
      • OR download an archive by navigating to https://github.com/sdaxen/python_utilities and clicking "Clone or download > Download ZIP". Extract the archive.
    3. Add the path to the repository to your $PYTHONPATH. On Unix, this can be done with export PYTHONPATH=[PATH/TO/REPO]:$PYTHONPATH where [PATH/TO/REPO] is replaced with the path on your machine.

Usage

An example usage of the most common methods/classes is given below. In this example, we read in a file that contains a range of numbers. We then compute the product between each of those numbers and a single number. We do this in parallel, so that as each slave node is ready, the master sends it a number from the file. All results are logged to log.txt, and the results are saved to a file products.txt.

from python_utilities.scripting import setup_logging
from python_utilities.io_tools import smart_open
from python_utilities.parallel import Parallelizer, make_data_iterator


# Methods written for parallel have non-keyword (num1) and keyword (num2)
# arguments. All keyword arguments must be constant across all parallel
# runs, while non-keyword arguments may vary. Here, we will vary num1, but
# num2 will be constant.
def product(num1, num2=100):
    return num1 * num2


# log everything, including logging.debug messages, to log.txt
setup_logging("log.txt", verbose=True)

data_list = []
# smart_open recognizes the .gz extension
with smart_open("numbers.txt.gz", "r") as f:
    for line in f:
        data_list.append(float(line.strip()))

# items in iterator must be lists or tuples (non-keyword args)
data_iterator = make_data_iterator(data_list)
# use multiprocessing if available
parallelizer = Parallelizer(parallel_mode="processes")
run_kwargs = {"out_file": "products.txt",  # save one result per line
              "out_str": "%d\n",  # formatting of output line
              "out_format": lambda x: x,  # modify result before saving
              "logging_str": "Multiplied by %d",  # format log line
              "logging_format": lambda x: (x),  # modify result before logging
              "kwargs": {"num2": 100}}  # pass constant keyword argument

# run the method on every item in the iterator. If out_file specified,
# boolean success is returned. Otherwise, result is returned. Use
# parallelizer.run to run method on all data before returning and return
# in order.
for success, data in parallelizer.run_gen(product, data_iterator,
                                          **run_kwargs):
    print(success)