Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More liberal pattern when running terracotta ingest #271

Closed
tomalrussell opened this issue Jul 26, 2022 · 3 comments
Closed

More liberal pattern when running terracotta ingest #271

tomalrussell opened this issue Jul 26, 2022 · 3 comments

Comments

@tomalrussell
Copy link
Contributor

It would be convenient to allow some punctuation marks as well as alphanumeric characters in the regex pattern matching key values.

I have a raster pattern like:

  • {type}__rp_{rp}__rcp_{rcp}__epoch_{epoch}__gcm_{gcm}.tif
    and files like:
  • cyclone__rp_10__rcp_8.5__epoch_2050__gcm_CMCC-CM2-VHR4.tif
  • river__rp_2__rcp_8.5__epoch_2030__gcm_MIROC-ESM-CHEM.tif

The regex match is too strict to allow the . or - in values - could it be relaxed? I can patch the filenames as a workaround, but a quick edit to the key bit of the regex so it's just [^_]+ seems to work okay locally:
https://github.com/DHI-GRAS/terracotta/blob/b7c67c3c2736401295644c1e8882b3f0f013bb5c/terracotta/scripts/click_types.py#L74

@dionhaefner
Copy link
Collaborator

dionhaefner commented Jul 26, 2022

Unfortunately this is an unsolvable problem. Raster patterns are a confusing mess, and I am hesitant to make it even more confusing by capturing more stuff. They are really meant only for the simplest of use cases.

Is there any particular reason why you don't want to use the Python API for ingestion like we recommend in the docs?

I could be nudged to support regex patterns as a power-user feature:

$ terracotta ingest --raster-regex "(?P<type>\w+)__rp_(?P<rp>\d+)__rcp_(?P<rcp>\d+\.\d+)__epoch_(?P<epoch>\d+)__gcm_(?P<gcm>[\w-]+)\.tif"

But I think you have to agree that the patterns are quite messy, so it might be easier to use the Python API :)

@tomalrussell
Copy link
Contributor Author

I can see the problem. I guess I asked because the ingest subcommand almost does what I want and tweaking it seemed easier than learning how to use the Python API.

I can't see myself typing out that --raster-regex example (correctly, first time!), and at that point I'd be writing some kind of script anyway.

I wonder if a simpler example script in the docs might help:

import os
from typing import Dict, List

import tqdm
import terracotta

# Define the location of the SQLite database
# (this will be created if it doesn't already exist)
DB_NAME = f"./terracotta.sqlite"

# Define the list of keys that will be used to identify datasets.
# (these need to match the key_values dicts defined in RASTER_FILES below)
KEYS = ["type", "rp", "rcp", "epoch", "gcm"]

# Define a list of raster files to import
# (this is a list of dictionaries, each with a file path and the values for
# each key - make sure the order matches the order of KEYS defined above)
#
# This part of the script could be replaced with something that makes sense for
# your data - it could use a glob expression to find all TIFFs and a regular
# expression pattern to extract the key values, or it could read from a CSV,
# or use some other reference or metadata generating process.
RASTER_FILES = [
    {
        "key_values": {
            "type": "river",
            "rp": 250,
            "rcp": 4.5,
            "epoch": 2030,
            "gcm": "NorESM1-M",
        },
        "path": "./data/river__rp_250__rcp_4x5__epoch_2030__gcm_NorESM1-M.tif",
    },
    {
        "key_values": {
            "type": "river",
            "rp": 500,
            "rcp": 8.5,
            "epoch": 2080,
            "gcm": "NorESM1-M",
        },
        "path": "./data/river__rp_500__rcp_8x5__epoch_2080__gcm_NorESM1-M.tif",
    },
]


def load(db_name: str, keys: List[str], raster_files: List[Dict]):
    driver = terracotta.get_driver(db_name)

    # create an empty database if it doesn't exist
    if not os.path.isfile(db_name):
        driver.create(keys)

    # sanity check that the database has the same keys that we want to load
    assert list(driver.key_names) == keys, (driver.key_names, keys)

    progress_bar = tqdm.tqdm(raster_files)

    for raster in progress_bar:
        progress_bar.set_postfix(file=raster["path"])

        with driver.connect():
            driver.insert(raster["key_values"], raster["path"])


if __name__ == "__main__":
    load(DB_NAME, KEYS, RASTER_FILES)

I can draft a PR with an attempt at adding to the docs if you like - otherwise do close this, some version of using the API is the way forward 😊

@dionhaefner
Copy link
Collaborator

This looks awesome, thanks! I would gladly accept a PR on this.

tomalrussell added a commit to nismod/infra-risk-vis that referenced this issue Jul 29, 2022
The terracotta ingest script is limited in what it can parse,
see DHI/terracotta#271
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants