Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support .quiltignore on S3 #4326

Open
drernie opened this issue Feb 15, 2025 · 8 comments
Open

Support .quiltignore on S3 #4326

drernie opened this issue Feb 15, 2025 · 8 comments

Comments

@drernie
Copy link
Member

drernie commented Feb 15, 2025

Right now, quiltignore_filter only support 'file' for the url_scheme

@drernie
Copy link
Member Author

drernie commented Feb 15, 2025

Suggested improvement

import boto3
from fnmatch import fnmatch
from pathlib import Path
from urllib.parse import urlparse

def quiltignore_filter(paths, ignore, url_scheme):
    """Filter out paths captured by ignore rules, supporting both local and S3 paths.

    Args:
        paths (list): List or iterable of paths (file system or S3).
        ignore (path): Path to the file defining ignore rules (Unix shell-style wildcard format).
        url_scheme (str): The URL scheme ("file" for local, "s3" for S3 paths).
        
    Returns:
        set: Filtered paths after applying ignore rules.
    """
    # Read and process ignore rules
    ignore_rules = ignore.read_text('utf-8').split("\n")
    ignore_rules = ['*/' + rule for rule in ignore_rules if rule]

    if url_scheme == 'file':
        return _filter_local_paths(paths, ignore_rules)
    elif url_scheme == 's3':
        return _filter_s3_paths(paths, ignore_rules)
    else:
        raise NotImplementedError(f"URL scheme '{url_scheme}' is not supported.")


def _filter_local_paths(paths, ignore_rules):
    """Filter local file paths based on ignore rules."""
    files, dirs = set(), set()
    for path in paths:
        if path.is_file():
            files.add(path)
        else:
            dirs.add(path)

    filtered_dirs = dirs.copy()
    for ignore_rule in ignore_rules:
        for pkg_dir in filtered_dirs.copy():
            if fnmatch(pkg_dir.as_posix() + "/", ignore_rule) or fnmatch(pkg_dir.as_posix(), ignore_rule):
                files = {n for n in files if pkg_dir not in n.parents}
                dirs.discard(pkg_dir)

        files = {n for n in files if not fnmatch(n.as_posix(), ignore_rule)}

    return files.union(dirs)


def _filter_s3_paths(paths, ignore_rules):
    """Filter S3 object paths based on ignore rules."""
    s3_client = boto3.client("s3")
    s3_objects = set()

    for path in paths:
        parsed = urlparse(path)
        bucket_name = parsed.netloc
        prefix = parsed.path.lstrip("/")

        # Get all objects in S3 under the given prefix
        response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
        if "Contents" in response:
            for obj in response["Contents"]:
                s3_objects.add(f"s3://{bucket_name}/{obj['Key']}")

    # Apply ignore rules
    filtered_objects = s3_objects.copy()
    for ignore_rule in ignore_rules:
        filtered_objects = {obj for obj in filtered_objects if not fnmatch(obj, ignore_rule)}

    return filtered_objects

@drernie
Copy link
Member Author

drernie commented Feb 15, 2025

🎯 Supported and Unsupported Patterns in quiltignore_filter

✅ Supported Patterns

Pattern Type Works on Local Files? Works on S3 Paths? Notes
Basic Filename Matching ✅ Yes ✅ Yes Matches specific filenames.
Wildcards (*) ✅ Yes ✅ Yes Matches any number of characters except /.
Recursive (**) ✅ Yes ✅ Yes Matches everything inside a folder or nested folders.
Negation (!) ✅ Yes ⚠️ Limited Works for local, needs extra logic for S3.
Directory Matching (/) ✅ Yes ✅ Yes S3 treats directories as prefixes.
Prefix Matching (temp/) ✅ Yes ✅ Yes Matches all files under a directory.

❌ Unsupported Patterns

Pattern Type Works on Local Files? Works on S3 Paths? Notes
Single Character Matching (?) ❌ No ❌ No AWS S3 does not support ?, and fnmatch has limitations.
Character Ranges ([a-z]) ❌ No ❌ No fnmatch does not handle [a-z] correctly with /.
Escaping (\) ❌ No ❌ No Needs manual handling for special characters like #.

📌 Key Takeaways

  • S3 treats "directories" as prefixes, so directory filtering works well.
  • Negation (!) is tricky for S3, requiring additional filtering.
  • Wildcard (*) and recursive (**) patterns work perfectly.
  • No support for ?, [a-z], or escaped characters.

@drernie
Copy link
Member Author

drernie commented Feb 15, 2025

Sample .quiltignore file for testing

# Ignore all `.log` files except `important.log`
*.log
!important.log

# Ignore the `logs/` directory and everything inside it
logs/**

# Ignore `backups/` directory but allow `.zip` files inside it
backups/*
!backups/*.zip

# Ignore all files inside `folder1/subfolder` except `keep-me.txt`
folder1/subfolder/*
!folder1/subfolder/keep-me.txt

@drernie
Copy link
Member Author

drernie commented Feb 15, 2025

Script to create test folder

#!/bin/bash

# Define base directory
BASE_DIR="."

# Create main directories
mkdir -p $BASE_DIR/{logs,backups,folder1/subfolder}

# Create files in the root directory
touch $BASE_DIR/file1.txt
touch $BASE_DIR/file2.log
touch $BASE_DIR/important.log

# Create files inside logs/
touch $BASE_DIR/logs/error.log
touch $BASE_DIR/logs/access.log

# Create files inside backups/
touch $BASE_DIR/backups/backup1.zip
touch $BASE_DIR/backups/backup2.tar.gz

# Create files inside folder1/
touch $BASE_DIR/folder1/keep.txt
touch $BASE_DIR/folder1/temp.log

# Create files inside folder1/subfolder/
touch $BASE_DIR/folder1/subfolder/ignore-me.txt
touch $BASE_DIR/folder1/subfolder/keep-me.txt

# Create the .gitignore file
cat <<EOL > $BASE_DIR/.gitignore
# Ignore all `.log` files except `important.log`
*.log
!important.log

# Ignore the `logs/` directory and everything inside it
logs/**

# Ignore `backups/` directory but allow `.zip` files inside it
backups/*
!backups/*.zip

# Ignore all files inside `folder1/subfolder` except `keep-me.txt`
folder1/subfolder/*
!folder1/subfolder/keep-me.txt
EOL

echo "✅ Test directory structure created successfully inside '$BASE_DIR'."

@drernie
Copy link
Member Author

drernie commented Feb 15, 2025

Sample folder: s3://udp-spec/example/quiltignore

@drernie
Copy link
Member Author

drernie commented Feb 15, 2025

Raw output

project-root/
├── .gitignore
├── file1.txt
├── file2.log
├── important.log
├── logs/
│   ├── error.log
│   ├── access.log
├── backups/
│   ├── backup1.zip
│   ├── backup2.tar.gz
├── folder1/
│   ├── keep.txt
│   ├── temp.log
│   ├── subfolder/
│   │   ├── ignore-me.txt
│   │   ├── keep-me.txt

@drernie
Copy link
Member Author

drernie commented Feb 15, 2025

After the ignore filter:

s3://my-bucket/
├── file1.txt
├── important.log
├── backups/
│   ├── backup1.zip
├── folder1/
│   ├── keep.txt
│   ├── subfolder/
│   │   ├── keep-me.txt

@drernie
Copy link
Member Author

drernie commented Feb 15, 2025

We should very that this produces the expected result from both local and S3 folders!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant