Skip to content

Commit

Permalink
docs: 📝 pseudo code and docstring for write_resource_parquet() (#816)
Browse files Browse the repository at this point in the history
## Description

Based on @martonvago's suggestion, I'll write things in "pseudocode"
from now on. But instead of pseudocode, I will write an outline of the
Python function with how I think it might flow inside. Plus, I can write
the full docstrings inside, so you all don't need and we don't need to
move it over from the Quarto doc. **I have NOT ran this, tested it, or
did any execution, this is purely how I think it might work**, hence
"pseudo" 😛. I'll add some comments directly to the code
in the PR.

Closes #642

<!-- Select quick/in-depth as necessary -->
This PR needs an in-depth review.

## Checklist

- [x] Updated documentation

---------

Co-authored-by: martonvago <[email protected]>
  • Loading branch information
lwjohnst86 and martonvago authored Feb 19, 2025
1 parent fa92e90 commit 71f451a
Show file tree
Hide file tree
Showing 2 changed files with 86 additions and 7 deletions.
22 changes: 15 additions & 7 deletions docs/design/interface/functions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ more details.

## Data resource functions

### {{< var done >}}`create_resource_structure(path)`
### {{< var done >}} `create_resource_structure(path)`

See the help documentation with `help(create_resource_structure)` for
more details.
Expand All @@ -127,13 +127,21 @@ flowchart
function --> out
```

### {{< var wip >}} `write_resource_parquet(raw_files, path)`
### {{< var wip >}} `build_resource_parquet(raw_files_path, resource_properties)`

This function takes the files provided by `raw_files` and merges them
into a `data.parquet` file provided by `path`. Use
`path_resource_data()` to provide the correct path location for `path`
and `path_resource_raw_files()` for the `raw_files` argument. Outputs
the path object of the created file.
See the help documentation with `help(build_resource_parquet)` for more
details.

```{mermaid}
flowchart
in_raw_files_path[/raw_files_path/]
in_properties[/resource_properties/]
function("build_resource_parquet()")
out[("./resources/{id}/data.parquet")]
in_raw_files_path --> function
in_properties --> function
function --> out
```

### {{< var wip >}} `edit_resource_properties(path, properties)`

Expand Down
71 changes: 71 additions & 0 deletions docs/design/interface/pseudocode/build_resource_parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# ruff: noqa
def build_resource_parquet(
raw_files_path: list[Path], resource_properties: ResourceProperties
) -> Path:
"""Merge all raw resource file(s) and write into a Parquet file.
This function takes the file(s) provided by `raw_files_path` and merges them into
a `data.parquet` file. The Parquet file will be stored at the path found in `ResourceProperties.path`.
While Sprout generally assumes
that the files stored in the `resources/raw/` folder are already correctly
structured and tidy, it still runs checks to ensure the data are correct
by comparing to the properties. All data in the
`resources/raw/` folder will be merged into one single data object and then
written back to the Parquet file. The Parquet file will be overwritten.
If there are any duplicate observation units in the data, only the most recent
observation unit will be kept. This way, if there are any errors or mistakes
in older raw files that have been corrected in later files, the mistake can still
be kept, but won't impact the data that will actually be used.
Examples:
``` python
import seedcase_sprout.core as sp
sp.build_resource_parquet(
raw_files_path=sp.path_resources_raw_files(1),
resource_properties=sp.example_resource_properties,
)
```
Args:
raw_files_path: A list of paths for all the raw files, mostly commonly stored in the
`.csv.gz` format. Use `path_resource_raw_files()` to help provide the
correct paths to the raw files.
resource_properties: The `ResourceProperties` object that contains the properties
of the resource you want to create the Parquet file for.
Returns:
Outputs the path object of the created Parquet file.
"""
# Not sure if this is the correct way to verify multiple files.
[check_is_file(path) for path in raw_files_path]
check_resource_properties(resource_properties)

data = read_raw_files(raw_files_path)
data = drop_duplicate_obs_units(data)

# This function could be several functions or the one full function.
check_data(data, resource_properties)

return write_parquet(data, resource_properties["path"])


def write_parquet(data: DataFrame, path: Path) -> Path:
return path


def read_raw_files(paths: list[Path]) -> DataFrame:
# Can read gzip files.
data_list = [polars.read_csv(path) for path in paths]
# Merge them all together.
data = polars.concat(data_list)
return data


def drop_duplicate_obs_units(data: DataFrame) -> DataFrame:
# Drop duplicates based on the observation unit, keeping only the most
# recent one. This allows older raw files to contain potentially wrong
# data that was corrected in the most recent file.
return data.drop_duplicates()

0 comments on commit 71f451a

Please sign in to comment.