Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Data Guide #162

Merged
merged 11 commits into from
Aug 22, 2024
32 changes: 16 additions & 16 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,29 @@
format: jb-book
root: intro
parts:
- caption: LEAP-Pangeo
- caption: Tutorials
chapters:
- file: leap-pangeo/tutorial.md
- file: leap-pangeo/jupyterhub.md
- file: leap-pangeo/architecture
- file: leap-pangeo/implementation
- file: tutorials/getting_started
- caption: Guides
chapters:
- file: guides/hub_guides
- file: leap-pangeo/solutions
- file: guides/education
- file: guides/bootcamp
- file: guides/team_docs
- file: guides/faq
- file: guides/data_guide
- file: guides/compute_guide
- file: guides/education_guide
- file: guides/bootcamp_guide
- file: guides/vm_access
- caption: Policies
- file: guides/team_guide
- file: guides/faq
- caption: Explanation
chapters:
- file: policies/code_policy
- file: policies/data_policy
- file: policies/infrastructure_policy
- file: policies/users_roles
- file: explanation/architecture
- file: explanation/implementation
- file: explanation/code_policy
- file: explanation/data_policy
- file: explanation/infrastructure_policy
- caption: Reference
chapters:
- file: reference/infrastructure
- file: reference/membership
- file: reference/education
- caption: Miscellaneous
chapters:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
(explanation.architecture)=

# LEAP-Pangeo Architecture

LEAP-Pangeo is a cloud-based data and computing platform that will be used to support research, education, and knowledge transfer within the LEAP program.
Expand Down Expand Up @@ -27,6 +29,8 @@ LEAP-Pangeo high-level architecture diagram

There are four primary components to LEAP-Pangeo.

(explanation.architecture.data-library)=

### The Data Library

The data library will provide analysis-ready, cloud-optimized data for all aspects of LEAP.
Expand All @@ -44,6 +48,22 @@ Examples of data that may become part of the library are
- Easily accessible syntheses of climate projections from [CMIP6 data](https://esgf-node.llnl.gov/projects/cmip6/), produced by the LEAP team,
for use by industry partners for business strategy and decision making.

(explanation.architecture.catalog)=

#### Data Catalog

A [STAC](https://stacspec.org/) data catalog be used to enumerate all LEAP-Pangeo datasets and provide this information to the public.
The catalog will store all relevant metadata about LEAP datasets following established metadata standards (e.g. CF Conventions).
It will also provide direct links to raw data in cloud object storage.

The catalog will facilitate several different modes of access:

- Searching, crawling, and opening datasets from within notebooks or scripts
- "Crawling" by search indexes or other machine-to-machine interfaces
- A pretty web front-end interface for interactive public browsing

The [Radiant Earth MLHub](https://mlhub.earth/) is a great reference for how we imagine the LEAP data catalog will eventually look.

### Data Storage Service

The underlying technology for the LEAP Data catalog will be cloud object storage (e.g. Amazon S3),
Expand Down Expand Up @@ -82,20 +102,6 @@ This synergistic relationship with be mutually beneficial to two NSF-sponsored p
Using Pangeo Forge effectively will require LEAP scientists and data engineers to engage
with the open-source development process around Pangeo Forge and related technologies.

#### Catalog

A [STAC](https://stacspec.org/) data catalog be used to enumerate all LEAP-Pangeo datasets and provide this information to the public.
The catalog will store all relevant metadata about LEAP datasets following established metadata standards (e.g. CF Conventions).
It will also provide direct links to raw data in cloud object storage.

The catalog will facilitate several different modes of access:

- Searching, crawling, and opening datasets from within notebooks or scripts
- "Crawling" by search indexes or other machine-to-machine interfaces
- A pretty web front-end interface for interactive public browsing

The [Radiant Earth MLHub](https://mlhub.earth/) is a great reference for how we imagine the LEAP data catalog will eventually look.

### The Hub

```{figure} https://jupyter.org/assets/homepage/labpreview.webp
Expand Down Expand Up @@ -130,9 +136,7 @@ with full-featured Python software environments for environmental data science.
These environments will be the starting point for LEAP environments.
They may be augmented as LEAP evolves with more specific software as needed by research projects.

Use management and access control for the Hub are described in {doc}`/policies/users_roles`.
We use GitHub for identity management, in order to make it easy to include participants
from any partner institution..
Use management and access control for the Hub are described in [](reference.membership).

### The Knowledge Graph

Expand All @@ -145,9 +149,9 @@ LEAP "outputs" will be of four main types:
- **Educational Modules** - used for teaching

All of these object must be tracked and cataloged in a uniform way.
The {doc}`/policies/code_policy` and {doc}`/policies/data_policy` will help set these standards.
The [](explanation.code_policy) and [](explanation.data-policy) will help set these standards.

```{figure} LEAP_knowledge_graph.png
```{figure} ../images/LEAP_knowledge_graph.png
---
width: 600px
name: knowledge-graph
Expand Down
10 changes: 10 additions & 0 deletions book/explanation/code_policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
(explanation.code_policy)=

# LEAP-Pangeo Code Policy

(explanation.code-policy.dont-let-perfect-be-the-enemy-of-good)=

## Enable Science now, but keep evolving.

"Don't let perfect be the enemy of good"
🚧
30 changes: 28 additions & 2 deletions book/policies/data_policy.md → book/explanation/data_policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,31 @@ abbreviations:
ARCO: Analysis-Ready Cloud-Optimized
---

# Data Policy
(explanation.data-policy)=

# LEAP-Pangeo Data Policy

(explanation.data-policies.access)=

## Data Access

🚧

(explanation.data-policy.reproducibility)=

## Reproducibility

🚧

(explanation.data-policy.types)=

## Types of Data Used at LEAP

Within the LEAP project we distinguish between several different types of data mostly based on whether the data was used or produced at LEAP and if the data is already accessible in {abbr}`ARCO` formats in the cloud.

:::\{admonition} LEAP produced
:class: dropdown
Data that has been created or modified by LEAP researchers.
Data that has been created or modified by LEAP researchers. We currently do not provide any way of ensuring that data is archived, and users should never rely on LEAP-Pangeo resources as the only replicate of valuable data (see also [](guides.data.ingestion)).
:::

:::\{admonition} LEAP ingested
Expand All @@ -23,3 +39,13 @@ Data that is already publically available but has been ingested into cloud stora
:class: dropdown
Data that is already available in {abbr}`ARCO` formats in publically accessible object storage. Adding this data to the LEAP-Pangeo Catalog enables us to visualize it with the Data Viewer, and collect all datasets of importance in one single location, but none of the data itself is modified.
:::

## Roles

Many different people at LEAP interact with data in various ways. Here is a list of typical roles (some people have multiple roles):

(explanation.data-policy.roles.data-expert)=

### Data Expert

🚧
File renamed without changes.
1 change: 1 addition & 0 deletions book/explanation/infrastructure_policy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# LEAP-Pangeo Infrastructure Policy
2 changes: 1 addition & 1 deletion book/guides/bootcamp.md → book/guides/bootcamp_guide.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Running LEAP bootcamps
# Bootcamp Guide

We collect all bootcamp materials in the [LEAP-Pangeo bootcamp repository](https://github.com/leap-stc/LEAP-bootcamps). Please keep all relevant information and materials in this repository to make it easier for participants to find them.

Expand Down
10 changes: 10 additions & 0 deletions book/guides/compute_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Compute Guide

These are a set of guides for using the JupyterHub Compute Environment effectively.

## Dask

To help you scale up calculations using a cluster, the Hub is configured with Dask Gateway.
For a quick guide on how to start a Dask Cluster, consult this page from the Pangeo docs:

- <https://pangeo.io/cloud.html#dask>
Loading
Loading