- Overview
- Articles
- Database Migration Terminology 101
- Data Quality at Petabyte Scale: Building Trust in the Data Lifecycle
- Testing, Monitoring, and Observability
- Dev/Stage/Prod is the Wrong Pattern for Data Pipelines
- Putting the Write-Audit-Publish Pattern (WAP) into Practice with lakeFS
- How to Implement Write-Audit-Publish (WAP)
- Virtual Data Environments
- Books
- Frameworks / tools
Created by gh-md-toc
This project intends to document requirements and referential material about data life cycle, in particular to differentiate it from the software delivery life cycle (SDLC).
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
- Data Engineering Helpers - Knowledge Sharing - Data contracts
- Data Engineering Helpers - Knowledge Sharing - Data products
- Data Engineering Helpers - Knowledge Sharing - Data quality
- Data Engineering Helpers - Knowledge Sharing - Data pipeline deployment
- Data Engineering Helpers - Knowledge Sharing - Data lakehouse
- Data Engineering Helpers - Knowledge Sharing - Data management
- Data Engineering Helpers - Knowledge Sharing - Metadata
- Data Engineering Helpers - Knowledge Sharing - Architecture principles
- Data Engineering Helpers - Knowledge Sharing - Semantic layer
- Title: Database Migration Terminology 101
- Date: Feb. 2025
- Author: Tom Baeyens
- Link to the post on LinkedIn: https://www.linkedin.com/posts/tombaeyens_database-migration-terminology-101-source-activity-7297573805609156609-ZrQW/
- Title: Data Quality at Petabyte Scale: Building Trust in the Data Lifecycle
- Date: Feb. 2025
- Author: Zakariah Siyaji (Zakariah Siyaji on LinkedIn, Zakariah Siyaji on Medium)
- Link to the post on LinkedIn by Chad Sanderson: https://www.linkedin.com/posts/chad-sanderson_many-companies-talk-about-implementing-data-activity-7296212049565515777-dnCn/
- Link to the article on Medium: https://medium.com/glassdoor-engineering/data-quality-at-petabyte-scale-building-trust-in-the-data-lifecycle-7052361307a4
- Also referenced in Data Engineering Helpers - Knowledge Sharing - Data contracts
- Title: The Data Engineer’s Guide to Testing, Monitoring, and Observability
- Date: Dec. 2024
- Author: Alex Caruso (Alex Caruso on LinkedIn, Alex Caruso on Airbyte blog)
- Link to the article: https://airbyte.com/blog/the-data-engineers-guide-to-testing-monitoring-and-observability
- Publisher: Airbyte blog
- Title: Dev/Stage/Prod is the Wrong Pattern for Data Pipelines
- Date: 2 Aug. 2023
- Publisher: Enigma engineering blog
- Link to the article: https://enigma.com/blog/post/dev-stage-prod-is-the-wrong-pattern-for-data-pipelines
- Title: Putting the Write-Audit-Publish Pattern into Practice with lakeFS
- Date: June 2023
- Author: Robin Moffatt
- Link to the article: https://lakefs.io/blog/write-audit-publish-with-lakefs/
- Publisher: LakeFS
- Title: How to Implement Write-Audit-Publish (WAP)
- Date: May 2023
- Author: Robin Moffatt
- Link to the article: https://lakefs.io/blog/how-to-implement-write-audit-publish/
- Publisher: LakeFS
- Title: Virtual Data Environments
- Date: 18 April 2023
- Author: Iaroslav Zeigerman
- Link to the article: https://tobikodata.com/virtual-data-environments.html
- Publisher: Tobiko Data
- Title: Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation
- Authors: Jez Humble and David Farley
- Foreword by Martin Fowler
- Date: 27 Jul. 2010
- ASIN: 0321601912
- Publisher: Addison-Wesley Professional; 1st edition
- ISBN-10: 9780321601919
- ISBN-13: 978-0321601919
- Link to the book home page: https://martinfowler.com/bliki/ContinuousDelivery.html
- Home page: https://github.com/treeverse/lakeFS
lakeFS is an open-source tool that transforms your object storage into a Git-like repository. It enables you to manage your data lake the way you manage your code.
With lakeFS you can build repeatable, atomic, and versioned data lake operations - from complex ETL jobs to data science and analytics.
lakeFS supports AWS S3, Azure Blob Storage, and Google Cloud Storage as its underlying storage service. It is API compatible with S3 and works seamlessly with all modern data frameworks such as Spark, Hive, AWS Athena, DuckDB, and Presto.
For more information, see the documentation.