Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: add fresh, max/min count by columns #7

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Contributing to sdf-action
# Contributing to sdf-tests

Thank you for your interest in contributing to `sdf-action`! We welcome contributions from the community and are excited to see what you can bring to the project. Before you get started, please review the guidelines below to ensure a smooth and efficient contribution process.
Thank you for your interest in contributing to `sdf-tests`! We welcome contributions from the community and are excited to see what you can bring to the project. Before you get started, please review the guidelines below to ensure a smooth and efficient contribution process.

## PR Labeling Guidelines

Expand Down
85 changes: 66 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,25 +19,27 @@ For an in-depth guide on how to use SDF tests, please see the Tests section of [

## SDF Standard Library Tests

| Test Name | Type |
| ------------------------------ | --------- |
| [`not_null()`](#not-null) | Scalar |
| [`valid_scalar(condition)`](#valid-scalar) | Scalar |
| [`valid_aggregate(condition)`](#valid-aggregate) | Aggregate |
| [`unique()`](#unique) | Aggregate |
| [`in_accepted_values([values])`](#in-accepted-values) | Aggregate |
| [`minimum(value)`](#minimum) | Aggregate |
| [`maxiumum(value)`](#maximum) | Aggregate |
| [`exclusive_minimum(value)`](#exclusive-minimum) | Aggregate |
| [`exclusive_maximum(value)`](#exclusive-maximum) | Aggregate |
| [`between(lower, upper)`](#between) | Aggregate |
| [`max_length(value)`](#max-length) | Aggregate |
| [`min_length(value)`](#min-length) | Aggregate |
| [`like(string)`](#like) | Aggregate |
| [`try_cast(type)`](#try-cast) | Aggregate |
| [`primary_key(column)`](#primary-key) | Aggregate |
| [`unique_columns([c1, c2])`](#unique-columns)| Table |

| Test Name | Type |
| -------------------------------------------------------------------- | --------- |
| [`not_null()`](#not-null) | Scalar |
| [`valid_scalar(condition)`](#valid-scalar) | Scalar |
| [`valid_aggregate(condition)`](#valid-aggregate) | Aggregate |
| [`unique()`](#unique) | Aggregate |
| [`in_accepted_values([values])`](#in-accepted-values) | Aggregate |
| [`minimum(value)`](#minimum) | Aggregate |
| [`maxiumum(value)`](#maximum) | Aggregate |
| [`exclusive_minimum(value)`](#exclusive-minimum) | Aggregate |
| [`exclusive_maximum(value)`](#exclusive-maximum) | Aggregate |
| [`between(lower, upper)`](#between) | Aggregate |
| [`max_length(value)`](#max-length) | Aggregate |
| [`min_length(value)`](#min-length) | Aggregate |
| [`like(string)`](#like) | Aggregate |
| [`try_cast(type)`](#try-cast) | Aggregate |
| [`primary_key(column)`](#primary-key) | Aggregate |
| [`unique_columns([c1, c2])`](#unique-columns) | Table |
| [`fresh(reference_value, date_part, value)`](#fresh) | Aggregate |
| [`maximum_count([c1, c2], value)`](#maximum_row_count_by_partitions) | Table |
| [`minimum_count([c1, c2], value)`](#maximum_row_count_by_partitions) | Table |

#### Not Null

Expand Down Expand Up @@ -232,3 +234,48 @@ table:
- expect: unique_columns(['a', 'b'])
```

#### Fresh

Asserts that a column contains values more recent than a given number of interval compared to a reference value.
The column and reference must be of the same data type.

**Example:**
```yaml
columns:
- name: a
tests:
- expect: fresh('a', current_date(), 1)
- expect: fresh('a', current_date(), 1)
- name: b
tests:
- expect: fresh('b', current_timestamp(), 180, 'minute')
- expect: fresh('warn', 'b', current_datetime(), 90, 'minute')
```

#### Maximum Count by Partition

Asserts that a table grouped by a list of columns contains less rows than a threshold value.
The column and reference must be of the same data type.

**Example:**
```yaml
columns:
- name: a
tests:
- expect: maximum_row_count_by_partition('a', current_date(), 1)
- expect: maximum_row_count_by_partition('a', current_date(), 1)
```

#### Minimum Count by Partition

Asserts that a table grouped by a list of columns contains more rows than a threshold value.
The column and reference must be of the same data type.

**Example:**
```yaml
columns:
- name: a
tests:
- expect: maximum_row_count_by_partition('a', current_date(), 1)
- expect: maximum_row_count_by_partition('a', current_date(), 1)
```
72 changes: 63 additions & 9 deletions macros/test.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@
'{{severity}}: column {{ column_name }} has unexpected values in {{condition_str}}'
{%- endmacro %}


{# should mention column_name in condition#}
{% macro valid_scalar(severity, column_name, condition) -%}
COUNT(CASE WHEN NOT({{ condition }}) THEN 1 ELSE NULL END) > 0
Expand All @@ -43,7 +42,7 @@


{# ---------------------------------------------------------------------------------------------- #}
{# number column checks: via aggregate #}
{# number column checks: via aggregate #}
{% macro minimum(severity, column_name, value) -%}
NOT(MIN({{column_name}}) >= {{value}})
==>
Expand All @@ -68,38 +67,44 @@
'{{severity}}: column {{ column_name }} has values greater than or equal to {{ value }}'
{%- endmacro %}


{% macro between(severity, column_name, min_value, max_value) -%}
NOT(MIN({{column_name}}) >= {{min_value}}) or MAX({{column_name}}) > {{max_value}}
==>
'{{severity}}: column {{ column_name }} has values outside of {{min_value | safe_str()}}..{{max_value| safe_str()}}'
{%- endmacro %}

{% macro fresh(severity, column_name, reference_value, value, date_part) %}
MAX(EXTRACT({{ date_part }} FROM {{ column_name }}) - {{ reference_value }} ) > {{value}}
==>
'{{severity}}: column {{ column_name }} has no values fresher than interval {{ value | safe_str }} {{ date_part }}'
{%- endmacro %}



{# ---------------------------------------------------------------------------------------------- #}
{# string column checks: via aggregate #}
{# string column checks: via aggregate #}
{% macro max_length(severity, column_name, value) -%}
MAX(LENGTH({{column_name}})) > {{value}}
==>
'{{severity}}: column {{ column_name }} has string lengths greater than {{ value | safe_str }}'
{%- endmacro %}

{# string column checks: via aggregate #}
{# string column checks: via aggregate #}
{% macro min_length(severity, column_name, value) -%}
MIN(LENGTH({{column_name}})) < {{value}}
==>
'{{severity}}: column {{ column_name }} has string lengths shorter than {{ value | safe_str }}'
{%- endmacro %}


{# string column checks: via aggregate #}
{# string column checks: via aggregate #}
{% macro like(severity, column_name, value) -%}
NOT(COUNT(CASE WHEN {{ column_name }} LIKE {{value}} THEN 1 ELSE NULL END)>0)
==>
'{{severity}}: column {{ column_name }} has strings that are NOT like {{ value| safe_str }}'
{%- endmacro %}

{# string column checks: via aggregate #}
{# string column checks: via aggregate #}
{% macro try_cast(severity, column_name, type_name) -%}
NOT(COUNT(CASE WHEN TRY_CAST({{column_name}} AS {{type_name}}) IS NOT NULL THEN 1 ELSE NULL END)>0)
==>
Expand All @@ -116,9 +121,8 @@
{%- endmacro %}



{# ---------------------------------------------------------------------------------------------- #}
{# dbt generic tests #}
{# generic tests #}



Expand Down Expand Up @@ -183,6 +187,56 @@
SELECT reason FROM {{verdict}}
{% endmacro %}

{% macro maximum_count(severity, table_name, max_count, column_list) %}

{%- if column_list is none -%}
{%- set verdict = table_name ~ 'maximum_count' ~ max_count | join('_') | safe_id -%}
{%- set grouped_columns = table_name ~ 'maximum_count' ~ max_count | join('_') | safe_id -%}
{% else %}
{%- set verdict = table_name ~ 'maximum_count' ~ max_count ~ 'by' ~ column_list | join('_') | safe_id -%}
{%- set grouped_columns = 'GROUP BY ' ~ {{ column_list | join(', ') }} -%}
{%- endif -%}

{{verdict}} AS (
WITH RowCounts AS (
SELECT {{ 'COUNT(*)' ~ column_list | join(', ')}}
FROM {{table_name}}
{{ grouped_columns }}
HAVING COUNT(*) > {{ min_count }}
)
SELECT
'{{severity}}: columns {{ (column_list | join(', '))| safe_str }} has row count above maximum threshold' AS reason
FROM RowCounts
)
==>
SELECT reason FROM {{verdict}}
{% endmacro %}

{% macro minimum_count(severity, table_name, row_count, column_list) %}

{%- if column_list is none -%}
{%- set verdict = table_name ~ 'maximum_count' ~ max_count | join('_') | safe_id -%}
{%- set grouped_columns = table_name ~ 'maximum_count' ~ max_count | join('_') | safe_id -%}
{% else %}
{%- set verdict = table_name ~ 'maximum_count' ~ max_count ~ 'by' ~ column_list | join('_') | safe_id -%}
{%- set grouped_columns = {{ column_list | join(', ') }} -%}
{%- endif -%}

{{verdict}} AS (
WITH RowCounts AS (
SELECT {{ column_list | join(', ')}}
FROM {{table_name}}
{{ grouped_columns }}
HAVING COUNT(*) > {{ row_count }}
)
SELECT
'{{severity}}: columns {{ (column_list | join(', '))| safe_str }} has row count below minimum threshold' AS reason
FROM RowCounts
)
==>
SELECT reason FROM {{verdict}}
{% endmacro %}


{# ---------------------------------------------------------------------------------------------- #}
{# generate column constraints #}
Expand Down