Skip to content

Commit d63e5c6

Browse files
author
Alex Higgs
committed
Merge branch 'master' into releases
2 parents da5553c + 06e404f commit d63e5c6

9 files changed

+125
-80
lines changed

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
latest [![Documentation Status](https://readthedocs.org/projects/dbtvault/badge/?version=latest)](https://dbtvault.readthedocs.io/en/latest/?badge=latest)
66

7-
stable [![Documentation Status](https://readthedocs.org/projects/dbtvault/badge/?version=v0.3.1-pre)](https://dbtvault.readthedocs.io/en/v0.3.1-pre/?badge=v0.3.1-pre)
7+
stable [![Documentation Status](https://readthedocs.org/projects/dbtvault/badge/?version=v0.3.2-pre)](https://dbtvault.readthedocs.io/en/v0.3.2-pre/?badge=v0.3.2-pre)
88

99
[past docs versions](https://dbtvault.readthedocs.io/en/latest/changelog/)
1010

@@ -34,7 +34,7 @@ Add the following to your ```packages.yml```
3434
packages:
3535

3636
- git: "https://github.com/Datavault-UK/dbtvault"
37-
revision: v0.3.1-pre # Latest stable version
37+
revision: v0.3.2-pre # Latest stable version
3838
```
3939
4040
And run

dbt_project.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
name: 'dbtvault'
2-
version: '0.3.1'
2+
version: '0.3.2'
33

44
profile: 'dbtvault'
55

docs/bestpractices.md

+41-4
Original file line numberDiff line numberDiff line change
@@ -31,16 +31,53 @@ If there is already a source in the raw staging layer, you may keep this or over
3131

3232
## Hashing
3333

34-
Best practises for hashing include:
34+
!!! seealso "See Also"
35+
- [hash](#hash)
36+
- [multi-hash](macros.md#multi_hash)
37+
38+
### The drawbacks of using MD5
3539

36-
- Alpha sorting hashdiff columns. dbtvault does this for us, so no worries! Refer to the [multi-hash](macros.md#multi_hash) docs for how to do this
40+
We are using md5 to calculate the hash in the macros above. If your table contains more than a few billion rows,
41+
then there is a chance of a clash: where two different values generate the same hash value
42+
(see [Collision vulnerabilities](https://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities)).
43+
44+
For this reason, it **should not be** used for cryptographic purposes either.
45+
46+
In future releases of dbtvault, we will allow you to change the algorithm that is used (e.g. to SHA-256) to reduce the
47+
chance of a clash (at the expense of more processing and a larger column), or switch off hashing entirely.
48+
49+
### Why do we hash?
50+
51+
Data Vault uses hashing for two different purposes.
52+
53+
#### Primary Key Hashing
54+
55+
A hash of the primary key. This creates a surrogate key, but it is calculated consistently across the database:
56+
as it is a single column, same data type, it supports the use of pattern-based loading.
57+
58+
#### Hashdiffs
59+
60+
Used to finger-print the payload of a satellite (similar to a checksum) so it is easier to detect if there has been a
61+
change in payload, to trigger the load of a new satellite record. This simplifies the SQL as otherwise we'd have to
62+
compare each column in turn and handle nulls to see if a change had occured.
63+
64+
Hashing is sensitive to column ordering. You can ask the macro to sort the columns alphabetically for you
65+
(as per best practices), or switch this off and let your order take precedence (by setting the sort parameter
66+
to true or false accordingly). Columns are sorted by their alias.
67+
68+
### Hashing best practices
69+
70+
Best practices for hashing include:
71+
72+
- Alpha sorting hashdiff columns. As mentioned, dbtvault can do this for us, so no worries!
73+
Refer to the [multi-hash](macros.md#multi_hash) docs for details on how to do this.
3774

3875
- Ensure all **hub** columns used to calculate a primary key hash are presented in the same order across all
3976
staging tables
4077

4178
!!! note
42-
Some tables may use different column names for primary key components, so we cannot sort the columns for
43-
you as we do with hashdiffs.
79+
Some tables may use different column names for primary key components, so you generally **should not** use
80+
the sorting functionality for primary keys.
4481

4582
- For **links**, columns must be sorted by the primary key of the hub and arranged alphabetically by the hub name.
4683
The order must also be the same as each hub.

docs/changelog.md

+11
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,17 @@ All notable changes to this project will be documented in this file.
44
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

7+
## [v0.3.2-pre] - 2019-10-28
8+
[![Documentation Status](https://readthedocs.org/projects/dbtvault/badge/?version=v0.3.2-pre)](https://dbtvault.readthedocs.io/en/v0.3.2-pre/?badge=v0.3.2-pre)
9+
10+
### Bug Fixes
11+
12+
- Fixed a bug where the logic for performing a base-load (loading for the first time) on a union-based hub or link was incorrect, causing a load failure.
13+
14+
### Documentation
15+
16+
- Various corrections and clarifications on the macros page.
17+
718
## [v0.3.1-pre] - 2019-10-25
819
[![Documentation Status](https://readthedocs.org/projects/dbtvault/badge/?version=v0.3.1-pre)](https://dbtvault.readthedocs.io/en/v0.3.1-pre/?badge=v0.3.1-pre)
920

docs/macros.md

+53-23
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,44 @@
11
## Table templates
22
######(macros/tables)
33

4-
These macros form the core of the package and can be called in your models to build the tables for your Data Vault.
4+
These macros form the core of the package and can be called in your models to build the different types of tables needed
5+
for your Data Vault.
56

7+
### Metadata notes
68
#### Using a source reference for the target metadata
79

8-
As of release 0.3, you may now use a reference as a target metadata value, to streamline metadata entry.
10+
!!! note
11+
As of release 0.3, you may now use a source reference as a target metadata value, to streamline metadata entry.
12+
Read below!
913

10-
In the usage examples for the table template macros in this section, you will see ```source``` provided as the values for some
11-
of the target metadata variables. ```source``` has been declared as a variable at the top of the models, and holds a
12-
reference to the source table we are loading from. This is shorthand for retaining the name and data types of the columns as they were
13-
provided in the ```src``` variables. You may wish to alias the columns or change their data types in specific
14-
circumstances, which is possible by providing a triple in the form of a list.
14+
In the usage examples for the table template macros in this section, you will see ```source``` provided as the values
15+
for some of the target metadata variables. ```source``` has been declared as a variable at the top of the models,
16+
and holds a reference to the source table we are loading from. This is shorthand for retaining the name and data types
17+
of the columns as they are provided in the ```src``` variables. You may wish to alias the columns or change their data
18+
types in specific circumstances, which is possible by providing an additional parameter as a list of triples:
19+
``` (source column name, data type to cast to, target column name)```.
1520

1621
Both approaches are shown in the snippet below:
1722

1823
```mysql
24+
{%- set src_pk = 'CUSTOMER_NATION_PK' -%}
25+
{%- set src_fk = ['CUSTOMER_PK', 'NATION_PK'] -%}
26+
{%- ...other src metadata... -%}
27+
1928
{%- set tgt_pk = source -%}
2029
{%- set tgt_fk = [['CUSTOMER_PK', 'BINARY(16)', 'CUSTOMER_FK'],
2130
['NATION_PK', 'BINARY(16)', 'NATION_FK']] -%}
2231
```
2332

33+
Here, we are keeping the ```tgt_pk``` (the target table's primary key) the same as the primary key identified in the
34+
source (```src_pk```).
35+
Behind the scenes, the macro will get the datatype of the column provided in the ```src_pk``` variable and generate a
36+
mapping for us. If the ```src_pk``` column does not exist, an appropriate exception will be raised.
37+
38+
Alternatively we have provided a manual mapping for the ```tgt_fk``` (the target table's foreign key).
39+
40+
*For further details and examples on both methods, refer to the usage examples
41+
and snippets in the table template documentation below (both Single-Source and Union).*
2442

2543
!!! note
2644
If only aliasing and **not** changing data types, we suggest using the [add_columns](#add_columns) macro.
@@ -30,7 +48,7 @@ ___
3048

3149
### hub_template
3250

33-
Creates a hub with provided metadata.
51+
Generates sql to build a hub table using the provided metadata.
3452

3553
```mysql
3654
dbtvault.hub_template(src_pk, src_nk, src_ldts, src_source,
@@ -152,7 +170,7 @@ ___
152170

153171
### link_template
154172

155-
Creates a link with provided metadata.
173+
Generates sql to build a link table using the provided metadata.
156174

157175
```mysql
158176
dbtvault.link_template(src_pk, src_fk, src_ldts, src_source,
@@ -228,7 +246,6 @@ dbtvault.link_template(src_pk, src_fk, src_ldts, src_source,
228246
source) }}
229247
```
230248

231-
232249
#### Output
233250

234251
```mysql tab="Single-Source"
@@ -280,7 +297,7 @@ ___
280297

281298
### sat_template
282299

283-
Creates a satellite with provided metadata.
300+
Generates sql to build a satellite table using the provided metadata.
284301

285302
```mysql
286303
dbtvault.sat_template(src_pk, src_hashdiff, src_payload,
@@ -398,9 +415,10 @@ ___
398415
[Read More](https://www.md5online.org/blog/why-md5-is-not-safe/)
399416

400417
!!! seealso "See Also"
401-
[hash](#hash)
418+
- [hash](#hash)
419+
- [Hashing best practises and why we hash](bestpractices.md#hashing)
402420

403-
A macro for generating multiple lines of hashing SQL for columns:
421+
This macro will generate SQL hashing sequences for one or more columns as below:
404422
```sql
405423
CAST(MD5_BINARY(UPPER(TRIM(CAST(column1 AS VARCHAR)))) AS BINARY(16)) AS alias1,
406424
CAST(MD5_BINARY(UPPER(TRIM(CAST(column2 AS VARCHAR)))) AS BINARY(16)) AS alias2
@@ -437,9 +455,9 @@ CAST(MD5_BINARY(CONCAT(
437455
```
438456

439457
!!! success "Column sorting"
440-
You do not need to worry about providing the columns in any particular order, as long as you set the
441-
```sort``` flag to true when creating hashdiffs.
442-
458+
If you wish to sort columns in alphabetical order as per [best practices](bestpractices.md#hashing),
459+
you do not need to worry about doing this manually, just set the
460+
```sort``` flag to true when creating hashdiffs as per the above example.
443461
___
444462

445463
### add_columns
@@ -483,16 +501,21 @@ OLD_CUSTOMER_PK AS CUSTOMER_PK
483501
The ```add_columns``` macro will automatically select all columns from the optional ```source_table``` reference,
484502
if provided.
485503

486-
##### Overring source columns
504+
##### Overriding source columns
487505

488506
You may wish to override some of the source columns with different values. To replace the ```SOURCE```
489507
or ```LOADDATE``` column value, for example, then you must provide the column name
490508
that you wish to override as the alias in the pair.
491509

510+
!!! note
511+
The macro will not actually override (delete or replace) any of the source columns, but simply add new columns
512+
using the provided column as a basis.
513+
492514
##### Functions
493515

494-
Database functions may be used, for example ```CURRENT_DATE()```, to set the current date as the value of a column, as on
495-
```line 2``` of the usage example.
516+
Database functions may be used, for example ```CURRENT_DATE()``` to set the current date as the value of a column, as on
517+
```line 2``` of the usage example. Any function supported by the database is valid, for example ```LPAD()```, which pads
518+
a column with leading zeroes.
496519

497520
##### Adding constants
498521
With the ```add_columns``` macro, you may provide constants.
@@ -502,9 +525,11 @@ and the macro will do the rest. See ```line 3``` of the usage example above, and
502525

503526
##### Aliasing columns
504527

505-
As of release 0.3, columns must now be aliased prior to loading, in the staging layer. This can be done by providing the
528+
As of release 0.3, columns should now be aliased in the staging layer prior to loading. This can be achieved by providing the
506529
column name you wish to alias as the first argument in a pair, and providing the alias for that column as the second argument.
507-
This process can be observed on ```line 4``` of the usage example above.
530+
This can be observed on ```line 4``` of the usage example above. Aliasing can still be carried out using a
531+
manual mapping (shown in the [table template](#table-templates) section examples) but this is less concise for aliasing
532+
purposes.
508533

509534
___
510535

@@ -517,7 +542,7 @@ FROM MYDATABASE.MYSCHEMA.MYTABLE
517542
```
518543

519544
!!! info
520-
Sources need to be set up in dbt. [Read More](https://docs.getdbt.com/docs/using-sources)
545+
Sources need to be set up in dbt to ensure this works. [Read More](https://docs.getdbt.com/docs/using-sources)
521546

522547
#### Parameters
523548

@@ -604,14 +629,19 @@ ___
604629
The intended use is for creating checksum-like fields only, so that a record change can be detected.
605630

606631
[Read More](https://www.md5online.org/blog/why-md5-is-not-safe/)
632+
633+
!!! seealso "See Also"
634+
- [multi-hash](#multi_hash)
635+
- [Hashing best practises and why we hash](bestpractices.md#hashing)
607636

608637
A macro for generating hashing SQL for columns:
609638
```sql
610639
CAST(MD5_BINARY(UPPER(TRIM(CAST(column AS VARCHAR)))) AS BINARY(16)) AS alias
611640
```
612641

613642
- Can provide multiple columns as a list to create a concatenated hash
614-
- Hashdiffs should be alpha sorted using the ```sort``` flag.
643+
- Columns are sorted alphabetically (by alias) if you set the ```sort``` flag to true.
644+
- Generally, you should alpha sort hashdiffs using the ```sort``` flag.
615645
- Casts a column as ```VARCHAR```, transforms to ```UPPER``` case and trims whitespace
616646
- ```'^^'``` Accounts for null values with a double caret
617647
- ```'||'``` Concatenates with a double pipe

macros/internal/get_tgt_cols.sql

-44
This file was deleted.

macros/internal/union.sql

+1-2
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,10 @@
1414
-#}
1515
{%- macro union(src_pk, src_nk, src_ldts, src_source, tgt_pk, source) -%}
1616

17-
SELECT {{ dbtvault.prefix([src_pk, src_nk, src_ldts, src_source], 'src')}}{% if is_incremental() or union -%},
17+
SELECT {{ dbtvault.prefix([src_pk, src_nk, src_ldts, src_source], 'src')}},
1818
LAG({{ src_source }}, 1)
1919
OVER(PARTITION by {{ tgt_pk | last }}
2020
ORDER BY {{ tgt_pk | last }}) AS FIRST_SOURCE
21-
{%- endif %}
2221
FROM (
2322

2423
{%- set letters='abcdefghijklmnopqrstuvwxyz' -%}

macros/tables/hub_template.sql

+8-2
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,18 @@ FROM (
3333
tgt_pk, tgt_nk, tgt_ldts, tgt_source,
3434
source, is_union) }}
3535
) AS stg
36-
{% if is_incremental() or is_union -%}
36+
{# If incremental union or single #}
37+
{%- if is_incremental() -%}
3738
LEFT JOIN {{ this }} AS tgt
3839
ON {{ dbtvault.prefix([tgt_pk|first], 'stg') }} = {{ dbtvault.prefix([tgt_pk|last], 'tgt') }}
3940
WHERE {{ dbtvault.prefix([tgt_pk|last], 'tgt') }} IS NULL
40-
{%- if is_union %}
41+
{# If an incremental and union load -#}
42+
{% if is_union -%}
4143
AND stg.FIRST_SOURCE IS NULL
4244
{%- endif -%}
4345
{%- endif -%}
46+
{# If a union base-load #}
47+
{%- if is_union and not is_incremental() -%}
48+
WHERE stg.FIRST_SOURCE IS NULL
49+
{%- endif -%}
4450
{%- endmacro -%}

macros/tables/link_template.sql

+8-2
Original file line numberDiff line numberDiff line change
@@ -33,12 +33,18 @@ FROM (
3333
tgt_pk, tgt_fk, tgt_ldts, tgt_source,
3434
source, is_union) }}
3535
) AS stg
36-
{% if is_incremental() or is_union -%}
36+
{# If incremental union or single #}
37+
{%- if is_incremental() -%}
3738
LEFT JOIN {{ this }} AS tgt
3839
ON {{ dbtvault.prefix([tgt_pk|first], 'stg') }} = {{ dbtvault.prefix([tgt_pk|last], 'tgt') }}
3940
WHERE {{ dbtvault.prefix([tgt_pk|last], 'tgt') }} IS NULL
40-
{%- if is_union %}
41+
{# If an incremental and union load -#}
42+
{% if is_union -%}
4143
AND stg.FIRST_SOURCE IS NULL
4244
{%- endif -%}
4345
{%- endif -%}
46+
{# If a union base-load #}
47+
{%- if is_union and not is_incremental() -%}
48+
WHERE stg.FIRST_SOURCE IS NULL
49+
{%- endif -%}
4450
{%- endmacro -%}

0 commit comments

Comments
 (0)