Skip to content

Commit 8bdd40f

Browse files
author
Alex Higgs
committed
Added performance note to worked example
1 parent a2f95e3 commit 8bdd40f

File tree

2 files changed

+23
-3
lines changed

2 files changed

+23
-3
lines changed

docs/staging.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ in our model.
6363
!!! note
6464
On line 3 below we are using a dbt source.
6565

66-
If you have not yet set up sources in your dbt configuration please refer to [setting up sources](gettingstarted.md#setting-up-sources).
66+
If you have not yet set up sources in your dbt configuration please refer to [setting up sources](walkthrough.md#setting-up-sources).
6767

6868

6969
```stg_customer_hashed.sql```

docs/workedexample.md

+22-2
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@ We will:
1717
- process the raw staging layer.
1818
- create a Data Vault with hubs, links and satellites using dbtvault and pre-written models.
1919

20-
2120
## Pre-requisites
2221

2322
These pre-requisites are separate from those found on the [getting started](walkthrough.md) page and will
@@ -37,4 +36,25 @@ be the only necessary requirements you will need to get started with the example
3736

3837
!!! note
3938
We have provided a complete ```requirements.txt``` to install with ```pip install -r requirements.txt```
40-
as a quick way of getting your Python environment set up. This file includes dbt and comes with the download in the next section.
39+
as a quick way of getting your Python environment set up. This file includes dbt and comes with the download in the
40+
next section.
41+
42+
## Performance note
43+
44+
Please be aware that table structures are simulated from the TPCH-H dataset. The TPC-H dataset is a static view of data.
45+
46+
Only a subset of the data contains dates which allows us to simulate daily feeds. The ```v_stg_orders``` orders view is
47+
filtered by date, unfortunately the ```v_stg_inventory``` view cannot be filtered by date, so it ends up being a feed of
48+
the entire contents of the view each cycle.
49+
50+
This means that inventory related hubs links and satellites are populated once during the initial load cycle with
51+
everything and later cycles insert 0 new records in their left outer joins.
52+
53+
As the dataset increases in size, e.g if you run with a larger TPC-H dataset (100, 1000 etc.) then be aware you are
54+
processing the entire inventory dataset each cycle, which results in unrepresentative load cycle times.
55+
56+
Unfortunately it's the nature of the dataset, it will not be that way for other datasets. We will look at additonal
57+
datasets in the future!
58+
59+
If you are feeling adventurous you may disable the inventory feed (```raw_inventory``` and child models) to see a more
60+
accurate representation of performance.

0 commit comments

Comments
 (0)