The Data Vault Guru: a pragmatic guide on building a data vault
Rate it:
20%
Flag icon
Never use any of the aforementioned business dates as load date timestamps!
23%
Flag icon
If raw data sources had all the derived content (making it raw content) as well then there would be no business vault.
23%
Flag icon
in this framework we do not enforce referential integrity and cause delayed latency in data availability, as soon as the data is ready to load it loads. Links, link-satellites, hubs and its satellites all load independently of each other, there are no dependencies because all use either natural keys or surrogate hash keys that are deterministic.
23%
Flag icon
3.5 Table & column metadata reference
24%
Flag icon
Table Artefact Suggested prefix
26%
Flag icon
Inserting a ghost record into a hub table is optional but mandatory in satellite tables, it only becomes necessary to load a ghost record to a hub table for schema-on-read platforms.
26%
Flag icon
insert into <hub> select distinct dv_tenantid , dv_bkeycode_hub_party , dv_hashkey_hub_party , dv_recsource , dv_taskid , dv_appts , dv_loadts , contact_id from <staged-file> stg where not exists (select 1 from <hub> h where stg.dv_hashkey_hub_party = h.dv_hashkey_hub_party) ;
26%
Flag icon
What is standard is the business key treatments, that is cast business keys to text; ensure the business keys are on a standard encoding (ASCII); include a business key collision code; include the multi-tenant id; apply a standard sanding value as delimiters for composite business keys; trim all leading and trailing blanks for each business key; substitute null values with a zero key; uppercase the business keys (passive integration); choice of hashing algorithm;
30%
Flag icon
degenerate dimensions
31%
Flag icon
4.3 SATELLITES An apt definition in the Oxford dictionary summarises the satellite table’s place in data vault, “something that is separated from or on the periphery of something else but is nevertheless dependent on or controlled by it.” A satellite contains the descriptive details about the artefact it is adjacent to. If the satellite is adjacent to a hub table then it will contain the change-tracked descriptive details about a single business entity in a hub. If the satellite is adjacent to a link table, then it will contain the change-tracked descriptive details about a relationship in a ...more
34%
Flag icon
Data vault documentation advocates that business keys should not live in satellite tables. First and foremost, this is correct, all treated business keys must live in the hub table and not the adjacent satellites and links in a hash-key based data vault, however there are exceptions to the rule, and we will discuss each one.
35%
Flag icon
Schema Evolution The indisputable constant in business is change, this change is reflected in the purchase decisions of IT systems and the evolution and upgrade of IT systems themselves,
46%
Flag icon
Because data vault has a limited number of table types to load the loads to these tables can be fully automated by repeatable parameterized loading patterns (tasks):- Load hubs Load links Load satellites (with dependent child keys) Reuse loading pattern for staged status tracking satellite Reuse loading pattern for record tracking satellite Reuse loading pattern for staged effectivity satellite Load multi-active satellites Load non-historized links and satellites
47%
Flag icon
regular – satellite is at the single record grain or multi-active – record at a set grain or non-historized – simply load what is in the staged file.
47%
Flag icon
Source push/pull Source files may arrive in varying delta types either pushed from source or pulled from source, staging and loading
48%
Flag icon
Also shared between the load to the hub and the satellites are the record source, tenant id, task id, load date timestamp and the applied date timestamp {{HUB-METADATA}}.
48%
Flag icon
idempotent.
51%
Flag icon
Each of the three load procedures are autonomous and predictable. If there is nothing new to load then nothing new will be loaded. As a part of test-driven development this is a vital test; what happens if I try to break my code by loading the same thing twice?
63%
Flag icon
in order to be successful in using data vault analysts should really try to stop comparing Kimball data models to data vault as the two modelling paradigms are designed to solve different problems in the data warehouse;
69%
Flag icon
Do not include ghost record creation in your daily data pipelines, they are to be created and inserted once upon table creation then never thought of again until creating point-in-time (PIT) tables to join to that ghost record.
70%
Flag icon
Can a PIT table be created as a VIEW? Absolutely not! PIT and bridge tables are in essence JOIN INDEXES (a term borrowed from Teradata jargon) that enable equi-join performance gains when retrieving data from the underlying data vault structures. A well-constructed PIT table should include the business keys available in the hub table(s) so that a join to the hub is no longer necessary.
70%
Flag icon
No, PIT and bridge tables must be physical table constructs with lots of indexes from the outset!
72%
Flag icon
Conceptually the bridge table is a link super highway and is heavily indexed for performance and can be used in conjunction with various PIT tables. They too take snapshots but of the hub and link tables instead
74%
Flag icon
data vault modelling is a data modelling task and attempting to solve data modelling with engineering will likely lead to a fake vault (see: Appendix D) and a future legacy data warehouse.
74%
Flag icon
0-Stage Area Criteria Hash-key based or Non-hash key based
75%
Flag icon
Business keys can be included in the record-hash if desired. Yes APP Are dependent child keys included in hash difference calculations? Yes APP Are business keys included in hash difference calculations? No
76%
Flag icon
If multiple sources are loading to a hub with composite keys, is the business key order from multiple sources the same for the target hub table?
88%
Flag icon
Data vault is not meant to replace dimensional modelling; it is complimentary to it; it is far better suited to ingest data and flexible enough to adapt with the changes of the business and business processes.
91%
Flag icon
No link-satellites Data vault delivered without link satellites will likely include weak hubs to compensate, the weak hub is arbitrary and simply creates additional join conditions to get the data out.