Jaqueline E.’s Kindle Notes & Highlights for The Data Vault Guru: a pragmatic guide on building a data vault

Never use any of the aforementioned business dates as load date timestamps!

If raw data sources had all the derived content (making it raw content) as well then there would be no business vault.

in this framework we do not enforce referential integrity and cause delayed latency in data availability, as soon as the data is ready to load it loads. Links, link-satellites, hubs and its satellites all load independently of each other, there are no dependencies because all use either natural keys or surrogate hash keys that are deterministic.

23%

3.5 Table & column metadata reference

24%

Table Artefact Suggested prefix

26%

Inserting a ghost record into a hub table is optional but mandatory in satellite tables, it only becomes necessary to load a ghost record to a hub table for schema-on-read platforms.

26%

insert into <hub> select distinct dv_tenantid , dv_bkeycode_hub_party , dv_hashkey_hub_party , dv_recsource , dv_taskid , dv_appts , dv_loadts , contact_id from <staged-file> stg where not exists (select 1 from <hub> h where stg.dv_hashkey_hub_party = h.dv_hashkey_hub_party) ;

26%

What is standard is the business key treatments, that is cast business keys to text; ensure the business keys are on a standard encoding (ASCII); include a business key collision code; include the multi-tenant id; apply a standard sanding value as delimiters for composite business keys; trim all leading and trailing blanks for each business key; substitute null values with a zero key; uppercase the business keys (passive integration); choice of hashing algorithm;

30%

degenerate dimensions

31%

4.3 SATELLITES An apt definition in the Oxford dictionary summarises the satellite table’s place in data vault, “something that is separated from or on the periphery of something else but is nevertheless dependent on or controlled by it.” A satellite contains the descriptive details about the artefact it is adjacent to. If the satellite is adjacent to a hub table then it will contain the change-tracked descriptive details about a single business entity in a hub. If the satellite is adjacent to a link table, then it will contain the change-tracked descriptive details about a relationship in a ...more

34%

Data vault documentation advocates that business keys should not live in satellite tables. First and foremost, this is correct, all treated business keys must live in the hub table and not the adjacent satellites and links in a hash-key based data vault, however there are exceptions to the rule, and we will discuss each one.

35%

Schema Evolution The indisputable constant in business is change, this change is reflected in the purchase decisions of IT systems and the evolution and upgrade of IT systems themselves,

46%

Because data vault has a limited number of table types to load the loads to these tables can be fully automated by repeatable parameterized loading patterns (tasks):- Load hubs Load links Load satellites (with dependent child keys) Reuse loading pattern for staged status tracking satellite Reuse loading pattern for record tracking satellite Reuse loading pattern for staged effectivity satellite Load multi-active satellites Load non-historized links and satellites

47%

regular – satellite is at the single record grain or multi-active – record at a set grain or non-historized – simply load what is in the staged file.

47%

Source push/pull Source files may arrive in varying delta types either pushed from source or pulled from source, staging and loading

48%

Also shared between the load to the hub and the satellites are the record source, tenant id, task id, load date timestamp and the applied date timestamp {{HUB-METADATA}}.

48%

idempotent.

51%

Each of the three load procedures are autonomous and predictable. If there is nothing new to load then nothing new will be loaded. As a part of test-driven development this is a vital test; what happens if I try to break my code by loading the same thing twice?

63%

in order to be successful in using data vault analysts should really try to stop comparing Kimball data models to data vault as the two modelling paradigms are designed to solve different problems in the data warehouse;

69%

Do not include ghost record creation in your daily data pipelines, they are to be created and inserted once upon table creation then never thought of again until creating point-in-time (PIT) tables to join to that ghost record.

70%

Can a PIT table be created as a VIEW? Absolutely not! PIT and bridge tables are in essence JOIN INDEXES (a term borrowed from Teradata jargon) that enable equi-join performance gains when retrieving data from the underlying data vault structures. A well-constructed PIT table should include the business keys available in the hub table(s) so that a join to the hub is no longer necessary.

70%

No, PIT and bridge tables must be physical table constructs with lots of indexes from the outset!

72%

Conceptually the bridge table is a link super highway and is heavily indexed for performance and can be used in conjunction with various PIT tables. They too take snapshots but of the hub and link tables instead

74%

data vault modelling is a data modelling task and attempting to solve data modelling with engineering will likely lead to a fake vault (see: Appendix D) and a future legacy data warehouse.

74%

0-Stage Area Criteria Hash-key based or Non-hash key based

75%

Business keys can be included in the record-hash if desired. Yes APP Are dependent child keys included in hash difference calculations? Yes APP Are business keys included in hash difference calculations? No

76%

If multiple sources are loading to a hub with composite keys, is the business key order from multiple sources the same for the target hub table?

88%

Data vault is not meant to replace dimensional modelling; it is complimentary to it; it is far better suited to ingest data and flexible enough to adapt with the changes of the business and business processes.

91%

No link-satellites Data vault delivered without link satellites will likely include weak hubs to compensate, the weak hub is arbitrary and simply creates additional join conditions to get the data out.