Notes from the Data Warehouse Toolkit

by Ralph Kimball

This is a set of notes I made while reading the book "The Data Warehouse Toolkit" by Ralph Kimball.

Goals of the Data Warehouse

  1. The data warehouse provides access to corporate or organizational data.
    The data must be easily accessible, and reports can be run and rerun easily and results returned quickly.
  2. The data in a data warehouse is consistent.
    Two different users who make the same query should get identical results even if they make requests at different times. Also if the data load is incomplete for the request the user is informed about this.
  3. The data in the data warehouse can be sliced and diced along every dimension i.e. by every possible measure in the business.
  4. The data warehouse is not just data, but also a set of tools to query, analyze and present information.
    The DW encompasses both back-end database infrastructure, as well as front-end reporting tools.
  5. The data warehouse is where we publish used data.
    Data is assembled from a variety of sources, cleaned up, quality assured and released only when ready for use.
  6. The quality of data in the data warehouse is a driver of business re-engineering.

I. Two Different Worlds

OLTP is profoundly different from dimensional data warehousing DDW.

Fact tables
Each item in the fact table represents the grain of the fact table.
The fact table is where the numerical measurements/metrics of the business are stored. The most useful facts are numeric, continuously valued and additive. The data should be continuously valued since it will in general be impossible to predict the set of values it can take on.

Dimension tables
Dimension tables are where the description of the data is stored. The best attributes are textual, discrete and are the source of the constraints and row headers in the result set. It is often unclear whether a numeric data field extracted from a production data source is a fact or a a dimensional attribute. We can tell the difference by asking the question "is the data field variable and continuously valued, or is it a discrete value of something that stays fairly constant.

Standard Template Query
select d.rowheader,sum(f.metric) : select list
from fact f,dimension d,time t : from clause
where f.dimensionkey=d.dimensionkey : join constraint
and f.timekey=t.timekey : join constraint
and t.date='11/11/97' : application constraint
group by p.rowheader : group by clause
order by p.rowheader : order by clause
The select list defines the columns that will appear in the result set. There is a row header and an aggregated fact (sum) in the select list.
The from clause is a list of the tables used in the query.
The join constraints capture the relationship between the fact and dimension tables. They link the tables by means of primary keys and foreign keys. The primary key forms the index of the dimension table. The foreign key is used to lookup into the fact table. Every foreign key from the fact table must have its counterpart in the dimension table in order to preserve referential integrity.
The fact table has a composite primary key, meaning that it is a combination of multiple foreign keys. Every fact table has a composite key, and conversely, every table that has a composite key is a fact table. The joins in a dimensional data warehouse capture a fundamental relationship between entities in the business model.
The application constraint serves to limit the result to the subset of the facts desired from the fact table.
The group by clause shows how to summarize the results in the result set, i.e. by row headers.
The order by clause expresses the sort order of the result set for display to the user.
Attributes
The database is only as good as the dimension tables as the attributes in the dimension tables provide the constraints that limit the result set to a useful and meaningful subset of the universe of facts.

OLAP - On-line Analytical Processing

II. The Grocery Store

Steps in the Design Process

  1. Choose a business process to model.
  2. Choose the grain of the business process.
  3. Choose the dimensions that will apply to each fact table record.
  4. Choose the metrics to be stored in the fact table.

Identifying the Processes to Model

Normalization

Time Dimension

Product Dimension

Grocery Store/Page Turn facts

Just as customer counts are semi-additive across the sales fact table due to the fact that we summarized up from the individual transactions to the daily item movement totals (e.g. if we have 25 sales of cornflakes and 30 of rice krispies we cannot tell how many customers bought either cornflakes or rice krispies) so are visitor counts. This is why the Visit dimension is needed to keep track of individual visitors.

III. The Warehouse

Degenerate dimensions are dimension keys with no corresponding dimension table.

IV. Shipments

Ship-To Dimension

VI. Big Dimensions

Drill-down
Drill-down simply means "show more information"/add a new row header to the report.

Resist the tendency to snowflake
Reasons NOT to snowflake:

MiniDimensions

Slowly Changing Dimensions

How do we track changes over time in dimension tables, e.g. product names change, addresses change etc.? This emphasizes the fact that primary dimensions are not completely time independent.

We have the following 3 fundamental choices to make when we encounter a slowly changing dimension:

  1. Overwrite old values with new ones and hence lose the ability to provide a history over time.
    This is very rarely of value, except when the data was incorrect in the first place.
  2. Create a new dimension record at the time of the change with the new attribute values, thereby segmenting history very accurately between the old description and the new description.
    This requires that a new dimension key be generated for this new attribute value.
  3. Creating new "current" fields within the original dimension to record the new attribute values, while keeping the original attribute values as well, thereby being able to describe history both forward and backward from the change either in terms of the original attribute values or in terms of the current attribute values.
    The Type 3 slowly changing dimensions is equipped to handle only the original and current values of the changed attribute. Intermediate values are lost. Thus we cannot partition history as in the Type 2 model.

Slowly Changing MiniDimensions

The attributes isolated in a minidimension are very often those attributes that change over time and whose distinct values need to be tracked. In those cases where all possible combinations of the attribute values have already been created, updating a description in the deriving minidimension (customer) means placing a different key in the fact table record from a certain time forward and overwriting the derived minidimension key (demographics key) in the deriving table record (customer record) whenever it changes.

VII. Financial Services

X. Factless Fact Tables

Factless fact tables are used to track events for which there are no measured facts. There are two major types: event tracking tables and coverage tables.

Event Tracking Tables

These are used for recording events which occur as the culmination of a number of dimensional entities, e.g. daily attendance at a college. There are no facts per se. The "facts" only occur when each of the dimensional entities occur/exist at a point in time. The purpose of the fact table is to register the event of the occurrence of the entities at that point in time.

Coverage Tables

XII. Building a Dimensional Data Warehouse

The 9 decision points to be decided when designing a dimensional data warehouse are:

  1. The processes, and hence the identity of the fact tables.
  2. The grain of each fact table.
  3. The dimensions of each fact table.
  4. The facts, incl. pre-calculated facts.
  5. The dimension attributes with complete descriptions and proper terminology.
  6. How to track slowly changing dimensions.
  7. The aggregations, heterogeneous dimensions, minidimensions, query modes and other physical storage decisions.
  8. The historical duration of the database.
  9. The urgency with which data is extracted and loaded into the data warehouse.

These decisions should be made in the order given.

Interviewing the End User

Interviewing the end user is the most important first step in designing a data warehouse. The interview achieves the following:

  1. It gives the designers an idea as to the needs and expectations of the users
  2. It serves a PR purpose: i.e. it allows the designer to raise the awareness level of the forthcoming data warehouse among end users, and to adjust and correct some of the user's expectations. Hence we are managing user's expectations.

The interviews should be as thorough and encompass as many of the business units of the company as much as possible.

XIII. Aggregates

An aggregate is a fact table record representing a summarization of base-level fact table records. An aggregate fact table record is always associated with one or more aggregate dimension table records.
Question: where do we put the new fact table records and dimension table records and how are they to be administered? There are 2 main approaches:

  1. New Fact Tables
  2. New Level Fields

New Fact Tables

Aggregate fact tables are derivative fact tables since they are derived from the main base-level fact table. Each derivative fact table must be joined to one nor more derivative dimension tables. This derivative dimension table is a small subset of the larger dimension table where the subset only consists of those attributes relevant to the derived dimension. Entirely new keys my be created for this derived dimension. These keys did not exist in the base-level dimension table.

New Level Fields

Aggregates can also be stored by using Level fields in the affected dimension tables, thus leaving the aggregate fact records in the original fact table. The only difference in this approach is that all the new aggregate records are kept in the original dimension and fact tables. In the Level field model the dimension table is used with all its fields, with the addition of a Level field. The Level field describes the aggregation level of every record in the dimension table. The new aggregate records need to have keys that are compatible with and do not conflict with the original base-level keys in the dimension table.

Indexing Issues

Application Issues

A problem with building aggregates for dimensional tables is that applications had to know of the presence of aggregates and specifically request for them in the SQL generated. viz: select ... from base_sales_fact... vs select ... from department_sales fact. This is called an applications discontinuity.

Aggregate Navigation

A new layer of software, the aggregate navigatorhas been inserted between the application and the DBMS in order to address the applications discontinuity problem. The role of the aggregate navigator is to intercept the end user's SQL and transform it so as to use the best available aggregate. The end user's query tools, report writers and applications all speak nothing but base-level SQL. Even the applications developer does not need to worry about the identity of aggregate tables.
All of the end user's clients make requests to the aggregate navigator. The end-user apps. only speak base-level SQL. The function of the aggregate navigator is to transform the user's base-level SQL into aggregate-aware SQL. The aggregate navigator then submits this SQL to the DBMS. The real DBMS returns the result set to the aggregate navigator which subsequently passes it to the end user.
The phenomenon of query tool vendors offering aggregate navigation is the wrong architecture.

XIV. The Back End

Browse Queries
Browse queries usually touch a single attribute in a dimension table through an index.

Multitable Join QueriesMultitable join queries involve the fact table and dimension tables. The joining of multiple fact tabes is not recommended. In a multitable join query the dimensional tables are evaluated and a list of primary key values from each of the tables is generated. The separate dimensions combine to generate a set of composite keys in sorted order that is presented to the fact table's composite key index. The facts returned from the fact table are as a result of the matching between these two ordered sets of composite keys in the fastest possible time.

Loading Phase
Make good use of mirroring to ensure that the DBMS is up for almost 24-7.

Production Data Extraction
This should in general involve the following 11 steps:

  1. Primary Extraction
  2. Identifying the changed records
  3. Generalizing keys for changing dimensions
  4. Transforming into load record images
  5. Migration from the legacy system to DDW system
  6. Sorting and building aggregates
  7. Generalizing keys for aggregates
  8. Loading
  9. Processing exceptions
  10. Quality assurance
  11. Publishing

XIV. The Front End

Client-Side Query Tools
A client side query tool performs the following 2 functions:

  1. It sends SQL requests to the DBMS
  2. It receives answers back from the DBMS.

The query tool architecture should include the following:

  1. The use of SQL to make requests to the DBMS.
  2. Use of a database independent connectivity interface, e.g.. ODBC.
  3. Use of a network based aggregate navigator.
  4. Reliance on a bottleneck style of communication with the DBMS, with relatively small SQL messages being sent to the DBMS and relatively small answer sets being received from the DBMS.

Single table browse queries account for around 80% of end user queries while the remaining 20% are multitable joins. The multitable joins account for a majority of the resources used by the user. The use of a larger number of simpler multitable joins rather than a smaller number of more complex multitable joins is recommended.

There are 3 kinds of activities users do when data is returned to the screen.:

  1. Comparison
  2. Presentation
  3. Asking Why

The data warehouse team should have a software shop with responsibility to create the pre-canned parameterized reports. This shop should:

  1. Develop an overall plan for building applications from a set of basic templates which can later be built upon.
  2. Build the template applications with the aim of having the applications work immediately on the user's desktop and can be changed with the minimum number of button clicks.
  3. Build the template apps. with dimension browsing capability and customized Help screen capability. A glossary should be provided in every template.
  4. Provide a bug reporting hotline which is highly responsive to help requests from users.

XVI. Front End Applications

There are 4 possible techniques for producing comparison columns in business reports:

  1. Self join
  2. Correlated subquery
  3. Case statement
  4. Separate query streams (Multipass queries)

Of these, 4. Multipass queries seems to be the most simplest to generate and most convenient. For more info. on multipass queries, see the article by Kimball on Features for Query Tools in DBMS magazine, Feb. 1997.

Technique Advantage/Disadvantage.
Self-join Disadvantage: Complicated SQL; not produced by commercial query tool; produces large no. of tables which trouble DBMS optimizer
Correlated subquery Same as above
Case statement Advantage: Optimizer does not get confused by the existence of false additional tabes, and SQL is more compact than previous cases.

Disadvantage: Division by 0 is not well handled. The SQL is not compact/intentional and no query tool will generate such SQL. The CASE statement requires the main SQL query constraints be made loose which hampers performance on complex queries and makes it difficult to use aggregates effectively.
Multipass Queries Advantage: Small, compact SQL which allows tight control on DBMS performance. Division by zero problem disappears. Separate queries are amenable to use of full aggregate navigation. Dynamic edits to the overall report can be handled by fetching only the changed column, rather than the entire report. Post-processing of the answer set.
Disadvantage: Query tool must manage the separate query streams.

Executive, Analyst and Developer Interfaces

The executive interfaceis concerned with alerting the executive to look at something unusual in a pre-canned report. It does not need to allow the construction or modification of an individual reporting screen.
The analyst interface is concerned withe running and modifying pre-canned, parameterized reports. The analyst user should be able to change among pre-canned constraints on dimensions etc..
The developer interface mainly concerned with building pre-canned, parameterized reports.

Query Tool Features

These are desirable features for a query tool:

  1. Browsing
    A browser allows the user to display distinct lists of values for each dimension attribute and then to set constraints progressively on one or more of the attributes.
  2. Outer Join
    A query tool fetching multiple answer sets and combining them must always combine these answer sets using outer joins. The outer joins are almost all 2-way (symmetric) outer joins.
  3. Interface to Help
  4. Pre-canned Comparisons
  5. Distributed Calculations: Nonadditive Measures
  6. Drilling Down- produce more row headers.
  7. Advanced exception handling
  8. Context-Specific Drill-Down Targets
    Drill down to a different report depending on the context, e.g. if a particular exceptional no. is selected for drill-down, an entirely different report is generated.
  9. Interaction with Aggregates
    The need for an aggregate navigator has been previously mentioned.
  10. Drilling Across
    Drilling Across involves combining 2 or more fact tables that share dimensions into a single report. This will produce 2 or more metric columns in the final report, depending upon the no. of fact tables used. The drill-across model works as long as the row headers chosen for a particular report have exactly the same meaning in both fact tables and their resp. dimensions.
  11. Extending SQL's Aggregation Operations
    This would be extended to include measures like PERIODAVG, PERIODCOUNT.
  12. Break rows
    Break rows amount to sectional sub-total/semi-additive rows in the report.
  13. Behavioral Constraints.
    This involves building a special table containing keys we wish to constrain on and constraining on them in the SQL that gets generated.
  14. Pivoting.
    Pivoting allows the row and column headers to be scrambled in arbitrary combinations e.g. rows become columns and vice versa.
  15. Handing off the answer set.
    The query tool hands off the answer set to adjacent tools.
  16. Printing
  17. Batch operation
    This is useful for
    1. Time-scheduled report generation
    2. Event-triggered report generation
    3. Stress Testing

Administrative Responsibilities

  1. Tool and Data Training
  2. Pre-joins and Query Tool Metadata
  3. Information Dictionary
  4. Browsing support
  5. Public and Private Constraint Groups
  6. Public and Private Behavioral Groups
  7. Aggregate Navigation Metatables and Statistics
  8. Extract Tool Metatables

XVII. The Future

Points for the future:

  1. Software is the key
Hosted by www.Geocities.ws

1