Notes from the Data Warehouse Toolkit
by Ralph Kimball
This is a set of notes I made while reading the book "The Data Warehouse
Toolkit" by Ralph Kimball.
Goals of the Data Warehouse
- The data warehouse provides access to corporate or organizational
data.
The data must be easily accessible, and reports can be run and rerun
easily and results returned quickly.
- The data in a data warehouse is consistent.
Two different users who
make the same query should get identical results even if they make requests at
different times. Also if the data load is incomplete for the request the user
is informed about this.
- The data in the data warehouse can be sliced and diced along every
dimension i.e. by every possible measure in the business.
- The data warehouse is not just data, but also a set of tools to query,
analyze and present information.
The DW encompasses both back-end database
infrastructure, as well as front-end reporting tools.
- The data warehouse is where we publish used data.
Data is
assembled from a variety of sources, cleaned up, quality assured and released
only when ready for use.
- The quality of data in the data warehouse is a driver of business
re-engineering.
I. Two Different Worlds
OLTP is profoundly different from dimensional data warehousing DDW.
- Consistency
OLTP consistency is microscopic. In OLTP
consistency, ALL the transactions presented to the system are accounted
for.
In a data warehouse, consistency is global. We care that the load of
new data is a full and consistent load of data. In the case of OLTP we have a
microscopic perspective. In the case of DDW, we have a quality assurance
perspective.
Does the data load pass the consistency test? This is what we
ask in DW
- An OLTP system processes thousands or millions of transactions per day. A
DW system processes a single transaction each day called the production data
load.
- Users and managers of an OLTP system
i. usually deal w/ one account at
a time
ii. are making use of the system continuously throughout the day to
make updates, queries, lookups of SINGLE records from the data. (That is why
it is called on-line).
iii. are concerned about performance and
reliability for every transaction.
iv. make transactions for small (almost
atomic) sets of data.
v. OLTP users perform the same query multiple
times.
Users and managers of a DW system
i.
usually deal with more than one account at a time, but need information on a
series of account across the organization.
ii. make use of the system
periodically, e.g. morning reports from the previous night's load.
iii. are
more concerned with the performance of single-table queries, which are
expected to be instantaneous. Join queries can be allowed to run for seconds
or minutes.
iv. usually make queries that summarize a large set of data
for reporting.
- Time
OLTP databases are twinkling databases, i.e. they are in
flux and their state is constantly changing, . Thus there is temporal
inconsistency wrt state. OLTP databases lack explicit support for
historical data, e.g. a financial securities database system which contains a
snapshot of trading positions as they change throughout the day. This is also
temporal inconsistency.
DW systems contain time-series data. The
data is generally not in flux, except at load time, and even then inserts,
rather than updates are generally being done. Hence individual records, once
added, are static rather than dynamic. By storing static snapshots of the data
at periodic time intervals, we are able to make historical queries of the
data. The snapshot is called the production data extract.
- Data model
OLTP systems use an entity-relationship data model. This
model consists of a set of tables of normalized data which are connected to
each other in many possible ways. Queries to an entity relationship model
usually involve joins between pairs of tables. Entity relationship data models
are inappropriate for data warehouses since they cannot be understood by users
or navigated adequately by DW software - a query involving joins between many
large tables would perform terribly.
Kimball outlines a problem in
data warehouse projects with overly complex schemas: $3 million hardware &
software doing 50 queries per day, an IT dept. that has to write queries in
hand-coded SQL,a marketing dept. that is unhappy because they can't use the
system directly and can't ask in which areas of the business they are
profitable, etc..
DW systems use a dimensional data model/star join
schema. This model is asymmetrical, with a central fact table
surrounded by multiple lookup/dimension tables.
Fact tables
Each item in the fact table represents the grain
of the fact table.
The fact table is where the numerical
measurements/metrics of the business are stored. The most useful facts are
numeric, continuously valued and additive. The data should be continuously
valued since it will in general be impossible to predict the set of values it
can take on.
Dimension tables
Dimension tables are where the description of the
data is stored. The best attributes are textual, discrete and are the source of
the constraints and row headers in the result set. It is often unclear whether a
numeric data field extracted from a production data source is a fact or a a
dimensional attribute. We can tell the difference by asking the question "is the
data field variable and continuously valued, or is it a discrete value of
something that stays fairly constant.
Standard Template Query
select d.rowheader,sum(f.metric) :
select list
from fact f,dimension d,time t : from
clause
where f.dimensionkey=d.dimensionkey : join
constraint
and f.timekey=t.timekey : join constraint
and
t.date='11/11/97' : application constraint
group by p.rowheader :
group by clause
order by p.rowheader : order by clause
The
select list defines the columns that will appear in the result set. There is a
row header and an aggregated fact (sum) in the select list.
The from clause
is a list of the tables used in the query.
The join constraints capture the
relationship between the fact and dimension tables. They link the tables by
means of primary keys and foreign keys. The primary key forms the index of the
dimension table. The foreign key is used to lookup into the fact table. Every
foreign key from the fact table must have its counterpart in the dimension table
in order to preserve referential integrity.
The fact table has a
composite primary key, meaning that it is a combination of multiple foreign
keys. Every fact table has a composite key, and conversely, every table that has
a composite key is a fact table. The joins in a dimensional data warehouse
capture a fundamental relationship between entities in the business
model.
The application constraint serves to limit the result to the subset of
the facts desired from the fact table.
The group by clause shows how to
summarize the results in the result set, i.e. by row headers.
The order by
clause expresses the sort order of the result set for display to the
user.
Attributes
The database is only as good as the dimension
tables as the attributes in the dimension tables provide the constraints that
limit the result set to a useful and meaningful subset of the universe of
facts.
OLAP - On-line Analytical Processing
II. The Grocery Store
Steps in the Design Process
- Choose a business process to model.
- Choose the grain of the business process.
- Choose the dimensions that will apply to each fact table record.
- Choose the metrics to be stored in the fact table.
Identifying the Processes to Model
- The first step in the design process is to decide what business processes
to model, by combining an understanding of the business with an understanding
of what data is available.
- The second step in the design is to decide on the grain of the fact table
in each business process.
- A data warehouse almost always demands data expressed at the lowest
possible rain of each dimension, because queries need to cut through the
database in very precise ways.
- A careful grain statement determines the primary dimensionality of the
fact table. If it is recognized that an additional desired dimension violates
the grain by causing additional records to be generates, e.g. the fact table
records items per day, but we wish to add an hour attribute to the Time
dimension table.
Normalization
- The fact table in a dimensional schema is naturally highly normalized
- The dimension tables must not be normalized but should remain as flat
tables. Normalized dimension tables destroy the ability to browse.
Time Dimension
- Most data warehouses need an explicit time dimension even though the
primary time key may be an SQL date-valued object. The explicit time dimension
is needed to describe fiscal periods, seasons, holidays, weekends and other
calendar calculations that are difficult to get from the SQL date machinery.
Product Dimension
- The product dimension is one of the two or three primary dimensions in
nearly every data warehouse.
- Drilling down in a data warehouse is nothing more than adding new headers
from the dimension tables. Drilling up is subtracting row headers.
Grocery Store/Page Turn facts
Just as customer counts are semi-additive across the sales fact table due to
the fact that we summarized up from the individual transactions to the daily
item movement totals (e.g. if we have 25 sales of cornflakes and 30 of rice
krispies we cannot tell how many customers bought either cornflakes or rice
krispies) so are visitor counts. This is why the Visit dimension is needed to
keep track of individual visitors.
- Visitor/Customer counts are semi-additive when they occur in time snapshot
fact tables because they double count activity across products during the
visit/customer event.
Degenerate dimensions are dimension keys with no corresponding dimension
table.
IV. Shipments
Ship-To Dimension
- Any dimension whose records define a point in space automatically is
capable of supporting multiple independent geographic hierarchies.
- It is natural and common, for a dimension to simultaneously support
multiple independent hierarchies. Drilling up and down within each of these
hierarchies must be supported in a data warehouse.
- Two loosely correlated attributes that have a many-to-many relationship
can be modeled either as a single compound dimension, or they can be modeled
as separate dimensions, at the designer's discretion.
- If attributes are highly correlated (i.e. one-to-many) one should use a
single dimension, since the two attributes can then be browsed against each
other more efficiently than if they intersect only in the fact tabs, and the
total no. of keys in the single dimension will only be a little larger than
the cardinality of the more numerous attribute. If the attributes are very
loosely correlated or have zero correlation (many-to-many), then it is best to
model them as separate dimensions, because otherwise the no. of records in a
combined dimension will probably be the product of the no.s of the separate
attributes.
VI. Big Dimensions
Drill-down
Drill-down simply means "show more information"/add a
new row header to the report.
Resist the tendency to snowflake
Reasons NOT to snowflake:
- The space savings resulting from snowflaking are minute compared to the
size of the overall fact table and the database as a whole in most cases.
- If you snowflake your dimensions, be prepared to live with poor browsing
performance.
MiniDimensions
- The best approach for tracking changes in really huge dimensions is to
break off one or more minidimensions from the dimension table, each consisting
of small clumps of attributes that have been administered to have a limited
no. of values.
- The minidimension key can exist as a foreign key in both the fact table
and the dimension table from which it was broken off.
Slowly Changing Dimensions
How do we track changes over time in dimension tables, e.g. product names
change, addresses change etc.? This emphasizes the fact that primary dimensions
are not completely time independent.
We have the following 3 fundamental choices to make when we encounter a
slowly changing dimension:
- Overwrite old values with new ones and hence lose the ability to provide a
history over time.
This is very rarely of value, except when the data was
incorrect in the first place.
- Create a new dimension record at the time of the change with the new
attribute values, thereby segmenting history very accurately between the old
description and the new description.
This requires that a new dimension key
be generated for this new attribute value.
- The use of the Type 2 slowly changing dimension requires that the
dimension key be generalized. It may be sufficient to take the underlying
production key and add two or three version digits to the end of the key to
simplify the key generation process.
- The Type 2 slowly changing dimension automatically partitions history
and an application must not be required to place any time constraints on
effective dates in the dimension. The reason for this is that we are always
looking up the attribute from a row in the fact table which is already
dated.
- Since the Type 2 slowly changing dimension partitions history, we will
not be able to use the new value of a changed attribute on old history or
vice versa. E.g. if we constrain on Label=Active from Oct 15, 1997 we will
to be able to see that label before this date. If we wish to see it "if it
had been this way all the time, i.e. Active" we have to resort to Type 3.
- Creating new "current" fields within the original dimension to record the
new attribute values, while keeping the original attribute values as well,
thereby being able to describe history both forward and backward from the
change either in terms of the original attribute values or in terms of the
current attribute values.
The Type 3 slowly changing dimensions is equipped
to handle only the original and current values of the changed attribute.
Intermediate values are lost. Thus we cannot partition history as in the Type
2 model.
Slowly Changing MiniDimensions
The attributes isolated in a minidimension are very often those attributes
that change over time and whose distinct values need to be tracked. In those
cases where all possible combinations of the attribute values have already been
created, updating a description in the deriving minidimension (customer) means
placing a different key in the fact table record from a certain time forward and
overwriting the derived minidimension key (demographics key) in the deriving
table record (customer record) whenever it changes.
VII. Financial Services
- In data warehouses where a dimension must describe a large no. of
heterogeneous items, the recommended technique is to create a core fact table
and a core dimension table in order to allow queries to cross the disparate
types, and to create a custom fact table and a custom dimension table for
querying each individual type in depth.
- All the keys are duplicated exactly once in the custom dimension tables.
- The primary core facts should be duplicated in the custom fact tables.
This virtually eliminates the need to process two fact tables in a single
query in a heterogeneous product schema
Factless fact tables are used to track events for which there are no measured
facts. There are two major types: event tracking tables and coverage tables.
Event Tracking Tables
These are used for recording events which occur as the culmination of a
number of dimensional entities, e.g. daily attendance at a college. There are no
facts per se. The "facts" only occur when each of the dimensional entities
occur/exist at a point in time. The purpose of the fact table is to register the
event of the occurrence of the entities at that point in time.
- Events are often modeled by a fact table containing a number of keys, each
representing a participating dimension in the event. Such event tables often
have no obvious numerical facts associated with them, and hence are called
factless fact tables.
Coverage Tables
- Coverage tables are often tables of events that didn't happen. Coverage
tables are usually factless in the same way as event tracking tables.
XII. Building a Dimensional Data Warehouse
The 9 decision points to be decided when designing a dimensional data
warehouse are:
- The processes, and hence the identity of the fact tables.
- The grain of each fact table.
- The dimensions of each fact table.
- The facts, incl. pre-calculated facts.
- The dimension attributes with complete descriptions and proper
terminology.
- How to track slowly changing dimensions.
- The aggregations, heterogeneous dimensions, minidimensions, query modes
and other physical storage decisions.
- The historical duration of the database.
- The urgency with which data is extracted and loaded into the data
warehouse.
These decisions should be made in the order given.
Interviewing the End User
Interviewing the end user is the most important first step in
designing a data warehouse. The interview achieves the following:
- It gives the designers an idea as to the needs and expectations of the
users
- It serves a PR purpose: i.e. it allows the designer to raise the awareness
level of the forthcoming data warehouse among end users, and to adjust and
correct some of the user's expectations. Hence we are managing user's
expectations.
The interviews should be as thorough and encompass as many of the business
units of the company as much as possible.
- The use of pre-stored aggregates (summaries) is the single most effective
tool the data warehouse designer has to control performance.
The dba
should not be wasting time trying to get the DBMS optimizer to work
"properly", rather the dba should be evaluating the need for building more
aggregates based upon the behavior of the users. Another benefit of aggregates
is that they can be guaranteed to be correct.
An aggregate is a fact table record representing a summarization of
base-level fact table records. An aggregate fact table record is always
associated with one or more aggregate dimension table records.
Question:
where do we put the new fact table records and dimension table records and how
are they to be administered? There are 2 main approaches:
- New Fact Tables
- New Level Fields
New Fact Tables
Aggregate fact tables are derivative fact tables since they are derived from
the main base-level fact table. Each derivative fact table must be joined to one
nor more derivative dimension tables. This derivative dimension table is a small
subset of the larger dimension table where the subset only consists of those
attributes relevant to the derived dimension. Entirely new keys my be created
for this derived dimension. These keys did not exist in the base-level dimension
table.
- The creation of aggregate fact table records always requires the creation
of artificial keys in each of the dimensions being aggregated.
- Any dimension attribute that survives in the aggregate dimension table can
be used more efficiently in the aggregate schema than in the base-level schema
because it is guaranteed to make sense at the aggregated level.
New Level Fields
Aggregates can also be stored by using Level fields in the affected dimension
tables, thus leaving the aggregate fact records in the original fact table. The
only difference in this approach is that all the new aggregate records are kept
in the original dimension and fact tables. In the Level field model the
dimension table is used with all its fields, with the addition of a Level field.
The Level field describes the aggregation level of every record in the dimension
table. The new aggregate records need to have keys that are compatible with and
do not conflict with the original base-level keys in the dimension table.
- If aggregates are represented in the original dimension and fact tables by
means of the Level field construct, then every query ever presented to that
schema must constrain the Level field to a single value of double counting
will occur. Hence the constraint "and LEVEL=Category" must be present in the
query.
- Each type (grain) of aggregate should occupy its own fact table, and
should be supported by the proper set of dimension tables containing only
those dimensional attributes that are defined for that grain of aggregate.
- The single most effective way to control an aggregation explosion, but
still benefit from the value of aggregates is to make sure that each aggregate
summarizes at least 10 and preferably 20 or more lower-level items.
Indexing Issues
- A loose constraint, or no constraint in a base-level fact table is
actually a tight constraint in an aggregate fact table.
- Only one sort order on the master composite index on the fact table needs
to be built, since other sort orders based on missing dimensional constraints
should be handled by separate aggregation tables instead.
Application Issues
A problem with building aggregates for dimensional tables is that
applications had to know of the presence of aggregates and specifically request
for them in the SQL generated. viz: select ... from base_sales_fact... vs select
... from department_sales fact. This is called an applications
discontinuity.
Aggregate Navigation
A new layer of software, the aggregate navigatorhas been inserted
between the application and the DBMS in order to address the applications
discontinuity problem. The role of the aggregate navigator is to intercept the
end user's SQL and transform it so as to use the best available aggregate. The
end user's query tools, report writers and applications all speak nothing but
base-level SQL. Even the applications developer does not need to worry about the
identity of aggregate tables.
All of the end user's clients make requests to
the aggregate navigator. The end-user apps. only speak base-level SQL. The
function of the aggregate navigator is to transform the user's base-level SQL
into aggregate-aware SQL. The aggregate navigator then submits this SQL to the
DBMS. The real DBMS returns the result set to the aggregate navigator which
subsequently passes it to the end user.
The phenomenon of query tool vendors
offering aggregate navigation is the wrong architecture.
- An aggregate navigator is an essential component of a data warehouse
because it insulates end user applications from the changing portfolio of
aggregations, and allows the DBA to dynamically adjust the aggregations
without having to roll over the applications base.
XIV. The Back End
Browse Queries
Browse queries usually touch a single attribute in a
dimension table through an index.
Multitable Join QueriesMultitable join queries involve the fact table
and dimension tables. The joining of multiple fact tabes is not recommended. In
a multitable join query the dimensional tables are evaluated and a list of
primary key values from each of the tables is generated. The separate dimensions
combine to generate a set of composite keys in sorted order that is presented to
the fact table's composite key index. The facts returned from the fact table are
as a result of the matching between these two ordered sets of composite keys in
the fastest possible time.
Loading Phase
Make good use of mirroring to ensure that the DBMS is
up for almost 24-7.
Production Data Extraction
This should in general involve the
following 11 steps:
- Primary Extraction
- Identifying the changed records
- Generalizing keys for changing dimensions
- Transforming into load record images
- Migration from the legacy system to DDW system
- Sorting and building aggregates
- Generalizing keys for aggregates
- Loading
- Processing exceptions
- Quality assurance
- Publishing
XIV. The Front End
Client-Side Query Tools
A client side query tool performs the
following 2 functions:
- It sends SQL requests to the DBMS
- It receives answers back from the DBMS.
The query tool architecture should include the following:
- The use of SQL to make requests to the DBMS.
- Use of a database independent connectivity interface, e.g.. ODBC.
- Use of a network based aggregate navigator.
- Reliance on a bottleneck style of communication with the DBMS, with
relatively small SQL messages being sent to the DBMS and relatively small
answer sets being received from the DBMS.
Single table browse queries account for around 80% of end user queries while
the remaining 20% are multitable joins. The multitable joins account for a
majority of the resources used by the user. The use of a larger number of
simpler multitable joins rather than a smaller number of more complex multitable
joins is recommended.
There are 3 kinds of activities users do when data is returned to the
screen.:
- Comparison
- Presentation
- Asking Why
- In the trade-off between power and ease of use, ease of use must always
come first. End users will not use a tool that seems too complicated. If the
tool is perceived as complicated, either IS will have to use the tool on the
end user's behalf, or the tool will not be used.
- Simplicity is mostly a function of whether the user can get to the desired
result in "one button click", where that button click is obvious from the user
interface. If the user believes that the query tool is simple by this
definition, then the query tool will probably be spontaneous and repeatedly
used.
- For the rank and file user, the data warehouse should consist of a library
of template applications that run immediately on the user's desktop. These
applications should have a limited set of user-selectable alternatives for
setting new constraints and for picking new measures. These template
applications are pre-canned, parameterized reports.
The data warehouse team should have a software shop with responsibility to
create the pre-canned parameterized reports. This shop should:
- Develop an overall plan for building applications from a set of basic
templates which can later be built upon.
- Build the template applications with the aim of having the applications
work immediately on the user's desktop and can be changed with the minimum
number of button clicks.
- Build the template apps. with dimension browsing capability and customized
Help screen capability. A glossary should be provided in every template.
- Provide a bug reporting hotline which is highly responsive to help
requests from users.
- It is essential for a query tools operating against a dimensional data
warehouse to be able to perform comparisons flexibly and immediately. A single
row of an answer set should simultaneously be able to show comparisons over
multiple time periods of different grains and comparisons over other
dimensions, and compound comparisons across two or more dimensions. A query
tool should have these comparison alternatives available in a pull down menu
or its equivalent. The end user should never see the SQL that implements the
comparisons.
- Presentation is a separate activity from query and comparing, and the data
warehouse owner should choose tools that allow answer sets to be transferred
easily by the user into multiple presentation environments for different
purposes.
- There is a fundamental mismatch between the power and depth of a
individual SQL answer set and the power and depth of a business report. In
most cases, a business report must be assembled from several answer sets. Any
query tool that deals with answers to business questions that can be described
as business reports must deal with the issue of multiple answer sets.
Conversely a raw answer set is not a sufficient deliverable to the user.
There are 4 possible techniques for producing comparison columns in business
reports:
- Self join
- Correlated subquery
- Case statement
- Separate query streams (Multipass queries)
Of these, 4. Multipass queries seems to be the most simplest to generate and
most convenient. For more info. on multipass queries, see the article by Kimball
on Features for Query
Tools in DBMS magazine, Feb. 1997.
| Technique |
Advantage/Disadvantage. |
| Self-join |
Disadvantage: Complicated SQL; not produced by commercial
query tool; produces large no. of tables which trouble DBMS
optimizer |
| Correlated subquery |
Same as above |
| Case statement |
Advantage: Optimizer does not get confused by the existence
of false additional tabes, and SQL is more compact than previous
cases.
Disadvantage: Division by 0 is not well handled. The SQL is
not compact/intentional and no query tool will generate such SQL. The CASE
statement requires the main SQL query constraints be made loose which
hampers performance on complex queries and makes it difficult to use
aggregates effectively. |
| Multipass Queries |
Advantage: Small, compact SQL which allows tight control on
DBMS performance. Division by zero problem disappears. Separate queries
are amenable to use of full aggregate navigation. Dynamic edits to the
overall report can be handled by fetching only the changed column, rather
than the entire report. Post-processing of the answer
set. Disadvantage: Query tool must manage the separate query
streams. |
- Comparisons are a key element of nearly every business report. Comparisons
should be implemented as separate query streams, managed by the query tool.
- DBMSs constructed on the fly to hold
the results of separate query streams are not recommended. The separate query
streams should be sent to the client query tool for all further
processing.
- A good user interface is based on recognizing and pointing, not
remembering and typing. A good interface minimizes the no. of button clicks
and context switches.
- A report-writing query tool should communicate the context of the report
at a glance, including esp. the identities of the dimension tables and fact
tables, the constraints in effect on the dimensions, and the current state of
the report. The report itself should be visible in a single button click.
- If users can see something, like a column of a report, they should be able
to edit it directly.
- Edits performed on report columns should leave as much of the report still
valid as possible. Requerying after an edit should at most fetch the data
needed to rebuild the edited column.
- The query must have an immediate STOP command, and the query tool must not
preempt the user's machine while it is waiting for data from the DBMS.
Executive, Analyst and Developer Interfaces
The executive interfaceis concerned with alerting the executive to
look at something unusual in a pre-canned report. It does not need to allow the
construction or modification of an individual reporting screen.
The
analyst interface is concerned withe running and modifying pre-canned,
parameterized reports. The analyst user should be able to change among
pre-canned constraints on dimensions etc..
The developer interface
mainly concerned with building pre-canned, parameterized reports.
Query Tool Features
These are desirable features for a query tool:
- Browsing
A browser allows the user to display distinct lists of values
for each dimension attribute and then to set constraints progressively on one
or more of the attributes.
- Outer Join
A query tool fetching multiple answer sets and combining
them must always combine these answer sets using outer joins. The outer joins
are almost all 2-way (symmetric) outer joins.
- Interface to Help
- Pre-canned Comparisons
- Distributed Calculations: Nonadditive Measures
- Nonadditive quantities derived from separate queries (e.g. ratios) can
be usefully presented at different grains and in break rows only if the
additive components are fetched from the DBMS and the computation is
performed in the query tool as the last step before presentation.
- Drilling Down- produce more row headers.
- Drilling down does not mean descending a predetermined hierarchy. It
means being able to quickly ask for additional row headers from any of the
dimensions joined to the fact table. It also means being able to remove row
headers and drill down in a different direction.
- Advanced exception handling
- Highlighting exceptional items
- Limiting the display to or throwing out exceptional items
- Marking as exception if in top./bottom N,N% of report.
Agents/triggers that occur if an event occurs in the DW
environment.
- Context-Specific Drill-Down Targets
Drill down to a different report
depending on the context, e.g. if a particular exceptional no. is selected for
drill-down, an entirely different report is generated.
- Interaction with Aggregates
The need for an aggregate navigator has
been previously mentioned.
- Drilling Across
Drilling Across involves combining 2 or more fact
tables that share dimensions into a single report. This will produce 2 or more
metric columns in the final report, depending upon the no. of fact tables
used. The drill-across model works as long as the row headers chosen for a
particular report have exactly the same meaning in both fact tables and their
resp. dimensions.
- Drilling across 2 or more fact tables is well defined as long as the row
headers chosen for the particular report have exactly the same meaning in
all the fact and dimension tables involved. Constraints may be applied to
nonshared dimensions in one or more fact tables, but it would be advisable
to warn the user of this condition either at runtime or on the report
itself.
- Extending SQL's Aggregation Operations
This would be extended to
include measures like PERIODAVG, PERIODCOUNT.
- Break rows
Break rows amount to sectional sub-total/semi-additive rows
in the report.
- Behavioral Constraints.
This involves building a special table
containing keys we wish to constrain on and constraining on them in the SQL
that gets generated.
- Pivoting.
Pivoting allows the row and column headers to be scrambled in
arbitrary combinations e.g. rows become columns and vice versa.
- Handing off the answer set.
The query tool hands off the answer set to
adjacent tools.
- Printing
- Batch operation
This is useful for
- Time-scheduled report generation
- Event-triggered report generation
- Stress Testing
Administrative Responsibilities
- Tool and Data Training
- Pre-joins and Query Tool Metadata
- Information Dictionary
- Browsing support
- Public and Private Constraint Groups
- Public and Private Behavioral Groups
- Aggregate Navigation Metatables and Statistics
- Extract Tool Metatables
XVII. The Future
Points for the future:
- Software is the key
- Optimization of star join query execution
- Indexing of multi-million-row dimension tables for browsing and
constraining
- Accessing and indexing of composite key of large fact tables.
- Completing SQL so it can process business questions
- Support of low-level data compression
- Support of parallel processing
- Dimensional database design tools.
- Dimensional database extract, admin and QA tools.
- End user query tools