Back to Tylogix Home Page


The Data Warehouse: Dawn of a New Process

By Thibault Dambrine

Data warehousing is a hot topic in the news today. Walmart currently operates a 7TB system, the largest in the world. [1 terabyte = 1,000 gigabytes]. The company claims it "enables it to manage an operation that would be otherwise beyond human capacity". Bank of America’s 1.2 TB data warehouse is used daily by 1,500 users who make 2,800 queries per day. Marriott International uses data mining to identify potential buyers for its time-share properties. Using data mining techniques, Australia’s Health Insurance Commission found an unexpected surprise: Buried in the mass of medical billing data, a shady billing practice was uncovered and reversed, saving Australian tax payers millions of dollars.

But what is a data warehouse and what does it do? In this article, we will cover the basic elements.

W. H. Inmon, widely regarded as the inventor of the Data Warehouse concept, has this definition: "A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process."

Traditionally, the database model of the computer system has been the base building block for operational applications for the organization. The aim of such an operational system database is to give good response time, to be reliable, to provide good security and good recoverability in case of accident. In the Data Warehouse model, the database has a different aim. It can also be called an "informational system". The data warehouse goal is to handle data in a centralized database for smart and deductive information retrieval purpose. With the data warehouse, the business user is aiming to gain a competitive advantage with information technology. The data warehouse thus is a decision support system.

John Zachman, likens the building of a large and complex Information System to building a large and complex building. Owners and Investors are interested in a building that meets residential and lifestyle requirements. The architects are interested in designing a building that incorporates the investors requirements as effectively, efficiently and aesthetically as possible. In the same way a building is realized by combining many interests and skills, the data warehouse requires different interests to combine skills. To summarise these interests, John Zachman has created the following chart. (Figure 1.) It defines two main categories of users within a business environment and the implementor’s matching concerns to meet them.

 

What

How

Where

Who

When

Why

Owner

DSS for Customer Revenue Analysis

Analyse Historical Data

Marketing Dept. Office

Marketing Dept.

Monthly, Quarterly and Annually

Business Objectives, Making Promotional Decisions

Business User

Subject Areas

Dimensions

Granularity

Drill down

Report

Query

Desktop

Dept. Conference Room

Managers

Analysts

Currency of Information

Operational Objectives, Decision Support

IT Implementor

Subject Areas, Warehouse Meta Model

Data Warehouse and Data Mart functions

LAN

Corporate network

DA/DBA, Domain Analyst

Application Programmer

Data Warehouse Load & Refresh Cycle

Technical Objectives

Figure 1.

The purpose of the data warehouse is to present the data from a user’s perspective rather than (more traditionally done) from the application or flow of data perspective.

Typically, the data warehouse will handle historical data that would be otherwise removed or "retired" from the active data processing system. By the very nature of time passing, historical data will grow in volume and the data warehouse must offer tools to effectively summarize or aggregate. Without this, a data warehouse could end up storing more than it economically should. Another reason for summarizing is that it makes the "big picture" easier to understand. At the same time, when users inquire in a data warehouse they will sometimes need to "drill down" to the details that make up a given summary. Somewhere between the two poles of storing every detail and summarizing everything to a higher level is the proper balance. The aim is to make the data warehouse an effective decision support information system with the mix of details and summaries appropriate to each user, department and company.

To support this aim, the data warehouse should support the following functionality:

To best describe the data warehousing process, let us go through the path of the data to be stored in the data warehouse. The data warehouse is a repository for information from various sources. It should, before anything else be able to communicate with the systems that will provide it with data. Once the data is made available, the first concern with this information is quality. The basic meaning of quality in this case is that data should be accurate, current, and contain no duplicates. Some databases have built-in intelligence in elements such as customer numbers or item numbers. Older databases often "recycle" older, unused fields with unrelated information. For best results, these anomalies should be unraveled and reorganized logically before taking in the data for permanent storage. Analyzing these issues and creating a "clean" database is a cornerstone necessity for the data warehouse. Without good data quality, the data warehouse practically loses its reason to be.

Once the data is "cleaned", it will likely require some re-structuring. The re-structuring and re-grouping of the data will facilitate the access for the analysts and the users of the data warehouse. This process will include the following activities:

Another component of the data warehouse is created before permanent storage, and is called "metadata", which means "data about data". Here is the type of information metadata consists of:

Metadata not only deals with the source-to-warehouse-rules segment. It also records warehouse-to-user-rules:

The purpose of the data warehouse is to create a pool of data that is organized purposely for the user. The data model may well be completely different from the original sources, and that is fine. The architect of the data warehouse has a mandate to make the data easy to retrieve in the format that is necessary for the decision makers.

In the process of building a data warehouse, the architect may decide to take a more segmented approach, to serve specific departments as opposed to a global, company-wide, all encompassing data warehouse. Enter the "data mart". Data marts can be a building block towards the creation of the data warehouse, but they can also be created from data drawn within an existing data warehouse. The primary difference between the data mart and the data warehouse is the focus towards a specific group of users, such as a specific department within a company. This makes the scope and the complexity of this "mini data warehouse" more manageable and more focused.

Here to wrap up, is Gill & Rao’s summary (Figure 2.) of the data warehouse building activities:

Refinement

Engineering

Data Warehouse

Retrieval

Standardise

Integrate & Partition

Modeling

 

Filter & Match

Summarise & aggregate

Summarise

 

Cleanup & Scrub

Pre-Calculate & Derive

Aggregate

 

Stamp Time Dimension & Data Source

Translate & Format

Reconcile & Validate

 

Verify Data Quality

Transform & Re-Map

Build Architected queries

 

Metadata Extraction & Creation

Metadata Creation

Create Glossary

Meta-Data Browse & Navigate

Figure 2.

The reason-to-be of the data warehouse is the availability of the data it contains for decision making. This is what provides the payoff and the value to the business of the data warehousing exercise. W.H. Inmon states, "The Holy Grail for the end user is insight as to why business has been conducted the way it has been in the past and how the business can be conducted more effectively in the future."

Once the data warehouse is in place, the following list of retrieval tools become available to the user:

Informational Processing: Includes techniques such as data and basic statistical analysis, query and reporting. The results may be presented as reports or charts.

Analytical Processing: OLAP, (On-line Analytical Processing), has the mission of extracting data for decision making support process. It is a relatively new technology and in the data warehouse realm warrants strong attention. Here, according to Gill & Rao, are a few points that uniquely define OLAP:

In fewer words, OLAP multiplies the power of the on-line database power-user.

Data Mining: Serendipitous process attempting to uncover previously unknown and unsuspected hidden patterns or relationships in the data. Data Mining, to achieve results, scans data at the transaction detail level. The results can come in reports in various states of summarization. Emerging visualization tools make trends "easy to see". These tools are used mostly, but not exclusively to understand customer behavior.

Pre-Processed information: Not all of the information stored in the data warehouse is as sexy as the one extracted by Data Mining tools. A good part of the data warehouse typically can be used for purposes such as routine weekly and monthly marketing summaries and reports.

All together, these tools offer new, creative and unforeseen weapons for doing business. They can be applied at every end of the spectrum, from cost cutting to targeted marketing. The data warehousing process can be seen as knowledge value for the preemptive decision maker. To know before the others, to decide and decide right, before anybody else does. The pay-back of knowledge here, is money.

In Conclusion:

Data warehousing is already a staple in larger organizations. The Walmarts of the world already study consumer behavior like some scientists study ants. The question is, what’s in it for the average sized I/S Department? "Data Warehouse", by nature, evokes the word "BIG". Unearthing the "gold" from your own data does not mean you have to have a huge mass of it. It is not reserved exclusively to the giant corporations. Soon, most companies will not go without it, much like the PCs, now on every desk. To illustrate this point here is one last example:

16 out of 29 NBA teams are currently using or evaluating a data mining tool called "Advanced Scout". It is DOS-based, and runs on a lap-top. This product, specifically designed for the NBA, is designed to help coaches better leverage the statistical information gathered during each game. By analyzing information such as the number of blocked shots or assists or the success of certain plays and lineups against a particular team, the coach can get a competitive edge. "This lets us look at the big picture" says Tom Sterner, assistant coach of the Orlando Magic who has used the "Advanced Scout" for a year. "We can see a trend we didn’t even know existed, isolate the problem and solve it or use the information to our benefit".

Sources:

The Official Guide to Data Warehousing , (1996 Que Corporation)

W.H. Inmon, Building a Data Warehouse (1992, QED Technical Publishing Group)

J. Zachman, A Framework for Information System Architecture – IBM Systems Journal 1987

Ziff Davis Publishing, PC Week, article by Aileen Crowley, June 3, 1996

The Wall Street Journal, article by Gomes Lee, November 18 1996

Back to Tylogix Home Page