Data Warehousing Concepts

Back to Tylogix Home Page

The Data Warehouse: Dawn of a New Process

By Thibault Dambrine

Data warehousing is a hot topic in the news today. Walmart currently operates a 7TB system, the largest in the world. [1 terabyte = 1,000 gigabytes]. The company claims it "enables it to manage an operation that would be otherwise beyond human capacity". Bank of America’s 1.2 TB data warehouse is used daily by 1,500 users who make 2,800 queries per day. Marriott International uses data mining to identify potential buyers for its time-share properties. Using data mining techniques, Australia’s Health Insurance Commission found an unexpected surprise: Buried in the mass of medical billing data, a shady billing practice was uncovered and reversed, saving Australian tax payers millions of dollars.

But what is a data warehouse and what does it do? In this article, we will cover the basic elements.

W. H. Inmon, widely regarded as the inventor of the Data Warehouse concept, has this definition: "A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process."

Traditionally, the database model of the computer system has been the base building block for operational applications for the organization. The aim of such an operational system database is to give good response time, to be reliable, to provide good security and good recoverability in case of accident. In the Data Warehouse model, the database has a different aim. It can also be called an "informational system". The data warehouse goal is to handle data in a centralized database for smart and deductive information retrieval purpose. With the data warehouse, the business user is aiming to gain a competitive advantage with information technology. The data warehouse thus is a decision support system.

John Zachman, likens the building of a large and complex Information System to building a large and complex building. Owners and Investors are interested in a building that meets residential and lifestyle requirements. The architects are interested in designing a building that incorporates the investors requirements as effectively, efficiently and aesthetically as possible. In the same way a building is realized by combining many interests and skills, the data warehouse requires different interests to combine skills. To summarise these interests, John Zachman has created the following chart. (Figure 1.) It defines two main categories of users within a business environment and the implementor’s matching concerns to meet them.

	What	How	Where	Who	When	Why
Owner	DSS for Customer Revenue Analysis	Analyse Historical Data	Marketing Dept. Office	Marketing Dept.	Monthly, Quarterly and Annually	Business Objectives, Making Promotional Decisions
Business User	Subject Areas Dimensions Granularity	Drill down Report Query	Desktop Dept. Conference Room	Managers Analysts	Currency of Information	Operational Objectives, Decision Support
IT Implementor	Subject Areas, Warehouse Meta Model	Data Warehouse and Data Mart functions	LAN Corporate network	DA/DBA, Domain Analyst Application Programmer	Data Warehouse Load & Refresh Cycle	Technical Objectives

Figure 1.

The purpose of the data warehouse is to present the data from a user’s perspective rather than (more traditionally done) from the application or flow of data perspective.

Typically, the data warehouse will handle historical data that would be otherwise removed or "retired" from the active data processing system. By the very nature of time passing, historical data will grow in volume and the data warehouse must offer tools to effectively summarize or aggregate. Without this, a data warehouse could end up storing more than it economically should. Another reason for summarizing is that it makes the "big picture" easier to understand. At the same time, when users inquire in a data warehouse they will sometimes need to "drill down" to the details that make up a given summary. Somewhere between the two poles of storing every detail and summarizing everything to a higher level is the proper balance. The aim is to make the data warehouse an effective decision support information system with the mix of details and summaries appropriate to each user, department and company.

To support this aim, the data warehouse should support the following functionality:

It should be scaleable, as size variation (mostly from big to bigger) will be a factor.

It should be able to connect with ease with systems of different nature in order to collect data.

It should provide tools to ensure data quality before it is stored permanently.

It should provide the user with decision support tools such as queries, to give different ways to view the same data. The ability to aggregate and summarize, to do statistical analysis, and so-called "knowledge discovery" also known as "data mining" tools are also items on the list.

It should be flexible, as the structures of data can change as time passes and the best use of the data can be refined over time with more efficient structures.

Logically, the data warehouse leads into the subject of Data Mining, and OLAP (On-line Analytical Processing), the star feature of the data warehousing process.

To best describe the data warehousing process, let us go through the path of the data to be stored in the data warehouse. The data warehouse is a repository for information from various sources. It should, before anything else be able to communicate with the systems that will provide it with data. Once the data is made available, the first concern with this information is quality. The basic meaning of quality in this case is that data should be accurate, current, and contain no duplicates. Some databases have built-in intelligence in elements such as customer numbers or item numbers. Older databases often "recycle" older, unused fields with unrelated information. For best results, these anomalies should be unraveled and reorganized logically before taking in the data for permanent storage. Analyzing these issues and creating a "clean" database is a cornerstone necessity for the data warehouse. Without good data quality, the data warehouse practically loses its reason to be.

Once the data is "cleaned", it will likely require some re-structuring. The re-structuring and re-grouping of the data will facilitate the access for the analysts and the users of the data warehouse. This process will include the following activities:

Integration and re-formatting of different but similar types of data from different sources into a common, consistent, new format. For example, call detail records from different switches have similar characteristics (time of call, duration, etc.) but seldom come in the same record layout.

Partitioning data into time-phased series to facilitate analysis.

Pre-calculation of commonly used summary information for the business users.

Another component of the data warehouse is created before permanent storage, and is called "metadata", which means "data about data". Here is the type of information metadata consists of:

Mapping rules and the maps between the data sources and the warehouse data.

Translation, transformation and "cleaning" rules applied to the original source data.

Information about the source data such as date & time stamps, system of origin, file name of origin, type of filtering or matching applied if any.

Pre-calculated and derived data and the rules applicable to derive the new elements.

Metadata not only deals with the source-to-warehouse-rules segment. It also records warehouse-to-user-rules:

The description of the data warehouse database.

The navigation paths and rules for browsing in the data warehouse.

The data dictionary.

The list of pre-designed and built-in queries available to the users.

The purpose of the data warehouse is to create a pool of data that is organized purposely for the user. The data model may well be completely different from the original sources, and that is fine. The architect of the data warehouse has a mandate to make the data easy to retrieve in the format that is necessary for the decision makers.

In the process of building a data warehouse, the architect may decide to take a more segmented approach, to serve specific departments as opposed to a global, company-wide, all encompassing data warehouse. Enter the "data mart". Data marts can be a building block towards the creation of the data warehouse, but they can also be created from data drawn within an existing data warehouse. The primary difference between the data mart and the data warehouse is the focus towards a specific group of users, such as a specific department within a company. This makes the scope and the complexity of this "mini data warehouse" more manageable and more focused.

Here to wrap up, is Gill & Rao’s summary (Figure 2.) of the data warehouse building activities:

Refinement	Engineering	Data Warehouse	Retrieval
Standardise	Integrate & Partition	Modeling
Filter & Match	Summarise & aggregate	Summarise
Cleanup & Scrub	Pre-Calculate & Derive	Aggregate
Stamp Time Dimension & Data Source	Translate & Format	Reconcile & Validate
Verify Data Quality	Transform & Re-Map	Build Architected queries
Metadata Extraction & Creation	Metadata Creation	Create Glossary	Meta-Data Browse & Navigate

Figure 2.

The reason-to-be of the data warehouse is the availability of the data it contains for decision making. This is what provides the payoff and the value to the business of the data warehousing exercise. W.H. Inmon states, "The Holy Grail for the end user is insight as to why business has been conducted the way it has been in the past and how the business can be conducted more effectively in the future."

Once the data warehouse is in place, the following list of retrieval tools become available to the user:

Informational Processing: Includes techniques such as data and basic statistical analysis, query and reporting. The results may be presented as reports or charts.

Analytical Processing: OLAP, (On-line Analytical Processing), has the mission of extracting data for decision making support process. It is a relatively new technology and in the data warehouse realm warrants strong attention. Here, according to Gill & Rao, are a few points that uniquely define OLAP:

It presents a multidimensional, logical view of the data in the data warehouse. The view is independent of the way the data is stored.

It always involves interactive query and analysis of the data. The interaction is usually multiple passes, involving drilling down into successively lower levels of detail data or roll-ups to higher levels of summarization and aggregation.

It offers analytical modeling capabilities, including a calculation engine for deriving ratios, variances, etc., involving measurements or numerical data across many dimensions.

It supports functional models for forecasting , trend analysis and statistical analysis.

It retrieves and displays data in 2D or 3D cross-tabs, charts & graphs with easy pivoting of axis (to see what "pivoting" does, try this feature on a Microsoft Excel spreadsheet ). Pivoting is key because business users need to analyze the data from different perspectives; and the analysis of one perspective leads to business questions to be examined from another perspective.

It responds to queries quickly, so the analysis process is not interrupted and the information is not stale.

It has a multi-dimensional data store engine, which stores data in arrays. These arrays are a logical representation of the business dimensions.

In fewer words, OLAP multiplies the power of the on-line database power-user.

Data Mining: Serendipitous process attempting to uncover previously unknown and unsuspected hidden patterns or relationships in the data. Data Mining, to achieve results, scans data at the transaction detail level. The results can come in reports in various states of summarization. Emerging visualization tools make trends "easy to see". These tools are used mostly, but not exclusively to understand customer behavior.

Pre-Processed information: Not all of the information stored in the data warehouse is as sexy as the one extracted by Data Mining tools. A good part of the data warehouse typically can be used for purposes such as routine weekly and monthly marketing summaries and reports.

All together, these tools offer new, creative and unforeseen weapons for doing business. They can be applied at every end of the spectrum, from cost cutting to targeted marketing. The data warehousing process can be seen as knowledge value for the preemptive decision maker. To know before the others, to decide and decide right, before anybody else does. The pay-back of knowledge here, is money.

In Conclusion:

Data warehousing is already a staple in larger organizations. The Walmarts of the world already study consumer behavior like some scientists study ants. The question is, what’s in it for the average sized I/S Department? "Data Warehouse", by nature, evokes the word "BIG". Unearthing the "gold" from your own data does not mean you have to have a huge mass of it. It is not reserved exclusively to the giant corporations. Soon, most companies will not go without it, much like the PCs, now on every desk. To illustrate this point here is one last example:

16 out of 29 NBA teams are currently using or evaluating a data mining tool called "Advanced Scout". It is DOS-based, and runs on a lap-top. This product, specifically designed for the NBA, is designed to help coaches better leverage the statistical information gathered during each game. By analyzing information such as the number of blocked shots or assists or the success of certain plays and lineups against a particular team, the coach can get a competitive edge. "This lets us look at the big picture" says Tom Sterner, assistant coach of the Orlando Magic who has used the "Advanced Scout" for a year. "We can see a trend we didn’t even know existed, isolate the problem and solve it or use the information to our benefit".

Sources:

The Official Guide to Data Warehousing , (1996 Que Corporation)

W.H. Inmon, Building a Data Warehouse (1992, QED Technical Publishing Group)

J. Zachman, A Framework for Information System Architecture – IBM Systems Journal 1987

Ziff Davis Publishing, PC Week, article by Aileen Crowley, June 3, 1996

The Wall Street Journal, article by Gomes Lee, November 18 1996

Back to Tylogix Home Page