Data Warehousing, Part II: Database Designs
(This was originally posted to the NTEN Data Warehousing group list, by Nate Wilbert. Nate has kindly allowed us to publish his work here.)
A couple of weeks ago I wrote about similarities between putting a jigsaw puzzle together and building up a data warehouse. I finished the letter with a comment about discussing a couple of ways to design databases. Technically speaking, I was referring to the difference between a database designed for OLAP, or "online analytical processing" and OLTP, or "online transaction processing." (More on terminology below!) But first, last weekend the Olympics started in China and so I'll work on an analogy in honor of the Summer Games.
One of my favorite parts of the summer games is Track and Field. I used to be a distance runner. I ran cross country in the fall semester and track and field during the spring semester. We had two different styles of training during either season. There were days of the week in which we ran great distances, but at a slower pace to help build up our endurance. Then there were the days in which we ran short distances, but at a very fast pace, over and over. They both had a goal.
The former was to give us stamina for the length of the race and the latter gave us an ability to switch gears during the race and finish with some speed! We needed both to perform well on race day. The same is true of database design.
We need an area to enter data in fast and efficiently. We also need an area to retrieve data fast and efficiently. Let's discuss the first type. In scenario one, someone is browsing on your website and decides they want to support your organization and are willing to sign up for more information. They fill out a form on your site, sign up for your email list and provide some critical information about themselves. As soon as they click the submit button, a transaction occurs that loads their information into your contacts database. There are special design rules for building "transaction oriented databases" and we'll briefly touch on that below.
The second kind of database design is called OLAP. There are a couple kinds of OLAP, but the basic idea there is that we design the database differently from the one in which we're entering data because the goal is to pull data out quickly, not load it quickly. You can't design a database that does both efficiently. For an OLAP example, let's say you run the marketing group of your organization. You want to know how many contacts have donated money by region of the country. With a database designed for analysis (OLAP) we will much more quickly be able to get the answer to that question than with one designed using a traditional method (for transactions).
A transaction oriented database is usually designed with normalization in mind. Let's unpack the term. Normalization is a fancy database term that basically means we do not have data in multiple places within the same database. It helps in a couple of areas. One, it helps maintain data integrity in that we don't have to worry about updating every single place we've stored a piece of information, because the goal is to only store it in one place! And along with that, when loading data we only have to load to one place so the LOAD process is faster than if we had to insert into multiple tables. You can check the term out on Wikipedia if you like.
An analysis oriented database is usually designed using what's called a dimensional framework. The term dimension refers to different facets of your organization. A common dimension would be Time or Geography. That is, we will typically report on XYZ measures by Time and Geography. "Show me number of donors for last year, by state." A dimensional framework is something which Ralph Kimball has written many books on. We actually do the opposite of a normalized design and it's called a denormalized database design. We repeat data in many places sometimes. The reason is that when we are querying the database, the more tables we have to query to get our information the slower the response time will be. So we try to minimize the number of table joins created. Additionally, a dimensional model is easier for nontechnical persons to understand which becomes important when trying to design a database which will support the reports they require. There are typically fewer tables because we're allowing data to be denormalized and repeated rather than forcing and each type of data into its own table.
Just like there were two types of training to help with the goal of winning a race on race day, there are two types of database design to reach the goal of becoming a more information driven organization. One helps us load and store data efficiently, while the other helps us pull the data quickly. I'm not really getting into the nuts and bolts here, but if you're interested in some further study on the ideas here I'd recommend Kimball's The Data Warehouse Toolkit (for dimensional design) or his Data Warehouse Lifecycle Toolkit, as well as his articles (listed in the first post).
Enjoy the rest of the Games.
Nate
- Login to post comments