When you are starting to build a data warehouse (DWH) you probably ask yourself: what data modeling approach should I choose? Should I go with Star schema or Snowflake schema? Snowflake schema seems like more natural approach – every data entity gets its own table. However if you have a lot of fact and dimension tables in your data warehouse, queries against snowflake schema can be cumbersome. Star schema or some kind of Snowflake/Star hybrid are more common and user friendly.
Star and Snowflake data modeling approaches were developed when companies mostly dealt with data like orders, purchases, shipments etc. This type of data obviously still flows into most of data warehouses. However these days many companies also collect data that reflects users’ activity on their websites: page views, clicks, searches etc. It’s not uncommon to have tens or even hundreds of millions of such events per day. Each event might have tens or even hundreds of attributes. Should we create a dimension for every attribute? Why don’t we just load all this data into flat fact table in my DWH?
With introduction of columnar databases like Vertica, Redshift, Greenplum etc, where every column stored in a separate file, querying tables with many-many columns became more efficient. However storage utilization and query performance still can be an issue if tables are not designed properly. If you have hundreds of millions or billions of rows reflecting page views or clicks, it might be very costly to store actual useragent or url value in every row. For example: lets say you have 3B page visits stored in your database and there ~1M unique useragent that correspond to them. If you would store useragents in a dimension table – it will consume ~500MB of space. However, if you would store useragents in your fact table, attached to every page visit, it will require ~1.7TB of space. Also if you need to perform a search within a list of useragents or url – it’s more efficient to do it over the dimension table rather the fact table.
On the other hand, if your fact table contains column like “payment_method” or “page_type” that hardly has more than 5-10 unique and fairly short values – creating dimension table might not be needed. Space saving wouldn’t be significant in this case. We usually just standardize value by trimming trailing spaces and bringing all characters to lowercase. Such design will also accommodate majority of reporting tool. For example, tools like Microstrategy “like” to have dimension tables so list of unique values for certain data entity can be easily extracted. Columnar database usually will provide very fast response when you doing “select distinct …” from a column with very few unique values, like list “payment_method” or “page_type” to create these dimension tables on the fly.