INFORMATION ABOUT STUDIES: What to Learn About in Order to Speed Up Data Warehouse Querying

This paper is a laundry list of items data warehouse implementers may wish to learn more about in order to speed up their data warehouse queries or to make the data warehouse "environment" more responsive to the bulk of the data warehouse query users. This paper will not attempt to provide detailed explanations of these topics. Nor is including a topic in this list a declaration that knowledge of the topic will definitely speed up querying. Rather, data warehouse implementers may use this paper as a starting point in their search for ways to speed up queries. This list includes topics that are relevant to many of the relational database and data access tool technologies. Some topics that apply, to the best of my knowledge, to one or two vendors' technologies are not listed.
SQL SELECT statements
This is bedrock knowledge. It is quite worthwhile to get an book on SQL (there are quite a few good ones) and review (or learn) this topic. Though you may think that your query tool's SQL generation capabilities lessen the need for this knowledge, you will eventually find the SQL knowledge quite helpful.
How does your database join tables, union tables, uses indexes, choose access paths
This is some more bedrock knowledge. Unfortunately, this information may not be that accessible. If the information exists, it may be poorly written, written for an academic audience, and/or scattered among many manuals. Nevertheless, it is worth making a determined effort to understand these topics. - The vendor/consultant community would do itself well if it tried much harder to communicate this information in coherent and comprehensible terms.
What statistics your database provides on query execution
Sometimes those of us building stores of information for users to analyze forget about our own information needs. You need this information to identify which queries are especially resource consumptive. You probably will be concerned with a clump of queries that are far more consumptive than average. Sometimes the resolution of consumption issues is a simple rewrite of the query. Sometimes resolution is more technically involved and requires doing many things listed in this paper. And sometimes the solution is to do nothing - you just have to accept that your data warehouse has to support these demanding queries.
Aggregate tables
This is probably the most used method of speeding up queries. There are many discussions of this in the literature. The books "The Data Warehouse Lifecycle Toolkit", "The Data Warehouse Toolkit", and "Data Warehousing in the Real World" have especially good non-technology specific discussions of this topic.
Aggregate navigators/query redirectors
This is the technology that automatically directs a query to aggregated data if such data are available and appropriate for the query.
Partitioning
This is probably the second most common method of speeding up queries. Note that partitioning comes in many ways, shapes, and forms. At the very least, it is dividing one table into several tables usually based on the time the table data represent. Note that both tables and indexes may be partitioned.
B-tree indexing
Adding numerous indexes is another common method for speeding up queries. Note that persons with a transaction processing mindset may have a hard time accepting as much use of these indexes as is usually helpful in a data warehouse.
Dimensional modeling
With certain database technologies, this modeling can reduce the amount of sort/merging that goes on when joining tables. And, some query tools may generate more efficient SQL if data are modeled dimensionally. Also, if you use surrogate keys in conjunction with dimension modeling, joins may be more efficient.
Parallelizing query execution
Developments in database technology have made doing this much easier. Note, however, the number of users running queries and the amount of data to be returned in a query can sometimes limit this technique's effectiveness.
Archiving/purging data
Sometimes the cost of having to scan through older data exceeds the benefit of having it available in the unlikely possibility someone wants to examine it.
Reducing the width of large tables that get scanned
There are also many ways to do this. Before getting fancy with this it is worth taking the time to understand what actually takes up space in your database tables.
Completely denormalizing aggregate tables
If these tables can be heavily indexed and can be maintained by complete refreshing, the requirements of join processing can be eliminated.
Loading tables completely in memory
Presuming the memory is available to do this and you have researched other topics in this paper, this may be an interesting strategy.
Bit mapped indexing
This technique can work well when a field takes on a low number of distinct values (i. e., low cardinality) and tends to be in WHERE clauses often.
Striping files
This means spreading a file over several physical disks. Look into the topic of RAID for more details.
Locating different files used concurrently on different disks
This is basic stuff but it can be helpful.
Defragmentation of table and index files
This is more basic stuff.
Solid State Disk
Supposedly prices have come down in the last few years.
Disk controllers
Too few can be a query bottleneck.
What your query tool attempts to do via SQL and what it does internally
The book "The Data Warehouse Toolkit" has a good discussion of where query tools may fall short. The reason you need to learn about this is to prevent using the query tool where it is inefficient or to know when you might build some "get arounds".
Query scheduling capabilities
This does not necessarily speed up a given query. However, scheduling resource consumptive queries for off-hours times may free up resources for other queries during prime time.
Query queuing
As with scheduling, this does not speed a given query up. However, this facility gives you a means so priority queries (such as a query needed to gain information for the monthly close of the financial books) can execute faster.
Query accelerators
These help you generate more efficient SQL. Note that they are probably more helpful to those who report off of highly normalized databases.
Query governors
These stop queries usually after a specified number of rows have been returned and/or a specified time has elapsed.
Query nannies
This is my term for technologies that warn (scold?) the user if he submits an inefficient query. Some of these provides hints about how to make the query more efficient and some (I have heard) actually try to fix up the queries.
"Productionizing" regularly used, highly resource consumptive queries
Certain queries probably should be written by someone with a great deal of knowledge how to make queries efficient.
Storing the image of the report
If a report based on a query is used by many people and on-line retrieval of the report is needed, the image of the report may be stored. The query then need be run only once and perhaps at a less busy time. There are tools that allow intelligent retrieval of stored report data.
Query tool caching of results
Some tools store the results of some queries. If the same query is run again, the tool may check to see if the results are stored. Or, if a subset of a previously retrieved result set is desired, the tool will read the previously retrieved query result set rather than the data warehouse.
Query tool preview of a subset of records
When a query is being developed, some tools make it easy to retrieve a small subset of records that meet the query criteria. This makes it quicker to test the query and cuts down the number of potentially expensive test queries.
Making two copies of the data warehouse - one for "operational" users and one for "analytical" users
It actually is hard to draw a line between what is operational use and what is analytical use of a data warehouse. However, in a typical data warehouse most of the users (usually with more "operational" needs) are running IS written, parameterized queries. A relatively small number of users (usually with more "analytical" needs) are running potentially highly resource consumptive ad hoc queries. - Though it is not necessarily pretty, sometimes the best way to handle this mixed use of the data warehouse is to create a separate copy of the data warehouse for each user group.
Multi-tiered architectures/Application partitioning
Some query tools allow you to run different components (i.e., "tiers" or "partitions") of the tool on different hardware servers.
Table compression

Reading fewer blocks of data may result in improved query performance. Note there are many approaches to compressing data that you may have to experiment with.
Network bottlenecks
Though you do not have to become an expert at network topologies, if some of your users will run queries that generate large result sets (and do not assume that only lengthy reports bring back large result sets to the query tool), it pays to trace the flow of data from the server to the user's workstation in order to see if there are any mismatched network components. For example, Fast Ethernet may be in your new facility but your user may have a 10Mbps network interface card.. Or, your user may have a card that was advertised to perform at 100Mbps which in actuality performs at 30Mbps. Also, find out how your network people load balance. They are more used to dealing with predictable transaction processing than extremely variable data warehousing demands. And if necessary, find out the costs of dropping more cable so you can put your users that run large result set producing queries on dedicated network segments. If you have invested millions in the data warehouse, the cost of an electrician and wire may be worth it.
Database technology designed specifically for data warehousing
Google data warehouse appliances.
Columnar databases

Many of the data warehouse appliances feature this architecture.
The cost of installing more/faster CPU, memory, disk

Sometimes buying metal is (by far) the least expensive way to speed up your queries.

Some final thoughts about speeding up queries:
bullet

You best expect that many of your queries are going to run a "long" time. You will prevent some problems if you spend some time teaching your users about what, in general, will take a long time.
bullet

In line with what I just said, you can spend plenty of time tuning queries. Though many IS people like to spend their time tuning queries, this tuning time can take IS away from other data warehouse problems whose solution is more meaningful to the business.
bullet

In reality the area of speeding up queries involves plenty of guesswork, doings thing by intuition, trial and error, and making uncomfortable trade-offs.

INFORMATION ABOUT STUDIES

google search

Saturday, October 25, 2008

What to Learn About in Order to Speed Up Data Warehouse Querying

0 comments:

About Me

Blog Archive