An earlier blog post focused on simple steps to get started using the Teradata 14.10 Automated Statistics Management (AutoStats) feature. One of the new capabilities that AutoStats relies on when it streamlines statistics collection is the new “Threshold” option. Threshold applies some intelligence about when statistics actually need to be re-collected, allowing the optimizer to skip some recollections.
Although you will probably want to begin relying on AutoStats when you get to 14.10, you don’t have to be using AutoStats to take advantage of threshold, as the two features are independent from one another. This post will give you a simple explanation of what the threshold feature is, what default threshold activity you can expect when you get on 14.10, and what the options having to do with threshold do for you. And you’ll get some suggestions on how you can get acquainted with threshold a step at a time.
For more thorough information about statistics improvements in 14.10, including the threshold functionality, see the orange book Teradata Database 14.10 Statistics Enhancements by Rama Korlapati.
What Does the Threshold Option Do?
When you submit a COLLECT STATISTICS statement in 14.10, it may or may not execute. A decision is made whether or not there is a value in recollecting these particular statistics at the time they are submitted. That decision is only considered if threshold options are being used.
Threshold options can exist at three different levels, each of which will be discussed more fully in their own section below. This is a very general description of the three levels:
- System threshold: This is the default approach for applying thresholds for all 14.10 platforms. The system threshold default is not a single threshold value. Rather this default approach determines the appropriate threshold for each statistic and considers how much the underlying table has changed since the last collection.
- DBA-defined global thresholds: These optional global thresholds override the system default, and rely on DBA-defined fixed percentages as thresholds. Once set, all statistics collection statements will use these global threshold values, unless overridden by the third level of threshold options at the statement level.
- Thresholds on individual statements: Optional USING clauses that are attached to COLLECT STATISTICS statements can override the system default or any global DBA-defined thresholds when there is a need for customization at the individual statistic level.
Whichever threshold level is being used, if the optimizer determines that the threshold has not been met, no statistics will be collected, even though they have been requested. When a collection has been asked for but has not been executed, a StatsSkipCount column in the DBC.StatsTbl row that represents this statistics will be incremented.
StatsSkipCount appears as an explicit column in the view, but in the base DBC.StatsTbl StatsSkipCount is carried in the Reserved1 field. When StatsSkipCount is zero it means that the most recent COLLECT STATISTICS request was executed.
Ways That a Threshold Can Be Expressed
The system setting (level 1) for threshold logic is not one threshold value applied to all statistics collections. Rather, when enabled, the setting tells the optimizer to hold back the execution of a collection submission based on whatever it deems as an appropriate threshold for this statistics at this point in time. This high-level setting uses a “percent of change” type of threshold only.
Statistics collection thresholds are explicitly specified when using DBA-defined global settings or individual statement thresholds are used. These explicit thresholds can be expressed as a percent of change to the rows of the table upon which statistics are being collected, or as time (some number of days) since the last collection.
The most reliable way to express thresholds is by means of a percent of table change. That is why the highest level system setting, the one that is on by default, only supports percent of change thresholds. Time as a threshold must be explicitly specified in order to be used.
Importance of DBQL USECOUNT Logging
The recommended percent of change thresholds rely on having DBQL USECOUNT logging turned on. See my earlier blog on AutoStats for an explanation of USECOUNT DBQL logging. USECOUNT logging is a special type of DBQL logging that is enabled at the database level. Among other things, USECOUNT tracks inserts, deletes and updates made to tables within a database, and as a result, can provide highly accurate information to the optimizer about how the table has changed since the last statistics collection.
The default system threshold functionality is able to be applied to a statistic collection only if USECOUNT logging has been enabled for the database that the statistics collection table belongs to. In the absence of USECOUNT data, the default threshold behavior will be ignored. However, both DBA-defined global thresholds and statement-based thresholds are able to use percent of change thresholds even without USECOUNT logging, but with the risk of less accuracy.
In the cases where USECOUNT logging is not enabled, percent of change values are less reliable because the optimizer must rely on random AMP sample comparisons. Such comparisons consider estimated table row counts (the size of the table) since the last statistics collection. This can mask some conditions, like deletes and inserts happening in the same timeframe. Comparisons based strictly on table rows counts are not able to detect row updates, which could change column demographics. For that reason, it is recommended that USECOUNT logging be turned on for all databases undergoing change once you get to 14.10.
Percent of change is the recommended way to express thresholds when you begin to use the threshold feature in 14.10. Time-based thresholds are offered as options primarily for sites that have evolved their own in-house statistics management applications at a time when percent of change was unavailable, and wish to continue to use time.
The next three sections discuss the three different levels of threshold settings.
More about the System Threshold Option
All 14.10 systems have the system threshold functionality turned on by default. But by itself, that is not enough. USECOUNT logging for the database must also be enabled. If USECOUNT DBQL logging is turned on, then each COLLECT STATISTICS statement will be evaluated to see if it will run or be skipped.
During this evaluation, an appropriate change threshold for the statistic is established by the optimizer. The degree of change to the table since the last collection is compared against the current state of the table, based on USECOUNT tracking of inserts, deletes and updates performed. If the change threshold has not been reached, and enough history has been collected for this statistics (usually four of five full collections) so that the optimizer can perceive a pattern in the data such that extrapolations can be confidently performed, then this statistics collection will be skipped.
Even if the percent of change threshold has not been reached (indicating that statistics can be skipped), if there are insufficient history records, the statistics will be recollected. And even with 10 or 20 history records, if there is no regular pattern of change that the optimizer can rely on to make reasonable extrapolations, statistics will be recollected.
There is a DBS Control record parameter called SysChangeThresholdOption which the behavior of the system threshold functionality. This parameter is set at zero by default. Zero means that as long as USECOUNT logging in DBQL is enabled for the database that the table belongs to, then all statistics collection statements will undergo a percent of change threshold evaluation, as described above.
If you want to maintain the legacy behavior, threshold logic can be turned off completely at the system level by disabling the SysChangeThresholdOption setting in DBS Control (set it to 3). This field, along with parameters to set DBA-defined global parameters, can be found in the new Optimizer Statistics Fields in DBS Control.
It is important to re-emphasize that the DBQL USECOUNT logging must be enabled for all databases that you want to take advantage of the system threshold functionality. In addition, all other lower-level threshold settings must remain off (as they are by default) in order for the system threshold to be in effect.
More about DBA-Defined Global Thresholds
While it is recommended that the system threshold setting be embraced as the universal approach, there are some sites that have established their own statistics management processes prior to 14.10. Some of these involve logic that checks on the number of days that have passed since the last collection as an indicator of when to recollect.
In order to allow those statistics applications to continue to function as they have in the past within the new set of threshold options in 14.10, global thresholds parameters have been made available. These options are one step down from the system threshold and will cancel out use of DefaultUserChangeThreshold.
There are two parameters in the same section of DBS Control Optimizer Statistics Field that allow you to set DBA-defined thresholds:
DefaultUserChangeThreshold: If this global threshold is modified with a percent of change value (some number > 0), then the system default threshold will be disabled, and the percent defined here will be used to determine whether or not skip statistic collections globally.
Unlike the system default, if DBQL USECOUNT logging has not been enabled, random AMP samples will be used instead if this global setting has been enabled. The approach of using random AMP sample is somewhat less reliable, particularly in cases where there are updates, or deletes accompanied by inserts, rather than just inserts.
DefaultTimeThreshold: This global setting provides backward compatibility with home-grown statistics management applications that rely on the passage of time. Using a time-based threshold offers a less precise way of determining when a given statistic requires recollection. Some tables may undergo large changes in a 7-day period, while others may not change at all during that same interval. This is a one-size-fits-all lacks specificity and may result in unneeded resource usage.
More about Statement-Level Thresholds
USING THRESHOLD syntax can be added manually to any COLLECT STATISTIC statement.
COLLECT STATISTICS USING THRESHOLD 10% AND THRESHOLD 15 DAYS COLUMN TestID ON SandBoxT1;
When you use USING THRESHOLD, it will override any default or global threshold settings that are in place. See the Teradata Database 14.10 Statistics Enhancements orange book for detailed information about the variations of statement-level options you can use for this purpose.
For statement-based percent of change thresholds, the optimizer does not require that there be a history of past collections. If data change is detected over the specified threshold statistics will be collected, otherwise they will be skipped.
Statement-level thresholds are for special cases where a particular statistic needs to be treated differently than the higher level default parameters dictate. They can also be useful when you are getting starting with threshold, and you want to limit the scope to just a few statistics.
Getting Started Using Threshold
Here are some suggestions for sites that have just moved to 14.10 and want to experience how the threshold logic works on a small scale before relying on the system and/or global options:
- Pick a small, non-critical database.
- Enable DBQL USECOUNT logging on that database:
BEGIN QUERY LOGGING WITH USECOUNT ON SandBoxDB;
- Disable the system threshold parameter by setting DBS Control setting:
SysChangeThresholdOption = 3
- Leave the global parameters disabled as they are by default:
DefaultUserChangeThreshold = 0
DefaultTimeThreshold = 0
- Add USING THRESHOLD 10 PERCENT to the statistics collection statements just for the tables within your selected database:
COLLECT STATISTICS USING THRESHOLD 10% COLUMN TestID ON SandBoxT1;
- Insert a few rows into the table (less than 10% of the table size) and run an Explain of the statistics collection statement itself, and it will tell you whether or not skipping is taking place.
EXPLAIN COLLECT STATISTICS COLUMN TestID ON SandBoxT1;
See page 39 of the Teradata Database 14.10 Statistics Enhancement orange book for some examples.
Summary of Recommendations for Threshold Use
The following recommendations apply when you are ready to use the threshold functionality fully:
- If your statistics are not under the control of AutoStats, make no changes and rely on the system threshold functionality to appropriately run or skip your statistics collection statements.
- Always turn on USECOUNT logging in DBQL for all databases for which statistics are being collected and where you are relying on system threshold.
- If you have your own statistics management routines that rely on executing statistics only after a specific number of days have passed, set the DefaultTimeThreshold to meet your threshold criteria. You should experience similar behavior as you did prior to 14.10. Over time, consider switching to a change-based threshold and re-establishing the system threshold, as it will be more accurate for you.
- Don’t lead with either of the DBA-defined global parameters DefaultUserChangeThreshold or DefaultTimeThreshold unless there is a specific reason to do so.
- Use the statement-level threshold only for special statistics that need to be handled differently from the system or the DBA-defined global defaults.
- Favor percent of change over number of days as your first choice for the type of threshold to use.
- But if USECOUNT is not being logged, then rely on time-based thresholds and set DefaultTimeThreshold at your preferred number of days.