How to use JDBC FastExport with R

December 3, 2013, 1:39 pm

≫ Next: How to avoid duplicate JDBC Connection created by RJDBC dbConnect

≪ Previous: How to use JDBC FastExport with R

The following shows how to select rows from a database table using JDBC FastExport, which only works with JDBC PreparedStatement.

If the SELECT statement has no'?' parameter markers, then...

drv = JDBC("com.teradata.jdbc.TeraDriver","c:\\terajdbc\\terajdbc4.jar;c:\\terajdbc\\tdgssconfig.jar")
con = dbConnect(drv,"jdbc:teradata://system/TYPE=FASTEXPORT","user","password")

# initialize table foo
dbSendUpdate(con,"drop table foo")
dbSendUpdate(con,"create table foo(a int,b varchar(100))")
dbSendUpdate(con,"insert into foo values(?,?)", 42, "bar1")
dbSendUpdate(con,"insert into foo values(?,?)", 43, "bar2")

# select * from table foo
ps = .jcall(con@jc, "Ljava/sql/PreparedStatement;", "prepareStatement","select * from foo")
rs = .jcall(ps, "Ljava/sql/ResultSet;", "executeQuery")
md = .jcall(rs, "Ljava/sql/ResultSetMetaData;", "getMetaData")
jr = new("JDBCResult", jr=rs, md=md, stat=ps, pull=.jnull())
fetch(jr, -1)
.jcall(rs,"V","close")
.jcall(ps,"V","close")

dbDisconnect(con)

If the SELECT statement has '?' parameter markers, then...

drv = JDBC("com.teradata.jdbc.TeraDriver","c:\\terajdbc\\terajdbc4.jar;c:\\terajdbc\\tdgssconfig.jar")
con = dbConnect(drv,"jdbc:teradata://system/TYPE=FASTEXPORT","user","password")

# initialize table foo
dbSendUpdate(con,"drop table foo")
dbSendUpdate(con,"create table foo(a int,b varchar(100))")
dbSendUpdate(con,"insert into foo values(?,?)", 42, "bar1")
dbSendUpdate(con,"insert into foo values(?,?)", 43, "bar2")

# select * from table foo with '?' parameter marker
dbGetQuery(con,"select * from foo where a=?",as.integer(43))

dbDisconnect(con)

Tags:

Ignore ancestor settings:

Apply supersede status to children:

↧

How to avoid duplicate JDBC Connection created by RJDBC dbConnect

December 4, 2013, 2:34 pm

≫ Next: Test using embedded SQL

≪ Previous: How to use JDBC FastExport with R

The implementation of dbConnect in RJDBC version 0.2-2 and earlier (http://www.rforge.net/RJDBC/news.html) erroneously creates two (duplicate) connections.

RJDBC dbDisconnect only closes one of the two connections, leaving the duplicate connection orphaned. The issue has been reported at https://github.com/s-u/RJDBC/commit/c6a0907822d6bcfe003f4de38bd4c65ae7c261aa#commitcomment-4772766 and https://github.com/s-u/RJDBC/issues/1.

There are two workarounds until RJDBC dbConnect is fixed:

Avoid RJDBC dbConnect and dbDisconnect. Invoke JDBC DriverManager.getConnection() and JDBC Connection.close() directly.

drv = JDBC("com.teradata.jdbc.TeraDriver","c:\\terajdbc\\terajdbc4.jar;c:\\terajdbc\\tdgssconfig.jar")
con = .jcall("java/sql/DriverManager","Ljava/sql/Connection;","getConnection", "jdbc:teradata://system","user","password")
s = .jcall(con, "Ljava/sql/Statement;", "createStatement")
rs = .jcall(s, "Ljava/sql/ResultSet;", "executeQuery", "SELECT SessionNo,TRIM(UserName),LogonSource FROM DBC.SessionInfo ORDER BY SessionNo")
md = .jcall(rs, "Ljava/sql/ResultSetMetaData;", "getMetaData")
jr = new("JDBCResult", jr=rs, md=md, stat=s, pull=.jnull())
fetch(jr, -1)
.jcall(rs,"V","close")
.jcall(s,"V","close")
.jcall(con,"V","close")

Overwrite RJDBC dbConnect with the expected fix.

setMethod("dbConnect", "JDBCDriver", def=function(drv, url, user='', password='', ...) {
  jc <- .jcall("java/sql/DriverManager","Ljava/sql/Connection;","getConnection", as.character(url)[1], as.character(user)[1], as.character(password)[1], check=FALSE)
  if (is.jnull(jc) && !is.jnull(drv@jdrv)) {
    # ok one reason for this to fail is its interaction with rJava's
    # class loader. In that case we try to load the driver directly.
    oex <- .jgetEx(TRUE)
    p <- .jnew("java/util/Properties")
    if (length(user)==1 && nchar(user)) .jcall(p,"Ljava/lang/Object;","setProperty","user",user)
    if (length(password)==1 && nchar(password)) .jcall(p,"Ljava/lang/Object;","setProperty","password",password)
    l <- list(...)
    if (length(names(l))) for (n in names(l)) .jcall(p, "Ljava/lang/Object;", "setProperty", n, as.character(l[[n]]))
    jc <- .jcall(drv@jdrv, "Ljava/sql/Connection;", "connect", as.character(url)[1], p)
  }
  .verify.JDBC.result(jc, "Unable to connect JDBC to ",url)
  new("JDBCConnection", jc=jc, identifier.quote=drv@identifier.quote)},
          valueClass="JDBCConnection")

.verify.JDBC.result <- function (result, ...) {
  if (is.jnull(result)) {
    x <- .jgetEx(TRUE)
    if (is.jnull(x))
      stop(...)
    else
      stop(...," (",.jcall(x, "S", "getMessage"),")")
  }
}

drv = JDBC("com.teradata.jdbc.TeraDriver","c:\\terajdbc\\terajdbc4.jar;c:\\terajdbc\\tdgssconfig.jar")
con = dbConnect(drv,"jdbc:teradata://system","user","password")
dbGetQuery(con,"SELECT SessionNo,TRIM(UserName),LogonSource FROM DBC.SessionInfo ORDER BY SessionNo")
dbDisconnect(con)

Tags:

rjdbc

connection

Ignore ancestor settings:

Apply supersede status to children:

↧

Test using embedded SQL

December 6, 2013, 10:01 am

≫ Next: How to capture JDBC Connection parameter LOG=DEBUG messages with R

≪ Previous: How to avoid duplicate JDBC Connection created by RJDBC dbConnect

Short teaser:

TEST - Embedded SQL (Andreas)

test

testing

drv = JDBC("com.teradata.jdbc.TeraDriver","c:\\terajdbc\\terajdbc4.jar;c:\\terajdbc\\tdgssconfig.jar")

con = dbConnect(drv,"jdbc:teradata://system/TYPE=FASTEXPORT","user","password")

# initialize table foo

dbSendUpdate(con,"drop table foo")

dbSendUpdate(con,"create table foo(a int,b varchar(100))")

dbSendUpdate(con,"insert into foo values(?,?)", 42, "bar1")

dbSendUpdate(con,"insert into foo values(?,?)", 43, "bar2")

# select * from table foo

ps = .jcall(con@jc, "Ljava/sql/PreparedStatement;", "prepareStatement","select * from foo")

rs = .jcall(ps, "Ljava/sql/ResultSet;", "executeQuery")

md = .jcall(rs, "Ljava/sql/ResultSetMetaData;", "getMetaData")

jr = new("JDBCResult", jr=rs, md=md, stat=ps, pull=.jnull())

fetch(jr, -1)

.jcall(rs,"V","close")

.jcall(ps,"V","close")

dbDisconnect(con)

</pre>

Ignore ancestor settings:

Apply supersede status to children:

↧

How to capture JDBC Connection parameter LOG=DEBUG messages with R

December 6, 2013, 12:22 pm

≫ Next: The Future of Big Data: the Answer is Not “42”

≪ Previous: Test using embedded SQL

JDBC Connection parameter LOG=DEBUG prints error, timing, informational, and debug messages to Java System.out. The following shows how the messages can be captured with R.

R can be started with several executables: R, Rgui, Rscript, Rterm. Not all of them capture messages printed to Java System.out, e.g. Rgui on Windows. Rscript does not include the invoked R commands in its output, so the executables to use for capturing invoked R commands and messages printed to Java System.out are R or Rterm.

Save the following in a file, e.g. Rtest.txt.

library(RJDBC)
library(teradataR)

drv = JDBC("com.teradata.jdbc.TeraDriver","c:\\terajdbc\\terajdbc4.jar;c:\\terajdbc\\tdgssconfig.jar")
con = dbConnect(drv,"jdbc:teradata://system/LOG=DEBUG","user","password")
dbGetQuery(con,"select * from dbc.dbcinfo")
dbDisconnect(con)

Enter the following to execute Rterm and save the output in Result.log: Rterm --no-save < Rtest.txt > Result.log

Tags:

jdbc

system.out

Ignore ancestor settings:

Apply supersede status to children:

↧

The Future of Big Data: the Answer is Not “42”

December 28, 2012, 9:55 am

≫ Next: How to capture chained JDBC SQLException error messages and stack trace with R

≪ Previous: How to capture JDBC Connection parameter LOG=DEBUG messages with R

Cover Image:

“42” is the Answer to the Ultimate Question of Life, the Universe, and Everything. But for Big Data, the answer is simply “everything.”

As we progress with our gadgets and sensors, we have grown the amount of human activity on which we can now and will in the future monitor and analyze. First we monitored and captured business transactions: sales, inventory, computer and telco networks, etc. With the advent of social media, we monitor social interactions online: tweets, likes, posts, texts, email. Pictures, with their 1,000 words, yield a vast amount of data that you may not intend to state directly.

With the addition of physical devices that can report on body location and activities, retailers and insurance providers that track physical activity can sell and price based on location data. Of course, with shoppers “show rooming” and comparison shopping using their mobile devices, they are not shy about letting organizations know where they are. In addition, businesses do not need a gadget to understand your interests: the eyes are the window to the soul. And a license plate is sufficient for the insurers.

Now with the addition of physical devices that can report on body location and activities – “wearables” – we can monitor and report on practically all human activity: physical activity (steps) and the condition of the body (heart, lungs, temperature, blood). And with upgrades to noninvasive devices that can interface the mind by scanning the brain’s electrical activity for games and prosthetics, capturing and analyzing thought patterns is not too far in the future.

Remember when capturing and analyzing business transaction was a challenge? How about all this activity? The pipes required simply to capture the data are huge – petabytes per day huge for some organizations – let alone actually running analytics against the data once it has been filtered and landed.

What is the answer to building an architecture and support organization to handle this data monster? It is not going to be as simple as “42.”

Ignore ancestor settings:

Tags:

Apply supersede status to children:

↧

How to capture chained JDBC SQLException error messages and stack trace with R

December 13, 2013, 12:08 pm

≫ Next: How to capture chained JDBC FastLoad SQLException/SQLWarning messages and stack trace with R

≪ Previous: The Future of Big Data: the Answer is Not “42”

The following batches 5 rows of data, 2 of them with errors. The JDBC PreparedStatement.executeBatch attempts to INSERT the batched rows into a database table but throws a JDBC SQLException. Here is how to capture the chain of JDBC SQLException error messages and stack trace:

drv = JDBC("com.teradata.jdbc.TeraDriver","c:\\terajdbc\\terajdbc4.jar;c:\\terajdbc\\tdgssconfig.jar")
con = .jcall("java/sql/DriverManager","Ljava/sql/Connection;","getConnection", "jdbc:teradata://system","user","password")
s = .jcall(con, "Ljava/sql/Statement;", "createStatement")

.jcall(s, "I", "executeUpdate", "drop table foo")
.jcall(s, "I", "executeUpdate", "create table foo(c1 integer check(c1 between 10 and 20),c2 varchar(100)) unique primary index (c1)")

ps = .jcall(con, "Ljava/sql/PreparedStatement;", "prepareStatement","insert into foo values(?,?)")

.jcall(ps,"V","setInt",as.integer(1),as.integer(12))
.jcall(ps,"V","setString",as.integer(2),"bar1")
.jcall(ps,"V","addBatch") # row 1

.jcall(ps,"V","setInt",as.integer(1),as.integer(23)) # inject constraint violation error
.jcall(ps,"V","setString",as.integer(2),"bar2")
.jcall(ps,"V","addBatch") # row 2

.jcall(ps,"V","setInt",as.integer(1),as.integer(14))
.jcall(ps,"V","setString",as.integer(2),"bar3")
.jcall(ps,"V","addBatch") # row 3

.jcall(ps,"V","setInt",as.integer(1),as.integer(12)) # inject duplicate key error
.jcall(ps,"V","setString",as.integer(2),"bar4")
.jcall(ps,"V","addBatch") # row 4

.jcall(ps,"V","setInt",as.integer(1),as.integer(16))
.jcall(ps,"V","setString",as.integer(2),"bar5")
.jcall(ps,"V","addBatch") # row 5

# capture chained JDBC SQLException error messages and stack trace from PreparedStatement.executeBatch()
.jcall(ps,"[I","executeBatch", check=FALSE) # disable the default jcall exception handling with check=FALSE
x = .jgetEx() # save exceptions from PreparedStatement.executeBatch()
.jclear() # clear all pending exceptions
while (!is.jnull(x)) { # walk thru chained exceptions
    sw = .jnew("java/io/StringWriter")
    pw = .jnew("java/io/PrintWriter",.jcast(sw, "java/io/Writer"),TRUE)
    .jcall(x,"V","printStackTrace",pw) # redirect printStackTrace to a Java PrintWriter so it can be displayed in Rterm AND Rgui
    if (x %instanceof% "java.sql.BatchUpdateException") {
        print(.jcall(x,"[I","getUpdateCounts")) # show int[] update count with 3 rows inserted successfully (1) and 2 rows failed to insert (-3)
    }
    cat(.jcall(sw,"Ljava/lang/String;","toString")) # print the error message and stack trace
    if (x %instanceof% "java.sql.SQLException") {
        x = x$getNextException()
    } else {
        x = x$getCause()
    }
}

.jcall(ps,"V","close")

rs = .jcall(s, "Ljava/sql/ResultSet;", "executeQuery", "select * from foo order by 1")
md = .jcall(rs, "Ljava/sql/ResultSetMetaData;", "getMetaData")
jr = new("JDBCResult", jr=rs, md=md, stat=s, pull=.jnull())
fetch(jr, -1) # 3 rows are inserted (c1=12,14,16)
.jcall(rs,"V","close")

.jcall(s,"V","close")
.jcall(con,"V","close")

Tags:

jdbc

sqlexception

printStackTrace

Ignore ancestor settings:

Apply supersede status to children:

↧

How to capture chained JDBC FastLoad SQLException/SQLWarning messages and stack trace with R

December 16, 2013, 3:28 pm

≫ Next: Column compress values from statistics

≪ Previous: How to capture chained JDBC SQLException error messages and stack trace with R

The following batches 5 rows of data, 2 of them with errors. The JDBC FastLoad PreparedStatement.executeBatch attempts to INSERT the batched rows into a database table but throws a JDBC SQLException. Here is how to capture the chain of JDBC FastLoad SQLException messages and stack trace from JDBC FastLoad PreparedStatement.executeBatch and the chain of JDBC FastLoad SQLWarning messages and stack trace from JDBC FastLoad Connection.rollback:

drv = JDBC("com.teradata.jdbc.TeraDriver","c:\\terajdbc\\terajdbc4.jar;c:\\terajdbc\\tdgssconfig.jar")
con = .jcall("java/sql/DriverManager","Ljava/sql/Connection;","getConnection", "jdbc:teradata://system/TYPE=FASTLOAD","user","password")
s = .jcall(con, "Ljava/sql/Statement;", "createStatement")

.jcall(s, "I", "executeUpdate", "drop table foo")
.jcall(s, "I", "executeUpdate", "create table foo(c1 integer check(c1 between 10 and 20),c2 varchar(100)) unique primary index (c1)")

.jcall(con, "V", "setAutoCommit",FALSE)
ps = .jcall(con, "Ljava/sql/PreparedStatement;", "prepareStatement","insert into foo values(?,?)")

.jcall(ps,"V","setInt",as.integer(1),as.integer(12))
.jcall(ps,"V","setString",as.integer(2),"bar1")
.jcall(ps,"V","addBatch") # row 1

.jcall(ps,"V","setInt",as.integer(1),as.integer(23)) # inject constraint violation error
.jcall(ps,"V","setString",as.integer(2),"bar2")
.jcall(ps,"V","addBatch") # row 2

.jcall(ps,"V","setInt",as.integer(1),as.integer(14))
.jcall(ps,"V","setString",as.integer(2),"bar3")
.jcall(ps,"V","addBatch") # row 3

.jcall(ps,"V","setInt",as.integer(1),as.integer(25)) # inject constraint violation error
.jcall(ps,"V","setString",as.integer(2),"bar4")
.jcall(ps,"V","addBatch") # row 4

.jcall(ps,"V","setInt",as.integer(1),as.integer(16))
.jcall(ps,"V","setString",as.integer(2),"bar5")
.jcall(ps,"V","addBatch") # row 5

# capture chained JDBC SQLException messages and stack trace from PreparedStatement.executeBatch()
.jcall(ps,"[I","executeBatch", check=FALSE) # disable the default jcall exception handling with check=FALSE
x = .jgetEx() # save exceptions from PreparedStatement.executeBatch()
.jclear() # clear all pending exceptions
if (!is.jnull(ex)) {
    while (!is.jnull(ex)) { # loop thru chained exceptions
        sw = .jnew("java/io/StringWriter")
        pw = .jnew("java/io/PrintWriter",.jcast(sw, "java/io/Writer"),TRUE)
        .jcall(ex,"V","printStackTrace",pw) # redirect printStackTrace to a Java PrintWriter so it can be printed in Rterm AND Rgui
        if (ex %instanceof% "java.sql.BatchUpdateException") {
            print(.jcall(ex,"[I","getUpdateCounts")) # print int[] update count showing 3 rows inserted successfully (1) and 2 rows failed to insert (-3)
        }
        cat(.jcall(sw,"Ljava/lang/String;","toString")) # print the error message and stack trace
        if (ex %instanceof% "java.sql.SQLException") {
            ex = ex$getNextException()
        } else {
            ex = ex$getCause()
        }
    }

    # capture chained JDBC SQLWarning messages and stack trace from Connection.rollback()
    .jcall(con, "V", "rollback")
    w = .jcall(con, "Ljava/sql/SQLWarning;", "getWarnings") # save warnings from Connection.rollback()
    while (!is.jnull(w)) { # loop thru chained warnings
        sw = .jnew("java/io/StringWriter")
        pw = .jnew("java/io/PrintWriter",.jcast(sw, "java/io/Writer"),TRUE)
        .jcall(w,"V","printStackTrace",pw) # redirect printStackTrace to a Java PrintWriter so it can be printed in Rterm AND Rgui
        cat(.jcall(sw,"Ljava/lang/String;","toString")) # print the warning message and stack trace
        w = w$getNextWarning()
    }
} else {
    .jcall(con, "V", "commit")
}

.jcall(ps,"V","close")
.jcall(con, "V", "setAutoCommit",TRUE)

rs = .jcall(s, "Ljava/sql/ResultSet;", "executeQuery", "select * from foo order by 1")
md = .jcall(rs, "Ljava/sql/ResultSetMetaData;", "getMetaData")
jr = new("JDBCResult", jr=rs, md=md, stat=s, pull=.jnull())
fetch(jr, -1) # 0 rows are selected
.jcall(rs,"V","close")

.jcall(s,"V","close")
.jcall(con,"V","close")

A sample Java program that illustrates the use of JDBC FastLoad can be found here.

Tags:

Ignore ancestor settings:

Apply supersede status to children:

↧

Column compress values from statistics

December 17, 2013, 3:00 am

≫ Next: Determining Precedence Among TASM Planned Environments

≪ Previous: How to capture chained JDBC FastLoad SQLException/SQLWarning messages and stack trace with R

Short teaser:

Get multi value compression out of column statistics for free

Cover Image:

Besides collecting statistics on your columns on your Teradata database, the compressing of the data to save disk space is a very important maintaining task. So why not connect these two tasks? The idea is to extract the values for the multi value compression of the columns out of the collected statistics.

The idea

Starting with Teradata V14 the "SHOW STATISTICS VALUES COLUMN col ON db.tab; " prints out as a text (optionally as XML) the results of the last collection of statistics in detail. The output in text form is exactly the command to insert the results of the collection back into the database. The command prints a lot of lines. The following are interesting for the algorithm:

...

/* NumOfNulls */ 20,
...

/* NumOfRows */ 3180,
...

/** Biased: Value, Frequency **/
/* 1 */ 'N', 3147,
/* 2 */ 'Y', 13
...

Specially the biased values block show the values of the column, which are very often in the data. And these values can be taken for compressing of the column.

The column for compression has to have the following requirements:

Statistic has to be representative and actual, but could be sampled
Column is not allowed to be part of index or partition
The statistics values must have the correct length
It is not allowed to have statistics on the column

In Teradata 14 all statistics values are limited to 26 characters. To get the not trimmed values you have to use the "USING MAXVALUELENGTH" clause during the collect statistics command.

The other fact disturb the algorithm more: You cannot change a column when there is an statistic on it.

The advantages are:

No costs for getting the values for compression
Good compression results with easy algorithm

This easy solution for fitting on one page has some disadvantages:

Procedure doesn't take care of previous values list
Algorithm doesn't take care of multi columns collect statistics

The algorithm

First we execute for each column with statistics of the table to compress the "SHOW STATISTICS VALUES COLUMN". From this output we take the numbers of null and the values of the biased values block. From the number of occurences we decide which values come into the multi value compress list. At the moment each value has to have an estimation of more than 1% in the data. With this limit it could not happen that we have more than 100 compress values. In parallel we create a "DROP STATISTICS" and the "COLLECT STATISTICS COLUMN ... ON ... VALUES (...);" to put the statistics back. With this three files we first drop the statistics, perform the alter table statement and after that put the statistics back.

The process

The algorithm consists of a sql file and an awk script. The sql file gets the "SHOW STATISTICS VALUES COLUMN" for the columns for the tables in an useful ordering:

SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col2 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col3 on dbtest.tab1;
SHOW STATISTICS VALUES COLUMN col1 on dbtest.tab2;
SHOW STATISTICS VALUES COLUMN col2 on dbtest.tab2;

These commands have to be executed by bteq and stored in one file. The awk script takes this file and produces a larger file:

DROP STATISTICS column col1 on dbtest.tab1;
DROP STATISTICS column col2 on dbtest.tab1;
DROP STATISTICS column col3 on dbtest.tab1;
ALTER TABLE dbtest.tab1 add col1 compress ( ...)
, add col2 compress ( ...)
, add col3 compress ( ...)
;
COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab1 VALUES (...);
COLLECT STATISTICS COLUMN ( col2 ) ON dbtest.tab1 VALUES (...);
COLLECT STATISTICS COLUMN ( col3 ) ON dbtest.tab1 VALUES (...);


DROP STATISTICS column col1 on dbtest.tab2;
DROP STATISTICS column col2 on dbtest.tab2;
ALTER TABLE dbtest.tab2 add col1 compress ( ...)
, add col2 compress ( ...)
;
COLLECT STATISTICS COLUMN ( col1 ) ON dbtest.tab2 VALUES (...);
COLLECT STATISTICS COLUMN ( col2 ) ON dbtest.tab2 VALUES (...);

Executing these statements perform the compression. Finished.

The source code

SQL File

SELECT
         'SHOW STATISTICS VALUES COLUMN '||(trim (both from a.columnname))||' on '||(trim(both from a.databasename))||'.'||(trim(both from a.tablename))||';' as stmt
FROM
        dbc.ColumnStatsV a
INNER JOIN
        dbc.columns b
ON
a.databasename=b.databasename
        AND
a.tablename=b.tablename
        AND
a.columnname=b.columnname
LEFT OUTER JOIN
        dbc.PartitioningConstraintsV c
ON
a.databasename=c.databasename
        AND
a.tablename=c.tablename
        AND
upper(c.constrainttext) LIKE '%'||(upper(a.columnname))||'%'

WHERE
        c.constrainttext is null
        AND
a.indexnumber is null
        AND
a.databasename='${DB}'        AND
(a.databasename,a.tablename,a.columnname) not in (select databasename,tablename,columnname from dbc.indices)
        AND
(a.databasename,a.tablename) in (select databasename,tablename from dbc.tables where tablekind='T')
order by a.databasename,a.tablename,a.columnname;

AWK File

BEGIN   { CUTPERCENTAGE=1;
          print ".errorlevel (3582) severity 0";
          print ".errorlevel (6956) severity 0";
          print ".errorlevel (5625) severity 0";
          print ".errorlevel (3933) severity 0";
        }
/            COLUMN \(/ { COL=$3; }
/                ON / { DBTAB=$2; }
/^ \/\*\* / { BIASEDON=0; }
/Data Type and Length/ { DATATYPE=substr($6,2,2); }
/NumOfRows/ { CUTROWS=$4*CUTPERCENTAGE/100;
              BIASED=="";
              if (0+CUTROWS<0+NULLROWS) BIASED="NULL,";
            }
/\/\* NumOfNulls/ { NULLROWS=$4; }
/^ \/\*\* Biased:/ { if (DATATYPE!="TS"&& DATATYPE!="AT"&& DATATYPE!="DS") BIASEDON=1;}
/^ \/\* / { if (BIASEDON==1)
                {
                if (CUTROWS < 0+gensub(".*,","","",gensub(",? ?$","","g")))
                        {
                        BIAS=gensub("^ */[^/]*/","","g",gensub(",[0-9 ]*,? ?$","","g"));
                        if (index (BIASED,BIAS)==0)
                                BIASED=BIASED BIAS ",";
                        }
                }
        }
/^COLLECT STATISTICS/   { COLSTATON=1; }
        {       if (COLSTATON==1) COLSTAT=COLSTAT "\n" $0; }

/^);/   { BIASEDON=0;
        COLSTATON=0;
        if (BIASED=="")
                {
                COLSTAT="";
                next;
                }
        if (DBTAB!=DBTABOLD)
                {
                if (DBTABOLD!="")
                        {
                        print DROPSTATS;
                        print ALTERTABLE ";";
                        print COLSTATALL;
                        COLSTATALL="";
                        }
                ALTERTABLE="ALTER TABLE " DBTAB " ADD " COL " COMPRESS (" gensub(",$","","",BIASED) ")";
                DROPSTATS="DROP STATISTICS COLUMN " COL " ON " DBTAB ";";
                DBTABOLD=DBTAB;
                }
        else
                {
                ALTERTABLE=ALTERTABLE "\n""        ,ADD " COL " COMPRESS (" gensub(",$","","",BIASED) ")";
                DROPSTATS=DROPSTATS "\n""DROP STATISTICS COLUMN " COL " ON " DBTAB ";";
                }
        COLSTATALL=COLSTATALL "\n" COLSTAT;
        COLSTAT="";
        BIASED="";
        }

END     {
        print ";";
        }

First Results and Motivation

As a teradata customer we run a Appliance instance with about 10 TB of user data. In a few hours running these scripts we decreased our space by 20%.

Unfortunately this is the only instance I can test the scripts at the moment, so further improvements and remarks are very welcomed.

Last, but not least, thanks to Dieter Nöth (dnoeth) for the tipps.

Ignore ancestor settings:

Tags:

Apply supersede status to children:

↧

Determining Precedence Among TASM Planned Environments

January 20, 2014, 1:46 pm

≫ Next: Ordered Analytical Functions: Translating SQL Functions to Set SQL

≪ Previous: Column compress values from statistics

Most Teradata systems support different processing windows, each which has a somewhat different set of business priorities. From 8 AM to 12 noon the most important work may be the Dashboard application queries. But from 12 noon to 6 PM it could be the ad hoc business users. At night maybe it’s the batch loads and reporting. Planned Environments function within the TASM state matrix to support the ability to automatically manage changes to workload management setup during those different windows of time.

The state matrix represents more than just processing windows, however. It intersects your Planned Environments (time-based processing windows) with any system Health Conditions you may have defined (such as high AMP worker task usage levels, or hardware disabilities). TASM moves you from one Planned Environment to another or one Health Condition to another based on events that you have defined being triggered (such as time of day).

At the intersection of the two dimensions are the TASM states, which contain the actual workload management setup definitions. A state may contain different values for some of the TASM settings, such as throttle limits or workload priorities. So when you change state, you are changing some of the rules that TASM uses when it manages your workloads.

Example of a State Matrix

The figure below illustrates a simple 4 x 3 state matrix. The Planned Environments go along the top, and the Health Conditions go along the left side. The same state can be, and often is, used at multiple intersection points.

How TASM Decides Which State to Switch to

At each event interval, all system events are checked to see if a Health Condition needs to be changed. After that is done, all defined Planned Environments are checked to see if a different Planned Environment should now be in effect. It is possible for both the Planned Environment and the Health Condition to change on a given event interval. Once the correct ones are established, their intersection points to the state that they are associated with.

If more than one Planned Environment meets the current conditions (for example, one planned environment is for Monday and the other is for the end of the month and the end of the month falls on a Monday), then the Planned Environment with the highest precedence number wins. This will usually be the one in the rightmost position in the state matrix.

When the Planned Environments are evaluated to see which one should be active, internally the search is performed from the right-most state (the one with the highest precedence) and moves leftwards, stopping at the first Planned Environment that fits the criteria. For Health Conditions, the search starts at the bottom, with the one with the highest severity, stopping at the first Health Condition that qualifies. That way, if more than one of those entities is a candidate for being active, the one that is deemed most important or most critical will be selected first.

Changing Precedence

Viewpoint Workload Designer allows you to set the precedent order you wish by dragging and dropping individual planned environments. In the state matrix shown above, assume that the Planned Environments are in the order in which they were created, with Monday being the right-most as it was add most recently. But if you your business dictated that MonthEnd processing was more important and should be in effect during the times when Monday ended up being at the end of the month (a two-way tie), then you would want to drag EndOfMonth over to the right-most position. You state matrix would then look like this:

Note that when you drag a Planned Environment to a different position, the cells below it that represent the states associated with that Planned Environment are moved as well.

In addition to looking at your Workload Designer screens, you can also see which Planned Environment has the highest precedence amongst two that might overlap by looking at tdwmdmp -a output.

Under the heading "State and Event Information" tdwmdmp output will show you each Operating Environment (AKA Planned Environment) that is available, with their precedence order. If more than one is eligible, the one with the higher precedence wins. In the output below, the EndOfMonth Planned Environment will win if another Planned Environment is triggered at the same time.

Tdwmdmp -a also tells you the current Planned Environment.

State and Event Information:

Operational Environments:

PRECE OPENV OPERATING ENVIRONMENT

DENCE ID NAME ENAB CURR DFLT

----- ------ -------------------------- ---- ---- ----

4 77 EndOfMonth Yes

3 74 Monday Yes Yes

2 75 Weekday Yes

1 76 Always Yes Yes

Tags:

Ignore ancestor settings:

Apply supersede status to children:

↧

Ordered Analytical Functions: Translating SQL Functions to Set SQL

January 21, 2014, 7:45 pm

≫ Next: .NET Data Provider for Teradata supports Visual Studio 2012 and 2013

≪ Previous: Determining Precedence Among TASM Planned Environments

Short teaser:

Porting stored procedures that contain functions to Teradata

Cover Image:

Many experienced SQL and procedure developers use SQL functions to perform common tasks and simplify their code narratives. The function concept is a good programming technique, but adopting it literally in a Set SQL statement may force the statement to execute as a serial process; that is, it can make parallel processing impossible. This is one reason Teradata SQL does not allow functions to access tables.

There are many types of user-defined functions (UDFs), and they can be written in C/C++, Java or SQL. The functions that concern us here are scalar functions that can be written in the procedural languages of other data base systems, such as Oracle PL/SQL and Microsoft SQL Server T-SQL. Teradata SPL (Stored Procedure Language) offers no facility for writing a scalar function, so there is no direct method of translating this sort of code.

However, most functions that contain only logic and arithmetic can be translated to Teradata SQL functions, and these perform very well. Deploying them is different, though, because they cannot simply be defined within a stored procedure: they are created with the Create Function SQL DDL command, and permissions are granted just as with any other function.

Functions that issue SQL commands cannot be translated to Teradata functions, but the data access and logic in such functions can almost always be translated to a form of Set SQL that is incorporated in the main query as a derived table or a view.

To illustrate these methods, let's look at some examples of logic implemented with functions in other database systems and the translation to Teradata set-level processing.

First, consider an example of an Update statement in a procedure using a function that could be used in parallel processing. The Update statement calls a function that contains only logic and does not access any tables.

PROCEDURE Update_Table1() IS
BEGIN

FUNCTION Get_Desc_ID(Descr varchar2) RETURN INTEGER IS
DescID INTEGER := 0;
BEGIN
 IF Descr = 'High' then DescID := 4;
 ELSIF Descr = 'Medium' then DescID := 3;
 ELSIF Descr = 'Low' then DescID := 2;
 ELSIF Descr = 'Trivial' then DescID := 1;
 RETURN DescID;
END Get_Desc_ID;

UPDATE Table1
Set DescID = Get_Desc_ID(Table1.Desc);
END Update_Table1;

In Teradata SQL we can create an equivalent SQL function:

REPLACE FUNCTION Get_Desc_ID(Descr varchar2) RETURNS INTEGER
LANGUAGE SQL
CONTAINS SQL
RETURNS NULL ON NULL INPUT
DETERMINISTIC
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN CASE Descr WHEN 'High' THEN 4
 WHEN 'Medium' THEN 3
 WHEN 'Low' THEN 2
 WHEN 'Trivial' THEN 1
 ELSE 0
END

Note that there are several phrases required in Teradata SQL Function Create statements, but most of these will be the same for every SQL function.

Now consider a different implementation of this function that does a table lookup rather than burying the values within the function code:

PROCEDURE Update_Table2() IS
BEGIN

FUNCTION Get_Desc_ID(Descr varchar2) RETURN INTEGER IS
DescID INTEGER := 0;
BEGIN
 SELECT DescID INTO :DescID FROM DESC_TABLE WHERE DESCR = :Descr;
 RETURN DescID;
END Get_Desc_ID;

UPDATE Table2
Set DescID = Get_Desc_ID(Table2.Descr);
END Update_Table2;

This function issues an SQL command. If the database system executes this literally as written, then, for each row in Table2, it will have to scan the DESC_TABLE looking for a match. This may not be much overhead for a sequential database that can buffer all of DESC_TABLE so that it stays in memory, but a parallel processing database would need to buffer a copy of DESC_TABLE for each unit of parallelism. When the process becomes more complex, as we shall see, buffering becomes impractical.

As noted, Teradata functions cannot access tables, so the table access in this function is rewritten as a join in the main query, like this:

UPDATE Table1
FROM DESC_TABLE
SET DescID = DESC_TABLE.DescID
WHERE Table1.Descr = DESC_Table.Descr;

To summarize to this point:
1. Teradata functions can be written in SQL language, but they cannot access tables. This limits their power to what can be expressed in pure logic (CASE operations), arithmetic and data type transformations.
2. If a function from another database system is being converted to Teradata, and it accesses data, then it must be rewritten as something that can be joined to the SQL statements that use it.

Now consider a complex function that contains both logic and DML statements: one that computes the average daily balance of a bank account from all the transactions in a given month. Think of a table of transactions that looks like this:

Account	Date	Time	Amout	Balance
1	2013-10-02	15:32:33	1000.00	2500.00
1	2013-10-02	16:44:44	500.00	3000.00
1	2013-10-02	16:44:45	700.00	3700.00
1	2013-10-03	12:00:00	-700.00	3000.00
2	2013-10-04	09:12:34	1000.00	1000.00
2	2013-10-04	12:00:00	-500.00	500.00
3	2013-10-14	12:00:00	-500.00	500.00

The average daily balance is the sum of all the balances in an account for each day in a month, divided by the number of days in a month. The transaction table does not contain a record for each day of the month, but we know that the balance is constant between two sequential transactions. So if the balance in Account 1 is 3000 on October 3, it is also 3000 on October 4, 5, and so on until the next transaction. We can quickly add the balances for each day between transactions by multiplying the balance after the first transaction by the number of days between transactions.

This gives the balance from the day of the first transaction through the end of the month, but we still need to compute the balance for the days preceding the first transaction. We know that it must be the balance shown in the first transaction, minus the transaction amount, and we can multiply this by the number of days preceding that transaction for the month.

Given the transaction data for Account 1, we can visualize the balance for each day like so:

2013-10-01 1500.00
2013-10-02 3700.00
2013-10-03 3000.00
2013-10-04 3000.00
2013-10-05 3000.00
...
2013-10-30 3000.00
2013-10-31 3000.00

The average daily balance for Account 1 is ( 1500 + 3700 + 3000 + ... + 3000 ) / 31, or, stated more simply, ( 1500 + 3700 + 3000*29 ) / 31.

The average daily balance for Account 2 is ( 0*3 + 500*28 ) / 31. The average daily balance for Account 3 is ( 1000*13 + 500*18 ) / 31.

Here is the pseudo code for a function that would compute this. It requires that transactions be sorted by Account Date and Time.

Read the first transaction date, amount and balance for Account X in Month M.
Multiply ( balance - amount) by the number of days in the month preceding the transaction.
Store the result in Accum_Balance.
Store the date and balance from the first transaction.
DO:
 Read next transaction; Exit loop if no-more-transactions
 If the date is different from the prior transaction:
  multiply the days between this transaction and the prior by the prior balance.
  Add the result to Accum_Balance.
 Save the date and balance.
END-DO
Multiply the balance from the last transaction by the number of days remaining in the month.
Add the result to Accum_Balance.
Return Accum_Balance divided by the number of days in Month M.

There are three challenges with this function: it contains complex logic, it reads a table, and the data has to be processed sequentially.

The fact that it reads a table tells us that this function should be implemented as a derived table within any Set SQL statement that needs to use it. If it is used in several places, it might be stated as a view.

The fact that this process needs to see data in a particular sequence tells us that, if we can do it in Set SQL, it will have to use an ordered analytical function.

Take the steps in the function one at a time and consider how they might be done in Set SQL. The first step is to compute the balance before the first transaction. This requires that we read the first transaction. The SQL will look like this:

select acctid, txndate, bal - amt
 ,extract(day from txndate) - 1 as NrDays
from TxnTable
qualify [transaction date and time]
 = min( [transaction date and time] ) over(partition by AcctID)
 and extract(month from txndate) = :vMonth

The predicate has to use an ordered analytical function to identify the first transaction date-time, so the keyword QUALIFY is used instead of WHERE.

To handle the loop on transactions by Account and date-time, we can use another ordered analytical function:

select acctid, txndate, bal
 ,coalesce(
  extract ( day from
   max(txndate) over(partition by AcctID order by txndate
    rows between 1 following and 1 following)
  ),
  :vMonthDays) NextDay
 ,NextDay + 1 - extract(day from txndate) as NrDays
from
 ( select acctid, txndate, bal from txntable
  qualify txntime = max(txntime) over(partition by AcctID, txndate)
   and extract(month from txndate) = :vMonth

The expression:

  extract ( day from
   max(txndate) over(partition by AcctID order by txndate
    rows between 1 following and 1 following) )

returns the day of the month of the next transaction following the current row. If this is the last transaction of the month, there is no next transaction, so coalesce() returns the last day of the month in that case.

The derived table:

   select acctid, txndate, bal from txntable
  qualify txntime = max(txntime) over(partition by AcctID, txndate)
   and extract(month from txndate) = :vMonth

contains the last transaction of each day; the other transactions that occurred that day are not needed to calculate the ending balance for the day.

Now we have the balances and the number of days for each balance. The UNION set operator combines the calculation for the beginning of the month with the calculations on the other days. All that remains is to add these numbers together and divide by the number of days in the month; this is easy if we place the UNION of the two computations in a derived table.

 select acctid, sum(bal*NrDays)/:vMonthDays
 from (
  select acctid, txndate, bal
   ,coalesce(
    extract ( day from
    max(txndate) over(partition by AcctID order by txndate
      rows between 1 following and 1 following)
    ),
    :vMonthDays) NextDay
   ,NextDay + 1 - extract(day from txndate) as NrDays
  from
   ( select acctid, txndate, bal from txntable
   qualify txntime = max(txntime) over(partition by AcctID, txndate)
        and extract(month from txndate) = :vMonth
   ) M
 union
  /* Compute balance from the beginning of the month to 1st transaction */
  select acctid, txndate, bal - amt
   ,0 NextDay
   ,extract(day from txndate) - 1 as NrDays
  from txntable
  qualify cast( cast(txndate as timestamp(0)) +
  cast(cast(txntime as char(8)) as interval hour to second) as timestamp(0) )
   = min( cast( cast(txndate as timestamp(0)) +
  cast(cast(txntime as char(8)) as interval hour to second) as timestamp(0) ) )
   over(partition by AcctID)
      and extract(month from txndate) = :vMonth
 ) T (acctid, txndate, bal, NextDay, NrDays)
 group by 1

If this Select statement were a view, it would be easy to generate a report using a Select statement or update the Accounts table with an Update-join. However, a view cannot contain host variables. This Select uses the two host variables vMonth and vMonthDays, which tell us the month and how many days it has. One way to solve this is to assume this will be run only on the prior month's data: then one could write a couple of simple SQL functions like Get_Desc_ID, above, that return the prior month and the number of days. Another option is to put these values in a lookup table and join to that table in the outer select statement.

If this does not have to be a view, then it can be a derived table in a stored procedure that takes the month (and perhaps the number of days) as parameters. An Update statement in such a procedure would look like this:

UPDATE Cust_Table
FROM
(
 select acctid, sum(bal*NrDays) / :vMonthDays
 from (
  select acctid, txndate, bal
   ,coalesce(
    extract ( day from
    max(txndate) over(partition by AcctID order by txndate
      rows between 1 following and 1 following)
    ),
    :vMonthDays) NextDay
   ,NextDay + 1 - extract(day from txndate) as NrDays
  from
   ( select acctid, txndate, bal from txntable
   qualify txntime = max(txntime) over(partition by AcctID, txndate)
        and extract(month from txndate) = :vMonth
   ) M
 union
  /* Compute balance from the beginning of the month to 1st transaction */
  select acctid, txndate, bal - amt
   ,0 NextDay
   ,extract(day from txndate) - 1 as NrDays
  from txntable
  qualify cast( cast(txndate as timestamp(0)) +
  cast(cast(txntime as char(8)) as interval hour to second) as timestamp(0) )
   = min( cast( cast(txndate as timestamp(0)) +
  cast(cast(txntime as char(8)) as interval hour to second) as timestamp(0) ) )
    over(partition by AcctID)
      and extract(month from txndate) = :vMonth
 ) T (acctid, txndate, bal, NextDay, NrDays)
 group by 1
) AvgDailyBal (AcctID, Bal)

Set ADB = AvgDailyBal.Bal
WHERE  Cust_Table.ACCTID = AvgDailyBal.AcctID

We frequently encounter stored procedures that contain complex PL/SQL or T-SQL functions, but functions are certainly not required for complex operations. Rewriting them as views or derived tables, using ordered analytical functions when needed, simplifies code maintenance and enables parallel set processing.

Ignore ancestor settings:

Tags:

Apply supersede status to children:

↧

.NET Data Provider for Teradata supports Visual Studio 2012 and 2013

January 30, 2014, 3:58 am

≫ Next: Statistics Threshold Functionality 101

≪ Previous: Ordered Analytical Functions: Translating SQL Functions to Set SQL

.NET Data Provider for Teradata version 14.11.0.1 (or above) supports integration with Visual Studio 2012 and Visual Studio 2013. This exposes the .NET Data Provider for Teradata objects necessary for development of ADO.NET applications utilizing the Teradata Database within Microsoft Visual Studio 2012 and 2013.

Following features are supported :

Server Explorer Support
Microsoft Query Designer support
Toolbox Support
Window Form Designer Support
Dataset Designer Support
Data Binding Support
SQL Server Analysis Services Support
Teradata Generated Data Behavior Support
Microsoft Visual Studio IntelliSense Support
Help Integration
Entity Data Model Generation

e.g., Toolbox Support :

Adding the .NET Data Provider for Teradata objects is supported in Visual Studio Toolbox. These objects can be drag-dropped onto design surfaces within Visual Studio.

For information on Server Explorer Support please go through,

http://developer.teradata.com/connectivity/articles/visual-studio-server-explorer-integrated-with-net-data-provider-for-teradata

Tags:

.net

data provider

.net data provider team

Visual Studio

Ignore ancestor settings:

Apply supersede status to children:

↧

Statistics Threshold Functionality 101

February 6, 2014, 4:20 pm

≫ Next: Filter on descriptive text, without date lookup.

≪ Previous: .NET Data Provider for Teradata supports Visual Studio 2012 and 2013

An earlier blog post focused on simple steps to get started using the Teradata 14.10 Automated Statistics Management (AutoStats) feature. One of the new capabilities that AutoStats relies on when it streamlines statistics collection is the new “Threshold” option. Threshold applies some intelligence about when statistics actually need to be re-collected, allowing the optimizer to skip some recollections.

Although you will probably want to begin relying on AutoStats when you get to 14.10, you don’t have to be using AutoStats to take advantage of threshold, as the two features are independent from one another. This post will give you a simple explanation of what the threshold feature is, what default threshold activity you can expect when you get on 14.10, and what the options having to do with threshold do for you. And you’ll get some suggestions on how you can get acquainted with threshold a step at a time.

For more thorough information about statistics improvements in 14.10, including the threshold functionality, see the orange book Teradata Database 14.10 Statistics Enhancements by Rama Korlapati.

What Does the Threshold Option Do?

When you submit a COLLECT STATISTICS statement in 14.10, it may or may not execute. A decision is made whether or not there is a value in recollecting these particular statistics at the time they are submitted. That decision is only considered if threshold options are being used.

Threshold options can exist at three different levels, each of which will be discussed more fully in their own section below. This is a very general description of the three levels:

System threshold: This is the default approach for applying thresholds for all 14.10 platforms. The system threshold default is not a single threshold value. Rather this default approach determines the appropriate threshold for each statistic and considers how much the underlying table has changed since the last collection.
DBA-defined global thresholds: These optional global thresholds override the system default, and rely on DBA-defined fixed percentages as thresholds. Once set, all statistics collection statements will use these global threshold values, unless overridden by the third level of threshold options at the statement level.
Thresholds on individual statements: Optional USING clauses that are attached to COLLECT STATISTICS statements can override the system default or any global DBA-defined thresholds when there is a need for customization at the individual statistic level.

Whichever threshold level is being used, if the optimizer determines that the threshold has not been met, no statistics will be collected, even though they have been requested. When a collection has been asked for but has not been executed, a StatsSkipCount column in the DBC.StatsTbl row that represents this statistics will be incremented.

StatsSkipCount appears as an explicit column in the view, but in the base DBC.StatsTbl StatsSkipCount is carried in the Reserved1 field. When StatsSkipCount is zero it means that the most recent COLLECT STATISTICS request was executed.

Ways That a Threshold Can Be Expressed

The system setting (level 1) for threshold logic is not one threshold value applied to all statistics collections. Rather, when enabled, the setting tells the optimizer to hold back the execution of a collection submission based on whatever it deems as an appropriate threshold for this statistics at this point in time. This high-level setting uses a “percent of change” type of threshold only.

Statistics collection thresholds are explicitly specified when using DBA-defined global settings or individual statement thresholds are used. These explicit thresholds can be expressed as a percent of change to the rows of the table upon which statistics are being collected, or as time (some number of days) since the last collection.

The most reliable way to express thresholds is by means of a percent of table change. That is why the highest level system setting, the one that is on by default, only supports percent of change thresholds. Time as a threshold must be explicitly specified in order to be used.

Importance of DBQL USECOUNT Logging

The recommended percent of change thresholds rely on having DBQL USECOUNT logging turned on. See my earlier blog on AutoStats for an explanation of USECOUNT DBQL logging. USECOUNT logging is a special type of DBQL logging that is enabled at the database level. Among other things, USECOUNT tracks inserts, deletes and updates made to tables within a database, and as a result, can provide highly accurate information to the optimizer about how the table has changed since the last statistics collection.

The default system threshold functionality is able to be applied to a statistic collection only if USECOUNT logging has been enabled for the database that the statistics collection table belongs to. In the absence of USECOUNT data, the default threshold behavior will be ignored. However, both DBA-defined global thresholds and statement-based thresholds are able to use percent of change thresholds even without USECOUNT logging, but with the risk of less accuracy.

In the cases where USECOUNT logging is not enabled, percent of change values are less reliable because the optimizer must rely on random AMP sample comparisons. Such comparisons consider estimated table row counts (the size of the table) since the last statistics collection. This can mask some conditions, like deletes and inserts happening in the same timeframe. Comparisons based strictly on table rows counts are not able to detect row updates, which could change column demographics. For that reason, it is recommended that USECOUNT logging be turned on for all databases undergoing change once you get to 14.10.

Percent of change is the recommended way to express thresholds when you begin to use the threshold feature in 14.10. Time-based thresholds are offered as options primarily for sites that have evolved their own in-house statistics management applications at a time when percent of change was unavailable, and wish to continue to use time.

The next three sections discuss the three different levels of threshold settings.

More about the System Threshold Option

All 14.10 systems have the system threshold functionality turned on by default. But by itself, that is not enough. USECOUNT logging for the database must also be enabled. If USECOUNT DBQL logging is turned on, then each COLLECT STATISTICS statement will be evaluated to see if it will run or be skipped.

During this evaluation, an appropriate change threshold for the statistic is established by the optimizer. The degree of change to the table since the last collection is compared against the current state of the table, based on USECOUNT tracking of inserts, deletes and updates performed. If the change threshold has not been reached, and enough history has been collected for this statistics (usually four of five full collections) so that the optimizer can perceive a pattern in the data such that extrapolations can be confidently performed, then this statistics collection will be skipped.

Even if the percent of change threshold has not been reached (indicating that statistics can be skipped), if there are insufficient history records, the statistics will be recollected. And even with 10 or 20 history records, if there is no regular pattern of change that the optimizer can rely on to make reasonable extrapolations, statistics will be recollected.

There is a DBS Control record parameter called SysChangeThresholdOption which the behavior of the system threshold functionality. This parameter is set at zero by default. Zero means that as long as USECOUNT logging in DBQL is enabled for the database that the table belongs to, then all statistics collection statements will undergo a percent of change threshold evaluation, as described above.

If you want to maintain the legacy behavior, threshold logic can be turned off completely at the system level by disabling the SysChangeThresholdOption setting in DBS Control (set it to 3). This field, along with parameters to set DBA-defined global parameters, can be found in the new Optimizer Statistics Fields in DBS Control.

It is important to re-emphasize that the DBQL USECOUNT logging must be enabled for all databases that you want to take advantage of the system threshold functionality. In addition, all other lower-level threshold settings must remain off (as they are by default) in order for the system threshold to be in effect.

More about DBA-Defined Global Thresholds

While it is recommended that the system threshold setting be embraced as the universal approach, there are some sites that have established their own statistics management processes prior to 14.10. Some of these involve logic that checks on the number of days that have passed since the last collection as an indicator of when to recollect.

In order to allow those statistics applications to continue to function as they have in the past within the new set of threshold options in 14.10, global thresholds parameters have been made available. These options are one step down from the system threshold and will cancel out use of DefaultUserChangeThreshold.

There are two parameters in the same section of DBS Control Optimizer Statistics Field that allow you to set DBA-defined thresholds:

DefaultUserChangeThreshold: If this global threshold is modified with a percent of change value (some number > 0), then the system default threshold will be disabled, and the percent defined here will be used to determine whether or not skip statistic collections globally.

Unlike the system default, if DBQL USECOUNT logging has not been enabled, random AMP samples will be used instead if this global setting has been enabled. The approach of using random AMP sample is somewhat less reliable, particularly in cases where there are updates, or deletes accompanied by inserts, rather than just inserts.

DefaultTimeThreshold: This global setting provides backward compatibility with home-grown statistics management applications that rely on the passage of time. Using a time-based threshold offers a less precise way of determining when a given statistic requires recollection. Some tables may undergo large changes in a 7-day period, while others may not change at all during that same interval. This is a one-size-fits-all lacks specificity and may result in unneeded resource usage.

More about Statement-Level Thresholds

USING THRESHOLD syntax can be added manually to any COLLECT STATISTIC statement.

COLLECT STATISTICS USING THRESHOLD 10% AND THRESHOLD 15 DAYS 
COLUMN TestID ON SandBoxT1;

When you use USING THRESHOLD, it will override any default or global threshold settings that are in place. See the Teradata Database 14.10 Statistics Enhancements orange book for detailed information about the variations of statement-level options you can use for this purpose.

For statement-based percent of change thresholds, the optimizer does not require that there be a history of past collections. If data change is detected over the specified threshold statistics will be collected, otherwise they will be skipped.

Statement-level thresholds are for special cases where a particular statistic needs to be treated differently than the higher level default parameters dictate. They can also be useful when you are getting starting with threshold, and you want to limit the scope to just a few statistics.

Getting Started Using Threshold

Here are some suggestions for sites that have just moved to 14.10 and want to experience how the threshold logic works on a small scale before relying on the system and/or global options:

Pick a small, non-critical database.
Enable DBQL USECOUNT logging on that database:

BEGIN QUERY LOGGING WITH USECOUNT ON SandBoxDB;

Disable the system threshold parameter by setting DBS Control setting:

SysChangeThresholdOption = 3

Leave the global parameters disabled as they are by default:

DefaultUserChangeThreshold = 0

DefaultTimeThreshold = 0

Add USING THRESHOLD 10 PERCENT to the statistics collection statements just for the tables within your selected database:

 COLLECT STATISTICS USING THRESHOLD 10% COLUMN TestID ON SandBoxT1;

Insert a few rows into the table (less than 10% of the table size) and run an Explain of the statistics collection statement itself, and it will tell you whether or not skipping is taking place.

EXPLAIN COLLECT STATISTICS COLUMN TestID ON SandBoxT1;

See page 39 of the Teradata Database 14.10 Statistics Enhancement orange book for some examples.

Summary of Recommendations for Threshold Use

The following recommendations apply when you are ready to use the threshold functionality fully:

If your statistics are not under the control of AutoStats, make no changes and rely on the system threshold functionality to appropriately run or skip your statistics collection statements.
Always turn on USECOUNT logging in DBQL for all databases for which statistics are being collected and where you are relying on system threshold.
If you have your own statistics management routines that rely on executing statistics only after a specific number of days have passed, set the DefaultTimeThreshold to meet your threshold criteria. You should experience similar behavior as you did prior to 14.10. Over time, consider switching to a change-based threshold and re-establishing the system threshold, as it will be more accurate for you.
Don’t lead with either of the DBA-defined global parameters DefaultUserChangeThreshold or DefaultTimeThreshold unless there is a specific reason to do so.
Use the statement-level threshold only for special statistics that need to be handled differently from the system or the DBA-defined global defaults.
Favor percent of change over number of days as your first choice for the type of threshold to use.
But if USECOUNT is not being logged, then rely on time-based thresholds and set DefaultTimeThreshold at your preferred number of days.

Tags:

collect statistics

statistics

threshold

teradata database 14.10

Ignore ancestor settings:

Apply supersede status to children:

↧

Filter on descriptive text, without date lookup.

February 7, 2014, 3:22 am

≫ Next: Teradata Analytics for SAP Solutions Demo Video

≪ Previous: Statistics Threshold Functionality 101

Translating descriptive dates, like last-week, with in line SQL code to achieve Partition elimination.

One frequently used feature of BI tools is to present drop-down lists of values. The purpose is to provide limiting parameters used in the creation of a report or selection. Within the database, these parameters are used to limit the data selected and (hopefully) use partition elimination on the larger fact tables.

Most commonly the Fact tables are partitioned by transaction date. The reporting and selection that is run on these tables, either specifies a single date, or a date range.

When actual dates are provided, all types of selection are able to use partition elimination.

However when the parameter first needs a lookup to be translated to a date, like “last week”, then partition elimination can become more problematic.

If the selection results in a single value that can be compared to the partition column in an equal (=) test, then partition elimination works. The optimizer creates a temporary variable, where the actual value is inserted into this variable during execution. When looking in the explain plan, you will see this variable being used in the retrieve or join step on the table with partitioning. This is called Delayed partition elimination.

Unfortunately, this delayed partition elimination only works for the equal test. When we need to do the commonly used BETWEEN clause, partition elimination with lookup does not work.

In some cases we can create a lookup table, where a list of dates can be found for each combination of the selection parameters. Then this list is used in a product join, achieving Dynamic partition elimination (DPE). Unfortunately this does not always work as the filtering has to be done in a (costly) product join.

Fortunately, we have another option of achieving direct partition elimination for multiple parameter lookup, by translating the lookup into a date calculation.

For instance the below descriptive terms are easily translated into a date calculation.

YESTERDAY = CURRENTDATE – 1
CURRENT_MONTH_START = CURRENT_DATE -EXTRACT(DAY FROM CURRENT_DATE) +1
PREVIOUS_MONTH_START = ADD_MONTHS(CURRENT_DATE -EXTRACT(DAY FROM CURRENT_DATE) +1 ,-1)
PREVIOUS_MONTH_END = CURRENT_DATE -EXTRACT(DAY FROM CURRENT_DATE)
CURRENT_YEAR_START = ADD_MONTHS(CURRENT_DATE -EXTRACT(DAY FROM CURRENT_DATE) +1 ,-EXTRACT(MONTH FROM CURRENT_DATE) +1)
CURRENT_YEAR_End = ADD_MONTHS(CURRENT_DATE-EXTRACT(DAY FROM CURRENT_DATE)+1 ,-EXTRACT(MONTH FROM CURRENT_DATE) +13) -1

We can take advantage of Transitive Closure (elimination of code that is never true, like 1=2), to code multiple of these translations into a single filter condition.

For example the below SQL;

SEL FACT.* FROM Daily_transactions_table FACT
WHERE ( /* filter to set the FROM for the BETWEEN (>=) */
      (  @FROM_DATE = 'TODAY'     AND FACT.Activity_DATE >= CURRENT_DATE   )
   OR (  @FROM_DATE = 'YESTERDAY' AND FACT.Activity_DATE >= CURRENT_DATE -1)
    )
AND ( /* filter to set the TO for the BETWEEN (<=) */
      (  @TO_DATE   = 'TODAY'     AND FACT.Activity_DATE <= CURRENT_DATE   )
   OR (  @TO_DATE   = 'TOMORROW'  AND FACT.Activity_DATE <= CURRENT_DATE +1)
    ) ;

In this example, the FROM and TO parameters are linked with date calculations. When we run this query and substitute YESTERDAY and TOMORROW for the FROM and TO parameters, what remains for the system to execute is;

SEL FACT.* FROM Daily_transactions_table FACT
WHERE ((  'YESTERDAY' = 'YESTERDAY' AND FACT.Activity_DATE >= CURRENT_DATE -1))
  AND ((  'TOMORROW'  = 'TOMORROW'  AND FACT.Activity_DATE <= CURRENT_DATE +1))

The @FROM_DATE = ‘TODAY’ filter is eliminated by Transitive Closure, as ‘YESTERDAY’ = ‘TODAY’ is not possible. As is the @TO_DATE = ‘TODAY’ filter. This leaves a simple BETWEEN logic with calculated dates, resulting in partition elimination.

Of course we should add some tests to make valid dates possible, and some more options to enter in the selection lists.

Below a complete working example with date validation;

SEL FACT.txt_typ_cd, COUNT(*) Cntr
FROM       ProdDB.Large_Fact_Table FACT
INNER JOIN SYS_CALENDAR.CALENDAR   CAL
 ON CAL.Calendar_Date = FACT.activity_date
 /***************************************************************************************/
 /* Possible combinations for FROM - TO parameters are;
 /* TODAY  - TODAY
 /* TODAY  - MTD
 /* TODAY  - QTD
 /* TODAY  - YTD
 /* <valid date> - TODAY
 /* <valid date> - <valid date>
 /* PRIOR  - WEEK
 /* PRIOR  - 2WEEKS
 /* PRIOR  - MONTH
 /* PRIOR  - 2MONTHS
 /* PRIOR  - 3MONTHS
 /* All selections are assumed to be based on yesterday (CURRENT_DATE-1).
 /*************************************************************************************/
 /* replace the parameters with location of “list of values” (LOV) for use in BO;
 /* @FROM_PARM: @Prompt('Report Start:','A','Date LOVs\Relative Date LOV', MONO,FREE,not_persistent,{'TODAY'})
 /* @TO_PARM  : @Prompt('Report END:','A','Date LOVs\Month number LOV', MONO,FREE,not_persistent,{'MTD'})
 /* note: do this substitution last, as it makes the code near to unreadable.
 /*************************************************************************************/
AND ( (  /* for TODAY - TODAY */
         @FROM_PARM = 'TODAY'     AND @TO_PARM   = 'TODAY'     AND CAL.Calendar_Date  = CURRENT_DATE-1
      )
   OR (  /* for TODAY - MTD */
         @FROM_PARM = 'TODAY'     AND @TO_PARM   = 'MTD'     AND CAL.Calendar_Date
     BETWEEN CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1) +1
         AND CURRENT_DATE-1
      )
   OR (  /* for TODAY - QTD */
         @FROM_PARM = 'TODAY'     AND @TO_PARM   = 'QTD'     AND CAL.Calendar_Date
     BETWEEN ADD_MONTHS(CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1) +1
             ,-EXTRACT(MONTH FROM CURRENT_DATE-1)
              +(((EXTRACT(MONTH FROM CURRENT_DATE-1)+2)/3-1)*3+1))
         AND CURRENT_DATE-1
      )
   OR (  /* for TODAY - YTD */
         @FROM_PARM = 'TODAY'     AND @TO_PARM   = 'YTD'     AND CAL.Calendar_Date
     BETWEEN ADD_MONTHS(CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1) +1
            ,-EXTRACT(MONTH FROM CURRENT_DATE-1) +1)
         AND CURRENT_DATE-1
      )
   OR (  /* for TODAY - <valid date> */
         @TO_PARM   = 'TODAY'     AND SUBSTR(@FROM_PARM,1,2)   = '20'     AND SUBSTR(@FROM_PARM,3,1)  IN ('0','1','2','3','4','5','6','7','8','9')
     AND SUBSTR(@FROM_PARM,4,1)  IN ('0','1','2','3','4','5','6','7','8','9')
     AND SUBSTR(@FROM_PARM,5,1)   = '-'     AND ( (  SUBSTR(@FROM_PARM,6,2)  IN ('02')
          AND SUBSTR(@FROM_PARM,9,2)  <= '28' )
        OR (  SUBSTR(@FROM_PARM,6,2)  IN ('04','06','09','11')
          AND SUBSTR(@FROM_PARM,9,2)  <= '30' )
        OR (  SUBSTR(@FROM_PARM,6,2)  IN ('01','03','05','07','08','10','12')
          AND SUBSTR(@FROM_PARM,9,2)  <= '31' )
        OR @FROM_PARM IN ('2004-02-29','2008-02-29','2012-02-29','2016-02-29','2020-02-29'                         ,'2024-02-29','2028-02-29','2032-02-29','2036-02-29','2040-02-29')
         )
     AND SUBSTR(@FROM_PARM,8,1)   = '-'     AND SUBSTR(@FROM_PARM,9,1)  IN ('0','1','2','3')
     AND SUBSTR(@FROM_PARM,10,1) IN ('0','1','2','3','4','5','6','7','8','9')
     AND CAL.Calendar_Date
     BETWEEN CAST((CASE WHEN SUBSTR(@FROM_PARM,1,2) = '20'                        THEN @FROM_PARM ELSE '1000-01-01' END) AS DATE)
         AND CURRENT_DATE-1
      )
   OR (  /* for <valid date> - <valid date> */
         SUBSTR(@FROM_PARM,1,2)   = '20'     AND SUBSTR(@FROM_PARM,3,1)  IN ('0','1','2','3','4','5','6','7','8','9')
     AND SUBSTR(@FROM_PARM,4,1)  IN ('0','1','2','3','4','5','6','7','8','9')
     AND SUBSTR(@FROM_PARM,5,1)   = '-'     AND ( (  SUBSTR(@FROM_PARM,6,2)  IN ('02')
          AND SUBSTR(@FROM_PARM,9,2)  <= '28' )
        OR (  SUBSTR(@FROM_PARM,6,2)  IN ('04','06','09','11')
          AND SUBSTR(@FROM_PARM,9,2)  <= '30' )
        OR (  SUBSTR(@FROM_PARM,6,2)  IN ('01','03','05','07','08','10','12')
          AND SUBSTR(@FROM_PARM,9,2)  <= '31' )
        OR @FROM_PARM IN ('2004-02-29','2008-02-29','2012-02-29','2016-02-29','2020-02-29'                         ,'2024-02-29','2028-02-29','2032-02-29','2036-02-29','2040-02-29')
         )
     AND SUBSTR(@FROM_PARM,8,1)   = '-'     AND SUBSTR(@FROM_PARM,9,1)  IN ('0','1','2','3')
     AND SUBSTR(@FROM_PARM,10,1) IN ('0','1','2','3','4','5','6','7','8','9')
     AND SUBSTR(@TO_PARM,1,2)   = '20'     AND SUBSTR(@TO_PARM,3,1)  IN ('0','1','2','3','4','5','6','7','8','9')
     AND SUBSTR(@TO_PARM,4,1)  IN ('0','1','2','3','4','5','6','7','8','9')
     AND SUBSTR(@TO_PARM,5,1)   = '-'     AND ( (  SUBSTR(@TO_PARM,6,2)  IN ('02')
          AND SUBSTR(@TO_PARM,9,2)  <= '28' )
        OR (  SUBSTR(@TO_PARM,6,2)  IN ('04','06','09','11')
          AND SUBSTR(@TO_PARM,9,2)  <= '30' )
        OR (  SUBSTR(@TO_PARM,6,2)  IN ('01','03','05','07','08','10','12')
          AND SUBSTR(@TO_PARM,9,2)  <= '31' )
        OR @TO_PARM IN ('2004-02-29','2008-02-29','2012-02-29','2016-02-29','2020-02-29'                       ,'2024-02-29','2028-02-29','2032-02-29','2036-02-29','2040-02-29')
         )
     AND SUBSTR(@TO_PARM,8,1)   = '-'     AND SUBSTR(@TO_PARM,9,1)  IN ('0','1','2','3')
     AND SUBSTR(@TO_PARM,10,1) IN ('0','1','2','3','4','5','6','7','8','9')
     AND CAL.Calendar_Date
     BETWEEN CAST((CASE WHEN SUBSTR(@FROM_PARM,1,2) = '20'                        THEN @FROM_PARM ELSE '1000-01-01' END) AS DATE)
         AND CAST((CASE WHEN SUBSTR(@TO_PARM,1,2) = '20'                        THEN @TO_PARM ELSE '1000-01-01' END) AS DATE)
      )
   OR (  /* for PRIOR - WEEK */
         /* note: ((CURRENT_DATE - DATE'0001-01-07') MOD 7) gives 0 for Sun, 6 for Sat */
         /* this calculation bases a week as Sun-Sat */
         @FROM_PARM = 'PRIOR'     AND @TO_PARM   = 'WEEK'     AND CAL.Calendar_Date
     BETWEEN CURRENT_DATE-1 -((CURRENT_DATE - DATE'0001-01-07') MOD 7)-7
         AND CURRENT_DATE-1 -((CURRENT_DATE - DATE'0001-01-07') MOD 7)
      )
   OR (  /* for PRIOR - 2WEEKS */
         @FROM_PARM = 'PRIOR'     AND @TO_PARM   = '2WEEKS'     AND CAL.Calendar_Date
     BETWEEN CURRENT_DATE-1 -((CURRENT_DATE - DATE'0001-01-07') MOD 7)-14
         AND CURRENT_DATE-1 -((CURRENT_DATE - DATE'0001-01-07') MOD 7)
      )
   OR (  /* for PRIOR - MONTH */
         @FROM_PARM = 'PRIOR'     AND @TO_PARM   = 'MONTH'     AND CAL.Calendar_Date
     BETWEEN ADD_MONTHS(CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1) +1 ,-1)
         AND CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1)
      )
   OR (  /* for PRIOR - 2MONTHS */
         @FROM_PARM = 'PRIOR'     AND @TO_PARM   = '2MONTHS'     AND CAL.Calendar_Date
     BETWEEN ADD_MONTHS(CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1) +1 ,-2)
         AND CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1)
      )
   OR (  /* for PRIOR - 3MONTHS */
         @FROM_PARM = 'PRIOR'     AND @TO_PARM   = '3MONTHS'     AND CAL.Calendar_Date
     BETWEEN ADD_MONTHS(CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1) +1 ,-3)
         AND CURRENT_DATE-1 -EXTRACT(DAY FROM CURRENT_DATE-1)
      )
    ) /** end of OR list **/
GROUP BY  1;

Notes;

Starting with Teradata Database 12.0, static partition elimination also occurs for the built-in function CURRENT_DATE and DATE.

Dynamic partition elimination (DPE) for product joins may occur when there is an equality constraint (including IN and NOT IN starting with Teradata Database 13.0) between the partitioning column of a partitioning expression of one table and a column of another table, sub query or spool.

Tags:

sql

BI parameter

partition elimination

Ignore ancestor settings:

Apply supersede status to children:

↧

Teradata Analytics for SAP Solutions Demo Video

February 26, 2014, 7:30 am

≫ Next: How Resources are Shared in the SLES 11 Priority Scheduler

≪ Previous: Filter on descriptive text, without date lookup.

Short teaser:

Want to improve your SAP analytics capability? Teradata Analytics for SAP Solutions unlocks SAP data

Cover Image:

Learn how to unlock data in SAP® Enterprise Resource Planning (ERP) solutions to discover deeper business insights. Teradata Analytics for SAP® Solutions is the newest addition to Teradata’s portfolio of SAP® integration tools and services. This product demonstration shows out-of-the-box functionality from the powerful pre-built and custom reporting capabilities to the easy cross-functional integration of SAP® data.

Ignore ancestor settings:

Tags:

sap bw

sap

business analytics

business intelligence

ERP

Apply supersede status to children:

↧

How Resources are Shared in the SLES 11 Priority Scheduler

March 7, 2014, 4:12 pm

≫ Next: Big Data and Social Media: a Threatened Future?

≪ Previous: Teradata Analytics for SAP Solutions Demo Video

The SLES 11 priority scheduler implements priorities and assigns resources to workloads based on a tree structure. The priority administrator defines workloads in Viewpoint Workload Designer and places the workloads on one of several different available levels in this hierarchy. On some levels the admin assigns an allocation percent to the workloads, on other levels not.

How does the administrator influence who gets what? How does tier level and the presence of other workloads are on the same tier impact what resources are actually allocated? What happens when some workloads are idle and others are not?

This posting gives you a simple explanation of how resources are shared in SLES 11 priority scheduler and what happens when one or more workloads are unable to consume what they have been allocated.

The Flow of Resources within the Hierarchy

Conceptually, resources flow from the top of the priority hierarchy through to the bottom. Workloads near the top of the hierarchy will be offered all the resources they are entitled to receive first. What they cannot use, or what they are not entitled to, will flow to the next level in the tree. Workloads at the bottom of the hierarchy will receive resources that either cannot be used by workloads above them, or resources that workloads above them are not entitled to.

What does “resources a workload is entitled to” mean?

Tactical is the highest level where workloads can be placed in the priority hierarchy. A workload in tactical is entitled to a lot of resources, practically all of the resources on the node if it is able to consume that much. However, tactical workloads are intended to support very short, very highly-tuned requests, such as single-AMP queries, or few-AMP queries. Tactical is automatically given a very large allocation of resources to boost its priority, so that work running there can enjoy a high level of consistency. Tactical work is expected to use only a small fraction of what it is entitled to.

If recommended design approaches have been followed, the majority of the resources that flow into the tactical level will flow down to the level below. If you are on an Active EDW platform, the next level down will be SLG Tier 1. If you are on an Appliance platform, it will be Timeshare.

The flow of resources to Service Level Goal (SLG) Tiers

SLG Tiers are intended for workloads where there is a service level goal, whose requests have an expected elapsed time and where their elapsed time is critical to the business. Up to five SLG Tiers may be defined, although one, or maybe two, are likely to be adequate for most sites. Multiple workloads may be placed on each SLG Tier. The figure below shows an example of what SLG Tier 1 might look like.

In looking back at priority hierarchy figure, shown first, note that the tactical tier and each SLG Tier include a workload labeled “Remaining”. That workload is created internally by priority scheduler. It doesn’t have any tasks or use any resources. Its purpose is to connect to and act as a parent to the children in the tier below. The Remaining workload passes unused or unallocated resources from one tier to another.

The administrator assigns an allocation percent to each user-defined workload on an SLG Tier. This allocation represents a percent of resources the workload is entitled to from among the resources that flow into the tier. If 80% of the node resources flow into SLG Tier 1, the Dashboard workload (which has been assigned an allocation of 15%) is entitled to 12% of the node resources (80% of 15% = 12%).

The Remaining workload on an SLG tier is automatically assigned an allocation that is derived by summing all the user-defined workload allocations on that tier and subtracting that sum from 100%. Remaining in the figure above gets an allocation of 70% because 100% - (15% + 10% + 5%) = 70%. Remaining’s allocation of 70% represents the percent of the resources that flow into SLG Tier 1 that the tiers below are entitled to. You will be forced by Workload Designer to always leave some small percent to Remaining on an SLG Tier so work below will never be in danger of starving.

Sharing unused resources within the SLG Tier

An assigned allocation percent could end up providing a larger level of node resources than a workload ever needs. Dashboard may only ever consume 10% of node resources at peak processing times. Or there may be times of day when Dashboard is not active. In either of those cases, unused resources that were allocated to one workload will be shared by the other user-defined workloads on that tier, based on their percentages. This is illustrated in the figure below.

Note that what the Remaining workload is entitled to remains the same. The result of Dashboard being idle is that WebApp1 and WebApp2 receive higher run-time allocations. Only if the two of them are not able to use that spare resource will it go to Remaining and flow down to the tiers below.

Unused resources on a tier are offered to sibling workloads (workloads on the same tier) first. What is offered to each is based on the ratio of their individual workload allocations. WebApp1 gets offered twice as much unused resource originally intended for Dashboard as WebApp2, because WebApp1 has twice as large a defined allocation.

Priority scheduler uses the same approach to sharing unused resources if the tiers below cannot use what flows to them. The backflow that comes to an SLG tier from the tier below will be offered to all active workloads on the tier, proportional to their allocations. However, this situation would only occur if Timeshare workloads were not able to consume the resources that flowed down to them. All resources flow down to the base of the hierarchy first. Only if they cannot be used by the workloads at the base will they be available to other workloads to consume. Just as in SLES 10 priority scheduler, no resource is wasted as long as someone is able to use it.

Sharing resources within Timeshare

Timeshare is a single level in the hierarchy that is expected to support the majority of the work running on a Teradata platform. The administrator selects one of four access levels when a workload is assigned to Timeshare: Top, High, Medium and Low. The access level determines the level of resources that will be assigned to work running in that access level's workloads. Each access level comes with an access rate that determines the actual contrast in priority among work running in Timeshare. Top has an access rate of 8, High 4, Medium 2 or Low 1. Access rates cannot be altered.

Priority Scheduler tells the operating system to allocate resources to the different Timeshare requests based on the access rates of the workload they have classified to. This happens in such a way that any Top query will always receive eight times the resources as any Low query, and four times the resource of any Medium query, and two times the resource of any High query.

This contrast in resource allocation is maintained among queries within Timeshare no matter how many are running in each access level. If there are four queries running in Top, each will get 8 times the resource of a single query in Low. If there are 20 queries in Top, each will get 8 times the resource of a single query in Low. In this way, high concurrency in one access level will not dilute the priority differences among queries active in different access levels at the same time.

Conclusions

When using SLES 11 priority scheduler, the administrator can influence the level of resources assigned to various workloads by several means. The tier (or level) in the priority hierarchy where a workload is placed will identify its general priority. If a workload is placed in the SLG Tier, the highest SLG tier will be offer a more predictable level of resources, compared to the lowest SLG Tier.

The allocation percent given to SLG Tier workloads will determine the minimum percent those workloads will be offered. How many other workloads are defined on the same SLG Tier and their patterns of activity and inactivity can tell you whether sibling sharing will enable a workload to receive more than its defined allocation.

Workloads placed in the Timeshare level may end up with the least predictable stream of resources, especially on a platform that supports SLG Tiers that use more at some times and less at others. This is by design, because Timeshare work is intended to be less critical and not generally associated with service levels. When there is low activity above Timeshare in the hierarchy, more unused resources will flow into Timeshare workloads. But if all workloads above Timeshare are consuming 100% of their allocations, Timeshare will get less.

However, there is always an expected minimum amount of resources you can count on Timeshare receiving. This can be determined by looking at the allocation percent of the Remaining workload in the tier just above. That Remaining workload is the parent of all activity that runs in Timeshare, so whatever is allocated to that Remaining will be shared across Timeshare requests.

You can route more resources to Timeshare, should you need to do that, by ensuring that the SLG Tier Remaining workloads that are in the parent chain above Timeshare in the tree have adequate allocations associated with them. (To accomplish this you may need to reduce some of the allocation percentages of the user-defined workloads on the various SLG Tiers.) What Timeshare is entitled to, based on the Remaining workloads above it, is honored by the SLES 11 priority scheduler in the same way as the allocation of any other component higher up in the tree is honored. But since Timeshare is able to get all the unused resources that no one else can use, it is likely that Timeshare workloads will receive much more than they are entitled to most of the time.

Tags:

Ignore ancestor settings:

Apply supersede status to children:

↧

Big Data and Social Media: a Threatened Future?

March 10, 2014, 10:18 am

≫ Next: Expediting Express Requests

≪ Previous: How Resources are Shared in the SLES 11 Priority Scheduler

Short teaser:

A couple of recent articles in Wired got me thinking about just how social media services, and thus the value of the big data that they create, could be under threat from their own customers.

The first article, “Silicon Valley Needs to Lose the Arrogance or Risk Destruction,” outlines some issues around the tin ear of the larger social media organizations. I consider this to be more of perception problem for a young industry and their owners rather than a real issue. The free services provided are just too popular at this point.

However, when looking at the entire social media culture, there could be concerns. “The entire business models of Google and Facebook are built not on a physical product or even a service but on monetizing data that users freely supply. Were either company to lose the trust and optimism of its customers, it wouldn’t just be akin to ExxonMobil failing to sell oil or Dow Chemical to sell plastic; it would be like failing to drill oil, to make plastic.

When William Gibson envisioned cyberspace as a “consensual hallucination,” he was right. Unsettle the consensus about the social web and you don’t just risk slowing its growth or depopulating it slightly. You risk ending it, as mistrust of corporate motives festers into cynicism about the entire project.”

The key here is that the big data underlying these products IS the product, or at least where the value of the product resides. No data means no revenue.

Not much of a threat? How is the customer going to duplicate the value of these services without participating and divulging their data for analysis and revenue generation? Funny you should ask: in the same issue, we have an article on BYOC (bring your own cloud). There are tools that allow one to host their own social media site. “But as I discovered, running a cloud brings with it deeper and weirder pleasures. When you’re master of your own domain, you subtly change your relationship to being online. In a thread with friends on my Tonido service, I discovered that I was far more willing to be jokey or nuts or to curse like a sailor. I was no longer worried about my postings suddenly becoming public without my knowledge, as when Facebook “revises” its privacy settings in the middle of the night.” Not only is the BYOC data not accessible to social media companies that need to live off of this data, but it also a bit more under the radar of government organizations. Of course, remember the rule about data: if it is it online, it is available to anybody who really wants to make an effort to obtain it.

And most of us have an old unused server or laptop that could easily be tasked for this purpose. The ease of use of these tools will evolve, and the Torrent culture will spread these tools to all whose interests are peaked. Peaked by, let’s say, their perception that their favorite social media service is stepping on their toes?

Maybe this isn’t just a perception problem after all.

Ignore ancestor settings:

Tags:

big data

data warehouse

social media

Apply supersede status to children:

↧

Expediting Express Requests

June 3, 2014, 1:40 pm

≫ Next: Don’t confuse SLES11 Virtual Partitions with SLES10 Resource Partitions

≪ Previous: Big Data and Social Media: a Threatened Future?

Cover Image:

If I told you there was a way you might be able to speed up parsing time for your queries, would you be interested?

In Teradata Database 14.10.02 there is a new capability that allows you to expedite express requests, and I’d like to explain how that works, describe when it can help you, and make some suggestions about how you can use it to get the best performance you can from parsing when the system is under stress. But first a little background.

What is an express request?

When data dictionary information is needed by modules in the parsing engine (PE), and it cannot be found in the data dictionary cache on that PE, an express request is issued. That stream-lined request goes directly to the AMPs and attempts to find that data dictionary information. Because the database code itself is issuing express requests (rather than an end user), and the database can trust itself to do the right thing, these very short requests are allowed to bypass the resolver, security, parsing and optimization modules and are sent directly to the AMPs.

Most express requests are single-AMP requests that go to one AMP. If there is lock on the row hash of the dictionary row they are trying to access, an express request will be resent to that AMP with an access lock applied. If data from the data dictionary is accessed using an access lock, that data is not cached as it is the result of a dirty read and that row or rows may be in the process of undergoing change.

Several different modules in the parsing engine can submit express requests. The figure below lists some of the dictionary information that express requests access, and which modules issue the requests. Even a simple query may require 30 or more express requests to be issued, and they will be issued serially. Things like the number of database objects referenced in the SQL, the number of statistics that the optimizer looks for, and the complexity of access rights can influence the amount of separate requests for data that will required.

Assessing parsing time spent on express requests

Starting in Teradata Database 14.0 you can see the wall clock time that was spent in processing express requests. Usually this number is close to zero. This data is in a new column in DBQLogTbl named ParserExpReq. Below is a sample of columns from several rows of DBQLogTbl output showing ParserExpReq, intentionally selected to show variation. The unit reported in ParserExpReq is seconds.

NumSteps	AMPCPUTime	ParserCPUTime	ParserExpReq
6	0.52	0.02	0.01
955	0.32	1.51	27.47
15	0.6	0.04	19.26
12	30.41	0.02	0.01
9	268.85	0.04	0
4	0.07	0.02	?
26	55.02	0.38	1.96

In many cases ParserExpReq will be NULL. You will see a NULL when no express requests were issued because all the data was found in the data dictionary. Zero means that the wall clock time for express requests was less than 0.01 seconds. 99.9% of theDBQLogTbl rows from my shared test system showed ParserExpReq values of either zero or NULL. I would expect that to be the same on your platform. But as you can see from the data above, taken from that same system, there were occasional times when ParserExpReq was reporting some number of seconds (close to half a minute in the worst case above), even when the CPU time for parsing was very low.

ParserExpReq reports wall clock time for the execution of all express requests combined on behalf of a query, and will not correlate directly to ParserCPUTime. The usual reason for ParserExpReq to be a higher number of seconds is that one or more of the express requests were blocked once they got to the AMP. This could happen if the AMP has exhausted AMP worker tasks.

What does expediting a request do?

Expediting a request marks it for special performance advantages. In SLES11 and current SLES10 releases, all queries within a tactical workload are automatically expedited. As of 14.10.02 you have the capability of expediting express requests as well.

Here’s why that might make a difference for you. When a request is expedited it is able to use special reserve pools of AMP worker tasks (AWTs), intended for tactical queries only. If there is a shortage of AWTs on your platform, use of these reserve pools can speed up the elapsed time of a request, as the request no longer has to wait for another request to complete and free up an AWT so that it can begin to execute.

See this blog posting on reserving AMP worker tasks for more information on how expedited requests take advantage of reserve pools:

http://developer.teradata.com/blog/carrie/2010/01/expedite-your-tactical-queries-whether-you-think-they-need-it-or-not

In addition to being given access to special reserve pools of AWTs, expedited requests are given other small internal boosts that are coded into the database. While probably not noticeable in most cases, these slight performance advantages can contribute to completing work more quickly, especially on a platform with a high degree of contention.

The standard way that express requests are assigned to AMP worker tasks

Prior to taking advantage of this enhancement in 14.10.02, express requests were sent to the AMP in a message classified as a Work01 work type message. Message work types are used to indicate the importance of the work that is contained in the message. Work01 is used for spawned work on behalf of a user-initiated query (such as the receiver task during row redistribution). It is a step up from new work (which runs in Work00).

If there is a delay in getting an AWT, Work01 messages queue up in the message queue ahead of Work00 messages (new work sent from the dispatcher), but behind all the other work types. If there are no AWTs available in the unassigned AWT pool at the time the message arrives, and the three reserves for Work01 are in-use, the message will wait on the queue. This can increase the time for an express request to complete.

How express requests are assigned to AMP worker tasks with this enhancement

If you take advantage of this enhancement, then messages representing express requests that arrive on the AMPs may be able to use the Work09 work type and reserve pool. Work09 is a more elevated work type and is the work type assigned to spawned work from an expedited request. Use of Work09 for express requests only happens, however, if there are AMP worker tasks reserved for the Work09 reserve pool. If there are no reserves in Work09, then Work01 will continue to be used.

For more information on work types and reserve pools read this posting:

http://developer.teradata.com/blog/carrie/2011/10/reserving-amp-worker-tasks-don-t-let-the-parameters-confuse-you

The important point in all of this is that if you are often, or even occasionally, out of AWTs, then making sure your express requests are not going to be impacted when that condition arises could provide better query performance for parsing activities during stressful times.

Steps you have to take

The default behavior for express requests will remain the same when you upgrade to 14.10.02 or 15.0. In order to expedite express requests you will need to put in a request to the support center or your account team asking them to change an internal DBS Control parameter called: EnableExpediteExp.

The EnableExpediteExp setting has three possible settings:

0 = Use current behavior, all express requests go to Work01 (the default)

1 = Parser express requests will be expedited and use Work09 for all requests if AWTs have been reserved

2 = Parser express requests will be expedited and use Work09 only if the current workload is expedited by workload management and AWTs have been reserved

If you set EnableExpediteExp = 1, then all express request for all queries will be expedited, even when the request undergoing parsing is running in a workload that itself is not expedited. If you set EnableExpediteExp = 2, then only express requests issued on behalf of an expedited workload will be expedited.

Marking EnableExpediteExp as 1 or 2 is going to provide a benefit primarily in situations where there is some level of AWT exhaustion and where an AMP worker task reserve pool for tactical work has been setup. You can look at ParserExpReq in DBQL to check if you are experiencing longer parsing times due to express request delays. If you are, have a conversation with the support center about whether this is the right change for you.

Ignore ancestor settings:

Tags:

Apply supersede status to children:

↧

Don’t confuse SLES11 Virtual Partitions with SLES10 Resource Partitions

June 30, 2014, 1:39 pm

≫ Next: Government Black Swans

≪ Previous: Expediting Express Requests

Cover Image:

Because they look like just another group of workloads, you might think that SLES11 virtual partitions are the same as SLES10 resource partitions. I’m here to tell you that is not the case. They have quite different capabilities and purposes. So don’t fall victim to retro-conventions and old-school habits that might hold you back from the full value of new technology. Start using SLES11 with fresh eyes and brand new attitudes. Begin at the virtual partition level.

This content is relevant to EDW platforms only.

Background on SLES10 Resource Partitions

Use of multiple resource partitions (RP) in SLES10 originated due to restrictions in the early days on how many different priorities each RP could support. The original Teradata priority scheduler had four external performance groups and four internal performance groups contained in a single default RP. Even today, the original RP (RP 0, the default RP) usually supports no more than four default priorities of $L, $M, $H, and $R.

In Teradata V2R5 came the ability to add resource partitions, but even then each new resource partition could only support 4 different external performance groups, similar to how RP 0 worked. This forced users to branch out to more RPs if they had a greater number of priority differences. So it was common to see 4 to 5 RPs in use, and some users raised complaints that that wasn’t enough to provide homes to the growing mix of priorities they were trying to support.

In V2R6, priority scheduler was enhanced to allow more than 4 priority groupings in any RP. At that time we encouraged users to consolidate all their performance groups into three standard partitions for ease of management: Default, Standard, and Tactical. Generally, a Tactical RP was needed to give special protection to short tactical queries. Some internal work still ran in RP 0 so it was recommended that you avoid assigning user work there, which necessitated that a “Standard” RP be set up to manage all of the non-tactical performance groups. In SLES10 many users embraced this three-RP approach, while others went their own way with subject-area divisions or priority-based divisions among multiple RPs (creating a Batch RP and a User RP, for example).

Here are four rationales for the multiple resource partition usage patterns that are in heavy rotation with SLES10 today. For the most part they came into being due to restrictions within the SLES10 priority scheduler which encouraged out-of-the-box use of multiple RPs on EDW platforms, whether you thought you needed them or not.

Internal work: Some sensitive internal work ran in RP 0, the so the recommendation was to avoid putting user work there.
Protection for tactical work by isolating it into its own RP with a high RP weight. A high RP weight contributed to a more stable relative weight (allocation of resources) for tactical workloads.
Desire to more easily swap priorities between load and query work by different times of day (by making one change at the RP-level instead of multiple changes at the level of the allocation group). These RP-level changes often included the desire to add RP-level CPU limits on RPs supporting resource-intensive work, in order to protect tactical queries at certain times of the day.
Sharing unused resources with an RP. Some sites liked putting all work from one application type in the same RP so that if one of the allocation groups was idle, the other allocation groups of that type would get their relative weight points. The SLES10 relative weight calculation benefits groups within the same RP, such that they share unused resources among themselves first, before those resources are made available to allocation groups in other RPs.

Very limited examples of using resource partitions for business unit divisions has been in evidence among Teradata sites on SLES10, partly because of only having four usable RPs and partly because the SLES10 technology has not been all-encompassing enough to support the degree of separation required.

What Has Changed with SLES11?

A lot.

First, let’s address the four key motives (or rationales) users have had for spreading workloads and performance groups across multiple RPs in SLES10, but looking at it from the SLES11 perspective.

Internal work: In SLES11 all internal work has been moved up in the priority hierarchy above the virtual partition level, where it can get all of the resource it needs off the top, without the user having to be aware or considerate of where that internal work is running. There is no longer a need to set up additional partitions to avoid impacting internal work.
Protection for tactical work: The Tactical tier in SLES11 is intended (and is) a turbo-powered location in which to place tactical queries, where response time expectations can be consistently met without taking extraordinary steps. The Tactical tier in SLES11 is first in line when it comes to resource allocation, right after operating system and internal database tasks. This eliminates the need for a special partition solely for tactical work, or as a means of applying resource limits on the non-tactical work.
Desire to more easily swap priorities: There is something to be said for grouping workloads that need priority changes at similar times into a single partition, because then you only have to make the change in one place. But that is a fairly minor issue on either SLES10 or SLES11 with the advent of TASM planned environments. You’re not saving that much during TASM setup to indicate a change in one place (a virtual partition) vs. making a change in several places (multiple workloads) when those changes are going to be happening automatically for you at run time each day. There is no repetitive action that needs to be taken by the administrator once a new planned environment has been created. New planned environments can automatically implement new definitions, with lower priorities for some of the workloads and higher for others, no matter how many workloads are involved.

Applying higher level (partition-level) resource limits on a group of workloads at the partition level, as we have see in some SLES10 sites, is much less likely to be needed in SLES11 (I personally believe it will not be needed at all). That is because the accounting in SLES11 priority scheduler is more accurate, giving SLES 11 the ability to deliver exactly what is specified. No more, no less. There is no longer a performance-protection need for resource limits or an over-/under-allocation of weight at the partition level. And because that need has gone away, the argument in favor of separate partitions for performance benefit is less compelling.

Sharing unused resources. Sharing unused resources among a small set of selected workloads is available on each SLG Tier as it exists within a single virtual partition in SLES11. If an SLG Tier 1 workload is idle, the other workloads placed on SLG Tier 1 will be able to share its allocation before those resources are made available to other workloads lower in the hierarchy. The order of sharing of unused resources is guided by the priority hierarchy in SLES11 and does not require multiple partitions to implement.

The Intent and Vision of SLES11 Virtual Partitions

A virtual partition in SLES11 is a self-contained microcosm. It has a place for very high priority tactical work in the Tactical tier. It has many places in the SLG Tiers for critical, time dependent work across all applications ranging from the very simple to the more complex. And at the base of its structure in Timeshare it can accommodate large numbers of different workloads submitting resource-intensive or background work at different access levels, including load jobs, sandbox applications and long-running queries. Within its self-sufficient world, priorities at the workload level can be changed multiple times every day if you wish, using planned environments in the TASM state matrix.

If you’re on an EDW platform with SLES11, you are offered multiple virtual partitions, but their intent is different from SLES10 resource partitions. Virtual partitions were implemented in order to provide a capability that SLES10 was not well suited to deliver: Supporting differences in resource availability across multiple business units, or distinct geographic areas, or a collection of tenants.

Virtual partitions are there to provide a method of slicing up available resources among key business divisions of the company on the same hardware platform. Once you get on SLES11, if you begin moving in a direction that made sense in SLES10, you lose the ability to sustain distinct business units in the future. And you’ll be less in harmony with TASM/SLES11 enhancements going forward.

New capabilities around virtual partitions, such a virtual partition throttles in 15.0, and other similar enhancements being planned, are all being put in place with the same consistent vision of what a virtual partition is. Keep in step with these enhancements and position yourself to use them fully, by letting go of previous conventions and embracing the new world of SLES11 possibilities.

Ignore ancestor settings:

Apply supersede status to children:

↧

Government Black Swans

July 7, 2014, 8:42 am

≫ Next: In-Lining of LOBs in .NET Data Provider for Teradata 15.0

≪ Previous: Don’t confuse SLES11 Virtual Partitions with SLES10 Resource Partitions

Cover Image:

This blog concentrates on the expected unexpected external factors that can have a (negative) impact on your organizations’ Integrated Data Warehouse (IDW). The current discussions around what NSA can and cannot capture and store for data analysis got me thinking about the biggest elephant in the room: the government.

In the WSJ’s coverage of a house bill overhauling the NSA phone program, “Instead of collecting millions of Americans' phone records en masse, the NSA would ask phone companies to query their databases for connections to suspicious phone numbers.” In other words, all processing power that the NSA has been dedicating to the analytics of the phone metadata would be pushed back into the private phone companies’ IT departments. Do they have this in their analytics processing capacity plan budget? And how do they prioritize the security letter information requests vs. their own processing on a shared platform?

What about the major social media sites that have acting as the repository for the world’s online activities? Up to now, based on the press reports and Edward J. Snowden’s revelations, the NSA has been vacuuming-up this data in the background for internal pattern analysis. Can the social media sites expect a major uptick in security letters asking for analytics on their IDWs if this is put to a halt? If the data collection activities are further restricted, I can imagine a large broadening of the impact across all industries: queries from hardware chains on material purchases; queries from large retailers looking for specific market baskets; etc.

Here’s a link to the 2013 security letter activities. There were nearly 100,000 “targets” affected. It will be interesting to see what this looks like in 2015 if the current legislation restricting NSA collection and storage activities gets through the Senate.

But it isn’t just the NSA that has an impact on private company IDW infrastructures. Sarbanes – Oxly changed the storage requirements on practically every company activity. Transportation companies have to track and report in detail on passenger and shipping manifests. The list goes on and on.

So, when building your organization’s IDW capacity plan, please do keep an eye on the Washington (or your home country’s) legislature. Something big may be coming your way.

Ignore ancestor settings:

Tags:

data warehouse

Big Data Analytics

Apply supersede status to children:

↧

In-Lining of LOBs in .NET Data Provider for Teradata 15.0

July 14, 2014, 9:49 am

≫ Next: Workload Management with User Defined Functions and Table Operators

≪ Previous: Government Black Swans

This will be part 1 of a multi-part blog about how the .NET Data Provider for Teradata 15.0 can now In-Line Large Objects (LOB) that are sent to a Teradata Database when executing an INSERT or UPDATE statement. This first blog will introduce In-Lining of LOB. Blogs will also be written that discuss how to take advantage of this feature, and performance characteristics. All these blogs will be more technically oriented than my other blogs.

Overview

The first item that needs to be mentioned is that In-Lining of LOBs is only supported when the Data Provider is connected to a Teradata Database 14.0 or later. If you install the .NET Data Provider for Teradata 15.0 and connect to a Teradata Database release earlier than 14.0 all LOBs will be sent using Deferred mode.

The data for all LOBS (BLOB, CLOB, XML, JSON, Geospatial) were sent to the Teradata Database using Deferred mode by the .NET Data Provider for Teradata in releases prior to 15.0. It is now possible for a LOB to be In-Lined by the 15.0 release of the Data Provider.

This feature occurs automatically. If the Data Provider determines that the data of a LOB can be In-Lined it will automatically In-Line the data.

What does it mean to In-Line a LOB or send a LOB as Deferred ?

Before answering this question a Message Buffer must be defined. A Message Buffer is used by the Data Provider to write out all the data of a request that will be sent to the Teradata Database. This includes the SQL statement, data of all the parameters, and associated overhead. After the Message Buffer has been filled with all this information, it is sent to the Teradata Database for processing.

A LOB is In-Lined when all of its data is written to the Message Buffer. Since the Message Buffer is limited to 1mb, the largest LOB that can be In-Lined will be approximately 1mb.

When a LOB is sent to the Teradata Database using Deferred mode a unique identifier is written in place of the LOB data in the Message Buffer. This identifier takes up 4 bytes in the buffer. When the Teradata Database receives the data in the buffer it sends a request to the Data Provider for the LOB data. The Data Provider will fill the Message Buffer with the LOB data, then send the contents of the buffer to the Teradata Database. The amount of data that can be sent to the Teradata Database is 1mb. This back and forth communication between the Data Provider and Teradata Database continues until all the data of the LOB has been sent.

There is one less round of communication between the Data Provider and Teradata Database when a "small" LOB can be In-Lined.

When is a LOB In-Lined?

A LOB will not be In-Lined under the following conditions:

When the TdParameters.Size property is set to 0 and the data type that represents the LOB is a Stream, TextReader, or XmlReader.
When the total number of bytes of Non-Lob parameters and the total number of bytes of all LOB parameters exceed 1mb. This is clarified in the discussion that follows.

There are two scenarios that need to be discussed when the Data Provider determines when the data of a LOB can be In-Lined:

Executing a command using one of the TdCommand's execution methods
Using TdDataAdapter to process a batch.

How In-Lining is Determined when TdCommand's Execution Methods are Called

When one of the TdCommand's execution methods (i.e. TdCommand.ExecuteNonQuery) is called, the Data Provider determines whether a LOB can be In-Lined by performing the following:

Subtracting the total number of bytes of the non-LOB parameters, the SQL statement, and the number of bytes of the overhead from 1mb.
Starting from the smallest LOB the Data Provider will check whether it can fit within the remaining space. The LOB will be In-Lined if it can fit. The size of the LOB is subtracted from the remaining space. The size of the next smallest LOB is checked. This will continue until a LOB does not fit in the space that remains. Any LOB that cannot be In-Lined will be sent using Deferred mode.

Example using TdCommand.ExecuteNonQuery

In this example, a parameter row contains 3 BLOBS.

static void Example1(TdCommand cmd)
{
 
    FileStream blob1 = new FileStream("blob1.mp3", FileMode.Open);
    FileStream blob2 = new FileStream("blob2.mp3", FileMode.Open);
    FileStream blob3 = new FileStream("blob3.mp3", FileMode.Open);
 
    cmd.CommandText = "insert into exTable (int1, blob1, blob2, blob3) values (?, ?, ?, ?)";
 
    cmd.Parameters.Add(null, TdType.Integer);
    cmd.Parameters[0].Value = 1;
 
    // blob1 will be sent as Deferred because Size=0 and blob1 is a Stream
    cmd.Parameters.Add(null, TdType.Blob);
    cmd.Parameters[1].Size = 0;
    cmd.Parameters[1].Value = blob1;
 
    // blob2 will be sent In-Line because the Size will fit within the Message Buffer
    cmd.Parameters.Add(null, TdType.Blob);
    cmd.Parameters[2].Size = 1000;
    cmd.Parameters[2].Value = blob2;
 
    // blob3 will be sent as Deferred because the Size is too large to fit 
    // within the Message buffer
    cmd.Parameters.Add(null, TdType.Blob);
    cmd.Parameters[3].Size = 1000000000;      // 1,000,000,000
    cmd.Parameters[3].Value = blob3;
 
    cmd.ExecuteNonQuery();
}

In this example the BLOBs are represented by a FileStream. The Data Provider will perform the following actions on each of the BLOBs:

blob1 -- TdParameter.Size = 0 and the base type is a Stream. The Data Provider cannot determine the size of the BLOB. This BLOB will be sent deferred.
blob2 -- TdParameter.Size = 1000. When the overhead and the size of all the other parameters are accounted for, this LOB will fit within the Message Buffer. It will be In-Lined.
blob3 -- TdParameter.Size=1,000,000,000. The size of the BLOB is to large to fit within the Message Buffer.

How In-Lining is Determined when Processing a Batch

It gets a little more complicated when the Data Provider is processing a Batch.

Message Buffer Space allocated for each parameter row in the batch is calculated.
- 1mb / TdDataAdapter.BatchSize
The Teradata Database requires that the same LOB parameter in each row to be In-Lined or Deferred. The calculation to determine which LOB can be In-Lined only has to be performed on one row.
The same calculation described in the TdCommand's execution method is used except the space allocation for a parameter row is used instead of 1mb.

Example of Using a Batch to Update a Table

static void Example2(TdConnection cn)
{
    DataTable dt = new DataTable("example2");

    dt.Columns.Add("int1", typeof(Int32));
    dt.Columns.Add("blob1", typeof(FileStream));
    dt.Columns.Add("blob2", typeof(FileStream));
    dt.Columns.Add("blob3", typeof(FileStream));
 
    // ***********
    // The dt is filled with rows 
    // ***********
 
    TdCommand cmd = cn.CreateCommand();
     cmd.Parameters.Add("int1", TdType.Integer, 0, "int1");
    cmd.Parameters[0].SourceVersion = DataRowVersion.Proposed;

    // blob1 will be sent In-Line because the Size will fit within the Message Buffer
    cmd.Parameters.Add("blob1", TdType.Blob, 40000, "blob1");
    cmd.Parameters[1].SourceVersion = DataRowVersion.Proposed;
 
    // blob2 will be sent as Deferred because Size=0 and blob2 is a Stream
    cmd.Parameters.Add("blob2", TdType.Blob, 0, "blob2");
    cmd.Parameters[2].SourceVersion = DataRowVersion.Proposed;
 
    // blob3 will be In-Lined because the space that remains in the Message 
    // Buffer will fit the 30,000 bytes.
    cmd.Parameters.Add("blob3", TdType.Blob, 30000, "blob3");
    cmd.Parameters[3].SourceVersion = DataRowVersion.Proposed;
 
    cmd.CommandText = "insert into exTable (int1, blob1, blob2, blob3) values (?, ?, ?, ?)";
 
    TdDataAdapter da = new TdDataAdapter();
    da.InsertCommand = cmd;
 
    // iterated request (parameter arrays) will be used
    da.UpdateBatchSize = 10;
    da.ContinueUpdateOnError = true;
    da.KeepCommandBatchSequence = false;
 
    // Sending batch to Teradata Database.
    da.Update(dt);
 
    da.Dispose();
    cmd.Dispose();
}

In this example each of the BLOBs are represented as a FileStream. To determine which BLOBs will be In-Lined the Data Provider performs the following actions:

Calculates the space in the Message Buffer that will be assigned to each Parameter Row.
- 1mb / TdDataAdapter.UpdateBatchSize = 1mb / 10 = 100kb
Starting with the smallest LOB, the Data Provider will determine whether each LOB can fit within the 100kb.
- blob2 will be sent Deferred because the TdParameter.Size=0 and the base type is a Stream.
- The first BLOB that is considered for In-Lining is blob3 because it is the smallest LOB. It has a Size of 30,000. After subtracting the overhead from 100kb, blob3 can fit within the allocated space. The Size of blob3 is subtracting from the remaining bytes.
- The second BLOB that is considered for In-Lining is blob1. It has a size of 40,000. The Data Provider will check whether it can fit within the remaining space. "blob1" will fit so it will also be In-Lined.

blob1 and blob3 are small BLOBs. You may think that they can be represented as a Byte Array and the TdParameter.TdType for each parameter can be set to TdType.VarByte. However, the Teradata Database has a limit on the number of bytes that can be set in a parameter row. The limit is 64kb.

This limit changes if the parameter row contains a LOB data type whose value can be In-Lined. The limit increases to 1mb.

Tags:

.net

data provider

.net data provider team

Ignore ancestor settings:

Apply supersede status to children:

↧