Showing posts with label T-SQL. Show all posts
Showing posts with label T-SQL. Show all posts

Friday, September 7, 2007

Intelligent Database Design Using Hash Keys - Checksum()

Intelligent Database Design Using Hash Keys

Using Hash Keys instead of String Indexes

Your application may require an index based on a lengthy string, or even worse, a concatenation of two strings, or of a string and one or two integers. In a small table, you might not notice the impact. But suppose the table of interest contains 50 million rows? Then you will notice the impact both in terms of storage requirements and search performance.

You don’t have to do it this way. There is a very slick alternative, using what are known alternatively as hash buckets or hash keys.

What is a Hash?

In brief, a hash is the integer result of an algorithm (known as a hash function) applied to a given string. You feed said algorithm a string and you get back an integer. If you use an efficient hash function then there will be only a small chance that two different strings will yield the same hash value. If this does occur, then it is known as a hash collision. Suppose that you fed this article into a hash algorithm, then changed one character in the article and fed the article back into the hashing algorithm: it would return a different integer.

Hash Keys in Database Design

Now, how can we apply hash leys intelligently in our database designs? Suppose that we have these columns in the table of interest:

Column Name


Data Type

Name

Varchar(50)

GroupName

Varchar(50)

A compound index on both these columns would consume 50 + 50 characters per row. Given 50 million rows, this is a problem. A hash key based on these two columns is vastly smaller (4 bytes per row). Even better, we don’t have to store the hash keys themselves – or more accurately, we have to store them just once. We create a calculated column whose formula is the hash key of these two columns. Now, we index the hash key row and don’t bother with the index on the two columns mentioned above.

The basic process is as follows:

  1. The user (whether a human or an application) queries the values of interest
  2. These values are then converted into a hash key
  3. The database engine searches the index on the hashed column, returning the required row, or a small subset of matching rows.

In a 50 million row table, there will undoubtedly be hash collisions, but that isn’t the point. The set of rows returned will be dramatically smaller than the set of rows the engine would have to visit in order to find an exact match on the original query values. You isolate a small subset of rows using the hash key and then perform an exact-string match against the hits. A search based on an integer column can be dramatically faster than a search based on a lengthy string key, and more so if it is a compound key.

Hash Key Algorithms using the Checksum Function

There are several algorithms available, the simplest of which is built into SQL Server in the form of the Checksum function. For example, the following query demonstrates how to obtain the hash key for any given value or combination of values:

USE AdventureWorks
SELECT Name, GroupName, Checksum(Name,GroupName)AS HashKey
FROM Adventureworks.HumanResources.Department
ORDER BY HashKey

This results in the following rows (clipped to 10 for brevity):


Name



GroupName



Hashkey


Tool Design

Research and Development

-2142514043

Production

Manufacturing

-2110292704

Shipping and Receiving

Inventory Management

-1405505115

Purchasing

Inventory Management

-1264922199

Document Control

Quality Assurance

-922796840

Information Services

Executive General and Administration

-904518583

Quality Assurance

Quality Assurance

-846578145

Sales

Sales and Marketing

-493399545

Production Control

Manufacturing

-216183716

Marketing

Sales and Marketing

-150901473

You have a number of choices as to how you create the hash key. You might elect to fire an INSERT trigger, or use a stored procedure to create the hash key once the values of interest have been obtained, or even to execute an UPDATE query that creates the hash keys and populates the hash column retroactively (so that you can apply this technique to tables that already contain millions of rows). As stated above, my preferred solution is to "store" the hash keys in a calculated column that is then indexed. As such, the index contains the hash keys but the table itself does not.

Using this technique, you might approach the problem as follows, assuming that the front end passes in the target values for Name and GroupName:

CREATE PROCEDURE DemoHash
(
@Name Varchar(50),
@GroupName Varchar(50)
)
AS
-- USE AdventureWorks
DECLARE @id as int
SET @id = Checksum(@Name,@GroupName)
SELECT * FROM Adventureworks.HumanResources.Department
WHERE HashKey = @id
AND Name = @Name AND GroupName = @GroupName

Conclusion


This approach can yield considerable performance benefits and I encourage you to test it out on your own systems. The technique, as presented here, assumes that the search targets exist in a single table, which may not always be the case. I am still experimenting with ways to use this technique to search joined tables, and when I come up with the best approach, I will let you know.

© Simple-Talk.com


Intelligent Database Design Using Hash Keys

Sunday, April 8, 2007

Quick MS Access (JET SQL) to SQL Server (T-SQL) Conversion Guide

Quick MS Access (JET SQL) to SQL Server (T-SQL) Conversion Guide


Quick MS Access (JET SQL) to SQL Server (T-SQL) Conversion Guide

Lots of questions come up in the SQL Team forums about conversions between Access and T-SQL and some of the differences between the two SQL dialects. Here's a few handy things to help you out with converting your projects. Check in now and then as this short list will eventually grow as more things come up.

Converting NULL values


Access: NZ(Value, ValueToReturnIfNull)
T-SQL: COALESCE(Value, ValueToReturnIfNull) -- or -- ISNULL(Value, ValueToReturnIfNull)

Checking for NULLs

Access: WHERE Value IS NULL -- or -- WHERE ISNULL(Value) (note the difference from T-SQL's ISNULL)
T-SQL: WHERE Value IS NULL

String Segments

Access: MID(StringVal, StartPos, [length] ) (length is optional)
T-SQL: SUBSTRING(StringVal, StartPos, length ) (length is required!)

Finding a String within a String

Access: SELECT INSTR(start, StringToSearch, StringToFind)
T-SQL: SELECT CHARINDEX(start, StringToSearch, StringToFind)

Reverse a String

Access: SELECT STRREVERSE(StringVal)
T-SQL: SELECT REVERSE(StringVal)

Formatting Dates, Booleans, Numerics as Strings

Access: SELECT Format(Value, FormatSpecification) (note: this always returns a string value)
T-SQL: Do not do this in T-SQL; format data at your front-end application or report

String Literals

Access: SELECT "This is a string"
T-SQL: SELECT 'This is a string'

LIKE pattern matching

matching multiple characters:

Access: WHERE Column LIKE "*string*"
T-SQL: WHERE Column LIKE '%string%'

matching a single character:

Access: WHERE Column LIKE "?string?"
T-SQL: WHERE Column LIKE '_string_'

not matching a character or range:

Access: WHERE Column LIKE "[!a-z]"
T-SQL: WHERE Column LIKE '[^a-z]'

Triming White Space

Access: TRIM(val)
T-SQL: RTRIM(LTRIM(val))

Converting DataTypes

Access: CINT(value), CDBL(value), CDEC(value), CSTR(value), CDATE(value), CBOOL(value)
T-SQL: CONVERT(DATATYPE, value) -- or -- CAST(value AS datatype)

Conditional Expressions

Access: IIF(Condition, ReturnIfTrue, ReturnIfValue)
T-SQL: CASE WHEN Condition THEN ReturnIfTrue ELSE ReturnIfFalse END

Working with Date Literals

Access: WHERE SomeDate = #1/1/2005#
T-SQL: WHERE SomeDate = '1/1/2005' (this is an implicit conversion from a string to a date)

Creating new Dates

Access: DATESERIAL(year,month,date)
T-SQL: Use the Date() function here -- there is no quick easy way to do this in T-SQL

Creating new Times

Access: TIMESERIAL(Hour, minute, second)
T-SQL: Use the Time() function here -- there is no quick easy way to do this in T-SQL

Getting Today's Date and Time

Access: SELECT now()
T-SQL: SELECT getdate()

Getting Today's Date only (i.e., at midnight)

Access: SELECT date()
T-SQL: Use the DateOnly() function here : SELECT dbo.DateOnly(getdate())

Getting Today's Time Only (at the "base" date, or date with a numeric value of 0)

Access: SELECT Time() (this returns the time at 12/30/1899)
T-SQL: Use the TimeOnly() function here : SELECT dbo.TimeOnly(getdate()) (returns the time at 1/1/1900)

Boolean (True/False) Values

Access: WHERE Active = True -- and -- WHERE Active = False
(Active is a Boolean datatype)

T-SQL: WHERE Active=1 -- and -- WHERE Active=0
(Active is a Bit datatype)

Returning or Setting Boolean Values

Access: SELECT BooleanExpression
T-SQL: CAST(CASE WHEN BooleanExpression THEN 1 ELSE 0 END) AS BIT

FULL OUTER JOINS

(Note: try to avoid these as a general practice)

Access: SELECT ... FROM tableA LEFT OUTER JOIN tableB ON ...
UNION ALL
SELECT ... FROM tableB LEFT OUTER JOIN tableA ON ... WHERE tableA .PK IS NULL

T-SQL: SELECT ... FROM tableA FULL OUTER JOIN tableB ON ....

RIGHT OUTER JOINS

Because we all know that using the query designer in Access sometimes results in these, but we should never use them in manually written and maintained SQL:

Access: SELECT ... FROM tableA RIGHT OUTER JOIN tableB ON ....
T-SQL: SELECT ... FROM tableB LEFT OUTER JOIN tableA ON ....

Parameters

Access: SELECT [Any column name not defined]
T-SQL: SELECT @ParamName

Modulo Operator

Access: SELECT value1 MOD value2
T-SQL: SELECT value1 % value2

Dividing Integers to calculate a Percentage or other result with decimal places

Access: SELECT Int1 / Int2 (this returns a Double value implicitly)
T-SQL: SELECT Int1 * 1.0 / Int2 (the multiplication by 1.0 results in a numeric(8,6) being returned)

String Concatenation Operator

Access: Val1 & Val2 (both will be implicitly converted to strings if they are not already)
T-SQL: Val1 + Val2 ( note that explicit conversion to a "string" datatypes is necessary in T-SQL)

Referencing an Expression in a SELECT

Here, we define A+B as a new column X, and we want to reference X in the SELECT:

Access: SELECT A+B as X, X+C as D FROM ...
T-SQL: SELECT X, X+C as D FROM (SELECT A+B as X, C FROM ... ) tmp

Getting a Character from an ASCII code

Access: SELECT CHR(AsciiCode)
T-SQL: SELECT CHAR(AsciiCode)

Getting an ASCII code from a Character

Access: SELECT ASC(Character)
T-SQL: SEELCT ASCII(Character)

Date Part Indicators (DateAdd, DateDiff, DatePart)

MS Access and SQL Server both use the same basic date functions (DateAdd, DateDiff, DatePart) but the way you indicate which "date part" you are after differs between the two.

MS Access uses a string expression to indicate the "dart part" in DateAdd, DatePart and DateDiff expressions; SQL Server uses symbols. Thus, you need to put quotes around the part name in MS Access since it is just a string expression, but you should NOT use quotes in SQL Server -- just enter the value directly.

The Date Part indicators are listed below:

Date Part SQL Server MS Access
Year year, yy, yyyy "yyyy"
Quarter quarter, qq, q "q"
Month month, mm, m "m"
Day of Year dayofyear, dy, y "y"
Day day, dd, d "d"
Week week, wk, ww "ww"
Day of Week weekday, dw "w"
Hour hour, hh "h"
Minute minute, mi, n "n"
Second second, ss, s "s"
Millisecond millisecond, ms -

Finally, note that both Access and T-SQL support the Year(), Month() and Day() functions.