Thursday, May 10, 2007

SQL Server 2005: Regular Expressions Make Pattern Matching And Data Extraction Easier -- MSDN Magazine, February 2007

SQL Server 2005: Regular Expressions Make Pattern Matching And Data Extraction Easier -- MSDN Magazine, February 2007

  • Efficient SQL querying using regular expressions
  • Support in SQL Server 2005 for regular expressions
  • Using .NET Regex classes from SQL Server
  • Effective uses for regular expressions in a database

select dbo.RegexMatch( N'123-45-6789', N'^\d{3}-\d{2}-\d{4}$' )


select ROUTINE_NAME
from INFORMATION_SCHEMA.ROUTINES
where ROUTINE_TYPE = N'PROCEDURE'
and dbo.RegexMatch( ROUTINE_NAME,
N'^usp_(Insert|Update|Delete|Select)([A-Z][a-z]+)+$' ) = 0


select ROUTINE_NAME
from INFORMATION_SCHEMA.ROUTINES
where ROUTINE_TYPE = N'PROCEDURE'
and ( LEN( ROUTINE_NAME ) < 11
or LEFT( ROUTINE_NAME, 4 ) <> N'usp_'
or SUBSTRING( ROUTINE_NAME, 5, 6 ) not in
( N'Insert', N'Update', N'Delete', N'Select' ) )


CREATE TABLE [Account]
(
[AccountNumber] nvarchar(20) CHECK (dbo.RegexMatch(
[AccountNumber], '^[A-Z]{3,5}\d{5}-\d{3}$' ) = 1),
[PhoneNumber] nchar(13) CHECK (dbo.RegexMatch(
[PhoneNumber], '^\(\d{3}\)\d{3}-\d{4}$' ) = 1),
[ZipCode] nvarchar(10) CHECK (dbo.RegexMatch(
[ZipCode], '^\d{5}(\-\d{4})?$' ) = 1)
)



Data Extraction

The grouping features of regular expressions can be used to extract data from a string. My RegexGroup function provides that functionality to T-SQL:

[SqlFunction]
public static SqlChars RegexGroup(
SqlChars input, SqlString pattern, SqlString name )
{
Regex regex = new Regex( pattern.Value, Options );
Match match = regex.Match( new string( input.Value ) );
return match.Success ?
new SqlChars( match.Groups[name.Value].Value ) : SqlChars.Null;
}


select distinct dbo.RegexGroup( [Url],
N'https?://(?([\w-]+\.)*[\w-]+)', N'server' )
from [UrlTable]


CREATE TABLE [Email]
(
[Address] nvarchar(max),
[Mailbox] as dbo.RegexGroup( [Address],
N'(?[^@]*)@', N'mailbox' ),
[Domain] as dbo.RegexGroup( [Address], N'@(?.*)', N'domain' )


Pattern Storage

All of the patterns used by these functions are just strings, which means that any of them can be stored in a table within your database. Most databases that store international data have a table representing countries. By adding a few extra columns to that table, you could store country-specific validation patterns. That would allow the constraint applied to an address row to vary based on the country for that row.

In databases that store data on behalf of clients, there is typically already a table representing a client. That table can be used to store grouping patterns that let you describe the way raw client data is stored within the database, and this allows you to create computed columns to pull the data you actually need from the client data. For example, if each of your clients has unique schemes for account numbers and you only need specific pieces of that account number, you could easily create an expression that pulls the correct piece of information for each client.


declare @text nvarchar(max), @pattern nvarchar(max)
select
@text = N'Here are four words.',
@pattern = '\w+'
select count(distinct [Text])
from dbo.RegexMatches( @text, @pattern )


declare @pattern nvarchar(max), @list nvarchar(max)
select @pattern = N'[^,]+', @list = N'2,4,6'

select d.* from [Data] d
inner join dbo.RegexMatches( @list, @pattern ) re
on d.[ID] = re.[Text]




Related Articles from MSDN Magazine: