Thursday, January 6, 2011

Fun with a String

I recently had a friend contact me recently to ask for some help with TSQL.  The basic issue is that he had 2 tables.  Table 1 has numeric key values which are contained within a character column in Table 2 which is also in another database on the same server.  His job was to parse the character column in Table 2 so he could join to the numeric key in Table 1.  He originally said that the numeric data in the character column would start at position 7 and be 12 characters long so his original query was something like this:

SELECT
*
FROM
db.schemaname.table1 AS T1
JOIN db.schemaname.table2 AS T2
ON T1.id = SUBSTRING(T2.character_col, 7, 12)



I replied that the query would work, but was he guaranteed that the data would ALWAYS start at position 7 and be 12 characters?  Also that using the SUBSTRING function meant that the optimizer could not use an index.  As I thought about the situation I came up with this as probably the most flexible solution.



Here’s some test data:



/* Don't like to work in a user database */
USE tempdb ;
GO

CREATE TABLE
#test1 (id INT)

/* create a test table */
CREATE TABLE #test
(
string VARCHAR(50) NOT NULL
) ;

/* put in some data */
INSERT INTO #test1
(
id
)
SELECT
123456789012
UNION ALL
SELECT
234567890121
UNION ALL
SELECT
345678901212
UNION ALL
SELECT
456789012123
UNION ALL
SELECT
12345678901
UNION ALL
SELECT
23456789011
UNION ALL
SELECT
1234567890123
UNION ALL
SELECT
12345678901234 ;

INSERT INTO #test
(
string
)
SELECT
'Jack 123456789012'
UNION ALL
SELECT
'234567890121'
UNION ALL
SELECT
'Jack 345678901212'
UNION ALL
SELECT
'Jack abcdefg'
UNION ALL
SELECT
'Jack 456789012123x'
UNION ALL
SELECT
'Jack 12345678901'
UNION ALL
SELECT
'Jack 23456789011a'
UNION ALL
SELECT
'Jack 1234567890123'
UNION ALL
SELECT
'Jack 12345678901234a' ;


/* Use PATINDEX inside SUBSTRING to find the first instance of a numeric character 
to set the starting index. Use PATINDEX on the string REVERSE to find the
last instance of a numeric character. */
SELECT
*
FROM
#test1 AS T1 JOIN
#test AS T2 ON T1.id = SUBSTRING(T2.string, PATINDEX('%[0-9]%', T2.string),
LEN(T2.string) - PATINDEX('%[0-9]%', REVERSE(T2.string)) -
PATINDEX('%[0-9]%', T2.string) + 2);

DROP TABLE #test1;
DROP TABLE #test ;


This code works for this situation.  Will it work in every situation?  Probably not.  Will it scale?  Probably not, but it was an interesting exercise.



Do you have a better solution?  If you do post it in the comments or post a link to a blog post where you put your solution.

No comments:

Post a Comment

So what do you think I am?