Tokenize UDF

March 10, 2005

146

Yes, another string splitting UDF from a guy who’s obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, “Tokenization.”

So what is Tokenization? It’s a word I made up for this problem.

But what is it, really? It’s splitting up a string based on a delimiter — in this case, a comma — but being wary of substring delimiters. In this case, that’s a pair of apostrophes, because that’s what TSQL uses for strings.

I think this is best illustrated with an example string:

[sql]DECLARE @Tokens VARCHAR(50)

SET @Tokens = ‘a, ”b”, ””c”, ”d”, ”e””, f, ”1,2,3,4”’
[/sql]

The basic split string function that you can find will produce the following output:

[sql]SELECT *
FROM dbo.SplitString(@Tokens, ‘,’)

OutParam
————-
a
‘b’
”c’
‘d’
‘e”
f
‘1
2
3
4’
[/sql]

Well, that’s wrong. Because what I want to do is maintain the substrings (or, “tokens,” as I like to call them — thus, Tokenization!)

The output I desire is:

[sql]Token
——–
a
‘b’
”c’, ‘d’, ‘e”
f
‘1,2,3,4’
[/sql]

Notice that substrings — delimited with apostrophes — should be maintained.

And here’s how I’ve solved this problem…

[sql]CREATE FUNCTION dbo.Tokenize
(
@Input NVARCHAR(2000)
)
RETURNS @Tokens TABLE
(
TokenNum INT IDENTITY(1,1),
Token NVARCHAR(2000)
)
AS
BEGIN
DECLARE @i INT SET @i = 0
DECLARE @StartChar INT SET @StartChar = 1
DECLARE @Quote INT SET @Quote = 0

DECLARE @Chars TABLE
(
CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
TheChar CHAR(1),
TheCount INT,
StartChar INT
)

SET @Input = ‘ , ‘ + @Input + ‘ , ‘

INSERT @Chars (TheChar)
SELECT SUBSTRING(@Input, n.Number, 1)
FROM Numbers n
WHERE n.Number > 0
AND n.Number <= LEN(@Input)
ORDER BY n.Number

UPDATE Chars SET
@i = Chars.TheCount =
CASE
WHEN Chars1.TheChar = ‘,’
AND @Quote % 2 = 0 THEN 0
ELSE @i + 1
END,
@Quote =
CASE
WHEN Chars1.TheChar = ”” THEN @Quote + 1
WHEN @i = 0 THEN 0
ELSE @Quote
END,
@StartChar = Chars.StartChar =
CASE
WHEN @i = 1 THEN Chars1.CharNum – 1
WHEN @i = 0 THEN @StartChar + 1
ELSE @StartChar
END
FROM @Chars Chars
JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1

INSERT @Tokens(Token)
SELECT
RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
FROM (
SELECT StartChar, CharNum
FROM @Chars
WHERE TheCount = 0

UNION ALL

SELECT
MAX
(
CASE TheCount
WHEN 0 THEN CharNum
ELSE 0
END
) + 1,
MAX(CharNum)
FROM @Chars
) x
WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT IN (”, ‘,’)
ORDER BY x.StartChar
RETURN
END
[/sql]

A word of warning: This UDF uses the undocumented — and unsupported — “aggregate update” functionality. I’ve tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn’t support this one, so handle with care.

And by the way, you need a numbers table to use this thing. Of course.

As for using this thing, it’s pretty easy:

[sql]DECLARE @Tokens VARCHAR(50)

SET @Tokens = ‘a, ”b”, ””c”, ”d”, ”e””, f, ”1,2,3,4”’

SELECT Token
FROM dbo.Tokenize(@Tokens)

Token
——–
a
‘b’
”c’, ‘d’, ‘e”
f
‘1,2,3,4’
[/sql]

… and it even appears to work properly!

Enjoy… and application for this and other strange things I’ve been posting recently coming very, very soon.

1 COMMENT

Jeremy Swartwood May 29, 2013 At 11:39 pm

Thank you for this. To note, if there is a token and nothing but a space, your script excludes this extra "token". In my situation I needed to always compare a specific token number so I needed this empty token.
These changes are not efficient, but they worked.
I changed the INSERT section to use a CASE instead that compared against ” and then it didn’t use the LTREM/RTRIM else it used the trim.
case when RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) = ”
then SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)
else RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
end
Additionally, I had to change the WHERE clause because SQL thinks that ” = ‘ ‘.
SUBSTRING(@Input, StartChar, CharNum – StartChar + 1) NOT LIKE ”
AND
RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT LIKE ‘,’

Tokenize UDF

1 COMMENT

LEAVE A REPLY Cancel reply

Popular Posts

The SQL Hall of Shame

Capturing Attention: Writing Great Session Descriptions

Invitation to Participate in T-SQL Tuesday #001: Date/Time Tricks

SQLCLR String Splitting Part 2: Even Faster, Even More Scalable

Scalar functions, inlining, and performance: An entertaining title for a boring post

T-SQL Tuesday #21 – A Day Late and Totally Full of It

Next-Level Parallel Plan Forcing: An Alternative to 8649

SQLQueryStress: The Source Code