Yes, another string splitting UDF from a guy who’s obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, “Tokenization.”
So what is Tokenization? It’s a word I made up for this problem.
But what is it, really? It’s splitting up a string based on a delimiter — in this case, a comma — but being wary of substring delimiters. In this case, that’s a pair of apostrophes, because that’s what TSQL uses for strings.
I think this is best illustrated with an example string:
[sql]DECLARE @Tokens VARCHAR(50)
SET @Tokens = ‘a, ”b”, ””c”, ”d”, ”e””, f, ”1,2,3,4”’
[/sql]
The basic split string function that you can find will produce the following output:
[sql]SELECT *
FROM dbo.SplitString(@Tokens, ‘,’)
OutParam
————-
a
‘b’
”c’
‘d’
‘e”
f
‘1
2
3
4’
[/sql]
Well, that’s wrong. Because what I want to do is maintain the substrings (or, “tokens,” as I like to call them — thus, Tokenization!)
The output I desire is:
[sql]Token
——–
a
‘b’
”c’, ‘d’, ‘e”
f
‘1,2,3,4’
[/sql]
Notice that substrings — delimited with apostrophes — should be maintained.
And here’s how I’ve solved this problem…
[sql]CREATE FUNCTION dbo.Tokenize
(
@Input NVARCHAR(2000)
)
RETURNS @Tokens TABLE
(
TokenNum INT IDENTITY(1,1),
Token NVARCHAR(2000)
)
AS
BEGIN
DECLARE @i INT SET @i = 0
DECLARE @StartChar INT SET @StartChar = 1
DECLARE @Quote INT SET @Quote = 0
DECLARE @Chars TABLE
(
CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
TheChar CHAR(1),
TheCount INT,
StartChar INT
)
SET @Input = ‘ , ‘ + @Input + ‘ , ‘
INSERT @Chars (TheChar)
SELECT SUBSTRING(@Input, n.Number, 1)
FROM Numbers n
WHERE n.Number > 0
AND n.Number <= LEN(@Input)
ORDER BY n.Number
UPDATE Chars SET
@i = Chars.TheCount =
CASE
WHEN Chars1.TheChar = ‘,’
AND @Quote % 2 = 0 THEN 0
ELSE @i + 1
END,
@Quote =
CASE
WHEN Chars1.TheChar = ”” THEN @Quote + 1
WHEN @i = 0 THEN 0
ELSE @Quote
END,
@StartChar = Chars.StartChar =
CASE
WHEN @i = 1 THEN Chars1.CharNum – 1
WHEN @i = 0 THEN @StartChar + 1
ELSE @StartChar
END
FROM @Chars Chars
JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1
INSERT @Tokens(Token)
SELECT
RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
FROM (
SELECT StartChar, CharNum
FROM @Chars
WHERE TheCount = 0
UNION ALL
SELECT
MAX
(
CASE TheCount
WHEN 0 THEN CharNum
ELSE 0
END
) + 1,
MAX(CharNum)
FROM @Chars
) x
WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT IN (”, ‘,’)
ORDER BY x.StartChar
RETURN
END
[/sql]
A word of warning: This UDF uses the undocumented — and unsupported — “aggregate update” functionality. I’ve tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn’t support this one, so handle with care.
And by the way, you need a numbers table to use this thing. Of course.
As for using this thing, it’s pretty easy:
[sql]DECLARE @Tokens VARCHAR(50)
SET @Tokens = ‘a, ”b”, ””c”, ”d”, ”e””, f, ”1,2,3,4”’
SELECT Token
FROM dbo.Tokenize(@Tokens)
Token
——–
a
‘b’
”c’, ‘d’, ‘e”
f
‘1,2,3,4’
[/sql]
… and it even appears to work properly!
Enjoy… and application for this and other strange things I’ve been posting recently coming very, very soon.
Thank you for this. To note, if there is a token and nothing but a space, your script excludes this extra "token". In my situation I needed to always compare a specific token number so I needed this empty token.
These changes are not efficient, but they worked.
I changed the INSERT section to use a CASE instead that compared against ” and then it didn’t use the LTREM/RTRIM else it used the trim.
case when RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) = ”
then SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)
else RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
end
Additionally, I had to change the WHERE clause because SQL thinks that ” = ‘ ‘.
SUBSTRING(@Input, StartChar, CharNum – StartChar + 1) NOT LIKE ”
AND
RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT LIKE ‘,’