Home Uncategorized Tokenize UDF

    Tokenize UDF

    145
    1

    Yes, another string splitting UDF from a guy who’s obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, “Tokenization.”

    So what is Tokenization? It’s a word I made up for this problem.

    But what is it, really? It’s splitting up a string based on a delimiter — in this case, a comma — but being wary of substring delimiters. In this case, that’s a pair of apostrophes, because that’s what TSQL uses for strings.

    I think this is best illustrated with an example string:

    [sql]DECLARE @Tokens VARCHAR(50)

    SET @Tokens = ‘a, ”b”, ””c”, ”d”, ”e””, f, ”1,2,3,4”’
    [/sql]

    The basic split string function that you can find will produce the following output:

    [sql]SELECT *
    FROM dbo.SplitString(@Tokens, ‘,’)

    OutParam
    ————-
    a
    ‘b’
    ”c’
    ‘d’
    ‘e”
    f
    ‘1
    2
    3
    4’
    [/sql]

    Well, that’s wrong. Because what I want to do is maintain the substrings (or, “tokens,” as I like to call them — thus, Tokenization!)

    The output I desire is:

    [sql]Token
    ——–
    a
    ‘b’
    ”c’, ‘d’, ‘e”
    f
    ‘1,2,3,4’
    [/sql]

    Notice that substrings — delimited with apostrophes — should be maintained.

    And here’s how I’ve solved this problem…

    [sql]CREATE FUNCTION dbo.Tokenize
    (
    @Input NVARCHAR(2000)
    )
    RETURNS @Tokens TABLE
    (
    TokenNum INT IDENTITY(1,1),
    Token NVARCHAR(2000)
    )
    AS
    BEGIN
    DECLARE @i INT SET @i = 0
    DECLARE @StartChar INT SET @StartChar = 1
    DECLARE @Quote INT SET @Quote = 0

    DECLARE @Chars TABLE
    (
    CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
    TheChar CHAR(1),
    TheCount INT,
    StartChar INT
    )

    SET @Input = ‘ , ‘ + @Input + ‘ , ‘

    INSERT @Chars (TheChar)
    SELECT SUBSTRING(@Input, n.Number, 1)
    FROM Numbers n
    WHERE n.Number > 0
    AND n.Number <= LEN(@Input)
    ORDER BY n.Number

    UPDATE Chars SET
    @i = Chars.TheCount =
    CASE
    WHEN Chars1.TheChar = ‘,’
    AND @Quote % 2 = 0 THEN 0
    ELSE @i + 1
    END,
    @Quote =
    CASE
    WHEN Chars1.TheChar = ”” THEN @Quote + 1
    WHEN @i = 0 THEN 0
    ELSE @Quote
    END,
    @StartChar = Chars.StartChar =
    CASE
    WHEN @i = 1 THEN Chars1.CharNum – 1
    WHEN @i = 0 THEN @StartChar + 1
    ELSE @StartChar
    END
    FROM @Chars Chars
    JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1

    INSERT @Tokens(Token)
    SELECT
    RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
    FROM (
    SELECT StartChar, CharNum
    FROM @Chars
    WHERE TheCount = 0

    UNION ALL

    SELECT
    MAX
    (
    CASE TheCount
    WHEN 0 THEN CharNum
    ELSE 0
    END
    ) + 1,
    MAX(CharNum)
    FROM @Chars
    ) x
    WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT IN (”, ‘,’)
    ORDER BY x.StartChar
    RETURN
    END
    [/sql]

    A word of warning: This UDF uses the undocumented — and unsupported — “aggregate update” functionality. I’ve tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn’t support this one, so handle with care.

    And by the way, you need a numbers table to use this thing. Of course.

    As for using this thing, it’s pretty easy:

    [sql]DECLARE @Tokens VARCHAR(50)

    SET @Tokens = ‘a, ”b”, ””c”, ”d”, ”e””, f, ”1,2,3,4”’

    SELECT Token
    FROM dbo.Tokenize(@Tokens)

    Token
    ——–
    a
    ‘b’
    ”c’, ‘d’, ‘e”
    f
    ‘1,2,3,4’
    [/sql]

    … and it even appears to work properly!

    Enjoy… and application for this and other strange things I’ve been posting recently coming very, very soon.

    1 COMMENT

    1. Thank you for this.  To note, if there is a token and nothing but a space, your script excludes this extra "token".  In my situation I needed to always compare a specific token number so I needed this empty token.
      These changes are not efficient, but they worked.
      I changed the INSERT section to use a CASE instead that compared against ” and then it didn’t use the LTREM/RTRIM else it used the trim.
      case when RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) = ”
      then SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)
       else RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1)))
      end
      Additionally, I had to change the WHERE clause because SQL thinks that ” = ‘ ‘.
      SUBSTRING(@Input, StartChar, CharNum – StartChar + 1) NOT LIKE ”
      AND
      RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum – StartChar + 1))) NOT LIKE ‘,’

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here