Home Uncategorized T-SQL Tuesday #001: Exploring "Fuzzy" Interval Islands Using SQLCLR

T-SQL Tuesday #001: Exploring "Fuzzy" Interval Islands Using SQLCLR

December 8, 2009

When working with time intervals, we often want to ask a couple of basic questions:

Which time periods are not covered by our intervals? These are known as “gaps”.
What are the time ranges that we are fully covering? These are known as “islands”.

If you’re unfamiliar with “gaps” and “islands” I highly recommend reading some of Itzik Ben-Gan’s recent work in SQL Server Magazine. He’s had a great series going on the topic. But one thing that he hasn’t found a good T-SQL solution for is a problem that I call “fuzzy islands.”

When is an island fuzzy? When it doesn’t necessarily have a fixed end time. For an example of this, consider a store, selling a number of products. Management might want to see a report showing when each product was selling. This is, effectively, an island question. The goal is to find all of the covered ranges during which sales occurred. But running such a report, you might find that way too much data is returned. A given product may have sold units on Monday, Tuesday, and Thursday, but for some reason no one bought one on Wednesday. Creating a new island every time there is a small gap will create a 300-page report where a 1-page dashboard might suffice–not a good user experience, nor a good way of representing the data. The solution? Introduce a bit of fuzziness–a rule that says, for instance, that a gap is only a gap if it’s longer than 7 days.

Answering the fuzzy islands question is not a very difficult thing to do in T-SQL. The basic algorithm follows:

Find all of the “start” dates or times. These are simply those dates or times for which a previous date or time in the fuzzy interval does not exist. So if a row is dated 2009-12-08 and our fuzzy granularity is 7 days, we know we have a start date if there is no other covered data from the end of November.
For each start date identified, find the minimum date greater than the start date. This is done by looking ahead rather than behind, so if our date is 2009-12-08 and we have a granularity of 7 days, we’ll look forward until December 15th.
Optionally, add the fuzzy factor to the end date. This is something that I think is a good idea, as it introduces the concept of an “active” interval–a period over which, for example, a product is considered to have been selling. The interval shouldn’t necessarily terminate the day that the last sale occurred. But of course this depends on the situation in question.

The following query produces a fuzzy islands report for each product in the AdventureWorks (or AdventureWorks2008) Production.TransactionHistory table. You can modify the @active_interval variable to tweak the fuzziness and change the output.

[sql]–Find all “active” product time ranges, meaning that the product
–has sold within the previous 7 days
DECLARE @active_interval INT = 7

SELECT DISTINCT
t_s.ProductID,
t_s.TransactionDate AS StartDate,
DATEADD
(
dd,
@active_interval,
(
SELECT
MIN(t_e.TransactionDate)
FROM Production.TransactionHistory AS t_e
WHERE
t_e.ProductID = t_s.ProductID
AND t_e.TransactionDate >= t_s.TransactionDate
AND NOT EXISTS
(
SELECT *
FROM Production.TransactionHistory AS t_ae
WHERE
t_ae.ProductID = t_s.ProductID
AND t_ae.TransactionDate BETWEEN
DATEADD(dd, 1, t_e.TransactionDate)
AND DATEADD(dd, @active_interval, t_e.TransactionDate)
)
)
) AS EndDate
FROM
(
SELECT DISTINCT
ProductID,
TransactionDate
FROM Production.TransactionHistory
) AS t_s
WHERE
NOT EXISTS
(
SELECT *
FROM Production.TransactionHistory AS t_ps
WHERE
t_ps.ProductID = t_s.ProductID
AND t_ps.TransactionDate BETWEEN
DATEADD(dd, -@active_interval, t_s.TransactionDate)
AND DATEADD(dd, -1, t_s.TransactionDate)
)
ORDER BY
ProductID,
StartDate
GO[/sql]

Running this query you’ll find that it works… But the results are returned a bit more slowly than we might desire–15 to 16 seconds on my end. Looking at the query plan, the reason for this becomes quite obvious: Lots and lots of table scans. How can we eliminate all of the overhead?

SQLCLR to the rescue. The best way to solve this problem–at least until the SQL Server team adds proper OVER clause support (LAG and LEAD, specifically)–is to use a cursor algorithm. We could do this in a T-SQL cursor, but why bother? Cursor logic in SQLCLR is much, much faster.

To solve the problem, I implemented an enumerator, called active_products_enumerator. The enumerator is initialized using a SqlDataReader and an “active interval” — the number of days we’re allowing for fuzziness. The DataReader is expected to return rows ordered by ProductID and TransactionDate. The enumeration process uses the following algorithm:

If the current ProductID is not the same as the previous ProductID, return an end date for the previous interval and start a new one
If the current period date is greater than the previous period date plus the active interval, return an end date for the previous interval and start a new one
Otherwise, continue

Following is the MoveNext method for the enumerator (the complete code is attached to this post so that you can run it on your end without my bombarding you with a gigantic code-filled post):

[sql]

public bool MoveNext()
{
try
{
current_results = null;

while (r.Read())
{
new_ProductID = r.GetSqlInt32(0);
new_period_date = r.GetDateTime(1);

if (new_ProductID != ProductID)
{
if (ProductID != 0)
{
current_results = new results(
ProductID,
StartDate,
period_plus_interval);
}

ProductID = new_ProductID;
StartDate = new_period_date;
period_plus_interval = new_period_date.AddDays(activeInterval);

if (current_results != null)
return (true);
}
else
{
if (period_plus_interval < new_period_date)
{
current_results = new results(
ProductID,
StartDate,
period_plus_interval);

StartDate = new_period_date;
}

period_plus_interval = new_period_date.AddDays(activeInterval);

if (current_results != null)
return (true);
}
}

//return the last row of data
if (ProductID != 0)
{
current_results = new results(
ProductID,
StartDate,
period_plus_interval);

//set this to 0 so we don’t return another row
ProductID = 0;
}

if (current_results != null)
return (true);
else
{
r.Dispose();
return (false);
}
}
catch
{
r.Dispose();
throw;
}
} [/sql]

Put into a table-valued function, as the attached code does, this algorithm will return the same data as the T-SQL query in under a third of a second on my end–around 45 times faster than the T-SQL version.

Note that I’ve played some games with a loopback connection to get this whole thing to work. That’s a topic for a future blog post, so stay tuned. In the meantime, please realize that you’ll have to catalog the assembly with EXTERNAL_ACCESS permission to make this happen.

This post was created for T-SQL Tuesday, the revolving SQL Server blog party, hosted this month by… me. Enjoy!

File Attachment: AdamMachanic_TSQLTuesday_ActiveProducts.zip

T-SQL Tuesday #001: Exploring "Fuzzy" Interval Islands Using SQLCLR

LEAVE A REPLY Cancel reply

Popular Posts

The SQL Hall of Shame

Capturing Attention: Writing Great Session Descriptions

Invitation to Participate in T-SQL Tuesday #001: Date/Time Tricks

SQLCLR String Splitting Part 2: Even Faster, Even More Scalable

Scalar functions, inlining, and performance: An entertaining title for a boring post

T-SQL Tuesday #21 – A Day Late and Totally Full of It

Next-Level Parallel Plan Forcing: An Alternative to 8649

SQLQueryStress: The Source Code