比較使用TSQL那些方法效能可最接近CLR
撰寫TSQL達到string splitting並不難,以前我使用CTE來處理但不太關心效能,但如你是效能至上的話,透過SQL CLR在任何案例中,效能都有不錯表現,
這篇我要測試使用TSQL還有那些方法可以達到string splitting且效能接近CLR。
CTE
CREATE FUNCTION dbo.StrSplit_CTE
(
@inputcontent nvarchar(MAX),
@Delimiter nvarchar(255)
)
RETURNS @Items TABLE (Item NVARCHAR(4000))
WITH SCHEMABINDING
AS
BEGIN
DECLARE @ll INT = LEN(@inputcontent) + 1, @ld INT = LEN(@Delimiter);
WITH a AS
(
SELECT
[start] = 1,
[end] = COALESCE(NULLIF(CHARINDEX(@Delimiter,
@inputcontent, 1), 0), @ll),
[value] = SUBSTRING(@inputcontent, 1,
COALESCE(NULLIF(CHARINDEX(@Delimiter,
@inputcontent, 1), 0), @ll) - 1)
UNION ALL
SELECT
[start] = CONVERT(INT, [end]) + @ld,
[end] = COALESCE(NULLIF(CHARINDEX(@Delimiter,
@inputcontent, [end] + @ld), 0), @ll),
[value] = SUBSTRING(@inputcontent, [end] + @ld,
COALESCE(NULLIF(CHARINDEX(@Delimiter,
@inputcontent, [end] + @ld), 0), @ll)-[end]-@ld)
FROM a
WHERE [end] < @ll
)
INSERT @Items SELECT [value]
FROM a
WHERE LEN([value]) > 0
OPTION (MAXRECURSION 0);
RETURN;
END
GO
透過遞迴CTE取得每個部份字串,但要注意MVTF無法使用統計值(因為需要本地變數資料表),而且要設定最大遞迴數0,以免預設100不夠用而出現錯誤。
Test
declare @inputcontent nvarchar(max)=N'Patriots,Red Sox,Bruins,abcd,rico,sa,ricoisme,bar,mort,splunge,bacon'
select * from dbo.StrSplit_CTE(@inputcontent,',')
select * from dbo.StrSplit_CTE(REPLICATE(@inputcontent,1000),',')
XML
create FUNCTION dbo.StrSplit_XML
(
@inputcontent nvarchar(MAX),
@Delimiter nvarchar(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')
FROM
(
SELECT x = CONVERT(XML, '<i>'
+ REPLACE(@inputcontent, @Delimiter, '</i><i>')
+ '</i>').query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO
如果可以確保輸入的XML資料正確的話,那麼這方法我還滿推薦的,且並不像CTE屬於MTVF(all inline)
TEST
declare @inputcontent nvarchar(max)=N'Patriots,Red Sox,Bruins,abcd,rico,sa,ricoisme,bar,mort,splunge,bacon'
select * from dbo.StrSplit_XML(@inputcontent,',')
select * from dbo.StrSplit_XML(REPLICATE(@inputcontent,1000),',')
Numbers table
DECLARE @UpperLimit INT = 100000;
WITH n AS
(
SELECT
x = ROW_NUMBER() OVER (ORDER BY (select null))
FROM sys.all_objects AS s1
CROSS JOIN sys.all_objects AS s2
CROSS JOIN sys.all_objects AS s3
)
SELECT Number = x
INTO dbo.testNumbers
FROM n
WHERE x BETWEEN 1 AND @UpperLimit;
CREATE UNIQUE CLUSTERED INDEX n ON dbo.testNumbers(Number)
WITH (DATA_COMPRESSION = PAGE);--使用頁面壓縮減少I/O
GO
CREATE FUNCTION dbo.StrSplit_Numbers
(
@inputcontent nvarchar(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = SUBSTRING(@inputcontent, Number,
CHARINDEX(@Delimiter, @inputcontent + @Delimiter, Number) - Number)
FROM dbo.testNumbers
WHERE Number <= CONVERT(INT, LEN(@inputcontent))
AND SUBSTRING(@Delimiter + @inputcontent, Number, LEN(@Delimiter)) = @Delimiter
);
GO
該方法須先建立數字位置資料表表來存放字串起始位置,而且還要確保該資料表總筆數>=字串最大長度,這方法有一點麻煩。
TEST
declare @inputcontent nvarchar(max)=N'Patriots,Red Sox,Bruins,abcd,rico,sa,ricoisme,bar,mort,splunge,bacon'
select * from dbo.StrSplit_Numbers(@inputcontent,',')
select * from dbo.StrSplit_Numbers(REPLICATE(@inputcontent,1000),',')
Jeff Moden
利用Tally Table分配每個字串的起始字元位置和長度(類似Numbers table
概念),我覺得很Cool
create FUNCTION dbo.StrSplit_Moden
(
@inputcontent nvarchar(max),
@Delimiter nvarchar(255)
)
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
WITH E1(N) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),
E2(N) AS (SELECT 1 FROM E1 a, E1 b),
E4(N) AS (SELECT 1 FROM E2 a, E2 b),
E42(N) AS (SELECT 1 FROM E4 a, E2 b),
cteTally(N) AS (SELECT 0 UNION ALL SELECT TOP (DATALENGTH(ISNULL(@inputcontent,1)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E42),
cteStart(N1) AS (SELECT t.N+1 FROM cteTally t
WHERE (SUBSTRING(@inputcontent,t.N,1) = @Delimiter OR t.N = 0))
SELECT Item = SUBSTRING(@inputcontent, s.N1, ISNULL(NULLIF(CHARINDEX(@Delimiter,@inputcontent,s.N1),0)-s.N1,8000))
FROM cteStart s;
Test
declare @inputcontent nvarchar(max)=N'Patriots,Red Sox,Bruins,abcd,rico,sa,ricoisme,bar,mort,splunge,bacon'
select * from dbo.StrSplit_Moden(@inputcontent,',')
select * from dbo.StrSplit_Moden(REPLICATE(@inputcontent,1000),',')
結論:
方法 |
10001 rows (長字串) |
11 rows(短字串) |
CTE |
CPU 時間 = 812 ms,經過時間 = 1330 ms。 掃描計數 1,邏輯讀取 31,讀取前讀取 6。 |
CPU 時間 = 15 ms,經過時間 = 16 ms。 掃描計數 1,邏輯讀取 1 |
XML |
CPU 時間 = 141 ms,經過時間 = 352 ms。 掃描計數 0,邏輯讀取 27,讀取前讀取 0,LOB 邏輯讀取 1461。 |
CPU 時間 = 0 ms,經過時間 = 1 ms。 |
Numbers table |
CPU 時間 = 156 ms,經過時間 = 296 ms。 掃描計數 1,邏輯讀取 95,實體讀取 0,讀取前讀取 85。 |
CPU 時間 = 0 ms,經過時間 = 1 ms。 掃描計數 1,邏輯讀取 3,實體讀取 1,讀取前讀取 8。 |
Jeff Moden
|
CPU 時間 = 985 ms,經過時間 = 1161 ms。 |
CPU 時間 = 0 ms,經過時間 = 0 ms。 |
String splitting透過CLR效能還是最佳,如果嫌維護CLR麻煩的話,那麼XML是第二選擇,因為Numbers table和Jeff Moden需額外考慮字串最大長度,這一點我個人覺得彈性不是那麼好。
參考
[SQL SERVER][Performance]資料表值函數(2)
SQLCLR String Splitting Part 2: Even Faster, Even More Scalable
Tally OH! An Improved SQL 8K “CSV Splitter” Function