Optimizing numeric range (interval) searches in SQL Server












12














This question is similar to Optimizing IP Range Search? but that one is restricted to SQL Server 2000.



Suppose I have 10 million ranges provisionally stored in a table structured and populated as below.



CREATE TABLE MyTable
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX IX1 (RangeFrom,RangeTo),
INDEX IX2 (RangeTo,RangeFrom)
);

WITH RandomNumbers
AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3,
sys.all_objects o4)
INSERT INTO MyTable
(RangeFrom,
RangeTo)
SELECT Num,
Num + 1 + CRYPT_GEN_RANDOM(1)
FROM RandomNumbers


I need to know all ranges containing the value 50,000,000. I try the following query



SELECT *
FROM MyTable
WHERE 50000000 BETWEEN RangeFrom AND RangeTo


SQL Server shows that there were 10,951 logical reads and nearly 5 million rows were read to return the 12 matching ones.



enter image description here



Can I improve on this performance? Any restructuring of the table or additional indexes is fine.










share|improve this question
























  • If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?
    – davidbak
    2 days ago












  • @davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.
    – Martin Smith
    2 days ago












  • I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.
    – davidbak
    2 days ago








  • 2




    Related: Select all overlapping ranges from starting range
    – Paul White
    2 days ago
















12














This question is similar to Optimizing IP Range Search? but that one is restricted to SQL Server 2000.



Suppose I have 10 million ranges provisionally stored in a table structured and populated as below.



CREATE TABLE MyTable
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX IX1 (RangeFrom,RangeTo),
INDEX IX2 (RangeTo,RangeFrom)
);

WITH RandomNumbers
AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3,
sys.all_objects o4)
INSERT INTO MyTable
(RangeFrom,
RangeTo)
SELECT Num,
Num + 1 + CRYPT_GEN_RANDOM(1)
FROM RandomNumbers


I need to know all ranges containing the value 50,000,000. I try the following query



SELECT *
FROM MyTable
WHERE 50000000 BETWEEN RangeFrom AND RangeTo


SQL Server shows that there were 10,951 logical reads and nearly 5 million rows were read to return the 12 matching ones.



enter image description here



Can I improve on this performance? Any restructuring of the table or additional indexes is fine.










share|improve this question
























  • If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?
    – davidbak
    2 days ago












  • @davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.
    – Martin Smith
    2 days ago












  • I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.
    – davidbak
    2 days ago








  • 2




    Related: Select all overlapping ranges from starting range
    – Paul White
    2 days ago














12












12








12


3





This question is similar to Optimizing IP Range Search? but that one is restricted to SQL Server 2000.



Suppose I have 10 million ranges provisionally stored in a table structured and populated as below.



CREATE TABLE MyTable
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX IX1 (RangeFrom,RangeTo),
INDEX IX2 (RangeTo,RangeFrom)
);

WITH RandomNumbers
AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3,
sys.all_objects o4)
INSERT INTO MyTable
(RangeFrom,
RangeTo)
SELECT Num,
Num + 1 + CRYPT_GEN_RANDOM(1)
FROM RandomNumbers


I need to know all ranges containing the value 50,000,000. I try the following query



SELECT *
FROM MyTable
WHERE 50000000 BETWEEN RangeFrom AND RangeTo


SQL Server shows that there were 10,951 logical reads and nearly 5 million rows were read to return the 12 matching ones.



enter image description here



Can I improve on this performance? Any restructuring of the table or additional indexes is fine.










share|improve this question















This question is similar to Optimizing IP Range Search? but that one is restricted to SQL Server 2000.



Suppose I have 10 million ranges provisionally stored in a table structured and populated as below.



CREATE TABLE MyTable
(
Id INT IDENTITY PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX IX1 (RangeFrom,RangeTo),
INDEX IX2 (RangeTo,RangeFrom)
);

WITH RandomNumbers
AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3,
sys.all_objects o4)
INSERT INTO MyTable
(RangeFrom,
RangeTo)
SELECT Num,
Num + 1 + CRYPT_GEN_RANDOM(1)
FROM RandomNumbers


I need to know all ranges containing the value 50,000,000. I try the following query



SELECT *
FROM MyTable
WHERE 50000000 BETWEEN RangeFrom AND RangeTo


SQL Server shows that there were 10,951 logical reads and nearly 5 million rows were read to return the 12 matching ones.



enter image description here



Can I improve on this performance? Any restructuring of the table or additional indexes is fine.







sql-server optimization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 1 hour ago

























asked 2 days ago









Martin Smith

61.6k10166247




61.6k10166247












  • If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?
    – davidbak
    2 days ago












  • @davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.
    – Martin Smith
    2 days ago












  • I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.
    – davidbak
    2 days ago








  • 2




    Related: Select all overlapping ranges from starting range
    – Paul White
    2 days ago


















  • If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?
    – davidbak
    2 days ago












  • @davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.
    – Martin Smith
    2 days ago












  • I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.
    – davidbak
    2 days ago








  • 2




    Related: Select all overlapping ranges from starting range
    – Paul White
    2 days ago
















If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?
– davidbak
2 days ago






If I'm understanding the set up of the table correctly, you're picking random numbers uniformly to form your ranges, with no constraints on the "size" of each range. And your probe is for the middle of the overall range 1..100M. In that case - no apparent clustering due to uniform randomness - I don't know why an index on either lower bound or upper bound would be helpful. Can you explain that?
– davidbak
2 days ago














@davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.
– Martin Smith
2 days ago






@davidbak the conventional indexes on this table are indeed not very helpful in the worst case as it has to scan half the range hence asking for potential improvements on it. There is a nice improvement in the linked question for SQL Server 2000 with the introduction of the "granule" I was hoping spatial indexes could help here as they support contains queries and whilst they work well at reducing the amount of data read they seem to add other overhead that counteracts this.
– Martin Smith
2 days ago














I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.
– davidbak
2 days ago






I don't have the facility to try it - but I wonder if two indexes - one on the lower bound, one on the upper - and then an inner join - would let the query optimizer work something out.
– davidbak
2 days ago






2




2




Related: Select all overlapping ranges from starting range
– Paul White
2 days ago




Related: Select all overlapping ranges from starting range
– Paul White
2 days ago










6 Answers
6






active

oldest

votes


















9














Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



DROP TABLE IF EXISTS dbo.MyTableCCI;

CREATE TABLE dbo.MyTableCCI
(
Id INT PRIMARY KEY,
RangeFrom INT NOT NULL,
RangeTo INT NOT NULL,
CHECK (RangeTo > RangeFrom),
INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO dbo.MyTableCCI
SELECT TOP (987654321) *
FROM dbo.MyTable
ORDER BY RangeFrom ASC
OPTION (MAXDOP 1);


By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.






share|improve this answer























  • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.
    – Martin Smith
    2 days ago



















6














One alternative way of representing a range would be as points on a line.



The below migrates all the data into a new table with the range represented as a geometry datatype.



CREATE TABLE MyTable2
(
Id INT IDENTITY PRIMARY KEY,
Range GEOMETRY NOT NULL,
RangeFrom AS Range.STPointN(1).STX,
RangeTo AS Range.STPointN(2).STX,
CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
);

SET IDENTITY_INSERT MyTable2 ON

INSERT INTO MyTable2
(Id,
Range)
SELECT ID,
geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
FROM MyTable

SET IDENTITY_INSERT MyTable2 OFF


CREATE SPATIAL INDEX index_name
ON MyTable2 ( Range )
USING GEOMETRY_GRID
WITH (
BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
GRIDS = (HIGH, HIGH, HIGH, HIGH),
CELLS_PER_OBJECT = 16);


The equivalent query to find ranges containing the value 50,000,000 is below.



SELECT Id,
RangeFrom,
RangeTo
FROM MyTable2
WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


The reads for this show an improvement on the 10,951 from the original query.



Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



The execution plan is more complex as below



enter image description here



The only case where the rewrite reliably performs better for me is with a cold cache.



So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.






share|improve this answer





























    5














    Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



    In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



    Based on this I restructured the table as below.



    CREATE TABLE dbo.MyTable3
    (
    Id INT IDENTITY PRIMARY KEY,
    RangeFrom INT NOT NULL,
    RangeTo INT NOT NULL,
    node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
    CHECK (RangeTo > RangeFrom)
    );

    CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
    CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

    SET IDENTITY_INSERT MyTable3 ON

    INSERT INTO MyTable3
    (Id,
    RangeFrom,
    RangeTo)
    SELECT Id,
    RangeFrom,
    RangeTo
    FROM MyTable

    SET IDENTITY_INSERT MyTable3 OFF


    And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



    DECLARE @value INT = 50000000;

    ;WITH N AS
    (
    SELECT 30 AS Level,
    CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
    CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
    (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
    UNION ALL
    SELECT N.Level-1,
    CASE WHEN @value > node THEN node END AS selected_left_node,
    CASE WHEN @value < node THEN node END AS selected_right_node,
    (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
    FROM N
    WHERE N.Level > 0
    )
    SELECT I.id, I.RangeFrom, I.RangeTo
    FROM dbo.MyTable3 AS I
    JOIN N AS L
    ON I.node = L.selected_left_node
    AND I.RangeTo >= @value
    AND L.selected_left_node IS NOT NULL
    UNION ALL
    SELECT I.id, I.RangeFrom, I.RangeTo
    FROM dbo.MyTable3 AS I
    JOIN N AS R
    ON I.node = R.selected_right_node
    AND I.RangeFrom <= @value
    AND R.selected_right_node IS NOT NULL
    UNION ALL
    SELECT I.id, I.RangeFrom, I.RangeTo
    FROM dbo.MyTable3 AS I
    WHERE node = @value;


    This typically executes in 1mson my machine when all pages are in cache - with IO stats.



    Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


    and plan



    enter image description here



    NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.






    share|improve this answer































      4














      I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeTo and RangeTo and you indexed it along with RangeFrom:



      ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

      CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


      If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



      WITH RecursiveCTE
      AS
      (
      -- Anchor
      SELECT TOP (1)
      DiffOfColumns
      FROM dbo.MyTableWithDiff AS T
      ORDER BY
      T.DiffOfColumns

      UNION ALL

      -- Recursive
      SELECT R.DiffOfColumns
      FROM
      (
      -- Number the rows
      SELECT
      T.DiffOfColumns,
      rn = ROW_NUMBER() OVER (
      ORDER BY T.DiffOfColumns)
      FROM dbo.MyTableWithDiff AS T
      JOIN RecursiveCTE AS R
      ON R.DiffOfColumns < T.DiffOfColumns
      ) AS R
      WHERE
      -- Only the row that sorts lowest
      R.rn = 1
      )
      SELECT ca.*
      FROM RecursiveCTE rcte
      CROSS APPLY (
      SELECT mt.Id, mt.RangeFrom, mt.RangeTo
      FROM dbo.MyTableWithDiff mt
      WHERE mt.DiffOfColumns = rcte.DiffOfColumns
      AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
      ) ca
      OPTION (MAXRECURSION 0);


      You can see the usual recursive part along with the index seek for every distinct value:



      query plan 1



      The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



      DROP TABLE IF EXISTS dbo.MyTableBigDiff;

      CREATE TABLE dbo.MyTableBigDiff
      (
      Id INT IDENTITY PRIMARY KEY,
      RangeFrom INT NOT NULL,
      RangeTo INT NOT NULL,
      CHECK (RangeTo > RangeFrom)
      );

      WITH RandomNumbers
      AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
      FROM sys.all_objects o1,
      sys.all_objects o2,
      sys.all_objects o3,
      sys.all_objects o4)
      INSERT INTO dbo.MyTableBigDiff
      (RangeFrom,
      RangeTo)
      SELECT Num,
      Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
      FROM RandomNumbers;


      ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

      CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


      The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



      ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

      CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


      One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



      CROSS APPLY (
      SELECT mt.Id, mt.RangeFrom, mt.RangeTo
      FROM dbo.MyTableBigDiff mt
      WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
      AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
      AND mt.RangeFrom <= 50000000
      AND mt.RangeTo >= 50000000
      ) ca


      The full query now takes 6 ms on my machine.






      share|improve this answer































        2














        I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



        Starting with your given sample, then modifying the table:



        ALTER TABLE dbo.MyTable
        ADD curtis_jackson
        AS CONVERT(BIT, CASE
        WHEN RangeTo >= 50000000
        AND RangeFrom < 50000000
        THEN 1
        ELSE 0
        END);

        CREATE INDEX IX1_redo
        ON dbo.MyTable (curtis_jackson)
        INCLUDE (RangeFrom, RangeTo);


        The query simply becomes:



        SELECT *
        FROM MyTable
        WHERE curtis_jackson = 1;


        Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



        Table 'MyTable'. Scan count 1, logical reads 3...

        SQL Server Execution Times:
        CPU time = 0 ms, elapsed time = 0 ms.


        And here's the query plan:



        NUTS






        share|improve this answer





















        • and the index could be filtered too
          – Martin Smith
          yesterday






        • 2




          @MartinSmith Not on a computed column :(
          – sp_BlitzErik
          yesterday










        • And how much time does the index need to be created? (so we know the overhead in case it is not a single value)
          – yper-crazyhat-cubeᵀᴹ
          16 hours ago












        • @yper-crazyhat-cubeᵀᴹ on my machine? light speed, baby.
          – sp_BlitzErik
          16 hours ago






        • 1




          @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis
          – Martin Smith
          1 hour ago



















        2














        As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



        To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



        Let's start with the language that doesn't exist in Boston!



        R



        EXEC sp_execute_external_script 
        @language = N'R',
        @script = N'
        tweener = 50000000
        MO = data.frame(MartinIn)
        MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
        ',
        @input_data_1_name = N'MartinIn',
        @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
        @output_data_1_name = N'MartinOut',
        @parallel = 1
        WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


        This was a bad time.



        Table 'MyTable'. Scan count 1, logical reads 22400

        SQL Server Execution Times:
        CPU time = 3219 ms, elapsed time = 5349 ms.


        The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



        NUTS



        Next up, coding with crayons!



        Python



        EXEC sp_execute_external_script 
        @language = N'Python',
        @script = N'
        import pandas as pd
        MO = pd.DataFrame(MartinIn)
        tweener = 50000000
        MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
        ',
        @input_data_1_name = N'MartinIn',
        @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
        @output_data_1_name = N'MartinOut',
        @parallel = 1
        WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


        Just when you thought it couldn't get worse than R:



        Table 'MyTable'. Scan count 1, logical reads 22400

        SQL Server Execution Times:
        CPU time = 3797 ms, elapsed time = 10146 ms.


        Another foul-mouthed execution plan:



        NUTS



        Hmm and Hmmer



        So far, I'm not impressed. I can't wait to delete this VM.






        share|improve this answer





















        • You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.
          – wBob
          6 hours ago













        Your Answer








        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "182"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f225953%2foptimizing-numeric-range-interval-searches-in-sql-server%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        6 Answers
        6






        active

        oldest

        votes








        6 Answers
        6






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        9














        Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



        DROP TABLE IF EXISTS dbo.MyTableCCI;

        CREATE TABLE dbo.MyTableCCI
        (
        Id INT PRIMARY KEY,
        RangeFrom INT NOT NULL,
        RangeTo INT NOT NULL,
        CHECK (RangeTo > RangeFrom),
        INDEX CCI CLUSTERED COLUMNSTORE
        );

        INSERT INTO dbo.MyTableCCI
        SELECT TOP (987654321) *
        FROM dbo.MyTable
        ORDER BY RangeFrom ASC
        OPTION (MAXDOP 1);


        By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



        Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


        For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.






        share|improve this answer























        • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.
          – Martin Smith
          2 days ago
















        9














        Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



        DROP TABLE IF EXISTS dbo.MyTableCCI;

        CREATE TABLE dbo.MyTableCCI
        (
        Id INT PRIMARY KEY,
        RangeFrom INT NOT NULL,
        RangeTo INT NOT NULL,
        CHECK (RangeTo > RangeFrom),
        INDEX CCI CLUSTERED COLUMNSTORE
        );

        INSERT INTO dbo.MyTableCCI
        SELECT TOP (987654321) *
        FROM dbo.MyTable
        ORDER BY RangeFrom ASC
        OPTION (MAXDOP 1);


        By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



        Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


        For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.






        share|improve this answer























        • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.
          – Martin Smith
          2 days ago














        9












        9








        9






        Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



        DROP TABLE IF EXISTS dbo.MyTableCCI;

        CREATE TABLE dbo.MyTableCCI
        (
        Id INT PRIMARY KEY,
        RangeFrom INT NOT NULL,
        RangeTo INT NOT NULL,
        CHECK (RangeTo > RangeFrom),
        INDEX CCI CLUSTERED COLUMNSTORE
        );

        INSERT INTO dbo.MyTableCCI
        SELECT TOP (987654321) *
        FROM dbo.MyTable
        ORDER BY RangeFrom ASC
        OPTION (MAXDOP 1);


        By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



        Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


        For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.






        share|improve this answer














        Columnstore is very heplful here compared to a nonclustered index which scans half the table. A nonclustered columnstore index provides most of the benefit but inserting ordered data into a clustered columnstore index is even better.



        DROP TABLE IF EXISTS dbo.MyTableCCI;

        CREATE TABLE dbo.MyTableCCI
        (
        Id INT PRIMARY KEY,
        RangeFrom INT NOT NULL,
        RangeTo INT NOT NULL,
        CHECK (RangeTo > RangeFrom),
        INDEX CCI CLUSTERED COLUMNSTORE
        );

        INSERT INTO dbo.MyTableCCI
        SELECT TOP (987654321) *
        FROM dbo.MyTable
        ORDER BY RangeFrom ASC
        OPTION (MAXDOP 1);


        By design I can get rowgroup elimination on the RangeFrom column which will eliminate half of my rowgroups. But due to the nature of the data I also get rowgroup elimination on the RangeTo column as well:



        Table 'MyTableCCI'. Segment reads 1, segment skipped 9.


        For larger tables with more variable data there are different ways to load the data to guarantee the best possible rowgroup elimination on both columns. For your data in particular, the query takes 1 ms.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 2 days ago

























        answered 2 days ago









        Joe Obbish

        20.6k32880




        20.6k32880












        • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.
          – Martin Smith
          2 days ago


















        • yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.
          – Martin Smith
          2 days ago
















        yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.
        – Martin Smith
        2 days ago




        yep definitely looking for other approaches to consider without the 2000 restriction. Doesnt sound like that will be beaten.
        – Martin Smith
        2 days ago













        6














        One alternative way of representing a range would be as points on a line.



        The below migrates all the data into a new table with the range represented as a geometry datatype.



        CREATE TABLE MyTable2
        (
        Id INT IDENTITY PRIMARY KEY,
        Range GEOMETRY NOT NULL,
        RangeFrom AS Range.STPointN(1).STX,
        RangeTo AS Range.STPointN(2).STX,
        CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
        );

        SET IDENTITY_INSERT MyTable2 ON

        INSERT INTO MyTable2
        (Id,
        Range)
        SELECT ID,
        geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
        FROM MyTable

        SET IDENTITY_INSERT MyTable2 OFF


        CREATE SPATIAL INDEX index_name
        ON MyTable2 ( Range )
        USING GEOMETRY_GRID
        WITH (
        BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
        GRIDS = (HIGH, HIGH, HIGH, HIGH),
        CELLS_PER_OBJECT = 16);


        The equivalent query to find ranges containing the value 50,000,000 is below.



        SELECT Id,
        RangeFrom,
        RangeTo
        FROM MyTable2
        WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


        The reads for this show an improvement on the 10,951 from the original query.



        Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
        Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
        Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
        Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


        However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



        The execution plan is more complex as below



        enter image description here



        The only case where the rewrite reliably performs better for me is with a cold cache.



        So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.






        share|improve this answer


























          6














          One alternative way of representing a range would be as points on a line.



          The below migrates all the data into a new table with the range represented as a geometry datatype.



          CREATE TABLE MyTable2
          (
          Id INT IDENTITY PRIMARY KEY,
          Range GEOMETRY NOT NULL,
          RangeFrom AS Range.STPointN(1).STX,
          RangeTo AS Range.STPointN(2).STX,
          CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
          );

          SET IDENTITY_INSERT MyTable2 ON

          INSERT INTO MyTable2
          (Id,
          Range)
          SELECT ID,
          geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
          FROM MyTable

          SET IDENTITY_INSERT MyTable2 OFF


          CREATE SPATIAL INDEX index_name
          ON MyTable2 ( Range )
          USING GEOMETRY_GRID
          WITH (
          BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
          GRIDS = (HIGH, HIGH, HIGH, HIGH),
          CELLS_PER_OBJECT = 16);


          The equivalent query to find ranges containing the value 50,000,000 is below.



          SELECT Id,
          RangeFrom,
          RangeTo
          FROM MyTable2
          WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


          The reads for this show an improvement on the 10,951 from the original query.



          Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
          Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
          Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
          Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


          However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



          The execution plan is more complex as below



          enter image description here



          The only case where the rewrite reliably performs better for me is with a cold cache.



          So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.






          share|improve this answer
























            6












            6








            6






            One alternative way of representing a range would be as points on a line.



            The below migrates all the data into a new table with the range represented as a geometry datatype.



            CREATE TABLE MyTable2
            (
            Id INT IDENTITY PRIMARY KEY,
            Range GEOMETRY NOT NULL,
            RangeFrom AS Range.STPointN(1).STX,
            RangeTo AS Range.STPointN(2).STX,
            CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
            );

            SET IDENTITY_INSERT MyTable2 ON

            INSERT INTO MyTable2
            (Id,
            Range)
            SELECT ID,
            geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
            FROM MyTable

            SET IDENTITY_INSERT MyTable2 OFF


            CREATE SPATIAL INDEX index_name
            ON MyTable2 ( Range )
            USING GEOMETRY_GRID
            WITH (
            BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
            GRIDS = (HIGH, HIGH, HIGH, HIGH),
            CELLS_PER_OBJECT = 16);


            The equivalent query to find ranges containing the value 50,000,000 is below.



            SELECT Id,
            RangeFrom,
            RangeTo
            FROM MyTable2
            WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


            The reads for this show an improvement on the 10,951 from the original query.



            Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
            Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
            Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
            Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


            However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



            The execution plan is more complex as below



            enter image description here



            The only case where the rewrite reliably performs better for me is with a cold cache.



            So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.






            share|improve this answer












            One alternative way of representing a range would be as points on a line.



            The below migrates all the data into a new table with the range represented as a geometry datatype.



            CREATE TABLE MyTable2
            (
            Id INT IDENTITY PRIMARY KEY,
            Range GEOMETRY NOT NULL,
            RangeFrom AS Range.STPointN(1).STX,
            RangeTo AS Range.STPointN(2).STX,
            CHECK (Range.STNumPoints() = 2 AND Range.STPointN(1).STY = 0 AND Range.STPointN(2).STY = 0)
            );

            SET IDENTITY_INSERT MyTable2 ON

            INSERT INTO MyTable2
            (Id,
            Range)
            SELECT ID,
            geometry::STLineFromText(CONCAT('LINESTRING(', RangeFrom, ' 0, ', RangeTo, ' 0)'), 0)
            FROM MyTable

            SET IDENTITY_INSERT MyTable2 OFF


            CREATE SPATIAL INDEX index_name
            ON MyTable2 ( Range )
            USING GEOMETRY_GRID
            WITH (
            BOUNDING_BOX = ( xmin=0, ymin=0, xmax=110000000, ymax=1 ),
            GRIDS = (HIGH, HIGH, HIGH, HIGH),
            CELLS_PER_OBJECT = 16);


            The equivalent query to find ranges containing the value 50,000,000 is below.



            SELECT Id,
            RangeFrom,
            RangeTo
            FROM MyTable2
            WHERE Range.STContains(geometry::STPointFromText ('POINT (50000000 0)', 0)) = 1


            The reads for this show an improvement on the 10,951 from the original query.



            Table 'MyTable2'. Scan count 0, logical reads 505, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
            Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
            Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
            Table 'extended_index_1797581442_384000'. Scan count 4, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


            However there is no significant improvement over the original in terms of time elapsed. Typical execution results are 250 ms vs 252 ms.



            The execution plan is more complex as below



            enter image description here



            The only case where the rewrite reliably performs better for me is with a cold cache.



            So disappointing in this case and difficult to recommend this rewrite but publication of negative results can also be useful.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered 2 days ago









            Martin Smith

            61.6k10166247




            61.6k10166247























                5














                Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



                In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



                Based on this I restructured the table as below.



                CREATE TABLE dbo.MyTable3
                (
                Id INT IDENTITY PRIMARY KEY,
                RangeFrom INT NOT NULL,
                RangeTo INT NOT NULL,
                node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
                CHECK (RangeTo > RangeFrom)
                );

                CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
                CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

                SET IDENTITY_INSERT MyTable3 ON

                INSERT INTO MyTable3
                (Id,
                RangeFrom,
                RangeTo)
                SELECT Id,
                RangeFrom,
                RangeTo
                FROM MyTable

                SET IDENTITY_INSERT MyTable3 OFF


                And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



                DECLARE @value INT = 50000000;

                ;WITH N AS
                (
                SELECT 30 AS Level,
                CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
                CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
                (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
                UNION ALL
                SELECT N.Level-1,
                CASE WHEN @value > node THEN node END AS selected_left_node,
                CASE WHEN @value < node THEN node END AS selected_right_node,
                (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
                FROM N
                WHERE N.Level > 0
                )
                SELECT I.id, I.RangeFrom, I.RangeTo
                FROM dbo.MyTable3 AS I
                JOIN N AS L
                ON I.node = L.selected_left_node
                AND I.RangeTo >= @value
                AND L.selected_left_node IS NOT NULL
                UNION ALL
                SELECT I.id, I.RangeFrom, I.RangeTo
                FROM dbo.MyTable3 AS I
                JOIN N AS R
                ON I.node = R.selected_right_node
                AND I.RangeFrom <= @value
                AND R.selected_right_node IS NOT NULL
                UNION ALL
                SELECT I.id, I.RangeFrom, I.RangeTo
                FROM dbo.MyTable3 AS I
                WHERE node = @value;


                This typically executes in 1mson my machine when all pages are in cache - with IO stats.



                Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


                and plan



                enter image description here



                NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.






                share|improve this answer




























                  5














                  Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



                  In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



                  Based on this I restructured the table as below.



                  CREATE TABLE dbo.MyTable3
                  (
                  Id INT IDENTITY PRIMARY KEY,
                  RangeFrom INT NOT NULL,
                  RangeTo INT NOT NULL,
                  node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
                  CHECK (RangeTo > RangeFrom)
                  );

                  CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
                  CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

                  SET IDENTITY_INSERT MyTable3 ON

                  INSERT INTO MyTable3
                  (Id,
                  RangeFrom,
                  RangeTo)
                  SELECT Id,
                  RangeFrom,
                  RangeTo
                  FROM MyTable

                  SET IDENTITY_INSERT MyTable3 OFF


                  And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



                  DECLARE @value INT = 50000000;

                  ;WITH N AS
                  (
                  SELECT 30 AS Level,
                  CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
                  CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
                  (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
                  UNION ALL
                  SELECT N.Level-1,
                  CASE WHEN @value > node THEN node END AS selected_left_node,
                  CASE WHEN @value < node THEN node END AS selected_right_node,
                  (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
                  FROM N
                  WHERE N.Level > 0
                  )
                  SELECT I.id, I.RangeFrom, I.RangeTo
                  FROM dbo.MyTable3 AS I
                  JOIN N AS L
                  ON I.node = L.selected_left_node
                  AND I.RangeTo >= @value
                  AND L.selected_left_node IS NOT NULL
                  UNION ALL
                  SELECT I.id, I.RangeFrom, I.RangeTo
                  FROM dbo.MyTable3 AS I
                  JOIN N AS R
                  ON I.node = R.selected_right_node
                  AND I.RangeFrom <= @value
                  AND R.selected_right_node IS NOT NULL
                  UNION ALL
                  SELECT I.id, I.RangeFrom, I.RangeTo
                  FROM dbo.MyTable3 AS I
                  WHERE node = @value;


                  This typically executes in 1mson my machine when all pages are in cache - with IO stats.



                  Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                  Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


                  and plan



                  enter image description here



                  NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.






                  share|improve this answer


























                    5












                    5








                    5






                    Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



                    In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



                    Based on this I restructured the table as below.



                    CREATE TABLE dbo.MyTable3
                    (
                    Id INT IDENTITY PRIMARY KEY,
                    RangeFrom INT NOT NULL,
                    RangeTo INT NOT NULL,
                    node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
                    CHECK (RangeTo > RangeFrom)
                    );

                    CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
                    CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

                    SET IDENTITY_INSERT MyTable3 ON

                    INSERT INTO MyTable3
                    (Id,
                    RangeFrom,
                    RangeTo)
                    SELECT Id,
                    RangeFrom,
                    RangeTo
                    FROM MyTable

                    SET IDENTITY_INSERT MyTable3 OFF


                    And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



                    DECLARE @value INT = 50000000;

                    ;WITH N AS
                    (
                    SELECT 30 AS Level,
                    CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
                    CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
                    (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
                    UNION ALL
                    SELECT N.Level-1,
                    CASE WHEN @value > node THEN node END AS selected_left_node,
                    CASE WHEN @value < node THEN node END AS selected_right_node,
                    (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
                    FROM N
                    WHERE N.Level > 0
                    )
                    SELECT I.id, I.RangeFrom, I.RangeTo
                    FROM dbo.MyTable3 AS I
                    JOIN N AS L
                    ON I.node = L.selected_left_node
                    AND I.RangeTo >= @value
                    AND L.selected_left_node IS NOT NULL
                    UNION ALL
                    SELECT I.id, I.RangeFrom, I.RangeTo
                    FROM dbo.MyTable3 AS I
                    JOIN N AS R
                    ON I.node = R.selected_right_node
                    AND I.RangeFrom <= @value
                    AND R.selected_right_node IS NOT NULL
                    UNION ALL
                    SELECT I.id, I.RangeFrom, I.RangeTo
                    FROM dbo.MyTable3 AS I
                    WHERE node = @value;


                    This typically executes in 1mson my machine when all pages are in cache - with IO stats.



                    Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                    Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


                    and plan



                    enter image description here



                    NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.






                    share|improve this answer














                    Paul White pointed out an answer to a similar question containing a link to an interesting article by Itzik Ben Gan. This describes the "Static Relational Interval Tree" model that allows this to be done efficiently.



                    In summary this approach involves storing a computed ("forknode") value based on the interval values in the row. When searching for ranges that intersect another range it is possible to precalculate the possible forknode values that matching rows must have and use this to find the results with a maximum of 31 seek operations (the below supports integers in the range 0 to the max signed 32 bit int)



                    Based on this I restructured the table as below.



                    CREATE TABLE dbo.MyTable3
                    (
                    Id INT IDENTITY PRIMARY KEY,
                    RangeFrom INT NOT NULL,
                    RangeTo INT NOT NULL,
                    node AS RangeTo - RangeTo % POWER(2, FLOOR(LOG((RangeFrom - 1) ^ RangeTo, 2))) PERSISTED NOT NULL,
                    CHECK (RangeTo > RangeFrom)
                    );

                    CREATE INDEX ix1 ON dbo.MyTable3 (node, RangeFrom) INCLUDE (RangeTo);
                    CREATE INDEX ix2 ON dbo.MyTable3 (node, RangeTo) INCLUDE (RangeFrom);

                    SET IDENTITY_INSERT MyTable3 ON

                    INSERT INTO MyTable3
                    (Id,
                    RangeFrom,
                    RangeTo)
                    SELECT Id,
                    RangeFrom,
                    RangeTo
                    FROM MyTable

                    SET IDENTITY_INSERT MyTable3 OFF


                    And then used the following query (the article is looking for intersecting intervals so finding an interval containing a point is a degenerate case of this)



                    DECLARE @value INT = 50000000;

                    ;WITH N AS
                    (
                    SELECT 30 AS Level,
                    CASE WHEN @value > POWER(2,30) THEN POWER(2,30) END AS selected_left_node,
                    CASE WHEN @value < POWER(2,30) THEN POWER(2,30) END AS selected_right_node,
                    (SIGN(@value - POWER(2,30)) * POWER(2,29)) + POWER(2,30) AS node
                    UNION ALL
                    SELECT N.Level-1,
                    CASE WHEN @value > node THEN node END AS selected_left_node,
                    CASE WHEN @value < node THEN node END AS selected_right_node,
                    (SIGN(@value - node) * POWER(2,N.Level-2)) + node AS node
                    FROM N
                    WHERE N.Level > 0
                    )
                    SELECT I.id, I.RangeFrom, I.RangeTo
                    FROM dbo.MyTable3 AS I
                    JOIN N AS L
                    ON I.node = L.selected_left_node
                    AND I.RangeTo >= @value
                    AND L.selected_left_node IS NOT NULL
                    UNION ALL
                    SELECT I.id, I.RangeFrom, I.RangeTo
                    FROM dbo.MyTable3 AS I
                    JOIN N AS R
                    ON I.node = R.selected_right_node
                    AND I.RangeFrom <= @value
                    AND R.selected_right_node IS NOT NULL
                    UNION ALL
                    SELECT I.id, I.RangeFrom, I.RangeTo
                    FROM dbo.MyTable3 AS I
                    WHERE node = @value;


                    This typically executes in 1mson my machine when all pages are in cache - with IO stats.



                    Table 'MyTable3'. Scan count 24, logical reads 72, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
                    Table 'Worktable'. Scan count 4, logical reads 374, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


                    and plan



                    enter image description here



                    NB: The source uses multistatement TVFs rather than a recursive CTE to get the nodes to join on but in the interests of making my answer self contained I have opted for the latter. For production use I'd probably use the TVFs.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 11 hours ago

























                    answered yesterday









                    Martin Smith

                    61.6k10166247




                    61.6k10166247























                        4














                        I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeTo and RangeTo and you indexed it along with RangeFrom:



                        ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                        CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                        If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



                        WITH RecursiveCTE
                        AS
                        (
                        -- Anchor
                        SELECT TOP (1)
                        DiffOfColumns
                        FROM dbo.MyTableWithDiff AS T
                        ORDER BY
                        T.DiffOfColumns

                        UNION ALL

                        -- Recursive
                        SELECT R.DiffOfColumns
                        FROM
                        (
                        -- Number the rows
                        SELECT
                        T.DiffOfColumns,
                        rn = ROW_NUMBER() OVER (
                        ORDER BY T.DiffOfColumns)
                        FROM dbo.MyTableWithDiff AS T
                        JOIN RecursiveCTE AS R
                        ON R.DiffOfColumns < T.DiffOfColumns
                        ) AS R
                        WHERE
                        -- Only the row that sorts lowest
                        R.rn = 1
                        )
                        SELECT ca.*
                        FROM RecursiveCTE rcte
                        CROSS APPLY (
                        SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                        FROM dbo.MyTableWithDiff mt
                        WHERE mt.DiffOfColumns = rcte.DiffOfColumns
                        AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
                        ) ca
                        OPTION (MAXRECURSION 0);


                        You can see the usual recursive part along with the index seek for every distinct value:



                        query plan 1



                        The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



                        DROP TABLE IF EXISTS dbo.MyTableBigDiff;

                        CREATE TABLE dbo.MyTableBigDiff
                        (
                        Id INT IDENTITY PRIMARY KEY,
                        RangeFrom INT NOT NULL,
                        RangeTo INT NOT NULL,
                        CHECK (RangeTo > RangeFrom)
                        );

                        WITH RandomNumbers
                        AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
                        FROM sys.all_objects o1,
                        sys.all_objects o2,
                        sys.all_objects o3,
                        sys.all_objects o4)
                        INSERT INTO dbo.MyTableBigDiff
                        (RangeFrom,
                        RangeTo)
                        SELECT Num,
                        Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
                        FROM RandomNumbers;


                        ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                        CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                        The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



                        ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

                        CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


                        One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



                        CROSS APPLY (
                        SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                        FROM dbo.MyTableBigDiff mt
                        WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
                        AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
                        AND mt.RangeFrom <= 50000000
                        AND mt.RangeTo >= 50000000
                        ) ca


                        The full query now takes 6 ms on my machine.






                        share|improve this answer




























                          4














                          I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeTo and RangeTo and you indexed it along with RangeFrom:



                          ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                          CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                          If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



                          WITH RecursiveCTE
                          AS
                          (
                          -- Anchor
                          SELECT TOP (1)
                          DiffOfColumns
                          FROM dbo.MyTableWithDiff AS T
                          ORDER BY
                          T.DiffOfColumns

                          UNION ALL

                          -- Recursive
                          SELECT R.DiffOfColumns
                          FROM
                          (
                          -- Number the rows
                          SELECT
                          T.DiffOfColumns,
                          rn = ROW_NUMBER() OVER (
                          ORDER BY T.DiffOfColumns)
                          FROM dbo.MyTableWithDiff AS T
                          JOIN RecursiveCTE AS R
                          ON R.DiffOfColumns < T.DiffOfColumns
                          ) AS R
                          WHERE
                          -- Only the row that sorts lowest
                          R.rn = 1
                          )
                          SELECT ca.*
                          FROM RecursiveCTE rcte
                          CROSS APPLY (
                          SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                          FROM dbo.MyTableWithDiff mt
                          WHERE mt.DiffOfColumns = rcte.DiffOfColumns
                          AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
                          ) ca
                          OPTION (MAXRECURSION 0);


                          You can see the usual recursive part along with the index seek for every distinct value:



                          query plan 1



                          The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



                          DROP TABLE IF EXISTS dbo.MyTableBigDiff;

                          CREATE TABLE dbo.MyTableBigDiff
                          (
                          Id INT IDENTITY PRIMARY KEY,
                          RangeFrom INT NOT NULL,
                          RangeTo INT NOT NULL,
                          CHECK (RangeTo > RangeFrom)
                          );

                          WITH RandomNumbers
                          AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
                          FROM sys.all_objects o1,
                          sys.all_objects o2,
                          sys.all_objects o3,
                          sys.all_objects o4)
                          INSERT INTO dbo.MyTableBigDiff
                          (RangeFrom,
                          RangeTo)
                          SELECT Num,
                          Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
                          FROM RandomNumbers;


                          ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                          CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                          The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



                          ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

                          CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


                          One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



                          CROSS APPLY (
                          SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                          FROM dbo.MyTableBigDiff mt
                          WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
                          AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
                          AND mt.RangeFrom <= 50000000
                          AND mt.RangeTo >= 50000000
                          ) ca


                          The full query now takes 6 ms on my machine.






                          share|improve this answer


























                            4












                            4








                            4






                            I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeTo and RangeTo and you indexed it along with RangeFrom:



                            ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                            CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                            If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



                            WITH RecursiveCTE
                            AS
                            (
                            -- Anchor
                            SELECT TOP (1)
                            DiffOfColumns
                            FROM dbo.MyTableWithDiff AS T
                            ORDER BY
                            T.DiffOfColumns

                            UNION ALL

                            -- Recursive
                            SELECT R.DiffOfColumns
                            FROM
                            (
                            -- Number the rows
                            SELECT
                            T.DiffOfColumns,
                            rn = ROW_NUMBER() OVER (
                            ORDER BY T.DiffOfColumns)
                            FROM dbo.MyTableWithDiff AS T
                            JOIN RecursiveCTE AS R
                            ON R.DiffOfColumns < T.DiffOfColumns
                            ) AS R
                            WHERE
                            -- Only the row that sorts lowest
                            R.rn = 1
                            )
                            SELECT ca.*
                            FROM RecursiveCTE rcte
                            CROSS APPLY (
                            SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                            FROM dbo.MyTableWithDiff mt
                            WHERE mt.DiffOfColumns = rcte.DiffOfColumns
                            AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
                            ) ca
                            OPTION (MAXRECURSION 0);


                            You can see the usual recursive part along with the index seek for every distinct value:



                            query plan 1



                            The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



                            DROP TABLE IF EXISTS dbo.MyTableBigDiff;

                            CREATE TABLE dbo.MyTableBigDiff
                            (
                            Id INT IDENTITY PRIMARY KEY,
                            RangeFrom INT NOT NULL,
                            RangeTo INT NOT NULL,
                            CHECK (RangeTo > RangeFrom)
                            );

                            WITH RandomNumbers
                            AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
                            FROM sys.all_objects o1,
                            sys.all_objects o2,
                            sys.all_objects o3,
                            sys.all_objects o4)
                            INSERT INTO dbo.MyTableBigDiff
                            (RangeFrom,
                            RangeTo)
                            SELECT Num,
                            Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
                            FROM RandomNumbers;


                            ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                            CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                            The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



                            ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

                            CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


                            One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



                            CROSS APPLY (
                            SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                            FROM dbo.MyTableBigDiff mt
                            WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
                            AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
                            AND mt.RangeFrom <= 50000000
                            AND mt.RangeTo >= 50000000
                            ) ca


                            The full query now takes 6 ms on my machine.






                            share|improve this answer














                            I was able to find a row mode approach that is competitive with the N/CCI approach, but you need to know something about your data. Suppose that you had a column that contained the difference of RangeTo and RangeTo and you indexed it along with RangeFrom:



                            ALTER TABLE dbo.MyTableWithDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                            CREATE INDEX IXDIFF ON dbo.MyTableWithDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                            If you knew all of the distinct values of DiffOfColumns then you could perform a seek for every value of DiffOfColumns with a range filter on RangeTo to get all of the relevant data. For example, if we know that DiffOfColumns = 2 then the only allowed values for RangeFrom are 49999998, 49999999 and 50000000. Recursion can be used to get all of the distinct values of DiffOfColumns and it works well for your data set because there are only 256 of them. The query below takes about 6 ms on my machine:



                            WITH RecursiveCTE
                            AS
                            (
                            -- Anchor
                            SELECT TOP (1)
                            DiffOfColumns
                            FROM dbo.MyTableWithDiff AS T
                            ORDER BY
                            T.DiffOfColumns

                            UNION ALL

                            -- Recursive
                            SELECT R.DiffOfColumns
                            FROM
                            (
                            -- Number the rows
                            SELECT
                            T.DiffOfColumns,
                            rn = ROW_NUMBER() OVER (
                            ORDER BY T.DiffOfColumns)
                            FROM dbo.MyTableWithDiff AS T
                            JOIN RecursiveCTE AS R
                            ON R.DiffOfColumns < T.DiffOfColumns
                            ) AS R
                            WHERE
                            -- Only the row that sorts lowest
                            R.rn = 1
                            )
                            SELECT ca.*
                            FROM RecursiveCTE rcte
                            CROSS APPLY (
                            SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                            FROM dbo.MyTableWithDiff mt
                            WHERE mt.DiffOfColumns = rcte.DiffOfColumns
                            AND mt.RangeFrom >= 50000000 - rcte.DiffOfColumns AND mt.RangeFrom <= 50000000
                            ) ca
                            OPTION (MAXRECURSION 0);


                            You can see the usual recursive part along with the index seek for every distinct value:



                            query plan 1



                            The flaw with this approach is that it starts to get slow when there are too many distinct values for DiffOfColumns. Let's do the same test, but use CRYPT_GEN_RANDOM(2)instead of CRYPT_GEN_RANDOM(1).



                            DROP TABLE IF EXISTS dbo.MyTableBigDiff;

                            CREATE TABLE dbo.MyTableBigDiff
                            (
                            Id INT IDENTITY PRIMARY KEY,
                            RangeFrom INT NOT NULL,
                            RangeTo INT NOT NULL,
                            CHECK (RangeTo > RangeFrom)
                            );

                            WITH RandomNumbers
                            AS (SELECT TOP 10000000 ABS(CRYPT_GEN_RANDOM(4)%100000000) AS Num
                            FROM sys.all_objects o1,
                            sys.all_objects o2,
                            sys.all_objects o3,
                            sys.all_objects o4)
                            INSERT INTO dbo.MyTableBigDiff
                            (RangeFrom,
                            RangeTo)
                            SELECT Num,
                            Num + 1 + CRYPT_GEN_RANDOM(2) -- note the 2
                            FROM RandomNumbers;


                            ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns AS RangeTo-RangeFrom;

                            CREATE INDEX IXDIFF ON dbo.MyTableBigDiff (DiffOfColumns,RangeFrom) INCLUDE (RangeTo);


                            The same query now finds 65536 rows from the recursive part and takes 823 ms of CPU on my machine. There are PAGELATCH_SH waits and other bad things going on. I can improve performance by bucketing the diff values to keep the number of unique values under control and adjusting for the bucketing in the CROSS APPLY. For this data set I'll try 256 buckets:



                            ALTER TABLE dbo.MyTableBigDiff ADD DiffOfColumns_bucket256 AS CAST(CEILING((RangeTo-RangeFrom) / 256.) AS INT);

                            CREATE INDEX [IXDIFF😎] ON dbo.MyTableBigDiff (DiffOfColumns_bucket256, RangeFrom) INCLUDE (RangeTo);


                            One way to avoid getting extra rows (now I'm comparing to a rounded value instead of the true value) is by filtering on RangeTo:



                            CROSS APPLY (
                            SELECT mt.Id, mt.RangeFrom, mt.RangeTo
                            FROM dbo.MyTableBigDiff mt
                            WHERE mt.DiffOfColumns_bucket256 = rcte.DiffOfColumns_bucket256
                            AND mt.RangeFrom >= 50000000 - (256 * rcte.DiffOfColumns_bucket256)
                            AND mt.RangeFrom <= 50000000
                            AND mt.RangeTo >= 50000000
                            ) ca


                            The full query now takes 6 ms on my machine.







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited 2 days ago

























                            answered 2 days ago









                            Joe Obbish

                            20.6k32880




                            20.6k32880























                                2














                                I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



                                Starting with your given sample, then modifying the table:



                                ALTER TABLE dbo.MyTable
                                ADD curtis_jackson
                                AS CONVERT(BIT, CASE
                                WHEN RangeTo >= 50000000
                                AND RangeFrom < 50000000
                                THEN 1
                                ELSE 0
                                END);

                                CREATE INDEX IX1_redo
                                ON dbo.MyTable (curtis_jackson)
                                INCLUDE (RangeFrom, RangeTo);


                                The query simply becomes:



                                SELECT *
                                FROM MyTable
                                WHERE curtis_jackson = 1;


                                Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



                                Table 'MyTable'. Scan count 1, logical reads 3...

                                SQL Server Execution Times:
                                CPU time = 0 ms, elapsed time = 0 ms.


                                And here's the query plan:



                                NUTS






                                share|improve this answer





















                                • and the index could be filtered too
                                  – Martin Smith
                                  yesterday






                                • 2




                                  @MartinSmith Not on a computed column :(
                                  – sp_BlitzErik
                                  yesterday










                                • And how much time does the index need to be created? (so we know the overhead in case it is not a single value)
                                  – yper-crazyhat-cubeᵀᴹ
                                  16 hours ago












                                • @yper-crazyhat-cubeᵀᴹ on my machine? light speed, baby.
                                  – sp_BlitzErik
                                  16 hours ago






                                • 1




                                  @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis
                                  – Martin Smith
                                  1 hour ago
















                                2














                                I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



                                Starting with your given sample, then modifying the table:



                                ALTER TABLE dbo.MyTable
                                ADD curtis_jackson
                                AS CONVERT(BIT, CASE
                                WHEN RangeTo >= 50000000
                                AND RangeFrom < 50000000
                                THEN 1
                                ELSE 0
                                END);

                                CREATE INDEX IX1_redo
                                ON dbo.MyTable (curtis_jackson)
                                INCLUDE (RangeFrom, RangeTo);


                                The query simply becomes:



                                SELECT *
                                FROM MyTable
                                WHERE curtis_jackson = 1;


                                Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



                                Table 'MyTable'. Scan count 1, logical reads 3...

                                SQL Server Execution Times:
                                CPU time = 0 ms, elapsed time = 0 ms.


                                And here's the query plan:



                                NUTS






                                share|improve this answer





















                                • and the index could be filtered too
                                  – Martin Smith
                                  yesterday






                                • 2




                                  @MartinSmith Not on a computed column :(
                                  – sp_BlitzErik
                                  yesterday










                                • And how much time does the index need to be created? (so we know the overhead in case it is not a single value)
                                  – yper-crazyhat-cubeᵀᴹ
                                  16 hours ago












                                • @yper-crazyhat-cubeᵀᴹ on my machine? light speed, baby.
                                  – sp_BlitzErik
                                  16 hours ago






                                • 1




                                  @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis
                                  – Martin Smith
                                  1 hour ago














                                2












                                2








                                2






                                I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



                                Starting with your given sample, then modifying the table:



                                ALTER TABLE dbo.MyTable
                                ADD curtis_jackson
                                AS CONVERT(BIT, CASE
                                WHEN RangeTo >= 50000000
                                AND RangeFrom < 50000000
                                THEN 1
                                ELSE 0
                                END);

                                CREATE INDEX IX1_redo
                                ON dbo.MyTable (curtis_jackson)
                                INCLUDE (RangeFrom, RangeTo);


                                The query simply becomes:



                                SELECT *
                                FROM MyTable
                                WHERE curtis_jackson = 1;


                                Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



                                Table 'MyTable'. Scan count 1, logical reads 3...

                                SQL Server Execution Times:
                                CPU time = 0 ms, elapsed time = 0 ms.


                                And here's the query plan:



                                NUTS






                                share|improve this answer












                                I found a pretty good solution using a computed column, however it's only good for a single value. That being said, if you've got a magic value, maybe it's enough.



                                Starting with your given sample, then modifying the table:



                                ALTER TABLE dbo.MyTable
                                ADD curtis_jackson
                                AS CONVERT(BIT, CASE
                                WHEN RangeTo >= 50000000
                                AND RangeFrom < 50000000
                                THEN 1
                                ELSE 0
                                END);

                                CREATE INDEX IX1_redo
                                ON dbo.MyTable (curtis_jackson)
                                INCLUDE (RangeFrom, RangeTo);


                                The query simply becomes:



                                SELECT *
                                FROM MyTable
                                WHERE curtis_jackson = 1;


                                Which returns the same results as your starting query. With execution plans turned off, here are the stats (truncated for brevity):



                                Table 'MyTable'. Scan count 1, logical reads 3...

                                SQL Server Execution Times:
                                CPU time = 0 ms, elapsed time = 0 ms.


                                And here's the query plan:



                                NUTS







                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered yesterday









                                sp_BlitzErik

                                20.8k1262102




                                20.8k1262102












                                • and the index could be filtered too
                                  – Martin Smith
                                  yesterday






                                • 2




                                  @MartinSmith Not on a computed column :(
                                  – sp_BlitzErik
                                  yesterday










                                • And how much time does the index need to be created? (so we know the overhead in case it is not a single value)
                                  – yper-crazyhat-cubeᵀᴹ
                                  16 hours ago












                                • @yper-crazyhat-cubeᵀᴹ on my machine? light speed, baby.
                                  – sp_BlitzErik
                                  16 hours ago






                                • 1




                                  @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis
                                  – Martin Smith
                                  1 hour ago


















                                • and the index could be filtered too
                                  – Martin Smith
                                  yesterday






                                • 2




                                  @MartinSmith Not on a computed column :(
                                  – sp_BlitzErik
                                  yesterday










                                • And how much time does the index need to be created? (so we know the overhead in case it is not a single value)
                                  – yper-crazyhat-cubeᵀᴹ
                                  16 hours ago












                                • @yper-crazyhat-cubeᵀᴹ on my machine? light speed, baby.
                                  – sp_BlitzErik
                                  16 hours ago






                                • 1




                                  @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis
                                  – Martin Smith
                                  1 hour ago
















                                and the index could be filtered too
                                – Martin Smith
                                yesterday




                                and the index could be filtered too
                                – Martin Smith
                                yesterday




                                2




                                2




                                @MartinSmith Not on a computed column :(
                                – sp_BlitzErik
                                yesterday




                                @MartinSmith Not on a computed column :(
                                – sp_BlitzErik
                                yesterday












                                And how much time does the index need to be created? (so we know the overhead in case it is not a single value)
                                – yper-crazyhat-cubeᵀᴹ
                                16 hours ago






                                And how much time does the index need to be created? (so we know the overhead in case it is not a single value)
                                – yper-crazyhat-cubeᵀᴹ
                                16 hours ago














                                @yper-crazyhat-cubeᵀᴹ on my machine? light speed, baby.
                                – sp_BlitzErik
                                16 hours ago




                                @yper-crazyhat-cubeᵀᴹ on my machine? light speed, baby.
                                – sp_BlitzErik
                                16 hours ago




                                1




                                1




                                @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis
                                – Martin Smith
                                1 hour ago




                                @yper-crazyhat-cubeᵀᴹ - yes. CREATE INDEX IX1_redo ON dbo.MyTable (curtis_jackson) INCLUDE (RangeFrom, RangeTo) WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000 would work. And the query SELECT * FROM MyTable WHERE RangeTo >= 50000000 AND RangeFrom <= 50000000; uses it - so then not much need for poor Curtis
                                – Martin Smith
                                1 hour ago











                                2














                                As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



                                To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



                                Let's start with the language that doesn't exist in Boston!



                                R



                                EXEC sp_execute_external_script 
                                @language = N'R',
                                @script = N'
                                tweener = 50000000
                                MO = data.frame(MartinIn)
                                MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
                                ',
                                @input_data_1_name = N'MartinIn',
                                @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                @output_data_1_name = N'MartinOut',
                                @parallel = 1
                                WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                This was a bad time.



                                Table 'MyTable'. Scan count 1, logical reads 22400

                                SQL Server Execution Times:
                                CPU time = 3219 ms, elapsed time = 5349 ms.


                                The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



                                NUTS



                                Next up, coding with crayons!



                                Python



                                EXEC sp_execute_external_script 
                                @language = N'Python',
                                @script = N'
                                import pandas as pd
                                MO = pd.DataFrame(MartinIn)
                                tweener = 50000000
                                MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
                                ',
                                @input_data_1_name = N'MartinIn',
                                @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                @output_data_1_name = N'MartinOut',
                                @parallel = 1
                                WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                Just when you thought it couldn't get worse than R:



                                Table 'MyTable'. Scan count 1, logical reads 22400

                                SQL Server Execution Times:
                                CPU time = 3797 ms, elapsed time = 10146 ms.


                                Another foul-mouthed execution plan:



                                NUTS



                                Hmm and Hmmer



                                So far, I'm not impressed. I can't wait to delete this VM.






                                share|improve this answer





















                                • You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.
                                  – wBob
                                  6 hours ago


















                                2














                                As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



                                To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



                                Let's start with the language that doesn't exist in Boston!



                                R



                                EXEC sp_execute_external_script 
                                @language = N'R',
                                @script = N'
                                tweener = 50000000
                                MO = data.frame(MartinIn)
                                MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
                                ',
                                @input_data_1_name = N'MartinIn',
                                @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                @output_data_1_name = N'MartinOut',
                                @parallel = 1
                                WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                This was a bad time.



                                Table 'MyTable'. Scan count 1, logical reads 22400

                                SQL Server Execution Times:
                                CPU time = 3219 ms, elapsed time = 5349 ms.


                                The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



                                NUTS



                                Next up, coding with crayons!



                                Python



                                EXEC sp_execute_external_script 
                                @language = N'Python',
                                @script = N'
                                import pandas as pd
                                MO = pd.DataFrame(MartinIn)
                                tweener = 50000000
                                MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
                                ',
                                @input_data_1_name = N'MartinIn',
                                @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                @output_data_1_name = N'MartinOut',
                                @parallel = 1
                                WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                Just when you thought it couldn't get worse than R:



                                Table 'MyTable'. Scan count 1, logical reads 22400

                                SQL Server Execution Times:
                                CPU time = 3797 ms, elapsed time = 10146 ms.


                                Another foul-mouthed execution plan:



                                NUTS



                                Hmm and Hmmer



                                So far, I'm not impressed. I can't wait to delete this VM.






                                share|improve this answer





















                                • You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.
                                  – wBob
                                  6 hours ago
















                                2












                                2








                                2






                                As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



                                To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



                                Let's start with the language that doesn't exist in Boston!



                                R



                                EXEC sp_execute_external_script 
                                @language = N'R',
                                @script = N'
                                tweener = 50000000
                                MO = data.frame(MartinIn)
                                MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
                                ',
                                @input_data_1_name = N'MartinIn',
                                @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                @output_data_1_name = N'MartinOut',
                                @parallel = 1
                                WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                This was a bad time.



                                Table 'MyTable'. Scan count 1, logical reads 22400

                                SQL Server Execution Times:
                                CPU time = 3219 ms, elapsed time = 5349 ms.


                                The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



                                NUTS



                                Next up, coding with crayons!



                                Python



                                EXEC sp_execute_external_script 
                                @language = N'Python',
                                @script = N'
                                import pandas as pd
                                MO = pd.DataFrame(MartinIn)
                                tweener = 50000000
                                MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
                                ',
                                @input_data_1_name = N'MartinIn',
                                @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                @output_data_1_name = N'MartinOut',
                                @parallel = 1
                                WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                Just when you thought it couldn't get worse than R:



                                Table 'MyTable'. Scan count 1, logical reads 22400

                                SQL Server Execution Times:
                                CPU time = 3797 ms, elapsed time = 10146 ms.


                                Another foul-mouthed execution plan:



                                NUTS



                                Hmm and Hmmer



                                So far, I'm not impressed. I can't wait to delete this VM.






                                share|improve this answer












                                As an homage to our new robot overlords, I decided to see if any of the new-ish R and Python functionality could help us out here. The answer is no, at least for the scripts that I could get working and returning correct results. If anyone with better knowledge comes along, well, feel free to spank me. My rates are reasonable.



                                To do this, I set up a VM with 4 cores and 16 GB RAM, thinking this would be enough to deal with a ~200MB data set.



                                Let's start with the language that doesn't exist in Boston!



                                R



                                EXEC sp_execute_external_script 
                                @language = N'R',
                                @script = N'
                                tweener = 50000000
                                MO = data.frame(MartinIn)
                                MartinOut <- subset(MO, RangeFrom <= tweener & RangeTo >= tweener, select = c("Id","RangeFrom","RangeTo"))
                                ',
                                @input_data_1_name = N'MartinIn',
                                @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                @output_data_1_name = N'MartinOut',
                                @parallel = 1
                                WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                This was a bad time.



                                Table 'MyTable'. Scan count 1, logical reads 22400

                                SQL Server Execution Times:
                                CPU time = 3219 ms, elapsed time = 5349 ms.


                                The execution plan is pretty uninteresting, though I don't know why the middle operator has to call us names.



                                NUTS



                                Next up, coding with crayons!



                                Python



                                EXEC sp_execute_external_script 
                                @language = N'Python',
                                @script = N'
                                import pandas as pd
                                MO = pd.DataFrame(MartinIn)
                                tweener = 50000000
                                MartinOut = MO[(MO.RangeFrom <= tweener) & (MO.RangeTo >= tweener)]
                                ',
                                @input_data_1_name = N'MartinIn',
                                @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable',
                                @output_data_1_name = N'MartinOut',
                                @parallel = 1
                                WITH RESULT SETS ((ID INT, RangeFrom INT, RangeTo INT));


                                Just when you thought it couldn't get worse than R:



                                Table 'MyTable'. Scan count 1, logical reads 22400

                                SQL Server Execution Times:
                                CPU time = 3797 ms, elapsed time = 10146 ms.


                                Another foul-mouthed execution plan:



                                NUTS



                                Hmm and Hmmer



                                So far, I'm not impressed. I can't wait to delete this VM.







                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered 17 hours ago









                                sp_BlitzErik

                                20.8k1262102




                                20.8k1262102












                                • You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.
                                  – wBob
                                  6 hours ago




















                                • You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.
                                  – wBob
                                  6 hours ago


















                                You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.
                                – wBob
                                6 hours ago






                                You can pass in parameters too, eg DECLARE @input INT = 50000001; EXEC dbo.sp_execute_external_script @language = N'R', @script = N'OutputDataSet <- InputDataSet[which(x >= InputDataSet$RangeFrom & x <= InputDataSet$RangeTo) , ]', @parallel = 1, @input_data_1 = N'SELECT Id, RangeFrom, RangeTo FROM dbo.MyTable;', @params = N'@x INT', @x = 50000001 WITH RESULT SETS ( ( Id INT NOT NULL, RangeFrom INT NOT NULL, RangeTo INT NOT NULL )); but yeah performance isn't great. I use R for stuff you can't do in SQL, say if you wanted to predict something.
                                – wBob
                                6 hours ago




















                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Database Administrators Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.





                                Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                                Please pay close attention to the following guidance:


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f225953%2foptimizing-numeric-range-interval-searches-in-sql-server%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Morgemoulin

                                Scott Moir

                                Souastre