How can I optimize a Point in Polygon query for millions of points when most of the points lie within the...












8














I have 150 million points in a point table and would like to find the few points lying outside a given polygon geometry. I know that 99.9% of the points are within the polygon geometry. I am interested in finding the few points which lie outside the polygon.



My present best query using indexed PostGIS tables takes about 30 minutes to complete. Is there a way to optimize the following query knowing that most of the points are within the polygon (border)?



SELECT COUNT(*) 
FROM italy_points pt
JOIN borders poly
ON ST_WITHIN (pt.the_geom, poly.geom)
WHERE poly.iso3 = 'ITA';


The polygon is basically the admin 0 border of Italy.
Vertices - 405,000. Parts - 510. The envelope is much larger than the polygon (The polygon covers 24% of the envelope)










share|improve this question




















  • 2




    Please Edit the question to give an indication of the complexity of the polygon -- How many parts? How many vertices? What percentage of the envelope of the polygon is within the polygon. I've found that partitioning complex polygons can improve point-in polygon evaluation, but you need to handle the condition where a single point intersects more than one partition.
    – Vince
    Dec 17 at 17:02










  • The first optimization for this type of operation is commonly to check whether the point is in the bounding box of the polygon before carrying on to the full point-in-polygon operation. Point-in-box is a very efficient operation by comparison.
    – WhiteboxDev
    Dec 17 at 19:56










  • @Vince If duplicates are possible (the only case I think think of is when it falls exactly on the border of two partitions), this is trivially handled in PostGIS. You need only GROUP BY the primary key of the points. (PostgreSQL conveniently allows you to reference any columns in the SELECT clause that come from a table where the primary key is included in the GROUP BY clause.)
    – jpmc26
    Dec 17 at 21:02












  • @WhiteboxDev ST_Within already does a boundary box check that enables to use of the index. (Almost all of PostGIS's functions include this optimization.) If it's still slow, then clearly the problem is with the complexity of the polygon.
    – jpmc26
    Dec 17 at 21:05












  • @jpmc26 Certainly, but the SQL query would also need to be modified to use ST_Intersects, since ST_Within would not reliably match internal boundary conditions.
    – Vince
    Dec 17 at 21:52
















8














I have 150 million points in a point table and would like to find the few points lying outside a given polygon geometry. I know that 99.9% of the points are within the polygon geometry. I am interested in finding the few points which lie outside the polygon.



My present best query using indexed PostGIS tables takes about 30 minutes to complete. Is there a way to optimize the following query knowing that most of the points are within the polygon (border)?



SELECT COUNT(*) 
FROM italy_points pt
JOIN borders poly
ON ST_WITHIN (pt.the_geom, poly.geom)
WHERE poly.iso3 = 'ITA';


The polygon is basically the admin 0 border of Italy.
Vertices - 405,000. Parts - 510. The envelope is much larger than the polygon (The polygon covers 24% of the envelope)










share|improve this question




















  • 2




    Please Edit the question to give an indication of the complexity of the polygon -- How many parts? How many vertices? What percentage of the envelope of the polygon is within the polygon. I've found that partitioning complex polygons can improve point-in polygon evaluation, but you need to handle the condition where a single point intersects more than one partition.
    – Vince
    Dec 17 at 17:02










  • The first optimization for this type of operation is commonly to check whether the point is in the bounding box of the polygon before carrying on to the full point-in-polygon operation. Point-in-box is a very efficient operation by comparison.
    – WhiteboxDev
    Dec 17 at 19:56










  • @Vince If duplicates are possible (the only case I think think of is when it falls exactly on the border of two partitions), this is trivially handled in PostGIS. You need only GROUP BY the primary key of the points. (PostgreSQL conveniently allows you to reference any columns in the SELECT clause that come from a table where the primary key is included in the GROUP BY clause.)
    – jpmc26
    Dec 17 at 21:02












  • @WhiteboxDev ST_Within already does a boundary box check that enables to use of the index. (Almost all of PostGIS's functions include this optimization.) If it's still slow, then clearly the problem is with the complexity of the polygon.
    – jpmc26
    Dec 17 at 21:05












  • @jpmc26 Certainly, but the SQL query would also need to be modified to use ST_Intersects, since ST_Within would not reliably match internal boundary conditions.
    – Vince
    Dec 17 at 21:52














8












8








8







I have 150 million points in a point table and would like to find the few points lying outside a given polygon geometry. I know that 99.9% of the points are within the polygon geometry. I am interested in finding the few points which lie outside the polygon.



My present best query using indexed PostGIS tables takes about 30 minutes to complete. Is there a way to optimize the following query knowing that most of the points are within the polygon (border)?



SELECT COUNT(*) 
FROM italy_points pt
JOIN borders poly
ON ST_WITHIN (pt.the_geom, poly.geom)
WHERE poly.iso3 = 'ITA';


The polygon is basically the admin 0 border of Italy.
Vertices - 405,000. Parts - 510. The envelope is much larger than the polygon (The polygon covers 24% of the envelope)










share|improve this question















I have 150 million points in a point table and would like to find the few points lying outside a given polygon geometry. I know that 99.9% of the points are within the polygon geometry. I am interested in finding the few points which lie outside the polygon.



My present best query using indexed PostGIS tables takes about 30 minutes to complete. Is there a way to optimize the following query knowing that most of the points are within the polygon (border)?



SELECT COUNT(*) 
FROM italy_points pt
JOIN borders poly
ON ST_WITHIN (pt.the_geom, poly.geom)
WHERE poly.iso3 = 'ITA';


The polygon is basically the admin 0 border of Italy.
Vertices - 405,000. Parts - 510. The envelope is much larger than the polygon (The polygon covers 24% of the envelope)







postgis postgresql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 17 at 22:25

























asked Dec 17 at 16:49









Prithvi

433




433








  • 2




    Please Edit the question to give an indication of the complexity of the polygon -- How many parts? How many vertices? What percentage of the envelope of the polygon is within the polygon. I've found that partitioning complex polygons can improve point-in polygon evaluation, but you need to handle the condition where a single point intersects more than one partition.
    – Vince
    Dec 17 at 17:02










  • The first optimization for this type of operation is commonly to check whether the point is in the bounding box of the polygon before carrying on to the full point-in-polygon operation. Point-in-box is a very efficient operation by comparison.
    – WhiteboxDev
    Dec 17 at 19:56










  • @Vince If duplicates are possible (the only case I think think of is when it falls exactly on the border of two partitions), this is trivially handled in PostGIS. You need only GROUP BY the primary key of the points. (PostgreSQL conveniently allows you to reference any columns in the SELECT clause that come from a table where the primary key is included in the GROUP BY clause.)
    – jpmc26
    Dec 17 at 21:02












  • @WhiteboxDev ST_Within already does a boundary box check that enables to use of the index. (Almost all of PostGIS's functions include this optimization.) If it's still slow, then clearly the problem is with the complexity of the polygon.
    – jpmc26
    Dec 17 at 21:05












  • @jpmc26 Certainly, but the SQL query would also need to be modified to use ST_Intersects, since ST_Within would not reliably match internal boundary conditions.
    – Vince
    Dec 17 at 21:52














  • 2




    Please Edit the question to give an indication of the complexity of the polygon -- How many parts? How many vertices? What percentage of the envelope of the polygon is within the polygon. I've found that partitioning complex polygons can improve point-in polygon evaluation, but you need to handle the condition where a single point intersects more than one partition.
    – Vince
    Dec 17 at 17:02










  • The first optimization for this type of operation is commonly to check whether the point is in the bounding box of the polygon before carrying on to the full point-in-polygon operation. Point-in-box is a very efficient operation by comparison.
    – WhiteboxDev
    Dec 17 at 19:56










  • @Vince If duplicates are possible (the only case I think think of is when it falls exactly on the border of two partitions), this is trivially handled in PostGIS. You need only GROUP BY the primary key of the points. (PostgreSQL conveniently allows you to reference any columns in the SELECT clause that come from a table where the primary key is included in the GROUP BY clause.)
    – jpmc26
    Dec 17 at 21:02












  • @WhiteboxDev ST_Within already does a boundary box check that enables to use of the index. (Almost all of PostGIS's functions include this optimization.) If it's still slow, then clearly the problem is with the complexity of the polygon.
    – jpmc26
    Dec 17 at 21:05












  • @jpmc26 Certainly, but the SQL query would also need to be modified to use ST_Intersects, since ST_Within would not reliably match internal boundary conditions.
    – Vince
    Dec 17 at 21:52








2




2




Please Edit the question to give an indication of the complexity of the polygon -- How many parts? How many vertices? What percentage of the envelope of the polygon is within the polygon. I've found that partitioning complex polygons can improve point-in polygon evaluation, but you need to handle the condition where a single point intersects more than one partition.
– Vince
Dec 17 at 17:02




Please Edit the question to give an indication of the complexity of the polygon -- How many parts? How many vertices? What percentage of the envelope of the polygon is within the polygon. I've found that partitioning complex polygons can improve point-in polygon evaluation, but you need to handle the condition where a single point intersects more than one partition.
– Vince
Dec 17 at 17:02












The first optimization for this type of operation is commonly to check whether the point is in the bounding box of the polygon before carrying on to the full point-in-polygon operation. Point-in-box is a very efficient operation by comparison.
– WhiteboxDev
Dec 17 at 19:56




The first optimization for this type of operation is commonly to check whether the point is in the bounding box of the polygon before carrying on to the full point-in-polygon operation. Point-in-box is a very efficient operation by comparison.
– WhiteboxDev
Dec 17 at 19:56












@Vince If duplicates are possible (the only case I think think of is when it falls exactly on the border of two partitions), this is trivially handled in PostGIS. You need only GROUP BY the primary key of the points. (PostgreSQL conveniently allows you to reference any columns in the SELECT clause that come from a table where the primary key is included in the GROUP BY clause.)
– jpmc26
Dec 17 at 21:02






@Vince If duplicates are possible (the only case I think think of is when it falls exactly on the border of two partitions), this is trivially handled in PostGIS. You need only GROUP BY the primary key of the points. (PostgreSQL conveniently allows you to reference any columns in the SELECT clause that come from a table where the primary key is included in the GROUP BY clause.)
– jpmc26
Dec 17 at 21:02














@WhiteboxDev ST_Within already does a boundary box check that enables to use of the index. (Almost all of PostGIS's functions include this optimization.) If it's still slow, then clearly the problem is with the complexity of the polygon.
– jpmc26
Dec 17 at 21:05






@WhiteboxDev ST_Within already does a boundary box check that enables to use of the index. (Almost all of PostGIS's functions include this optimization.) If it's still slow, then clearly the problem is with the complexity of the polygon.
– jpmc26
Dec 17 at 21:05














@jpmc26 Certainly, but the SQL query would also need to be modified to use ST_Intersects, since ST_Within would not reliably match internal boundary conditions.
– Vince
Dec 17 at 21:52




@jpmc26 Certainly, but the SQL query would also need to be modified to use ST_Intersects, since ST_Within would not reliably match internal boundary conditions.
– Vince
Dec 17 at 21:52










1 Answer
1






active

oldest

votes


















10














Use ST_Subdivide to cut your polygon into smaller polygons, save them to a table and create a spatial index. Then do your query on the gridded polygons.



Without that, spatial indexing does not provide any advantages in your case (only 1 polygon of interest).






share|improve this answer

















  • 4




    Further to the ST_Subdivide() comment, you might find that using less than the default number of vertices yields further gains by increasing index leverage and decreasing geometry recovery time. Try 64 or even 32.
    – Paul Ramsey
    Dec 17 at 17:27










  • The query now completes in 5 minutes as opposed to 30! Thanks for the suggestion.
    – Prithvi
    Dec 17 at 22:28











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "79"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fgis.stackexchange.com%2fquestions%2f306295%2fhow-can-i-optimize-a-point-in-polygon-query-for-millions-of-points-when-most-of%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









10














Use ST_Subdivide to cut your polygon into smaller polygons, save them to a table and create a spatial index. Then do your query on the gridded polygons.



Without that, spatial indexing does not provide any advantages in your case (only 1 polygon of interest).






share|improve this answer

















  • 4




    Further to the ST_Subdivide() comment, you might find that using less than the default number of vertices yields further gains by increasing index leverage and decreasing geometry recovery time. Try 64 or even 32.
    – Paul Ramsey
    Dec 17 at 17:27










  • The query now completes in 5 minutes as opposed to 30! Thanks for the suggestion.
    – Prithvi
    Dec 17 at 22:28
















10














Use ST_Subdivide to cut your polygon into smaller polygons, save them to a table and create a spatial index. Then do your query on the gridded polygons.



Without that, spatial indexing does not provide any advantages in your case (only 1 polygon of interest).






share|improve this answer

















  • 4




    Further to the ST_Subdivide() comment, you might find that using less than the default number of vertices yields further gains by increasing index leverage and decreasing geometry recovery time. Try 64 or even 32.
    – Paul Ramsey
    Dec 17 at 17:27










  • The query now completes in 5 minutes as opposed to 30! Thanks for the suggestion.
    – Prithvi
    Dec 17 at 22:28














10












10








10






Use ST_Subdivide to cut your polygon into smaller polygons, save them to a table and create a spatial index. Then do your query on the gridded polygons.



Without that, spatial indexing does not provide any advantages in your case (only 1 polygon of interest).






share|improve this answer












Use ST_Subdivide to cut your polygon into smaller polygons, save them to a table and create a spatial index. Then do your query on the gridded polygons.



Without that, spatial indexing does not provide any advantages in your case (only 1 polygon of interest).







share|improve this answer












share|improve this answer



share|improve this answer










answered Dec 17 at 17:00









Mesa

989615




989615








  • 4




    Further to the ST_Subdivide() comment, you might find that using less than the default number of vertices yields further gains by increasing index leverage and decreasing geometry recovery time. Try 64 or even 32.
    – Paul Ramsey
    Dec 17 at 17:27










  • The query now completes in 5 minutes as opposed to 30! Thanks for the suggestion.
    – Prithvi
    Dec 17 at 22:28














  • 4




    Further to the ST_Subdivide() comment, you might find that using less than the default number of vertices yields further gains by increasing index leverage and decreasing geometry recovery time. Try 64 or even 32.
    – Paul Ramsey
    Dec 17 at 17:27










  • The query now completes in 5 minutes as opposed to 30! Thanks for the suggestion.
    – Prithvi
    Dec 17 at 22:28








4




4




Further to the ST_Subdivide() comment, you might find that using less than the default number of vertices yields further gains by increasing index leverage and decreasing geometry recovery time. Try 64 or even 32.
– Paul Ramsey
Dec 17 at 17:27




Further to the ST_Subdivide() comment, you might find that using less than the default number of vertices yields further gains by increasing index leverage and decreasing geometry recovery time. Try 64 or even 32.
– Paul Ramsey
Dec 17 at 17:27












The query now completes in 5 minutes as opposed to 30! Thanks for the suggestion.
– Prithvi
Dec 17 at 22:28




The query now completes in 5 minutes as opposed to 30! Thanks for the suggestion.
– Prithvi
Dec 17 at 22:28


















draft saved

draft discarded




















































Thanks for contributing an answer to Geographic Information Systems Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fgis.stackexchange.com%2fquestions%2f306295%2fhow-can-i-optimize-a-point-in-polygon-query-for-millions-of-points-when-most-of%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Morgemoulin

Scott Moir

Souastre