Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GiST support #42

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

GiST support #42

wants to merge 4 commits into from

Conversation

zachasme
Copy link
Owner

@zachasme zachasme commented Aug 24, 2020

Replaces #7. Part of #5.

References:

  • GiST bible
  • PostGIS implementation (gserialized_gist*.* a bit spread over here and here).

I have closed the old PR which contained code for both GiST and SP-GiST, splitting it into two separate PRs. This PR also includes the work by @AbelVM in #40.

The PR builds and runs, but basic tests fail (see test/sql/opclass_gist.sql), so there is some logical error. Any help figuring it out is appreciated. ❤️

We should also consider either 1) remove the unused parts of h3Index.h, 2) simply write out the few macros we need, or 3) get the macros pulled into the public API upstream.

@zachasme
Copy link
Owner Author

There is still some logical error in the current implementation, as seen from this simple test:

\set hexagon '\'831c02fffffffff\'::h3index'
CREATE TABLE h3_test_gist (hex h3index);
CREATE INDEX GIST_IDX ON h3_test_gist USING gist(hex);

-- insert immediate children
INSERT INTO h3_test_gist (hex) SELECT h3_to_children(:hexagon);
-- num children is 7
SELECT COUNT(*) FROM h3_test_gist WHERE hex <@ :hexagon;
     7
-- insert deeper children
INSERT INTO h3_test_gist (hex) SELECT h3_to_children(:hexagon, 8);
-- num children is ... ZERO?
SELECT COUNT(*) FROM h3_test_gist WHERE hex <@ :hexagon;
     0  

Any help figuring it out is much appreciated. ❤️

@AbelVM
Copy link

AbelVM commented Aug 24, 2020

Using my sql function, with the very same logic I ported to c:

WITH a AS(SELECT h3_to_children('831c02fffffffff'::h3index, 8) as h)
SELECT 
	count(*)
FROM a
WHERE h3_contains('831c02fffffffff'::h3index, a.h);

Returns 16807

Weird ¯\(ツ)

@zachasme
Copy link
Owner Author

zachasme commented Aug 26, 2020

That's the problem. The regular operations work fine, but breaks when using the GiST index as it is currently implemented in this branch:

CREATE TABLE t (hex h3index);
INSERT INTO t (hex) SELECT h3_to_children('831c02fffffffff', 8);
SELECT COUNT(*) FROM t WHERE hex <@ '831c02fffffffff';
 16807

-- after adding gist index
CREATE INDEX idx ON t USING gist(hex);
SELECT COUNT(*) FROM t WHERE hex <@ '831c02fffffffff';
     0

It is the algorithm implemented in opclass_gist.c that is off.

@AbelVM
Copy link

AbelVM commented Aug 26, 2020

The logic used in gist_cmp is the same within my custom h3_contains function 🤔

Maybe something is lost in translation 😞 . Does the c function work as expected when run standalone? I will give it some more love this afternoon

@zachasme
Copy link
Owner Author

zachasme commented Aug 27, 2020

Ah sorry for the confusion -- as far as I can tell, your gist_cmp works as intended 👍

I think the problem lies in the functions

// opclass_gist.c
h3index_gist_consistent
h3index_gist_union
h3index_gist_penalty
h3index_gist_picksplit
h3index_gist_same

which together defines the behavior of our GiST implementation.

@JohanMollevik
Copy link

Is work still being done on this pull request or is it abandoned? Having GIST indices would be great.

@zachasme
Copy link
Owner Author

I haven't touched the branch in a year, but I don't wish to abandon it. We need to correctly define the GIST operator class methods for the H3 data type. Our efforts so far (opcalss_gist.c in this branch) results in a fail of our basic test suite. Any help would be appreciated (see #42 (comment))!

@AbelVM
Copy link

AbelVM commented Sep 8, 2021

Updated reference: https://www.postgresql.org/docs/13/gist-extensibility.html

@zachasme zachasme changed the title WIP: GiST support GiST support Oct 10, 2022
@zachasme zachasme marked this pull request as draft October 10, 2022 08:31
@zachasme zachasme force-pushed the gist branch 6 times, most recently from 6c410a6 to 4a532cc Compare December 1, 2022 13:06
@mattiZed
Copy link

mattiZed commented Oct 8, 2024

Hey guys, software engineer here from a company that competes in the broad realm of "location-based services".

First off, much appreciation for all the work on this project!

We are using H3-indexed data in various applications, some of which might greatly benefit from index-accelerated queries. Especially for checking containment of a cell within a set of compacted cells (so contained by/contains operators would be very helpful). I can probably justify looking into this issue in my working hours so I'm happy to give it a shot.

I had a quick GPT session to understand how GiST-Indices are designed and I think I'm pretty clear on the topic now. I'll leave the chat here for others to check out.

The example I chose in that chat features concepts that seem very close to how a GiST implementation for H3 could look and from what I saw in opclass_gist.cc, this seems to be exactly the path that you are currently on.

Have there been any new insights meanwhile that are not reflected in the branch or this discussion?

I'm planning to set up an environment via Docker to get going. Is it possible to debug the extension live, e.g. using GDB? Any advice on setting up a dev environment is appreciated. I'm pretty seasoned in C(++) but have never touched a native PSQL extension.

@jmealo
Copy link
Contributor

jmealo commented Oct 10, 2024

I'm interested in GiST and GIN support as well. Also looking at porting over h3_coverage or similar.

@zachasme
Copy link
Owner Author

Hi @mattiZed!

Any help would be appreciated, no progress has been made outside this discussion. :-)

I haven't managed to get GDB to work. At one point I had a docker development setup in the repo: https://github.com/zachasme/h3-pg/tree/5a4899fcfeb89bff8366757ac9d0c21fb1eeb24c/.github/docker maybe you can use some of that.

I basically do this:

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
sudo cmake --install build --component h3-pg
ctest --test-dir build --output-on-failure --build-config Release

@mattiZed
Copy link

mattiZed commented Oct 11, 2024

Thanks @zachasme

Edit: Alright, managed to compile it just fine and got to the point where the tests are failing.

I will try to dedicate some time to this next week.

@mattiZed
Copy link

mattiZed commented Oct 11, 2024

Hm, I couldn't help it and fumbled a bit already. I changed my test to perform just this:

\pset tuples_only on
\set hexagon '\'831c02fffffffff\'::h3index'

CREATE TABLE h3_test_gist (cell h3index);
CREATE INDEX h3_test_gist_index
          ON h3_test_gist
       USING gist(cell experimental_h3index_ops);

INSERT INTO h3_test_gist (cell) SELECT h3_cell_to_children(:hexagon);
INSERT INTO h3_test_gist (cell) SELECT h3_cell_to_children(:hexagon, 8);
SELECT COUNT(*) FROM h3_test_gist WHERE cell <@ :hexagon;

Output:

\pset tuples_only on
\set hexagon '\'831c02fffffffff\'::h3index'

CREATE TABLE h3_test_gist (cell h3index);
CREATE INDEX h3_test_gist_index
          ON h3_test_gist
       USING gist(cell experimental_h3index_ops);

INSERT INTO h3_test_gist (cell) SELECT h3_cell_to_children(:hexagon);
INSERT INTO h3_test_gist (cell) SELECT h3_cell_to_children(:hexagon, 8);
SELECT COUNT(*) > 8 FROM h3_test_gist WHERE cell <@ :hexagon;
 t

SELECT COUNT(*) FROM h3_test_gist WHERE cell <@ :hexagon;
 16814

I quickly confirmed via an SQL instance running h3-pg 4.1.2 that this seems indeed correct:

WITH hexagon AS (
    SELECT '831c02fffffffff'::h3index as idx  
),
children as (
    SELECT h3_cell_to_children(idx) as idx FROM hexagon
),
many_children as (
    SELECT h3_cell_to_children(idx, 8) as idx FROM hexagon
),
children_and_many_children as (
    SELECT idx FROM children
    UNION
    SELECT idx FROM many_children
)
    
SELECT count(idx) FROM children_and_many_children
-> 16814

Am I missing something?

Edit: Also some timings:

postgres=# CREATE TABLE h3_test_gist (cell h3index);

CREATE TABLE
Time: 14.966 ms
postgres=# INSERT INTO h3_test_gist (cell) SELECT h3_cell_to_children(:hexagon);

INSERT 0 7
Time: 12.594 ms
postgres=# INSERT INTO h3_test_gist (cell) SELECT h3_cell_to_children(:hexagon, 10);

INSERT 0 823543
Time: 1500.014 ms (00:01.500)
postgres=# CREATE INDEX h3_test_gist_index ON h3_test_gist USING gist(cell experimental_h3index_ops);

CREATE INDEX
Time: 58060.916 ms (00:58.061)
postgres=# EXPLAIN SELECT count(*) FROM h3_test_gist WHERE h3index_contains(:hexagon, cell);
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=9220.49..9220.50 rows=1 width=8)
   ->  Gather  (cost=9220.28..9220.49 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=8220.28..8220.29 rows=1 width=8)
               ->  Parallel Seq Scan on h3_test_gist  (cost=0.00..7934.32 rows=114382 width=0)
                     Filter: h3index_contains('831c02fffffffff'::h3index, cell)
(6 rows)

Time: 0.539 ms
postgres=# SELECT count(*) FROM h3_test_gist WHERE h3index_contains(:hexagon, cell);
 count
--------
 823550
(1 row)

Time: 45.824 ms
postgres=# EXPLAIN SELECT count(*) FROM h3_test_gist WHERE cell <@ :hexagon;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Aggregate  (cost=36.77..36.77 rows=1 width=8)
   ->  Index Only Scan using h3_test_gist_index on h3_test_gist  (cost=0.29..34.70 rows=824 width=0)
         Index Cond: (cell <@ '831c02fffffffff'::h3index)
(3 rows)

Time: 0.449 ms
postgres=# SELECT count(*) FROM h3_test_gist WHERE cell <@ :hexagon;
 count
--------
 823550
(1 row)

Time: 198.721 ms

So, while the index seems to perform correctly, at least in this little test (which is not really a meaningful query you would use in an application, after all), it's not of much use. I guess this is not using any binary optimizations to leverage the h3index bit layout?

However, if I do something more meaningful:

# Get a cell that is in the set for a test
postgres=# SELECT cell, h3_get_resolution(cell) FROM h3_test_gist LIMIT 1;
      cell       | h3_get_resolution
-----------------+-------------------
 8b1c02000000fff |                11
(1 row)

Time: 0.467 ms

# Check if that exact cell is in the set
postgres=# EXPLAIN SELECT * FROM h3_test_gist WHERE h3index_contains(cell, '8b1c02000000fff'::h3index);
                              QUERY PLAN
----------------------------------------------------------------------
 Seq Scan on h3_test_gist  (cost=0.00..97568.01 rows=1921600 width=8)
   Filter: h3index_contains(cell, '8b1c02000000fff'::h3index)
(2 rows)

Time: 0.455 ms
postgres=# SELECT * FROM h3_test_gist WHERE h3index_contains(cell, '8b1c02000000fff'::h3index);
      cell
-----------------
 8b1c02000000fff
(1 row)

Time: 553.407 ms
postgres=# EXPLAIN SELECT * FROM h3_test_gist WHERE cell @> '8b1c02000000fff'::h3index;
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Index Only Scan using h3_test_gist_index on h3_test_gist  (cost=0.41..233.30 rows=5765 width=8)
   Index Cond: (cell @> '8b1c02000000fff'::h3index)
(2 rows)

Time: 0.408 ms

# The index-accelerated query is already 10 times faster
postgres=# SELECT * FROM h3_test_gist WHERE cell @> '8b1c02000000fff'::h3index;
      cell
-----------------
 8b1c02000000fff
(1 row)

Time: 55.111 ms
postgres=# SELECT * FROM h3_test_gist WHERE h3index_contains(cell, h3_cell_to_parent('8b1c02000000fff'::h3index));
 cell
------
(0 rows)

Time: 397.042 ms

# Sanity check - this should return nothing
postgres=# SELECT * FROM h3_test_gist WHERE cell @> h3_cell_to_parent('8b1c02000000fff'::h3index);
 cell
------
(0 rows)

Time: 60.516 ms

# Query with this cell's parent
postgres=# SELECT * FROM h3_test_gist WHERE h3index_contains(h3_cell_to_parent('8b1c02000000fff'::h3index), cell);
      cell
-----------------
 8b1c02000000fff
 8b1c02000001fff
 8b1c02000002fff
 8b1c02000003fff
 8b1c02000004fff
 8b1c02000005fff
 8b1c02000006fff
(7 rows)

Time: 466.482 ms

# Again a close-to 10x speedup over the naive variant
postgres=# SELECT * FROM h3_test_gist WHERE h3_cell_to_parent('8b1c02000000fff'::h3index) @> cell;
      cell
-----------------
 8b1c02000005fff
 8b1c02000000fff
 8b1c02000001fff
 8b1c02000003fff
 8b1c02000006fff
 8b1c02000002fff
 8b1c02000004fff
(7 rows)

Time: 58.952 ms
postgres=#

@zachasme
Copy link
Owner Author

zachasme commented Oct 14, 2024

Hm, you are right, I've rebased the branch and it does seem to return the correct results. I actually don't recall what the issue was.

I would like to come up with a good test with a large amount of inserts and deletions over a range of resolutions, in order to verify my picksplit/consistent algorithm.

Thanks for picking this up @mattiZed. If we feel somewhat comfortable about the correctness, I could release a new version and you/others could start testing on real data.

@mattiZed
Copy link

mattiZed commented Oct 15, 2024

Hey @zachasme

thanks for the reply. I'll pull your changes and do some more digging.

I'd also like to spark some discussion here. While it seems to me that the current implementation yields correct results (bear in mind that I all I did so far, was doing some quick checks in a psql-terminal), I don't understand yet if it's particularly efficient or "exploitative" of the H3Index properties.

Primer: No offense. This is not meant to come off condescending. I understand that you might have invested a lot more thought on this topic than I currently have. I just want to share my thoughts :)

I'm still understanding how GiST platform really is designed and I would like to take a deeper look at what opclass_gist.c is actually doing currently. For that I managed to get the debugger running. Since sharing is caring, here are my steps:

Debugging

  1. All inside a container based on image ubuntu:22.04
  2. install postgresql-server-dev-14 (PG 14 is the version I currently target), build-essential, gdb, git etc
  3. create user postgres and clone this repo
  4. Connect VS Code via the Remote Dev extension. Important: You'd wanna change your VSC container config (when connected via VS Code, click on the blue bar in the bottom left, then select "Open Container Configuration File" from the commands) to contain something like this:
{
	"workspaceFolder": "/var/lib/postgresql/h3-pg",
	"remoteUser": "postgres"
}

so VSC Editor operations inside the container are not done by the root user (avoids having to chmod/chown all the time)
5) I compile like this (basically exactly what you posted):

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
sudo cmake --install build --component h3-pg
ctest --test-dir build --output-on-failure --build-config Debug
  1. Create a DB
/usr/lib/postgresql/14/bin/initdb /tmp/test
  1. Start the thing /usr/lib/postgresql/14/bin/pg_ctl -D /tmp/test/ start
  2. I then connect from the terminal via psql and load the extension
  3. Find the current backend SELECT pg_backend_pid()
  4. My VS Code launch configuration for the debugger:
{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "(gdb) Attach",
            "type": "cppdbg",
            "request": "attach",
            "program": "/usr/lib/postgresql/14/lib/h3_postgis.so",
            "MIMode": "gdb",
            "setupCommands": [
                {
                    "description": "Enable pretty-printing for gdb",
                    "text": "-enable-pretty-printing",
                    "ignoreFailures": true
                },
                {
                    "description": "Set Disassembly Flavor to Intel",
                    "text": "-gdb-set disassembly-flavor intel",
                    "ignoreFailures": true
                }
            ]
        }

    ]
}
  1. Start the debug configuration and attach to the PID reported in step 9. For this to work, you have to modify kernel security: echo 0 > /proc/sys/kernel/yama/ptrace_scope. This might have implications if you're working on a multi-user system, so use with care. There seems to be a way to do it directly via docker: first response which I haven't tried.

Thoughts

Penalty

For some reason, I was initially under the assumption that GiST is a binary tree behind the scenes, so each internal node can exactly have two children. By now I think this is not correct and an internal node can probably have an arbitrary number of children - ultimately this behaviour seems to be governed by the penalty function.

I tried to think of what the Index tree really should look like, if we forget about GiST for a moment: Considering an example where we have a "highly populated" Index, the index tree could look like this:

Up to 122 nodes in the first generation (the base cells) and then each node in any following generation could have up to 7 child nodes. We could also think of the index as up to 122 "hepternary" trees since each tree has a base cell as a root node and then each child node could have up to 7 child nodes. I hope I made sense so far.

I plan to look closely under the debugger what happens, if we do e.g.

INSERT INTO table (cell) SELECT :hexagon
INSERT INTO table (cell) SELECT h3_cell_to_children(:hexagon)

I would expect the index to contain :hexagon as the only internal node at this point with all it's children attached as leaf nodes. I did not find a way yet to really inspect the layout of the index data, but I hope what happens is somewhat traceable by breaking on calls to penalty and probably union.

Picksplit

I initially thought that this somehow splits what is connected to an internal node. However, the documentation clearly states "index page". I'm not clear if the terms "internal node" and "index page" can be used interchangeably here, or if they are totally different concepts. For now, I decided to treat this function exactly as the documentation states: It should split a vector of elements. I was creating some console output with the DEBUG macro and I was confused that the split seemed very unbalanced (the left side and the right side contained e.g. 4 vs 400 elements) I think all the time.

What makes the H3 Picksplit different from other approaches I have seen (e.g. when building a GiST for bounding box datatypes): we cannot just easily create an "enclosing" feature for representing either of the splits that contains exactly all the members of one side of the split as we could with the bounding box example. We are bound to the H3 grid, and many times the representants of either side of the split are exactly the same index.

I think splitting could be improved if we

  1. identity if left and right side would end up with the same representant
  2. if 1), then we distribute the entries across all the 7 children of the representant
  3. we associate the children to left or right side of the split based on the entry count of each child

So that, for example, we could end up with a split that contains 2 children in left and 4 children in right. But the 2 children in left might cover 10 entries each, where the 4 children in right cover 6 entries each. I hope this made sense as well.

Would love to hear your thoughts.

@zachasme
Copy link
Owner Author

zachasme commented Jan 2, 2025

Hi @mattiZed, sorry for the late reply. And thank you for looking into this!

I got stuck on picking a picksplit and penalty algorithm. My current picksplit works as follow:

  • Find the two H3 indices that would result the most "wasted space" if grouped. That is, the number of children of the finest common ancestor, at the finer resolution of the two indices, minus the number of children of the two indices.
  • Then, assign each node/index to one of the two chosen indices, where the finest common ancestor has the smallest change in resolution (and update the parent node to the new common ancestor).

I have no idea if the right path forward is to optimize on that initial algorithm, or go with something completely different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🚀 New feature or request help wanted ⛏️ Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants