-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for Neo4j Uploader Improvments #357
base: main
Are you sure you want to change the base?
Conversation
jexp
commented
Jan 28, 2025
•
edited
Loading
edited
- use neo4j-rust-ext instead of plain neo4j driver for 10x perf improvement
- always match on a single node label (equivalent to the constraint), never blank matches
- group relationships by type, source- and target-type
- increase batch size
- use vector property procedure to set fp32 instead of p64
- method to select the main label for a node
- TODO: create vector index would need information from the embedder (dimension) and similarity function (from config)
@@ -257,6 +261,8 @@ async def run_async(self, path: Path, file_data: FileData, **kwargs) -> None: # | |||
graph_data = _GraphData.model_validate(staged_data) | |||
async with self.connection_config.get_client() as client: | |||
await self._create_uniqueness_constraints(client) | |||
# TODO need chunker config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could get the dimensions from the content by inspecting the length of the embeddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea
- use neo4j-rust-ext instead of plain neo4j driver for 10x perf improvement - always match on a single node label (equivalent to the constraint), never blank matches - group relationships by type, source- and target-type - increase batch size - use vector property procedure to set fp32 instead of p64 - method to select the main label for a node - TODO: create vector index would need information from the embedder (dimension) and similarity function (from config) - Set extra labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few question and comments to inform further review
logger.info( | ||
f"Adding id uniqueness constraint for nodes labeled '{label.value}'" | ||
" if it does not already exist." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log doesn't seem to describe what's happening, we create a vector index but log we're creating id uniqueness constraint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this should read, "Adding vector index for nodes labeled '{label.value}'" instead.
def _main_label(self, labels: list[Label]) -> Label: | ||
if labels is None or len(labels) == 0: return None | ||
|
||
for label in Label: | ||
if label in labels: | ||
return label | ||
else: | ||
return labels[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this function may not do what it was intended to, right now it:
- Returns first enum value of
Label
(where order is arbitrary) if that label is present in providedlabels
parameter - Returns first label from provided
labels
parameter otherwise
- What was the intention of the function?
- If the idea is that each node should necessarily have a main label that it will be searched by then I'd argue that we should represent that in the abstraction, that is to:
- add a
main_label: Label
field to the node - use that for searches - have
secondary_labels: list[Label]
field for the rest of the labels
Names subject to change, just sketching the idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent was to return the first of any found label. Neo4j does not qualify between available labels - so not sure what the best universal way of identifying which should be the main label outside of some external configuration.
def _create_nodes_query(nodes: list[_Node], label: Label) -> tuple[str, dict]: | ||
logger.info(f"Preparing MERGE query for {len(nodes)} nodes labeled '{label}'.") | ||
query_string = f""" | ||
UNWIND $nodes AS node | ||
MERGE (n: {labels_string} {{id: node.id}}) | ||
MERGE (n: `{label.value}` {{id: node.id}}) | ||
SET n += node.properties | ||
SET n:$(node.labels) | ||
WITH * WHERE node.vector IS NOT NULL | ||
CALL db.create.setNodeVectorProperty(n, 'embedding', node.vector) | ||
""" | ||
parameters = {"nodes": [{"id": node.id_, "properties": node.properties} for node in nodes]} | ||
parameters = {"nodes": [{"id": node.id_, "labels":[l.value for l in node.labels if l != label],"vector":node.properties.pop('embedding', None), "properties": node.properties} for node in nodes]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow exactly one label per Node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly would be simpler. But would dramatically alter the data model for graphs containing multi-labeled nodes, resulting in a lot of duplicated nodes (with the same properties) and relationships to get the same effect. My gut says no, but it's an idea worth exploring if the ingested data is coming entirely from unstructured.
def _create_edges_query(edges: list[_Edge], relationship: tuple[Relationship,Label,Label]) -> tuple[str, dict]: | ||
logger.info(f"Preparing MERGE query for {len(edges)} {relationship} relationships.") | ||
query_string = f""" | ||
UNWIND $edges AS edge | ||
MATCH (u {{id: edge.source}}) | ||
MATCH (v {{id: edge.destination}}) | ||
MERGE (u)-[:{relationship.value}]->(v) | ||
MATCH (u: `{relationship[1].value}` {{id: edge.source}}) | ||
MATCH (v: `{relationship[2].value}` {{id: edge.destination}}) | ||
MERGE (u)-[:`{relationship[0].value}`]->(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply that a relationship should always be defined in terms of connecting two specified labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Match could be found with just the ids, but by adding labels, the match part the of the query will be more performant.
f""" | ||
CREATE VECTOR INDEX {index_name} IF NOT EXISTS | ||
FOR (n:{label.value}) ON n.embedding | ||
OPTIONS {{`vector.similarity_function`: '{similarity_function}', `vector.dimensions`: {dimensions}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want the function to be configurable by user when creating the destination connector then such field should be added to Neo4jUploaderConfig
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is setup to try and avoid any naming collisions when additional indexes are added. Making the index name configurable might have unexpected consequences later.
Implement main label getting logic on the Node. Validate that Node has at least one label to ensure there's always a main label. Remove data about nodes from the edge, refer to Node objects directly in the Edge.