Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Neo4j Uploader Improvments #357

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jexp
Copy link

@jexp jexp commented Jan 28, 2025

  • use neo4j-rust-ext instead of plain neo4j driver for 10x perf improvement
  • always match on a single node label (equivalent to the constraint), never blank matches
  • group relationships by type, source- and target-type
  • increase batch size
  • use vector property procedure to set fp32 instead of p64
  • method to select the main label for a node
  • TODO: create vector index would need information from the embedder (dimension) and similarity function (from config)

@@ -257,6 +261,8 @@ async def run_async(self, path: Path, file_data: FileData, **kwargs) -> None: #
graph_data = _GraphData.model_validate(staged_data)
async with self.connection_config.get_client() as client:
await self._create_uniqueness_constraints(client)
# TODO need chunker config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could get the dimensions from the content by inspecting the length of the embeddings.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea

- use neo4j-rust-ext instead of plain neo4j driver for 10x perf improvement
- always match on a single node label (equivalent to the constraint), never blank matches
- group relationships by type, source- and target-type
- increase batch size
- use vector property procedure to set fp32 instead of p64
- method to select the main label for a node
- TODO: create vector index would need information from the embedder (dimension) and similarity function (from config)
- Set extra labels
Copy link
Contributor

@ds-filipknefel ds-filipknefel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few question and comments to inform further review

Comment on lines 285 to 288
logger.info(
f"Adding id uniqueness constraint for nodes labeled '{label.value}'"
" if it does not already exist."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log doesn't seem to describe what's happening, we create a vector index but log we're creating id uniqueness constraint?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this should read, "Adding vector index for nodes labeled '{label.value}'" instead.

Comment on lines 313 to 320
def _main_label(self, labels: list[Label]) -> Label:
if labels is None or len(labels) == 0: return None

for label in Label:
if label in labels:
return label
else:
return labels[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this function may not do what it was intended to, right now it:

  • Returns first enum value of Label (where order is arbitrary) if that label is present in provided labels parameter
  • Returns first label from provided labels parameter otherwise
  1. What was the intention of the function?
  2. If the idea is that each node should necessarily have a main label that it will be searched by then I'd argue that we should represent that in the abstraction, that is to:
  • add a main_label: Label field to the node - use that for searches
  • have secondary_labels: list[Label] field for the rest of the labels

Names subject to change, just sketching the idea

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent was to return the first of any found label. Neo4j does not qualify between available labels - so not sure what the best universal way of identifying which should be the main label outside of some external configuration.

Comment on lines 382 to 392
def _create_nodes_query(nodes: list[_Node], label: Label) -> tuple[str, dict]:
logger.info(f"Preparing MERGE query for {len(nodes)} nodes labeled '{label}'.")
query_string = f"""
UNWIND $nodes AS node
MERGE (n: {labels_string} {{id: node.id}})
MERGE (n: `{label.value}` {{id: node.id}})
SET n += node.properties
SET n:$(node.labels)
WITH * WHERE node.vector IS NOT NULL
CALL db.create.setNodeVectorProperty(n, 'embedding', node.vector)
"""
parameters = {"nodes": [{"id": node.id_, "properties": node.properties} for node in nodes]}
parameters = {"nodes": [{"id": node.id_, "labels":[l.value for l in node.labels if l != label],"vector":node.properties.pop('embedding', None), "properties": node.properties} for node in nodes]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow exactly one label per Node?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly would be simpler. But would dramatically alter the data model for graphs containing multi-labeled nodes, resulting in a lot of duplicated nodes (with the same properties) and relationships to get the same effect. My gut says no, but it's an idea worth exploring if the ingested data is coming entirely from unstructured.

Comment on lines 396 to 402
def _create_edges_query(edges: list[_Edge], relationship: tuple[Relationship,Label,Label]) -> tuple[str, dict]:
logger.info(f"Preparing MERGE query for {len(edges)} {relationship} relationships.")
query_string = f"""
UNWIND $edges AS edge
MATCH (u {{id: edge.source}})
MATCH (v {{id: edge.destination}})
MERGE (u)-[:{relationship.value}]->(v)
MATCH (u: `{relationship[1].value}` {{id: edge.source}})
MATCH (v: `{relationship[2].value}` {{id: edge.destination}})
MERGE (u)-[:`{relationship[0].value}`]->(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that a relationship should always be defined in terms of connecting two specified labels?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Match could be found with just the ids, but by adding labels, the match part the of the query will be more performant.

f"""
CREATE VECTOR INDEX {index_name} IF NOT EXISTS
FOR (n:{label.value}) ON n.embedding
OPTIONS {{`vector.similarity_function`: '{similarity_function}', `vector.dimensions`: {dimensions}}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want the function to be configurable by user when creating the destination connector then such field should be added to Neo4jUploaderConfig class

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is setup to try and avoid any naming collisions when additional indexes are added. Making the index name configurable might have unexpected consequences later.

Filip Knefel added 3 commits February 6, 2025 10:44
Implement main label getting logic on the Node.
Validate that Node has at least one label to ensure there's always a
main label.
Remove data about nodes from the edge, refer to Node objects directly
in the Edge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants