Methods for creating fine-tuning datasets for text-to-Cypher generation.

Created with ChatGPT-DALLE

Cypher is Neo4j’s graph query language. It was inspired and bears similarities with SQL, enabling data retrieval from knowledge graphs. Given the rise of generative AI and the widespread availability of large language models (LLMs), it is natural to ask which LLMs are capable of generating Cypher queries or how we can finetune our own model to generate Cypher from the text.

The issue presents considerable challenges, primarily due to the scarcity of fine-tuning datasets and, in my opinion, because such a dataset would significantly rely on the specific graph schema.

In this blog post, I will discuss several approaches for creating a fine-tuning dataset aimed at generating Cypher queries from text. The initial approach is grounded in Large Language Models (LLMs) and utilizes a predefined graph schema. The second strategy, rooted entirely in Python, offers a versatile means to produce a vast array of questions and Cypher queries, adaptable to any graph schema. For experimentation I created a knowledge graph that is based on a subset of the ArXiv dataset.

As I was finalizing this blogpost, Tomaz Bratanic launched an initiative project aimed at developing a comprehensive fine-tuning dataset that encompasses various graph schemas and integrates a human-in-the-loop approach to generate and validate Cypher statements. I hope that the insights discussed here will also be advantageous to the project.

I like working with the ArXiv dataset of scientific articles because of its clean, easy-to-integrate format for a knowledge graph. Utilizing techniques from my recent Medium blogpost, I enhanced this dataset with additional keywords and clusters. Since my primary focus is on building a fine-tuning dataset, I’ll omit the specifics of constructing this graph. For those interested, details can be found in this Github repository.

The graph is of a reasonable size, featuring over 38K nodes and almost 96K relationships, with 9 node labels and 8 relationship types. Its schema is illustrated in the following image:

Image by the Author

While this knowledge graph isn’t fully optimized and could be improved, it serves the purposes of this blogpost quite effectively. If you prefer to just test queries without building the graph, I uploaded the dump file in this Github repository.

The first approach I implemented was inspired by Tomaz Bratanic’s blogposts on building a knowledge graph chatbot and finetuning a LLM with H2O Studio. Initially, a selection of sample queries was provided in the prompt. However, some of the recent models have enhanced capability to generate Cypher queries directly from the graph schema. Therefore, in addition to GPT-4 or GPT-4-turbo, there are now accessible open source alternatives such as Mixtral-8x7B I anticipate could effectively generate decent quality training data.

In this project, I experimented with two models. For the sake of convenience, I decided to use GPT-4-turbo in conjunction with ChatGPT, see this Colab Notebook. However, in this notebook I performed a few tests with Mixtral-7x2B-GPTQ, a quantized model that is small enough to run on Google Colab, and which delivers satisfactory results.

To maintain data diversity and effectively monitor the generated questions, Cypher statements pairs, I have adopted a two steps approach:

  • Step 1: provide the full schema to the LLM and request it to generate 10–15 different categories of potential questions related to the graph, along with their descriptions,
  • Step 2: provide schema information and instruct the LLM to create a specific number N of training pairs for each identified category.

Extract the categories of samples:

For this step I used ChatGPT Pro version, although I did iterate through the prompt several times, combined and enhanced the outputs.

  • Extract a schema of the graph as a string (more about this in the next section).
  • Build a prompt to generate the categories:
chatgpt_categories_prompt = f"""
You are an experienced and useful Python and Neo4j/Cypher developer.

I have a knowledge graph for which I would like to generate
interesting questions which span 12 categories (or types) about the graph.
They should cover single nodes questions,
two or three more nodes, relationships and paths. Please suggest 12
categories together with their short descriptions.
Here is the graph schema:
{schema}
"""

  • Ask the LLM to generate the categories.
  • Review, make corrections and enhance the categories as needed. Here is a sample:
'''Authorship and Collaboration: Questions about co-authorship and collaboration patterns.
For example, "Which authors have co-authored articles the most?"''',
'''Article-Author Connections: Questions about the relationships between articles and authors,
such as finding articles written by a specific author or authors of a particular article.
For example, "Find all the authors of the article with tile 'Explorations of manifolds'"''',
'''Pathfinding and Connectivity: Questions that involve paths between multiple nodes,
such as tracing the relationship path from an article to a topic through keywords,
or from an author to a journal through their articles.
For example, "How is the author 'John Doe' connected to the journal 'Nature'?"'''

💡Tips💡

  • If the graph schema is very large, split it into overlapping subgraphs (this depends on the graph topology also) and repeat the above process for each subgraph.
  • When working with open source models, choose the best model you can fit on your computational resources. TheBloke has posted an extensive list of quantized models, Neo4j GenAI provides tools to work on your own hardware and LightningAI Studio is a recently released platform which gives you access to a multitude of LLMs.

Generate the training pairs:

This step was performed with OpenAI API, working with GPT-4-turbo which also has the option to output JSON format. Again the schema of the graph is provided with the prompt:

def create_prompt(schema, category):
"""Build and format the prompt."""
formatted_prompt = [
{"role": "system",
"content": "You are an experienced Cypher developer and a
helpful assistant designed to output JSON!"},
{"role": "user",
"content": f"""Generate 40 questions and their corresponding
Cypher statements about the Neo4j graph database with
the following schema:
{schema}
The questions should cover {category} and should be phrased
in a natural conversational manner. Make the questions diverse
and interesting.
Make sure to use the latest Cypher version and that all
the queries are working Cypher queries for the provided graph.
You may add values for the node attributes as needed.
Do not add any comments, do not label or number the questions.
"""}]
return formatted_prompt

Build the function which will prompt the model and will retrieve the output:

def prompt_model(messages):
"""Function to produce and extract model's generation."""
response = client.chat.completions.create(
model="gpt-4-1106-preview", # work with gpt-4-turbo
response_format={"type": "json_object"},
messages=messages)
return response.choices[0].message.content

Loop through the categories and collect the outputs in a list:

def build_synthetic_data(schema, categories):
"""Function to loop through the categories and generate data."""

# List to collect all outputs
full_output=[]
for category in categories:
# Prompt the model and retrieve the generated answer
output = [prompt_model(create_prompt(schema, category))]
# Store all the outputs in a list
full_output += output
return full_output

# Generate 40 pairs for each of the categories
full_output = build_synthetic_data(schema, categories)

# Save the outputs to a file
write_json(full_output, data_path + synthetic_data_file)

At this point in the project I collected almost 500 pairs of questions, Cypher statements. Here is a sample:

{"Question": "What articles have been written by 'John Doe'?",
"Cypher": "MATCH (a:Author {first_name:'John', last_name:'Doe'})-
[:WRITTEN_BY]-(article:Article) RETURN article.title, article.article_id;"}

The data requires significant cleaning and wrangling. While not overly complex, the process is both time-consuming and tedious. Here are several of the challenges I encountered:

  • non-JSON entries due to incomplete Cypher statements;
  • the expected format is {’question’: ‘some question’, ‘cypher’:’some cypher’}, but deviations are frequent and need to be standardized;
  • instances where the questions and the Cypher statements are clustered together, necessiting their separation and organization.

💡Tip💡

It is better to iterate through variations of the prompt than trying to find the best prompt format from the beginning. In my experience, even with diligent adjustments, generating a large volume of data like this inevitably leads to some deviations.

Now regarding the content. GPT-4-turbo is quite capable to generate good questions about the graph, however not all the Cypher statements are valid (working Cypher) and correct (extract the intended information). When fine-tuning in a production environment, I would either rectify or eliminate these erroneous statements.

I created a function execute_cypher_queries() that sends the queries to the Neo4j graph database . It either records a message in case of an error or retrieves the output from the database. This function is available in this Google Colab notebook.

From the prompt, you may notice that I instructed the LLM to generate mock data to populate the attributes values. While this approach is simpler, it results in numerous empty outputs from the graph. And it demands extra effort to identify those statements involving hallucinatins, such as made-up attributes:

'MATCH (author:Author)-[:WRITTEN_BY]-(article:Article)-[:UPDATED]-
(updateDate:UpdateDate)
WHERE article.creation_date = updateDate.update_date
RETURN DISTINCT author.first_name, author.last_name;"

The Article node has no creation_date attribute in the ArXiv graph!

💡Tip💡

To minimize the empty outputs, we could instead extract instances directly from the graph. These instances can then be incorporated into the prompt, and instruct the LLM to use this information to enrich the Cypher statements.

This method allows to create anywhere from hundreds to hundreds of thousands of correct Cypher queries, depending on the graph’s size and complexity. However, it is crucial to strike a balance bewteen the quantity and the diversity of these queries. Despite being correct and applicable to any graph, these queries can occasionally appear formulaic or rigid.

Extract Information About the Graph Structure

For this process we need to start with some data extraction and preparation. I use the Cypher queries and the some of the code from the neo4j_graph.py module in Langchain.

  • Connect to an existing Neo4j graph database.
  • Extract the schema in JSON format.
  • Extract several node and relationship instances from the graph, i.e. data from the graph to use as samples to populate the queries.

I created a Python class that perfoms these steps, it is available at utils/neo4j_schema.py in the Github repository. With all these in place, extracting the relevant data about the graph necessitates a few lines of code only:

# Initialize the Neo4j connector
graph = Neo4jGraph(url=URI, username=USER, password=PWD)
# Initialize the schema extractor module
gutils = Neo4jSchema(url=URI, username=USER, password=PWD)

# Build the schema as a JSON object
jschema = gutils.get_structured_schema
# Retrieve the list of nodes in the graph
nodes = get_nodes_list(jschema)
# Read the nodes with their properties and their datatypes
node_props_types = jschema['node_props']

# Check the output
print(f"The properties of the node Report are:\n{node_props_types['Report']}")

>>>The properties of the node Report are:
[{'property': 'report_id', 'datatype': 'STRING'}, {'property': 'report_no', 'datatype': 'STRING'}]

# Extract a list of relationships
relationships = jschema['relationships']

# Check the output
relationships[:1]

>>>[{'start': 'Article', 'type': 'HAS_KEY', 'end': 'Keyword'},
{'start': 'Article', 'type': 'HAS_DOI', 'end': 'DOI'}]

Extract Data From the Graph

This data will provide authentic values to populate our Cypher queries with.

  • First, we extract several node instances, this will retrieve all the data for nodes in the graph, including labels, attributes and their values :
# Extract node samples from the graph - 4 sets of node samples
node_instances = gutils.extract_node_instances(
nodes, # list of nodes to extract labels
4) # how many instances to extract for each node
  • Next, extract relationship instances, this includes all the data on the start node, the relationship with its type and properties, and the end node information:
# Extract relationship instances
rels_instances = gutils.extract_multiple_relationships_instances(
relationships, # list of relationships to extract instances for
8) # how many instances to extract for each relationship

💡Tips💡

  • Both of the above methods work for the full lists of nodes, relationships or sublists of them.
  • If the graph contains instances that lack records for some attributes, it is advisable to collect more instances to ensure all possible scenarios are covered.

The next step is to serialize the data, by replacing the Neo4j.time values with strings and save it to files.

Parse the Extracted Data

I refer to this phase as Python gymnastics. Here, we handle the data obtained in the previous step, which consists of the graph schema, node instances, and relationship instances. We reformat this data to make it easily accessible for the functions we are developing.

  • We first identify all the datatypes in the graph with:
dtypes = retrieve_datatypes(jschema)
dtypes

>>>{'DATE', 'INTEGER', 'STRING'}

  • For each datatype we extract the attributes (and the corresponding nodes) that have that dataype.
  • We parse instances of each datatype.
  • We also process and filter the relationships so that the start and the end nodes have attributes of specifid data types.

All the code is available in the Github repository. The reasons of doing all these will become transparent in the next section.

How to Build One or One Thousand Cypher Statements

Being a mathematician, I often perceive statements in terms of the underlying functions. Let’s consider the following example:

q = "Find the Topic whose description contains 'Jordan normal form'!"
cq = "MATCH (n:Topic) WHERE n.description CONTAINS 'Jordan normal form' RETURN n"

The above can be regarded as functions of several variables f(x, y, z) and g(x. y, z) where

f(x, y, z) = f"Find the {x} whose {y} contains {z}!"
q = f('Topic', 'description', 'Jordan normal form')

g(x, y, z) = f"MATCH (n:{x}) WHERE n.{y} CONTAINS {z} RETURN n"
qc = g('Topic', 'description', 'Jordan normal form')

How many queries of this type can we build? To simplify the argument let’s assume that there are N node labels, each having in average n properties that have STRING datatype. So at least Nxn queries are available for us to build, not taking into account the options for the string choices z.

💡Tip💡

Just because we are able to construct all these queries using a single line of code doesn’t imply that we should incorporate the entire set of examples into our fine-tuning dataset.

Develop a Process and a Template

The main challenge lies in creating a sufficiently varied list of queries that covers a wide range of aspects related to the graph. With both proprietary and open-source LLMs capable of generating basic Cypher syntax, our focus can shift to generating queries about the nodes and relationships within the graph, while omitting syntax-specific queries. To gather query examples for conversion into functional form, one could refer to any Cypher language book or explore the Neo4j Cypher documentation site.

In the GitHub repository, there are about 60 types of these queries that are then applied to the ArXiv knowledge graph. They are versatile and applicable to any graph schema.

Below is the complete Python function for creating one set of similar queries and incorporate it in the fine-tuning dataset:

def find_nodes_connected_to_node_via_relation():
def prompter(label_1, prop_1, rel_1, label_2):
subschema = get_subgraph_schema(jschema, [label_1, label_2], 2, True)
message = {"Prompt": "Convert the following question into a Cypher query using the provided graph schema!",
"Question": f"""For each {label_1}, find the number of {label_2} linked via {rel_1} and retrieve the {prop_1} of the {label_1} and the {label_2} counts in ascending order!""",
"Schema": f"Graph schema: {subschema}",
"Cypher": f"MATCH (n:{label_1}) -[:{rel_1}]->(m:{label_2}) WITH DISTINCT n, m RETURN n.{prop_1} AS {prop_1}, count(m) AS {label_2.lower()}_count ORDER BY {label_2.lower()}_count"
}
return message

sampler=[]
for e in all_rels:
for k, v in e[1].items():
temp_dict = prompter(e[0], k, e[2], e[3])
sampler.append(temp_dict)

return sampler

  • the function find_nodes_connected_to_node_via_relation() takes the generating prompter and evaluates it for all the elements in all_rels which is the collection of extracted and processed relationship instances, whose entries are of the form:
['Keyword',
{'name': 'logarithms', 'key_id': '720452e14ca2e4e07b76fa5a9bc0b5f6'},
'HAS_TOPIC',
'Topic',
{'cluster': 0}]
  • the prompter inputs are two nodes denoted label_1 and label_2 , the property prop_1 for label_1 and the relationship rel_1 ,
  • the message contains the components of the prompt for the corresponding entry in the fine-tuning dataset,
  • the subschema extracts first neighbors for the two nodes denoted label_1 and label_2 , this means: the two nodes listed, all the nodes related to them (distance one in the graph), the relationships and all the corresponding attributes.

💡Tip💡

Including the subschema in the finetuning dataset is not essential, although the more closely the prompt aligns with the fine-tuning data, the better the generated output tends to be. From my perspective, incorporating the subschema in the fine-tuning data still offers advantages.

To summarize, post has explored various methods for building a fine-tuning dataset for generating Cypher queries from text. Here is a breakdown of these techniques, along with their advantages and disadvantages:

LLM generated question and Cypher statements pairs:

  • The method may seem straightforward in terms of data collection, yet it often demands excessive data cleaning.
  • While certain proprietary LLMs yield good outcomes, many open source LLMs still lack the proficiency of generating a wide range of accurate Cypher statements.
  • This technique becomes burdensome when the graph schema is complex.

Functional approach or parametric query generation:

  • This method is adaptable across various graphs schemas and allows for easy scaling of the sample size. However, it is important to ensure that the data doesn’t become overly repetitive and maintains diversity.
  • It requires a significant amount of Python programming. The queries generated can often seem mechanial and may lack a conversational tone.

To expand beyond these approaches:

  • The graph schema can be seamlessley incorporated into the framework for creating the functional queries. Consider the following question, Cypher statement pair:
Question: Which articles were written by the author whose last name is Doe?
Cypher: "MATCH (a:Article) -[:WRITTEN_BY]-> (:Author {last_name: 'Doe') RETURN a"

Instead of using a direct parametrization, we could incorporate basic parsing (such as replacing WRITTEN_BY with written by), enhancing the naturalness of the generated question.

This highligts the significance of the graph schema’s design and the labelling of graph’s entities in the construction of the fine-tuning pars. Adhering to general norms like using nouns for node labels and suggestive verbs for the relationships proves beneficial and can create a more organically conversational link between the elements.

  • Finally, it is crucial not to overlook the value of collecting actual user generated queries from graph interactions. When available, parametrizing these queries or enhancing them through other methods can be very useful. Ultimately, the effectiveness of this method depends on the specific objectives for which the graph has been designed.

To this end, it is important to mention that my focus was on simpler Cypher queries. I did not address creating or modifying data within the graph, or the graph schema, nor I did include APOC queries.

Are there any other methods or ideas you might suggest for generating such fine-tuning question and Cypher statement pairs?

Code

Github Repository: Knowledge_Graphs_Assortment — for building the ArXiv knowledge graph

Github Repository: Cypher_Generator — for all the code related to this blogpost

Data

• Repository of scholary articles: arXiv Dataset that has CC0: Public Domain license.

Leave a Reply