Embedding near the edge: pgEdge Distributed PostgreSQL with pgVector
Putting vector similarity search capability closer to end users for faster response times
Introduction
We are excited to be announcing that we now support the increasingly popular pgVector Postgres extension for storing and searching vector embeddings in AI-powered applications. Bringing pgVector and pgEdge’s distributed capabilities together makes for a powerful combination that greatly improves performance for users regardless of their geographic location.
In this blog we'll demonstrate how to configure pgVector with pgEdge to provide similarity search functionality across a pgEdge Distributed PostgreSQL cluster.
I will start with brief summary of the products mentioned in the title of this blog:
pgEdge is fully-distributed PostgreSQL, optimized for the network edge and deployable across multiple cloud regions or data centers. pgEdge is available as pgEdge Platform, self-hosted software available for download from [download link]; or as pgEdge Cloud, a fully managed service. This blog is applicable to both pgEdge Cloud and pdEdge Platform.
pgvector is an open source extension for PostgreSQL that enables efficient similarity search and other vector-based operations. It's often used for applications like recommendation systems and image search. The pgvector extension provides an indexable vector data type that stores vectors in a PostgreSQL database. pgvector supports the ivfflat
index, which implements the IVF FLAT
method of indexing.
Vector Database
Vector data stores data as high-dimensional vectors, which are mathematical representations of features or attributes. The number of dimensions in a vector ranges from tens to thousands, depending on the complexity and granularity of the data. The main advantage of a vector database is that it allows for fast and accurate similarity search and retrieval of data based on their vector distance or similarity. So instead of using the conventional methods for searching data using predefined criteria or exact matches or wildcards, one can use the vector database to find similar or relevant data based on semantic or contextual meaning.
Vector databases enable accurate and efficient search and analysis of large datasets by utilizing the characteristics of vectors. A vector database's capacity to locate comparable items is its key benefit. For example, two statements with comparable meanings will produce vectors that are close to one another. This allows you to use the vector database to locate all the vectors that are near to one another. For example, a vector database can be used to find:
images that are similar to a given image based on visual content and style.
documents that are similar to a given document based on topic and content.
products that are similar to a given product based on features and ratings.
Vector databases are currently the popular choice. With the rise of large-language AI models (LLMs), efficiently managing and searching large-scale, high-dimensional data has become a tremendously important use case. The solution to this challenge lies in vector databases – a powerful and increasingly popular data storage technology that enables faster and more accurate searches.
With the addition of the open-source pgvector extension, PostgreSQL is being used as a vector database. There is a lot of excitement about using PostgreSQL as a vector database, but there is more innovation to come, and work to be done to make the vector workload more secure, performant, and scalable.
Vector Data
Before showing an example of how pgEdge works with pgvector extension, it is important to understand the dynamics of vector data, and how it is stored in the database. Vector data refers to a type of data representation where each data point is described by a set of numerical values arranged in a specific order. These values are usually referred to as components or features and they capture different aspects or attributes of a data point. Vectors are commonly used to represent a wide range of information in many fields: mathematics, computer science, data science, and machine learning.
Real-world applications utilize far more than just two dimensions; OpenAI embeddings may use more than a thousand dimensions to vectorize data. One method for converting high-dimensional data into a low-dimensional space is embedding. Embedding allows us to extract data from multiple dimensions and sources, including text, photos, audio, and video, and convert it into vectors. Embedding is a widely-used technique in machine learning and natural language processing (NLP) to represent sparse symbols or objects as continuous vectors.
For example, tree data like a car, truck, cycle, helicopter, or hoverboard object may all be converted into vectors using embeddings. Two-dimensional embeddings are shown behind the object they describe in the following list:
car: embedding [2.0,2.3]
truck: embedding [3.4, 5.9]
motorcycle: embedding [0.5,1.2]
bicycle: embedding [0.2,0.8]
helicopter : embedding [13.2,19.8]
hoverboard: embedding [0.1,0.2]
A review of the result set shows us that a bicycle and a motorcycle are similar and that their vectors (if charted) would be fairly close in distance. Vehicle characteristics can also be categorized along dimensions that include color, model, year, and manufacturer. The finer-grained your data is when describing an object, the more precise your results will be in the resulting vehicle grouping.
Vector databases can efficiently find items that satisfy a query using vector representations. They use similarity metrics like Euclidean distance, Cosine similarity, or Manhattan distance to determine data point proximity, resulting in relevant and similar results.
pgvector syntax
The pgvector extension introduces a vector data type that can be used as the column type in a PostgreSQL database. The simple examples that follow show how to use the vector data type in INSERT/UPDATE/DELETE
statements, and search the vector data. Invoke the following commands with the psql client:
Creating a Sample Table
CREATE TABLE items (data text, embedding vector(3));
INSERT INTO items (data, embedding)
VALUES ('German Shepherd', '[1,2,3]'), ('Bulldog','[4,5,6]');
Retrieving Data
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
-- Finding the nearest neighbour
SELECT * FROM items WHERE data != 'Bulldog' ORDER BY embedding <-> (SELECT embedding FROM items WHERE data = 'Bulldog') LIMIT 5;
-- Get the nearest neighbours
SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;
-- Get rows inside a specific range
Managing Data Operations
INSERT INTO items (data, embedding) VALUES ('Golden Retriever', '[9,10,11]'), ('Mastiff', '[20,30,12]');
UPDATE items SET embedding = '[4.2,5.9,6]' WHERE data = 'Bulldog';
DELETE FROM items WHERE data = 'Boerboel';
Querying Aggregates
SELECT AVG(embedding) FROM items;
-- Average vectors
SELECT data, AVG(embedding) FROM items GROUP BY data;
-- Average groups of vectors
Creating Indexes
PostgreSQL can create indexes for vectors that hold up to 2000 dimensions.
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
-- L2 distance
CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);
-- Inner product
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
-- Cosine distance
CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100) WHERE (data = 'Bulldog');
-- This creates a partial index
You can create embeddings using tools like the OpenAI API client. Similarity searches of vector embeddings have a variety of commercial uses like fraud detection, food industry use, security systems.
pgvector real world example
The following example is a real world sample code of an AI based enquiry system that tries to automatically answer client queries. It has a limited knowledge base, if it doesn't know the answer, it replies appropriately.
import psycopg2
import openai
import os
# OPENAI Settings
EMBEDDINGS_MODEL = "text-embedding-ada-002"
INDEX_DIMENSIONS = 1536
R_MODEL = "gpt-3.5-turbo"
R_TEMPERATURE = 0.1
R_MAX_TOKENS = 150
R_TOP_P = 1
R_FREQUENCY_PENALTY = 0
R_PRESENCE_PENALTY = 0.6
# PostgreSQL Settings
PG_DATABASE = "postgres"
PG_HOST = "localhost"
PG_USER = "asif"
PG_PASSWORD = "db_pass"
PG_PORT = "5477"
# Envionment variables OPENAI_API_KEY1 and OPENAI_ORG also require to be set before running the script
# We need the following knowledge base for OpenAI and similarity search comparison
DATAFEED = [
"PostgreSQL Conference Europe 2023 will be held on December 12-15 in Prague, Czechia at the Clarion Congress Hotel"
"Prague",
"PGDay UK 2023 is the first in a new series of local annual events in the UK on 2023-09-12, complementing the"
"existing pgDay Paris, Nordic PGDay, and other similar events organised by PostgreSQL Europe.",
"pgEdge provides distributed PostgreSQL optimized for the network edge to reduce data latency and achieve ultra"
"high availability. Only pgEdge combines multi-active (multi-master), multi-region and multi-cloud in a fully"
"managed cloud service that is 100% standard Postgres and 100% open (source available).",
"Denis Lussier. Denis is a serial postgres entrepreneur and the co-founder and CTO of pgEdge.",
"Ibrar Ahmed is a specialist in PostgreSQL core development",
"PgEge office is located at 201 N. Union Street Alexandria, VA 22314 USA. Queries can be sent to [email protected]"
]
# Test questions to mimic the client queries
QUERIES = [
"tell me about pgEdge",
"Who is Denis Lussier",
"Where is pgEdge office",
"Who is Ibrar Ahmed",
"Where is PostgreSQL Conference Europe 2023 taking place",
"When is PGDay UK 2023 taking place",
"Which country K2 mountain belong to ?"
]
# Make connection with PostgreSQL database instance
def connect_with_postgres():
connection = psycopg2.connect(database=PG_DATABASE,
host=PG_HOST,
user=PG_USER,
password=PG_PASSWORD,
port=PG_PORT)
return connection
def initialize_openai():
openai.api_key = os.getenv('OPENAI_API_KEY1')
openai.organization = os.getenv('OPENAI_ORG')
# For similarity search, generate the embedding and store it in the database
def populate_kb_table():
cur = conn.cursor()
print("Creating table ...")
cur.execute("DROP TABLE IF EXISTS knowledge_base;")
cur.execute("CREATE TABLE IF NOT EXISTS knowledge_base (id SERIAL PRIMARY KEY, info VARCHAR(500),"
"embedding VECTOR("+str(INDEX_DIMENSIONS)+"));")
print("Populating embeddings ...")
for idx, info in enumerate(DATAFEED):
embedding = openai.Embedding.create(input=info, engine=EMBEDDINGS_MODEL)['data'][0]['embedding']
cur.execute(f"""
INSERT INTO knowledge_base (info, embedding)
VALUES ('{info}', '{embedding}');
""")
cur.close()
conn.commit()
def get_response(question):
print("Get response for question: " + question)
query_embedding = openai.Embedding.create(input=question, engine=EMBEDDINGS_MODEL)['data'][0]['embedding']
cur = conn.cursor()
cur.execute("WITH matching_data AS ("
"SELECT kb.info, 1 - (kb.embedding <=> '" + str(query_embedding) + "') matching from knowledge_base kb)"
"SELECT * from matching_data where matching > 0.80 order by matching desc limit 1;")
kb_record = cur.fetchone()
cur.close()
if kb_record is not None:
info = kb_record[0]
matching = kb_record[1]
messages = [{"role": "system", "content": info},
{"role": "user", "content": question}]
response = openai.ChatCompletion.create(
model=R_MODEL,
messages=messages,
temperature=R_TEMPERATURE,
max_tokens=R_MAX_TOKENS,
top_p=R_TOP_P,
frequency_penalty=R_FREQUENCY_PENALTY,
presence_penalty=R_PRESENCE_PENALTY
)
response_content = response['choices'][0]['message']['content']
print("Response: " + response_content + "\n\n")
else:
print("Response: Unable to answer the question\n\n")
conn = connect_with_postgres()
initialize_openai()
populate_kb_table()
# run each query one by one
for i, query in enumerate(QUERIES):
get_response(query)
# clean up
conn.commit()
conn.close()
This generates the following log:
Creating table ...
Populating embeddings ...
Get response for question: tell me about pgEdge
Response: pgEdge is a distributed PostgreSQL database system designed to optimize data latency and ensure high availability. It combines three key features: multi-active (multi-master) replication, multi-region deployment, and multi-cloud support. This unique combination allows pgEdge to deliver exceptional performance and reliability.
With multi-active replication, pgEdge enables multiple database nodes to accept both read and write operations simultaneously. This ensures that data can be accessed and modified from any node, reducing latency and improving overall performance. Additionally, this feature provides automatic failover and load balancing capabilities, ensuring high availability even in the event of node failures.
pgEdge also supports multi-region deployment, allowing you to distribute your database across different geographical locations. This enables you to place your data closer to your users, reducing Get response for question: Who is Denis Lussier
Response: Denis Lussier is a serial entrepreneur and technology executive. He is known for his expertise in PostgreSQL, an open-source relational database management system. Denis has co-founded several companies in the PostgreSQL ecosystem, including pgEdge, which provides a cloud-native database platform for PostgreSQL. As the CTO of pgEdge, he is responsible for driving the technical direction and innovation of the company. Denis is highly regarded in the PostgreSQL community and has contributed to the development and adoption of the database technology.
Get response for question: Where is pgEdge office
Response: The pgEdge office is located at 201 N. Union Street Alexandria, VA 22314 USA.
Get response for question: Who is Ibrar Ahmed
Response: Ibrar Ahmed is a specialist in PostgreSQL core development. He has extensive experience in working with the PostgreSQL database management system and has contributed to its core development. He is knowledgeable in areas such as query optimization, performance tuning, and database administration.
Get response for question: Where is PostgreSQL Conference Europe 2023 taking place
Response: PostgreSQL Conference Europe 2023 will be taking place in Prague, Czechia at the Clarion Congress Hotel Prague.
Get response for question: When is PGDay UK 2023 taking place
Response: PGDay UK 2023 is taking place on September 12, 2023.
Get response for question: Which country K2 mountain belong to ?
Response: Unable to answer the question
The above sample code elaborates the use of pgvector extension for a real world example of AI based enquiry system that tries to automatically answer client queries. We can divide the application into four sections:
Questions mimic client queries to drive learning. Since it is an intelligent automatic reply enquiry system, we have fed all the client queries in
QUERIES
array.The system has a knowledge base that contains all the information that we want the system to learn. The knowledge base grows.
We perform a similarity search that exercises pgvector/PostgreSQL capabilities. We iterate and get responses from the system for each query.
We generate a response from an AI model. Since we have a limited knowledge base, if our AI model doesn't know the answer, it replies accordingly. We expect that it will reply to all the related queries.
Exercising the example
Query:
When is PGDay UK 2023 taking place?
The knowledge base contains the following entry to educate the automatic system to answer correctly:
PGDay UK 2023 is the first in a new series of local annual events in the UK on 2023-09-12, complementing the existing pgDay Paris, Nordic PGDay, and other similar events organized by PostgreSQL Europe.
Enquiry System Response:
PGDay UK 2023 is taking place on September 12, 2023.
The system has capability to do similarity search to correctly answer the posted query by the client. This is possible with the help of the PostgreSQL pgvector extension and the OpenAI embedding generation feature. When we use PostgreSQL with pgvector, not only does it provide vector search, but it helps with storage and other RDBMS features that help us develop a professional and industrial quality application.
To generate a good reasonable response to the client, we used the OpenAI model (gpt-3.5-turbo
) to generate an answer to the query. If the knowledge base provides no related knowledge, it will reply with Unable to answer the question.
This application is written in basic python code to demonstrate the real world use of the pgvector extension. It was tested with PostgreSQL 15 (with pgvector extension installed), OpenAI (via online internet access), and Python 3.9.