Exciting news for fans of PostgreSQL is the introduction of a vector search extension to the robust PostgreSQL database. For data handling, I have always recommended PostgreSQL for larger projects and now with pgvectorscale PostgreSQL can be recommended for machine learning projects. Essentially, pgvectorscale handles two of this issues required for machine learning projects, vectors and scaling.
While mongoDB is very useful for client side aipps pgvectorscale brings enterprise grade scaling to the machine learning platform. One thing machine learning platforms do is to generate data. At the core of a vector search engine is the idea that if data and documents are alike, their vectors will be similar. By indexing both queries and documents with vector embeddings, you find similar documents as the nearest neighbors of your query.
Getting Docker
pgvectorscale is an innovative PostgreSQL extension designed to enhance the capabilities of pgvector, making PostgreSQL a competitive choice for high-performance, cost-efficient vector search applications commonly required in AI and machine learning.
Docker has become a popular choice for containerization of software with 82% of marketshare. While you can get up and running with Docker quickly, under the hood is a rich feature set that I have been exploring for days. .Dockerfiles in tandem with starlark provide a robust installation framework for simplifying complicated build dependencies into a single call. Part of the beauty of docker is found in .Dockerfile which is useful for creating customized software environments as containers.
The following is a pgvectorscale.Dockerfile posted more for my own reference then yours. This will help you onboard into pgvectscale using Ubuntu 22.04LTS.
Creating a Docker image that seamlessly integrates PostgreSQL 16 and the pgvectorscale extension requires meticulous configuration and setup. This article explains how to construct such an image using Ubuntu 22.04 as the base, ensuring that all components work harmoniously together. The Dockerfile provided below outlines each step in this comprehensive setup process.
list all docker images
docker images -a
remove all docker images
docker rmi -f $(docker images -aq)
We start with the official Ubuntu 22.04 image, known for its stability and support for various applications. Setting environment variables to non-interactive ensures that package installations proceed without user prompts, which is crucial for automated builds.
pgvector.Dockerfile
# Use Ubuntu 22.04 as a base image
FROM ubuntu:22.04# Set environment variables to non-interactive to prevent prompts
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=America/New_York
# Install necessary packages
RUN ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime && \
apt-get update && \
apt-get install -y git tzdata gnupg2 wget nano curl make gcc pkg-config clang libssl-dev lsb-release software-properties-common && \
dpkg-reconfigure --frontend noninteractive tzdata
# Add the PostgreSQL 16 repository
RUN wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add - && \
sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' && \
apt-get update
# Install PostgreSQL 16, contrib modules, pgvector, and timescaledb
RUN apt-get install -y postgresql-16 postgresql-contrib-16 postgresql-16-pgvector postgresql-16-timescaledb postgresql-server-dev-16
# Ensure PostgreSQL binaries are in the PATH
ENV PATH="/usr/lib/postgresql/16/bin:$PATH"
# Ensure the postgres user exists
RUN id -u postgres || useradd -ms /bin/bash postgres
# Change ownership of PostgreSQL data directory
RUN mkdir -p /var/lib/postgresql/16/main && chown -R postgres:postgres /var/lib/postgresql && chmod -R 700 /var/lib/postgresql
# Copy the initialization script
COPY templates/databases/init.sql /docker-entrypoint-initdb.d/
RUN chown -R postgres:postgres /docker-entrypoint-initdb.d/
# Configure PostgreSQL for remote connections and password authentication
RUN sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '*'/" /etc/postgresql/16/main/postgresql.conf && \
sed -i '/^local\s\+all\s\+postgres\s\+trust/s/trust/md5/' /etc/postgresql/16/main/pg_hba.conf && \
sed -i '/^local\s\+all\s\+all\s\+trust/s/trust/md5/' /etc/postgresql/16/main/pg_hba.conf && \
sed -i '/^local\s\+all\s\+all\s\+peer/s/peer/md5/' /etc/postgresql/16/main/pg_hba.conf && \
echo "host all all 0.0.0.0/0 md5" >> /etc/postgresql/16/main/pg_hba.conf
# Expose PostgreSQL port
EXPOSE 5432
# Start PostgreSQL service as the postgres user
USER postgres
CMD ["postgres", "-D", "/var/lib/postgresql/16/main", "-c", "config_file=/etc/postgresql/16/main/postgresql.conf"]
./templates/databases/init.sql
-- Loop through databases to create users and databases, and grant privileges
range .dbs
\connect $.master_db $.master_user;
CREATE USER .user WITH PASSWORD '.password';
CREATE DATABASE .name OWNER .user;
if .init
\connect .name .user;
.init
end
GRANT ALL PRIVILEGES ON DATABASE .name TO .user;
end-- Create the vectorscale extension
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
docker build -t custom-timescaledb -f pgvector.Dockerfile .
docker run -d --name my-timescaledb -p 5432:5432 custom-timescaledb
Stop and remove the existing container:
docker stop my-timescaledb
docker rm my-timescaledb
# run interactively
docker run -it --name my-timescaledb custom-timescaledb /bin/bash
# verify PostgreSQL
which postgres
interact with PostgreSQL as a client
sudo apt install postgresql-client-commondocker run -it - name my-timescaledb custom-timescaledb /bin/bash
Access the PostgreSQL server as the postgres
user inside the container:
docker exec -it my-timescaledb /bin/bash
psql
make sure to change etc/postgresql/16/main/pg_hba.conf from trust
# "local" is for Unix domain socket connections only
local all all trust
set a password for postgres user
ALTER USER postgres PASSWORD 'your_secure_password';
\q
exit
Create the master_user
role and the master
database:
CREATE ROLE master_user WITH LOGIN PASSWORD 'master_password';
CREATE DATABASE master OWNER master_user;
GRANT ALL PRIVILEGES ON DATABASE master TO master_user;
connect to PostgreSQL master database as master_user
:
docker exec -it my-timescaledb psql -U master_user -d master
\q
exit
By following these steps, you have created a “secure” Docker image for PostgreSQL 16 integrating pgvectorscale, and configured password authentication for enhanced security. This setup ensures that database access requires authentication, preventing unauthorized access.
This custom Docker image provides a powerful, flexible, and “secure” PostgreSQL 16 environment enhanced with pgVectorScale key benefits include:
Enhanced PostgreSQL Capabilities: By including pgVectorScale and TimescaleDB, this setup leverages advanced features for managing time-series data and vector computations, which are crucial for modern applications such as IoT, financial analysis, and machine learning.
Automated and Reproducible: The Dockerfile ensures that the entire environment can be built and deployed consistently, making it easy to reproduce the setup across different systems and for different team members.
Security: By configuring PostgreSQL for password authentication and remote connections, the image requires basic authentication to the database. This is a crucial first step for protecting sensitive data in production environments.
Flexibility: Using environment variables and the ability to customize the PostgreSQL configuration makes this setup highly adaptable to various use cases and requirements.
Development and Testing: This image is ideal for development and testing environments where you need a reliable and consistent database setup with advanced capabilities for data analysis and processing.
Following these steps provides a point of departure for your custom Docker image as you create a robust foundation for any application requiring a powerful and secure PostgreSQL setup that can be easily enhanced with modern extensions and further integrations.
This article is released under
The PostgreSQL License
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies.
IN NO EVENT SHALL TIMESCALE BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF Timescale HAS
BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
TIMESCALE SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
THE SOFTWARE PROVIDED HEREUNDER IS ON AN “AS IS” BASIS, AND TIMESCALE HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.