Apache Cassandra
This page provides a quickstart for using Apache Cassandra® as a Vector Store.
Cassandra is a NoSQL, row-oriented, highly scalable and highly available database.Starting with version 5.0, the database ships with vector search capabilities.
Note: in addition to access to the database, an OpenAI API Key is required to run the full example.
Setup and general dependencies
Use of the integration requires the following Python package.
%pip install --upgrade --quiet langchain-community "cassio>=0.1.4"
Note: depending on your LangChain setup, you may need to install/upgrade other dependencies needed for this demo
(specifically, recent versions of datasets
, openai
, pypdf
and tiktoken
are required, along with langchain-community
).
import os
from getpass import getpass
from datasets import (
load_dataset,
)
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("OPENAI_API_KEY = ")
embe = OpenAIEmbeddings()
Import the Vector Store
from langchain_community.vectorstores import Cassandra
Connection parameters
The Vector Store integration shown in this page can be used with Cassandra as well as other derived databases, such as Astra DB, which use the CQL (Cassandra Query Language) protocol.
DataStax Astra DB is a managed serverless database built on Cassandra, offering the same interface and strengths.
Depending on whether you connect to a Cassandra cluster or to Astra DB through CQL, you will provide different parameters when creating the vector store object.
Connecting to a Cassandra cluster
You first need to create a cassandra.cluster.Session
object, as described in the Cassandra driver documentation. The details vary (e.g. with network settings and authentication), but this might be something like:
from cassandra.cluster import Cluster
cluster = Cluster(["127.0.0.1"])
session = cluster.connect()
You can now set the session, along with your desired keyspace name, as a global CassIO parameter:
import cassio
CASSANDRA_KEYSPACE = input("CASSANDRA_KEYSPACE = ")
cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)
Now you can create the vector store:
vstore = Cassandra(
embedding=embe,
table_name="cassandra_vector_demo",
# session=None, keyspace=None # Uncomment on older versions of LangChain
)
Note: you can also pass your session and keyspace directly as parameters when creating the vector store. Using the global cassio.init
setting, however, comes handy if your applications uses Cassandra in several ways (for instance, for vector store, chat memory and LLM response caching), as it allows to centralize credential and DB connection management in one place.
Connecting to Astra DB through CQL
In this case you initialize CassIO with the following connection parameters:
- the Database ID, e.g.
01234567-89ab-cdef-0123-456789abcdef
- the Token, e.g.
AstraCS:6gBhNmsk135....
(it must be a "Database Administrator" token) - Optionally a Keyspace name (if omitted, the default one for the database will be used)
ASTRA_DB_ID = input("ASTRA_DB_ID = ")
ASTRA_DB_APPLICATION_TOKEN = getpass("ASTRA_DB_APPLICATION_TOKEN = ")
desired_keyspace = input("ASTRA_DB_KEYSPACE (optional, can be left empty) = ")
if desired_keyspace:
ASTRA_DB_KEYSPACE = desired_keyspace
else:
ASTRA_DB_KEYSPACE = None
import cassio
cassio.init(
database_id=ASTRA_DB_ID,
token=ASTRA_DB_APPLICATION_TOKEN,
keyspace=ASTRA_DB_KEYSPACE,
)
Now you can create the vector store:
vstore = Cassandra(
embedding=embe,
table_name="cassandra_vector_demo",
# session=None, keyspace=None # Uncomment on older versions of LangChain
)