Release Notes — AI-Native Databases Episode 1: Andy Pavlo
Hey everyone! I am super excited to publish the first episode of our AI-Native Database series! I set out to write a TLDR of the podcast, but ended up writing something a bit longer than that. I hope this is useful to those following along with the podcast!
Self-Driving Databases
We begin by discussing “Self-Driving Databases”, a phrase coined by Pavlo et al. that really captures peoples’ imagination. I think firstly this invokes surprise in the wording, not many expect “Database” to follow the “Self-Driving” prefix, and you can reliably count on a quick laugh, or look of surprise when telling someone about this for the first time. Aside from the beauty of the naming, let’s dive into what a “Self-Driving” or “Self-Managing” database system entails.
At the top level, we have — how you query a database. SQL is already a declarative language. This means that we don’t tell the database exactly how to execute the query, rather relying on the database to figure it out by itself. This is quite related to the recent advances in LLMs and Natural Language Processing, letting us simply tell the database what we want to do, moving even more declaratively beyond SQL. Text-to-SQL is certainly one of the most popular LLM use cases I have seen — there are even (multiple!) billboards praising Text-to-SQL in San Francisco! The language we use to get our data in and out of the database is one of the key topics throughout the podcast (later on Andy asks Bob — “when are we going to see SQL supported in Weaviate?!”), but a “Self-Driving Database” goes deeper than querying…
So let’s dive from the highest level (querying) to the lowest level (indexes and how the data is stored on disk), and then come back somewhere in the middle (schema design and Generative Feedback Loops). In the podcast, Andy gives the example of DBAs choosing whether to use row-oriented storage (generally preferred for transactional processing) or column-oriented storage (generally used in analytical processing). One way for Machine Learning models to make these decisions is to look at the query logs and from there determine the indexes that need to be built, and / or how to store the data. What else can the database use to make this decision? Can it infer the indexes that are needed from the schema you have given it? How about the data itself?
This transitions to the next Self-Driving Database concept of: the schema design. The low hanging fruit is to provide better interfaces to help people use databases. For example, correcting a user trying to set their ids with a VARCHAR instead of the native UUID datatype. Another idea is to infer what the application is based on the current schema to figure out what the ORM looks like at the application code sitting on top of the database. Andy describes how we might want to infer categories of applications such as finance / healthcare / … and apply certain optimizations, however this may not be something the database can do by itself without connection to the application code. Andy concludes with an interesting thought of how to plug this kind of optimization in with a VS code extension to give the model the context it would need for this, more on this later.
The introduction to “Self-Driving Databases” concludes with this idea:
If you had to build a brand new database system from scratch knowing that it was going to be tuned by an AI or machine learning algorithm, what would you do differently?
Why isn’t there just 1 Database?
So let’s imagine we have a Self-Driving Database that looks at your data and / or workload, builds all the appropriate indexes / data storage mechanisms, tailors your schema and lets you query it however you want. Sounds plausible, but yet we have such a massive diversity in the database market — transitioning to one of the biggest questions in the podcast — Why isn’t there just 1 Database? In addition to his experience of course Co-Founding OtterTune and teaching Database Systems at CMU, Andy also maintains a “Database of Databases”, containing nearly 1,000 databases. Andy presents 3 main reasons for the high # of databases on the market:
- People want a DB system for the new thing they’re building, OOP / XML / JSON
- Data models change — object-oriented data model, and then its evolution over time.
- Commercial activity in the space — Building a new OS is hard to get funding for, DBs are another story. Chasing successes such as the Snowflake IPO.
Collaboration of Models and Databases
This section will dive into two concepts: Relations captured in Vector Embeddings and Generative Feedback Loops.
Bob begins this section by telling the story of his experience trying to get different departments of a large organization to agree on the data model for a customer. This is what planted the seed in Bob’s head to eventually develop an obsession for vector embeddings of objects. Rather than wait for everyone to agree on definitions for everything, we can use the model to capture relationships in the semantic vector space.
I find this to be a super interesting topic, especially in discussing the evolution of databases with Andy Pavlo — we saw the move from the relational model, then to the document model, then the document model picking up some things from the relational model, and also… the emergence of the graph model…
We currently describe the Weaviate data model as a “graph-like model” which means that you can make links between collections, but it doesn’t have low-level support for efficient joins.
Are we seeing the “vector model” as a new data model?
For example, the easiest way to use a Vector DB is to have an Object
collection with a vectorized property called content
. Since the 1.22 release, Weaviate supports nested object storage, so you could just throw all metadata in a property named metadata
. The idea being that instead of relying on indexes built on symbolic properties such as name
or age
, all relations you would be interested in capturing have been achieved with the vector for the content
property.
That’s maybe one argument for how the “vector” model will emerge, an Object
class with the property content
and metadata JSON, metadata
. However, we have seen a lot of use out of filtered vector search, for example NEARTEXT “the future of databases” where source == “podcast” versus source == “arxiv”. Which you may be losing with this kind of Object/content/metadata data model, or not, maybe you can encode the source directly in the content
.
I am also very curious where the join analog comes into vector search. On one hand, Weaviate lets you combine where filters such as where source == “podcast” OR “arxiv”. But in the relational DB sense, this would entail joins such as where podcast.topic == arxiv.topic and then extensions to control exclusivity in the set overlap. Does it then make sense to join collections such as “Papers” and “Podcasts” that share an id or property and vector search over the resulting product?
So hopefully that’s a decent posing of the “data model” concept and how an AI-Native Database might come with an entirely new data model, let’s say that is the “vector model” for now. The next key topic here is Generative Feedback Loops.
We can begin with the LLM taking in new data objects and formatting them into the schema. This could be done with things like semantic type detection for new properties. Another case could be, “Hey you are an AirBnB listing without a description, let me use this other data to write that description” → vectorize synthetic description → into the vector index. This AirBnB example is an illustration of the Generative Feedback Loop performing CRUD on the data, changing what is being stored in the DB itself. There are all sorts of examples of this such as synthesizing a summary of product reviews, generating synthetic hotels, and more!
Bob concludes with a funny saying of “There is an old saying in data, shit in / shit out, but now we have the opportunity to turn chicken shit into chicken salad”. I think this is right on the money to describe the opportunity of Generative Feedback Loops, say a student uploads their test into the database and the system grades it and synthesizes a lesson plan to help them improve… and so many other cases like this where the model takes some data and uses the combination of memory + planning + tools to extract new value from it.
LLM Schema Tuning
I have decided to move this section into the overview of Self-Driving Databases. As a quick recap, we have (1) correcting user mistakes such as using the wrong datatype for the UUID or uploading a data object with the incorrect schema (also related to the GFL case with the AirBnB listings). We then have (2) inferring the application the data is being used for based on the schema, the workload, and / or the data itself, and using the predicted application to make optimizations.
The last idea is quite exciting, but limited because the database is disconnected from the application. For example, if the Self-Driving DB sees a pattern of SELECT statements that suggests normalizing the columns into sub-tables, that will break the application code above. Andy presents a super interesting idea to connect this kind of optimization with the application code on top of the database, perhaps interfaced in IDEs such as VS Code.
The Opinion of the System
We then transition topics into a reminder of the systems we are dealing with, Machine Learning models are subjective. An LLM, or a vector embedding of an object, has an opinion, or bias about the data it is presented with. This is one of the key themes of the podcast series in understanding “AI-Native Databases”.
Imagine a query to a Self-Driving Car, “drive me from A to B with the most beautiful scenery”. Well, it is now up to the model to determine what is beautiful.
As we give our AI-Native DB more autonomy, for example connecting it with tools such as Web Search and the ability to store information it acquires back into the Database, or tools such as Python Executors and similarly, the ability to store code back into the Database (similar to the Minecraft Voyager experiments). The DB has an opinion on how it needs to be designed or structured to achieve these goals.
I would say understanding this aspect of “AI-Native DBs” is the most common topic in the series broadly, although not as central to this particular episode.
PyTorch DB — Moving the Data closer to the Model
This is another massive massive topic in the evolution of AI-Native Databases. I’m sure everyone who is reading this has heard of RAG. RAG describes bringing the model to the data. Alternatively, we can think of bringing the data to the model.
Bringing the data to the model typically means plugging it into the gradient descent training loops that produce the models. There are tons of great systems that orchestrate this, further, active learning where we perform vector search to prepare mini-batches of data with the highest loss points from the last batch of data, is one of the most common applications of ANN search.
However, there is a new opportunity emerging with model editing techniques such as ROME, MEMIT, or GRACE. These techniques let you directly change the knowledge in the model such as editing LeBron James plays “basketball” to “baseball”. Can we imagine an LLM as an index of information that we would update in a similar way to inserting an id in a B+ tree or a vector in an HNSW graph?
Another point to this is the storage of model weights. Andy describes how the Stable Diffusion model is generally 8 GB, such that we can manage it in memory, but larger models — not as much. I think one of the biggest opportunities in AI right now is inference acceleration, and the inference API world is certainly booming at the time of writing this, such as Anyscale and a plethora of new providers serving the Mixtral 8x7B model. I expect to also see a lot of great content explaining inference acceleration and memory management that may look reminiscent of database systems, caches, and virtual context replacement — perhaps more so with the mixture-of-experts architectures or the rise of knowledge distilled task-specific models. I am not yet sure how memory management will intersect with model editing algorithms such as MEMIT, but we will be paying attention at Weaviate!
Database APIs
The next topic is the design of Database APIs. Returning to the original question:
If you had to build a brand new database system from scratch knowing that it was going to be tuned by an AI or machine learning algorithm, what would you do differently?
I think the lowest hanging fruit is the thinking around designing REST status codes for LLMs rather than humans, or writing API documentation for LLMs. Deeper into the technical details of LLMs, I think we have seen two key themes here: JSON Function Calling and Structured Output Parsing.
JSON Function Calling describes interfacing functions to LLMs with a JSON that describes the name of the function, what it does, and it’s potential input arguments. This worked fairly well with LLMs off-the-shelf, but has been taken to the next level by fine-tuning LLMs with these datasets. JSON Function Calling is an important consideration when thinking about next generation API design. Having these JSONs that can be plugged into an LLM seems highly likely to play a major role in interfacing software with software powered by LLMs.
The next key topic is Structured Output Parsing. This is a major problem for LLMs whose output must follow a template to be passed into the next stage of a pipeline. For example, an LLM that reranks documents has to output something like a List[int]
or a dictionary with keys id
and rank
. I am unfortunately not too caught up with the latest on this, my latest understanding was to force decoding (for example, only sampling tokens corresponding to numbers when an int
datatype might be returned). I’m sure there is more out there now.
We also discuss APIs with respect to how the data is returned, Andy argues that this is analogous to projections in SQL. We will return to this topic when discussing when Weaviate will add SQL support, gRPC support, GraphQL, and Gorillas.
Learning to operate Databases
We then transition into Andy’s experience teaching students about databases. Beginning with understanding what the database system is trying to do for them. This is also related to educating people on the trade-offs of adding LLMs and ML to their applications, such as the “opinion” of the system as described above. A key takeaway of Andy’s DB course is to have a “BS Meter” when evaluating new DBs, which brings us into the DB hype cycle!
The Database Hype Cycle
Beginning with the story of NoSQL, which was perhaps overly critical of the systems that came before them and then had to re-position themselves — learning that they actually did need to integrate the lessons of the relational DBs that came before them.
Andy describes how learning about the inner workings of databases helps you develop a “BS meter” for these new technologies. Fortunately for us, Vector DBs fly under Andy’s BS meter radar thanks to the novelty of the vector index and APIs for similarity search. Bob and Andy continued to discuss the nature of whether a bank would switch from Oracle to Weaviate and the focus of transactional processing, unfortunately I need to do more research before I can comment on this topic.
SQL in Weaviate?
What might an SQL operator look like in Weaviate? Maybe something like this:
SELECT * FROM podcasts NEARTEXT "AI-Native Databases"
Even the medium code editor knows this is SQL code, although we have made up the “NEARTEXT” operator.
As mentioned earlier, Text-to-SQL is one of the most common applications of LLMs, complete with San Francisco billboards. One of the most exciting papers this year has been the Gorilla LLM from Patil et al. Gorilla performs Text-to-ML Model API, and can be easily extended to other tools, such as Gorilla OpenFunctions. Gorilla OpenFunctions is a bit different from Text-to-SQL, returning to the JSON Function Calling approach we described earlier. Another class of these methods on the scale from JSON Function Calling to SQL are GraphQL, REST, or gRPC APIs (I would then say direct code is the next level in the LLM tool use hierarchy, of course levels to that as well…).
So does Weaviate need to implement an SQL parser to achieve an SQL API? Or can we use an LLM that is trained to map SQL-like queries to GraphQL, maybe an intermediate step that extracts the intent of the user and then maps that to GraphQL, further offering an “inferred intent” option for debugging?
Andy mentions another product that I think will play an absolutely enormous role in the evolution of Text-to-SQL, CatSQL. CatSQL from Alibaba also gives you data you didn’t ask for in the original query, but that the system determines you may also be interested in. I was first exposed to this idea in Doris Le’s excellent talk in the Stanford MLSys series describing Ponderdata. Further, I recently read a medium article from researchers at Snowflake describing connecting their Text-to-SQL models to Streamlit visualizations. Definitely something cooking up with the end-to-end of how we both query databases, as well as how the result is presented.
The Future of DBs
Ending with the greatest podcast question out there: What excites you the most?
Andy presents exciting directions for new hardware, poses the battle between transactional processing systems adding vector support and vector dbs adding transactional support, and eBPF, putting database logic inside of the linux kernel. All super exciting stuff!