Neon: A Serverless And Developer Friendly Postgres

Jul 8, 2024

Startups User Experience AI Business Product Management

Summary

The podcast episode explores Neon, a cloud-native serverless version of Postgres designed to scale elastically and optimize the developer experience. Nikita Shamgunov, Neon’s founder, discusses the motivation behind creating Neon, emphasizing the need for a Postgres alternative that supports serverless cloud usage without sacrificing compatibility with Postgres’s extensive ecosystem of plugins and extensions. Neon achieves this by running an unmodified Postgres engine, intercepting its local storage calls, and redirecting them to a custom-built, scalable networked storage layer via a thin API, enabling separation of storage and compute. This architecture supports zero-cost branching, allowing developers to create isolated development and staging environments quickly, connected directly to GitHub PRs, which significantly accelerates modern development workflows. Neon targets small, agile developer teams who lack dedicated DevOps resources, focusing on simplicity, automatic scaling, and minimized operational overhead. Key engineering innovations include running Postgres in dynamically resizable virtual machines with live migration to maintain stateful TCP connections and implementing a serverless driver alongside a connection balancer to overcome Postgres’s traditional connection limitations, especially for edge environments like Cloudflare workers. The episode also addresses challenges like managing latency introduced by network hops in decoupled architectures and the complexities of Postgres version upgrades, highlighting Neon's ongoing efforts toward automated upgrade tooling. Neon’s contribution and support for AI-related features such as the pgvector extension underscore its commitment to embedding AI capabilities within its platform. Finally, the conversation situates Neon within the competitive landscape of cloud-native and open-source databases, explaining its cloud-only strategy and reflecting on lessons learned from other systems like SingleStore, with a focus on sustainable engineering and developer productivity.

Key Takeaways

1Neon maintains full compatibility with Postgres by running the core engine unmodified and replacing only the local storage with a networked storage layer via a thin API that intercepts WAL and page requests, enabling serverless Postgres without breaking the extensive Postgres plugin ecosystem.
2Focusing exclusively on a cloud-only, serverless Postgres product enables Neon to optimize user experience and scalability specifically for small, agile development teams that lack traditional DevOps resources, avoiding the dilution of effort found in hybrid cloud/on-prem strategies.
3Neon's serverless implementation runs Postgres inside dynamically sized virtual machines with live migration to maintain stable, stateful TCP connections, addressing the challenge of scaling a traditional stateful database in a serverless environment.
4Neon introduces a serverless Postgres driver and a connection balancer (PG balancer) to overcome Postgres's well-known challenge of managing numerous connections, especially addressing edge cases like Cloudflare workers that lack TCP support.
5The separation of storage and compute enables Neon's zero-cost branching feature that allows developers to create instant, isolated copies of databases tied to GitHub pull requests, facilitating advanced development workflows and rapid feature testing.
6Neon manages the latency impact from network hops introduced by storage and compute separation through optimizations leveraging existing Postgres synchronous replication paradigms and using the Paxos protocol within their Safekeepers component to ensure reliability without significant performance degradation.
7Neon is actively developing automated and simplified Postgres upgrade processes, recognizing the historical complexity in Postgres upgrades involving multiple versions and storage format migrations, and drawing lessons from simpler, more reliable processes like those in SQL Server.
8Neon contributes directly to Postgres’s open-source ecosystem, particularly supporting the pgvector extension for embedding and similarity search, demonstrating its commitment to AI workloads and cloud optimization while fostering community collaboration.
9Neon's engineering updates to Postgres have been kept minimal and modular via extensions and limited forking, enabling easier merging with future upstream versions and ensuring platform sustainability as Postgres anticipates native support for pluggable storage engines.

Notable Quotes

"The ecosystem around Postgres is large and varied. A lot of different use cases that it's supporting, a massive number of different plugin types. I'm wondering if you can talk to some of the ways that you're thinking about what it means to be serverless for such a diverse ecosystem. What are some of the ways that you're trying to scope the applicability of Neon so that you don't have people coming to you and complaining that it doesn't do X, Y or Z because I'm trying to use these 15 different plugins. And some of the ways that you're trying to use these 15 different plugins and some of the ways that you're orienting towards that developer experience by removing the operational concerns. Well, I think there are two questions in one here."

"We run Postgres in ADM and we attach that Postgres into custom built storage. And that thing we built from scratch. So the integration point with Postgres with our storage goes on a relatively thin API. You know, at the end of the day, Postgres storage engine requests pages from disk, and then writes transaction log record called WAL on disk, and then uses WAL to update those pages both on disk and in memory. And that's precisely where we interjected. So we said, instead of writing a transaction log record on disk, send it over the network into our service. And then instead of requesting a page or reading a page from disk, read it from our service over an API call. And just as you see, that allows us to actually not change the engine."

"Every successful cloud product inside a hyperscaler has an open source alternative. Lots and lots of examples of that, right? For Redshift, that would probably be ClickHouse. Now, these days, that's one of the more popular open source products. But then, also, Redshift had an alternative, a cloud native alternative, which is Snowflake, and Snowflake frankly out executed Redshift. So unbundling a popular database service seems like a good idea. And I was also thinking about GitHub versus GitLab analogy, where GitHub is a cloud product, and GitLab is an open source product. And when you're an open source product, that gives you the right to exist. And I was like, nobody is building an alternative to Aurora. And that was strange to me."

""It's a gigantic pain to connect your application to the database. You know, certain things don't support TCP connections. So that's why we launched our serverless driver. Postgres doesn't do very well with lots of connections. And therefore there are systems like poolers, our PG balancer that allows you to scale number connections to Postgres. So part of the value is just packaging all of that and make it as stupid simple to consume, never run out of connections, and never do operations that you have to do with other systems, adding infrastructure to just the core database.""

"The user is a developer, and developers have lots of needs. We asked ourselves a question, why people use Neon versus AWS or Azure or GCS? And the answer was always kind of like, oh, it's easy to use, you push a button and you get it. I think the real answer is that small teams that need to move fast don't have the luxury of having DevOps. And if you're using Amazon, you need DevOps, right? Because Amazon is infrastructure; Amazon is not a developer platform. When you use GitHub, you don't need DevOps, right? You know, you as a developer consume that feels super native for you as a developer. But when you use EC2 or 200 services on AWS, you feel like it's Lego bricks on which you build your application. And it doesn't feel like this is built for developer as an end user consumption. So that aha moment was like, oh, this is what it means. Smaller teams can move faster because they don't have DevOps and they consume this directly. Once that clicked, we're realizing that those teams need more than just a database."

"So it's very quick. And then compute is just separate, right? So it's a different VM that runs Postgres and that's your compute. So that's the definition of separation of storage and compute and taking advantage of that architecture here. Now, in your developer environment, you can do whatever you want. You can change data, you can change schema, you can test performance, you can drop indexes, creating whatever. But then you want to roll these things forward into first staging environments and then eventually production environments."

"What we've discovered is that people don't really care about the changes in data. As a matter of fact, the data changes in the dev environment should not propagate it all the way to the production. But the application depends on the schema. So schema has to migrate forward. There are lots of tools that help you with schema migrations. Those are called ORMs. Things like Prisma, like Drizzle, TypeRM, and whatnot. And we're just plugging into that workflow."

""We have database previews, which is achieved through the technology with whole branching. We have the ability to create those previews based on every PR in GitHub. And now we're adding more and more features that would integrate with a deeper JavaScript ecosystem. So when you build apps and you need these systems like auth or payments or storage, that's also trivial to do on the end. So all of that kind of falls under the umbrella that you want to ship your applications faster. That's really the whole acceleration movement, which is mostly driven by AI, but really by developer productivity.""

"So the hop is already there. If you run Postgres on an EBS node, well, EBS node is network attached as well. So we're not really, actually we do, but like at the high level, it's roughly the same number of hops. While the reality, there's a Paxos protocol that we use for reliability when we send the log record into our service that's called Safekeepers. So that have multiple hops to persist the record in the Paxos protocol."

"The latencies fundamentally becoming, the latencies and throughput are becoming roughly the same. And roughly, there's still a bit of a haircut that we're taking on latencies. But in return, we're giving you infinite I.O. throughput, right? Because our storage is multi-tenant and, you know, we can request as many pages as you want. So that's the trade-off and specifically works super well for much larger databases. And for small databases, performance usually is not a problem."

"In SQL Server, you just restart. You like, shut down the old binary, start a new binary, point at the data location, and then it just upgrades on the spot in place. And the SQL Server team makes sure that the upgrades never fail. They kind of guarantee that this is the case. Here you have to do a bunch of dance to upgrade a Postgres instance, but we just treat it as a feature. By the way, we don't have that feature yet, but this feature is under development."

"We let people choose the version, the Postgres version. Today, I was actually advocating to not. I was saying, let's just run the latest version and upgrade ourselves. But then, we didn't have the upgrade feature for a while and we still don't have it. It's coming. So, we landed somewhere in between. When the new Postgres version shows up, the default Postgres that we spin up is the latest version. We don't upgrade automatically and let people choose up to two versions back."

""I'm wondering how you have had to approach the rework of that Postgres engine to minimize the footprint of your changes while maximizing the capabilities that you're enabling, and some of the ways that the scope and goals of your work on Postgres and Neon have changed from when you first came up with this vision of what you wanted to build to where you are today, where you have a real-world production system that people are using every day.""

"We plugged in at the page level and pages don't care about the version. So, that all works. I think there are benefits to just being on the latest version. I just lost that argument when we were introducing that feature. But we haven't been bitten by that much. Postgres is fairly disciplined and regimented in how it releases. It releases once a year. Not that much stuff changes."

""We contributed back to PG vector. So, so that was our experience from the business perspective. Oh, it's wonderful. It's wonderful that this thing is there. We obviously support it. We're contributors. We do things also, our architecture makes it better to run PG vector on Neon versus other platforms.""

Episode questions

What motivated Neon to focus exclusively on a cloud-only serverless Postgres rather than supporting on-premises deployments?

Nikita explains that they learned from SingleStore’s experience that supporting both on-prem and cloud workloads diluted focus and prevented them from competing strongly in cloud analytics. They observed that serverless cloud deployment requires optimizations that differ from on-prem and that focusing solely on cloud products enables a sharper user experience and scalability. They also found no existing open source alternative to Aurora, which inspired them to build Neon as a cloud-only, serverless, Postgres-compatible alternative. This strategy allows them to prioritize developer experience and cloud-native scalability.

How does Neon maintain Postgres compatibility given that Postgres does not have a pluggable storage engine?

Neon runs an unmodified Postgres engine but replaces the local disk storage with a networked storage layer via a thin API layer. They intercept Postgres' storage calls, redirecting WAL records and page read requests over the network to Neon’s storage system. This approach lets them keep the full Postgres engine intact, preserving ecosystem compatibility and plugin functionality. While some surgical modifications were needed, the overall Postgres behavior and protocol remain unchanged, which is crucial because any break in compatibility leads to customer problems and loss of ecosystem support.

Why does Neon see small developer teams as the ideal user base, and how does this influence their product design?

Nikita highlights that small teams need to move fast but often lack dedicated DevOps staff. Therefore, they want databases that just work without requiring deep operational knowledge or managing complex infrastructure. Neon is designed to be simple to use with a push-button serverless experience, eliminating manual provisioning or scaling. This design philosophy leads to a cloud-native, managed service that reduces time to value and minimizes operational complexity, fitting the needs of startups and agile teams that are prominent in today’s software development landscape.

What engineering challenges did Neon encounter when running Postgres in a serverless environment and how were they addressed?

Nikita described how Postgres is a stateful system that maintains TCP connections, which complicates serverless scaling. Neon runs Postgres inside virtual machines whose CPU and memory are adjusted dynamically. To maintain connection stability during resource changes or VM relocations, Neon uses VM live migration technology. This approach also necessitates strict security boundaries and internal VM management expertise. The overall solution allows the serverless scaling benefits without interrupting database sessions or violating Postgres’ stateful requirements.

← All episodes Browse issues