Can you just ask your database a question?

When Google shut down Universal Analytics in July 2024, historical analytics data for a lot of teams landed in a PostgreSQL database with no clean way to use it. The standard fix is to connect it to a visualization tool like Tableau or Google Data Studio. That works, but it means setting up environments, learning dashboards, and knowing enough about the database structure to build useful queries. Most people on the marketing team don’t want to do any of that. They just want to ask a question and get an answer.

We wondered whether an LLM could sit in the middle and handle the translation. Type a question in plain English, get a real answer back from the database.


If we gave an LLM enough context about the database, it could generate accurate SQL queries from natural language questions and return useful answers, without the user needing to know anything about the underlying tables or schema.


We built a chat interface connected directly to the Postgres database. When a user submits a question, the system runs through four steps: identify which tables are relevant, generate a SQL query, validate and run it, then summarize the results in plain language. Each step is a separate LLM call or function, which made it easier to debug and improve each one independently.

The hardest part was getting the model to reliably generate correct PostgreSQL. SQL looks deceptively simple but has sharp edges: column names need double quotes, type casting has to be explicit, and a query that’s logically wrong can still run without errors and return plausible-looking garbage. We addressed this with a detailed prefix prompt covering 20+ specific guidelines for PostgreSQL syntax, column quoting, data type handling, and logical validation. We also migrated from GPT-3.5 to GPT-4o and set temperature to 0.2, trading some creativity for more consistent, deterministic outputs.

Even with a strong prompt, the model still made mistakes. So we added a dedicated validation layer that runs 9 checks on every generated query before it touches the database, plus an automatic retry mechanism that catches SQL errors and asks the model to correct itself, up to three attempts. This caught a meaningful number of failures that would otherwise have returned wrong answers silently.

For example selection, we built a library of 60+ hand-crafted query examples and used FAISS embeddings to find the 5 most semantically similar examples to include with each new question. This was more effective than writing longer instructions. The model consistently performed better when it could see worked examples close to the question it was being asked.

Table selection was its own problem. The database has 40+ tables, and passing all of them into every prompt was expensive and made the model less accurate. We solved this with a two-step process: first ask the model to identify which of 20 logical table categories are relevant to the question, then map those categories to the actual table names. This kept each prompt focused and cut down on noise.

The final step, turning SQL results into a plain-language answer, is a separate LLM call that takes the original question, the query, and the results together. Keeping this as its own chain meant we could tune it independently from the query generation logic.


An LLM is a translator, and translators make mistakes in ways that are hard to catch. The model’s job was to figure out what kind of answer was expected and where in the database to find it. When it got either one wrong, it didn’t crash or throw an error. It queried a non-existent column, made up a table name, or returned a number that was technically an answer to something, just not the question that was asked. Nothing signaled to the user that anything went wrong. That’s a harder failure mode to design around than a system that breaks loudly.

The question type matters more than the question length. Simple, direct lookups worked well: how many sessions came from paid search last March, what pages had the most traffic in Q4. Multi-step questions requiring several independent queries and aggregation were unreliable. Ambiguous questions like “what are the best-performing pages” were the worst, because “best performing” can mean a dozen different things and the model picks one without telling you it made a choice.

Good examples beat good instructions. The most effective improvement we made wasn’t writing better prompts, it was building a library of 60+ concrete examples and using semantic search to select the 5 most similar ones dynamically based on the user’s question. The model learned more from seeing relevant worked examples than from detailed written instructions about what to do. For teams building similar tools, investing in a strong example library early pays off faster than trying to engineer the perfect system prompt.

Controlling what the model sees is as important as what you tell it. With 40+ tables in the database, passing everything into every prompt made the model slower and less accurate. The fix was a two-step process: first have the model identify which broad categories of tables were relevant, then pass only those tables into the next step. Narrowing the model’s view at each stage, rather than giving it everything and hoping it would focus, made a real difference. This is a pattern worth carrying into any project where the data is large or complex.

Context engineering is the actual work. We couldn’t point the model at the database and hope. The bulk of our time went into deciding what information to pass in, how to structure it, and which examples to include. Any team building something like this should expect that work to be ongoing. The good news is the approach generalizes: the same pattern works for PDFs, spreadsheets, or any other structured data source. What changes is the upfront analysis, understanding what data is available, organizing it so the model can navigate it, and mapping it to the kinds of questions users are likely to ask. That work determines whether the system is useful or just impressive in a demo.