We need to create a dataset of (NL, SQL) pairs that we can use to test the system’s accuracy. Diversity is good, so we can train on a representative set of inputs like actual users will query. Later, we will include real user queries - but “good enough” is the rule of the day here. General guidance is 50-100 manually curated records, then we can use that to boostrap a few hundred “Ground Truths” with some clever LLM usage.
We can start with just a list of questions, then use a general LLM like Claude/ChatGPT4o to try writing the SQL. We’ll need to edit and correct the LLM generated queries - there’s no substitute for looking at the data - but it’s a nice way to leverage AI to speed things up.
None of this requires special infrastructure. Python and CSVs for now. We want to minimze the code we directly write w/o an AI, so the work-to-do is creating prompts and testing/integrating results to ensure they meet milestones:
Milestones
- [Evals] 10x (NL, SQL pairs)
- [Evals] 100x (NL, SQL pairs)
- [Evals] Metrics / Measurement Plan
- [Infra] Can run queries against DB
- [Infra] Log test runs
- [Infra] Log NL (input), SQL (output), Data (Query Results) for test cases
- [Test Code] calculate metrics for NL2SQL records
- [RAG] Vector DB with certified NL2SQL records
- [Prompt] Few shot prompting via RAG retrieval
At this point, we have a basic set of metrics to evaluate the accuracy of NL2SQL: does the generated SQL lint? Does it execute without errors? These tests will be simple, but for bootstrapping, simple is good. We’ll add sophistication - but v0 is to build the feedback loop to evaluate whether prompt/RAG/fine-tuning actually improve system performance.
Future Directions
- Increase accuracy. More training data, integrate DSPy to auto-optimize different prompts and RAG approaches. Enhanced evaluation metrics.
- Integrate user interface. Users write natural language query and see results queried against live db.
Baselining Performance
Using a base model, without context learning (no RAG!), without prompt engineering: how good is performance on our test data? how do different models impact performance?
%%{ init: { 'theme': 'base', 'themeVariables': { 'primaryColor': '#000', 'primaryTextColor': '#fff', 'primaryBorderColor': '#7C0000', 'lineColor': '#000', 'secondaryColor': '#006100', 'tertiaryColor': '#fff', 'fontSize' : '24px' } } }%% graph TD A[Data Collection] --> B[Ground Truth Data] A --> C[Synthetic Data] D[NL2SQL Dataset] --> E[Test Suite] B --> D C --> D F[Eval Metrics] --> E E --> G[Performance Benchmarks] style G fill: green;