We need to create a dataset of (NL, SQL) pairs that we can use to test the system’s accuracy. Diversity is good, so we can train on a representative set of inputs like actual users will query. Later, we will include real user queries - but “good enough” is the rule of the day here. General guidance is 50-100 manually curated records, then we can use that to boostrap a few hundred “Ground Truths” with some clever LLM usage.

We can start with just a list of questions, then use a general LLM like Claude/ChatGPT4o to try writing the SQL. We’ll need to edit and correct the LLM generated queries - there’s no substitute for looking at the data - but it’s a nice way to leverage AI to speed things up.

None of this requires special infrastructure. Python and CSVs for now. We want to minimze the code we directly write w/o an AI, so the work-to-do is creating prompts and testing/integrating results to ensure they meet milestones:

Milestones

  1. [Evals] 10x (NL, SQL pairs)
  2. [Evals] 100x (NL, SQL pairs)
  3. [Evals] Metrics / Measurement Plan
  4. [Infra] Can run queries against DB
  5. [Infra] Log test runs
  6. [Infra] Log NL (input), SQL (output), Data (Query Results) for test cases
  7. [Test Code] calculate metrics for NL2SQL records
  8. [RAG] Vector DB with certified NL2SQL records
  9. [Prompt] Few shot prompting via RAG retrieval

At this point, we have a basic set of metrics to evaluate the accuracy of NL2SQL: does the generated SQL lint? Does it execute without errors? These tests will be simple, but for bootstrapping, simple is good. We’ll add sophistication - but v0 is to build the feedback loop to evaluate whether prompt/RAG/fine-tuning actually improve system performance.

Future Directions

  1. Increase accuracy. More training data, integrate DSPy to auto-optimize different prompts and RAG approaches. Enhanced evaluation metrics.
  2. Integrate user interface. Users write natural language query and see results queried against live db.

Baselining Performance

Using a base model, without context learning (no RAG!), without prompt engineering: how good is performance on our test data? how do different models impact performance?

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#000',
      'primaryTextColor': '#fff',
      'primaryBorderColor': '#7C0000',
      'lineColor': '#000',
      'secondaryColor': '#006100',
      'tertiaryColor': '#fff',
      'fontSize' : '24px'
    }
  }
}%%
graph TD
    A[Data Collection] --> B[Ground Truth Data]
    A --> C[Synthetic Data]
    

    D[NL2SQL Dataset]  --> E[Test Suite]

    B --> D
    C --> D

    F[Eval Metrics] --> E
    E --> G[Performance Benchmarks]

    style G fill: green;