Offline Architecture v0 - Sabermetrics Platform - Analyze and Publish Baseball Data

We need to create a dataset of (NL, SQL) pairs that we can use to test the system’s accuracy. Diversity is good, so we can train on a representative set of inputs like actual users will query. Later, we will include real user queries - but “good enough” is the rule of the day here. General guidance is 50-100 manually curated records, then we can use that to boostrap a few hundred “Ground Truths” with some clever LLM usage.

We can start with just a list of questions, then use a general LLM like Claude/ChatGPT4o to try writing the SQL. We’ll need to edit and correct the LLM generated queries - there’s no substitute for looking at the data - but it’s a nice way to leverage AI to speed things up.

None of this requires special infrastructure. Python and CSVs for now. We want to minimze the code we directly write w/o an AI, so the work-to-do is creating prompts and testing/integrating results to ensure they meet milestones:

Milestones

[Evals] 10x (NL, SQL pairs)
[Evals] 100x (NL, SQL pairs)
[Evals] Metrics / Measurement Plan
[Infra] Can run queries against DB
[Infra] Log test runs
[Infra] Log NL (input), SQL (output), Data (Query Results) for test cases
[Test Code] calculate metrics for NL2SQL records
[RAG] Vector DB with certified NL2SQL records
[Prompt] Few shot prompting via RAG retrieval

At this point, we have a basic set of metrics to evaluate the accuracy of NL2SQL: does the generated SQL lint? Does it execute without errors? These tests will be simple, but for bootstrapping, simple is good. We’ll add sophistication - but v0 is to build the feedback loop to evaluate whether prompt/RAG/fine-tuning actually improve system performance.

Future Directions

Increase accuracy. More training data, integrate DSPy to auto-optimize different prompts and RAG approaches. Enhanced evaluation metrics.
Integrate user interface. Users write natural language query and see results queried against live db.

Baselining Performance

Using a base model, without context learning (no RAG!), without prompt engineering: how good is performance on our test data? how do different models impact performance?