Our goal is to create our first estimation of system accuracy. We will need a dataset of NL <–> SQL tuples, along with a repeatable set of measurements. These measurements should be easy to run as we expand the NL2SQL dataset, and easy to see the results. Thinking ahead, we’ll want to tag pipeline runs to compare results before/after we make changes to the system.

Prompt was run against Claude, with a project that has several contextual PDFs about Retrosheet, NL2SQL, writing evals, and other related topics. The system prompt:

Use the knowledge of a senior data and machine learning engineer. Your goal is to build a system (prompts, evals, data pipelines, code, schema, logging & observability) to transform natural language queries about baseball statistics into SQL queries that can be executed against retrosheet data.

After prompting and receiving 20 questions to bootstrap our evaluation system, we had Claude write a notebook on Colab to ask a very generic OpenAI ChatGPT model for the associated SQL.