What’s the difference between Baseball Reference, Python, R, and SQL
This page describes some options for starting in baseball analysis. Where to get data, how to analyze it, as well as where to start. If you aren’t crystal clear about the difference between Python, R, SQL, and just using Baseball Reference or Fangraphs - this page can help.
Audience: Baseball fans who want to dive deeper
Baseball Data Options
Baseball Reference (Data Platform)
Everyone should start here. They have all the data, plus a fantastic interface for performing supported analyses. If you want to analyze baseball data, BBRef is going to get you 90% of your answers.
Summary the starting point and gold standard - they have all the data you want, accessible through the thoughtfully pre-built analyses they offer. Great point and click interface – click to summarize a player or team statistics between different periods. The best leaderboards. Premium membership gives access to the play-by-play index, which is going to be your best route to analyzing common questions you might have.
Tradeoff: if BBRef offers what you need, it’s the best option, full stop. The only real limitation is that you can’t extend results in novel ways. BBRef is great if you are asking questions they have built an interface to answer; if you have been frustrated by not getting exactly what you want, or wanted to extend an analysis past BR’s pre-set rollups, then you will need to do your own analysis.
Fangraphs.com (Content Site + Data Platform)
Best sabermetric content site on the web. Best authors who are regularly scooped up by teams. Close ties with industry - when MLB teams hire analysts, the jobs are cross-posted to Fangraphs readers. They have some sweet data sources - a stellar second place in data access, compared to BBRef.
Tradeoff: Cannot run your own computation on the underlying data. If the site supports the analysis you want, perfect! But if they don’t support that exact analysis - e.g., which players have seen the most significant changes in swing rates over the past 30 games - you’re out of luck.
Run Your Own Analyses (Programming Languages; DIY)
Only go this route if you have questions that BBRef and Fangraphs can’t answer. Focus on the analysis, not on the tools. However, if you cannot run your analyses, you will need to reach for your own tools. Be aware - this is going to be much harder! But sometimes it’s the only way.
Tradeoff: you have to do a bunch of things (acquire and clean data, update it manually to refresh analyses, etc.) that are more operationally focused and not aligned with your goal of baseball data analysis. (This is where Project Baseball can help! I want to do that work, so you can focus on cool analyses!)
When to do your own analysis
R and Python are great options. Python is one of the top universal programming languages, and so will pay dividends if you ever want to write programs (such as this website). R is easier to use for data anlysis (controversial opinion), but less popular outside of biostatistics communities. My advice - don’t sweat this decision. Choose Python if you want to go bigger than just data analysis. Otherwise, flip a coin!
R/Python versus SQL
R versus Python is a choice. R/Python versus SQL is more complementary. SQL is better for having large amounts of data, closer to plain English, and a language where you describe what you want (“how many sliders did Wainwright throw per game, for the past 7 games”). R/Python would require much more work to answer that question, but are better at running simulations, more sophisticated statistics, and any machine learning/AI. If your goal is machine learning, Python is probably a better choice.
Project Baseball
We offer access to data, via SQL, for free. Plus an easy way to publish analyses online and build a portfolio. I recommend you just jump in, check out other people’s queries, and play around. Starting points:
My Recommendation
My typical workflow is to filter data using SQL, from the event data (each pitch) or maybe rolling up to the “grain” that I care about. EG, if I want to count who has thrown the most sliders out of their last 100 pitches, I would stay on the pitch grain. However, if I wanted to count how frequently Whit Merrifield nailed the 2 hit threshold in the last 30 days giving KC fans free Pepsi - I’d want to aggregate (using SQL) the pitch data into at-bat data into per-game data. Each level - pitch by pitch, at bat, per game, per season, per player, per team - that is called the “grain”.
Back to our story. If you only care about 2021 data, SQL is great to get 2021 data on the grain (level) that you want. If you try to load all data of all time (20+GB), you will choke your laptop trying to use Python and R. Use SQL to filter to the grain, the season, the team, the player, using SQL. Then use R or Python to finish the analysis and charting.
Leaderboards
If your analysis lends itself to a leaderboard - maybe a last 30d Swinging Strike %, or Most Career Homeruns in a Leap Year - a leaderboard is a good fit. This is just a SQL query backed data table, chart, or a single number, displayed on a website, updated each day.
If you want summary statitics, over a period of time, at a specific grain, displayed on a chart or table - think “Leaderboard”!
SQL is your best option at Project Baseball, but also more broadly: most dashboard software support SQL as the primary programming language. This is one of the top career building skills, if summarizing data is part of your job responsibilities.
Simulations and Sophistications
If you want to do a more sophisticated analysis and need to use R/Python, you will save the results into a database table. Then your dashboard (leaderboard, ie a data table that updates automatically) will be using SQL to query the analysis results. (Generally, R/Python analyses have to be refreshed manually, when the underlying data refreshes; SQL operates on a closer-to-the-database level, so generally is easy to run when new data arrives.)