If you’ve ever tried to build a serious football analytics project, whether for fantasy, betting, scouting or media, you’ve probably felt the pain of coordinating an overwhelming amount of disjointed data. Data is everywhere, but it’s messy. Player stats live in one format. Transfer histories in another. Lineups might be buried in PDFs. And joining all of it together? That’s a weekend lost to spreadsheet hell. At Baselight, we believe sports analytics should be built on clean, structured, and accessible data. That’s why we created the Ultimate Soccer Dataset—a living, open foundation for football intelligence.
Why We Built It
The world of football is overflowing with data. Every match is logged, every player tracked, every statistic recorded—from expected goals and pressing intensity to transfer fees and tactical shifts. But if you’ve ever tried to use that data, you know how frustrating it can be. Stats are fragmented across APIs, locked behind paywalls, buried in spreadsheets, or formatted inconsistently across platforms. It’s a mess.
At Baselight, we’re both football fans and data infrastructure nerds. We wanted something better. Something structured. Something composable. Something that didn’t require hours of cleaning before you could even begin your analysis. That’s why we built the Ultimate Soccer Dataset—a public, normalized, and ever-growing foundation for high-quality football analytics.
What’s Inside?
The Ultimate Soccer Dataset is one of the most ambitious structured football datasets ever released. It covers data on:
- 5,000+ teams
- 120,000+ individual players
- 95,000+ matches
- 1,600,000+ player-match stats
- 400,000+ transfers
- 500,000+ betting odds
Whatever football stat you’re looking for, we probably have it.
Data structure is what set’s it apart
We didn’t just collect data. We normalized it. What does that mean? Technically, the dataset is in what’s called third normal form (3NF)—a standard in database design that avoids redundancy and ensures each fact is stored in exactly one place. For example:
- Player demographics are stored once and referenced wherever needed.
- Matches reference teams, players, and stats, but never duplicate their information.
- Transfers, betting odds, and performance stats are all cleanly separated into purpose-specific tables.
The result is a dataset that’s faster to query, easier to join, and far less error-prone than traditional sources.
- Faster queries — Because data isn’t duplicated across tables.
- Fewer errors — You’re not at risk of inconsistencies between different tables.
- One source of truth — Every data point (player, team, match, etc.) is consistent, no matter how you query it.
In other words, this is the fastest and easiest way to query football stats. For example, our Chief Portuguese Football Fan, Paulo Sousa, quickly whipped up this query after Ronaldo scored the game winning goal against Germany in the UEFA Nations League Semifinal!

So what can I do with the Ultimate Soccer Dataset built?
What can you actually do with this dataset? A lot. Sports journalists can use it to support articles with facts rather than gut feelings. Analysts can create models that predict match outcomes or injury risk with reliable inputs. Developers can build bots and AI agents that respond to football questions with real data. Fantasy and betting communities can surface undervalued players and optimize lineup decisions with precise metrics. And curious fans can finally answer questions like: Which referees hand out the most red cards? How does transfer spend correlate with final league position? Who’s the most efficient striker in the last three years
Here’s a great example! Let’s say you’re building a Golden Boot prediction model. You want to find efficient goal scorers across the top leagues from 2021 to 2024. With the Ultimate Soccer Dataset, you join the goal_events table with match_player_stats, filter for players with more than 900 minutes played, calculate goals per shot, and sort the results. From there, you can flag value picks, refine your fantasy roster, or generate betting signals. It’s a 20-minute workflow that used to take hours—and it’s fully auditable, reproducible, and shareable.
WITH valid_matches AS (
SELECT match_id
FROM "@blt.ultimate_soccer_dataset.matches"
WHERE year(date) between 2021 and 2024
),
goals AS (
SELECT
player_id,
COUNT(*) AS goals
FROM "@blt.ultimate_soccer_dataset.goal_events"
WHERE match_id IN (SELECT match_id FROM valid_matches)
GROUP BY player_id
),
player_stats AS (
SELECT
player_id,
SUM(minutes_played) AS total_minutes,
SUM(shots_total) AS total_shots
FROM "@blt.ultimate_soccer_dataset.match_player_stats"
WHERE match_id IN (SELECT match_id FROM valid_matches)
GROUP BY player_id
)
SELECT
ps.player_id,
p.name,
ps.total_minutes,
ps.total_shots,
COALESCE(g.goals, 0) AS goals,
ROUND(COALESCE(g.goals, 0) / NULLIF(ps.total_shots, 0), 4) AS goals_per_shot
FROM player_stats ps
LEFT JOIN goals g ON ps.player_id = g.player_id
LEFT JOIN @blt.ultimate_soccer_dataset.players p ON ps.player_id = p.player_id
WHERE ps.total_minutes > 900
ORDER BY goals_per_shot DESC;Join the Movement
We see this dataset as a shared resource for the next generation of football thinkers. Whether you’re creating tactical visualizations, training reinforcement learning agents, testing scouting models, or building fan dashboards, the Ultimate Soccer Dataset gives you the raw material to work with speed and confidence.
We’re just getting started. As more leagues come online and more users contribute, we expect this to become the go-to public dataset for football data analysis. And we want you to be part of it.
Explore the Ultimate Soccer Dataset today and experience a new standard in football intelligence.
Have a dataset to contribute? Want to build an app or model on Baselight? Interested in partnering? Reach out here or join our Discord community to start collaborating.
Let’s build the future of football data, together.
