Skip to content

Home / Blog

Loading Cricsheet YAML into DuckDB

·6 min read ·by Dipankar Sarkar

In short: Install CricketLogic, point it at a folder of Cricsheet YAML with db.ingest_directory(), and each match is parsed into DuckDB matches, innings, deliveries, players and teams tables — deduplicated by filename — ready to query with SQL.

What is Cricsheet YAML?

Cricsheet publishes free, structured, ball-by-ball data for thousands of cricket matches. Each match is a single YAML (or JSON) file describing the teams, the toss, the outcome, and every delivery bowled — who faced it, who bowled it, how many runs, and any wicket.

That structure is great for analysis but awkward to query directly: the deliveries are deeply nested inside innings inside a match. To ask “who has the best T20 strike rate?” you need those deliveries flattened into rows. That flattening is exactly what CricketLogic does.

The fastest path

pip install cricketlogic
cricketlogic download by-competition ipl     # grab match files
cricketlogic ingest ./data/ipl --db-path ipl.duckdb

Or from Python:

from cricketlogic import CricketDB

db = CricketDB("ipl.duckdb")
db.ingest_directory("./data/ipl")
db.close()

What tables do you get?

Ingestion normalizes each YAML into five core tables:

TableOne row perKey columns
matchesmatchteams, venue, city, dates, toss, outcome, match_type, competition, gender
inningsinningsteam, match_id
deliveriesdelivery (ball)batsman, bowler, non_striker, runs_batsman, runs_total, wicket
playersparticipantname, registry_identifier
teamsteamname

The deliveries table is where the value is: one row per legal and illegal ball across every ingested match. Every analytical view is an aggregation over it.

Why duplicates don’t pile up

Matches are identified by their filename. Before inserting, CricketLogic checks whether a file with that name has already been ingested and skips it if so. That means you can safely re-run ingest on an overlapping folder, or run download auto-update on a schedule, without double-counting a single ball.

# Safe to run repeatedly — already-seen matches are skipped
cricketlogic download auto-update --days 7
cricketlogic ingest ./data/recent --db-path ipl.duckdb

Verifying the load

After ingesting, sanity-check the counts:

SELECT match_type, COUNT(*) AS matches
FROM match_summary
GROUP BY match_type
ORDER BY matches DESC;

Then confirm the deliveries landed:

SELECT COUNT(*) AS balls FROM deliveries;

Where to go next

Once the data is in DuckDB, you never touch YAML again — it is all SQL. Read Computing batting & bowling stats from ball-by-ball data to see how the predefined views turn those delivery rows into metrics, or jump to the quickstart.

Last updated Jul 2, 2026 · Code verified against CricketLogic MIT · Source: https://github.com/cricketlogic/cricketlogic