How the data was assembled, what the codings mean, and where the judgment calls live.
The dataset comes from parsing the embedded JavaScript in the Guardian's interactive feature page (May 2026). Every voter's full ranked top-10 is in that source, including books that didn't make the top 100. The parser extracts (voter, book, rank, position, commentary) tuples and pivots them into structured CSVs. The full pipeline — build scripts, analysis code, interactive site — is in the project's git repo.
Each of the 172 voters is coded for gender — F, M, NB, or unknown. The coding was done in two passes. The first pass drew on general knowledge of contemporary literary figures (most of the voters are well-known novelists, critics, and Guardian journalists). The second pass was a careful validation: every voter was checked against an authoritative public source — typically the lead paragraph of their Wikipedia entry, with explicit attention to pronouns; failing that, their Guardian author-profile page, publisher bio, or personal website. The two-pass process caught three errors in the first-pass coding: Guy Gunaratne (M → NB; he/they), Lucas Rijneveld (NB → M, per his post-2022 use of he/him in English), and Olivia Laing (F → NB; they/them per Wikipedia).
Caveats about voter gender coding. Even with a validation pass, this kind of coding has real failure modes; the resulting CSV is an informed best reading, not ground truth:
Net of these caveats, the voter-gender file represents a sincere best effort, and the headline statistics about gendered voting patterns are robust to plausible single-voter recoding — no individual voter's gender determines the F-versus-M ballot split.
Each book is coded along three axes:
Authors are also coded for gender, separately from voters. 479 distinct authors across the dataset are coded (75 for the top 100; 404 for the additional authors that appear only in long-tail picks). Authors who use pen names (George Eliot, George Orwell, Elena Ferrante) are coded as the gender the author publicly identified as.
For the chart's binary F-vs-M math, the two non-binary voters are not in the denominator or the marginal curves. They appear normally in voter lists and rankings — they just don't enter the percentage math. The Y-axis is labelled Voter gender* with vertical "All male" / "All female" endpoint labels; the asterisk footnote spells out the F ÷ (F + M) calculation and the non-binary voter handling.
The four percentages at the top of the main page are parallel asymmetry measures, not complementary halves of a single whole. Each is computed against its own group's total ballots:
The two stats describe each gender's cross-the-line voting behaviour independently. Adding them together (27% + 49% = 76%) is not meaningful — they're different bases. Within each voter gender the complements do sum to 100% as expected (male voters: 27% to F authors + 73% to M authors = 100%; female voters: 49% to M authors + 51% to F authors = 100%). The same logic applies to the two subject-coded statistics.
Most of the codings in this analysis are judgment calls, and they are open to reasonable disagreement. It's worth being frank about which ones.
Subject score is the most subjective code. It compresses several distinct features of a book — protagonist gender, setting, themes, violence, narrative voice — into a single number on a stereotype gradient. The same book could legitimately be coded ±1 in either direction depending on which features you weight most. Some specific examples of the judgment calls made:
Canonicity score is similarly judgment-based. It conflates two correlated factors — age and author identity — into a single number, which works for most books on this list but breaks down for cases like Anita Brookner (canonically venerated but not in the deep canon) or Toni Morrison (very canonical now but the canon adopted her relatively recently). Each book's canonicity score reflects where the book currently sits in the literary establishment's received wisdom, which is not identical to how venerated it was at publication.
The long-tail subject coding was done by a research agent working from plot summaries and brief author research — not as carefully as the top-100 codings, which were hand-done with individual rationale per book. Coding consistency across the long tail is therefore weaker than within the top 100. For the headline patterns this matters less (the signal at the top of the list dominates), but for any individual long-tail book, the score should be taken as approximate rather than precise.
Author gender for the 404 new authors was also agent-coded. Famous and confidently-known authors are reliable; obscure single-name authors with no clear public gender are coded as "unknown" rather than guessed.
The "voter pool" is the Guardian's curated list of contributors, not a random sample of readers. The patterns here describe the gendered reading habits of a specific community of literary professionals — novelists, critics, academics, Guardian journalists. Generalising to "what men read" or "what women read" outside this community would overreach. The voter pool is also 59% female, which itself shapes which books aggregate to the top 100 and which fall out as honourable-mentions.