Structural Snapshot of the Toshima-ku Assembly 2024-2025 Pilot Data — An Early-Stage View of Assembly Response Corpus Building in Tokyo's Special Wards

Jul 4, 2026

Naoya Yokota

About 8 min read

The machikarte assembly-response corpus currently holds 2,211 speeches (65 speakers / 18 sessions) from the Toshima-ku assembly for 2024-2025. This case note reads that pilot slice honestly, holding the analysis at municipality-aggregate granularity, and treating the non-recording of other wards including Setagaya-ku as 'ingestion not yet started' rather than 'nothing to observe'.

X FBFacebook Threads

This note is a case-study entry from the machikarte laboratory (ISVD-LAB-004). Against the long-term target of about 1,788 local assemblies nationwide, the Toshima-ku assembly pilot data (2,211 speeches recorded for 2024-2025) represents an early ingestion slice. The article holds the analysis at municipality-aggregate granularity and does not address individual councillors or factions.

What is happening

The core table correlate-workspace.machikarte.speeches in machikarte holds, as of 2026-07-02, 1,669 speeches for the Toshima-ku assembly in fiscal 2024 and 542 speeches for fiscal 2025 (in progress), for a combined 2,211 speeches. The 2024 fiscal year covers 18 sessions and 65 speakers, with a total content length of 1,161,170 characters. Ingestion of fiscal 2025 continues.

This is an early-stage slice of assembly-response corpus building for the special wards of Tokyo (tokubetsu-ku). Toshima-ku data for 2018-2023 is not yet in the corpus, and Setagaya-ku (including 2024-2025) has not yet been ingested at all. The target scale for machikarte overall is around 1,788 assemblies and roughly 126 million speech records; the ward-level ingestion progress is uneven, and Toshima-ku's two-year, 2,211-speech slice sits at the early stage of that progression.

The article deliberately avoids the shortcut "ingested = ready for analysis". Some questions can be approached at the current 2,211-speech scale, and others cannot in principle be approached with a single ward and two years. Keeping those two apart is the point of this note. The stance follows the three-tier public-granularity rule set out in the machikarte laboratory hypothesis map; this article stays at the first tier (municipality-aggregate) and does not address individual councillors or factions.

Background and context

Corpus scale overall and the position of the special wards

The target scale of the machikarte corpus is roughly 1,788 local assemblies nationwide and about 126 million assembly speech records. Prior laboratory analyses have covered prefectural aggregates on datasets exceeding one million records as of 2024, and a distribution study of response phrasing across 870 municipalities and roughly 18.97 million responses (National Distribution of Deferral Phrasing in Assembly Responses).

Those articles handle "nationwide scale" by bundling many municipality–year combinations. This article, in contrast, takes Toshima-ku 2024-2025 as a single point in the corpus, closer to an early snapshot. Making that positional relationship visible up front lets the reader trace the split, later in the article, between what is observable now and what is not yet observable.

The special wards are the base-tier local governments of Tokyo, a specific category of local public entity under Japanese local government law. Policy debate at this tier concentrates on questions close to residents' daily lives: child-rearing, elderly care, housing, and ward services. Building a cross-cutting archive of assembly responses at this tier carries clear value, but ingestion progress is stage-based and ward-by-ward, and the coverage that would support cross-ward comparison is not yet in place.

The legal frame for open assemblies and the practical cost of ingestion

Local assembly meetings are open in principle. Article 115 of the Local Autonomy Act establishes the openness of plenary sessions, and provides the institutional basis for both public attendance and the publication of minutes. Minutes of the Toshima-ku assembly are published via the Toshima-ku official site. On that legal-openness dimension the requirement is met.

Even so, ingestion progresses in stages because the publication format differs by municipality. Minutes appear as PDFs, as HTML, on in-house CMSs, and on third-party search systems, and the coverage of periods and committee types is not uniform. Designing a scraper (a data-fetching program) per municipality, normalising the result into a shared schema, and validating it before loading into BQ (BigQuery, Google Cloud's data warehouse) requires per-municipality work. The details of that process are set out in the laboratory's assembly corpus construction methodology.

The current picture (Toshima-ku 2024-2025 ingested; Toshima-ku 2018-2023 and other special wards including Setagaya-ku not ingested) is a direct reflection of that stage-based, per-municipality ingestion cost.

Reading the structure

The 2024 slice (1,669 speeches × 65 speakers × 18 sessions)

Some simple averages for Toshima-ku fiscal 2024: about 93 speeches per session, about 26 speeches per speaker, and an average speech length of about 696 characters.

At this scale, what is observable stays within the outline of aggregate values: an initial description of the session-level debate structure, a snapshot of the per-speaker speech-volume distribution, and the shape of the length distribution. Questions like "which sessions in 2024 concentrate a given keyword" or "does the per-speaker speech-volume distribution have a long tail or a short one" remain within reach even at this scale.

What is not observable at this scale is time-series change and cross-municipality comparison with replicable structure. Observing a seven-year trajectory requires seven years of records; two years of 2024-2025 do not reach the horizon where policy lag becomes readable. Cross-ward comparison with other special wards cannot even be set up while the comparison targets have zero records. Both of these observation gaps must wait for further ingestion.

Collision avoidance and the unique-code discipline

The value 131164 in the municipality_code column uniquely identifies Toshima-ku (a Tokyo special ward) under the Japanese national municipality code system. The underlying JIS X 0402 code is the five-digit form 13116 (a two-digit prefectural code, 13 = Tokyo, plus a three-digit municipal code, 116 = Toshima-ku). The national municipality code table maintained by the Ministry of Internal Affairs and Communications adopts a six-digit form that appends a check digit (4) to that five-digit JIS X 0402 code, giving 131164. JIS X 0402 and the MIC national code coexist as separate standards; this article uses the latter as the primary key.

The name "Toshima-ku" happens to be unique within the Tokyo special wards, and the practical collision risk against other ward names is low. That safety, however, rests on a name-slug (a normalised string) and disappears as soon as name-based cross-reference (joining across tables) is used more widely. A prior incident involving Fukushima-cho in Hokkaido and Fukushima-shi in Fukushima Prefecture, both mapped to a similar name-slug, produced an erroneous join that placed one municipality's speeches under the other.

The laboratory therefore performs cross-reference only via municipality_code (the primary key), and treats name, slug, and display_name as display-only dependent columns. The full argument is laid out in the unique-identifier-discipline section of the assembly corpus construction methodology. Every aggregate in this article uses municipality_code = 131164 as the filter and does not pass through any ward-name string.

Three tiers of public granularity, and the scope of this article

The interface between machikarte and the laboratory runs on a three-tier public-granularity rule. Tier one is municipality-aggregate (adopted here). Tier two is faction-aggregate. Tier three is verbatim quotation. Faction-aggregate and verbatim quotation carry a risk of approaching individual identification and are treated as sensitive.

This article stays at tier one. No individual councillor names or faction names appear. A closer look at the composition of the 65 speakers or the topic composition of the 18 sessions may appear in a separate article once the tier-two public-granularity judgement is made explicitly for that scope. The role of this article is to make the early-stage slice visible as a shape, and to keep individual evaluation outside the scope.

"No records" is not the same as "no debate"

The number of Setagaya-ku speeches in the corpus, as of 2026-07-02, is zero. That number does not mean "no debate takes place in the Setagaya-ku assembly". Setagaya-ku runs its own minutes-publication system, and there is no basis for assuming its minutes volume is smaller than that of Toshima-ku. The zero means "ingestion has not started"; it does not mean the absence of debate.

The same reading applies to Toshima-ku 2018-2023. Assembly responses from that period clearly exist; the machikarte scraper has simply not covered that period yet. Equating "record count" with "debate volume" is a shortcut that this early stage does not support.

The overall roadmap for expanded ingestion is documented on the machikarte side. This article stays with the current slice and its limits. The connection point to future work is picked up in the next section.

Limits of observation and the next stage

What can be observed from a single ward and two years is narrow. A seven-year trajectory cannot be traced from two years, and comparison across the 23 special wards cannot be set up with a single ward. Time-series and cross-comparison, the two main observation axes of corpus research, are both sealed off at this stage. Making that fact visible honestly is one of the main purposes of this article.

Observation possibilities will open in stages as ingestion advances. When Toshima-ku 2018-2023 comes online, a single-ward seven-year trajectory becomes tractable: policy lag, annual variation in topic concentration, annual distributions of speech volume. When other special wards come online, cross-ward aggregates become possible, including the width of ward-level distributions and the position of the special wards within the Tokyo prefecture. When all 23 wards are in place, multi-layer analysis linking the Tokyo Metropolitan Assembly and the ward assemblies comes within reach.

These future items connect naturally to other article groups in the laboratory: prefectural distribution analysis, seven-year trend analysis, topic-specific aggregates. The role of this article is confined to setting the entry-point snapshot as a shape.

The methodology-side connection point is the assembly corpus construction methodology. That article covers the ingestion stage this note treats and the overall design that scales to 1,788 assemblies. The two are best read as sibling notes running in parallel.

References

machikarte — Nationwide Search Platform for Japanese Local Assembly Speeches (beta) — Institute for Social Vision Design (ISVD). ISVD

machikarte (GitHub) — schema, aggregation queries, license (MIT + CC BY 4.0) — Institute for Social Vision Design (ISVD). GitHub

Local Autonomy Act (chihou jichi hou) — Government of Japan. e-Gov Law Search

National Municipality Code Table — Ministry of Internal Affairs and Communications. MIC

Toshima-ku Assembly — Toshima-ku, Tokyo. Toshima-ku Official Site