Methodology of Assembly Corpus Construction — Turning Response Data from 1,700 Japanese Local Assemblies into Structural Analysis Material

This note is a methodology note of the machikarte research lab (ISVD-LAB-004). It organizes, along three layers — acquisition, normalization, and publication granularity — the design required to turn response data from roughly 1,700 Japanese local assemblies into structural analysis material.

What is happening

Local assemblies across Japan publish their meeting minutes on their own official websites. Publication as an institution is achieved. And yet, cross-sectional analyses such as "tracing seven years of care-related mentions across the country" or "mapping the regional distribution of PPP and PFI discussions" are not, in practice, an available option for either researchers or practitioners. Minutes are scattered across municipalities, formats drift, and the search infrastructure remains closed within each assembly.

machikarte treats this scattered state itself as a structural problem. It acquires minutes from individual assemblies, normalizes them into a common schema, stores them in BigQuery (Google Cloud's data warehouse, hereafter BQ), and publishes the result as an aggregation and search base. The labs articles sit as a layer above this base, extracting analytical material to "read the structure".

The question of this note is the following. To turn response data at the 1,700-assembly scale into material for structural analysis, not merely a collection, what design is required? What must be held as an identifier, what must not be discarded, and what must be split by publication granularity? This note organizes these three points as a methodology.

Background and context

Institutional premise for public access to local assembly responses

Meetings of local assemblies are, in principle, open to the public. Article 115 of the Local Autonomy Act stipulates the open nature of plenary sessions, providing the ground for the freedoms of attendance and reporting. The Information Disclosure Act and similar ordinances give citizens the right of access to minutes as administrative documents.

The institutional openness is in place. But the form in which minutes are published differs across municipalities. PDF, HTML, in-house CMSs, and third-party vendor search systems coexist, and the recording ranges by period and by committee type are not aligned. At the national level, there is the National Diet Library Diet Minutes Search System as an integrated base, but no equivalent nationwide cross-cutting base has existed for local assemblies for a long time.

The lineage of prior research

Research that treats parliamentary and assembly text has developed along at least three lines. Details are consolidated in the lab's literature map; here we sketch the outline as it connects to methodology.

The first is structural analysis of national and local assembly minutes. A continuing line of work by Yasunori Kimura and colleagues has kept updating methods that extract speaker roles, topics, and stances from minutes text. Yutaro Miyaki and Yuzu Uchida's Verification of a BERT-based classifier for role classification of Diet minutes speeches (2025) sits along this line, addressing the accuracy verification of utterance-role classification.

The second is the corpus and natural language processing lineage. Assembly minutes corpus research led by Haruka Watanabe and colleagues has accumulated the practices of schema design and quality control needed to treat text as structured data. The proceedings of the annual meeting of the Association for Natural Language Processing continue to host research targeting assembly and administrative text.

The third is the civic participation platform lineage. Applied research on public disclosure of assembly information and visualization of debate has been advancing under the civic tech umbrella. Machikarte stands at the point where these three lines meet, inheriting the corpus base practices of the first and second lines while responding to the public disclosure infrastructure purpose of the third.

Three specificities of machikarte

Machikarte's specificities relative to prior research can be organized into three points. First, national scale. It consolidates assembly minutes at the 1,700-assembly scale into a single base. Second, continuous updating. The scraper (the acquisition program) runs on a schedule, and response data keeps accumulating. Third, two-layer publication. It separates the data layer on BQ from the article layer on labs, letting raw data provision and structural analysis run in parallel with a division of roles.

This two-layer publication comes from the policy of remaining a neutral data source. The machikarte main site does not handle evaluative context toward individual assemblies, councillors, or heads of government. It stays a data provision layer. Evaluative reading is done in the labs article layer, with a separate set of aggregation granularities and publication rules. This separation is the background for the three-stage publication granularity discussed in the next section, "Reading the structure".

Reading the structure

Schema design and distributed processing optimization

The central table of machikarte is correlate-workspace.machikarte.speeches on BQ. One row corresponds to one utterance. The table has 15 columns, designed in four functional groups.

The identifier group has four columns. speech_id is the unique ID of the utterance, session_id is the ID at the meeting level, municipality_code is the nationwide code for local public bodies maintained by the Ministry of Internal Affairs and Communications, and municipality_name is the display-purpose name of the municipality. The point that municipality_code is held as the primary key is the entry point to the unique-identifier principle discussed later.

The time-series group has three columns. session_date is the date of the meeting, session_title is the meeting name, and session_type holds the category — plenary, committee, and so on. session_date is used as the DAY partition (partitioning by date unit) on BQ.

The utterance group has five columns. speaker_name is the speaker's name, speaker_role is the role — councillor, head of government, division director, and so on — content is the utterance body, content_length is the character count, and sequence is the order within the meeting. content_length is held as an independent column and used for automatic verification.

The source tracking group has three columns. source_offset is the position within the original text, source_url is the URL of the source minutes, and fetched_at is the acquisition timestamp. Holding source_url keeps traceback to the primary source available at all times.

For distributed processing optimization, we use a two-tier setup of partition and clustering. Setting session_date as the partition column makes period-specified queries read only the partitions of that period. Setting municipality_code as the clustering column lets municipality-specified queries reach the relevant rows within blocks. This design matches the combination of a time-series column and a high-cardinality column recommended by the BigQuery official documentation. A table that reads several hundred GB in a full scan without partition specification is compressed to a few GB with period plus municipality filtering.

Quality assurance and the unique-identifier principle

The first layer of quality assurance is the content_length INTEGER column. The scraper's acquired text and an independently computed character count are compared; discrepancies are detected. Text corruption, mid-way truncation, and inclusion of extra whitespace become mechanically visible via content_length differences.

The second layer of quality assurance is retention of source_url. When aggregation results are in doubt, both labs article readers and editors can go back from source_url to the relevant location in the original minutes. This backward traceability is basic equipment for extending the reproducibility practice of corpus research all the way to the public disclosure layer.

The third layer of quality assurance is the combination of the fetched_at TIMESTAMP with history tables. fetched_at holds "when it was acquired", the audit_history table holds "when it was updated", and the corrections table holds "where it was corrected". With these three history tables, changes in the state of data are placed under complete traceability.

On top of that, the most important methodological point is the unique-identifier-code principle. Local municipalities in Japan can share names. Fukushima-cho in Hokkaido and Fukushima-shi in Fukushima Prefecture, or "Midori-shi" existing adjacently in Gunma and Tochigi Prefectures, are examples. Cross-references (joining with other tables) based on municipality-name slugs (normalized strings) invite mis-joins via same-name collisions. In the past, incidents have occurred in which name-based joining mis-linked one municipality's utterances to another municipality.

The countermeasure is clear. Set municipality_code — the nationwide code for local public bodies maintained by the Ministry of Internal Affairs and Communications — as the primary key, and perform all cross-references by code. Keep name, slug, and display_name as display-purpose dependent columns. This policy is not merely a technical convention; it is applied as the nationwide cross-cutting practice of local assembly research to all aggregation queries in labs articles.

Three-stage publication granularity and ethics

Turning material into structural analysis is inseparable from publication granularity design. The linkage between machikarte and labs is operated under a three-stage publication granularity rule. Details are documented in the linkage document (docs/labs-public-asset-ppp-machikarte-cross-reference.md §1-3).

The first stage is municipality aggregation. It is used for describing nationwide distributions such as how many of the 1,700 assemblies mention a specific keyword, and what the mention rates are by prefecture. The main use of labs articles concentrates here. Individual municipality names are not published as top-or-bottom rankings; the article keeps the posture of reading the structure of the distribution.

The second stage is caucus aggregation. It aggregates mention tendencies by caucus within the same municipality. It is effective for policy-tendency analysis, but with small caucuses it can approach de facto individual identification, and thus becomes a target for cautious operation. The threshold — the lower bound on the number of councillors in a caucus — and the anonymization decision are examined case by case for each labs article.

The third stage is verbatim citation. It cites the body of a specific utterance as it is. It enters the domain of research ethics; the relationship to public-figure status (councillors, heads of government), context, and policy issue is judged individually. The machikarte main site takes the position of not handling evaluative citation, and the labs article side handles it with the editorial committee's judgment framework in place.

The meaning of drawing three stages is neither "lowering granularity is safe" nor "raising granularity is useful". Even with the same data, different reading yields different social meaning. Granularity is a parameter to be chosen together with the analytical purpose, and that choice itself is a part of methodology.

References

machikarte — A search infrastructure of local assembly speeches across Japan (beta) — Institute of Social Vision Design (ISVD). ISVD

machikarte (GitHub) — schema, aggregation queries, license (MIT + CC BY 4.0) — Institute of Social Vision Design (ISVD). GitHub

Verification of a BERT-based classifier for role classification of Diet minutes speeches — Yutaro Miyaki and Yuzu Uchida. Journal of Japan Society for Fuzzy Theory and Intelligent Informatics, Vol. 37, No. 1, pp. 530-534

Introduction to Partitioned Tables — Google Cloud. BigQuery Documentation

Introduction to Clustered Tables — Google Cloud. BigQuery Documentation

Local Autonomy Act — Government of Japan. e-Gov Law Search

Nationwide Code for Local Public Bodies — Ministry of Internal Affairs and Communications. Ministry of Internal Affairs and Communications

Diet Minutes Search System — National Diet Library. National Diet Library