Engineering data-infra 52 min read

Designing 'medallion' for Automated Daily Earnings Reports — XBRL/PDF Scraping and YAML Configuration

A deep dive into medallion's design for automated earnings collection. Covers XBRL/PDF reconciliation and YAML-driven pipelines.

Published 2026-05-23 森本 拓見

The HUMAN_INPUT marker appearing in this article is a placeholder used by AI writing skills to indicate where a human should later fill in finalized values (e.g., <!-- HUMAN_INPUT: enter numeric value -->).

This article discloses the full design of “medallion”—a Python-based data collection tool that automatically retrieves earnings reports every morning and accumulates them in shared Google Sheets. This guide is intended for data engineers, senior engineers, and architects. For perspectives on ROI, cost comparison, and investment decisions, please refer to the sister pillar article: “Business Decisions and ROI for Automating Financial Data Collection.” Here, we focus on the implementation structure itself.

Three Key Takeaways:

  1. We use a multi-stage fallback across three sources (TDnet, corporate IR sites, and EDINET). Discrepancies between XBRL and PDF are absorbed by a “PDF-Primary, XBRL-Supplement” merge pattern.
  2. Retrieval strategies for each ticker (url_template, page_scrape, edinet_only) are isolated within YAML files. No ticker-specific logic is hardcoded.
  3. Daily launchd execution is narrowed down by the intersection of the “TDnet Disclosure List” and “tickers registered in master.meta.” Eliminating full-ticker scans significantly reduced processing time and error noise.

Overall Architecture of medallion

Defining Collection Targets: Three Sources for Automated Financial Data Retrieval

The “earnings reports” (Kessan Tanshin) collected by medallion are financial summary documents submitted by Japanese listed companies quarterly, semi-annually, and annually based on JPX regulations. Released 2 to 4 times a year per company, they include net sales, operating income, net income, balance sheet (BS) summaries, cash flow (CF) summaries, and earnings forecasts.

We use three sources with prioritized fallback:

SourceTargetStrengthsWeaknesses
TDnet (Tokyo Stock Exchange)Latest reports (Flash PDF)Reliable for same-day to 3-day-old flashes.Limited retention for past data.
Corporate IR SitesReports from the past 5-10 years (PDF)Long-term history.URL rules vary wildly by company.
EDINET (FSA Electronic Disclosure)Annual/Semi-annual reports (XBRL/PDF)Easy to handle as structured data.Slower than flashes (delays of up to a month).

Our design principle is a division of responsibilities: “Flashes from TDnet, historical data from IR sites, and structured data from EDINET.” Combining these three ensures the entire system doesn’t stop even if one source changes.

Data Flow: Fetch → Stage → Normalize → Write to Sheets

[TDnet / IR / EDINET]
       |
  fetch Phase
  (tdnet.py / ir_jp.sh / edinet_search.py)
       |
  stage Phase
  (kessan_pdf.py / kessan_xbrl.py / edinet_xbrl.py)
  ↓ JSON (with field-level provenance)
  kessan_merge.py (PDF-Primary + XBRL-Supplement)
       |
  write Phase
  (schema.py → sheets_io.py → GWS Sheets)

We divide the phases into fetch, stage, and write to make reruns safe. The bulk_runner.sh includes a --phase fetch|stage|write|all flag, allowing us to resume from a subsequent phase if one fails. For example, if a fetch stops due to a network error, we can simply restart from --phase stage.

Directory Structure and Major Modules

scripts/earnings/
├── fetcher/                  # Zero GWS dependency (future MCP server candidate)
│   ├── schema.py             # 63-column schema, period generation, JSON-to-row conversion
│   ├── kessan_merge.py       # PDF-Primary + XBRL-Supplement merge
│   ├── extractors/
│   │   ├── kessan_pdf.py     # pdfplumber-based PDF extraction
│   │   ├── kessan_xbrl.py    # arelle/xml.etree-based XBRL extraction
│   │   ├── edinet_xbrl.py    # EDINET annual report XBRL parser
│   │   └── ixbrl_html.py     # Inline XBRL (iXBRL) HTM parser
│   └── sources/
│       ├── tdnet.py          # TDnet list retrieval + PDF URL resolution
│       ├── ir_yaml.py        # Ticker-specific YAML config reading + PDF URL resolution
│       └── edinet_search.py  # EDINET API client

└── montage_integration/      # GWS dependent (orchestration layer)
    ├── sheets_io.py          # Sheets R/W wrapper
    ├── history_runner.sh     # Orchestrator for historical data of a single ticker
    ├── bulk_runner.sh        # Parallel execution for multiple tickers
    └── daily_tdnet_runner.sh # Daily execution wrapper (called by launchd)

Keeping fetcher/ free of GWS dependencies allows for its potential future use as an MCP server. By isolating Google Sheets writing logic within montage_integration/, the fetcher layer remains portable for use in other environments.


Design Driven by Ticker-Specific YAML Configuration

Structure of YAML Config Files: The Foundation for JPX TDnet Implementation

config/companies/{ticker}.yaml serves as the SSOT (Single Source of Truth) for each ticker’s settings. A minimal configuration looks like this:

# config/companies/7203.yaml (Example: Toyota Motor Corporation)
ticker: "7203"
company: "Toyota Motor"
fiscal_month: 3
sector: "Automotive"
accounting_std: "IFRS"
ir_url: "https://global.toyota/pages/global_toyota/ir/financial-results/"
short_term_template: "https://global.toyota/pages/global_toyota/ir/financial-results/{FY}_{QNUM}q_summary_jp.pdf"
yuho_template: null
scraper_strategy: "url_template"
notes: "IFRS consolidated. Operating income not listed in annual report (pre-tax profit only)."

{FY} represents the calendar year of the fiscal year-end (e.g., 2024), and {QNUM} represents the quarter number (1-4, where 4 is full-year). By embedding these in the URL template, we can generate period-specific PDF URLs without writing extra code.

For companies with inconsistent IR URL rules (like Sony Group, where patterns change by year), we use scraper_strategy: "page_scrape" to scan the IR page and identify PDFs by link text:

# config/companies/6758.yaml (Example: Sony Group Corporation)
ticker: "6758"
company: "Sony Group"
accounting_std: "IFRS"
scraper_strategy: "page_scrape"
ir_pages:
  - url: "https://www.sony.com/ja/SonyInfo/IR/library/presen/"
    pdf_link_pattern: "Consolidated Financial Results"
  - url: "https://www.sony.com/ja/SonyInfo/IR/library/presen/archive/"
    pdf_link_pattern: "Consolidated Financial Results"
ir_page_selector: "a[href*='.pdf']"

There are three choices for scraper_strategy:

StrategyUse Case
url_templateWhen IR PDF URLs can be templated like {FY}_{QNUM}q_summary.pdf.
page_scrapeWhen URL patterns vary by year and must be identified by scanning link text on IR pages.
edinet_onlyWhen direct retrieval from the IR site is difficult and must rely solely on EDINET.

Why We Chose Configuration-Driven Design: Avoiding Hardcoding

The first problem we faced in the initial implementation was that URL rules for corporate IR sites were too inconsistent to be written into the code.

While Toyota uses a template like {FY}_{QNUM}q_summary_jp.pdf, another company might use /ir/kessan/{YY}_{MM}/summary.pdf, and yet another might change its entire URL structure every year. Writing these as if statements in the code would require code modifications every time a new ticker is added, making testing increasingly complex.

By isolating ticker-dependent logic in YAML, the code’s sole responsibility becomes “reading YAML and resolving URLs.” ir_yaml.py works like this:

# Excerpt from ir_yaml.py (Pseudo-code for conceptual illustration)
def resolve_url(config: dict, fy: int, quarter: int) -> str | None:
    strategy = config.get("scraper_strategy", "url_template")
    
    if strategy == "url_template":
        tmpl = config.get("short_term_template")
        return tmpl.format(FY=fy, QNUM=quarter) if tmpl else None
    
    elif strategy == "page_scrape":
        return scrape_ir_page(config)  # Scan the IR page
    
    elif strategy == "edinet_only":
        return None  # Rely on EDINET fallback

Adding a ticker is as simple as adding one YAML file. No code release is required.

Operational Procedures for Adding/Removing Tickers

# 1. Create the YAML
cp config/companies/7203.yaml config/companies/9999.yaml
# Edit contents (ticker / company / ir_url / scraper_strategy)

# 2. Validate YAML settings
python3 scripts/earnings/fetcher/sources/ir_yaml.py \
  --ticker 9999 --project-root . dump

# 3. Verify operation with a single ticker (last 2 years)
bash scripts/earnings/montage_integration/history_runner.sh \
  9999 "Test Company" 3 --years 2

# 4. Confirm meta information is written to the Master Spreadsheet
# (Check the 'meta' tab in the GWS console)

As of May 2026, 54 tickers are registered in config/companies/. We do not explicitly distinguish between active registrations and deleted ones (YAML remains but unused) on the YAML files themselves.


Reconciling XBRL and PDF Discrepancies

Parsing Strategy for XBRL-Compliant Companies: The Core of XBRL Scraping Design

XBRL (eXtensible Business Reporting Language) is the structured financial reporting format used by EDINET and TDnet. Because tags have conceptual names (like NetSales or OperatingIncome), parsing precision is higher compared to PDF text extraction.

Our design policy for kessan_xbrl.py has two main points:

1. Prioritize arelle, fallback to xml.etree

try:
    from arelle import Cntlr, ModelManager, ModelXbrl
    ARELLE_AVAILABLE = True
except ImportError:
    ARELLE_AVAILABLE = False

arelle is a native XBRL parser that accurately handles namespaces, contexts, and units. However, because it can be difficult to install in some environments, we fallback to manual parsing using xml.etree.ElementTree when arelle is unavailable.

2. Resolve “Local Name → Field Name” via Concept Maps

XBRL tag names vary by accounting standard (JGAAP, IFRS, or US-GAAP). kessan_xbrl.py absorbs these using a prioritized concept map:

# Common map for JGAAP / IFRS / US-GAAP (in order of priority)
CONCEPT_PL: dict[str, list[str]] = {
    "revenue": [
        "NetSales",            # JGAAP Manufacturing
        "Revenues",            # JGAAP Finance
        "OperatingRevenues",   # JGAAP Electric/Gas
        "Revenue",             # IFRS
        "SalesAndOperatingRevenues",
    ],
    "operatingIncome": [
        "OperatingIncome",                          # JGAAP
        "ProfitLossFromOperatingActivities",        # IFRS
        "OperatingIncomeLoss",                      # US-GAAP
    ],
    ...
}

The system searches the list from top to bottom to see if a tag name exists in the XBRL and adopts the first match. Conceptual differences between JGAAP and IFRS (e.g., JGAAP has “Ordinary Income,” but IFRS does not) are handled within the concept map definitions.

Handling iXBRL (Inline XBRL)

The “Earnings Report XBRL” from JPX is different from standard XBRL ZIPs; it is in iXBRL format, where XBRL tags are embedded inline within HTML (.htm extension). ixbrl_html.py is a dedicated parser that uses lxml to parse the HTML and extract ix:nonfraction and ix:nonnumeric tags.

# Core logic of ixbrl_html.py (Pseudo-code for conceptual illustration)
def extract_ixbrl(html_content: str) -> dict:
    tree = etree.fromstring(html_content.encode(), parser=html_parser)
    for elem in tree.iter():
        # Target ix:nonfraction tags (numeric values)
        if elem.tag.endswith("}nonfraction"):
            name = elem.get("name", "")  # e.g., "jppfs_cor:NetSales"
            context_ref = elem.get("contextRef", "")
            value = clean_number(elem.text_content())
            yield (name, context_ref, value)

Parsing Strategy for PDF-Only Companies

kessan_pdf.py analyzes PDFs using pdfplumber. Earnings report PDFs generally follow standard layout rules:

  • Page 1: Summary (PL/BS summaries, dividends, forecasts)
  • Pages 5-6: BS (Assets, Liabilities)
  • Pages 7-8: PL (Profit and Loss Statement)
  • Page 10: CF (Cash Flow Statement)

However, several real-world issues arise:

  1. Full-width numbers: “123,456” needs conversion to half-width.
  2. ”△” (Triangle) symbols: These represent losses and must be treated as negative numbers.
  3. Repeated character rendering: Some PDFs render characters multiple times on top of each other (e.g., “IIIFRS”), causing standard substring searches for “IFRS” to fail.
def _collapse_repeated_chars(s: str) -> str:
    """Collapses consecutive identical characters into one.
    
    Some PDFs, like Sojitz (2768), render as "IIIIIFFFFFRRRRRSSSSS".
    Collapsing results in "IFRS," enabling substring search.
    """
    if not s:
        return s
    return re.sub(r"(.)\1{2,}", r"\1", s)

We chose pdfplumber because its API (page.extract_text() / page.extract_tables()) allows us to handle both text and table extraction with a single library. Since earnings reports contain both prose and tables, being able to handle both in one library was a priority. We haven’t conducted quantitative benchmarks against pymupdf or pdfminer. Specific weaknesses of pdfplumber (e.g., cell merging issues in IFRS PDFs) are addressed with individual workarounds within kessan_pdf.py.

Normalizing XBRL/PDF into a Unified Schema

kessan_merge.py follows the policy of “PDF-Primary, XBRL-Supplement”:

# Design policy for kessan_merge.py (Conceptual code)
def merge(pdf_data: dict, xbrl_data: dict) -> dict:
    result = copy.deepcopy(pdf_data)
    
    for field in SCALAR_FIELDS:
        pdf_val = get_scalar(pdf_data, field)
        if pdf_val is None:  # Fill with XBRL only if PDF data is missing
            xbrl_val = get_scalar(xbrl_data, field)
            if xbrl_val is not None:
                set_scalar(result, field, xbrl_val)
                result.setdefault("_xbrl_filled_fields", []).append(field)
    
    # Record provenance
    filled = result.get("_xbrl_filled_fields", [])
    result["_source"] = "pdf+xbrl" if filled else "pdf"
    return result

By recording "pdf" or "pdf+xbrl" in the _source field, we can track the origin of each value, which is vital for debugging data quality issues.

Finally, the map_to_*_row() functions in schema.py convert the merged JSON into the 63-column Google Sheets format.


Designing for IFRS and JGAAP Support

Differences in Data Structure Between IFRS and JGAAP

Listed companies in Japan use either JGAAP (Japanese GAAP) or IFRS (International Financial Reporting Standards). According to JPX data (published 2025-08-08), 300 companies have adopted, decided to adopt, or planned to adopt IFRS. In terms of market capitalization, these companies accounted for 49.8% of the share as of 2025-06-30. While JGAAP is still used by more companies, IFRS covers nearly half the market cap. Therefore, a pipeline that cannot handle both would miss major tickers.

Key differences:

ItemJGAAPIFRS
Ordinary IncomePresent (includes non-operating income/expense excluding financial costs)None (replaced by “Pre-tax Profit”)
Cost of Revenue BreakdownStepwise: Gross Profit → SGA → Operating IncomeDisclosed by nature or function (optional)
Non-controlling InterestsShown between Liabilities and EquityShown as part of Equity
Quarterly ReportsMandatory until FY2024/3. Transitioned to Semi-annual from Oct 2024.Same as above.

To absorb these differences, schema.py is designed to allow None for fields present in JGAAP but absent in IFRS. For example, ordinaryIncome has a value only for JGAAP companies and is always None for IFRS companies.

Normalization Strategy for Unified Format

The keys for the unified schema are based on JGAAP conceptual names, with IFRS mappings resolved via concept maps.

For instance, IFRS ProfitLossBeforeIncomeTaxes is interpreted as equivalent to JGAAP “Income Before Income Taxes” and stored in the pretaxIncome field. Since there is no equivalent to JGAAP OrdinaryIncome in IFRS, the ordinaryIncome field remains empty for IFRS companies.

The intent behind this design is to be “honest about absent concepts.” Filling the ordinaryIncome field with IFRS ProfitLoss would lead to incorrect conclusions during comparative analysis.

Switching Accounting Standards via YAML

The accounting_std field in config/companies/{ticker}.yaml determines the parser’s behavior:

# Context selection in kessan_xbrl.py (Conceptual code)
def select_context(contexts: list[Context], accounting_std: str) -> Context:
    if accounting_std == "IFRS":
        # IFRS uses "CurrentYearDuration" + Consolidated (no _ConsolidatedMember)
        preferred = [c for c in contexts if "CurrentYearDuration" in c.id]
    else:
        # JGAAP uses "CurrentYearDuration" or "FilingDateInstant"
        preferred = [c for c in contexts if "CurrentYearDuration" in c.id]
    return preferred[0] if preferred else contexts[0]

This delegation to the accounting_std field allows for parallel processing of IFRS and JGAAP companies.


Design for Daily launchd Execution

launchd plist Design: Automated 5 AM Execution

We use macOS launchd to run daily_tdnet_runner.sh every morning at 05:00. The core of the plist:

<!-- config/launchd/com.medallion.tdnet-daily.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.medallion.tdnet-daily</string>

    <key>ProgramArguments</key>
    <array>
        <string>/opt/homebrew/bin/bash</string>
        <string>/path/to/medallion/scripts/earnings/montage_integration/daily_tdnet_runner.sh</string>
    </array>

    <!-- Run every morning at 05:00 JST -->
    <key>StartCalendarInterval</key>
    <dict>
        <key>Hour</key>
        <integer>5</integer>
        <key>Minute</key>
        <integer>0</integer>
    </dict>

    <key>StandardOutPath</key>
    <string>/Users/t.morimoto/Library/Logs/medallion-tdnet-daily.out.log</string>
    <key>StandardErrorPath</key>
    <string>/Users/t.morimoto/Library/Logs/medallion-tdnet-daily.err.log</string>
</dict>
</plist>

We chose 05:00 to reflect disclosures from the previous business day in Sheets before the market opens at 9:00. Designed as a pre-market batch, this timing isn’t dictated by TDnet’s own hours (which begin at 9:00 on weekdays). A 05:00 start provides a four-hour buffer for PDF/XBRL retrieval, analysis, and Sheets writing, ensuring data is ready for morning meetings and analysis.

Four-Step Flow of daily_tdnet_runner.sh

The core of the daily execution is a design that avoids full-ticker scans, processing only the intersection of TDnet disclosures and tickers registered in master.meta:

1. Retrieve TDnet disclosure list
   → output/tdnet/YYYY-MM-DD/disclosures.json
   
2. Read registered tickers from master.meta
   (gws sheets +read --range "meta!A2:E")
   
3. Calculate intersection (TDnet ∩ master.meta)
   → output/tdnet/YYYY-MM-DD/medallion_tickers.txt
   
4. Call bulk_runner.sh (parallel=8)
   → Runs history_runner.sh for each ticker
# Intersection logic in daily_tdnet_runner.sh (Excerpt)
python3 - <<PYEOF
import json

# Today's TDnet disclosure list
with open('disclosures.json') as f:
    disclosures = json.load(f)
tdnet_tickers = {item['ticker'] for item in disclosures}

# Tickers registered in master.meta
meta_map = {r['ticker']: r for r in meta_records}

# Intersection
intersect = sorted(tdnet_tickers & meta_map.keys())
PYEOF

Compared to the previous design (manually entering schedules monthly in an earningsSchedule sheet), this intersection method eliminates human maintenance. Even if someone forgets to update the schedule sheet, the system automatically detects disclosures from the actual TDnet list.

Retry Design for Retrieval Failures

daily_tdnet_runner.sh responds based on the following failure categories:

Failure TypeAction
TDnet site unresponsive (HTTP error / timeout)Normal exit (exit 0); launchd retries the next morning.
master.meta unreadable (GWS auth expired, etc.)Error exit (exit 1); launchd records the error.
bulk_runner failure (individual ticker fetch error)Recorded in warning logs; retried in the next run.

Swallowing TDnet failures with exit 0 is an intentional design. Because launchd’s StartCalendarInterval maintains the “next schedule” even if a run fails, the retry will happen automatically the following morning.

Alert Implementation

At this time, Slack or email notifications are not implemented. There is no alert code in daily_tdnet_runner.sh. It is listed as a future feature in the README; currently, monitoring logs via tail -f is the only way to confirm status.

Logs are output to three files: ~/Library/Logs/medallion-tdnet-daily.log, .out.log, and .err.log. You can check them using standard commands like tail -f or grep ERROR.


Google Sheets Writing Design

Sheets API Integration: Auth, Rate Limits, and Batch Writing

sheets_io.py is a wrapper that calls the Google Workspace CLI (gws command) as a subprocess. We use the CLI instead of the direct Sheets API to delegate GWS authentication profile management to the CLI.

Batch writing design:

# Writing logic in sheets_io.py (Conceptual code)
def write_rows(sheet_id: str, tab: str, rows: list[list]) -> None:
    # Coerce numeric strings to numbers (GWS CLI doesn't distinguish between strings and numbers)
    coerced = coerce_numeric(rows)
    
    gws(["sheets", "batchUpdate",
         "--spreadsheet", sheet_id,
         "--range", f"{tab}!A1",
         "--values", json.dumps(coerced)])

There is no explicit rate-limiting design in sheets_io.py. However, the EDINET API client (edinet_search.py) has a _rate_limit() method using time.sleep. While Sheets API requests concentrate during ticker processing, they are handled serially (though tickers are processed with --parallel 8), reducing the risk of excessive requests.

Incremental Update Design

medallion uses “incremental upsert” rather than overwriting. Using the Period column (e.g., 2024.03-Q1) as the primary key, existing rows are updated, and new rows are appended to the end.

The core of the schema lies in the HEADERS_* constants in schema.py. Each of the eight tabs (PL, BS, CF, per_share, forecast, segments, plans, meta) has its own header constant, and the map_to_*_row() functions convert the merged JSON into flat row arrays.

# From schema.py (Actual constants)
HEADERS_PL = [
    "Ticker", "Period",
    "Revenue", "Op_Income", "Ordinary_Income", "Pretax_Income",
    "Net_Income", "Comprehensive_Income",
    "Cost_of_Revenue", "Gross_Profit", "SGA",
    "Non_Op_Income", "Non_Op_Expense",
    "Extraordinary_Gain", "Extraordinary_Loss",
    "Income_Tax", "Minority_Interest",
    "Interest_Expense",  # Used for Interest_Coverage in metrics
]

Design Decisions: Rationale Behind the Structure

Why Write to Sheets Instead of a Database?

“Why not use PostgreSQL or BigQuery for financial data storage?” is a valid question.

We chose Sheets because the primary users (non-engineer members at Yakumo) are already proficient with Google Sheets and can perform analysis, aggregation, and visualization immediately. Using a database would require someone capable of writing SQL.

Additionally, the cost of integrating with Google Apps Script (GAS) is nearly zero. We have already implemented metrics calculation, automated graph generation, and Slack notifications via the Sheets → GAS pipeline.

There are clear downsides: Sheets’ limit of 10 million cells, contention during concurrent writes, and the difficulty of complex queries. Currently, medallion tracks 54 tickers, writing data across 8 tabs. For each ticker/period, the 6 single-row tabs (PL, BS, CF, per_share, forecast, plans) take up 81 columns, and the segments tab averages 50 cells (5 segments × 10 columns). This totals about 130 cells per entry. With 5 periods a year (4 quarters + 1 full-year) over 10 years, that’s about 6,500 cells per ticker. Even if we scale to our MVP scope of 231 tickers (Nikkei 225 + TOPIX Core 30 + TOPIX 100), it would total about 1.5 million cells, or only 15% of the 10-million-cell limit. Annual additions would be about 150,000 cells (231 tickers × 5 periods × 130 cells), leaving room for over 50 years of data. Sheets’ cell limit is not a bottleneck for our current scope.

Why YAML Configuration-Driven?

The decision to put settings in external files versus the code depends on whether the frequency of change differs between the two.

Corporate IR URLs change about once a year, but code releases should ideally be more deliberate. Having to edit, test, and PR code every time a URL changes is excessive. By using YAML, a URL change is just a one-line edit. No code review or CI is needed.

This follows the principle that “differences in change frequency should determine design.” The same principle applies to the header constants in schema.py.

Why Not Process XBRL and PDF in Parallel?

Retrieving XBRL and PDF simultaneously during the fetch phase might seem efficient, but we didn’t adopt it.

Reasons:

  1. Dependencies: Although XBRL and PDF are retrieved from different sources, the merge (kessan_merge.py) requires both to be present. Parallelization would introduce intermediate states where only one is available, complicating retry logic.

  2. Server Load: We want to avoid sending high-frequency requests to TDnet and EDINET from the same IP. tdnet.py is conservatively configured with TIMEOUT_SEC = 15 and MAX_RETRIES = 1.

Parallelization across tickers (--parallel 8) is implemented. While XBRL and PDF for a single ticker are retrieved serially, multiple tickers are processed in parallel. In practice, 50 tickers take about 20 minutes. Since daily_tdnet_runner.sh only processes the intersection of disclosures and registered tickers, the volume is manageable even on busy earnings days.


Pitfalls and Implementation Notes

Pitfall 1: JPX Reports use “iXBRL”

XBRL from TDnet is in iXBRL (inline XBRL) format as .htm files. XBRL tags are embedded inside HTML.

arelle handles this automatically, but for manual parsing, you must search for ix:nonfraction and ix:nonnumeric tags using lxml. Standard XBRL parsers will fail here.

Pitfall 2: Overlapping Text in PDFs

Some companies use a PDF format where characters are rendered multiple times on top of each other. pdfplumber might extract text like "IIIIIFFFFFRRRRRSSSSS".

We address this by collapsing consecutive identical characters into one with _collapse_repeated_chars() before searching for substrings. If text extraction seems off, check for this first.

Pitfall 3: Operating Income for IFRS Companies

Earnings reports for IFRS companies (e.g., Toyota, Sony) sometimes lack “Operating Income.” Since IFRS doesn’t standardize the definition of operating income, companies may define it themselves or only disclose it in annual reports.

We handle this by falling back to XBRL from EDINET annual reports (edinet_xbrl.py), but Op_Income may be None if only the flash PDF is available. This is an intentional design choice.

Pitfall 4: Transition to Semi-annual Reports

Starting after FY2024/3, Japanese companies are no longer required to submit quarterly reports and have transitioned to a new semi-annual report system. This affects document type codes (docTypeCode) in edinet_search.py:

# EDINET Document Type Codes (from edinet.json)
# Old System
"Quarterly Report Q1/Q3": "140",  # Until FY2024/3
# New System
"Semi-annual Report H1": "160",   # After Oct 2024

When retrieving H1 data from 2024 onwards, use 160 instead of 140. Logic to switch document types based on the date and the accounting_std YAML field is required.

Pitfall 5: launchd Sleep Behavior

If a Mac is asleep at the scheduled time, launchd’s StartCalendarInterval will trigger the job immediately upon waking. If the Mac is closed over the weekend, the 05:00 jobs for Saturday and Sunday might all run together on Monday morning.

Because daily_tdnet_runner.sh outputs disclosures.json for each specific date, running it multiple times for the same day is idempotent. However, it will trigger redundant writes, assuming the upsert logic in sheets_io.py works correctly.


Summary: medallion Implementation Checklist

Here are the principles and a checklist for data engineers building similar financial data pipelines.

Design Summary

DecisionApproachRationale
Data SourcesMulti-stage fallback: TDnet → IR → EDINETEnsures coverage and avoids single-source dependency.
Ticker ConfigOne YAML file per tickerAbsorbs ticker additions and URL changes without code changes.
XBRL/PDF ReconciliationPDF-Primary, XBRL-Supplement mergePDFs have higher immediacy for flashes.
Accounting Std. DifferencesYAML accounting_std + Concept MapsAvoids if statements in the code.
Daily ExecutionIntersection of TDnet ∩ master.metaReduces error noise by eliminating full scans.
Data StoreGoogle SheetsPrioritizes user proficiency and GAS integration.

Implementation Checklist

  • Create one config/companies/{ticker}.yaml and verify with ir_yaml.py --ticker {n} dump.
  • Choose scraper_strategy from url_template, page_scrape, or edinet_only.
  • Add tag names for relevant accounting standards to CONCEPT_PL and CONCEPT_BS in kessan_xbrl.py.
  • Ensure HEADERS_* in schema.py match the spreadsheet headers.
  • Verify single ticker operation with history_runner.sh --years 2.
  • Configure StartCalendarInterval in the launchd plist and run launchctl load.
  • Run daily_tdnet_runner.sh --dry-run to check the intersection results.
  • Confirm log output in ~/Library/Logs/medallion-tdnet-daily.log.

Automating business analysis and investment decisions using medallion data is covered in the sister pillar: “Business Decisions and ROI for Automating Financial Data Collection.” This article is for engineers interested in the tech design, while the sister pillar is for readers interested in business data utilization.