Skip to content

[Feature] Standardized Financial Statements From SEC Company Facts API#7416

Open
deeleeramone wants to merge 27 commits intodevelopfrom
feature/sec-company-facts
Open

[Feature] Standardized Financial Statements From SEC Company Facts API#7416
deeleeramone wants to merge 27 commits intodevelopfrom
feature/sec-company-facts

Conversation

@deeleeramone
Copy link
Copy Markdown
Contributor

@deeleeramone deeleeramone commented Mar 17, 2026

This PR adds a rule-based system that transforms raw SEC XBRL Company Facts data into standardized, cross-validated financial statements.

The system maps 1,190 distinct XBRL tag–namespace pairs drawn from 15 years of US GAAP and IFRS taxonomy vintages into 250 standardized line items across three financial statements (income statement, balance sheet, cash flow statement) and four industry-specific templates (industrial, financial, diversified, insurance).

The approach resolves three fundamental challenges in XBRL standardization: temporal tag evolution, cross-filer tag variation, and structural differences across industries. To address pervasive scope mismatches in reported XBRL data—where tags of different consolidation scope are combined within a single filing—the system employs a multi-stage pipeline comprising priority-ordered tag chain resolution, algebraic imputation from accounting identities, hierarchical statement articulation, targeted scope-mismatch corrections, and three-tier identity enforcement with full provenance tracking. For quarterly data, the system reconstructs Q4 from audited annual totals minus reported interim quarters, following the canonical SEC reporting structure where no standalone Q4 filing exists.

It resolves #6654 and aims to provide high-quality standardization of the 3 financial statements, following-up the conversation in #7411.

Endpoints

SEC has been added as a provider to:

  • equity.fundamental.balance
  • equity.fundamental.balance_growth
  • equity.fundamental.cash
  • equity.fundamental.cash_growth
  • equity.fundamental.income
  • equity.fundamental.income_growth

All have periods: ["annual", "quarterly", "ttm"]

Additional parameters are:

pit_mode: bool - When True, returns data as originally reported at the time of filing, without subsequent restatements or amendments. For annual data, uses the original 10-K values.

include_preliminary: bool - Whether to include preliminary data from 8-K filings for periods not yet reported on 10-Q/K.

use_cache: bool - HTTP requests to the CompanyFacts API are cached in-memory for four hours.

If a filer does not submit interim reports to the SEC (many 20/40-F companies), an error is raised indicating it is not available.

Detailed metadata - source provenance, tag details, etc. are returned to OBBject.extra["results_metadata"]

The company_facts utility returns the standardized statements in a narrow, long format. The ODP endpoints convert that to a wide format that matches the other providers, and separates the metadata by returning as AnnotatedResult.

Screenshot 2026-03-18 at 3 00 55 PM

Implementation Details

The system is organized into a Python package and a set of JSON schema
files, supported by a public API layer and a taxonomy maintenance tool:

  • statement_schema/schemas/ — The declarative schema, split into four
    JSON files: _meta.json (metadata, detection signals), plus one file per
    statement — income_statement.json, balance_sheet.json, cash_flow.json
    (~950 KB in total).
  • statement_schema/ — The runtime engine package: _types.py
    (dataclasses, constants), _detection.py (company-type classification,
    filing dates, fiscal metadata), _extraction.py (row-level XBRL
    extraction, reference filing computation), _rules.py (imputation and
    verification rule definitions), _imputation.py (multi-pass imputation,
    hierarchical articulation, identity enforcement), _schema.py
    (StatementSchema class orchestrating the pipeline), and __init__.py.
  • utils/company_facts.py — The public API: wraps the engine, merges
    configured multi-CIK histories, and produces long-format records with full
    provenance per line item per period.
  • xbrl_taxonomy_helper.py — Taxonomy infrastructure: programmatic
    access to FASB, SEC, and IFRS Foundation taxonomies for schema maintenance.

The system has been evaluated against a curated, edge-case-weighted validation corpus of 970 companies spanning industrials, banks, insurers, conglomerates, REITs, IFRS filers, discontinued-operations cases, and multi-CIK entities.

The schema is split across four JSON files in statement_schema/schemas/:

statement_schema/schemas/
├── _meta.json              # version, generated, taxonomy_sources, detection signals
│   ├── version: "2.0"
│   ├── generated: "2026-03-17"
│   ├── taxonomy_sources
│   │   ├── us_gaap: { years: [2011, 2026], tags_indexed: 3753 }
│   │   └── ifrs: { years: [2020, 2025], tags_indexed: 3979 }
│   └── detection
│       ├── insurance_is_signals: [8 tags]
│       ├── insurance_bs_signals: [5 tags]
│       ├── financial_signals: [16 tags]
│       ├── industrial_signals: [5 tags]
│       ├── diversified_signals: [5 tags]
│       └── min_financial_signals: 2
├── income_statement.json   # {industrial: [55 rows], financial: [74 rows],
│                           #  diversified: [52 rows], insurance: [56 rows]}
├── balance_sheet.json      # {industrial: [73 rows], financial: [68 rows],
│                           #  diversified: [73 rows], insurance: [68 rows]}
└── cash_flow.json          # {industrial: [57 rows], financial: [62 rows],
                            #  diversified: [57 rows], insurance: [62 rows]}

A standardized line item appears as a row in the schema like:

{
  "tag": "total_gross_profit",
  "label": "Total Gross Profit",
  "description": "Aggregate revenue less cost of goods and services sold...",
  "parent": null,
  "sequence": 8,
  "factor": "+",
  "balance": "credit",
  "unit": "monetary",
  "period_type": "duration",
  "xbrl_tags": [
    { "tag": "GrossProfit", "namespace": "us-gaap" },
    { "tag": "GrossProfit", "namespace": "ifrs-full", "first_year": 2020, "last_year": 2025 }
  ]
}

When an accounting identity fails to hold, a warning is broadcast and the full details of the violation are included with the other metadata.

Source provenance strings help the user evaluate the source of the standardized row, most are direct XBRL facts.

Source Pattern Meaning
us-gaap:Revenues Direct XBRL extraction at reference filing date
us-gaap:Revenues(fallback) Cross-vintage extraction from a non-reference filing
imputed: total_revenue + -total_cost_of_revenue Algebraic imputation from accounting identity
imputed-rollup: sum of explicitly mapped children Hierarchical parent rollup from children
imputed-plug: balancing remainder Synthetic plug to maintain parent-child articulation
identity-enforced: ... Value overridden during identity enforcement
corrected: total_gross_profit - total_operating_income OpEx/COGS disambiguation
us-gaap:NetIncomeLoss(NCI-corrected) NI switched from NCI-inclusive to parent-only
us-gaap:ProfitLoss(disc-adjusted) NI adjusted from total to continuing-ops scope
us-gaap:NetIncomeLoss(identity_lock:cash_flow) Direct XBRL value locked to a cross-statement identity target
reconciled: total_equity_and_nci - noncontrolling_interests Equity decomposition override
Q4: FY − (Q1+Q2+Q3) Q4 derived from annual minus first three quarters
H2: FY − H1 H2 derived from annual minus first-half semi-annual filing
derived: cash_at_end_of_period(2023-09-30) Beginning cash derived from prior period's ending
standalone Extracted without filing-date constraint
imputed-zero: ... Suspect zero flagged for review

Pasted below is a summary of the validate_corpus script (included for the repository, but does not ship with the package).

COMPANY TYPE BREAKDOWN
--------------------------------------------------
  diversified: 231
  financial: 68
  industrial: 619
  insurance: 52

SOURCE PROVENANCE — ANNUAL
------------------------------------------------------------------------------------------
  Category                    Income Stmt   Balance Sheet     Cash Flow         Total
  --------------------------------------------------------------------------------
  Direct XBRL                     409,025         428,544       428,800     1,266,369
  Algebraic impute                 13,942           7,186           862        21,990
  Hierarchical rollup              22,757          41,178        19,326        83,261
  Balancing plug                   67,973          38,237        48,232       154,442
  Scope correction                  8,210             623             1         8,834
  Identity enforced                    28             161         1,243         1,432
  Period derived                        0               0        11,550        11,550
  Suspect zero                      7,042           9,320         4,539        20,901
  --------------------------------------------------------------------------------
  TOTAL                           528,977         525,249       514,553     1,568,779

  Direct XBRL: 1,266,369 (80.7%)  |  Q4/H2: 0 (0.0%)  |  Other computed: 302,410 (19.3%)

SOURCE PROVENANCE — QUARTERLY
------------------------------------------------------------------------------------------
  Category                    Income Stmt   Balance Sheet     Cash Flow         Total
  --------------------------------------------------------------------------------
  Direct XBRL                   1,025,495       1,447,322     1,066,304     3,539,121
  Q4/H2 derivation                319,656             948       304,247       624,851
  Algebraic impute                 77,527          28,308         3,312       109,147
  Hierarchical rollup              87,623         155,647        70,999       314,269
  Balancing plug                  259,541         136,890       162,110       558,541
  Scope correction                 32,461           2,729            27        35,217
  Identity enforced                   280             997         3,939         5,216
  Period derived                        0               0        46,968        46,968
  Suspect zero                     31,286          36,147        33,345       100,778
  --------------------------------------------------------------------------------
  TOTAL                         1,833,869       1,808,988     1,691,251     5,334,108

  Direct XBRL: 3,539,121 (66.3%)  |  Q4/H2: 624,851 (11.7%)  |  Other computed: 1,170,136 (21.9%)

See the file, ./openbb_sec/utils/STATEMENT_SCHEMA_README.md, for a detailed technical explanation of the methodology.

Tests

Unit tests for the new Fetchers all share the same Company Facts file (BLK - newest CIK, ~450 KB) that is stored in the ./tests/record folder.

Additional unit tests for the statement_schema logic are under ./tests/test_company_facts.

The script, validate_corpus.py, will download all files (3+ GB) and run the extraction pipeline for annual and quarterly periods and generate a TXT and JSON report file that highlights any identity verification failures, and detailed summary statistics.

@github-actions github-actions bot added enhancement Enhancement platform OpenBB Platform v4 PRs for v4 labels Mar 17, 2026
@deeleeramone deeleeramone marked this pull request as ready for review March 21, 2026 06:52
Copy link
Copy Markdown
Member

@piiq piiq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the oecd refactor I've asked codex to review this against KISS and DRY principles. This usually gives a good path to consider one improvement pass.

From my own judgement I think this is good (despite being prohibitevely complex). The thing I strongly suggest adding is a bibtex style citation for the whitepaper. We should actually publish it on arxiv and point to the repo source code from there - it's not yet "good enough to be shared in an "academic" setting", but "good enough" for an arxiv publication If you want I can help you work on that.

Here's codex' response:


• Findings

  1. This is not KISS overall. The branch exposes a clean top-level idea, but the
     implementation sits on a few very large orchestration modules with complexity
     suppressions: openbb_platform/providers/sec/openbb_sec/utils/
     statement_schema/_schema.py:3, openbb_platform/providers/sec/openbb_sec/
     utils/statement_schema/_schema.py:212, openbb_platform/providers/sec/
     openbb_sec/utils/statement_schema/_schema.py:451, openbb_platform/providers/
     sec/openbb_sec/utils/statement_schema/_imputation.py:3, openbb_platform/
     providers/sec/openbb_sec/utils/statement_schema/_imputation.py:216, and
     openbb_platform/providers/sec/openbb_sec/utils/company_facts.py:367. That is
     powerful, but not simple.
  2. It is not DRY in the fetcher layer. The same fetch-pivot-metadata pipeline is
     repeated across six statement fetchers with only statement name, model type,
     and minor period mapping changed: openbb_platform/providers/sec/openbb_sec/
     models/balance_sheet.py:641, openbb_platform/providers/sec/openbb_sec/models/
     income_statement.py:538, openbb_platform/providers/sec/openbb_sec/models/
     cash_flow.py:407, openbb_platform/providers/sec/openbb_sec/models/
     balance_sheet_growth.py:586, openbb_platform/providers/sec/openbb_sec/models/
     income_statement_growth.py:513, openbb_platform/providers/sec/openbb_sec/
     models/cash_flow_growth.py:393. The repeated period_map blocks in the three
     growth models are the clearest example: openbb_platform/providers/sec/
     openbb_sec/models/balance_sheet_growth.py:611, openbb_platform/providers/sec/
     openbb_sec/models/income_statement_growth.py:538, openbb_platform/providers/
     sec/openbb_sec/models/cash_flow_growth.py:421.
  3. The branch is DRY in the schema definition itself. Moving statement mapping
     into shared JSON schemas plus one StatementSchema engine is a real
     consolidation win: openbb_platform/providers/sec/openbb_sec/utils/
     statement_schema/_schema.py:40, openbb_platform/providers/sec/openbb_sec/
     utils/company_facts.py:23, openbb_platform/providers/sec/openbb_sec/utils/
     company_facts.py:71. So this is not sloppy duplication everywhere; it is
     mixed.
  4. Test coverage looks materially better than the structure. openbb_platform/
     providers/sec/tests/test_company_facts.py:1 is extensive, and
     openbb_platform/providers/sec/tests/test_sec_fetchers.py:1 covers the new
     fetchers. That reduces risk, but it does not make the design KISS.

  Assessment
  Reviewed current branch feature/sec-company-facts against develop.

  Short version: more DRY than KISS.

  The good part is the data-driven schema approach. It removes a lot of tag-level
  duplication and gives the SEC provider one normalization pipeline. The weak part
  is that the pipeline is concentrated into very large multi-responsibility
  modules, and the six public fetchers duplicate the same wrapper logic instead of
  sharing a smaller base/helper layer.

  I would describe this branch as:

  - DRY at the schema/domain level
  - not DRY enough at the fetcher/plumbing level
  - not KISS overall

  I did not run tests; this was a code-structure review of the branch diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement platform OpenBB Platform v4 PRs for v4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR]: use SEC company facts for equity fundamental commands

2 participants