[Feature] Standardized Financial Statements From SEC Company Facts API by deeleeramone · Pull Request #7416 · OpenBB-finance/OpenBB

deeleeramone · 2026-03-17T18:01:18Z

This PR adds a rule-based system that transforms raw SEC XBRL Company Facts data into standardized, cross-validated financial statements.

The system maps 1,190 distinct XBRL tag–namespace pairs drawn from 15 years of US GAAP and IFRS taxonomy vintages into 250 standardized line items across three financial statements (income statement, balance sheet, cash flow statement) and four industry-specific templates (industrial, financial, diversified, insurance).

The approach resolves three fundamental challenges in XBRL standardization: temporal tag evolution, cross-filer tag variation, and structural differences across industries. To address pervasive scope mismatches in reported XBRL data—where tags of different consolidation scope are combined within a single filing—the system employs a multi-stage pipeline comprising priority-ordered tag chain resolution, algebraic imputation from accounting identities, hierarchical statement articulation, targeted scope-mismatch corrections, and three-tier identity enforcement with full provenance tracking. For quarterly data, the system reconstructs Q4 from audited annual totals minus reported interim quarters, following the canonical SEC reporting structure where no standalone Q4 filing exists.

It resolves #6654 and aims to provide high-quality standardization of the 3 financial statements, following-up the conversation in #7411.

Endpoints

SEC has been added as a provider to:

equity.fundamental.balance
equity.fundamental.balance_growth
equity.fundamental.cash
equity.fundamental.cash_growth
equity.fundamental.income
equity.fundamental.income_growth

All have periods: ["annual", "quarterly", "ttm"]

Additional parameters are:

pit_mode: bool - When True, returns data as originally reported at the time of filing, without subsequent restatements or amendments. For annual data, uses the original 10-K values.

include_preliminary: bool - Whether to include preliminary data from 8-K filings for periods not yet reported on 10-Q/K.

use_cache: bool - HTTP requests to the CompanyFacts API are cached in-memory for four hours.

If a filer does not submit interim reports to the SEC (many 20/40-F companies), an error is raised indicating it is not available.

Detailed metadata - source provenance, tag details, etc. are returned to OBBject.extra["results_metadata"]

The company_facts utility returns the standardized statements in a narrow, long format. The ODP endpoints convert that to a wide format that matches the other providers, and separates the metadata by returning as AnnotatedResult.

Implementation Details

The system is organized into a Python package and a set of JSON schema
files, supported by a public API layer and a taxonomy maintenance tool:

statement_schema/schemas/ — The declarative schema, split into four
JSON files: _meta.json (metadata, detection signals), plus one file per
statement — income_statement.json, balance_sheet.json, cash_flow.json
(~950 KB in total).
statement_schema/ — The runtime engine package: _types.py
(dataclasses, constants), _detection.py (company-type classification,
filing dates, fiscal metadata), _extraction.py (row-level XBRL
extraction, reference filing computation), _rules.py (imputation and
verification rule definitions), _imputation.py (multi-pass imputation,
hierarchical articulation, identity enforcement), _schema.py
(StatementSchema class orchestrating the pipeline), and __init__.py.
utils/company_facts.py — The public API: wraps the engine, merges
configured multi-CIK histories, and produces long-format records with full
provenance per line item per period.
xbrl_taxonomy_helper.py — Taxonomy infrastructure: programmatic
access to FASB, SEC, and IFRS Foundation taxonomies for schema maintenance.

The system has been evaluated against a curated, edge-case-weighted validation corpus of 970 companies spanning industrials, banks, insurers, conglomerates, REITs, IFRS filers, discontinued-operations cases, and multi-CIK entities.

The schema is split across four JSON files in statement_schema/schemas/:

statement_schema/schemas/
├── _meta.json              # version, generated, taxonomy_sources, detection signals
│   ├── version: "2.0"
│   ├── generated: "2026-03-17"
│   ├── taxonomy_sources
│   │   ├── us_gaap: { years: [2011, 2026], tags_indexed: 3753 }
│   │   └── ifrs: { years: [2020, 2025], tags_indexed: 3979 }
│   └── detection
│       ├── insurance_is_signals: [8 tags]
│       ├── insurance_bs_signals: [5 tags]
│       ├── financial_signals: [16 tags]
│       ├── industrial_signals: [5 tags]
│       ├── diversified_signals: [5 tags]
│       └── min_financial_signals: 2
├── income_statement.json   # {industrial: [55 rows], financial: [74 rows],
│                           #  diversified: [52 rows], insurance: [56 rows]}
├── balance_sheet.json      # {industrial: [73 rows], financial: [68 rows],
│                           #  diversified: [73 rows], insurance: [68 rows]}
└── cash_flow.json          # {industrial: [57 rows], financial: [62 rows],
                            #  diversified: [57 rows], insurance: [62 rows]}

A standardized line item appears as a row in the schema like:

{
  "tag": "total_gross_profit",
  "label": "Total Gross Profit",
  "description": "Aggregate revenue less cost of goods and services sold...",
  "parent": null,
  "sequence": 8,
  "factor": "+",
  "balance": "credit",
  "unit": "monetary",
  "period_type": "duration",
  "xbrl_tags": [
    { "tag": "GrossProfit", "namespace": "us-gaap" },
    { "tag": "GrossProfit", "namespace": "ifrs-full", "first_year": 2020, "last_year": 2025 }
  ]
}

When an accounting identity fails to hold, a warning is broadcast and the full details of the violation are included with the other metadata.

Source provenance strings help the user evaluate the source of the standardized row, most are direct XBRL facts.

Source Pattern	Meaning
us-gaap:Revenues	Direct XBRL extraction at reference filing date
us-gaap:Revenues(fallback)	Cross-vintage extraction from a non-reference filing
imputed: total_revenue + -total_cost_of_revenue	Algebraic imputation from accounting identity
imputed-rollup: sum of explicitly mapped children	Hierarchical parent rollup from children
imputed-plug: balancing remainder	Synthetic plug to maintain parent-child articulation
identity-enforced: ...	Value overridden during identity enforcement
corrected: total_gross_profit - total_operating_income	OpEx/COGS disambiguation
us-gaap:NetIncomeLoss(NCI-corrected)	NI switched from NCI-inclusive to parent-only
us-gaap:ProfitLoss(disc-adjusted)	NI adjusted from total to continuing-ops scope
us-gaap:NetIncomeLoss(identity_lock:cash_flow)	Direct XBRL value locked to a cross-statement identity target
reconciled: total_equity_and_nci - noncontrolling_interests	Equity decomposition override
Q4: FY − (Q1+Q2+Q3)	Q4 derived from annual minus first three quarters
H2: FY − H1	H2 derived from annual minus first-half semi-annual filing
derived: cash_at_end_of_period(2023-09-30)	Beginning cash derived from prior period's ending
standalone	Extracted without filing-date constraint
imputed-zero: ...	Suspect zero flagged for review

Pasted below is a summary of the validate_corpus script (included for the repository, but does not ship with the package).

COMPANY TYPE BREAKDOWN
--------------------------------------------------
  diversified: 231
  financial: 68
  industrial: 619
  insurance: 52

SOURCE PROVENANCE — ANNUAL
------------------------------------------------------------------------------------------
  Category                    Income Stmt   Balance Sheet     Cash Flow         Total
  --------------------------------------------------------------------------------
  Direct XBRL                     409,025         428,544       428,800     1,266,369
  Algebraic impute                 13,942           7,186           862        21,990
  Hierarchical rollup              22,757          41,178        19,326        83,261
  Balancing plug                   67,973          38,237        48,232       154,442
  Scope correction                  8,210             623             1         8,834
  Identity enforced                    28             161         1,243         1,432
  Period derived                        0               0        11,550        11,550
  Suspect zero                      7,042           9,320         4,539        20,901
  --------------------------------------------------------------------------------
  TOTAL                           528,977         525,249       514,553     1,568,779

  Direct XBRL: 1,266,369 (80.7%)  |  Q4/H2: 0 (0.0%)  |  Other computed: 302,410 (19.3%)

SOURCE PROVENANCE — QUARTERLY
------------------------------------------------------------------------------------------
  Category                    Income Stmt   Balance Sheet     Cash Flow         Total
  --------------------------------------------------------------------------------
  Direct XBRL                   1,025,495       1,447,322     1,066,304     3,539,121
  Q4/H2 derivation                319,656             948       304,247       624,851
  Algebraic impute                 77,527          28,308         3,312       109,147
  Hierarchical rollup              87,623         155,647        70,999       314,269
  Balancing plug                  259,541         136,890       162,110       558,541
  Scope correction                 32,461           2,729            27        35,217
  Identity enforced                   280             997         3,939         5,216
  Period derived                        0               0        46,968        46,968
  Suspect zero                     31,286          36,147        33,345       100,778
  --------------------------------------------------------------------------------
  TOTAL                         1,833,869       1,808,988     1,691,251     5,334,108

  Direct XBRL: 3,539,121 (66.3%)  |  Q4/H2: 624,851 (11.7%)  |  Other computed: 1,170,136 (21.9%)

See the file, ./openbb_sec/utils/STATEMENT_SCHEMA_README.md, for a detailed technical explanation of the methodology.

Tests

Unit tests for the new Fetchers all share the same Company Facts file (BLK - newest CIK, ~450 KB) that is stored in the ./tests/record folder.

Additional unit tests for the statement_schema logic are under ./tests/test_company_facts.

The script, validate_corpus.py, will download all files (3+ GB) and run the extraction pipeline for annual and quarterly periods and generate a TXT and JSON report file that highlights any identity verification failures, and detailed summary statistics.

…to feature/sec-company-facts

…en requesting it

…to feature/sec-company-facts

…OperatingActivitiesOther

piiq

As with the oecd refactor I've asked codex to review this against KISS and DRY principles. This usually gives a good path to consider one improvement pass.

From my own judgement I think this is good (despite being prohibitevely complex). The thing I strongly suggest adding is a bibtex style citation for the whitepaper. We should actually publish it on arxiv and point to the repo source code from there - it's not yet "good enough to be shared in an "academic" setting", but "good enough" for an arxiv publication If you want I can help you work on that.

Here's codex' response:


• Findings

  1. This is not KISS overall. The branch exposes a clean top-level idea, but the
     implementation sits on a few very large orchestration modules with complexity
     suppressions: openbb_platform/providers/sec/openbb_sec/utils/
     statement_schema/_schema.py:3, openbb_platform/providers/sec/openbb_sec/
     utils/statement_schema/_schema.py:212, openbb_platform/providers/sec/
     openbb_sec/utils/statement_schema/_schema.py:451, openbb_platform/providers/
     sec/openbb_sec/utils/statement_schema/_imputation.py:3, openbb_platform/
     providers/sec/openbb_sec/utils/statement_schema/_imputation.py:216, and
     openbb_platform/providers/sec/openbb_sec/utils/company_facts.py:367. That is
     powerful, but not simple.
  2. It is not DRY in the fetcher layer. The same fetch-pivot-metadata pipeline is
     repeated across six statement fetchers with only statement name, model type,
     and minor period mapping changed: openbb_platform/providers/sec/openbb_sec/
     models/balance_sheet.py:641, openbb_platform/providers/sec/openbb_sec/models/
     income_statement.py:538, openbb_platform/providers/sec/openbb_sec/models/
     cash_flow.py:407, openbb_platform/providers/sec/openbb_sec/models/
     balance_sheet_growth.py:586, openbb_platform/providers/sec/openbb_sec/models/
     income_statement_growth.py:513, openbb_platform/providers/sec/openbb_sec/
     models/cash_flow_growth.py:393. The repeated period_map blocks in the three
     growth models are the clearest example: openbb_platform/providers/sec/
     openbb_sec/models/balance_sheet_growth.py:611, openbb_platform/providers/sec/
     openbb_sec/models/income_statement_growth.py:538, openbb_platform/providers/
     sec/openbb_sec/models/cash_flow_growth.py:421.
  3. The branch is DRY in the schema definition itself. Moving statement mapping
     into shared JSON schemas plus one StatementSchema engine is a real
     consolidation win: openbb_platform/providers/sec/openbb_sec/utils/
     statement_schema/_schema.py:40, openbb_platform/providers/sec/openbb_sec/
     utils/company_facts.py:23, openbb_platform/providers/sec/openbb_sec/utils/
     company_facts.py:71. So this is not sloppy duplication everywhere; it is
     mixed.
  4. Test coverage looks materially better than the structure. openbb_platform/
     providers/sec/tests/test_company_facts.py:1 is extensive, and
     openbb_platform/providers/sec/tests/test_sec_fetchers.py:1 covers the new
     fetchers. That reduces risk, but it does not make the design KISS.

  Assessment
  Reviewed current branch feature/sec-company-facts against develop.

  Short version: more DRY than KISS.

  The good part is the data-driven schema approach. It removes a lot of tag-level
  duplication and gives the SEC provider one normalization pipeline. The weak part
  is that the pipeline is concentrated into very large multi-responsibility
  modules, and the six public fetchers duplicate the same wrapper logic instead of
  sharing a smaller base/helper layer.

  I would describe this branch as:

  - DRY at the schema/domain level
  - not DRY enough at the fetcher/plumbing level
  - not KISS overall

  I did not run tests; this was a code-structure review of the branch diff.

add standardized financial statements to sec provider

a7b3132

github-actions bot added enhancement Enhancement platform OpenBB Platform v4 PRs for v4 labels Mar 17, 2026

deeleeramone added 20 commits March 17, 2026 11:08

codespell

253231e

add type annotation to json_schema_extra dict

3fe67c3

test params

e78382c

mypy

b8da01e

Merge branch 'develop' of https://github.com/OpenBB-finance/OpenBB in…

29483e4

…to feature/sec-company-facts

pylint

48b8824

if no actual quarterly data is reported by 20/40-F, raise an error wh…

968fdf2

…en requesting it

Fix D&A parent double count rollups

586200f

some updates

2a08760

codespell

422c057

breakout statement_schema into smaller files

512d6a7

Merge branch 'develop' of https://github.com/OpenBB-finance/OpenBB in…

5cba013

…to feature/sec-company-facts

balance sheet ordering

a7c089f

reorganize the results metadata output to OBBject

8890b9c

unused import

e1eb5ce

restore sec_filing.py file - not in scope

af4a356

organize imports

c58c44b

AdjustmentsNoncashItemsToReconcileNetIncomeLossToCashProvidedByUsedIn…

86f9ca8

…OperatingActivitiesOther

fix docstring

7eefeaf

Merge branch 'develop' into feature/sec-company-facts

b22ece6

deeleeramone marked this pull request as ready for review March 21, 2026 06:52

deeleeramone and others added 5 commits March 22, 2026 13:05

Merge branch 'develop' into feature/sec-company-facts

59b62b3

Merge branch 'develop' into feature/sec-company-facts

776f72d

Merge branch 'develop' into feature/sec-company-facts

e43376b

Merge branch 'develop' into feature/sec-company-facts

267f11a

Merge branch 'develop' into feature/sec-company-facts

43522c8

piiq requested changes Apr 7, 2026

View reviewed changes

Merge branch 'develop' into feature/sec-company-facts

72bd242

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Standardized Financial Statements From SEC Company Facts API#7416

[Feature] Standardized Financial Statements From SEC Company Facts API#7416
deeleeramone wants to merge 27 commits intodevelopfrom
feature/sec-company-facts

deeleeramone commented Mar 17, 2026 •

edited

Loading

Uh oh!

piiq left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deeleeramone commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Endpoints

Implementation Details

Tests

Uh oh!

piiq left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deeleeramone commented Mar 17, 2026 •

edited

Loading

piiq left a comment •

edited

Loading