[Feature] Standardized Financial Statements From SEC Company Facts API#7416
Open
deeleeramone wants to merge 27 commits intodevelopfrom
Open
[Feature] Standardized Financial Statements From SEC Company Facts API#7416deeleeramone wants to merge 27 commits intodevelopfrom
deeleeramone wants to merge 27 commits intodevelopfrom
Conversation
…to feature/sec-company-facts
…to feature/sec-company-facts
…OperatingActivitiesOther
piiq
requested changes
Apr 7, 2026
Member
There was a problem hiding this comment.
As with the oecd refactor I've asked codex to review this against KISS and DRY principles. This usually gives a good path to consider one improvement pass.
From my own judgement I think this is good (despite being prohibitevely complex). The thing I strongly suggest adding is a bibtex style citation for the whitepaper. We should actually publish it on arxiv and point to the repo source code from there - it's not yet "good enough to be shared in an "academic" setting", but "good enough" for an arxiv publication If you want I can help you work on that.
Here's codex' response:
• Findings
1. This is not KISS overall. The branch exposes a clean top-level idea, but the
implementation sits on a few very large orchestration modules with complexity
suppressions: openbb_platform/providers/sec/openbb_sec/utils/
statement_schema/_schema.py:3, openbb_platform/providers/sec/openbb_sec/
utils/statement_schema/_schema.py:212, openbb_platform/providers/sec/
openbb_sec/utils/statement_schema/_schema.py:451, openbb_platform/providers/
sec/openbb_sec/utils/statement_schema/_imputation.py:3, openbb_platform/
providers/sec/openbb_sec/utils/statement_schema/_imputation.py:216, and
openbb_platform/providers/sec/openbb_sec/utils/company_facts.py:367. That is
powerful, but not simple.
2. It is not DRY in the fetcher layer. The same fetch-pivot-metadata pipeline is
repeated across six statement fetchers with only statement name, model type,
and minor period mapping changed: openbb_platform/providers/sec/openbb_sec/
models/balance_sheet.py:641, openbb_platform/providers/sec/openbb_sec/models/
income_statement.py:538, openbb_platform/providers/sec/openbb_sec/models/
cash_flow.py:407, openbb_platform/providers/sec/openbb_sec/models/
balance_sheet_growth.py:586, openbb_platform/providers/sec/openbb_sec/models/
income_statement_growth.py:513, openbb_platform/providers/sec/openbb_sec/
models/cash_flow_growth.py:393. The repeated period_map blocks in the three
growth models are the clearest example: openbb_platform/providers/sec/
openbb_sec/models/balance_sheet_growth.py:611, openbb_platform/providers/sec/
openbb_sec/models/income_statement_growth.py:538, openbb_platform/providers/
sec/openbb_sec/models/cash_flow_growth.py:421.
3. The branch is DRY in the schema definition itself. Moving statement mapping
into shared JSON schemas plus one StatementSchema engine is a real
consolidation win: openbb_platform/providers/sec/openbb_sec/utils/
statement_schema/_schema.py:40, openbb_platform/providers/sec/openbb_sec/
utils/company_facts.py:23, openbb_platform/providers/sec/openbb_sec/utils/
company_facts.py:71. So this is not sloppy duplication everywhere; it is
mixed.
4. Test coverage looks materially better than the structure. openbb_platform/
providers/sec/tests/test_company_facts.py:1 is extensive, and
openbb_platform/providers/sec/tests/test_sec_fetchers.py:1 covers the new
fetchers. That reduces risk, but it does not make the design KISS.
Assessment
Reviewed current branch feature/sec-company-facts against develop.
Short version: more DRY than KISS.
The good part is the data-driven schema approach. It removes a lot of tag-level
duplication and gives the SEC provider one normalization pipeline. The weak part
is that the pipeline is concentrated into very large multi-responsibility
modules, and the six public fetchers duplicate the same wrapper logic instead of
sharing a smaller base/helper layer.
I would describe this branch as:
- DRY at the schema/domain level
- not DRY enough at the fetcher/plumbing level
- not KISS overall
I did not run tests; this was a code-structure review of the branch diff.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a rule-based system that transforms raw SEC XBRL Company Facts data into standardized, cross-validated financial statements.
The system maps 1,190 distinct XBRL tag–namespace pairs drawn from 15 years of US GAAP and IFRS taxonomy vintages into 250 standardized line items across three financial statements (income statement, balance sheet, cash flow statement) and four industry-specific templates (industrial, financial, diversified, insurance).
The approach resolves three fundamental challenges in XBRL standardization: temporal tag evolution, cross-filer tag variation, and structural differences across industries. To address pervasive scope mismatches in reported XBRL data—where tags of different consolidation scope are combined within a single filing—the system employs a multi-stage pipeline comprising priority-ordered tag chain resolution, algebraic imputation from accounting identities, hierarchical statement articulation, targeted scope-mismatch corrections, and three-tier identity enforcement with full provenance tracking. For quarterly data, the system reconstructs Q4 from audited annual totals minus reported interim quarters, following the canonical SEC reporting structure where no standalone Q4 filing exists.
It resolves #6654 and aims to provide high-quality standardization of the 3 financial statements, following-up the conversation in #7411.
Endpoints
SEC has been added as a provider to:
All have periods: ["annual", "quarterly", "ttm"]
Additional parameters are:
pit_mode: bool - When True, returns data as originally reported at the time of filing, without subsequent restatements or amendments. For annual data, uses the original 10-K values.
include_preliminary: bool - Whether to include preliminary data from 8-K filings for periods not yet reported on 10-Q/K.
use_cache: bool - HTTP requests to the CompanyFacts API are cached in-memory for four hours.
If a filer does not submit interim reports to the SEC (many 20/40-F companies), an error is raised indicating it is not available.
Detailed metadata - source provenance, tag details, etc. are returned to
OBBject.extra["results_metadata"]The
company_factsutility returns the standardized statements in a narrow, long format. The ODP endpoints convert that to a wide format that matches the other providers, and separates the metadata by returning asAnnotatedResult.Implementation Details
The system is organized into a Python package and a set of JSON schema
files, supported by a public API layer and a taxonomy maintenance tool:
statement_schema/schemas/— The declarative schema, split into fourJSON files:
_meta.json(metadata, detection signals), plus one file perstatement —
income_statement.json,balance_sheet.json,cash_flow.json(~950 KB in total).
statement_schema/— The runtime engine package:_types.py(dataclasses, constants),
_detection.py(company-type classification,filing dates, fiscal metadata),
_extraction.py(row-level XBRLextraction, reference filing computation),
_rules.py(imputation andverification rule definitions),
_imputation.py(multi-pass imputation,hierarchical articulation, identity enforcement),
_schema.py(
StatementSchemaclass orchestrating the pipeline), and__init__.py.utils/company_facts.py— The public API: wraps the engine, mergesconfigured multi-CIK histories, and produces long-format records with full
provenance per line item per period.
xbrl_taxonomy_helper.py— Taxonomy infrastructure: programmaticaccess to FASB, SEC, and IFRS Foundation taxonomies for schema maintenance.
The system has been evaluated against a curated, edge-case-weighted validation corpus of 970 companies spanning industrials, banks, insurers, conglomerates, REITs, IFRS filers, discontinued-operations cases, and multi-CIK entities.
The schema is split across four JSON files in
statement_schema/schemas/:A standardized line item appears as a row in the schema like:
{ "tag": "total_gross_profit", "label": "Total Gross Profit", "description": "Aggregate revenue less cost of goods and services sold...", "parent": null, "sequence": 8, "factor": "+", "balance": "credit", "unit": "monetary", "period_type": "duration", "xbrl_tags": [ { "tag": "GrossProfit", "namespace": "us-gaap" }, { "tag": "GrossProfit", "namespace": "ifrs-full", "first_year": 2020, "last_year": 2025 } ] }When an accounting identity fails to hold, a warning is broadcast and the full details of the violation are included with the other metadata.
Source provenance strings help the user evaluate the source of the standardized row, most are direct XBRL facts.
Pasted below is a summary of the
validate_corpusscript (included for the repository, but does not ship with the package).See the file, .
/openbb_sec/utils/STATEMENT_SCHEMA_README.md, for a detailed technical explanation of the methodology.Tests
Unit tests for the new Fetchers all share the same Company Facts file (BLK - newest CIK, ~450 KB) that is stored in the
./tests/recordfolder.Additional unit tests for the
statement_schemalogic are under./tests/test_company_facts.The script,
validate_corpus.py, will download all files (3+ GB) and run the extraction pipeline for annual and quarterly periods and generate a TXT and JSON report file that highlights any identity verification failures, and detailed summary statistics.