Ingesting SEC disclosures for algorithmic natural language processing (NLP) is difficult because the HTML is poorly formed.  Now Calcbench API users can access standardized disclosure HTML.

For instance, Microsoft's Contingencies note looks like this  -




but the HTML looks like this -



everything is a paragraph, there is no hierarchy, the headers are not headers.


Calcbench's standardized HTML looks like this -



The hierarchy of headers headers is correct and they are in sections with the text to which they refer.


To get the standardized HTML use the disclosure API (Calcbench API access required) and pass the standardized=True to the DisclosureSearchResults objects returned by the disclosure_search method , documentation.




FREE Calcbench Premium
Two Week Trial

Research financial & accounting data like never before. Get features designed for better insights. Try our enhanced Excel Add-in. Sign up now to try the Premium Suite.