Open-sourcing SEC EDGAR on Hugging Face
Posted by EnricoShippole@reddit | LocalLLaMA | View on Reddit | 3 comments

Given the increasingly closed-source nature of the U.S. AI ecosystem, it is now more important than ever to push for the proliferation of open model and dataset releases. [Datamule](https://datamule.xyz/), [Teraflop AI](https://www.teraflopai.com/), and [Eventual](https://www.eventual.ai/) collaborated to release the [SEC-EDGAR dataset](https://huggingface.co/datasets/TeraflopAI/SEC-EDGAR).
The dataset contains 590 GB of data, spanning 8 million samples and 43 billion tokens from all major filings in the SEC EDGAR database. Many different unofficial API providers charge hundreds of dollars a month to access this data with strict limits.
The SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) is a free public online database providing access to millions of documents of the corporate financial filings of publicly traded companies over the last 20 years. We provide free and open access to numerous annual and quarterly reports, including filings 10-Q, 10-K, 8-5, etc., from the EDGAR system.
The bulk data was collected using [datamule-python](https://github.com/john-friedman/datamule-python) library and the official [datamule API](https://datamule.xyz/) created by [John Friedman](https://john-friedman.github.io/). The datamule Python library is a package for collecting, manipulating, and processing the SEC Edgar data at scale. Datamule provides a simple open-source API interface to easily download each of a company's filings by ticker and submission type. SEC EDGAR rate limits at 10 requests per second. Constantly crawling 8 million major filings without network overhead takes over 10 days alone, following the official EDGAR guidance. The documentation for datamule can be found [here](https://john-friedman.github.io/datamule-python/).
The dataset contains the raw contents of each major filing, the extracted and parsed HTML/XML plaintext, and relevant metadata such as the filing’s accession number, filing date, period, documents, and filer. The raw document contents are provided so that you may use your own custom parser to extract the HTML/XML to plaintext. The text was parsed and extracted from the HTML/XML contents using the [selectolax](https://selectolax.readthedocs.io/en/latest/index.html) HTML parser and a modified version of [doc2dict](https://github.com/john-friedman/doc2dict/tree/main) and [secsgml](https://github.com/john-friedman/secsgml) libraries.
The SEC SGML library is used to parse through the [Standard Generalized Markup Language](https://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language) document format used by the Securities and Exchange Commission and to handle [daily archive](http://sec.gov/Archives/edgar/Feed/) and [submission file types](https://www.sec.gov/Archives/edgar/data/1318605/000095017022000796/0000950170-22-000796.txt). The doc2dict library provides multiple parsers for extracting HTML, XML, and PDF content, and was used to convert to plaintext and explicitly handle table mappings. The documentation for doc2dict can be found [here](https://john-friedman.github.io/doc2dict/whitepaper/). We utilize [`@daft.cls`](https://docs.daft.ai/en/stable/custom-code/cls/#stateful-class-udfs-with-daftcls) and [`@daft.method.batch`](https://docs.daft.ai/en/stable/custom-code/cls/#batch-methods-with-daftmethodbatch) from Daft’s stateful UDFs to batch process the documents with doc2dict and secsgml.
Distributed processing of the data was scaled out using the highly efficient [Daft dataframe library](https://www.daft.ai/), [Ray](https://github.com/ray-project/ray) distributed framework, and [Teralop AI data pipelines](https://github.com/teraflop-ai). The entire dataset was processed into clean plaintext form with a total of 12 cores in under 24 hours. The total cost was approximately $1.10 USD.
The dataset has been made completely, freely available on Hugging Face [here](https://huggingface.co/datasets/TeraflopAI/SEC-EDGAR). A collection of the full dataset and all individual filing subsets can be found [here](https://hf.co/collections/TeraflopAI/sec-edgar).
Below, we provide a table for the total number of crawled and released samples per document type:
| Filing | Total number of samples |
| :---- | :---- |
| Form 5 | 114,724 |
| Form 4 | 4,474,981 |
| Form 3 | 387,465 |
| S-1 | 24,866 |
| S-8 | 95,543 |
| 10-K | 223,275 |
| 8-K | 1,952,207 |
| 20-F | 19,428 |
| 10-Q | 674,240 |
| 144 | 88,726 |
| Total | 8,055,455 |
A breakdown of the total token counts for each filing is provided below:
| Filing | Total token count |
| :---- | :---- |
| 10-K | 14,518,876,137 |
| 20-F | 2,917,164,397 |
| Form 5 | 66,330,315 |
| Form 4 | 1,676,565,503 |
| Form 3 | 110,098,014 |
| 10-Q | 17,509,723,617 |
| S-1 | 2,914,107,827 |
| S-8 | 472,867,864 |
| 8-K | 3,466,866,649 |
| 144 | 73,218,304 |
| Total | 43,725,818,627 |
The next SEC-EDGAR dataset release will include all other types of filings and forms that were not included, along with the major filings in this release. You can find a full breakdown of each document type through Datamule’s SEC Census [here](https://github.com/john-friedman/SEC-Census/tree/master).
We are building open-source state-of-the-art search across numerous domains. If you would like to help support or contribute to future open-source projects and dataset releases, you can join our [Discord](https://discord.gg/bWW8Wbhxhx) or contact us directly [here](https://x.com/EnricoShippole).
soshulmedia@reddit
I think you should also post this to /r/datahoarder.
ttkciar@reddit
When you have time, please fix your post's formatting.
status-code-200@reddit
Happy to have helped! I'm surprised how cheap daft made the project. Good to know for the future.