We organized Japanese financial reports to encourage applying NLP techniques to financial analytics.
The corpora are separated to each financial years.
master version.
fiscal_year | Raw file version (F) | Text extracted version (E) |
---|---|---|
2014 | .zip (9.3GB) | .zip (269.9MB) |
2015 | .zip (9.8GB) | .zip (291.1MB) |
2016 | .zip (10.2GB) | .zip (334.7MB) |
2017 | .zip (9.1GB) | .zip (309.4MB) |
2018 | .zip (10.5GB) | .zip (260.9MB) |
- financial data is from 決算短信情報.
- We use non-cosolidated data if it exist.
- stock data is from 月間相場表(内国株式).
close
is fiscal period end andopen
is 1 year before of it.
fiscal_year | number_of_reports | has_csr_reports | has_financial_data | has_stock_data |
---|---|---|---|---|
2014 | 3,724 | 92 | 3,583 | 3,595 |
2015 | 3,870 | 96 | 3,725 | 3,751 |
2016 | 4,066 | 97 | 3,924 | 3,941 |
2017 | 3,578 | 89 | 3,441 | 3,472 |
2018 | 3,513 | 70 | 2,893 | 3,413 |
The structure of dataset is following.
chakki_esg_financial_{year}.zip
└──{year}
├── documents.csv
└── docs/
docs
includes XBRL and PDF file.
- XBRL file of annual reports (files are retrieved from EDINET).
- PDF file of CSR reports (additional content).
documents.csv
has metadata like following. Please refer the detail at Wiki.
- edinet_code:
E0000X
- filer_name:
XXX株式会社
- fiscal_year:
201X
- fiscal_period:
FY
- doc_path:
docs/S000000X.xbrl
- csr_path:
docs/E0000X_201X_JP_36.pdf
Text extracted version includes txt
files that match each part of an annual report.
The extracted parts are defined at xbrr
.
chakki_esg_financial_{year}_extracted.zip
└──{year}
├── documents.csv
└── docs/
You can download dataset by command line tool.
pip install coarij
Please refer the usage by --
(using fire).
coarij --
Example command.
# Download raw file version dataset of 2014.
coarij download --kind F --year 2014
# Extract business.overview_of_result part of TIS.Inc (sec code=3626).
coarij extract business.overview_of_result --sec_code 3626
# Tokenize text by Janome (`janome` or `sudachi` is supported).
pip install janome
coarij tokenize --tokenizer janome
# Show tokenized result (words are separated by \t).
head -n 5 data/processed/2014/docs/S100552V_business_overview_of_result_tokenized.txt
1 【 業績 等 の 概要 】
( 1 ) 業績
当 連結 会計 年度 における 我が国 経済 は 、 消費 税率 引上げ に 伴う 駆け込み 需要 の 反動 や 海外 景気 動向 に対する 先行き 懸念 等 から 弱い 動き も 見 られ まし た が 、 企業 収益 の 改善 等 により 全体 ...
If you want to download latest dataset, please specify --version master
when download the data.
- About the parsable part, please refer the
xbrr
.
You can use Ledger
to select your necessary file from overall CoARiJ dataset.
from coarij.storage import Storage
storage = Storage("your/data/directory")
ledger = storage.get_ledger()
collected = ledger.collect(edinet_code="E00021")