From 37aba7e78db865470cb63e74798acd66f985666a Mon Sep 17 00:00:00 2001 From: Pavel Bedrin Date: Fri, 2 Aug 2024 15:52:07 +0300 Subject: [PATCH 1/8] move from lfs to nextcloud --- .gitattributes | 1 - README.md | 4 ++-- data/annotations/arabic.json | 3 --- data/annotations/chinese.json | 3 --- data/annotations/deutsch.json | 3 --- data/annotations/english.json | 3 --- data/annotations/korean.json | 3 --- data/annotations/russian.json | 3 --- data/mhtml-ru.zip | 3 --- 9 files changed, 2 insertions(+), 24 deletions(-) delete mode 100644 .gitattributes delete mode 100644 data/annotations/arabic.json delete mode 100644 data/annotations/chinese.json delete mode 100644 data/annotations/deutsch.json delete mode 100644 data/annotations/english.json delete mode 100644 data/annotations/korean.json delete mode 100644 data/annotations/russian.json delete mode 100644 data/mhtml-ru.zip diff --git a/.gitattributes b/.gitattributes deleted file mode 100644 index 20739a9..0000000 --- a/.gitattributes +++ /dev/null @@ -1 +0,0 @@ -data/** filter=lfs diff=lfs merge=lfs -text diff --git a/README.md b/README.md index 07a7444..deaf02f 100644 --- a/README.md +++ b/README.md @@ -164,8 +164,8 @@ JSONs structure for other languages: ## Download -* Multilingual dataset (1.1 GB): [`data/annotations`](./data/annotations) -* Russian-language web pages in MHTML format (zipped 1 GB): [`data/mhtml-ru.zip`](./data/mhtml-ru.zip) +* Multilingual dataset (1.1 GB): [`annotations/`](https://nextcloud.ispras.ru/index.php/s/zbaDqkxmQPmaEkT) +* Russian-language web pages in MHTML format (zipped 1 GB): [`news-page-dataset-mhtmls.zip`](https://nextcloud.ispras.ru/index.php/s/YDwme8jSByQY2xC) ## Citation diff --git a/data/annotations/arabic.json b/data/annotations/arabic.json deleted file mode 100644 index 412c6d8..0000000 --- a/data/annotations/arabic.json +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:c60b531ae8c51f416052512b806136d9721cd5bab2bb5dfb4118dc7542728757 -size 148343716 diff --git a/data/annotations/chinese.json b/data/annotations/chinese.json deleted file mode 100644 index 266a5b7..0000000 --- a/data/annotations/chinese.json +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:466c5919b5a8f9c05025716fa749a0ca43b95eb572eac25e509c2070e603afad -size 50161106 diff --git a/data/annotations/deutsch.json b/data/annotations/deutsch.json deleted file mode 100644 index 7879db7..0000000 --- a/data/annotations/deutsch.json +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:c8515996ce0e5f62e68ccefa61617836e7844dde7b49af05245f2db93473a1d9 -size 248170217 diff --git a/data/annotations/english.json b/data/annotations/english.json deleted file mode 100644 index 6c7cfad..0000000 --- a/data/annotations/english.json +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:57c02a80ea0b500461f1f7901e0be979d002f1fca2f863c581ac16aa7d320d4b -size 197585205 diff --git a/data/annotations/korean.json b/data/annotations/korean.json deleted file mode 100644 index 0b82902..0000000 --- a/data/annotations/korean.json +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:8c188a3bc9962eb8c42e6d02fa34d01ba2c4192b63e2becaf018691e793c5b3c -size 186998347 diff --git a/data/annotations/russian.json b/data/annotations/russian.json deleted file mode 100644 index 4aeb6ec..0000000 --- a/data/annotations/russian.json +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:2885afa56b8bcf9c88809f7c19e484d8160c96aa4c65cfa48e972b08b2bb8c6a -size 186622332 diff --git a/data/mhtml-ru.zip b/data/mhtml-ru.zip deleted file mode 100644 index 99b7bd8..0000000 --- a/data/mhtml-ru.zip +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:d6d28440669c06236800850dbdd1ca4e3d643bb4397c2f238659ebfa1a1990cc -size 1104865366 From a1ff7723cfb1506b44ccae01e52dd5b1227cc61c Mon Sep 17 00:00:00 2001 From: Alexander Kustenkov Date: Mon, 5 Aug 2024 12:34:38 +0300 Subject: [PATCH 2/8] Add NewsListDataset description --- README.md | 69 +++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 62 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index deaf02f..0e27e62 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,27 @@ -# Dataset For Information Extraction From News Web Pages + +# ISPRAS News Datasets Collection + +Table of Contents + +- [Dataset For Information Extraction From News Web Pages](#Dataset-For-Information-Extraction-From-News-Web-Pages) + - [Dataset Description](#Dataset-Description) + - [Data Collection](#Data-Collection) + - [Dataset Format](#Dataset-Format) + - [Download](#Download) + - [Citation](#Citation) +- [NewsListDataset](#NewsListDataset) + - [Dataset Description](#Dataset-Description) + - [Data Collection](#Data-Collection) + - [Dataset Format](#Dataset-Format) + - [Download](#Download) + - [Citation](#Citation) + + +## Dataset For Information Extraction From News Web Pages Multilingual dataset of labeled news web pages for information extraction task -## Dataset Description +### Dataset Description {#description-1} Dataset contains websites in 6 languages: Russian, English, German, Chinese, Korean, Arabic. We labeled news pages with attributes from these sets: * For Russian: title, subtitle, publication date, modification date, text, authors, sources, categories, tags * For other languages: title, publication date, text, authors, tags @@ -102,13 +121,13 @@ Dataset contains websites in 6 languages: Russian, English, German, Chinese, Kor -## Data Collection +### Data Collection {#data-collection-1} Creating the Russian-language part of the dataset is described in our [paper](https://ieeexplore.ieee.org/document/10076872). The annotators marked up web pages using Label Studio according to the [guideline](./MANIFEST.md). For other languages, we marked up nodes on pages using sitemaps created in the [Web Scraper](https://github.com/ispras/web-scraper-chrome-extension). -## Dataset Format +### Dataset Format {#data-format-1} For Russian-language part we have JSON file with the following structure (Label Studio JSON MIN format): ``` @@ -162,13 +181,13 @@ JSONs structure for other languages: ...} ``` -## Download +### Download {#download-1} * Multilingual dataset (1.1 GB): [`annotations/`](https://nextcloud.ispras.ru/index.php/s/zbaDqkxmQPmaEkT) * Russian-language web pages in MHTML format (zipped 1 GB): [`news-page-dataset-mhtmls.zip`](https://nextcloud.ispras.ru/index.php/s/YDwme8jSByQY2xC) -## Citation +### Citation {#citation-1} More details about the Russian-language part of the dataset are available in our [paper](https://ieeexplore.ieee.org/document/10076872). Please cite us if you use or discuss this dataset in your work: ``` @@ -182,4 +201,40 @@ More details about the Russian-language part of the dataset are available in our pages={100-106}, keywords={Annotations;Neural networks;Web pages;Data aggregation;Information retrieval;Data mining;Electronic commerce;web data extraction;information extraction;news;webpage dataset;neural networks}, doi={10.1109/ISPRAS57371.2022.10076872}} -``` \ No newline at end of file +``` + +## NewsListDataset +Dataset for extracting news records with their attributes from html pages. +### Dataset Description {#dataset-description-2} +This dataset contains pages with lists of news in Russian. +The following attributes were marked: title, date, tag, short_text, time, short_title, author. + +Their distribution: + +| | Pages | Records | Domains | +|-------------|-------|---------|---------| +| title | 12679 | 247262 | 275 | +| date | 12296 | 241634 | 251 | +| tag | 6165 | 108400 | 140 | +| short_text | 6855 | 115983 | 138 | +| time | 1938 | 41892 | 8 | +| short_title | 105 | 1289 | 4 | +| author | 87 | 957 | 1 | + +Totally dataset contains 13099 pages. + +### Dataset Format {#dataset-format-2} +Each file from data folder is instance of json dictionary with fields: +* **html**: formatted html code of page +* **exist_labels**: labels which are located at html +* **domain**: domain of page +* **labeled_xpaths**: dictionary of xpaths and its labels +* **timestamp**: timestamp of date, when page was loaded +* **url**: url of page +* **record_xpaths**: xpaths of block-nodes(first text node of each record) + +### Download {#download-2} +Dataset available at: +* NewsListDataset (915 MB): [`russian.json`](https://nextcloud.ispras.ru/index.php/s/ZP4D8cjAs4FcAjx) + +This file is dump of python-like list object, each item of it is instance of dictionary with fields described at [Dataset Format](#dataset-format-2) . So the size of list is 13099 items. \ No newline at end of file From 38fdc2a21d48d41e3c4efdfa68bb6757fa827525 Mon Sep 17 00:00:00 2001 From: Alexander Kustenkov Date: Mon, 5 Aug 2024 12:42:12 +0300 Subject: [PATCH 3/8] Fix links --- README.md | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 0e27e62..cdd4c0f 100644 --- a/README.md +++ b/README.md @@ -1,27 +1,25 @@ # ISPRAS News Datasets Collection -Table of Contents +Datasets: - [Dataset For Information Extraction From News Web Pages](#Dataset-For-Information-Extraction-From-News-Web-Pages) - - [Dataset Description](#Dataset-Description) - - [Data Collection](#Data-Collection) - - [Dataset Format](#Dataset-Format) - - [Download](#Download) - - [Citation](#Citation) + - [Dataset Description](#dataset-description-1) + - [Data Collection](#data-collection-1) + - [Dataset Format](#dataset-format-1) + - [Download](#download-1) + - [Citation](#citation-1) - [NewsListDataset](#NewsListDataset) - - [Dataset Description](#Dataset-Description) - - [Data Collection](#Data-Collection) - - [Dataset Format](#Dataset-Format) - - [Download](#Download) - - [Citation](#Citation) + - [Dataset Description](#dataset-description-2) + - [Dataset Format](#dataset-format-2) + - [Download](#download-2) ## Dataset For Information Extraction From News Web Pages Multilingual dataset of labeled news web pages for information extraction task -### Dataset Description {#description-1} +### Dataset Description {#dataset-description-1} Dataset contains websites in 6 languages: Russian, English, German, Chinese, Korean, Arabic. We labeled news pages with attributes from these sets: * For Russian: title, subtitle, publication date, modification date, text, authors, sources, categories, tags * For other languages: title, publication date, text, authors, tags @@ -127,7 +125,7 @@ Creating the Russian-language part of the dataset is described in our [paper](ht For other languages, we marked up nodes on pages using sitemaps created in the [Web Scraper](https://github.com/ispras/web-scraper-chrome-extension). -### Dataset Format {#data-format-1} +### Dataset Format {#dataset-format-1} For Russian-language part we have JSON file with the following structure (Label Studio JSON MIN format): ``` From cd04195bf78de7a74163a3308cafb7df51d12953 Mon Sep 17 00:00:00 2001 From: Alexander Kustenkov Date: Mon, 5 Aug 2024 15:07:49 +0300 Subject: [PATCH 4/8] Move to HTML links --- README.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index cdd4c0f..1b53e4c 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,8 @@ Multilingual dataset of labeled news web pages for information extraction task -### Dataset Description {#dataset-description-1} +

Dataset Description

+ Dataset contains websites in 6 languages: Russian, English, German, Chinese, Korean, Arabic. We labeled news pages with attributes from these sets: * For Russian: title, subtitle, publication date, modification date, text, authors, sources, categories, tags * For other languages: title, publication date, text, authors, tags @@ -119,7 +120,7 @@ Dataset contains websites in 6 languages: Russian, English, German, Chinese, Kor -### Data Collection {#data-collection-1} +

Data Collection

Creating the Russian-language part of the dataset is described in our [paper](https://ieeexplore.ieee.org/document/10076872). The annotators marked up web pages using Label Studio according to the [guideline](./MANIFEST.md). @@ -179,13 +180,13 @@ JSONs structure for other languages: ...} ``` -### Download {#download-1} +

Download

* Multilingual dataset (1.1 GB): [`annotations/`](https://nextcloud.ispras.ru/index.php/s/zbaDqkxmQPmaEkT) * Russian-language web pages in MHTML format (zipped 1 GB): [`news-page-dataset-mhtmls.zip`](https://nextcloud.ispras.ru/index.php/s/YDwme8jSByQY2xC) -### Citation {#citation-1} +

Citation

More details about the Russian-language part of the dataset are available in our [paper](https://ieeexplore.ieee.org/document/10076872). Please cite us if you use or discuss this dataset in your work: ``` @@ -203,7 +204,9 @@ More details about the Russian-language part of the dataset are available in our ## NewsListDataset Dataset for extracting news records with their attributes from html pages. -### Dataset Description {#dataset-description-2} + +

Dataset Description

+ This dataset contains pages with lists of news in Russian. The following attributes were marked: title, date, tag, short_text, time, short_title, author. @@ -221,7 +224,8 @@ Their distribution: Totally dataset contains 13099 pages. -### Dataset Format {#dataset-format-2} +

Dataset Format

+ Each file from data folder is instance of json dictionary with fields: * **html**: formatted html code of page * **exist_labels**: labels which are located at html @@ -231,7 +235,8 @@ Each file from data folder is instance of json dictionary with fields: * **url**: url of page * **record_xpaths**: xpaths of block-nodes(first text node of each record) -### Download {#download-2} +

Dataset Format

+ Dataset available at: * NewsListDataset (915 MB): [`russian.json`](https://nextcloud.ispras.ru/index.php/s/ZP4D8cjAs4FcAjx) From 5f64727a4d64fa3f8e12250cf2d03331f75e8f8c Mon Sep 17 00:00:00 2001 From: pbedrin Date: Fri, 27 Sep 2024 11:28:20 +0300 Subject: [PATCH 5/8] update multilingual dataset --- AE_DATASET_STATS.md | 632 ++++++++++++++++++++++++++++++++++++++++++++ README.md | 165 +++--------- 2 files changed, 668 insertions(+), 129 deletions(-) create mode 100644 AE_DATASET_STATS.md diff --git a/AE_DATASET_STATS.md b/AE_DATASET_STATS.md new file mode 100644 index 0000000..3093132 --- /dev/null +++ b/AE_DATASET_STATS.md @@ -0,0 +1,632 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TitleTextDateAuthorTag
amSites / Pages1 / 50
Sites with attribute
Pages with attribute
Nodes with attribute
1
50
50
1
50
568
0
0
0
0
0
0
0
0
0
arSites / Pages59 / 2218
Sites with attribute
Pages with attribute
Nodes with attribute
58
2198
2251
58
2216
21175
59
2218
2732
17
517
738
12
407
1811
bgSites / Pages5 / 247
Sites with attribute
Pages with attribute
Nodes with attribute
5
247
247
5
247
3680
5
247
297
4
155
158
2
98
505
caSites / Pages3 / 7
Sites with attribute
Pages with attribute
Nodes with attribute
3
7
7
3
7
48
3
7
7
1
3
3
1
3
8
daSites / Pages4 / 200
Sites with attribute
Pages with attribute
Nodes with attribute
4
200
244
4
200
2439
4
200
500
3
119
133
0
0
0
deSites / Pages9 / 450
Sites with attribute
Pages with attribute
Nodes with attribute
9
450
454
9
449
6847
9
450
600
9
270
308
2
100
336
dvSites / Pages1 / 2
Sites with attribute
Pages with attribute
Nodes with attribute
1
2
2
1
2
26
1
2
2
1
2
2
0
0
0
elSites / Pages4 / 154
Sites with attribute
Pages with attribute
Nodes with attribute
4
154
154
4
154
3242
4
154
154
3
102
113
3
144
510
enSites / Pages320 / 10479
Sites with attribute
Pages with attribute
Nodes with attribute
319
10428
10686
320
10469
192188
309
10069
10489
136
3920
4332
73
2080
7013
esSites / Pages92 / 3579
Sites with attribute
Pages with attribute
Nodes with attribute
92
3579
3580
92
3525
110752
91
3568
3810
51
1987
2635
37
1442
6106
etSites / Pages8 / 380
Sites with attribute
Pages with attribute
Nodes with attribute
8
380
436
8
380
6225
8
380
380
3
95
95
2
63
272
faSites / Pages3 / 146
Sites with attribute
Pages with attribute
Nodes with attribute
3
146
146
3
146
1575
3
146
146
2
96
96
0
0
0
fiSites / Pages2 / 100
Sites with attribute
Pages with attribute
Nodes with attribute
2
100
100
2
100
3876
2
100
100
1
50
50
1
49
301
frSites / Pages59 / 1750
Sites with attribute
Pages with attribute
Nodes with attribute
58
1701
1706
59
1750
31201
57
1743
1752
23
791
807
15
378
839
heSites / Pages3 / 148
Sites with attribute
Pages with attribute
Nodes with attribute
3
147
154
3
130
1383
3
145
145
3
118
142
1
25
44
hrSites / Pages13 / 540
Sites with attribute
Pages with attribute
Nodes with attribute
13
540
540
13
540
8174
12
508
558
6
284
286
10
407
1325
huSites / Pages7 / 258
Sites with attribute
Pages with attribute
Nodes with attribute
7
258
258
7
258
3969
6
256
306
4
200
205
3
101
655
idSites / Pages13 / 473
Sites with attribute
Pages with attribute
Nodes with attribute
13
472
472
13
471
8922
13
464
464
7
125
126
7
154
623
itSites / Pages11 / 275
Sites with attribute
Pages with attribute
Nodes with attribute
11
275
275
11
259
4351
11
275
275
5
106
106
4
84
214
jaSites / Pages2 / 97
Sites with attribute
Pages with attribute
Nodes with attribute
2
97
97
2
96
650
2
97
97
0
0
0
0
0
0
koSites / Pages16 / 536
Sites with attribute
Pages with attribute
Nodes with attribute
16
536
540
16
536
8097
16
536
540
11
382
473
3
57
230
loSites / Pages1 / 49
Sites with attribute
Pages with attribute
Nodes with attribute
1
49
49
1
49
297
1
49
49
1
49
49
1
49
196
ltSites / Pages3 / 149
Sites with attribute
Pages with attribute
Nodes with attribute
3
149
153
3
149
5668
3
149
149
0
0
0
1
47
281
lvSites / Pages4 / 144
Sites with attribute
Pages with attribute
Nodes with attribute
4
144
174
4
144
1687
4
144
144
1
8
18
1
8
45
mkSites / Pages5 / 105
Sites with attribute
Pages with attribute
Nodes with attribute
5
105
105
5
105
2955
5
105
105
3
72
79
2
61
139
mySites / Pages1 / 21
Sites with attribute
Pages with attribute
Nodes with attribute
1
21
21
1
21
67
1
21
21
1
21
21
0
0
0
nlSites / Pages1 / 98
Sites with attribute
Pages with attribute
Nodes with attribute
1
98
98
1
98
277
1
98
98
1
53
53
1
63
305
noSites / Pages11 / 83
Sites with attribute
Pages with attribute
Nodes with attribute
11
83
83
11
83
1495
11
83
83
2
5
5
0
0
0
ptSites / Pages43 / 1584
Sites with attribute
Pages with attribute
Nodes with attribute
43
1584
1584
43
1583
34677
43
1581
1581
21
667
797
17
571
2020
ruSites / Pages85 / 3623
Sites with attribute
Pages with attribute
Nodes with attribute
84
3523
3531
85
3623
79980
82
3396
3911
17
429
483
25
1110
4599
siSites / Pages1 / 46
Sites with attribute
Pages with attribute
Nodes with attribute
1
46
46
1
46
268
1
46
46
0
0
0
0
0
0
skSites / Pages5 / 104
Sites with attribute
Pages with attribute
Nodes with attribute
5
104
104
5
104
836
5
104
104
1
21
21
3
72
86
slSites / Pages2 / 50
Sites with attribute
Pages with attribute
Nodes with attribute
2
50
50
2
50
803
2
50
50
1
46
46
2
50
256
soSites / Pages4 / 103
Sites with attribute
Pages with attribute
Nodes with attribute
4
103
103
4
89
481
4
103
103
3
93
93
1
11
11
sqSites / Pages2 / 59
Sites with attribute
Pages with attribute
Nodes with attribute
2
59
59
2
59
456
2
59
59
1
50
50
1
9
51
swSites / Pages9 / 123
Sites with attribute
Pages with attribute
Nodes with attribute
9
123
123
9
116
1693
9
123
123
3
14
14
2
11
11
thSites / Pages3 / 131
Sites with attribute
Pages with attribute
Nodes with attribute
3
131
131
3
131
1392
3
131
131
0
0
0
2
58
159
tlSites / Pages5 / 24
Sites with attribute
Pages with attribute
Nodes with attribute
5
24
24
5
24
286
5
24
24
2
7
7
2
14
74
trSites / Pages3 / 107
Sites with attribute
Pages with attribute
Nodes with attribute
3
107
261
3
107
1306
3
107
157
0
0
0
0
0
0
ukSites / Pages1 / 47
Sites with attribute
Pages with attribute
Nodes with attribute
1
47
47
1
47
1799
1
47
47
0
0
0
0
0
0
urSites / Pages1 / 50
Sites with attribute
Pages with attribute
Nodes with attribute
1
50
50
1
50
599
1
50
50
1
49
52
0
0
0
viSites / Pages2 / 101
Sites with attribute
Pages with attribute
Nodes with attribute
2
101
101
2
100
1626
2
101
101
0
0
0
1
23
102
zh-cnSites / Pages12 / 951
Sites with attribute
Pages with attribute
Nodes with attribute
12
951
951
12
951
12527
12
951
951
5
318
318
0
0
0
zh-twSites / Pages6 / 85
Sites with attribute
Pages with attribute
Nodes with attribute
6
85
85
6
85
1463
6
85
85
3
56
56
1
11
43
diff --git a/README.md b/README.md index 1b53e4c..10d5580 100644 --- a/README.md +++ b/README.md @@ -21,114 +21,40 @@ Multilingual dataset of labeled news web pages for information extraction task

Dataset Description

-Dataset contains websites in 6 languages: Russian, English, German, Chinese, Korean, Arabic. We labeled news pages with attributes from these sets: -* For Russian: title, subtitle, publication date, modification date, text, authors, sources, categories, tags -* For other languages: title, publication date, text, authors, tags - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TitleTextDateAuthorTag
ruSites / Pages112 / 722
Sites with attribute
Pages with attribute
Nodes with attribute
110
712
714
112
716
5918
110
708
724
54
262
272
49
332
1190
enSites / Pages10 / 500
Sites with attribute
Pages with attribute
Nodes with attribute
10
500
500
10
499
22200
10
499
499
4
147
147
2
98
258
deSites / Pages9 / 450
Sites with attribute
Pages with attribute
Nodes with attribute
9
450
454
9
449
6847
9
450
600
9
270
308
2
100
336
zhSites / Pages10 / 500
Sites with attribute
Pages with attribute
Nodes with attribute
10
500
501
10
500
5872
10
500
500
6
227
277
0
0
0
koSites / Pages10 / 500
Sites with attribute
Pages with attribute
Nodes with attribute
10
500
500
10
500
6898
10
500
550
8
358
409
1
41
155
arSites / Pages10 / 500
Sites with attribute
Pages with attribute
Nodes with attribute
10
500
500
10
500
5752
10
500
550
10
180
274
4
184
648
+Dataset contains websites in 44 languages. We labeled such attributes on news web pages: *title*, *publication date*, *text*, *authors*, *tags*. Some sites may also have *subtitle*, *sources* and *categories* annotations. +We presented the statistics for the number of sites, pages and labeled nodes in the [AE_DATASET_STATS.md](./AE_DATASET_STATS.md) file. + +We also have a separate dataset for Russian news sites. We labeled there *title*, *subtitle*, *publication date*, *modification date*, *text*, *authors*, *sources*, *categories* and *tags*.

Data Collection

-Creating the Russian-language part of the dataset is described in our [paper](https://ieeexplore.ieee.org/document/10076872). The annotators marked up web pages using Label Studio according to the [guideline](./MANIFEST.md). +For multilingual dataset, we marked up nodes on pages using sitemaps created with the [Web Scraper](https://github.com/ispras/web-scraper-chrome-extension). + +Creating the Russian dataset is described in our [paper](https://ieeexplore.ieee.org/document/10076872). The annotators marked up web pages using Label Studio according to the [guideline](./MANIFEST.md). -For other languages, we marked up nodes on pages using sitemaps created in the [Web Scraper](https://github.com/ispras/web-scraper-chrome-extension). +

Dataset Format

-### Dataset Format {#dataset-format-1} +For the multilingual dataset we have JSON for each language with the following structure: +``` +{'site': [ + { + 'uuid': + 'url': + 'html': + 'annotations': [ + { + 'xpath': + 'text': + 'label': + }, + ...] + }, + ...], +...} +``` -For Russian-language part we have JSON file with the following structure (Label Studio JSON MIN format): +JSON structure for the Russian dataset is the Label Studio JSON MIN format: ``` [ { @@ -160,35 +86,17 @@ For Russian-language part we have JSON file with the following structure (Label ``` We additionally added `html_en` with translated HTML into English. -JSONs structure for other languages: - -``` -{'site': [ - { - 'uuid': - 'url': - 'html': - 'annotations': [ - { - 'xpath': - 'text': - 'label': - }, - ...] - }, - ...], -...} -``` -

Download

-* Multilingual dataset (1.1 GB): [`annotations/`](https://nextcloud.ispras.ru/index.php/s/zbaDqkxmQPmaEkT) -* Russian-language web pages in MHTML format (zipped 1 GB): [`news-page-dataset-mhtmls.zip`](https://nextcloud.ispras.ru/index.php/s/YDwme8jSByQY2xC) - +* Multilingual dataset (8.4 GB): [`multilingual-ae/`]() +* Multilingual web pages in MHTML (zipped 43.7 GB): [`multilingual-ae-mhtml.zip`]() +* Multilingual web pages in HTML (zipped 1.4 GB): [`multilingual-ae-html.zip`]() +* Russian dataset (178 MB): [`russian.json`]() +* Russian web pages in MHTML (zipped 1 GB): [`russian-ae-mhtml.zip`]()

Citation

-More details about the Russian-language part of the dataset are available in our [paper](https://ieeexplore.ieee.org/document/10076872). Please cite us if you use or discuss this dataset in your work: +More details about the Russian-language dataset are available in our [paper](https://ieeexplore.ieee.org/document/10076872). Please cite us if you use or discuss this dataset in your work: ``` @INPROCEEDINGS{10076872, author={Varlamov, Maksim and Galanin, Denis and Bedrin, Pavel and Duda, Sergey and Lazarev, Vladimir and Yatskov, Alexander}, @@ -208,7 +116,7 @@ Dataset for extracting news records with their attributes from html pages.

Dataset Description

This dataset contains pages with lists of news in Russian. -The following attributes were marked: title, date, tag, short_text, time, short_title, author. +The following attributes were marked: *title*, *date*, *tag*, *short_text*, *time*, *short_title*, *author*. Their distribution: @@ -235,9 +143,8 @@ Each file from data folder is instance of json dictionary with fields: * **url**: url of page * **record_xpaths**: xpaths of block-nodes(first text node of each record) -

Dataset Format

+

Download

-Dataset available at: * NewsListDataset (915 MB): [`russian.json`](https://nextcloud.ispras.ru/index.php/s/ZP4D8cjAs4FcAjx) -This file is dump of python-like list object, each item of it is instance of dictionary with fields described at [Dataset Format](#dataset-format-2) . So the size of list is 13099 items. \ No newline at end of file +This file is dump of python-like list object, each item of it is instance of dictionary with fields described at [Dataset Format](#dataset-format-2). So the size of list is 13099 items. \ No newline at end of file From ebd052da2f9838068b5f7d479bbfef4c92255349 Mon Sep 17 00:00:00 2001 From: pbedrin Date: Tue, 1 Oct 2024 11:54:58 +0300 Subject: [PATCH 6/8] multilingual overall stats --- AE_DATASET_STATS.md | 1 + README.md | 4 +--- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/AE_DATASET_STATS.md b/AE_DATASET_STATS.md index 3093132..3b66125 100644 --- a/AE_DATASET_STATS.md +++ b/AE_DATASET_STATS.md @@ -1,3 +1,4 @@ +Overall: 781 sites, 109 372 pages, 677 840 labels diff --git a/README.md b/README.md index 10d5580..1f0b8f9 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,6 @@ # ISPRAS News Datasets Collection -Datasets: - - [Dataset For Information Extraction From News Web Pages](#Dataset-For-Information-Extraction-From-News-Web-Pages) - [Dataset Description](#dataset-description-1) - [Data Collection](#data-collection-1) @@ -17,7 +15,7 @@ ## Dataset For Information Extraction From News Web Pages -Multilingual dataset of labeled news web pages for information extraction task +Multilingual dataset of labeled news web pages for information extraction task.

Dataset Description

From eb6941ceb24e328781b669b6e56c42fe8c273973 Mon Sep 17 00:00:00 2001 From: pbedrin Date: Tue, 17 Dec 2024 18:14:03 +0300 Subject: [PATCH 7/8] recalculate stats after extra data cleaning --- AE_DATASET_STATS.md | 65 +++++++++++++++++++++++---------------------- 1 file changed, 33 insertions(+), 32 deletions(-) diff --git a/AE_DATASET_STATS.md b/AE_DATASET_STATS.md index 3b66125..003ae4c 100644 --- a/AE_DATASET_STATS.md +++ b/AE_DATASET_STATS.md @@ -1,4 +1,5 @@ -Overall: 781 sites, 109 372 pages, 677 840 labels +Overall: 783 sites, 30 003 pages, 678 411 labels +
@@ -35,10 +36,10 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - - - - + + + + @@ -87,14 +88,14 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - + - - - - + + + + @@ -133,11 +134,11 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - - - - - + + + + + @@ -147,8 +148,8 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - - + + @@ -204,7 +205,7 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - + @@ -259,8 +260,8 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - - + + @@ -274,7 +275,7 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - + @@ -414,7 +415,7 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - + @@ -427,11 +428,11 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - - - - - + + + + + @@ -609,10 +610,10 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - - - - + + + + @@ -624,7 +625,7 @@ Overall: 781 sites, 109 372 pages, 677 840 labels - + From 28db89161dae69dcd0e31061e323f1263e102559 Mon Sep 17 00:00:00 2001 From: pbedrin Date: Tue, 17 Dec 2024 19:31:21 +0300 Subject: [PATCH 8/8] add dataset links --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 1f0b8f9..ccaf66f 100644 --- a/README.md +++ b/README.md @@ -86,11 +86,11 @@ We additionally added `html_en` with translated HTML into English.

Download

-* Multilingual dataset (8.4 GB): [`multilingual-ae/`]() -* Multilingual web pages in MHTML (zipped 43.7 GB): [`multilingual-ae-mhtml.zip`]() -* Multilingual web pages in HTML (zipped 1.4 GB): [`multilingual-ae-html.zip`]() -* Russian dataset (178 MB): [`russian.json`]() -* Russian web pages in MHTML (zipped 1 GB): [`russian-ae-mhtml.zip`]() +* Multilingual dataset (8.4 GB): [`multilingual-ae/`](https://nextcloud.ispras.ru/index.php/s/gkttoE637s9kxfJ) +* Multilingual web pages in MHTML (zipped 43.9 GB): [`multilingual-ae-mhtml.zip`](https://nextcloud.ispras.ru/index.php/s/RbxBaq3P7MD9DCC) +* Multilingual web pages in HTML (zipped 1.5 GB): [`multilingual-ae-html.zip`](https://nextcloud.ispras.ru/index.php/s/L2DTDdkRfwmy4PS) +* Russian dataset (178 MB): [`russian.json`](https://nextcloud.ispras.ru/index.php/s/fHgbox8mZYkQkew) +* Russian web pages in MHTML (zipped 1 GB): [`russian-ae-mhtml.zip`](https://nextcloud.ispras.ru/index.php/s/xm7NGX7bya7W8eo)

Citation

@@ -139,7 +139,7 @@ Each file from data folder is instance of json dictionary with fields: * **labeled_xpaths**: dictionary of xpaths and its labels * **timestamp**: timestamp of date, when page was loaded * **url**: url of page -* **record_xpaths**: xpaths of block-nodes(first text node of each record) +* **record_xpaths**: xpaths of block-nodes (first text node of each record)

Download

Sites with attribute
Pages with attribute
Nodes with attribute
58
2198
2251
58
2216
21175
59
2218
2732
17
517
738
58
2198
2248
59
2217
26354
59
2218
2538
17
517
739
12
407
1811
de Sites / Pages9 / 45014 / 527
Sites with attribute
Pages with attribute
Nodes with attribute
9
450
454
9
449
6847
9
450
600
9
270
308
14
527
546
14
526
7776
14
527
677
10
273
311
2
100
336
Sites with attribute
Pages with attribute
Nodes with attribute
319
10428
10686
320
10469
192188
309
10069
10489
136
3920
4332
73
2080
7013
319
10418
10674
320
10468
192737
308
10052
10394
135
3849
4154
73
2050
6956
Sites with attribute
Pages with attribute
Nodes with attribute
92
3579
3580
92
3525
110752
92
3541
3542
92
3525
110552
91
3568
3810
51
1987
2635
37
1442
6106
Sites with attribute
Pages with attribute
Nodes with attribute
58
1701
1706
59
1750
31201
59
1708
31150
57
1743
1752
23
791
807
15
378
839
Sites with attribute
Pages with attribute
Nodes with attribute
13
472
472
13
471
8922
13
473
473
13
471
8868
13
464
464
7
125
126
7
154
623
Sites with attribute
Pages with attribute
Nodes with attribute
11
275
275
11
259
4351
11
268
4437
11
275
275
5
106
106
4
84
214
Sites with attribute
Pages with attribute
Nodes with attribute
43
1584
1584
43
1583
34677
43
1582
33000
43
1581
1581
21
667
797
17
571
2020
Sites with attribute
Pages with attribute
Nodes with attribute
84
3523
3531
85
3623
79980
82
3396
3911
17
429
483
25
1110
4599
85
3623
3631
85
3623
78991
81
3346
3861
17
422
476
26
1197
4805
Sites with attribute
Pages with attribute
Nodes with attribute
12
951
951
12
951
12527
12
951
951
5
318
318
12
890
890
11
888
11792
11
888
888
4
255
255
0
0
0
Sites with attribute
Pages with attribute
Nodes with attribute
6
85
85
6
85
1463
6
85
1154
6
85
85
3
56
56
1
11
43