Skip to content

Commit

Permalink
Compression ratio improvements.
Browse files Browse the repository at this point in the history
  • Loading branch information
IlyaGrebnov committed Nov 11, 2022
1 parent 89e696b commit 4f61c07
Show file tree
Hide file tree
Showing 14 changed files with 6,515 additions and 1,374 deletions.
3 changes: 2 additions & 1 deletion AUTHORS
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
-- This program is based on (at least) the work of

Michael Maniscalco, Atsushi Komiya, Pochang Chen,
Surya Kandau and Malte Skarupke.
Surya Kandau, Malte Skarupke, Danny Dube, Vincent Beaudoin,
Takahiro Ota, Hiroyoshi Morita and Akiko Manada.


11 changes: 7 additions & 4 deletions CHANGES
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
* 2022-11-10 : Version 0.3.0
* Compression ratio improvements.

* 2022-01-08 : Version 0.2.1
* Replaced std::stable_sort with ska_sort.
* Performance improvements.

* 2022-01-05 : Version 0.2
* Improved compression.
* Reduced memory usage from 15x to 13x.
* Memory usage improvements.
* Compression ratio improvements.

* 2021-12-07 : Version 0.1.1 - 0.1.2
* Slightly improved compression using symbols history.
* Minor compression ratio improvements.

* 2021-12-03 : Version 0.1.0
* Initial public release of the bsc-m03.
138 changes: 72 additions & 66 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,25 @@ The bsc-m03 is experimental block sorting compressor based on M03 context aware
* Michael Maniscalco *M03: A solution for context based blocksort (BWT) compression*, 2004
* Jurgen Abel *Post BWT stages of the Burrows-Wheeler compression algorithm*, 2010

Moreover, the bsc-m03 compressor is a practical implementation of *Compression via Substring Enumeration* for byte-oriented sources:
* Danny Dube, Vincent Beaudoin *Lossless Data Compression via Substring Enumeration*, 2010
* Takahiro Ota, Hiroyoshi Morita, Akiko Manada *Compression by Substring Enumeration with a Finite Alphabet Using Sorting*, 2018

Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>

## License
The bsc-m03 is released under the [GNU General Public License](LICENSE "GNU General Public License")

## Changes
* 2022-11-10 : Version 0.3.0
* Compression ratio improvements.
* 2022-01-08 : Version 0.2.1
* Replaced std::stable_sort with ska_sort.
* Performance improvements.
* 2022-01-05 : Version 0.2
* Improved compression.
* Reduced memory usage from 15x to 13x.
* Memory usage improvements.
* Compression ratio improvements.
* 2021-12-07 : Version 0.1.1 - 0.1.2
* Slightly improved compression using symbols history.
* Minor compression ratio improvements.
* 2021-12-03 : Version 0.1.0
* Initial public release of the bsc-m03.

Expand All @@ -25,89 +31,89 @@ The bsc-m03 is released under the [GNU General Public License](LICENSE "GNU Gene
### Calgary Corpus ###
| File name | Input size (bytes) | Output size (bytes) | Bits per symbol |
|:---------------:|:-----------:|:------------:|:-------:|
| bib | 111261 | 24832 | 1.785 |
| book1 | 768771 | 206247 | 2.146 |
| book2 | 610856 | 140103 | 1.835 |
| geo | 102400 | 52597 | 4.109 |
| news | 377109 | 107049 | 2.271 |
| obj1 | 21504 | 9863 | 3.669 |
| obj2 | 246814 | 68833 | 2.231 |
| paper1 | 53161 | 15145 | 2.279 |
| paper2 | 82199 | 22824 | 2.221 |
| pic | 513216 | 44694 | 0.697 |
| progc | 39611 | 11390 | 2.300 |
| progl | 71646 | 13689 | 1.529 |
| progp | 49379 | 9376 | 1.519 |
| trans | 93695 | 15550 | 1.328 |
| bib | 111261 | 24656 | 1.773 |
| book1 | 768771 | 204395 | 2.127 |
| book2 | 610856 | 139566 | 1.828 |
| geo | 102400 | 52580 | 4.108 |
| news | 377109 | 106395 | 2.257 |
| obj1 | 21504 | 9795 | 3.644 |
| obj2 | 246814 | 68414 | 2.218 |
| paper1 | 53161 | 15048 | 2.265 |
| paper2 | 82199 | 22687 | 2.208 |
| pic | 513216 | 44620 | 0.696 |
| progc | 39611 | 11320 | 2.286 |
| progl | 71646 | 13610 | 1.520 |
| progp | 49379 | 9316 | 1.509 |
| trans | 93695 | 15446 | 1.319 |

### Canterbury Corpus ###
| File name | Input size (bytes) | Output size (bytes) | Bits per symbol |
|:---------------:|:-----------:|:------------:|:-------:|
| alice29.txt | 152089 | 38841 | 2.043 |
| asyoulik.txt | 125179 | 36149 | 2.310 |
| cp.html | 24603 | 6969 | 2.266 |
| fields.c | 11150 | 2712 | 1.946 |
| grammar.lsp | 3721 | 1138 | 2.447 |
| kennedy.xls | 1029744 | 56929 | 0.442 |
| lcet10.txt | 426754 | 95628 | 1.793 |
| plrabn12.txt | 481861 | 130437 | 2.166 |
| ptt5 | 513216 | 44694 | 0.697 |
| sum | 38240 | 11539 | 2.414 |
| xargs.1 | 4227 | 1603 | 3.034 |
| alice29.txt | 152089 | 38667 | 2.034 |
| asyoulik.txt | 125179 | 36019 | 2.302 |
| cp.html | 24603 | 6915 | 2.249 |
| fields.c | 11150 | 2695 | 1.934 |
| grammar.lsp | 3721 | 1130 | 2.429 |
| kennedy.xls | 1029744 | 56568 | 0.439 |
| lcet10.txt | 426754 | 95240 | 1.785 |
| plrabn12.txt | 481861 | 130068 | 2.159 |
| ptt5 | 513216 | 44620 | 0.696 |
| sum | 38240 | 11479 | 2.401 |
| xargs.1 | 4227 | 1585 | 3.000 |

### Large Canterbury Corpus ###
| File name | Input size (bytes) | Output size (bytes) | Bits per symbol |
|:---------------:|:-----------:|:------------:|:-------:|
| bible.txt | 4047392 | 703933 | 1.391 |
| E.coli | 4638690 | 1129304 | 1.948 |
| world192.txt | 2473400 | 381247 | 1.233 |
| bible.txt | 4047392 | 701049 | 1.386 |
| E.coli | 4638690 | 1126463 | 1.943 |
| world192.txt | 2473400 | 378508 | 1.224 |

### Silesia Corpus ###
| File name | Input size (bytes) | Output size (bytes) | Bits per symbol |
|:---------------:|:-----------:|:------------:|:-------:|
| dickens | 10192446 | 2208219 | 1.733 |
| mozilla | 51220480 | 15704019 | 2.453 |
| mr | 9970564 | 2160359 | 1.733 |
| nci | 33553445 | 1137038 | 0.271 |
| ooffice | 6152192 | 2522972 | 3.281 |
| osdb | 10085684 | 2230920 | 1.770 |
| reymont | 6627202 | 964011 | 1.164 |
| samba | 21606400 | 3839503 | 1.422 |
| sao | 7251944 | 4656134 | 5.136 |
| webster | 41458703 | 6279969 | 1.212 |
| xml | 5345280 | 364952 | 0.546 |
| x-ray | 8474240 | 3685642 | 3.479 |
| dickens | 10192446 | 2203859 | 1.730 |
| mozilla | 51220480 | 15630325 | 2.441 |
| mr | 9970564 | 2158802 | 1.732 |
| nci | 33553445 | 1130423 | 0.270 |
| ooffice | 6152192 | 2511633 | 3.266 |
| osdb | 10085684 | 2221807 | 1.762 |
| reymont | 6627202 | 962152 | 1.161 |
| samba | 21606400 | 3816749 | 1.413 |
| sao | 7251944 | 4651078 | 5.131 |
| webster | 41458703 | 6267572 | 1.209 |
| xml | 5345280 | 362358 | 0.542 |
| x-ray | 8474240 | 3681801 | 3.476 |

### Manzini Corpus ###
| File name | Input size (bytes) | Output size (bytes) | Bits per symbol |
|:---------------:|:-----------:|:------------:|:-------:|
| chr22.dna | 34553758 | 7227116 | 1.673 |
| etext99 | 105277340 | 21586520 | 1.640 |
| gcc-3.0.tar | 86630400 | 10198397 | 0.942 |
| howto | 39422105 | 7594162 | 1.541 |
| jdk13c | 69728899 | 2659297 | 0.305 |
| linux-2.4.5.tar | 116254720 | 16599153 | 1.142 |
| rctail96 | 114711151 | 9852234 | 0.687 |
| rfc | 116421901 | 15047359 | 1.034 |
| sprot34.dat | 109617186 | 17382679 | 1.269 |
| w3c2 | 104201579 | 5717299 | 0.439 |
| chr22.dna | 34553758 | 7206590 | 1.668 |
| etext99 | 105277340 | 21508150 | 1.634 |
| gcc-3.0.tar | 86630400 | 10131247 | 0.936 |
| howto | 39422105 | 7556359 | 1.533 |
| jdk13c | 69728899 | 2638786 | 0.303 |
| linux-2.4.5.tar | 116254720 | 16489301 | 1.135 |
| rctail96 | 114711151 | 9788959 | 0.683 |
| rfc | 116421901 | 14967795 | 1.029 |
| sprot34.dat | 109617186 | 17259191 | 1.260 |
| w3c2 | 104201579 | 5666677 | 0.435 |

### Maximum Compression Corpus ###
| File name | Input size (bytes) | Output size (bytes) | Bits per symbol |
|:---------------:|:-----------:|:------------:|:-------:|
| A10.jpg | 842468 | 823856 | 7.823 |
| AcroRd32.exe | 3870784 | 1568677 | 3.242 |
| english.dic | 465211 | 147280 | 2.533 |
| FlashMX.pdf | 4526946 | 3721859 | 6.577 |
| FP.LOG | 20617071 | 508327 | 0.197 |
| MSO97.DLL | 3782416 | 1890558 | 3.999 |
| ohs.doc | 4168192 | 810011 | 1.555 |
| rafale.bmp | 4149414 | 745966 | 1.438 |
| vcfiu.hlp | 4121418 | 613304 | 1.190 |
| world95.txt | 2988578 | 448323 | 1.200 |
| A10.jpg | 842468 | 823496 | 7.820 |
| AcroRd32.exe | 3870784 | 1560548 | 3.225 |
| english.dic | 465211 | 145707 | 2.506 |
| FlashMX.pdf | 4526946 | 3717253 | 6.569 |
| FP.LOG | 20617071 | 505982 | 0.196 |
| MSO97.DLL | 3782416 | 1882533 | 3.982 |
| ohs.doc | 4168192 | 805796 | 1.547 |
| rafale.bmp | 4149414 | 743544 | 1.434 |
| vcfiu.hlp | 4121418 | 608769 | 1.182 |
| world95.txt | 2988578 | 445466 | 1.192 |

### Large Text Compression Benchmark Corpus ###
| File name | Input size (bytes) | Output size (bytes) | Bits per symbol |
|:---------------:|:-----------:|:------------:|:-------:|
| enwik8 | 100000000 | 20398312 | 1.632 |
| enwik9 | 1000000000 | 161062976 | 1.289 |
| enwik8 | 100000000 | 20339773 | 1.627 |
| enwik9 | 1000000000 | 160616907 | 1.285 |
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.2.1
0.3.0
2 changes: 1 addition & 1 deletion bsc-m03.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,7 @@ static int print_usage()

int main(int argc, const char * argv[])
{
fprintf(stdout, "bsc-m03 is experimental block sorting compressor. Version 0.2.1 (08 January 2022).\n");
fprintf(stdout, "bsc-m03 is experimental block sorting compressor. Version 0.3.0 (10 November 2022).\n");
fprintf(stdout, "Copyright (c) 2021-2022 Ilya Grebnov <[email protected]>. ABSOLUTELY NO WARRANTY.\n");
fprintf(stdout, "This program is based on (at least) the work of Michael Maniscalco (see AUTHORS).\n\n");

Expand Down
6 changes: 6 additions & 0 deletions libsais/CHANGES
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
Changes in 2.7.1 (June 19, 2022)
- Improved cache coherence for ARMv8 architecture.

Changes in 2.7.0 (April 12, 2022)
- Support for longest common prefix array (LCP) construction.

Changes in 2.6.5 (January 1, 2022)
- Exposed functions to construct suffix array of a given integer array.
- Improved detection of various compiler intrinsics.
Expand Down
2 changes: 1 addition & 1 deletion libsais/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.6.5
2.7.1
Loading

0 comments on commit 4f61c07

Please sign in to comment.