- Console interface
- Algorithm
- Strategy to merge a file 1 GB
- Strategy to merge a file 10 GB
- Strategy to merge a file 100 GB
- Created with
The software provides a console interface with two verbs: [g]enerate, [s]ort for every program. To start sort an output file the target directory must be created. By default the path is C:\Temp\Files
.
.\ExtSort.exe --help
Fig. 1 - Using the option --help with the interface.
To generate a file with the size of 1 GB the following command can be executed:
.\ExtSort.exe generate output.txt 1048576
To start sorting a file with the correct data format the following commands can be executed:
.\ExtSort.exe sort output.txt output_sorted.txt IO
.\ExtSort.exe sort output.txt output_sorted.txt CPU
The provided software contains an implementation of the External Sort algorithm with a possible extension to split I/O operations between 2 drives.
During the splitting phase, a source file is sequentially read and split into NumberOfFiles
chunks. For every chunk, the name is set as <counter>.unsorted
. In the end of each iteration the algorithm analyzes whether the current byte is set at the position of a line's end, and if not, continues to write byte-by-byte until the end of the line is met. After a file is persisted, a map of filename::number of lines
is populated.
For every page, the program opens the SortPageSize
streams each in a separate task and starts to populate a buffer with priorities, the size of which was calculated during the Splitting phase. When a buffer is loaded, sorting occurs using the implemented Multi-Column Comparer to correspond to the requested template, then a space for a sorted file is allocated and the buffer is written into a file with the .sorted
extension. Right after a page was sorted a task to merge sorted files is executed. A possible strategy during the phase is to equate SortPageSize = SortThenMergePageSize x SortThenMergeChunkSize
so a page starts merging into ~sqrt(SortPageSize)
files when the next page is being sorted.
During the splitting phase, a source file is sequentially read and split into blocks of FileSplitSizeKb
size. For every chunk, a priority is calculated and linked with the original row using a priority queue size of BufferCapacityLines
. At the end of each iteration, the algorithm starts to dequeue the priority queue into a file with the .sorted.tmp
extension. When there are SortPageSize
tasks, they are awaited until all of them are completed.
The common merging strategy is to try to converge files from bottom to top forming a B-Tree. A possible chain is 64 -> 16 -> 4 -> 1
. Every MergePageSize
opens MergeChunkSize
streams, then reads the very first element, binds an index to it and continues to process every stream line-by-line sequentially enqueue a row to the priority queue, where the priority is set as a tuple of (<number>, <string>)
using the K-Way Merge. The first dequeued item is written to a target file. In case two drives are correctly set in the appsettings.json
it is possible to merge files from one drive to another: ex: C:\\->E:\\->C:\\->
.
Specifying the following settings the algorithm will split a file into 64 chunks ~16MB each and start processing 4 pages of 16 files.
The general files merging strategy: 64 -> 16
(during the Sorting/Merging Phase) -> 1
(during the Merging Phase). All operations will be performed within the two drives C:\ and E:\.
"SorterSettings": {
"NumberOfFiles": 64,
"SortPageSize": 16,
"SortOutputBufferSize": 2097152,
"MergePageSize": 4,
"MergeChunkSize": 16,
"MergeOutputBufferSize": 16777216,
"IOPath": {
"SortReadPath": "C:\\Temp\\Files",
"SortWritePath": "E:\\Temp\\Files",
"MergeStartPath": "E:\\Temp\\Files",
"MergeStartTargetPath": "C:\\Temp\\Files"
}
},
"SorterCPUSettings": {
"BufferCapacityLines": 720000
},
"SorterIOSettings": {
"SortThenMergePageSize": 4,
"SortThenMergeChunkSize": 4
}
Specifying the following settings the algorithm will split a file into 512 chunks ~20MB each and start processing 32 pages of 16 files.
The general merging strategy: 512 -> 64
(during the Sorting/Merging Phase) -> 8 -> 1
(during the Merging Phase). All operations will be performed within the single drive C:\.
"SorterSetting": {
"NumberOfFiles": 512,
"SortPageSize": 16,
"SortOutputBufferSize": 2097152,
"MergePageSize": 8,
"MergeChunkSize": 8,
"MergeOutputBufferSize": 16777216,
"IOPath": {
"SortReadPath": "C:\\Temp\\Files",
"SortWritePath": "C:\\Temp\\Files",
"MergeStartPath": "C:\\Temp\\Files",
"MergeStartTargetPath": "C:\\Temp\\Files"
}
},
"SorterCPUSettings": {
"BufferCapacityLines": 1000000
},
"SorterIOSettings": {
"SortThenMergePageSize": 2,
"SortThenMergeChunkSize": 8
}
Specifying the following settings the algorithm will split a file into 4096 chunks ~25MB each and start processing 128 pages of 32 files.
The general merging strategy: 4096 -> 512
(during the Sorting/Merging Phase) -> 64 -> 8 -> 1
(during the Merging Phase). All operations will be performed within the single drive C:\.
"SorterSetting": {
"NumberOfFiles": 4096,
"SortPageSize": 32,
"SortOutputBufferSize": 2097152,
"MergePageSize": 8,
"MergeChunkSize": 8,
"MergeOutputBufferSize": 16777216,
"IOPath": {
"SortReadPath": "C:\\Temp\\Files",
"SortWritePath": "C:\\Temp\\Files",
"MergeStartPath": "C:\\Temp\\Files",
"MergeStartTargetPath": "C:\\Temp\\Files"
}
},
"SorterCPUSettings": {
"BufferCapacityLines": 1300000
},
"SorterIOSettings": {
"SortThenMergePageSize": 4,
"SortThenMergeChunkSize": 8
}