The Billion Row Challenge in Prolog #2344

david-sitsky · 2024-02-27T07:50:26Z

david-sitsky
Feb 27, 2024

Hi all - I realise this was perhaps a little distraction, but I thought I'd quickly attempt "The Billion Row Challenge" as written here: https://www.morling.dev/blog/one-billion-row-challenge/ to see what issues may come with Scryer. I've hit this issue before when attempting to use Scryer with massive log files, but if I have DCGs which leave no choicepoints, I still get huge amounts of memory accumulating until the Linux OOM kills the process. The file with a billion rows is 13G in size, but the same program in SWI keeps memory usage nice and constant because of GC. Is there a rough timeline when Scryer will get a GC?

Apart from that, I noticed SWI Prolog is reading from the file in 4kb chunks, which is reasonable as shown from strace:

[pid 1193873] read(3, "n;11.8\nAustin;21.7\nSaint Petersburg;-18.1\nBridgetown;22.1\nRangpur;39.1\nBelize City;27.8\nSan Francisco;17.5\nChittagong;44.8\nMaun;7.6\nOdesa;2.3\nOuarzazate;19.4\nKyiv;-10.3\nKarachi;19.1\nVillahermosa;5.6\nN"..., 4096) = 4096
[pid 1193873] clock_gettime(0xff6e4376 /* CLOCK_??? */, {tv_sec=1113, tv_nsec=300972667}) = 0
[pid 1193873] rt_sigprocmask(SIG_BLOCK, ~[QUIT BUS SEGV CONT STOP PROF RTMIN RT_1], [], 8) = 0
[pid 1193873] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 1193873] clock_gettime(0xff6e4376 /* CLOCK_??? */, {tv_sec=1113, tv_nsec=301368717}) = 0
[pid 1193873] read(3, "1\nMaputo;24.0\nOmaha;-2.7\nOslo;11.6\nToliara;27.3\nAssab;20.1\nMadrid;8.4\nBatumi;17.7\nAustin;4.9\nYangon;38.0\nBujumbura;31.1\nToliara;32.8\nPhoenix;29.3\nMaun;13.9\nVeracruz;22.5\nMonaco;35.5\nNaha;24.5\nJohannes"..., 4096) = 4096
[pid 1193873] clock_gettime(0xff6e4376 /* CLOCK_??? */, {tv_sec=1113, tv_nsec=304342126}) = 0
[pid 1193873] rt_sigprocmask(SIG_BLOCK, ~[QUIT BUS SEGV CONT STOP PROF RTMIN RT_1], [], 8) = 0
[pid 1193873] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 1193873] clock_gettime(0xff6e4376 /* CLOCK_??? */, {tv_sec=1113, tv_nsec=304678181}) = 0
[pid 1193873] read(3, "cow;5.2\nDhaka;13.0\nNakhon Ratchasima;51.3\nVeracruz;31.7\nIstanbul;18.7\nMarrakesh;40.2\nNapier;-2.3\nBaguio;15.6\nNapier;11.3\nLa Ceiba;26.7\nHiroshima;16.2\nLhasa;11.4\nDa Lat;13.3\nTamanrasset;11.6\nDa Nang;48"..., 4096) = 4096
...

However for some reason Scryer is issuing read() system calls of only 4 bytes at a time, which is not great for performance. Is there a reason for this?

[pid 1191974] read(11, "Kyiv", 4)       = 4
[pid 1191974] read(11, ";4.8", 4)       = 4
[pid 1191974] read(11, "\nBuj", 4)      = 4
[pid 1191974] read(11, "umbu", 4)       = 4
[pid 1191974] read(11, "ra;1", 4)       = 4
[pid 1191974] read(11, "6.3\n", 4)      = 4
[pid 1191974] read(11, "Pana", 4)       = 4
[pid 1191974] read(11, "ma C", 4)       = 4
[pid 1191974] read(11, "ity;", 4)       = 4
[pid 1191974] read(11, "34.2", 4)       = 4
[pid 1191974] read(11, "\nSap", 4)      = 4
[pid 1191974] read(11, "poro", 4)       = 4
[pid 1191974] read(11, ";4.2", 4)       = 4
[pid 1191974] read(11, "\nCon", 4)      = 4
[pid 1191974] read(11, "akry", 4)       = 4
[pid 1191974] read(11, ";18.", 4)       = 4
[pid 1191974] read(11, "3\nTa", 4)      = 4
[pid 1191974] read(11, "shke", 4)       = 4
[pid 1191974] read(11, "nt;6", 4)       = 4
[pid 1191974] read(11, ".3\nG", 4)      = 4

mthom · 2024-02-27T20:33:21Z

mthom
Feb 27, 2024
Maintainer

CharReader is probably responsible for all the read system calls. As to the GC timeline, I hope to have it finished this year or the next. I'd like to establish a steady source of funding for Scryer development. Failing that, I'll have to take a job, and of course that will slow down development.

0 replies

triska · 2024-03-24T09:04:15Z

triska
Mar 24, 2024

All the overhead of tiny syscalls will be completely removed when files are entirely mapped to the heap by the OS, using the single syscall mmap as described in #251. I think Trealla Prolog is already able to do this. The prerequisite for this is the efficient string representation as outlined in #24. A naive representation of lists of characters will incur a 24-fold overhead, an overhead that can only be removed by representing lists of characters more compactly in memory.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Billion Row Challenge in Prolog #2344

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

The Billion Row Challenge in Prolog #2344

david-sitsky Feb 27, 2024

Replies: 2 comments

mthom Feb 27, 2024 Maintainer

triska Mar 24, 2024

david-sitsky
Feb 27, 2024

mthom
Feb 27, 2024
Maintainer

triska
Mar 24, 2024