Paths are used by file systems as well as protocols to represent paths to files and other resources. We need a generic type safe way to represent paths in general and file system paths in particular.
OS include files generally define PATH_MAX to 4K and NAME_MAX to 255, however, it is possible to create paths bigger than these depending on the file system.
In general directory trees could be quite deep and a directory can contain millions of entries. A good benchmark to measure the efficiency of path representation would be to traverse a directory tree recursively and list all the nodes under the tree. We could do this many times so that we do not need a really big directory tree.
A file system starts with just the root directory and then files are created in the file system by the user or by programs storing their data on the file system. When a directory or file is created, or when a directory is listed, the following operations are performed:
- for lookups an existing directory name must be resolved based on the name supplied by the user.
- for creation the file name to be created is supplied by the user
When the user asks the file system to lookup or create a file or directory in the file system:
- The operating system passes the name, as it is without any changes whatsoever, to the file system. or does it? Windows?
- The file system may translate the name to its own conventions before a
lookup or create, e.g. it may
- change the name to upper case
- translate the name to 8.3 chars
- change the character encoding?
- change the unicode normalization form of the name (Apple)
When resolving an existing directory name in the file system we need to supply a path which consists of component names separated by a separator byte. Separators are of no consequence to the file system, they are resolved by the OS and the path components are used to lookup the paths one components at a time. The path for lookup is acquired either by a user input, device input or by the program which previously got the path entries by traversing the file system itself.
- When the path is acquired by a user input, the user input could be:
- a literal string in the program
- a path entered via an input device
- a path coming from the network
- If the path was previously acquired from the file system then the best thing to do is to never change anything in the path and store it as it is and supply exactly the same path when needed. That way we can guarantee that the path remains exactly what it was in the file system.
The encoding of the source code file depends on the editor used and the encoding chosen when saving it. The string literals would be parsed by the GHC parser and then stored in the generated binaries as null terminated C string literals encoded in UTF-8 (see GHC reference). There are several possible points of failure here:
- GHC parser needs to interpret the source code encoding correctly.
- We assume that the editor does not perform any translation on the literal as entered by the user e.g. it does not perform unicode normalization on it. If it does then the string as entered by the user won't remain the same when it reaches the file system.
- GHC parser stores the parsed string literal in UTF-8 encoding. We assume that GHC does not perform any unicode normalization or any other translation on the string.
The UTF-8 encoded string literal can be passed as a blob of bytes to the file system or it can be converted to String type and re-encoded as UTF-8 both should work equivalently in this case.
The path provided by the user would assume some encoding based on the terminal settings or the encoding assumed by the sender over the network. The correctness depends on the contract between the two parties e.g. the locale setting. We assume that we get a sequence of raw bytes from the input device. We need to use the sequence as raw bytes and send it as it is to the file system without any translation.
We need to parse the path components by the separator bytes. We assume that the separator can be identified and removed correctly irrespective of the encoding. We also make sure that none of the bytes in the components is a separator byte.
Also, we would join the path components by the separator byte irrespective of the encoding of the components. If the OS treats the path as a sequence of bytes and nothing else and the components do not have the separator byte then we are good, we know that the OS would also be parsing based on the separator as a raw byte.
We may perform some validations on the paths such as the file names are not "." or "..". Such validations could be optional and we could also provide a way to not perform any validations and just blindly use the paths as provided by the user and let the file system/OS fail.
As we noted earlier, the file system may translate the paths before using them. For example, it may store the path after converting it to NFD unicode normalization. Translation may create some round tripping issues for programs. For example, a program may use a string literal which is stored in NFC and the file system converts it to NFD. Later, when the same path is retrieved from the file system and compared with the string literal that was to create it then it won't match. For such cases the programs need to understand the file system and perform comparisons by performing appropriate translations on the paths. To perform matching and translations correctly the program needs to correctly interpret the encoding specific to the file system.
When we display the paths to the user then we are forced to interpret it according to some encoding, to display the path correctly we have to know exactly how the file system stores the path. Otherwise if we display it differently, the user may use the displayed result to find the file and may not find it.
In general, the OS does not have to look inside the path components other than the ability to recognize the separator char to parse the path into components and to construct it from components. However, in some cases the OS may may have to look inside the path e.g. "." or ".." special paths on Unix.
On Unix a path component is just a blob of bytes. The user can encode and decode this blob of bytes as they desire. All they need to care about is that the path separator byte is respected. However, the the path component blobs may be changed by the underlying filesystem e.g. apple HFS may normalize the path before storing, therefore, the representation may change when it is returned back to the user.
On modern Windows systems a path component is UTF16-LE encoded. Theoretically, from the OS perspective, it does not have to look inside the path components other than determining whether a 16-bit unit represents the separator character, therefore, users can use any encoding as long as it encodes to 16-bit multiples and the separator is preserved as equivalent to the UTF16-LE representation.
- Safety against using an absolute path where a relative path is to be
used and vice-versa.
- Validations for absolute or relative path when constructing a path.
- We cannot append an absolute path to another path
- Safety against using a file name where a directory name is to be used and
vice-versa.
- Certain validations can be performed e.g. file names cannot be "." or "..".
- We should not be appending more directory components to a file path
In don't care situations we should be easily able to use any type conveniently or cast a type into another. It is desirable that the programmer can choose the safety level. For example, we should be able to instantiate a path type where we only worry about the distinction between Absolute and Relative paths but no distinction between files and directories or vice versa.
- minimal dependencies, specifically streamly does not depend on bytestring
- round-tripping safety wrt to file system returned paths
- type safety for different path types
- support Posix/Windows
- support URI paths and other ways to represent paths where the separator could be different.
- Should we store path as separate components or single string with separators?
- Should we validate the paths returned from the file system or trust those and use directly without any validations? Need to see if that makes any difference to path heavy benchmarks. If we want to use it directly then we have to store it as a single string.
- Parameterize the low level APIs with the separator so that we can support arbitrary separators when parsing or reconstructing paths.
- The low level API can support path handling in trees/DAGs/Graphs in general. For example, in trees we cannot have multiple parents of a child whereas in DAGs that is allowed, in graphs we can have cycles. We may also need ways to detect cycles.
- Do we need to support arbitrarily long paths i.e. streaming of path? We do not need that for file system paths and file system paths are limited size and operating system anyway requires them in strict buffers. In case of graphs if we have cycles paths can be infinite, we could generate a stream of path and the consumer could be traversing the graph according to the generated stream. If we want to support streaming then we have to store paths as a stream of chunks rather than a single string.
- In general, paths need not be strings, e.g. they can be references to locations in memory or they can be IP addresses of nodes. At an abstract level, paths are just a stream of tokens that represent a certain traversal.
- Relative paths are the most general representation. At a low level, all paths are relative, absolute paths are relative to a specified root whereas relative paths are relative to a dynamic root which is the current directory.
- Windows can have the root as different drive letters. So to represent paths with a root in general we can also store the specific root along with the path. In case of POSIX this will always be "/". In general, it could be a host name or IP address or dependent on the protocol whose path we are representing.
- We can parameterize the low level path type with the type of path e.g. POSIX, WINDOWS, HTTP etc. In general, programs may have to manipulate different types of paths at the same time. High level path types can be instantiated using the low level type therefore they can be much simpler as desired.
Some related links found by web search:
- https://gitlab.haskell.org/ghc/ghc/issues/5218
- https://nodejs.org/fr/docs/guides/working-with-different-filesystems/
- https://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux
- https://docs.microsoft.com/en-us/windows/win32/intl/character-sets-used-in-file-names
- https://beets.io/blog/paths.html