-
Notifications
You must be signed in to change notification settings - Fork 94
Model design
The model is the library that does most of the heavy lifting for odoc. Its aim is to provide descriptions of the interfaces of compilation units that are accurate according to OCaml’s module system semantics. This document describes the intended design of the model. Many of the details differ from what is currently implemented.
Most of the model’s work involves accurately determining which definition a path refers to. To do this, it tries to use very accurate representations of the different kinds of path present in an OCaml compilation unit. The representations try to ensure that all paths point to a definition of the appropriate kind. The representations also keep track of which paths have been “resolved”. A path is only “resolved” once we have verified that the corresponding definition exists in the place that we expect it.
The representations of paths are in the Paths_types and Paths modules.
The model tries to be completely independent of the “root” type used to represent compilation units – in fact it used to be polymorphic over it. This allows the rest of odoc the freedom to decide how different units are represented. This avoids the model making any assumptions about what compilation units might be present in the documentation. It also allows the root to contain information about where a particular unit is located (e.g. what package it is in).
Identifiers represent concrete definitions within compilation units. When generating documentation these correspond to the actual URL that a path should link to. They are pretty simple, consisting of various projections originating eventually from a single [Root] element.
Paths represent the concept of “path” within the OCaml language. It is these that appear in the representations of OCaml definitions (e.g. in descriptions of type expressions). Each resolved path refers to a single identifier, but a single identifier might have many possible paths.
Since we only handle interfaces, there are only paths for modules, module types, types and class types. Unlike identifiers, paths are quite complicated. They support simple projections, functor applications and a variety of book-keeping paths which dictate how a particular path should be interpreted as an identifier. The book-keeping paths are discussed in the sections on hidden paths, canonical paths, module alias substitutions and functor argument substitutions.
Fragments represent the fragments of OCaml paths that appear in with type
and
with module
constructs in the module language. A fragment does not, on its own,
correspond to an identifier. It instead represents how one identifier can be extended
to become a second identifier.
Fragments contain similar constructors to paths, although they do not need to deal with functor application which simplifies things a bit.
References represent paths written within documentation comments. Each resolved reference refers to a single identifier, but a single identifier might have many possible references.
References have similar constructors to paths, but with additional projections to handle references to other kinds of definition, and no support for functor application (yet).
Note that references must either be to a compilation dependency, a unit within the same package, or they must explicitly mention the package in which the target definition resides.
Many definitions are not intended to be visible to the user. We treat any definition
whose identifier contains __
as hidden, as well as supporting explicit hiding
constructs in documentation comments. The model tries to replace paths to hidden
definitions with paths to equivalent non-hidden definitions. In order to do this, it
marks any resolved path or reference to a hidden definition with a Hidden
path
constructor. This constructor optionally includes an alternative path that can be
substituted for the hidden one.
Some definitions are intended to be hidden and always replaced by a specific
alternative definition. Here the model allows all paths to a particular definition to be
replaced with an equivalent reference. To support this it marks any resolved path or
reference to such a definition using a Canonical
path constructor. This constructor
contains the original path and the reference it should be replaced with. If the
reference cannot be resolved then the original path is still used.
To (dramatically) reduce the size of interfaces the model does not expand interfaces for
module aliases. So if the module Foo
is defined as an alias for the module Bar
then
there is no page for Foo
. This means that any paths to the definition of Foo
should
instead refer to the identifier for Bar
. One exception to this rule is if the
definition of Bar
is hidden, in which case Foo
is given its own page. To support
this the model marks any resolved paths or reference to a module alias with an Alias
path constructor. This constructor contains the original path of the alias and the path
of which it is an alias.
Due to the dependent nature of module types, some paths involving functor application require substitution of the functor argument to make a valid identifier. For example, given:
module F (X : sig module type S end) : sig module N : X.S end
module M : sig module type S = sig type t end end
then the path F(M).N.t
should correspond to the identifier for M.S.t
. To represent
these cases the model would mark the path of F(M).N
with a Subst
path constructor,
which includes the path of the module F(M).N
along with the path of the module type
M.S
that it should be replaced with when projecting from it. There is also a similar
SubstAlias
path constructor for replacing a module path with another module
path. Unlike the case of module aliases above, this replacement must be carried out
even if the replacement path is hidden because without replacement there is no valid
path.
Module alias paths created with -no-alias-deps
option may point to compilation units
that are further forward in the dependency graph than the unit that contains them.
Such paths cannot be resolved until the “link” phase of generating interfaces. In
earlier phases they are explicitly marked with a Forward
constructor.
In full generality, there is no complete way to handle forward paths since they essentially behave as dynamically bound names. To avoid this we only allow forward paths within a package, and mark any other forward paths as unresolved.
Compilation unit interfaces are represented by a set of mutually recursive types. Each kind of definition has it’s own type. These types contain the identifier of the definition, any associated documentation, and a description of the definition itself.
The representations of compilation unit interfaces are in the Lang module.
The descriptions of modules, module types, includes, classes and class types, also contain an optional “expansion” field. This contains a normalized version of the definition that directly contains its constituent components. For example, given:
module M : sig type t end
module N = M
the description of the N
module will essentially be = M
, whilst its expansion would
be a representation of sig type t = M.t end
.
Creating descriptions of compilation unit interfaces is done in two phases: “compile” and “link”.
The compile phase requires all compilation dependencies of the unit to have already been “compiled”. This allows most paths and expansions to be done, since those are following the OCaml semantics and the compilation dependencies are all the modules which the OCaml type-checker needed to read to understand the file. The main exception is forward paths, where we don’t follow OCaml’s (slightly insane) semantics and so may require additional units outside of the compilation dependencies.
The compile phase first reads an OCaml file (e.g. .cmti, .cmi, .cmt) and creates an initial description of the interface with paths and references unresolved and no expansions. This part is handled by the modules Cmi, Cmti and Cmt.
Using the “compiled” descriptions of the unit’s compilation dependencies, the model can resolve all backwards paths. Forward paths and references are left until the “link” phase.
Using the “compiled” descriptions of the unit’s compilation dependencies, the model can add expansions for modules, module types, classes and class types. As mentioned above, we don’t add expansions for module aliases to keep down the size of the interfaces.
The linking phase requires the following units to have been “compiled” (but not “linked”):
- All units in the same package as the current unit
- All units in the same package as a compilation dependency of the current unit
- All units in packages that are mentioned by name in references
Compared to the dependencies for “compile”, dependencies for the “linking” step are defined at the granularity of the package and include packages mentioned by name in references.
Note that odoc provides a command which, given a “compiled” unit, lists all the packages that need to have been “compiled” before you can “link” that unit.
Forward paths are resolved during linking. Since we treat forward paths that go across a package boundary as unresolved, any legal destination of a forward path will be “compiled” before the linking step.
References are resolved during linking. Since forward references that cross package boundaries must name the destination package explicitly, any legal destination of a reference will be “compiled” before the linking step.
As mentioned earlier, any module alias to a hidden path is replaced by its
expansion. Note that a path containing a Hidden
node where the right-hand side of
the node is resolved is not considered hidden – since we can just use the non-hidden
right-hand side instead of the hidden path.
This is done during linking because the “hiddenness” of a path can depend on forward paths (due to forward paths in module aliases) or references (due to canonical paths).
Hidden paths includes “self-canonical aliases”, which are aliases to a path with a canonical element that points back to the original alias. For example, given
(** @canonical N *)
module M : sig ... end
module N = M
the definition of N
(= M
) is a “self-canonical alias” and as such is treated as an
alias to a hidden path.
In addition to hidden module aliases, there are also hidden type aliases, module type aliases, etc. For definitions that can have expansions we replace the alias with the expansion. For definitions that cannot have expansions we remove the alias to leave an abstract definition.
The bulk of the work done by the model should be done using the “components” data structure. This data structure is an optimized representation of a module’s interface that supports path resolution, reference resolution and module expansion. These operations could all be done directly using the compilation unit interface’s data, however that is very inefficient so using a specialized structure that takes advantage of laziness and memoization is needed for reasonable performance.
There are a number of core module system operations that must be supported by the components data structure.
A constructor is needed to build a components value from the description of a module signature. This requires iterating over the definitions in the signature.
A constructor is needed to build a components value to represent an alias to a module path or a module whose type is a module type path.
Projections allow a sub-module to be extracted from the components of a module. There are two kinds of projections here: identifier projections and path projections.
Consider the following module definitions:
module M : sig
type t
module N : sig
type s = t
end
(* A *)
end
(* B *)
There are two different views on the module type of M.N
, the view at (* A *)
which is:
sig type s = t end
and the view at (* B *)
which is:
sig type s = M.t end
the difference between the two views is that in the first the projection of t
from
M
occurs within the identifier (and so is not visible to the user), whilst in the
second view the projection of t
from M
occurs withing the path. We call the first
view of M.N
the identifier projection, and the second view the path projection.
Renaming occurs when a module type is used in a different location. For example in,
module type S = sig type t end
module M : S
we do a renaming when we copy the components of S
to M
. This involves changing the
identifiers associated with the definitions in S
, which is analogous to generating
fresh identifiers as the OCaml compiler would do in this situation.
Substitution occurs during functor application. It involves replacing all uses of the identifier of the functor parameter with the path of the functor argument.
Strengthening occurs when a module alias is expanded to its constituent components. It involves recursively finding abstract definitions in the module and extending them with aliases to the corresponding definitions in the aliased module. For example, given
module M : sig type t type s = int end
then = M
will expand to sig type t type s = int end
strengthened with M
, which
is:
sig type t = M.t type s = int end
Extension of a module type with additional equations happens when using with type =
and with module =
.
Destructive substitution is for handling with type :=
and with module :=
There are many places where memoization and laziness can improve the performance of the components data structure. We describe a few important ones below.
All lookups of identifiers and paths can be memoized by memoizing lookups of roots, projections from components and functor applications of components.
It is worth noting that whilst constructing the components of a particular module there will be lookups of sub-components of that module, which could cause an infinite loop. Such issues can be avoided by having a local environment that maps local identifiers to the corresponding memoized components, and passing that environment around whilst constructing the original components structure.
The construction of components from signatures and identifiers can be done lazily. This avoids needing to process an entire compilation unit in order to access a single sub-component within that unit. (This also helps with the infinite loop problem mentioned above).
The recursive operations (e.g. renaming, substitution, strengthening) can be done lazily. This is particularly important for renaming and substitution. Looking up a sub-component of a module only requires performing renamings and substitutions on the sub-component being looked up, however if these recursive operations are performed eagerly then the entire module will be processed unnecessarily.