Update November 2020:
This software implemented ideas from my dissertation, "Automated Parallelization from R Code". If you have questions, or you're interested in reviving this project, then you can contact me at [email protected].
This project is experimental and ambitious. To write parallel R today consider using an established package, such as the recommended parallel package which has been included with R since version 2.14. The CRAN Task View High-Performance and Parallel Computing with R provides a thorough review of available third party software.
makeParallel makes code parallel. It statically analyzes entire R scripts to detect dependencies between statements, and then generates equivalent parallel scripts.
The appeal of this approach is that we don't have to change our code to make it parallel. We can write higher level code that also performs better. By allowing this system to change our code we can benefit from underlying improvements in the system, and change code in ways that we may have never thought of, or that were manually infeasible. This is akin to an optimizing compiler. As the compiler becomes smarter, the same programs run faster, with no change to the underlying code. makeParallel creates R code from R code, so in this sense it acts as a transpiler.
Most conventional approaches to parallel programming require the user to specify when and where the program should run in parallel. They do this by providing their own programming models and application programming interfaces. The conventional approach has several advantages. First of all, it works. It scales up to huge clusters. Well defined interfaces simplify understanding and debugging. The approach also has disadvantages. It usually requires programmers to use more knowledge of the underlying system, data, and computations to use it effectively.
In contrast to the conventional approach, makeParallel takes general R code and automatically transforms it into parallel code. Thus the system must determine a correct and reasonably efficient way to execute parts of the program in parallel.
makeParallel builds on existing infrastructure wherever possible.
For example, it uses parallel::makeCluster
to initialize clusters of workers communicating over sockets.
Indeed, the conventional approaches provide the lower level tools.
The following diagram illustrates the computational model for the steps in inferring parallelism as implemented in the function makeParallel
.
In this diagram the rectangles represent base classes defined in this package, and the edges represent generic methods. The diagram depicts the following:
- The end user supplies some input code, which could be the name of a file or an actual expression.
inferGraph
statically analyzes the code and constructs a dependency graph between the expressions currently based on use-definition chains. A future version of the software will allow other types of constraints, such as requiring two expressions to run on the same processor.schedule
assigns each expression to a processor and orders the expression on each processor, while respecting the constraints of the dependency graph.generate
creates executable code from the schedule.writeCode
simply writes the code to a file and returns the expression containing the code.
These steps are all designed to be modular.
The most interesting step is the scheduler
.
In general scheduling a task graph is NP hard, and many algorithms exist to solve this problem.
The default scheduler produces a MapSchedule
which represents data parallelism.
This captures the common and useful case of "embarrassingly parallel" operations.
These come from replacing apply type functions with equivalent parallel versions.
The current implementation is basic and does the following:
- Identifies
for
loops that can be parallelized - Replaces top level
lapply
andmapply
with theirmc
equivalents from the parallel package - Avoids nested parallelism
Future versions may include the following features. If you need these sooner then get in touch.
- Marking expressions that can be embarrassingly parallel
- Deciding whether or not to run each expression in parallel
- Generating code for Windows
- Handling more apply style functions, ie.
apply, replicate
- Making vectorized operations parallel
- Splitting data and keeping it distributed for the duration of the program
Task parallelism means two or more workers execute two or more different blocks of code simultaneously.
The only way to get a speedup from task parallelism is if the dependency graph allows two or more long running parts of the program to run simultaneously.
This could happen when reading two large files, for example.
scheduleTaskList
implements a standard list scheduling algorithm.
The algorithm needs estimates of the execution time for each expression to work properly so we pass it in explicitly below.
n = 1e6
xfile = tempfile()
write.csv(data.frame(x = rnorm(n)), xfile, row.names = FALSE)
yfile = tempfile()
write.csv(data.frame(y = rnorm(n)), yfile, row.names = FALSE)
Here is how to use the default task scheduler:
library(makeParallel)
code = parse(text = "
x = read.csv(xfile)
y = read.csv(yfile)
xy = sort(c(x[, 1], y[, 1]))
")
g = inferGraph(code, time = c(1.25, 1.24, 0.2))
pcode = makeParallel(graph = g, scheduler = scheduleTaskList)
plot(schedule(pcode))
This plot illustrates the schedule returned from the scheduling algorithm. The completion time of the script is the last time a processor is busy on the graph. More efficient schedules complete earlier. This plot is a useful diagnostic- if all cores are mostly busy that's a good sign. If only one core does almost all the work then either 1) the code can't be improved through task parallelism, or 2) the default scheduling algorithm chose a poor schedule. The task scheduling problem is a very rich problem with many published algorithms, so we leave hooks in for users to supply their own scheduling algorithm.
The generated code is an ugly thing that no sane person would ever write by hand. But it does have a few virtues:
- It runs
- It does the same thing as the serial version while being faster
This is the actual generated code:
writeCode(pcode)
makeParallel
allows us to pass in functions for scheduling and code generation that customize the behavior.
TODO: Come up with improved example. Perhaps task parallelism with bottom level sorting.
Some schedulers must be tied to their code generators.
inferGraph, schedule
, and generate
are all generic functions, so we can allow user defined classes to extend the system through R's S4 object oriented programming system.
TODO: Incorporate the example above into the makeParallel classes.
In this section we went beyond the basic customization in the previous section in two ways. First, we extended the existing class hierarchy by defining our own scheduler. Second, we defined methods and relied on method dispatch to control some aspects of the code generation process. We did not have to touch the dependency graph computations.
Users can specify the scheduling and code generation steps by passing functions or defining methods.
Here's an associated mental model for the possible steps taken by the makeParallel
function.
The shaded nodes are not yet fully implemented.
In summary, makeParallel
can be used in the following ways:
makeParallel(code)
simple, common use casemakeParallel("some_script.R")
dispatches on the class of the first argument.makeParallel(graph = g)
skips the dependency graph inference in case the user would like to supply their own graph.makeParallel(code, scheduler = my_scheduler, generator = my_code_generator)
allows users to both customize the steps in the process by passing in their own functions to perform them.makeParallel(code, scheduler = scheduleTaskList)
dispatches on the class of the resulting objects allowing users to extend the system through defining their own classes.
Code transformation relies on static code analysis. This means we don't actually run any code until the user specifically asks for it. The CodeDepends package currently provides most of the underlying tools.
As we build more tools that may be useful for general purpose static R code analyis we've been putting them in the CodeAnalysis package. makeParallel will most likely come to depend on CodeAnalysis.
The following features are planned, but not yet implemented.
- Handle calls to
library()
- Generate code for Windows
- Use measured information on expression run time and object sizes
- Add constraints for processes, ie. code that performs plotting must happen in one process
- Show which code actually changed, perhaps by providing a
diff()
method - More efficient scheduling algorithms
- Detection of objects with reference semantics to handle appropriately
- Identify conditional versus unconditional data dependencies and handle appropriately
- Handle output dependencies aka Write After Write (WAW) dependencies.
- Allow users to pass information about the data, ie. dimensions and types.
- Infer dimensions and run time when possible.
- Generate code that uses systems besides R, for example Apache Hive.
Some R code is more difficult to parallelize. This package currently doesn't handle the following cases.
1.Non standard evaluation (NSE) complicates static code analysis because it changes the semantics of the language.
Familiar R functions with NSE include library()
and the ~
syntax for modeling.
Some third party packages use NSE extensively.
Examples include data.table and dplyr.
2.Objects with reference semantics can change in method calls, so they require much more synchronization to run in parallel.
Environments, reference classes, and R6 classes are examples of such objects.
Thanks Duncan Temple Lang and Nick Ulle for providing feedback.
Thanks to all those who have let me look at their code, especially Scott Devine, Nistara Randhawa, and Lynna Chu.