Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local Execution #1305

Closed

Conversation

reyna-abhyankar
Copy link
Collaborator

@reyna-abhyankar reyna-abhyankar commented Feb 20, 2024

Description of changes:

Implement local execution

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

Initial TODO

  • allocator
  • local backing
    • constructor
    • init
    • forward
    • backward
    • update - refactor optimizer
    • local task argument accessor
  • port over code from Implement infer_bwd_binding, task reg #1097
  • reorganize repo to be able to build operators, kernels, and local execution without legion
  • new plantuml diagram that matches the interface

Later TODO

  • make signatures, invocations, bindings, etc. all immutable data types and construct them with a OpTaskSignatureBuilder etc.

This change is Reviewable

@reyna-abhyankar reyna-abhyankar marked this pull request as draft February 20, 2024 16:55
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this currently does not build. Now that's it's pulled out into a separate library it would be nice to prioritize that as well as add the build into CI to help push the code towards being mergeable.

Also, it would be nice to start adding some tests not aiming at full coverage, but just as documentation of how you're envisioning some of the APIs here being used (essentially as machine-checkable documentation)

Reviewed 137 of 137 files at r1, all commit messages.
Reviewable status: all files reviewed, 67 unresolved discussions (waiting on @reyna-abhyankar)


lib/CMakeLists.txt line 3 at r1 (raw file):

add_subdirectory(pcg)
add_subdirectory(compiler)
add_subdirectory(execution)

Might be better renamed "local-execution"--"execution" by itself is a bit vague

Code quote:

execution

lib/README.md line 8 at r1 (raw file):

- `kernels`:
- `op-attrs`:
- `execution`:

Might be worth adding a quick summary of what the library does, especially as this one is a bit less trivial than some of the others

Code quote:

- `execution`:

lib/execution/CMakeLists.txt line 13 at r1 (raw file):

    op-attrs
    utils
    optional

Use std::optional instead and remove the optional dependency


lib/execution/CMakeLists.txt line 14 at r1 (raw file):

    utils
    optional
    compiler

Why does this need compiler?


lib/execution/include/local_execution/local_allocator.h line 11 at r1 (raw file):

struct LocalAllocator : public IAllocator {
  LocalAllocator(size_t);

Worth including a parameter name here since it's not clear what this actually does.

Suggestion:

 LocalAllocator(size_t ?????);

lib/execution/include/local_execution/local_allocator.h line 14 at r1 (raw file):

  ~LocalAllocator() override;

  void *allocate(Tensor) override;

The code to calculate the memory size needed for a tensor is shared between allocators, so it seems worth making this the interface rather than forcing each allocator to compute the size itself

Suggestion:

  void *allocate(size_t) override;

lib/execution/include/local_execution/local_allocator.h line 16 at r1 (raw file):

  void *allocate(Tensor) override;
  void deallocate(void *) override;
  size_t get_ptr_memory_size(void *);

Why?


lib/execution/include/local_execution/local_allocator.h line 21 at r1 (raw file):

  size_t total_memory_size;
  size_t allocated_memory_size;
  std::unordered_map<void *, size_t> ptr_memory_size_mapping;

Why? Isn't this already handled by cudaMalloc?


lib/execution/include/local_execution/local_allocator.h line 22 at r1 (raw file):

  size_t allocated_memory_size;
  std::unordered_map<void *, size_t> ptr_memory_size_mapping;
};

Add CHECK_RC_COPY_VIRTUAL_COMPLIANT(LocalAllocator);. This checks the conditions in https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-copy-virtual


lib/execution/include/local_execution/local_model_training_instance.h line 14 at r1 (raw file):

namespace FlexFlow {

struct LocalModelTrainingInstance {

Maybe worth splitting out a TrainingInstance to remove duplication between LocalModelTrainingInstance and ModelTrainingInstance (or whatever it is called now if you changed its name)


lib/execution/include/local_execution/local_model_training_instance.h line 34 at r1 (raw file):

                    local_training_backing);

void initialize_backing(LocalModelTrainingInstance &);

See https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/

Suggestion:

LocalModelTrainingInstance initialize_backing(ComputationGraph const &, Optimizer const &, etc.);

lib/execution/include/local_execution/local_task_argument_accessor.h line 15 at r1 (raw file):

  template <typename T>
  T const &get_argument(slot_id) const;

Same for all methods here I think

Suggestion:

  template <typename T>
  T const &get_argument(slot_id) const override;

lib/execution/include/local_execution/local_task_argument_accessor.h line 37 at r1 (raw file):

private:
  Allocator allocator;
  std::unordered_map<std::pair<slot_id, IsGrad>, GenericTensorAccessorW> tensor_backing_map;

Pull out into a separate struct that has a name. Raw pairs, variants, etc. tend to be confusing as it's not clear what they do.

Code quote:

std::pair<slot_id, IsGrad>

lib/execution/include/local_execution/local_task_argument_accessor.h line 44 at r1 (raw file):

}

CHECK_RC_COPY_VIRTUAL_COMPLIANT


lib/execution/include/local_execution/local_training_backing.h line 28 at r1 (raw file):

struct LocalTrainingBacking {
  LocalTrainingBacking(ComputationGraph, Allocator, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW>);

Suggestion:

 LocalTrainingBacking(ComputationGraph const &, Allocator const &, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> const &);

lib/execution/include/local_execution/local_training_backing.h line 35 at r1 (raw file):

  void execute_update();

  LocalTaskArgumentAccessor get_fwd_accessor(OpTaskInvocation);

Suggestion:

  TaskArgumentAccessor get_fwd_accessor(OpTaskInvocation);

lib/execution/include/local_execution/local_training_backing.h line 36 at r1 (raw file):

  LocalTaskArgumentAccessor get_fwd_accessor(OpTaskInvocation);
  LocalTaskArgumentAccessor get_bwd_accessor(OpTaskInvocation);

Suggestion:

  TaskArgumentAccessor get_bwd_accessor(OpTaskInvocation);

lib/execution/include/local_execution/local_training_backing.h line 40 at r1 (raw file):

private:
  Allocator allocator;
  std::vector<Node> topologically_ordered_graph;

Just use the ComputationGraph you already have until we actually learn the performance is that important.

Code quote:

  std::vector<Node> topologically_ordered_graph;

lib/execution/include/local_execution/local_training_backing.h line 43 at r1 (raw file):

  // memory
  std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> op_slot_tensor_mapping;

What's happening with args (i.e., not tensors)?


lib/execution/include/local_execution/local_training_backing.h line 50 at r1 (raw file):

  template <typename DeviceState>
  std::unordered_map<task_id_t, TaskImplFunction<DeviceState>> task_id_impl_mapping;
  std::unordered_map<task_id_t, OpTaskSignature> task_id_signature_mapping;

Probably worth pulling out into a separate struct that's responsible for the internals of task registration to avoid LocalTrainingBacking from becoming monolithic and unwieldy

Code quote:

  // hold mappings
  std::unordered_map<task_id_t, void*> task_id_impl_mapping;
  // TODO: add init task mapping
  template <typename DeviceState>
  std::unordered_map<task_id_t, TaskImplFunction<DeviceState>> task_id_impl_mapping;
  std::unordered_map<task_id_t, OpTaskSignature> task_id_signature_mapping;

lib/execution/include/local_execution/task_argument_accessor.h line 12 at r1 (raw file):

namespace FlexFlow {

using PrivilegeType = variant<privilege_mode_to_accessor<Permissions::RW>,

We have std::variant now that we're on C++17

Suggestion:

std::variant

lib/execution/src/task_spec/LocalTaskBackend.md line 27 at r1 (raw file):

Retriever

Is this still up to date? I don't recall seeing this class. We should make sure all the documentation is up-to-date before merging--feel free to draft @abisatish to help out with any doc writing and updating (afew docstrings would be good too as the design solidifies)


lib/execution/src/task_spec/OpTaskSpec.md line 1 at r1 (raw file):

# OpTaskSpec

Worth coordinating with @abisatish to try to make the documentation around operators, task_spec, etc. a bit more cohesive and approachable


lib/execution/src/task_spec/op_task_signature.cc line 8 at r1 (raw file):

void OpTaskSignature::add_input_slot(slot_id name, SlotType slot_type = SlotType::TENSOR) {
  OpTensorSlotSpec op_tensor_slot_spec {

I like to do this to signify that this is just a composite of these values, rather than some object that might be doing nontrivial things in a constructor.

Suggestion:

  OpTensorSlotSpec op_tensor_slot_spec = {

lib/execution/src/task_spec/TaskSpecDSL.md line 0 at r1 (raw file):
Is this still up-to-date?


lib/execution/src/task_spec/TaskSpecDSL.md line 3 at r1 (raw file):

# Task Spec DSL

There are three components in task-based systems: `agent`, `task`, and `backend`. At a high level, they interact with each other in the following way: an `agent` defines a `task` which is executed by the `backend`. In FlexFlow, an `agent` could be an operator (e.g., `Conv2D`), a `task` could be its `forward()` function, and this could be executed on a `backend` like the Legion runtime. A more complicated example is a deep neural network, in which there are many `agent`s dispatching many `task`s that may depend on each other. In this document, we will define the three terms and their interactions.

Why add the concept of an agent? I'm not against describing it here in a more object-oriented vs functional way, but I'm not sure why it's necessary? Why not just treat tasks as functions?


lib/execution/src/task_spec/TaskSpecInternals.md line 0 at r1 (raw file):
Would be nice to combine some of these docs and make them a bit more cohesive. Also, it would be good to explain what capabilities the task spec interface provides to code implementing tasks (argument accessor, bindings, etc.) independent of how they're implemented in the local backing. Also would be nice to have some documentation explaining why we have the backings and how they related to each other (i.e., they are not equivalent, but must be able to handle all operators in a semantics-preserving if not performance-preserving way)


lib/execution/src/task_spec/local_backing/local_allocator.cc line 24 at r1 (raw file):

  size_t freed_memory_size = this->ptr_memory_size_mapping[ptr];
  checkCUDA(cudaFree(ptr));
  this->allocated_memory_size -= freed_memory_size;

Why track allocated memory in LocalAllocator? Might be worth instead having a separate IAllocator implementation that wraps an IAllocator with this functionality instead

Code quote:

  this->allocated_memory_size

lib/execution/src/task_spec/local_backing/local_allocator.cc line 37 at r1 (raw file):

}

LocalAllocator::~LocalAllocator() {

Why was this chosen over not owning all of the memory regions (i.e., whoever uses them is responsible for deleting them)? No strong opinion, just curious


lib/execution/src/task_spec/local_backing/local_model_training_instance.cc line 16 at r1 (raw file):

}

GenericTensorAccessorR forward(LocalModelTrainingInstance const & local_model_training_instance) {

Why return this over having the LocalModelTrainingInstance just write out to an externally-provided tensor that's encoded when the instance is created?

Code quote:

GenericTensorAccessorR

lib/execution/src/task_spec/local_backing/local_task_argument_accessor.cc line 9 at r1 (raw file):

template <typename T>
T const & LocalTaskArgumentAccessor::get_argument(slot_id) const {
  not_implemented()

Semicolons?


lib/execution/src/task_spec/local_backing/local_task_argument_accessor.cc line 38 at r1 (raw file):

template <Permissions PRIV>
privilege_mode_to_accessor<PRIV> LocalTaskArgumentAccessor::get_tensor_grad(slot_id slot) const {
  return this->get_tensor(slot_id, is_grad = true);

What is this syntax?

Code quote:

is_grad = true

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 12 at r1 (raw file):

// operator/slot --> GTAR

LocalTrainingBacking::LocalTrainingBacking(ComputationGraph computation_graph, 

This function is quite long and seems to do a couple different things--might be worth splitting up into multiple functions/objects that are composed together for readability


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 14 at r1 (raw file):

LocalTrainingBacking::LocalTrainingBacking(ComputationGraph computation_graph, 
                                           Allocator allocator, 
                                           std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> allocated_tensors) {

Suggestion:

LocalTrainingBacking::LocalTrainingBacking(ComputationGraph const &computation_graph,
                                           Allocator const &allocator,
                                           std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> const &preallocated_tensors) {

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 15 at r1 (raw file):

                                           Allocator allocator, 
                                           std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> allocated_tensors) {
  this->op_slot_tensor_mapping.insert(allocated_tensors.begin(), allocated_tensors.end());

Just add to initializer list?

Suggestion:

: op_slot_tensor_mapping(allocated_tensors)

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 17 at r1 (raw file):

  this->op_slot_tensor_mapping.insert(allocated_tensors.begin(), allocated_tensors.end());

  std::vector<Node> layer_nodes = get_topological_ordering(computation_graph); 

Add overload so it returns a std::vector<layer_guid_t>


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 18 at r1 (raw file):

  std::vector<Node> layer_nodes = get_topological_ordering(computation_graph); 
  for (Node node: layer_nodes) {

Suggestion:

Node const &node

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 19 at r1 (raw file):

  std::vector<Node> layer_nodes = get_topological_ordering(computation_graph); 
  for (Node node: layer_nodes) {
    Layer layer = computation_graph.value().at(node);  // operator_guid_t doesn't seem as expressive as layer -- how to get attrs?

Add overload for layer_guid_t

Code quote:

computation_graph.value().at(node);

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 21 at r1 (raw file):

    Layer layer = computation_graph.value().at(node);  // operator_guid_t doesn't seem as expressive as layer -- how to get attrs?
    std::vector<task_id_t> task_ids = get_task_ids(layer.attrs);  // still think we need this, since we can't assume all ops have an init task
    for (task_id_t task_id: task_ids) {

Gentle reminder that format.sh before requesting review is appreciated 🙂


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 23 at r1 (raw file):

    for (task_id_t task_id: task_ids) {
      this->task_id_signature_mapping.insert({task_id, get_signature<task_id>()});
      this->task_id_impl_mapping.insert({task_id, get_task_impl<task_id>()});

Should it ever be possible for these to diverge? Why two maps and not one?

Code quote:

      this->task_id_signature_mapping.insert({task_id, get_signature<task_id>()});
      this->task_id_impl_mapping.insert({task_id, get_task_impl<task_id>()});

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 30 at r1 (raw file):

    //    TODO: this ^^ should definitely be a test
    std::unordered_set<MultiDiEdge> outgoing_edges = get_outgoing_edges(computation_graph, node);
    for (MultiDiEdge edge: outgoing_edges) {

Suggestion:

    for (MultiDiEdge const &edge

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 47 at r1 (raw file):

      Tensor tensor = computation_graph.value().at(edge);
      void * ptr = this->allocator.allocate(tensor);
      GenericTensorAccessorW tensor_backing = {tensor.data_type, tensor.get_shape(), ptr};

Pull this out into a separate function? Allocating a tensor backing seems like a common operation

Code quote:

      void * ptr = this->allocator.allocate(tensor);
      GenericTensorAccessorW tensor_backing = {tensor.data_type, tensor.get_shape(), ptr};

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 55 at r1 (raw file):

    // TODO: register update task

    this->allocator = allocator;

move toinitialization list?


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 64 at r1 (raw file):

GenericTensorAccessorR LocalTrainingBacking::execute_forward() {
  for (auto operator_node: this->topologically_ordered_graph) {

Suggestion:

  for (layer_guid_t operator_node: get_topological_ordering(this->computation_graph)) {

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 66 at r1 (raw file):

  for (auto operator_node: this->topologically_ordered_graph) {
    auto attrs = computation_graph.value().at(operator_node).attrs;
    OpTaskInvocation invocation = forward(operator_node.attrs);

Is there currently any checking anywhere that the OpTaskInvocation actually complies with the OpTaskSignature?


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 74 at r1 (raw file):

void LocalTrainingBacking::execute_backward() {
  for (auto operator_node: std::reverse(this->topologically_ordered_graph.begin(), this->topologically_ordered_graph.end())) {

Check containers.h

Code quote:

std::reverse(this->topologically_ordered_graph.begin(), this->topologically_ordered_graph.end())

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 74 at r1 (raw file):

void LocalTrainingBacking::execute_backward() {
  for (auto operator_node: std::reverse(this->topologically_ordered_graph.begin(), this->topologically_ordered_graph.end())) {

Suggestion:

layer_guid_t

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 78 at r1 (raw file):

    OpTaskInvocation invocation = backward(attrs);
    LocalTaskArgumentAccessor accessor = this->get_local_task_arg_accessor(invocation);
    task_id_impl_mapping[task_id](accessor);

Probably worth pulling out into a separate method for readability

Suggestion:

    this->call_task_impl(task_id, accessor);

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 87 at r1 (raw file):

LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {

Suggestion:

get_task_argument_accessor

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 87 at r1 (raw file):

LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {

Suggestion:

TaskArgumentAccessor

lib/execution/src/task_spec/local_backing/local_training_backing.cc line 88 at r1 (raw file):

LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {
  LocalTaskArgumentAccessor local_task_arg_acc (this->allocator);

Maybe make LocalTaskArgumentAccessor immutable rather than being incrementally constructed with insert_tensor? Not critical, just worth considering


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 94 at r1 (raw file):

    std::pair<slot_id, IsGrad> tensor_id = tensor_binding->first;
    operator_guid_t op_guid = invocation.get_operator_guid_t();
    GenericTensorAccessorW tensor_backing = this->op_slot_tensor_mapping[{op_guid, tensor_id->first}];

Suggestion:

    GenericTensorAccessorW tensor_backing = this->op_slot_tensor_mapping.at({op_guid, tensor_id->first});

lib/kernels/include/kernels/allocation.h line 11 at r1 (raw file):

struct IAllocator {
  virtual void *allocate(Tensor) = 0;

Calculating the memory size of a tensor isn't an allocator-specific operation, so let's keep the IAllocator API like this. If you wanted this for convenience, I have no issue with putting it on Allocator. Just keep in mind that everything that IAllocator defines must be implemented by each implementation of IAllocator

Suggestion:

  virtual void *allocate(size_t) = 0;

lib/kernels/include/kernels/allocation.h line 20 at r1 (raw file):

  Allocator() = delete;

  void *allocate(Tensor);

Suggestion:

 void *allocate(TensorShape const &);

lib/op-attrs/include/op-attrs/get_task_ids.h line 0 at r1 (raw file):
Should be moved to local-execution--task ids are not a concept that exists at the level of op-attrs


lib/op-attrs/include/op-attrs/get_task_ids.h line 50 at r1 (raw file):

template <typename... Ts>
std::vector<task_id_t> get_task_ids(variant<Ts...> const &attrs) {
  return visit(GetTaskIdsFunctor{}, attrs);

Just use an auto lambda


lib/op-attrs/src/datatype.cc line 5 at r1 (raw file):

namespace FlexFlow {

size_t size_of(DataType data_type) {

Let's not make this one underscore off from a really annoying to debug typo 😂. Currently if you forget an underscore you get the wrong result and there's no type error to stop you

Suggestion:

size_t size_of_datatype(DataType data_type) {

lib/op-attrs/src/datatype.cc line 20 at r1 (raw file):

      return sizeof(double);
    default:
      throw mk_runtime_error("Unknown data type");

I think mk_runtime_error supports fmt syntax (if I'm wrong ignore me)

Suggestion:

      throw mk_runtime_error("Unknown data type {}", dt);

lib/runtime/src/task_invocation_compilation.cc line 0 at r1 (raw file):
Is there a reason we're deleting all this? I'm not advocating moving it to local-execution, but keeping it around for now for Tanmaey's sanity might be nice (same for a couple other files deleted in this PR that would actually be quite useful for implementing the legion backend)


lib/runtime/src/tasks.h line 187 at r1 (raw file):

template <task_id_t>
void* get_task_impl();

Why not use that nice variant of std::functions you defined somewhere else? void * seems maybe a little drastic, as it doesn't even ensure the thing is a function, and simultaneously provides no hints to anyone reading the code what this function does?


lib/execution/include/task_spec/typed_future.h line 32 at r1 (raw file):

  }
  
  ArgTypeRuntimeTag get_type_tag() const {

Why?


lib/execution/include/task_spec/typed_future.h line 58 at r1 (raw file):

  std::type_index type_idx;
  Legion::Future future;
  ArgTypeRuntimeTag type_tag;

Should replace std::type_index, not accompany it.


lib/execution/include/task_spec/typed_future_map.h line 55 at r1 (raw file):

  std::type_index type;
  Legion::FutureMap future_map;

This is going to have issues since local-execution should not depend on legion.


lib/execution/include/task_spec/typed_future_map.h line 56 at r1 (raw file):

  std::type_index type;
  Legion::FutureMap future_map;
  ArgTypeRuntimeTag type_tag;

Should replace std::type_index.

Also I'm not sure if this is necessary, since ArgTypeRuntimeTag currently (i.e., in the main repo-refactor branch) primarily exists to enable legion serialization, which is not necessary for Futures (or for local execution at all)


lib/execution/include/task_spec/typed_task_invocation.h line 120 at r1 (raw file):

  std::type_index type_idx;
  std::shared_ptr<TaskInvocation const> invocation;
  ArgTypeRuntimeTag type_tag;

Should replace std::type_index


lib/execution/src/task_spec/op_task_invocation.h line 66 at r1 (raw file):

    return spec.get_type_tag().get_type_idx();
  }
};

auto lambda?

Code quote:

struct OpArgSpecTypeAccessor {
  std::type_index operator()(ConcreteArgSpec& spec) {
    return spec.get_type_tag().get_type_idx();
  }
  std::type_index operator()(IndexArgSpec& spec) {
    return spec.get_type_tag().get_type_idx();
  }
  std::type_index operator()(OpArgRefSpec& spec) {
    return spec.get_type_tag().get_type_idx();
  }
  std::type_index operator()(CheckedTypedFuture& spec) {
    return spec.get_type_tag().get_type_idx();
  }
  std::type_index operator()(CheckedTypedFutureMap& spec) {
    return spec.get_type_tag().get_type_idx();
  }
  std::type_index operator()(RuntimeArgRefSpec& spec) {
    return spec.get_type_tag().get_type_idx();
  }
  std::type_index operator()(TaskInvocationSpec& spec) {
    return spec.get_type_tag().get_type_idx();
  }
};

lib/execution/src/task_spec/op_task_signature.h line 91 at r1 (raw file):

private:
  OpTaskType type;
  std::vector<std::type_index> return_value;

Do we need multiple return values?

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 112 of 144 files reviewed, 67 unresolved discussions (waiting on @lockshaw)


lib/README.md line 8 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be worth adding a quick summary of what the library does, especially as this one is a bit less trivial than some of the others

Done.


lib/execution/CMakeLists.txt line 13 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use std::optional instead and remove the optional dependency

Done.


lib/execution/CMakeLists.txt line 14 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why does this need compiler?

Done.


lib/execution/include/local_execution/local_allocator.h line 22 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add CHECK_RC_COPY_VIRTUAL_COMPLIANT(LocalAllocator);. This checks the conditions in https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-copy-virtual

Done.


lib/execution/include/local_execution/local_task_argument_accessor.h line 37 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Pull out into a separate struct that has a name. Raw pairs, variants, etc. tend to be confusing as it's not clear what they do.

Why not just using SlotGradId = std::pair<slot_id, IsGrad>?


lib/execution/include/local_execution/local_task_argument_accessor.h line 44 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

CHECK_RC_COPY_VIRTUAL_COMPLIANT

Done.


lib/execution/include/local_execution/local_training_backing.h line 28 at r1 (raw file):

struct LocalTrainingBacking {
  LocalTrainingBacking(ComputationGraph, Allocator, std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW>);

Done.


lib/execution/include/local_execution/local_training_backing.h line 35 at r1 (raw file):

  void execute_update();

  LocalTaskArgumentAccessor get_fwd_accessor(OpTaskInvocation);

Done.


lib/execution/include/local_execution/local_training_backing.h line 36 at r1 (raw file):

  LocalTaskArgumentAccessor get_fwd_accessor(OpTaskInvocation);
  LocalTaskArgumentAccessor get_bwd_accessor(OpTaskInvocation);

Done.


lib/execution/include/local_execution/local_training_backing.h line 43 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What's happening with args (i.e., not tensors)?

Yeah, have to add that based on our discussion yesterday.


lib/execution/include/local_execution/task_argument_accessor.h line 12 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

We have std::variant now that we're on C++17

Done.


lib/execution/src/task_spec/op_task_signature.cc line 8 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I like to do this to signify that this is just a composite of these values, rather than some object that might be doing nontrivial things in a constructor.

Done.


lib/execution/src/task_spec/local_backing/local_task_argument_accessor.cc line 9 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Semicolons?

Done.


lib/execution/src/task_spec/local_backing/local_task_argument_accessor.cc line 38 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is this syntax?

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 14 at r1 (raw file):

LocalTrainingBacking::LocalTrainingBacking(ComputationGraph computation_graph, 
                                           Allocator allocator, 
                                           std::unordered_map<OperatorSlotBackingId, GenericTensorAccessorW> allocated_tensors) {

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 15 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just add to initializer list?

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 18 at r1 (raw file):

  std::vector<Node> layer_nodes = get_topological_ordering(computation_graph); 
  for (Node node: layer_nodes) {

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 21 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Gentle reminder that format.sh before requesting review is appreciated 🙂

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 23 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should it ever be possible for these to diverge? Why two maps and not one?

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 30 at r1 (raw file):

    //    TODO: this ^^ should definitely be a test
    std::unordered_set<MultiDiEdge> outgoing_edges = get_outgoing_edges(computation_graph, node);
    for (MultiDiEdge edge: outgoing_edges) {

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 87 at r1 (raw file):

LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 87 at r1 (raw file):

LocalTaskArgumentAccessor LocalTrainingBacking::get_local_task_arg_accessor(OpTaskInvocation invocation) {

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 94 at r1 (raw file):

    std::pair<slot_id, IsGrad> tensor_id = tensor_binding->first;
    operator_guid_t op_guid = invocation.get_operator_guid_t();
    GenericTensorAccessorW tensor_backing = this->op_slot_tensor_mapping[{op_guid, tensor_id->first}];

Done.


lib/op-attrs/src/datatype.cc line 5 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Let's not make this one underscore off from a really annoying to debug typo 😂. Currently if you forget an underscore you get the wrong result and there's no type error to stop you

Done.


lib/op-attrs/src/datatype.cc line 20 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I think mk_runtime_error supports fmt syntax (if I'm wrong ignore me)

Done.


lib/runtime/src/task_invocation_compilation.cc line at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is there a reason we're deleting all this? I'm not advocating moving it to local-execution, but keeping it around for now for Tanmaey's sanity might be nice (same for a couple other files deleted in this PR that would actually be quite useful for implementing the legion backend)

So there were actually 2 files with the same name in different subfolders. This deleted one looks like a precursor to the newer file which is in lib/execution/src/task_spec/task_invocation_compilation.cc and has a bunch of Legion stuff. Let me know if that's correct (I haven't verified which one is actually older via commit history, so I'll do that if you can't immediately tell).

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 112 of 144 files reviewed, 67 unresolved discussions (waiting on @lockshaw)


lib/execution/include/local_execution/local_allocator.h line 16 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

Yeah, I don't think this is needed. I wrote the allocator first and just added that as an API in case but didn't end up using it.


lib/execution/include/local_execution/local_allocator.h line 21 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why? Isn't this already handled by cudaMalloc?

Yeah, I don't think this is needed.

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 112 of 144 files reviewed, 67 unresolved discussions (waiting on @lockshaw)


lib/execution/include/local_execution/local_model_training_instance.h line 34 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/

This is a good paradigm. But I'm not connecting the dots for this example. Wouldn't we have already initialized LocalModelTrainingInstance before calling this function? That's what is expected for forward, backward, and update.

Actually, in that case shouldn't we do LocalModelTrainingInstance::forward() instead of forward(LocalModelTrainingInstance)?

Copy link
Collaborator Author

@reyna-abhyankar reyna-abhyankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 112 of 145 files reviewed, 67 unresolved discussions (waiting on @lockshaw)


lib/CMakeLists.txt line 3 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be better renamed "local-execution"--"execution" by itself is a bit vague

Done.


lib/execution/include/local_execution/local_allocator.h line 11 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Worth including a parameter name here since it's not clear what this actually does.

Done.


lib/execution/include/local_execution/local_allocator.h line 14 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The code to calculate the memory size needed for a tensor is shared between allocators, so it seems worth making this the interface rather than forcing each allocator to compute the size itself

Done.


lib/execution/include/local_execution/local_model_training_instance.h line 14 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Maybe worth splitting out a TrainingInstance to remove duplication between LocalModelTrainingInstance and ModelTrainingInstance (or whatever it is called now if you changed its name)

Done.


lib/execution/include/local_execution/local_task_argument_accessor.h line 15 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Same for all methods here I think

I think to have this kind of polymorphism we need to have a variant that holds all argument types. 2 reasons:

  1. C++ doesn't allow virtual template functions
  2. Having a map like below doesn't make sense because we'd need to then have LocalTaskArgumentAccessor<T> for a particular argument type which doesn't work.
    The variant solution (while it kind of seems like hard-coding) will fix this, and I think since arguments broadly fall into 2-3 categories, this is fine. (I think RuntimeArgRefSpec or OpArgRefSpec already does some of this)

Code snippet:

template <typename T>
std::unordered_map<slot_id, T> argument_map;

lib/execution/include/local_execution/local_training_backing.h line 40 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just use the ComputationGraph you already have until we actually learn the performance is that important.

Done.


lib/execution/include/local_execution/local_training_backing.h line 50 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably worth pulling out into a separate struct that's responsible for the internals of task registration to avoid LocalTrainingBacking from becoming monolithic and unwieldy

I've simplified these maps down, may pull more out.


lib/execution/src/task_spec/local_backing/local_allocator.cc line 24 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why track allocated memory in LocalAllocator? Might be worth instead having a separate IAllocator implementation that wraps an IAllocator with this functionality instead

Sure, i've lined down LocalAllocator accordingly to not need memory tracking and purely allocate/deallocate.


lib/execution/src/task_spec/local_backing/local_allocator.cc line 37 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why was this chosen over not owning all of the memory regions (i.e., whoever uses them is responsible for deleting them)? No strong opinion, just curious

What would the alternative be? My take is that once LocalAllocator goes out of scope, it should de-allocate all the memory it allocated, but perhaps deallocate can be private. What do you think?


lib/execution/src/task_spec/local_backing/local_model_training_instance.cc line 16 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why return this over having the LocalModelTrainingInstance just write out to an externally-provided tensor that's encoded when the instance is created?

Yeah, I'm changing this to what you said.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 55 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

move toinitialization list?

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 64 at r1 (raw file):

GenericTensorAccessorR LocalTrainingBacking::execute_forward() {
  for (auto operator_node: this->topologically_ordered_graph) {

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 74 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Check containers.h

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 74 at r1 (raw file):

void LocalTrainingBacking::execute_backward() {
  for (auto operator_node: std::reverse(this->topologically_ordered_graph.begin(), this->topologically_ordered_graph.end())) {

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 78 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably worth pulling out into a separate method for readability

Done.


lib/execution/src/task_spec/local_backing/local_training_backing.cc line 88 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Maybe make LocalTaskArgumentAccessor immutable rather than being incrementally constructed with insert_tensor? Not critical, just worth considering

Yeah, that's better. Done


lib/kernels/include/kernels/allocation.h line 11 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Calculating the memory size of a tensor isn't an allocator-specific operation, so let's keep the IAllocator API like this. If you wanted this for convenience, I have no issue with putting it on Allocator. Just keep in mind that everything that IAllocator defines must be implemented by each implementation of IAllocator

Done.


lib/kernels/include/kernels/allocation.h line 20 at r1 (raw file):

  Allocator() = delete;

  void *allocate(Tensor);

Let's have Allocator also be the one to return a GenericTensorAccessorW since that logic is repeated pretty often.

Code snippet:

GenericTensorAccessorW allocate(TensorShape const & shape) {
  size_t requested_mem_size = get_volume(shape); // dims * datatype
  void* ptr = this->i_allocator.allocate(requested_mem_size);
  return {shape.data_type, shape, ptr};
}

lib/op-attrs/include/op-attrs/get_task_ids.h line 50 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just use an auto lambda

Done.


lib/op-attrs/include/op-attrs/get_task_ids.h line at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should be moved to local-execution--task ids are not a concept that exists at the level of op-attrs

Done.


lib/execution/include/task_spec/typed_future.h line 32 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

In op_task_invocation.h, we have OpArgSpec = variant<ConcreteArgSpec,...,CheckedTypedFuture,...> with a bunch of arg types. Initially I thought this function lets us visit them uniformly, since all other types have an ArgTypeRuntimeTag. But if ArgTypeRuntimeTag is primarily for legion serialization, then we should just use type index and remove the future arg types from OpArgSpec. We can always extend it in runtime.


lib/execution/include/task_spec/typed_future.h line 58 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should replace std::type_index, not accompany it.

See above comment.


lib/execution/include/task_spec/typed_future_map.h line 55 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This is going to have issues since local-execution should not depend on legion.

See above comment.


lib/execution/include/task_spec/typed_future_map.h line 56 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should replace std::type_index.

Also I'm not sure if this is necessary, since ArgTypeRuntimeTag currently (i.e., in the main repo-refactor branch) primarily exists to enable legion serialization, which is not necessary for Futures (or for local execution at all)

See above comment.


lib/execution/include/task_spec/typed_task_invocation.h line 120 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should replace std::type_index

See above comment.


lib/execution/src/task_spec/op_task_invocation.h line 66 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

auto lambda?

Done.


lib/execution/src/task_spec/op_task_signature.h line 91 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Do we need multiple return values?

No, I think we agreed earlier there could be at most one.

@reyna-abhyankar reyna-abhyankar requested a review from lockshaw April 6, 2024 13:15
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be a lot of stray commented-out code throughout this PR--it would be nice to try to clean up some of that before this is merged.

Also, try to use more specific and clear names--a lot of the names (e.g., tensor_mapping) provide essentially no information, making the code way more challenging to read than it needs to be. Decomposing functions, even in instances where the functions are only called from that one location, can also be helpful for this reason--it bundles up pieces of functionality into nice named pieces that I can understand without having to wade through their implementation details

Also, let's try to aim for slightly smaller PRs in the future slightly_smiling_face:--I know a lot of the changes were relatively simple, but it's still a lot of code to go through

Reviewed 5 of 32 files at r2, 109 of 235 files at r4, 125 of 125 files at r5, all commit messages.
Reviewable status: all files reviewed, 126 unresolved discussions (waiting on @reyna-abhyankar)


lib/kernels/include/kernels/allocation.h line 28 at r5 (raw file):

  }

  void deallocate(GenericTensorAccessorW tensor_backing) {

Suggestion:

  void deallocate(GenericTensorAccessorW const &tensor_backing)

lib/kernels/include/kernels/flat_kernels.h line 4 at r5 (raw file):

#define _FLEXFLOW_OPS_KERNELS_FLAT_KERNELS_H

#include "device.h"

Why this change?


lib/kernels/include/kernels/layer_norm_kernels.h line 10 at r5 (raw file):

namespace FlexFlow {

// class LayerNormPerDeviceState : public PerDeviceOpState {

Delete?


lib/kernels/include/kernels/reduce_kernels.h line 37 at r5 (raw file):

                                 ArrayShape output_shape);

// void forward_kernel_wrapper(ReducePerDeviceState const &m,

Delete?


lib/kernels/include/kernels/reverse_kernels.h line 8 at r5 (raw file):

namespace FlexFlow {

using legion_coord_t = long long;

Just use the type directly--having a type named legion_coord_t in a part of the codebase without legion is just confusing


lib/kernels/include/kernels/split_kernels.h line 8 at r5 (raw file):

namespace FlexFlow {

using legion_coord_t = long long;

Delete


lib/kernels/include/kernels/transpose_kernels.h line 11 at r5 (raw file):

struct TransposePerDeviceState {
  int num_dim;
  req<int> perm[MAX_TENSOR_DIM];

vector?


lib/kernels/include/kernels/transpose_kernels.h line 36 at r5 (raw file):

// namespace Internal {

// void forward_kernel_wrapper(TransposePerDeviceState const &m,

Delete?


lib/kernels/src/cuda/reverse_kernels.cu line 20 at r5 (raw file):

namespace FlexFlow {
// declare Legion names
using legion_coord_t = long long;

Delete


lib/local-execution/include/arg_backing.h line 19 at r5 (raw file):

using DeviceStates = std::variant<LinearPerDeviceState>;

struct OpArgBacking {

visitable?


lib/local-execution/include/arg_backing.h line 28 at r5 (raw file):

};

using ArgBackingMapping = std::unordered_map<operator_guid_t, OpArgBacking>;

What's the purpose of ArgBackingMapping here?


lib/local-execution/include/local_allocator.h line 19 at r5 (raw file):

private:
  std::unordered_set<void *> ptrs;

Why have a set of pointers?


lib/local-execution/include/local_model_training_instance.h line 14 at r5 (raw file):

namespace FlexFlow {

struct LocalModelTrainingInstance {

Turn into a visitable-style type without methods


lib/local-execution/include/local_task_argument_accessor.h line 14 at r5 (raw file):

using SlotGradId = std::pair<slot_id, IsGrad>;
using TensorBackingOption = std::variant<

Why a variant of maps instead of a map of variants?


lib/local-execution/include/local_task_argument_accessor.h line 14 at r5 (raw file):

using SlotGradId = std::pair<slot_id, IsGrad>;
using TensorBackingOption = std::variant<

Also I think it's generally better to wrap std::variants rather than just using them directly


lib/local-execution/include/local_task_argument_accessor.h line 25 at r5 (raw file):

      std::unordered_map<slot_id, ArgRefBacking> argument_map)
      : allocator(allocator), tensor_backing_map(tensor_backing_map),
        argument_map(argument_map){};

.cc file


lib/local-execution/include/local_task_argument_accessor.h line 30 at r5 (raw file):

  ConcreteArgSpec const &get_concrete_arg(slot_id) const override;
  OpArgRefTypeBacking const &get_op_arg_ref(slot_id) const override;

These should already have been flattened down to all be ConcreteArgSpec by the type that argument accessor is called--OpArgRefTypeBacking, RuntimeArgRefTypeBacking, etc. are just placeholders that should get filled in during taskspec compilation


lib/local-execution/include/local_task_argument_accessor.h line 33 at r5 (raw file):

  RuntimeArgRefTypeBacking const &get_runtime_arg(slot_id) const override;

  PrivilegeType

What is PrivilegeType? (and PrivilegeVariadicType?)


lib/local-execution/include/local_task_argument_accessor.h line 42 at r5 (raw file):

  size_t get_device_idx() const override {
    return 0;

.cc file


lib/local-execution/include/local_training_backing.h line 26 at r5 (raw file):

struct TaskRegistry {
  TaskRegistry(std::unordered_map<tensor_guid_t, GenericTensorAccessorW &>);

Why is this taking in a map of mutable references? That seems strange (and likely unsafe)


lib/local-execution/include/local_training_backing.h line 27 at r5 (raw file):

struct TaskRegistry {
  TaskRegistry(std::unordered_map<tensor_guid_t, GenericTensorAccessorW &>);
  void register_task(task_id_t, operator_guid_t);

Why is TaskRegistry tied to a PCG and specific operator_guid_ts?


lib/local-execution/include/local_training_backing.h line 29 at r5 (raw file):

  void register_task(task_id_t, operator_guid_t);
  // void register_args(operator_guid_t, OpArgBacking);
  bool is_tensor_allocated(tensor_guid_t tensor_id);

Feels like this should be const (same for a lot of these)


lib/local-execution/include/local_training_backing.h line 69 at r5 (raw file):

  void execute_update();

  DeviceSpecific<DeviceStates> call_init_task_impl(task_id_t,

Should this and the methods below be public?


lib/local-execution/include/op_task_invocation.h line 28 at r5 (raw file):

    std::variant<ConcreteArgSpec, OpArgRefSpec, RuntimeArgRefSpec>;

struct OpArgSpecTypeAccessor {

Why is this a functor?


lib/local-execution/include/op_task_invocation.h line 104 at r5 (raw file):

  OpTaskBinding binding;
};
FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(OpTaskInvocation,

Why nonstandard construction?


lib/local-execution/include/op_task_invocation.h line 111 at r5 (raw file):

OpTaskBinding infer_bwd_binding(OpTaskBinding const &fwd);

bool validate_invocation(OpTaskSignature sig, OpTaskInvocation inv) {

.cc file


lib/local-execution/include/op_task_invocation.h line 111 at r5 (raw file):

OpTaskBinding infer_bwd_binding(OpTaskBinding const &fwd);

bool validate_invocation(OpTaskSignature sig, OpTaskInvocation inv) {

Suggestion:

bool is_invocation_valid(OpTaskSignature sig, OpTaskInvocation inv) {

lib/local-execution/include/op_task_invocation.h line 112 at r5 (raw file):

bool validate_invocation(OpTaskSignature sig, OpTaskInvocation inv) {
  // tensors

Just pull into a separate function rather than a comment?


lib/local-execution/include/op_task_invocation.h line 113 at r5 (raw file):

bool validate_invocation(OpTaskSignature sig, OpTaskInvocation inv) {
  // tensors
  auto tensor_bindings = inv.binding.get_tensor_bindings();

Avoid auto unless the type name is really long


lib/local-execution/include/op_task_invocation.h line 115 at r5 (raw file):

  auto tensor_bindings = inv.binding.get_tensor_bindings();
  for (OpTensorSlotSpec const &op_tensor_slot_spec : sig.get_tensor_slots()) {
    slot_id name = op_tensor_slot_spec.name;

Seems like this is just validating an OpTensorSlotSpec against an OpTensorSpec--why not create a separate function for doing that? This current function that lumps everything together is a pain to read.


lib/local-execution/include/op_task_invocation.h line 117 at r5 (raw file):

    slot_id name = op_tensor_slot_spec.name;
    IsGrad is_grad = op_tensor_slot_spec.is_grad;
    std::pair<slot_id, IsGrad> tensor_key = std::make_pair(name, is_grad);

Suggestion:

td::pair<slot_id, IsGrad> tensor_key = std::make_pair(
   op_tensor_slot_spec.name, 
   op_tensor_slot_spec.is_grad
);

lib/local-execution/include/op_task_invocation.h line 128 at r5 (raw file):

  auto sig_arg_types = sig.get_arg_types();
  OpArgSpecTypeAccessor type_accessor;
  for (auto arg_binding : inv.binding.get_arg_bindings()) {

Pull out into a separate funciton


lib/local-execution/include/op_task_invocation.h line 132 at r5 (raw file):

    OpArgSpec op_arg_spec = arg_binding.second;
    std::type_index arg_type = sig_arg_types.at(name);
    if (type_accessor(op_arg_spec) != arg_type) {

Not really a helpful or clear name

Code quote:

type_accessor

lib/local-execution/src/local_model_training_instance.cc line 13 at r5 (raw file):

    EnableProfiling enable_profiling,
    ProfilingSettings profiling_settings) {
  this->local_training_backing = LocalTrainingBacking(computation_graph,

Rather than delegating everything why not just have LocalModelTrainingInstance take in a LocalTrainingBacking or the other way around?


lib/local-execution/src/local_task_argument_accessor.cc line 41 at r5 (raw file):

  std::pair<slot_id, IsGrad> slot_grad_pair = std::make_pair(slot, is_grad);
  auto variadic_tensor_backing_map = std::get<
      std::unordered_map<SlotGradId, std::vector<GenericTensorAccessorW>>>(

Wrap your variants and tuples to avoid code like this


lib/local-execution/src/local_training_backing.cc line 21 at r5 (raw file):

void TaskRegistry::register_task(task_id_t task_id, operator_guid_t op_id) {
  TaskSignatureImpl task_signature_impl = {get_task_impl<task_id>(),

Suggestion:

TaskImplWithSignature

lib/local-execution/src/local_training_backing.cc line 40 at r5 (raw file):

bool TaskRegistry::is_tensor_allocated(tensor_guid_t tensor_id) {
  return this->tensor_mapping.find(tensor_id) != this->tensor_mapping.end();

contains in containers.h

And const


lib/local-execution/src/local_training_backing.cc line 45 at r5 (raw file):

GenericTensorAccessorW &
    TaskRegistry::get_tensor_backing(tensor_guid_t tensor_id) {
  return this->tensor_mapping.at(tensor_id);

Rather unclear name--from this name it could be a mapping from each tensor to its favorite color for all I know 🙂

Code quote:

tensor_mapping

lib/local-execution/src/local_training_backing.cc line 64 at r5 (raw file):

  this->task_registry = TaskRegistry(allocated_tensors);
  std::vector<operator_guid_t> layers = computation_graph.traverse();
  for (operator_guid_t const &node : layers) {

traverse is also really unclear, do you mean get_topological_ordering?

Suggestion:

  for (operator_guid_t const &node : computation_graph.traverse()) {

lib/local-execution/src/local_training_backing.cc line 69 at r5 (raw file):

    for (task_id_t task_id : task_ids) {
      this->task_registry.register_task(task_id, node);
    }

All of the intermediate variables make the code rather hard to read--they don't need to be eliminated entirely, but it would really help readability if the number was reduced a bit

Suggestion:

    CompGraphOperatorAttrs attrs = computation_graph.get_layer_attrs(node);
    for (task_id_t task_id : get_task_ids(attrs)) {
      this->task_registry.register_task(task_id, node);
    }

lib/local-execution/src/local_training_backing.cc line 98 at r5 (raw file):

        this->task_registry.get_init_signature(operator_node), invocation));
    TaskArgumentAccessor accessor = this->get_task_arg_accessor(invocation);
    DeviceSpecific<DeviceStates> device_state =

Why are we just throwing away the per-device state here?


lib/local-execution/src/local_training_backing.cc line 111 at r5 (raw file):

      this->task_registry.task_mapping.at(task_id);
  auto fn = std::get<std::function<DeviceStates(TaskArgumentAccessor const &)>>(
      task_sig_impl.impl_function);

Suggestion:

  auto task_impl_function = std::get<std::function<DeviceStates(TaskArgumentAccessor const &)>>(
      task_sig_impl.impl_function);

lib/local-execution/src/local_training_backing.cc line 147 at r5 (raw file):

    TaskArgumentAccessor accessor =
        this->get_task_arg_accessor(invocation, operator_guid_t);
    this->call_task_impl(invocation.task_id, accessor);

This seems duplicated a couple times--why not just have a call_task_impl overload that takes a signature and an invocation and does this part?

Code quote:

    assert(validate_invocation(
        this->task_registry.get_bwd_signature(operator_node), invocation));
    TaskArgumentAccessor accessor =
        this->get_task_arg_accessor(invocation, operator_guid_t);
    this->call_task_impl(invocation.task_id, accessor);

lib/local-execution/src/local_training_backing.cc line 152 at r5 (raw file):

void LocalTrainingBacking::execute_update() {
  not_implemented();

Suggestion:

  NOT_IMPLEMENTED();

lib/local-execution/src/local_training_backing.cc line 158 at r5 (raw file):

TaskArgumentAccessor
    LocalTrainingBacking::get_task_arg_accessor(OpTaskInvocation invocation,

Isn't this just a pure function of the OpTaskInvocation and the OpTaskSignature? Why is the task registry getting all involved here?


lib/local-execution/src/ops/split.cc line 96 at r5 (raw file):

  auto attrs = acc.get_argument<SplitAttrs>(ATTRS);

  legion_coord_t num_blks, in_blk_size, out_blk_size[MAX_NUM_OUTPUTS];

"block" is much clearer than "blk"

Suggestion:

  legion_coord_t num_blocks, input_block_size, output_block_size[MAX_NUM_OUTPUTS];

lib/local-execution/src/ops/split.cc line 126 at r5 (raw file):

  SimTaskBinding fwd_binding;
  fwd_binding.bind(INPUT, input.shape);

At some point this should just be replaced by the existing forward function


lib/op-attrs/include/op-attrs/ops/embedding.h line 13 at r5 (raw file):

namespace FlexFlow {

enum class AggregateOp { SUM, AVG, NONE };

Just use a std::optional<AggregateOp> if you want this functionality

Suggestion:

enum class AggregateOp { SUM, AVG };

lib/pcg/include/pcg/computation_graph.h line 19 at r5 (raw file):

                            OutputLabelledMultiDiGraph<Layer, Tensor>> {
  using strong_typedef::strong_typedef;

Make most/all of these methods functions


lib/pcg/include/pcg/computation_graph.h line 20 at r5 (raw file):

  using strong_typedef::strong_typedef;

  std::vector<operator_guid_t> traverse() {

Move to .cc file? (same with the other methods here)


lib/pcg/include/pcg/computation_graph.h line 27 at r5 (raw file):

  }

  std::vector<operator_guid_t> traverse_reverse_order() {

Pretty much all of these should be const


lib/pcg/include/pcg/computation_graph.h line 35 at r5 (raw file):

  }

  bool out_edge_comparator(MultiDiOutput x, MultiDiOutput y) {

Since this appears to only be used in sort_edge_set, just use a lambda there rather than defining a whole new method?


lib/pcg/include/pcg/computation_graph.h line 50 at r5 (raw file):

                     [&](MultiDiOutput const &e) -> tensor_guid_t {
                       return tensor_guid_t{e};
                     });

Also, what is this function doing? The name is not at all clear

Suggestion:

    std::unordered_set<MultiDiOutput> outputs =
        transform(
          sorted_by(
            edges,
            compare_by([](MultiDiEdge const &e) { return e.src_idx; })
          ),
          [](MultiDiEdge const &o) {
            return tensor_guid_t{o}; 
          }
        );

lib/pcg/include/pcg/computation_graph.h line 53 at r5 (raw file):

  }

  std::vector<tensor_guid_t> get_outgoing_tensors(operator_guid_t n) {

Why sort? I'm not sure I understand why something as get_outgoing_tensors should be doing things like sorting


lib/pcg/include/pcg/computation_graph.h line 61 at r5 (raw file):

  }

  operator_guid_t add_node(Layer const &layer) {

Suggestion:

add_layer

lib/pcg/include/pcg/computation_graph.h line 75 at r5 (raw file):

  }

  tensor_guid_t create_outgoing_edge_with_label(operator_guid_t node,

Aren't all edges labelled?


lib/pcg/include/pcg/computation_graph.h line 89 at r5 (raw file):

      MultiDiOutput input_view = input.value();
      MultiDiEdge edge = {node.value(),
                          NodePort{incoming_edge_dst_port++},

Fix usage of node ports (todo for me)


lib/runtime/include/runtime/model_training_instance.h line 16 at r5 (raw file):

struct ModelTrainingInstance {
  TrainingConfig training_config;

Why this change?


lib/utils/include/utils/graph/node_port.h line 17 at r5 (raw file):

  using strong_typedef::strong_typedef;

  static NodePort construct_port(size_t idx);

Remove


lib/execution/include/local_execution/local_allocator.h line 11 at r1 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Done.

Not seeing it


lib/execution/include/local_execution/local_allocator.h line 16 at r1 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Yeah, I don't think this is needed. I wrote the allocator first and just added that as an API in case but didn't end up using it.

Then remove it?


lib/execution/include/local_execution/local_allocator.h line 21 at r1 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Yeah, I don't think this is needed.

Then remove?


lib/execution/src/ops/attention.cc line 174 at r1 (raw file):

}

static void forward_task(Task const *task,

Where did forward_task go?


lib/execution/src/task_spec/local_backing/local_allocator.cc line 37 at r1 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

What would the alternative be? My take is that once LocalAllocator goes out of scope, it should de-allocate all the memory it allocated, but perhaps deallocate can be private. What do you think?

The alternative would be that once LocalAllocator goes out of scope nothing happens to those memory regions--it's that calling code's responsibility to clean it up (not arguing for this, just wanted to understand why you went with this way).


lib/execution/src/task_spec/local_backing/local_allocator.cc line 38 at r1 (raw file):

LocalAllocator::~LocalAllocator() {
  for (auto it = this->ptr_memory_size_mapping.begin(); it != this->ptr_memory_size_mapping.end();) {

Why not just a for-in (i.e., for ( .... : ... )) loop?


lib/local-execution/include/arg_ref.h line 58 at r5 (raw file):

      : type_idx(type_index), ref_type(ref_type) {}

  std::type_index type_idx;

I assume this change was made to avoid depending on Legion::Serializer?


lib/local-execution/include/config.h line 49 at r5 (raw file):

};

using legion_mapping_tag_id_t = unsigned long;

Probably better to just create a wrapper type?


lib/local-execution/include/device_specific.h line 13 at r5 (raw file):

  DeviceSpecific() = delete;
  DeviceSpecific(T ptr_type) { // accessor

Why does this exist? Isn't DeviceSpecific::create sufficient?


lib/local-execution/include/op_task_signature.h line 44 at r5 (raw file):

struct OpTaskSignature {
  OpTaskSignature() = default;

Why? Generally avoid default construction unless there's a super obvious initial state, and I don't see what the super obvious initial state is for OpTaskType


lib/local-execution/include/op_task_signature.h line 95 at r5 (raw file):

  std::unordered_set<OpTensorSlotSpec> op_tensor_slots;
};
// FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(OpTaskSignature,

Why?


lib/local-execution/include/profiling.h line 15 at r5 (raw file):

  std::optional<float> elapsed =
      profiling_wrapper<F, Ts...>(f, profiling, std::forward<Ts>(ts)...);
  // TODO -- local logger?

What does this comment mean? Why was this commented out?


lib/local-execution/include/tasks.h line 173 at r5 (raw file):

                   std::string const &name,
                   F const &func,
                   std::optional<F const &> cpu_func = std::nullopt);

I don't think we get optionals of references with std::optional--I think this was a tl::optional-only feature


lib/local-execution/include/tracked_allocator.h line 10 at r5 (raw file):

struct TrackedAllocator : public IAllocator {
  TrackedAllocator(size_t);

TrackedAllocator should just wrap an Allocator rather than reimplementing all of LocalAllocator


lib/local-execution/include/tracked_allocator.h line 20 at r5 (raw file):

private:
  const size_t total_memory_size;

(admittedly this should happen automatically when you format)

Suggestion:

  size_t const total_memory_size;

lib/local-execution/src/tracked_allocator.cc line 42 at r5 (raw file):

}

Allocator get_tracked_memory_allocator(size_t total_memory_size) {

Suggestion:

Allocator create_tracked_memory_allocator(size_t total_memory_size) {

lib/local-execution/src/ops/attention.cc line 124 at r5 (raw file):

  int num_heads = get_piece_shape(weight_parallel_tensor_shape)[ff_dim_t(1)];

  // MHAPerDeviceState per_device_state =

Delete?


lib/local-execution/src/ops/attention.cc line 348 at r5 (raw file):

                "Attention Bwd",
                bwd_signature<ATTENTION_BWD_TASK_ID>(),
                backward_task_impl);

How does this work with legion tasks? Doesn't backward_task_impl have the wrong argument types?


lib/local-execution/src/ops/batch_matmul.cc line 190 at r5 (raw file):

template <>
OpTaskSignature fwd_signature<BATCHMATMUL_FWD_TASK_ID>() {
  OpTaskSignature fwd;

Why is OpTaskSignature default constructible? Does it make any sense to have an OpTaskSignature without a type?


lib/local-execution/src/ops/linear.cc line 149 at r5 (raw file):

                 "[Linear] backward_time = %.2lfms\n",
                 per_device_state,
                 (void *)input.get_float_ptr(),

Why is everything getting cast to a void* here?


lib/local-execution/src/ops/replicate.cc line 75 at r5 (raw file):

                 input_grad,
                 output_grad,
                 attrs.replicate_degree); // is this `num_replicas`?

I believe so


lib/local-execution/src/ops/reverse.cc line 24 at r5 (raw file):

using namespace FlexFlow::Kernels::Reverse;
using legion_coord_t = long long;

Why is legion being referenced here?

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 95 unresolved discussions (waiting on @reyna-abhyankar)


lib/runtime/src/task_invocation_compilation.cc line at r1 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

So there were actually 2 files with the same name in different subfolders. This deleted one looks like a precursor to the newer file which is in lib/execution/src/task_spec/task_invocation_compilation.cc and has a bunch of Legion stuff. Let me know if that's correct (I haven't verified which one is actually older via commit history, so I'll do that if you can't immediately tell).

I think they both need to exist--this is the one for the actual compilation to legion objects, which shouldn't be part of local-execution but is critical for runtime


lib/execution/include/local_execution/local_model_training_instance.h line 34 at r1 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

This is a good paradigm. But I'm not connecting the dots for this example. Wouldn't we have already initialized LocalModelTrainingInstance before calling this function? That's what is expected for forward, backward, and update.

Actually, in that case shouldn't we do LocalModelTrainingInstance::forward() instead of forward(LocalModelTrainingInstance)?

I guess the question to answer is: if the lifetime of LocalTrainingBacking is different than the lifetime of LocalModelTrainingInstance (i.e., we create a LocalModelTrainingInstance and then only later do we actually initialize the backing), then should they be part of the same object? We generally try to avoid objects/structs having a whole bunch of state changes as that can't be tracked by the type system, and instead represent those state changes with different types so it can be checked by the type system (e.g., rather than having a FFModel which has to have a computation graph constructed, then legion initialized, then run, etc. we have a ComputationGraph object and a LegionBacking object that get created as you go along, so that if you have a LegionBacking you know for sure that legion has already been initialized)


lib/execution/include/local_execution/local_task_argument_accessor.h line 37 at r1 (raw file):

Previously, reyna-abhyankar (Reyna Abhyankar) wrote…

Why not just using SlotGradId = std::pair<slot_id, IsGrad>?

Because that's just an alias--anywhere I am using SlotGradId I can instead use a std::pair directly (as is being done here) which means the types only make sense if I know what type a SlotGradId actually is (so the alias doesn't really have much of a point), and it leads to code that looks like std::get<IsGrad>(my_slot_id) rather than my_slot_id.is_grad which is often easier to read. It's less of an issue here because pairs are pretty small and you benefit from the fact that both of the elements have wrapper types that make their purpose quite clear, but this would definitely improve some other parts of the code where you don't have this advantage (I care about this occurence here a bit less for that reason)

It's a similar answer to "why provide Node when you could just return int?"--while int is sufficiently expressive, this expressiveness actually makes the code harder to read (it's why we don't just pass around everything as a void * 🎚️)

@reyna-abhyankar reyna-abhyankar marked this pull request as ready for review April 14, 2024 22:30
@lockshaw lockshaw marked this pull request as draft May 14, 2024 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants