Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New compiler: support function overloading #2533

Open
ivan-mogilko opened this issue Sep 21, 2024 · 12 comments
Open

New compiler: support function overloading #2533

ivan-mogilko opened this issue Sep 21, 2024 · 12 comments
Labels
ags 4 related to the ags4 development context: script compiler type: enhancement a suggestion or necessity to have something improved

Comments

@ivan-mogilko
Copy link
Contributor

ivan-mogilko commented Sep 21, 2024

The proposal is to support script function overloading.

CC @fernewelten

Function overloading means that you may have multiple functions of identical name, but different prototype (return value and parameter list). For example:

int DoAction(Character *c);
int DoAction(Character *c, int param);
int DoAction(Character *c, float param);
int DoAction(Character *c, int param1, int param2);
int DoAction(Character *c, float param_f, const string s_param);

EDIT: since we now have struct constructors (#2582), overloading should be supported for constructors too.

NOTE: overloading must have different argument list, it cannot support function variants that only differ in return type, because there will be no way to tell which of those variants is being called.

In order to support this, function variants must be distinguished on both compilation and linking stages. In other words, each function variant must be registered under a unique internal name. Right now AGS uses a "FUNC^N" notation for distinguishing imports with different number of parameters (and afaik "FUNC$N" as a corresponding export name). This was done primarily to let link deprecated API functions in the engine (i think). But number of parameters is not enough for overloading, as we would also need to differentiate variants with different return and argument types.

The first idea that comes to mind is to generate a second suffix which contains encoded parameter types. Note that they do not exactly have to be uniquely identified throughout the script or game: for the purpose of overloading itself having different suffixes is enough. But it may be still beneficial to have a strict rule for these, i.e. not a random garbage, at least because this may be useful for debugging. And there may also be additional uses found later, so it would be best to not block this opportunity.

Now, this is where this becomes bit complicated. I may imagine that primitive types such as ints, floats, etc, could be identified by a single letter, like i, f, etc, but what about others? Having a single letter will not be suitable, having full type name may make this internal name quite long.

As a random idea there may be a "compressed" name generated as min number of characters enough to distinguish the type, maybe starting with 3 letters (unless the type is shorter). And then this type name "shortcut" is also saved somewhere, like in RTTI table, as a way to reference a type, in case we may need to quickly find that type's entry.

Are there any other visible options here?

@ivan-mogilko ivan-mogilko added type: enhancement a suggestion or necessity to have something improved ags 4 related to the ags4 development context: script compiler labels Sep 21, 2024
@fernewelten
Copy link
Contributor

there may be a "compressed" name generated as min number of characters enough to distinguish the type, maybe starting with 3 letters (unless the type is shorter). And then this type name "shortcut" is also saved somewhere, like in RTTI table, as a way to reference a type, in case we may need to quickly find that type's entry.

My first idea was using symbol table entry numbers, but just using symbol table entry numbers to identify the types won't be enough. The issue is, a struct foo could be symbol table entry 271 when a certain program is compiled that exports some function, and the same struct foo might be a completely different symbol table entry, e.g., 721, when a different program is compiled that imports that function.

Instead, we'd probably need to stringify the types in a unique way and use some sort of concatenation of this for the mangled function name. One could imagine mangled function names such as

fname^param_count^stringified_type_of_first_param^stringified_type_of_second_param^…

That might make for very long mangled function names, but as far as I know, other ecosystems such as GNU's gcc also have very long mangled function names.

To stringify a type in a unique way, as a first approach we might linearize its composition of data components (and ignore the attributes and function components)

  • At the base level, there would be single letters to represent the atomic types, e.g., i for integer.
  • A dynpointer to a type t could be stringified as *T, where T is the stringification of t
  • A dynarray to a type t could be stringified as T[]
  • A struct consisting of data fields of types t, u, v could be stringified as (TUV)

If you've already GOT a stringification for types as part of your RTTI scheme, let's take that.

If the mangled function names become too long, we might hash them so that we can look them up first by hash, then by exact name. The hashing only needs to occur once, at the linking stage, so this will probably not slow down program execution.

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented Sep 22, 2024

The issue is, a struct foo could be symbol table entry 271 when a certain program is compiled that exports some function, and the same struct foo might be a completely different symbol table entry, e.g., 721, when a different program is compiled that imports that function.
<...>
If you've already GOT a stringification for types as part of your RTTI scheme, let's take that.

So, in RTTI there are local tables with local typeids, and a joint table created at runtime, which assigns global numeric ids to the types. These tables are binded using a fully qualified type name (string). This is similar to how function fixups is done across scripts.

I suppose that if you will use numeric local type ids in this mangled name, then it will be possible to resolve them into global type ids (whether numeric or string) at the script linking stage in the engine.

As for string names, RTTI currently does not have any "shortcut" names, only full names. It's possible to add "shortcuts" there though, generated using your proposed "stringification" rules, if using numeric typeids does seem inconvenient.

EDIT:
But, I'd like to clarify, that supposedly using RTTI here is not exactly required to make these functions link, is it?
Function linking is still going to be done with the use of import/export tables and fixups.
RTTI will only be required if some operation would need to know exact types of function args at runtime.

@fernewelten
Copy link
Contributor

fernewelten commented Sep 22, 2024

Hm. We might keep things simple. We don't need RTTI at link time, but when RTTI already has a way to name types uniquely in a string (long names), then we could use just this naming mechanism and name the functions with these long names, too. That is,

fname^parameter_count^long_type_name1^long_type_name2^long_type_name3…

That makes for long function names, but AFAIK they are only used for linking and aren't usually shown to the programmer.
The engine would

  1. try to link a full mangled function name first,
  2. fall back to linking fname^parameter_count when there isn't a full mangled function name,
  3. and link to just fname as a last resort.

It seems that the engine does 2. and 3. already, so this might be a comparatively simple modification of pre-existing code.

In communication with the programmer, IMO they need to be told what they need to be told, it can't be helped. So when there are several functions of the same name that have different parameter lists and an error message needs to refer to one specific function, it would call the function fname(int, float) or some such. In simple cases that can't be misunderstood, the error messages could stil refer to fname() if that is simpler to understand.

@ericoporto
Copy link
Member

Are bool and enums assumed to be int in this overload idea? Or would they be their own stuff?

I am curious if this put new overhead in script function calls at runtime or if these would be figured out at compile time.

Just to expand a bit, assuming we do have it and let's we use it in Maths in the engine API to also support int, I imagine in the ags manual we would have all the overloaded methods in the same entry.

@fernewelten
Copy link
Contributor

fernewelten commented Sep 22, 2024

Are bool and enums assumed to be int in this overload idea?

Currently, the compiler doesn't know bool, it doesn't even see it. There are #defines in the autoheader that equate bool to int, true to 1 and false to 0.

As concerns enum, I think the language follows the old C conventions that essentially treat enums as int that have some compile-time constants. I'm not sure of the ramifications. This might confuse a user when they have defined a function as, e.g., fname(Enum i) (where Enum is an enum) and the compiler tells them about fname(int) in an error message.

We don't have casts in the AGS language, neither the C++ casts nor the C casts. So the language kind of relies on being able to assign ints to enums and vice versa,

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented Sep 22, 2024

In other words, in order to distinguish overloads with bools and different enums, the compiler would have to register bool and enums as distinct types. This sounds like a separate issue of its own though.

I am curious if this put new overhead in script function calls at runtime or if these would be figured out at compile time.

Function overloads are resolved at compile and linking time.

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented Nov 29, 2024

I looked into this a little.
Creating import names and linking in the engine part seems simple enough to achieve:
https://github.com/ivan-mogilko/ags-refactoring/commits/ags4--scriptfunctionoverload/

Where I got stumbled is the compiler part, and particularily the question of how to register different function variants in the symbol table.

Just looking at options, having all variants under same symbol does not look feasible, because symbol must have "Declared" location, "Scope", and "LifeScope" (that was added for the runtime reflection), and obviously overloaded functions will have different ones.

So, they must be separate symbols. But in this case, their string "Name" must be different, or they would require some additional field to be able to distinguish them when they are being looked up by the name. The Parser deals with Symbols, which are integer ids. The mapping between a Symbol and a Name is performed earlier by the Scanner. And Scanner does know nothing about function overload.

OTOH there's already a separation between "unqualified" symbol and "qualified" symbol; currently it's used for things like struct member functions: a function may first be recognized by its "unqualified" symbol (matching "funcname" string), and then registered with "qualified" symbol (matching "type::funcname" string). This is done by Parser.

So Parser should be able to do the overloading handling as well. Starting with "unqualified" function symbol, and generating a "qualified" function symbol, which corresponds to particular overload. The question remains how to distinguish these overloads, other than by symbol id itself. Should they have different Name value, or have other field that lets distinguish them?
If former, then what should be in the Name, is it same "mangled" name as we create for imports, with number of arguments and arglist?

Afaik Name is used mostly for reporting compiler messages, and registering RTTI/TOC tables, but not in the parser logic (?).
There's a place to strip the Name would it contain something extra for service purposes (see SymbolTable::GetName()).

I don't have a full picture of this yet, but might research this further.


I suppose, that Parser would need a quick way of finding other function variants. Maybe FunctionD member of a symbol entry could contain a vector of "overloads" with symbols with them.

Similarly, a "Constructor" field in VartypeD, which I added recently, would be replaced with a vector of symbols, for multiple constructors.

@fernewelten
Copy link
Contributor

fernewelten commented Nov 29, 2024

One way of having several parameter lists to one function name in the symbol table: Keep a vector of all the parameter lists.

  • Whenever the compiler encounters a function call that has several parameter lists associated with the function name, it tries all the parameter lists in sequence until it finds a parameter list that accepts all the arguments. If it doesn't find any that match, it calls an error such as ‘Could not find any parameter list that would match these arguments.’
  • For instance, when there is f(int, float), f(int) and f(float), then f(3.14) can only be a call for the parameter list (float).
  • Things will become a hassle once optional parameters and variadic parameters enter the picture. For instance, once the compiler knows f(int, optional int) it must balk when it encounters f(int, ...) because it can't differentiate between the parameter lists when it encounters, e.g., f(3, 17) later on.

Imported and exported functions would not be tracked by the name only, but by a "mangled name" that encodes the parameters, too.

  • This is a hassle and leads to very long names, but it seems that C++ compilers do exactly that behind-the-scenes.
  • Which convention is used to "mangle" the name is a thing that differs from compiler (tool train) to compiler (tool train). As far as I know, gcc does it differently than the Visual C++ compiler.
  • Maybe for us, "mangled names" could be Display^101^string or String::Concatenate^2^String^String, i.e., function name, ^, parameter count ^ and then the parameter types separated by ^.
  • This mangling schema is probably hard to get right, and once we adopt such a scheme, it will stay with us for a long time and will be difficult to change.

We might do it so that a function f with one set of parameters has one type of scope, and a function of the same name f with a different set of parameters has a different scope. For instance, the one is imported and the other is locally defined. On the other hand, it might be easier and less confusing to keep things simple and simply require that all the functions that have the same name must have the same scope, period. Only the parameter lists may differ, the rest must not.

This will still mean that for each parameter list, an offset into the code must be tracked where the function code begins, etc. How this would be done is a hassle, but it is doable. One way of doing it would be, have several entries in the symbol table:

  • Keep a "master" entry for the "unadorned" function name, e.g. Armor::Heal. This "master entry" has the list of parameter lists, e.g. (int) and (int, int, int).
  • In the same symbol table, keep a "subsidiary" entry for the "mangled" function names, i.e., Armor::Heal^1^int and Armor::Heal^3^int^int^int.
  • All the details that differ between the parameter lists would be saved in the "subsidiary" entries. All the details that are common are with the "master" entry.

@fernewelten
Copy link
Contributor

Once we add functions that can have more than one parameter list, then our error messages will perhaps become less helpful to the programmers:

Let's say that we have Armor::Heal(int mana). Now when the compiler encounters Heal(15.5), it can tell the programmer, ‘Check your first argument: it should be an int but is a float’.

Now let's add Armor::Heal(float mana, int conversion_type), i.e., there are now two parameter lists for one function name.

Now when the compiler encounters Heal(15.5), it can't know whether the programmer has got the first argument wrong – should have been an int – or whether the second argument is missing. So it can't tell the programmer any longer, ‘Check your first argument’ but it can only say, ‘I can't find a parameter list that would fit the arguments in this call’, which is less helpful.

So we might have to weigh what is more important to us: A language with more possibilities, or a restricted language with more specific help from the compiler.

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented Nov 30, 2024

  • Keep a "master" entry for the "unadorned" function name, e.g. Armor::Heal. This "master entry" has the list of parameter lists, e.g. (int) and (int, int, int).
  • In the same symbol table, keep a "subsidiary" entry for the "mangled" function names, i.e., Armor::Heal^1^int and Armor::Heal^3^int^int^int.
  • All the details that differ between the parameter lists would be saved in the "subsidiary" entries. All the details that are common are with the "master" entry.

This sounds like a better plan.
I would maybe try to not put all variants of parameter lists inside a master entry, but instead have a list of references to "subsidiary" entries in it.
Maybe make it so that if there's just one function variant, then it has a parameter list, but if there are multiple, then it has a list of "overloads", and parser would need to iterate these to check their parameter lists, and find the most matching one?

FunctionD -> Parameters (in case of a single variant)
FunctionD -> Overloads (std::vector<Symbol>)
                            | -> FunctionD -> Parameters
                            | -> FunctionD -> Parameters
                            | -> FunctionD -> Parameters
                            | -> ...

@ericoporto
Copy link
Member

ericoporto commented Nov 30, 2024

Is function overloading something that assumes the return value is the same between functions? Like, could I have

// header file
import int Square (int value);
import float Square (float value);
// script file
int Square(int value)
{
  return value*value;
}

float Square(float value)
{
  return value*value;
}

@ivan-mogilko
Copy link
Contributor Author

ivan-mogilko commented Nov 30, 2024

Is function overloading something that assumes the return value is the same between functions?

Traditional rules is that return types may be different, but the functions cannot differ only by return type. That is because compiler won't be able to deduce which variant to call.

Meaning, you can have:

int Func(int param);
float Func(float param);

But you cannot have

int Func(String param);
float Func(String param);
// which one to call if I pass String??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ags 4 related to the ags4 development context: script compiler type: enhancement a suggestion or necessity to have something improved
Projects
None yet
Development

No branches or pull requests

3 participants