Encoding structs (and pointers to structs) #1194
Replies: 4 comments 9 replies
-
So in my head when I was thinking about having a Struct COL node this would be something which only has fields like in C and LLVM. But of course in C++ structs are just classes where things are public by default. Would it then make sense to keep the C++ frontend converting C++ structs to classes and the C/LLVM frontends converting structs to the specific Struct COL node? Of course in C++ a class can still be treated as a value type unlike Java so we could also include the ability to have constructors and methods on the Struct COL node and have the only real difference between two be that structs get copied by assignments and classes just duplicate the reference. In that case a C++ class/struct would become a COL Struct if it has a copy constructor and a COL Class if it doesn't have a copy constructor? |
Beta Was this translation helpful? Give feedback.
-
Bob mentioned that we try to avoid having boolean flags in COL nodes and I think that is indeed a good choice to use separate nodes. Of course we also need separate types nodes for these two classes and in fact we might be able to get away with only having different types and keeping the Class node the same since we can determine whether or not to use by value or by reference semantics based on the type of the expression. Then we could do something similar to this: ---
title: Class Diagram
---
classDiagram
direction LR
`trait TClass` : +*clsː Ref[G, Class[G]]
`trait TClass` : +*typeArgsː Seq[Variable[G]]
`case class TByReferenceClass` : +clsː Ref[G, Class[G]]
`case class TByReferenceClass` : +typeArgsː Seq[Variable[G]]
`case class TByValueClass` : +clsː Ref[G, Class[G]]
`case class TByValueClass` : +typeArgsː Seq[Variable[G]]
`trait TClass` <|-- `case class TByReferenceClass`
`trait TClass` <|-- `case class TByValueClass`
The only thing that I can think of right now to watch out for is that I don't think it makes sense to have an intrinsic lock invariant on a by value class |
Beta Was this translation helpful? Give feedback.
-
After thinking about this a bit longer (and trying some stuff out) I've found that only differentiating in the type is not enough since there are places where we only have the Class and not an instance of TClass. I guess this means we'll need to use a similar kind of structure for the Class node: ---
title: Class Diagram
---
classDiagram
direction LR
`trait Class` : +*typeArgsː Seq[Variable[G]]
`trait Class` : +*declsː Seq[ClassDeclaration[G]]
`trait Class` : +*supportsː Seq[Type[G]]
`case class ByReferenceClass` : +typeArgsː Seq[Variable[G]]
`case class ByReferenceClass` : +declsː Seq[ClassDeclaration[G]]
`case class ByReferenceClass` : +supportsː Seq[Type[G]]
`case class ByReferenceClass` : +intrinsicLockInvariantː Expr[G]
`case class ByValueClass` : +typeArgsː Seq[Variable[G]]
`case class ByValueClass` : +declsː Seq[ClassDeclaration[G]]
`case class ByValueClass` : +supportsː Seq[Type[G]]
`trait Class` <|-- `case class ByReferenceClass`
`trait Class` <|-- `case class ByValueClass`
Also we should probably also think about Java's value classes in case project Valhalla ever materialises and we want to support it |
Beta Was this translation helpful? Give feedback.
-
So I implemented the above. However, I ran into a somewhat fundamental issue. I wanted to simply implement copy semantics for the ByValueClass by replacing accesses to the struct (when we are in a function call or in an assignment) with making a copy instead. However, a large part of VerCors currently assumes that assignments to locals can never fail, as such there is no suitable Blame available for me to use whenever the introduced dereferences for accessing the fields of the to-be-copied ByValueClass fail. In the current implementation this works because LangCToCol is the first rewriter and as such all the assignments have a non-panic blame attached. (no other rewriters have created assignments to locals which they assumed to never fail) The fact that local assignments are infallible seems like a logical and useful property to have so we want to preserve that. Since we cannot anticipate every single assignment statement that might be created by another rewriter we can probably not be sure that we've added a copy in all places where we should add it for the proper semantics. Pieter suggested we keep track of a separate set of local variables the "local heap variables" which can only be accessed through a struct A {
int a;
};
void test() {
struct A b;
b.a = 10;
struct A c = b;
c.a = 5;
} Which would be transformed into: struct A {
int *a;
};
void test() {
@heap@ struct A b;
ptrDeref(HeapLocal(b)) = new A();
ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
@heap@ struct A c;
ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
} Next we look for the address of operator to determine if we need to reason about the memory location of the struct and its fields. In this example we have no address of operator. Therefore, we can flatten this to: struct A {
int a;
};
void test() {
struct A b;
Local(b) = new A();
Local(b).a = 10;
struct A c;
Local(c) = new A();
Local(c).a = Local(b).a;
Local(c).a = 5;
} We would do this flattening at a late stage (probably somewhere just before the side-effects get resolved?) such that we do not risk any other rewrites making it impossible for us to add copy semantics. (the Then we can distinguish two more cases, one where we take a reference to the struct and one where we take a reference to the field. struct A {
int a;
};
void other(struct A* d) {}
void test() {
struct A b;
b.a = 10;
other(&b);
struct A c = b;
c.a = 5;
} Gets transformed into struct A {
int *a;
};
void other(struct A* d) {}
void test() {
@heap@ struct A b;
ptrDeref(HeapLocal(b)) = new A();
ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
other(AddrOf(ptrDeref(HeapLocal(b))));
@heap@ struct A c;
ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
} After the TrivialAddrOf rewrite pass this becomes struct A {
int *a;
};
void other(struct A* d) {}
void test() {
@heap@ struct A b;
ptrDeref(HeapLocal(b)) = new A();
ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
other(HeapLocal(b));
@heap@ struct A c;
ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
} Then because we have a "naked" struct A {
int a;
};
void other(struct A* d) {}
void test() {
struct A *b;
ptrDeref(Local(b)) = new A();
ptrDeref(Local(b)).a = 10;
other(Local(b));
struct A c;
Local(c) = new A();
Local(c).a = ptrDeref(Local(b)).a;
Local(c).a = 5;
} Finally an example where we take a reference to the field: struct A {
int a;
};
void other(int *d) {}
void test() {
struct A b;
b.a = 10;
other(&b.a);
struct A c = b;
c.a = 5;
} Gets transformed into struct A {
int *a;
};
void other(int* d) {}
void test() {
@heap@ struct A b;
ptrDeref(HeapLocal(b)) = new A();
ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
other(AddrOf(ptrDeref(ptrDeref(HeapLocal(b)).a)));
@heap@ struct A c;
ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
} After the TrivialAddrOf rewrite pass this becomes struct A {
int *a;
};
void other(struct A* d) {}
void test() {
@heap@ struct A b;
ptrDeref(HeapLocal(b)) = new A();
ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
other(ptrDeref(HeapLocal(b)).a);
@heap@ struct A c;
ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
} Then because we have a field that is access without a ptrDeref we cannot lower it to a regular field. However, because we never got a reference to the struct A {
int *a;
};
void other(int* d) {}
void test() {
struct A b;
Local(b) = new A();
ptrDeref(Local(b).a).int = 10;
other(Local(b).a);
struct A c;
Local(c) = new A();
Local(c).a = ptrDeref(Local(b).a).int;
Local(c).a = 5;
} This is just a worked example of the idea. I've yet to convince myself that we've gotten all the corner cases here. There are some unanswered questions:
|
Beta Was this translation helpful? Give feedback.
-
Currently structs are encoded similar to a class in Java. However, since structs are actually value types and not reference types this is not desirable. I would like to discuss what would be a good encoding of structs. I will discuss four different approaches
Approaches
Current encoding
While it is undesirable to use the same type of COL node for classes and structs the actual encoding of the two could remain very similar. This means that for every field in the struct we have a field in Viper. Creating a new instance of a struct is simply matter of using the
new
statement in Viper:However, this does mean that the values inside a struct are stored in heap locations and therefore need their permissions specified. We could try and work around that by automatically generating the permission annotations for structs. (which would be possible once we separate structs and classes)
To be able to take a pointer to a field in a struct we need to do something similar to what I implemented for the local variables in #1172 where the actual type of the field is replaced with a pointer to the actual type of the field and any access to that field must go through a pointer dereference.
Encoding as ADTs
We could also encode structs as ADTs. This would look something like this in Viper:
Compared to the current encoding this has the advantage that there is no need to deal with permissions when passing structs from one place to another. Another advantage is that this would allow us to easily check struct equality (without some auxiliary generated function) which might be nice in specifications. However, if we again need to be able to take a pointer to a field in the struct we must again replace the type with a pointer type which means that the nice equality operator on structs breaks. Another disadvantage is that to modify a field in the struct you would need to make copy of the struct since ADTs are immutable. This means that any field access to a struct field would have to be encoded as some function call that returns a new struct.
Encoding as ADTs with references
To get a "mutable" ADT we can make all the fields have the Ref type. This would look something like this in Viper:
Compared to the current encoding we still have to manage the same amount of permissions except that if we want to take a pointer to a field in the struct we no longer need to change the type of the field to a pointer type since we can simply set
ptrDeref(pointerValue)
to be equals to the field's Ref. Of course it would also be possible to do the same thing for the current encoding. (simply giving all the fields the type Ref)A downside of this encoding is that we need to generate a method to create a new instance of this struct. However this method would not be too complicate to generate (in this case we have a struct A which has as its first field an integer array and as its second field an integer):
Encoding as collections of variables
The simplest encoding would be to encode struct values as a collection of variables. A function that takes a struct would then expand the definition of the struct such that it takes each field as a separate parameter. This avoids having to deal with permissions for struct fields but it is unclear what should happen if we want to have a pointer to the struct.
Additional requirements from the LLVM side
One thing I would like to be able to do for the LLVM support is to treat a pointer to a struct as a pointer to its first element. One thing we could do is for every function that takes a pointer automatically generate a function that takes a more specific type (i.e. a struct that has the type we want as its first field, or there could be more levels of nesting there) but I don't think that is a great solution. I feel like it might make sense to represent for example a
pointer<int> a
in an LLVM context as apointer<TAny> a
with the additional permission annotationPerm(a, int, write)
saying that it must be some pointer that is accessible as an integer.On the viper side this could be implemented as follows:
First we define an ADT so that we can define functions that can called on any type:
Then whenever we make a new instance of a struct our struct creation method should also set the appropriate as* functions:
Then we can also make a new pointer to the struct:
Then we can read and write to the struct in the pointer as if it was an integer (accessing the first element), as an integer array, or as a StructA:
Note that when replacing the struct inside of the pointer we must now use a method to preserve the as* functions:
Using this approach for the current encoding (but with Ref fields) instead of the "ADT with Ref fields" encoding would be pretty much the same except that with the ADT approach you could make asIntArray(res) == field_1_of_StructA(res) an axiom.
So, any opinions on this? Am I overcomplicating it and should we stick to the current encoding (with the as* stuff added to it) but just with different COL nodes?
I think it would be nice if we could have some notion of equality on structs for specifications I think. For all of these encoding that would mean generating a function which checks this. It is also interesting to compare this approach to the way Prusti encodes Rust datatypes which is similar to our current encoding except that all the permissions are put into predicates which are folded and unfolded when needed. They also have the notion of snapshots where the current state of a struct is encoded into an ADT for use in pure functions, etc.
Beta Was this translation helpful? Give feedback.
All reactions