Encoding structs (and pointers to structs) #1194

superaxander · 2024-05-07T14:23:47Z

superaxander
May 7, 2024
Maintainer

Currently structs are encoded similar to a class in Java. However, since structs are actually value types and not reference types this is not desirable. I would like to discuss what would be a good encoding of structs. I will discuss four different approaches

Approaches

Current encoding

While it is undesirable to use the same type of COL node for classes and structs the actual encoding of the two could remain very similar. This means that for every field in the struct we have a field in Viper. Creating a new instance of a struct is simply matter of using the new statement in Viper:

var s: Ref := new(a,b,c)

However, this does mean that the values inside a struct are stored in heap locations and therefore need their permissions specified. We could try and work around that by automatically generating the permission annotations for structs. (which would be possible once we separate structs and classes)

To be able to take a pointer to a field in a struct we need to do something similar to what I implemented for the local variables in #1172 where the actual type of the field is replaced with a pointer to the actual type of the field and any access to that field must go through a pointer dereference.

Encoding as ADTs

We could also encode structs as ADTs. This would look something like this in Viper:

domain {
    function StructA_of(field_1: T1, field_2: T2, ...): StructA;
    function field_1_of_StructA(s: StructA): T1
    function field_2_of_StructA(s: StructA): T2

    axiom { forall StructA s :: StructA_of(field_0_of_StructA(s), field_1_of_StructA(s), ...) == s }
}

Compared to the current encoding this has the advantage that there is no need to deal with permissions when passing structs from one place to another. Another advantage is that this would allow us to easily check struct equality (without some auxiliary generated function) which might be nice in specifications. However, if we again need to be able to take a pointer to a field in the struct we must again replace the type with a pointer type which means that the nice equality operator on structs breaks. Another disadvantage is that to modify a field in the struct you would need to make copy of the struct since ADTs are immutable. This means that any field access to a struct field would have to be encoded as some function call that returns a new struct.

Encoding as ADTs with references

To get a "mutable" ADT we can make all the fields have the Ref type. This would look something like this in Viper:

domain {
    function StructA_of(field_1: Ref, field_2: Ref, ...): StructA;
    function field_1_of_StructA(s: StructA): Ref
    function field_2_of_StructA(s: StructA): Ref

    axiom { forall StructA s :: StructA_of(field_0_of_StructA(s), field_1_of_StructA(s), ...) == s }
}

Compared to the current encoding we still have to manage the same amount of permissions except that if we want to take a pointer to a field in the struct we no longer need to change the type of the field to a pointer type since we can simply set ptrDeref(pointerValue) to be equals to the field's Ref. Of course it would also be possible to do the same thing for the current encoding. (simply giving all the fields the type Ref)

A downside of this encoding is that we need to generate a method to create a new instance of this struct. However this method would not be too complicate to generate (in this case we have a struct A which has as its first field an integer array and as its second field an integer):

method make_StructA(a: Array, b: Int)
    returns (res: StructA)
    ensures acc(field_1_of_StructA(res).intArray, write)
    ensures field_1_of_StructA(res).intArray == a
    ensures acc(field_2_of_StructA(res).int, write)
    ensures field_2_of_StructA(res).int == b

Encoding as collections of variables

The simplest encoding would be to encode struct values as a collection of variables. A function that takes a struct would then expand the definition of the struct such that it takes each field as a separate parameter. This avoids having to deal with permissions for struct fields but it is unclear what should happen if we want to have a pointer to the struct.

Additional requirements from the LLVM side

One thing I would like to be able to do for the LLVM support is to treat a pointer to a struct as a pointer to its first element. One thing we could do is for every function that takes a pointer automatically generate a function that takes a more specific type (i.e. a struct that has the type we want as its first field, or there could be more levels of nesting there) but I don't think that is a great solution. I feel like it might make sense to represent for example a pointer<int> a in an LLVM context as a pointer<TAny> a with the additional permission annotation Perm(a, int, write) saying that it must be some pointer that is accessible as an integer.

On the viper side this could be implemented as follows:
First we define an ADT so that we can define functions that can called on any type:

domain Value[T] {
    function asInt(v: T): Ref
    function asIntArray(v: T): Ref
    function asStructA(v: T): Ref
    function asStructB(v: T): Ref
    function asStructC(v: T): Ref
    ...etc
}

Then whenever we make a new instance of a struct our struct creation method should also set the appropriate as* functions:

method make_StructA(a: Array, b: Int)
    returns (res: StructA)
    ensures acc(field_1_of_StructA(res).intArray, write)
    ensures field_1_of_StructA(res).intArray == a
    ensures asInt(res) == asInt(field_1_of_StructA(res).intArray)
    ensures asIntArray(res) == field_1_of_StructA(res)
    ensures acc(field_2_of_StructA(res).int, write)
    ensures field_2_of_StructA(res).int == b
    ensures acc(asStructA(res).structA, wildcard)
    ensures asStructA(res).structA == res

Then we can also make a new pointer to the struct:

method make_StructA_pointer()
    returns (res: Pointer)
    ensures block_length(pointer_block(res)) == 1
    ensures pointer_offset(res) == 0
    ensures acc(ptrDeref(res).structA, write)
    ensures asInt(ptrDeref(res)) == asInt(ptrDeref(res).structA)
    ensures asIntArray(ptrDeref(res)) == asIntArray(ptrDeref(res).structA)
    ensures asStructA(ptrDeref(res)) == ptrDeref(res)

Then we can read and write to the struct in the pointer as if it was an integer (accessing the first element), as an integer array, or as a StructA:

var p: Pointer := make_StructA_pointer();
asInt(ptrDeref(res)).int := 10
assert aloc(asIntArray(ptrDeref(res)).intArray, 0) == 10
assert aloc(field_1_of_StructA(asStructA(ptrDeref(res)).structA).intArray, 0) == 10

Note that when replacing the struct inside of the pointer we must now use a method to preserve the as* functions:

method write_to_StructB_pointer(ptr: Pointer, value: StructA)
    requires acc(ptrDeref(ptr).structA, write)
    ensures acc(ptrDeref(ptr).structA, write)
    ensures ptrDeref(ptr).structA== value
    ensures asInt(ptrDeref(ptr)) == asInt(ptrDeref(ptr).structA)
    ensures asIntArray(ptrDeref(ptr)) == asIntArray(ptrDeref(ptr).structA)
    ensures asStructA(ptrDeref(ptr)) == ptrDeref(ptr)

Using this approach for the current encoding (but with Ref fields) instead of the "ADT with Ref fields" encoding would be pretty much the same except that with the ADT approach you could make asIntArray(res) == field_1_of_StructA(res) an axiom.

So, any opinions on this? Am I overcomplicating it and should we stick to the current encoding (with the as* stuff added to it) but just with different COL nodes?

I think it would be nice if we could have some notion of equality on structs for specifications I think. For all of these encoding that would mean generating a function which checks this. It is also interesting to compare this approach to the way Prusti encodes Rust datatypes which is similar to our current encoding except that all the permissions are put into predicates which are folded and unfolded when needed. They also have the notion of snapshots where the current state of a struct is encoded into an ADT for use in pure functions, etc.

superaxander · 2024-05-08T10:55:54Z

superaxander
May 8, 2024
Maintainer Author

So in my head when I was thinking about having a Struct COL node this would be something which only has fields like in C and LLVM. But of course in C++ structs are just classes where things are public by default. Would it then make sense to keep the C++ frontend converting C++ structs to classes and the C/LLVM frontends converting structs to the specific Struct COL node? Of course in C++ a class can still be treated as a value type unlike Java so we could also include the ability to have constructors and methods on the Struct COL node and have the only real difference between two be that structs get copied by assignments and classes just duplicate the reference. In that case a C++ class/struct would become a COL Struct if it has a copy constructor and a COL Class if it doesn't have a copy constructor?

2 replies

sakehl May 14, 2024
Maintainer

Hmm strange I thought there was a fundamental difference between structs and class with respect of passing by ref of value. But apparently classes in C++ are passed by value, just like structs.

So I think we would just need to types of classes in Col. The "ValueClass" (or "CClass", whatever name we want) which is passed by value. And the normal "Class" which is passed by reference.

For the above discussion: I think it would be best if the attributes/fields of a class/struct are always references. So that in a parallel block we can divide permission to different threads, so different threads can write to them. With an ADT approach, this would never be possible.

superaxander May 14, 2024
Maintainer Author

Yeah but then I wonder if we should just make "byValue" a boolean flag on Class or make ValueClass extend Class because otherwise we are duplicating a lot of stuff. Then I think we'd just need to change ClassToRef to make copies whenever a byValue Class is used in an expression and make declarations include initialisation for byValue classes? Or is there some other things that would need to be different between the two types? (other than maybe nice stuff in the future like being able to use == to check for equality)

superaxander · 2024-05-14T09:30:17Z

superaxander
May 14, 2024
Maintainer Author

Bob mentioned that we try to avoid having boolean flags in COL nodes and I think that is indeed a good choice to use separate nodes.

Of course we also need separate types nodes for these two classes and in fact we might be able to get away with only having different types and keeping the Class node the same since we can determine whether or not to use by value or by reference semantics based on the type of the expression. Then we could do something similar to this:

---
title: Class Diagram
---
classDiagram
    direction LR
    `trait TClass` : +*clsː Ref[G, Class[G]]
    `trait TClass` : +*typeArgsː Seq[Variable[G]]

    `case class TByReferenceClass` : +clsː Ref[G, Class[G]]
    `case class TByReferenceClass` : +typeArgsː Seq[Variable[G]]

    `case class TByValueClass` : +clsː Ref[G, Class[G]]
    `case class TByValueClass` : +typeArgsː Seq[Variable[G]]
    
    `trait TClass` <|-- `case class TByReferenceClass`
    `trait TClass` <|-- `case class TByValueClass`

The only thing that I can think of right now to watch out for is that I don't think it makes sense to have an intrinsic lock invariant on a by value class

1 reply

superaxander May 14, 2024
Maintainer Author

I'm actually not sure how TAnyClass fits into this is it confusing if that is then not part of this type hierarchy? Would we need a separate TByValueAnyClass?

superaxander · 2024-05-14T12:10:22Z

superaxander
May 14, 2024
Maintainer Author

After thinking about this a bit longer (and trying some stuff out) I've found that only differentiating in the type is not enough since there are places where we only have the Class and not an instance of TClass. I guess this means we'll need to use a similar kind of structure for the Class node:

---
title: Class Diagram
---
classDiagram
    direction LR
    `trait Class` : +*typeArgsː Seq[Variable[G]]
    `trait Class` : +*declsː Seq[ClassDeclaration[G]]
    `trait Class` : +*supportsː Seq[Type[G]]
    
    `case class ByReferenceClass` : +typeArgsː Seq[Variable[G]]
    `case class ByReferenceClass` : +declsː Seq[ClassDeclaration[G]]
    `case class ByReferenceClass` : +supportsː Seq[Type[G]]
    `case class ByReferenceClass` : +intrinsicLockInvariantː Expr[G]

    `case class ByValueClass` : +typeArgsː Seq[Variable[G]]
    `case class ByValueClass` : +declsː Seq[ClassDeclaration[G]]
    `case class ByValueClass` : +supportsː Seq[Type[G]]
    
    `trait Class` <|-- `case class ByReferenceClass`
    `trait Class` <|-- `case class ByValueClass`

Also we should probably also think about Java's value classes in case project Valhalla ever materialises and we want to support it

1 reply

bobismijnnaam May 15, 2024
Maintainer

where we only have the Class and not an instance of TClass

Might not be relevant, but when I was implementing generics I noticed this as well. In the end I had to refactor those places to use a tclass, as those pieces of code were just not accounting for generics. Once I started replacing class with tclass where it was possible or made sense, it got sorted pretty quickly.

superaxander · 2024-05-29T14:02:12Z

superaxander
May 29, 2024
Maintainer Author

So I implemented the above. However, I ran into a somewhat fundamental issue. I wanted to simply implement copy semantics for the ByValueClass by replacing accesses to the struct (when we are in a function call or in an assignment) with making a copy instead. However, a large part of VerCors currently assumes that assignments to locals can never fail, as such there is no suitable Blame available for me to use whenever the introduced dereferences for accessing the fields of the to-be-copied ByValueClass fail. In the current implementation this works because LangCToCol is the first rewriter and as such all the assignments have a non-panic blame attached. (no other rewriters have created assignments to locals which they assumed to never fail)

The fact that local assignments are infallible seems like a logical and useful property to have so we want to preserve that. Since we cannot anticipate every single assignment statement that might be created by another rewriter we can probably not be sure that we've added a copy in all places where we should add it for the proper semantics.

Pieter suggested we keep track of a separate set of local variables the "local heap variables" which can only be accessed through a HeapLocal node. (instead of the Local node) Every "local heap variable" is stored as a pointer and must therefore be accessed through a pointer dereference. For example we can take this simple program:

struct A {
    int a;
};
void test() {
    struct A b;
    b.a = 10;
    struct A c = b;
    c.a = 5;
}

Which would be transformed into:

struct A {
    int *a;
};
void test() {
    @heap@ struct A b;
    ptrDeref(HeapLocal(b)) = new A();
    ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
    @heap@ struct A c;
    ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
    ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
}

Next we look for the address of operator to determine if we need to reason about the memory location of the struct and its fields. In this example we have no address of operator. Therefore, we can flatten this to:

struct A {
    int a;
};
void test() {
    struct A b;
    Local(b) = new A();
    Local(b).a = 10;
    struct A c;
    Local(c) = new A();
    Local(c).a = Local(b).a;
    Local(c).a = 5;
}

We would do this flattening at a late stage (probably somewhere just before the side-effects get resolved?) such that we do not risk any other rewrites making it impossible for us to add copy semantics. (the HeapLocal node would have a Blame since we are essentially saying that for this type of variable it is possible for a "read" of the value to cause an error)

Then we can distinguish two more cases, one where we take a reference to the struct and one where we take a reference to the field.

struct A {
    int a;
};
void other(struct A* d) {}
void test() {
    struct A b;
    b.a = 10;
    other(&b);
    struct A c = b;
    c.a = 5;
}

Gets transformed into

struct A {
    int *a;
};
void other(struct A* d) {}
void test() {
    @heap@ struct A b;
    ptrDeref(HeapLocal(b)) = new A();
    ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
    other(AddrOf(ptrDeref(HeapLocal(b))));
    @heap@ struct A c;
    ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
    ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
}

After the TrivialAddrOf rewrite pass this becomes

struct A {
    int *a;
};
void other(struct A* d) {}
void test() {
    @heap@ struct A b;
    ptrDeref(HeapLocal(b)) = new A();
    ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
    other(HeapLocal(b));
    @heap@ struct A c;
    ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
    ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
}

Then because we have a "naked" HeapLocal we cannot lower it to a regular variable but it becomes a pointer type. We do lower it to a normal Local variable since we assume that last view remaining rewrites don't add any new assignments/calls that might require copy semantics. (alternatively we could keep the separate HeapLocal around until ColToSilver at which point it just becomes a silver.LocalVar) Because we never got a reference to the field that one can be lowered to a simple integer instead of a pointer. Therefore we end up with:

struct A {
    int a;
};
void other(struct A* d) {}
void test() {
    struct A *b;
    ptrDeref(Local(b)) = new A();
    ptrDeref(Local(b)).a = 10;
    other(Local(b));
    struct A c;
    Local(c) = new A();
    Local(c).a = ptrDeref(Local(b)).a;
    Local(c).a = 5;
}

Finally an example where we take a reference to the field:

struct A {
    int a;
};
void other(int *d) {}
void test() {
    struct A b;
    b.a = 10;
    other(&b.a);
    struct A c = b;
    c.a = 5;
}

Gets transformed into

struct A {
    int *a;
};
void other(int* d) {}
void test() {
    @heap@ struct A b;
    ptrDeref(HeapLocal(b)) = new A();
    ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
    other(AddrOf(ptrDeref(ptrDeref(HeapLocal(b)).a)));
    @heap@ struct A c;
    ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
    ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
}

After the TrivialAddrOf rewrite pass this becomes

struct A {
    int *a;
};
void other(struct A* d) {}
void test() {
    @heap@ struct A b;
    ptrDeref(HeapLocal(b)) = new A();
    ptrDeref(ptrDeref(HeapLocal(b)).a).int = 10;
    other(ptrDeref(HeapLocal(b)).a);
    @heap@ struct A c;
    ptrDeref(HeapLocal(c)) = ptrDeref(HeapLocal(b));
    ptrDeref(ptrDeref(HeapLocal(c)).a).int = 5;
}

Then because we have a field that is access without a ptrDeref we cannot lower it to a regular field. However, because we never got a reference to the HeapLocal struct that one can be lowered to a simple integer instead of a pointer. Therefore we end up with:

struct A {
    int *a;
};
void other(int* d) {}
void test() {
    struct A b;
    Local(b) = new A();
    ptrDeref(Local(b).a).int = 10;
    other(Local(b).a);
    struct A c;
    Local(c) = new A();
    Local(c).a = ptrDeref(Local(b).a).int;
    Local(c).a = 5;
}

This is just a worked example of the idea. I've yet to convince myself that we've gotten all the corner cases here. There are some unanswered questions:

What about variables that are already on the heap?
If you add a new function which takes the reference of a struct field, does that mean that you now need to deal with permission to access the field where you didn't before? (because it wasn't a pointer type be default) I think so, but is that a problem?
How do we want the user to specify the permissions. Since structs are values I think you should be able to just say Perm(b, f) to get access to all the "real" fields (as in fields in Viper) of b and Perm(b.a, f) to get access to a specific "C field" which would only be necessary if there is some place where you refer to the address of b.a. Alternatively we could always represent the fields as pointers and make Perm(b, f) grant access to the "real" Viper fields and the "C fields". Then Perm(b.a, f) would also grant access to both the "real" Viper field and the "C field"
- The advantage of storing a struct as an ADT with Refs in it for every field is that we don't have separate "real" Viper fields to worry about

5 replies

superaxander May 30, 2024
Maintainer Author

For variables that are already on the heap we can keep using the same logic since the DerefHeapVariable is preserved until HeapVariableToRef. That means we can simply look for DerefHeapVariables with type ByValueClass and apply a copy at the appropriate places. The only difference between using HeapVariable + DerefHeapVariable and LocalHeapVariable + HeapLocal would be that heap local/local heap variables can be lowered to local variables (Variable + Local) if no one takes their address.

sakehl May 30, 2024
Maintainer

You mention ""real" fields" and "C fields". But I don't really see why you need both?
Or what the difference is. (Probably I just don't understand so that is fine :p but hard to help out if I don't understand this)

Also, small remark in the latest dev, I've added support for typedefs. So if you don't want to type 'struct' before every struct type, this is now possible:)

superaxander May 30, 2024
Maintainer Author

@sakehl

You mention ""real" fields" and "C fields". But I don't really see why you need both? Or what the difference is. (Probably I just don't understand so that is fine :p but hard to help out if I don't understand this)

Well we encode pointers using the Pointer adt. So a "C field" is the field ptrDeref(b.a).int in Viper. Whereas the "real field" is b.a in Viper. Because in viper we would have (simplified, we actually use Option<Pointer<int>>):

// "Real field"
field a: Pointer

// "C field"
field int: Int

method new_int_pointer() returns (res: Pointer)
ensures acc(ptrDeref(res).int, write)

method test() 
{
    var b: Ref := new(a)
    b.a := new_int_pointer();
    ptrDeref(b.a).int := 5
}

If we were to modify the "encoding as ADTs with references" approach I mentioned in the top post by simply having all the elements be of the Pointer type we would avoid having a separate "real field" for which we also need to track permissions.

Also, small remark in the latest dev, I've added support for typedefs. So if you don't want to type 'struct' before every struct type, this is now possible:)

Yup and I've used that but I usually don't when I write C (out of habit, no real reason) so I forgot to do that here :)

sakehl May 30, 2024
Maintainer

Check I get that now :)

Hmm maybe then an ADT with references makes more sense, the encoding in Viper becomes simpler with less permissions to track, which I think is an advantage.

pieter-bos May 30, 2024
Maintainer

Aside: maybe we can scan for write/read usages of HeapLocal in the method that they are declared, in which case we might automatically add the relevant permission in the nearby contract (method/loop). This is not infallible (after all you could just do int a; int *p=&a; and use p instead) but I think scanning is a correct lower bound of the permissions you need - so why not add them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding structs (and pointers to structs) #1194

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Encoding structs (and pointers to structs) #1194

superaxander May 7, 2024 Maintainer

Approaches

Current encoding

Encoding as ADTs

Encoding as ADTs with references

Encoding as collections of variables

Additional requirements from the LLVM side

Replies: 4 comments · 9 replies

superaxander May 8, 2024 Maintainer Author

sakehl May 14, 2024 Maintainer

superaxander May 14, 2024 Maintainer Author

superaxander May 14, 2024 Maintainer Author

superaxander May 14, 2024 Maintainer Author

superaxander May 14, 2024 Maintainer Author

bobismijnnaam May 15, 2024 Maintainer

superaxander May 29, 2024 Maintainer Author

superaxander May 30, 2024 Maintainer Author

sakehl May 30, 2024 Maintainer

superaxander May 30, 2024 Maintainer Author

sakehl May 30, 2024 Maintainer

pieter-bos May 30, 2024 Maintainer

superaxander
May 7, 2024
Maintainer

Replies: 4 comments 9 replies

superaxander
May 8, 2024
Maintainer Author

sakehl May 14, 2024
Maintainer

superaxander May 14, 2024
Maintainer Author

superaxander
May 14, 2024
Maintainer Author

superaxander May 14, 2024
Maintainer Author

superaxander
May 14, 2024
Maintainer Author

bobismijnnaam May 15, 2024
Maintainer

superaxander
May 29, 2024
Maintainer Author

superaxander May 30, 2024
Maintainer Author

sakehl May 30, 2024
Maintainer

superaxander May 30, 2024
Maintainer Author

sakehl May 30, 2024
Maintainer

pieter-bos May 30, 2024
Maintainer