Mission to decode block files (.sai) #598

mat888 · 2022-06-04T19:58:54Z

mat888
Jun 4, 2022

Update Below

Interested in anyone that can help:

I'm on a mission to decode the binary block data stored from the saito-lite-rust client written in JS. I've made some decent progress:

Firstly I've gotten the binary data into Haskell using:

import qualified Data.ByteString as B

main = do
  contents <- B.readFile $ "block.sai"
  return $ B.unpack contents

This gives a list of type GHC.Word.Word8 - in practice its a list of unsigned integers that wraps at 256.

According to lines 6 and 158 of lib/saito/block.ts the header bytes end at position 229.

    const BLOCK_HEADER_SIZE = 229;
//....
    let start_of_transaction_data = BLOCK_HEADER_SIZE;

This is where the following code from lib/saito/transaction.ts dictates metadata for the transaction whose bytes come right after the 229 header bytes:

for (let i = 0; i < transactions_length; i++) {
      const inputs_len = this.app.binary.u32FromBytes(
        buffer.slice(start_of_transaction_data, start_of_transaction_data + 4)
      );
      const outputs_len = this.app.binary.u32FromBytes(
        buffer.slice(start_of_transaction_data + 4, start_of_transaction_data + 8)
      );
      const message_len = this.app.binary.u32FromBytes(
        buffer.slice(start_of_transaction_data + 8, start_of_transaction_data + 12)
      );
      const path_len = this.app.binary.u32FromBytes(
        buffer.slice(start_of_transaction_data + 12, start_of_transaction_data + 16)
      );

Says that there are four values to be grabbed from the first 4 groups of 4 bytes 229 past the start of the block. It seems I am on the correct track. This is the corresponding point in the list from Haskell.

[0,0,0,1,0,0,0,1,0,0,3,228,0,0,0,1...]

Here are what they should mean according to the typescript code above:

0,0,0,1 means one tx input - that's correct
0,0,0,1 means one tx output - that's correct
0,0,3,228 is the message length. 3 * 256 + 228 = 996 bytes
0,0,0,1 is the path_len - not sure exactly but it seems to line up

The test message I included in the transaction is meant to be easy to spot in any encoding or format:

aaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccccccccccddddddddddddddddddddeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeefffffffffffffffffffffffffffffffffffggggggggggggggggggggggggggg
1111111111111111111111111111111111112222222222222222222222222222222222222222222333333333333333333333333333333333333333333444444444444444444444444444444444444444444444455555555555555555555555555555555555555555556666666666666666666666666666666666666

That message above is 247 characters long. With each character encoded in 4 bytes, 247 * 4 = 988 which is just less than the message length of 996 encoded as 0,0,3,228. So that makes sense.

I was expecting to see clean repetitions which made the message and the relationship between characters obvious.
In practice, while repeating sequences in the binary data do occur, the length's do not match up with the length's of the plaintext above and often times vary in unexpected ways.

Using this block of code from lines 190... of the transaction.ts file, I figured the actual message bytes should begin (95+75+75) bytes past the start of the transaction metadata.

const transaction_type = buffer[start_of_transaction_data + 92];
    const start_of_inputs = start_of_transaction_data + TRANSACTION_SIZE; // Tx size    (given on line 8) is 95, times 1 input = 95
    const start_of_outputs = start_of_inputs + inputs_len * SLIP_SIZE;              // Slip size (given on line 9) is 75, times 1 output = 75
    const start_of_message = start_of_outputs + outputs_len * SLIP_SIZE;       // 95 + 75 + 75
    const start_of_path = start_of_message + message_len;

//...

const message = buffer.slice(start_of_message, start_of_message + message_len); // 95 + 75 + 75

This could have a mistake, but it certainly puts us in the ballpark of where the message should be sitting within the bytes. Here is the first notable repeating sequence which might represent repeating plaintext characters that shows up:

[...70,104,89,87,70,104,89,87,70,104,89,87,70,104,89,87,70,104,89,87,70,104,89,87,70,(pattern breaks here) 104,89, ~~107~~...]

But it isn't long enough to simply decode as 'each repetition is one a." Generally there is a lot more repetition around this area where counting bytes indicates the message should live, but it doesn't seem to match in any simple way to the test message it should represent above.

I believed it would be as simple as taking repeating groups, and figuring out which character form the test text in the message they correspond to, but the groups never repeat consistently and the lengths are often much shorter than the amount of character repeated in the plaintext. I can post the full list if its helpful to anyone.

This is fairly niche, but if someone is willing to point me in the correct direction to finish decoding this it would be useful building apps which require some more low level functionality. I'm looking to decode the /blocks folder in real time modularly and feed relevant information into other programs like smart contract nodes, and, well, sky's the limit.

It may be as simple as be digging a little bit deeper into the source code and JS primitives, but in the mean time I'll drop this here.

Update

I've started fresh with cleaner results. The test message this time is a Saitolicious post with:

587 lowercase 'a's
linebreak
559 lowercase 'b's
linebreak
561 lowercase 'c's

I've grouped the bytes into four since every indication seems to point that way. There are now three groups in the byte array which clearly delineate these long lines of characters.

The line of 'a' is this sequence repeated 195 times:
[89,87,70,104]
195 x 3 = 585 ~ 587 'a's

Line of 'b' is this repeated 184 times
[89,109,74,105]
184 x 3 = 552 ~ 559 'c's

Line of 'c' is this repeated 186 times
[89,50,78,106]
186 x 3 = 558 ~561 'c's

Assuming these byte groupings correspond to three of each character gets us very close. The minor differences are probably explained by similar but not exactly equal byte arrays which sometimes come at the start and/or end of the repeating sequences.

The 'a' group of bytes ends with [89,87,70,104],[89,87,69,56]

The 'b' group of bytes ends with [89,109,74,105],[89,109,73,56]

The 'c' group of bytes starts with [80,109,78,106],[89,50,78,106]
and ends with [89,50,78,106],[89,122,120,105]

These are surely just the case when the group of characters don't neatly fit into groups of three - since it seems fairly clear there are three characters to four bytes.

I haven't figured out how to get the text out just yet, but this should be a much clearer starting point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mission to decode block files (.sai) #598

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Mission to decode block files (.sai) #598

mat888 Jun 4, 2022

Update Below

Update

Replies: 0 comments

mat888
Jun 4, 2022