Cleaned up descriptions in the README.

kanaverse · Dec 3, 2024 · d67a795 · d67a795
1 parent 6dcf1a5
commit d67a795
Showing 1 changed file with 49 additions and 131 deletions.
diff --git a/README.md b/README.md
@@ -2,24 +2,34 @@
 
 ## Overview
 
-The **wobbegong** R client converts `SummarizedExperiment` objects into files that can be easily interrogated via HTTP range requests.
-The idea is to use a static file server to transfer parts of the object to the client for use in web applications.
-Using the R client is very easy; just take your `SummarizedExperiment` and call the `wobbegongify()` function:
+The **wobbegong** R package converts SummarizedExperiment objects into files that can be easily accessed via HTTP range requests.
+The idea is to use a static file server to host **wobbegong**-formatted files, allowing web applications to transfer specific parts of the SummarizedExperiment to the client on demand.
+No custom server logic is required, nor do we require a bulk download of the entire dataset.
+Front-end developers should check out the [**wobbegong.js**](https://github.com/kanaverse/wobbegong.js) package to interface with the hosted **wobbegong** files.
+
+## Quick start
+
+To get started, we first install the package:
+
+```r
+devtools::install_github("kanaverse/wobbegong-R")
+```
+
+Then, we call the `wobbegongify()` function on our SummarizedExperiment of interest:
 
 ```r
 library(wobbegong)
 wobbegongify(my_se, "/my/server/directory")
 ```
 
-This will dump a whole bunch of files at the requested directory.
-Now, the real challenge is how to retrieve information from these files on the client.
+This will dump a whole bunch of files at the specified directory, which can be accessed with [**wobbegong.js**](https://github.com/kanaverse/wobbegong.js).
 
 ## Directory structure
 
-### For a `SummarizedExperiment`
+### For a SummarizedExperiment
 
 The top-level directory (referred to here as `{DIR}`) has a number of files and subdirectories.
-The most important is `{DIR}/summary.json`, which provides a summary of the `SummarizedExperiment`'s components.
+The most important is `{DIR}/summary.json`, which provides a summary of the SummarizedExperiment's components.
 This will have the following properties:
 
 - `row_count`: integer, the number of rows.
@@ -30,9 +40,9 @@ This will have the following properties:
 - `reduced_dimension_names`: array of strings, the names of the reduced dimensions.
   Only available for `SingleCellExperiment` objects.
 
-If `has_row_data = true`, a `{DIR}/row_data` subdirectory will be present, containing the row annotations in the [`DataFrame` directory layout](#for-a-dataframe).
+If `has_row_data = true`, a `{DIR}/row_data` subdirectory will be present, containing the row annotations in the [DataFrame directory layout](#for-a-dataframe).
 
-If `has_column_data = true`, a `{DIR}/column_data` subdirectory will be present, containing the column annotations in the [`DataFrame` directory layout](#for-a-dataframe).
+If `has_column_data = true`, a `{DIR}/column_data` subdirectory will be present, containing the column annotations in the [DataFrame directory layout](#for-a-dataframe).
 
 Any assay with more than two dimensions is not converted and is automatically excluded from `assay_names`.
 Any non-integer, non-logical or non-double assays are also ignored.
@@ -46,18 +56,18 @@ For each element of `reduced_dimensions`, a subdirectory will be present at `{DI
 (For example, the first reduced dimensionality result would be present at `{DIR}/reduced_dimensions/0`.)
 This subdirectory uses the [reduced dimension directory layout](#for-reduced-dimensions).
 
-### For a `DataFrame`
+### For a DataFrame
 
-Each `DataFrame` directory contains a `summary.json` file and a `content` file.
+Each DataFrame directory contains a `summary.json` file and a `content` file.
 The `summary.json` file has the following properties:
 
 - `byte_order`: string, the byte order used for [encoding](#data-encoding).
-- `row_count`: integer, the number of rows in the `DataFrame`.
-- `has_row_names`: boolean, whether row names are present in the `DataFrame`.
+- `row_count`: integer, the number of rows in the DataFrame.
+- `has_row_names`: boolean, whether row names are present in the DataFrame.
 - `columns`: object, information about the columns.
   - `names`: array of strings, the column names.
-    Each value corresponds to one of the columns of the `DataFrame`.
-  - `types`: array of strings, the type of each column (integer, boolean, string or double).
+    Each value corresponds to one of the columns of the DataFrame.
+  - `types`: array of strings, the type of each column (`"integer"`, `"boolean"`, `"string"` or `"double"`).
     Each value corresponds to an entry of `names`.
   - `bytes`: array of integers, the length (in bytes) of the range in `content` corresponding to each column.
     Each value corresponds to an entry of `names`.
@@ -68,7 +78,8 @@ For example, if `bytes` is `[100, 200, 300]`, the first column could be retrieve
 the second column would be retrieved by requesting bytes `100-299`;
 and the final column (or the row names, if `has_row_names = true`) would be retrieved with `300-599`.
 
-Once a column is retrieved, it can be decoded according to its type in `types` - see the [Data encoding section](#data-encoding) for more details.
+Once a column is retrieved, its byte range can be decoded into an array according to its type in `types` - see the [Data encoding section](#data-encoding) for more details.
+The length of the decoded array is guaranteed to be the same length as `row_count`.
 
 ### For an assay matrix
 
@@ -78,7 +89,7 @@ The `summary.json` file has the following properties:
 - `byte_order`: string, the byte order used for [encoding](#data-encoding).
 - `row_count`: integer, the number of rows in the matrix.
 - `column_count`: integer, the number of rows in the matrix.
-- `type`: string, the type of the matrix (integer, boolean or double).
+- `type`: string, the type of the matrix (`"integer"`, `"boolean"` or `"double"`).
 - `format`: string, the matrix format (dense or sparse).
 - `statistics`: object, information about the [statistics](#matrix-statistics).
 
@@ -94,8 +105,8 @@ For example, if `row_bytes` is `[100, 200, 300]`, the first row could be retriev
 the second row would be retrieved by requesting bytes `100-299`;
 and the final row would be retrieved with `300-599`.
 
-Once a row is retrieved from a dense matrix, it can be decoded according to the matrix `type` - see the [Data encoding section](#data-encoding) for more details.
-Decoding yields an array that is guaranteed to be the same length as `column_count`.
+Once a row is retrieved, its byte range can be decoded into an array according to the matrix `type` - see the [Data encoding section](#data-encoding) for more details.
+The length of the decoded array is guaranteed to be the same length as `column_count`.
 
 #### Sparse matrices
 
@@ -117,10 +128,13 @@ Values of the first row could be retrieved by requesting bytes `0-99`, while the
 Values of the second row could be retrieved by requesting bytes `110-309`, while the delta-encoded indices could be retrieved by requesting bytes `310-329`.
 Values of the third row could be retrieved by requesting bytes `330-629`, while the delta-encoded indices could be retrieved by requesting bytes `630-659`.
 
-Once the values are retrieved from a dense matrix, they can be decoded according to the matrix `type` - see the [Data encoding section](#data-encoding) for more details.
-The delta-encoded indices are decoded as integers, which are decoded to the column indices by computing the cumulative sum across the array.
+Once the row is retrieved from a sparse matrix:
+
+- The value byte range can be decoded into an array according to the matrix `type` - see the [Data encoding section](#data-encoding) for more details.
+- The index byte range can be decoded into an integer array, which is converted to the column indices by computing the cumulative sum across the array.
+
 Both decoded arrays are guaranteed to be of the same length that is no greater than `column_count`.
-Column indices (after decoding) are guaranteed to be zero-based and sorted in strictly ascending order.
+Column indices are guaranteed to be zero-based and sorted in strictly ascending order.
 
 #### Matrix statistics 
 
@@ -129,7 +143,7 @@ This is described by the `statistics` property of `summary.json`, which contains
 
 - `names`: array of strings, the names of the statistics.
   This is guaranteed to have `row_sum`, `column_sum`, `row_nonzero` and `column_nonzero`. 
-- `types`: array of strings, the types of the statistics.
+- `types`: array of strings, the types of the statistics (`"integer"`, `"double"`, `"boolean"` or `"string"`).
 - `bytes`: array of integers, the length (in bytes) of the range in `content` corresponding to each statistic.
   Each value corresponds to an entry of `names`.
 
@@ -138,7 +152,9 @@ For example, if `row_bytes` is `[100, 200, 300, 400]`, the first statistic could
 the second statistic would be retrieved by requesting bytes `100-299`;
 and so on.
 
-Once a statistic is retrieved, it can be decoded according to its type in `types` - see the [Data encoding section](#data-encoding) for more details.
+Once a statistic is retrieved, its byte range can be decoded into an array according to its type in `types` - see the [Data encoding section](#data-encoding) for more details.
+All statistics that are named `row_*` or `column_*` are guaranteed to have length equal to the number of matrix rows or columns, respectively,.
+For statistics with other names, the length is implementation-defined.
 
 ### For reduced dimensions
 
@@ -147,7 +163,7 @@ The `summary.json` file has the following properties:
 
 - `byte_order`: string, the byte order used for [encoding](#data-encoding).
 - `row_count`: integer, the number of rows in the reduced dimension matrix.
-- `type`: string, the type of the data (integer, boolean, string or double).
+- `type`: string, the type of the data (`"integer"`, `"double"`, `"boolean"` or `"string"`).
 - `column_bytes`: array of integers, the length (in bytes) of the range in `content` corresponding to each column of the reduced dimension matrix.
   The number of columns is defined by the length of this array.
 
@@ -156,120 +172,22 @@ For example, if `row_bytes` is `[100, 200, 300]`, the first column could be retr
 the second column would be retrieved by requesting bytes `100-299`;
 and the final column would be retrieved with `300-599`.
 
-Once a column is retrieved, it can be decoded according to its type in `types` - see the [Data encoding section](#data-encoding) for more details.
+Once a column is retrieved, its byte range can be decoded into an array according to its type in `types` - see the [Data encoding section](#data-encoding) for more details.
+The length of the array is guaranteed to be equal to `row_count`.
 
 ## Data encoding
 
-### Basics
-
-Integer data are encoded as a DEFLATE-compressed array of 32-bit signed integers in the specified `byte_order`.
+Integer data (`"integer"`) are encoded as a DEFLATE-compressed array of 32-bit signed integers in the specified `byte_order`.
 Missing values are represented as -2147483648.
 
-Double-precision data are encoded as a DEFLATE-compressed array of 64-bit IEEE double-precision floats in the specified `byte_order`.
+Double-precision data (`"double"`) are encoded as a DEFLATE-compressed array of 64-bit IEEE double-precision floats in the specified `byte_order`.
 This may contain IEEE special values like NaN and infinity.
+Missing values are encoded as NaN with a payload of 1954, inherited from R
+(see discussion [here](https://stackoverflow.com/questions/70471859/difference-between-na-real-and-nan/70472081#70472081)).
 
-Boolean data are encoded as a DEFLATE-compressed array of 8-bit unsigned integers. 
+Boolean data (`"boolean"`) are encoded as a DEFLATE-compressed array of 8-bit unsigned integers. 
 Values of 0 represent false, values of 1 represent true, and values of 2 are missing. 
 
-String data are encoded as a DEFLATE-compressed array of null-terminated strings.
+String data (`"string"`) are encoded as a DEFLATE-compressed array of null-terminated strings.
 Each string can be assumed to follow the UTF-8 character encoding.
 Missing values are represented by the Unicode replacement character (`U+FFFD`).
-
-### Demonstration code
-
-First let's set up some common utility functions:
-
-```js
-async function decompress(raw) {
-    // See https://developer.mozilla.org/en-US/docs/Web/API/DecompressionStream
-    // for the implementation status across browsers and frameworks.
-    let dec = new DecompressionStream("deflate-raw");
-    let bb = new Blob([raw]);
-    let readable = bb.stream().pipeThrough(dec);
-
-    let chunks = [];
-    let total = 0;
-    for await (const chunk of readable) {
-        chunks.push(chunk)
-        total += chunk.length;
-    }
-
-    let combined = new Uint8Array(total);
-    let offset = 0;
-    for (const chunk of chunks) {
-        combined.set(chunk, offset);
-        offset += chunk.length;
-    }
-
-    return combined;
-}
-
-function current_byte_order() { 
-    let val = new Int32Array([1]);
-    let view = new Uint8Array(val.buffer)
-    return (view[0] == 1 ? "little_endian" : "big_endian");
-}
-
-function convert_byte_order(x, size) {
-    for (let i = 0; i < x.length; i += size) {
-        const sub = x.subarray(i, i + size);
-        sub.reverse();
-    }
-}
-```
-
-Now we can decode integer data:
-
-```js
-let out = await decompress(range);
-if (current_byte_order() != summary["byte_order"]) {
-    convert_byte_order(out, 4);
-}
-let data = new Int32Array(out.buffer);
-```
-
-Doubles:
-
-```js
-let out = await decompress(range);
-if (current_byte_order() != summary["byte_order"]) {
-    convert_byte_order(out, 8);
-}
-let data = new Float64Array(out.buffer);
-```
-
-Booleans:
-
-```js
-let out = await decompress(range);
-let data = Array.from(out);
-for (const [i, v] of Object.entries(data)) {
-    switch (v) {
-        case 0: case 1:
-            data[i] = (v != 0);
-            break;
-        default:
-            data[i] = null;
-    }
-}
-```
-
-And strings:
-
-```js
-let out = await decompress(range);
-let last = 0;
-let data = [];
-const dec = new TextDecoder;
-for (let i = 0; i < out.length; i++) {
-    if (out[i] == 0) {
-        const view = out.subarray(last, i);
-        let curstr = dec.decode(view);
-        if (curstr == "�") {
-            curstr = null;
-        }
-        data.push(curstr);
-        last = i + 1;
-    }
-}
-```