Base32768 is a binary encoding optimised for UTF-16-encoded text. This JavaScript module, base32768
, is the first implementation of this encoding.
The efficiency chart speaks for itself. Efficiency ratings are averaged over long inputs. Higher is better.
Encoding | Efficiency | Bytes per Tweet * | |||
---|---|---|---|---|---|
UTF‑8 | UTF‑16 | UTF‑32 | |||
ASCII‑constrained | Unary / Base1 | 0% | 0% | 0% | 1 |
Binary | 13% | 6% | 3% | 35 | |
Hexadecimal | 50% | 25% | 13% | 140 | |
Base64 | 75% | 38% | 19% | 210 | |
Base85 † | 80% | 40% | 20% | 224 | |
BMP‑constrained | HexagramEncode | 25% | 38% | 19% | 105 |
BrailleEncode | 33% | 50% | 25% | 140 | |
Base2048 | 56% | 69% | 34% | 385 | |
Base32768 | 63% | 94% | 47% | 263 | |
Full Unicode | Ecoji | 31% | 31% | 31% | 175 |
Base65536 | 56% | 64% | 50% | 280 | |
Base131072 ‡ | 53%+ | 53%+ | 53% | 297 |
* New-style "long" Tweets, up to 280 Unicode characters give or take Twitter's complex "weighting" calculation.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation etc..
‡ Base131072 is a work in progress, not yet ready for general use.
Base32768 uses only "safe" Unicode code points - no unassigned code points, no whitespace, no control characters, etc..
npm install base32768
import { encode, decode } from 'base32768'
const uint8Array = new Uint8Array([104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100])
const str = encode(uint8Array)
console.log(str)
// 6 code points, '媒腻㐤┖ꈳ埳'
const uint8Array2 = decode(str)
console.log(uint8Array2)
// [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
Load this file in the browser to gain access to a base32768
global.
<script src="https://unpkg.com/base32768@2/dist/iife/base32768.js" crossorigin></script>
<script>
console.log(base32768.decode('怗膹䩈㭴䂊䫁輪黔'))
</script>
Encodes a Uint8Array
and returns a Base32768 String
. Note that every Node.js Buffer
is a Uint8Array
.
The string is suitable for passing safely through almost any "Unicode-clean" text-handling API. This string contains no special characters and is immune to Unicode normalization. Give or take some padding characters, the output string has 1 character per 15 bits of input.
All characters are chosen from the Basic Multilingual Plane. This means that when encoded as UTF-16, all characters occupy 16 bits. Thus, there are 16 bits of output UTF-16 text per 15 bits of input, an efficiency of 93.75%.
Decodes a Base32768 String
and returns a Uint8Array
containing the original binary data. Note that a Uint8Array
can be converted to a Node.js Buffer
like so:
const buffer = Buffer.from(uint8Array.buffer, uint8Array.byteOffset, uint8Array.byteLength)
MIT