| CVID | CID | CCID | Date Created | Date Edited |
|---|---|---|---|---|
| cef52be666abf1ce78393dc37909e9abf08eaaee2a58ad54fef2ecabf939bc1e⧉ | 5aed1e86d8fb719c220f28811ace46f2168e2497fba14dcd1aab0fe1c35e451d⧉ | 7e021f80ef01d35a77981d9d80bca5742ddc4fa9fec5c6e41661aab13abc7c49⧉ | 07 Apr 2026 03:14:49 GMT | 07 Apr 2026 05:15:00 GMT |
Consider a stream of information in binary encoding (“radix 2”) going from sender to recipient, or writer to reader, whose length () in bits is unknown and whose value is unknown.
Now let’s say two or more parts of the information are separate, that is, to be interpreted in different ways by the recipient, but the recipient does not know where one part ends and the next begins. Where is the end of the first part of information, this stream of information becomes:
Subsequences and are both bitstreams of unknown length. In fact, by Cantor every sub-bitstream in is a bitstream of unknown length. Here we start with investigating: if we want to mark two or more pieces of information in an unknown unbounded bitstream as separate, how do we do that?
We begin by exhaustively trying approaches, seeing what axes this uncovers, and seeing if we can determine a direction once those arise. Our first approach is a common one in “software engineering”: length prefixing, for encoding the position of the boundary between two pieces of information. If piece of information is bits long, then its prefix will require
bits to encode its position. But if is unknown, then this quantity is unknown. This leaves us with three unknown bitstreams of unknown length, which is increasing our unknowns not decreasing them and so cannot be treated as a solution.
But lets assert that
is known, and let’s assign 8 to it. If our recipient understands that it
was to scan every 8 bits per 28-1 bits (minus one because
value nil 00000000 would mean “no separation in the next
understood amount of bits”) for a value to mark separation, and if there
is none then to look again after the next 255 bit “chunk” (we will call
this technique “chunking”), and once it finds one it can stop scanning
and interpret the rest of the bits as literals, this works. And this can
quickly be expanded to include “an unknown amount of separated pieces of
information”. If a chunk is found to have a separation marker, it will
disregard all the bits of information after the separation marker until
the start of the next chunk, and the next piece of information will
start at the next chunk (this is because of the case where |sequence A|
+ |sequence B| < |chunk|). If the amount of pieces of information
being marked as separated is known by the recipient, than the sender can
skip chunk encoding for the last pieces of information to save up to
8/263 bits of data after the second-to-last piece of information. If the
amount of pieces of information being marked as separated is unknown,
then continue chunk encoding until the end of message. This shared
understanding between sender and recipient, that the sender knows the
recipient will interpret the data a certain way, and the recipient knows
the sender will send it a certain way, is called protocol. The
two known pieces of information being communicated are the epistemics of
the cardinalty of pieces of information to be marked as separated (is
the cardinality known or unknown), and how many bits are to be reserved
as chunk prefixes. The fact of dropping the chunk prefixes for the last
piece of information if the cardinaility is known can easily be inferred
by any intelligent actor. This is actually an important point of
protocol design: if you could communicate information to another party,
what would they be expected to do with the data? what are the obvious
things they could be expected to do, such as the energetically efficient
thing of dropping unnecessary prefix encodings? This actually neatly
solves our issue here, which is one of the worst issues that has plagued
all of computer science for the entirety of its over half-century
history.
So protocol has to be communicated between sender and recipient for this to work. Imagine the sender walking around with two pieces of information stuck to their forehead, a number (e.g. 8) and a cardinality epistemic status (e.g. “3 (known)” or “unknown”). Or, imagine a radio signal coming from a star that had a piece of information attached to its sender: the light frequency emitted to it relative to other stars (high or low?). This brings our question to a further question: what is the minimum protocol that is required for a sender to send information and assume that the recipient will be able to do the expected, energetically efficient things with it?
Let’s go to the extreme and say “no protocol at all. The sender is
not able to convey any additional information at all to the sender
besides the message itself”. If the receiver sees something that looks
like it has a pattern in it that it is not random noise (e.g. 1, mostly
00000000 usually appearing every 255 bits; e.g. 2, all 0s
appearing after value
plus 8 every 263rd bit when
is the value of the first 8 bits), and they were to try to interpret it,
they might be able to see that it is for separation of data. But trying
to find patterns like that computationally over otherwise unknown data
would require a scan of
bits. This goes up infinitely. J is unbounded. These scans do not have
to be for the chunking protocol as described above, there are probably
other minimal protocols, and probably a provable finite number of them
(for future work), but that is beyond the scope of this paper. We will
refer to these now-generalized unbounded scans as scanning for integer
N. So, to try to reasonably interpret otherwise noisy data when looking
for protocol-structured data, what is the minimum scanning number we can
look for? Lets continue with our binary scheme and try 2.
Value one is for separate, value zero is for not separate. Every other value is an actual encoding, and every other value is “separate” or “not separate”. This is one separation symbol per informational symbol. If we were to divide the ratio by the total amount of symbols, that gives us 1/2 or 50% informational efficiency, which I term the ratio of separating symbols to informational symbols. This is far worse than our >96% informational efficiency we had previously, which a predictable protocol reader would not do.
If we were to try another approach to get the number to be 2, and we reserved one symbol out of two to be the separator, that leaves us one to encode information. But you cannot encode information with one symbol because the entropy of it is 0 because it cannot collapse into multiple states when observed. And adding a time channel breaks our question of protocol and pure information representation. But, if we kept two symbols to encode information, and added a symbol, so we had one symbol to separate and two to encode, that gives us trinary. This is the minimum viable protocol, the amount of scans to go over it is 3. This is also the minimum amount of symbols to clearly separate information.