Stratus is working really well. Phase II: use it for IPC (pub/sub store-n-forward)

Findings:

  • HTTPS continues to be nontrivial
    • need to push on this a little bit harder just to be sure
    • no hard requirement tho, so maybe screw it?
  • client-side caching is super effective; ~ 1kb cache stores 10 of everything (incl Strings), ~ 10% performance cost vs. uncached access to int32
  • Shared per-app secret is OK. Functionally similar to API key / API secret. Write requests are signed with the shared secret, API key is essentially the per-device GUID. Requires no per-device configurations in firmware.
  • Not attempted: a "configurator" program which wold include all GUIDs + per-device private keys, burn the key into EEPROM. Not clear whether this is required or even meaningful...
  • Timing doesn't (cannot) matter. Specifically, race conditions would make fine-grained sequence-of-publication difficult to enforce. Therefore it probably doesn't (shouldn't) matter in which order subscriptions are replayed

Concerns:

  • writes might be pretty slow, should look into caching / batching
    • publish() queue, flush with update()
  • no real threading available for asynchronous updates
  • pub/sub pattern is attractive, but (because threading) the sub is in the foreground during update(). pub might be queued or synchronous.

Core functionality / API:

  • subscribe(stream, callback, scope=PRIVATE, limit=none, reset=false)
    • for each event: callback(stream, data)
    • if specified, will re-send events up to limit events (oldest -> newest)
    • should be able to reset his access counter
    • should also be able to unsubscribe
      • a client-side activity; server doesn't care
  • publish(stream, data, scope=PRIVATE, ttl=SHORTISH, queue=false)
    • publishes String data to stream
      • this may trigger a subscribed callback, synchronously
    • if specified, can queue data for the next update()
  • update()
    • updates the configuration (accessed via get())
    • sends any queued data
    • downloads any subscription data
      • may trigger callbacks, synchronously
  • maybeUpdate(interval=get("REFRESH INTERVAL"))
    • helper function, trivial
    • if interval seconds have passed, update()

Scope is just a channel, identified with a String. For simplicity any Stratus node can subscribe to any channels, but one per subscribe() call (assuming acl permits such)

Publishing to a PRIVATE (unreadable) scope with a very long TTL is just a key/value store.

HTTP/1.1 & keepAlive (persistent TCP) would improve performance measurably. Does that matter?

  • annoying to implement (TCP client & socket server)
  • Do we care if the MCU pauses for a ~ second during an update?
  • Probably decide when HTTPClient stops working out (e.g. very long timeouts or something)

IOT MQ table would require:

  • GUID of the writer (ownership)
  • Scope (PRIVATE or other keyword)
  • key & value (both String, which can be interpreted later to whatever)
  • TTL
  • timestamp

Access control? Stratus MQ has a name (URL) and a secret, access would also require a GUID. Do we need to further grant access by GUID (and therefore maintain a table of queue, guid, secret, approve/deny) ? -> yes, this allows for publish-only things (data sources) or subscribe-only (consumers). Implementation is (was) straightforward

Implementation details:

  • publish: action=publish, fields: key, value, ttl, scope, IOTMQ, GUID, signature
  • subscribe: action=subscribe, fields: key, scope, IOTMQ, GUID, signature

Access records (approve & reject) can be stored in the queue (private scope to the server), with a longish TTL. Rejects would be consumed for logging purposes or for an admin console ("approve this access"). Approved access can be consumed to track subscription (in which case the TTL can be pretty short).

Do we need a TTL based on message depth? "store N messages for at most M seconds" No current use case.

Do we need default TTLs for queues (either #messages, or time-per-message) ? #messages over time? no use case.

After careful consideration I've decided that stratus are shitty clouds. They're grey and amorphous, can be low-flying, and are aesthetically unpleasing. In a not-unrelated vein I am pondering a thing that might serve as an IOT-friendly key:value datastore.

High-level requirements / goals:

  1. Easy to add to existing code. Minimal boilerplate, minimal RAM/CPU overhead.
  2. reasonably secure; an endpoint (IOT device) should be reasonably confident that it's getting unmolested data
    1. note that this shouldn't protect against MITM or anything sophisticated
    2. note also that you shouldn't use this for anything important like heart monitors
  3. dead-ass simple. IOT devices suck and have limited CPU/RAM/connectivity resources
    1. unfortunately this pretty much eliminates active crypto
    2. hard-coded & configurable API keys would be just fine tho
    3. shared-secret check bits might work ok (actually they'd be great with a decent hashing algo)
  4. massively tolerant of failure
  5. self-configurable / zero bootstrap

So here's a starting place, read-only:

  • some ideas are stupid, but I don't want to forget that they're stupid
  • hosted via simple static/flat text file with key: value\n pairs
    • build static config files offline, signed
  • simple strnpos (or similar) to identify key and end-of-key "\n", substr() to extract it
  • "API key" as an n-bit hex value
    • this is nontrivial. IOT code is distributed en masse, per-device configs are infeasible
    • each host can have a GUID (MAC address or derived from)
  • host the static text file in a web space under that key, making it hard to find
    • /df/0xDEADBEEF.txt
  • Use a key like "where my config should live". If you need to move it for any reason, just update that key and the thing would next pull from the new location. This key should mostly be self-referential tho (duh)
  • Use a key for the API key too, in case you have to migrate that for any reason
  • an IOT device should also check for its own GUID as a key, possibly indicating a new config location
    • to let one retroactively split off some clients, or per-client config specialization
  • Other self-referential config things should include update frequency and debug / logging levels

If done correctly one could debug production device(s), ship them, disable debugging, and later re-enable it -- without requiring a code push / OTA.

#include "stratus.h"
Stratus stratus("http://austindavid.com:80/df/df.txt", "secret key");
...
setup () {
// networking setup happens
stratus.update();
}

loop() {
static int variable = 1;
EVERY_N_SECONDS(stratus.get("refresh interval", 300)) {
stratus.update();
variable = stratus.get("variable", variable);
}
}

Doing a little better (potential TODO):

  • implement a pub-sub; URL -> MQ, GUID + secret for authorization & authentication
  • for more arbitrary-sized objects (beelobs), http get a single key / retrieve a single (biggish) value?
  • use the datastore -- http post a key + value pub
  • have clients pump a checkin value to indicate they last successfully read sub

I hope the ads aren't too annoying. I enabled what I HOPE will be unobtrusive ads across all platforms, to be in the non-content areas of the pages. If they get out of hand please This email address is being protected from spambots. You need JavaScript enabled to view it.. Your ad blockers should entirely hide them if needed.

I choose to run the ads to finance the site. My net hosting costs are on the order of $200/yr, and ad traffic roughly covers that -- with most volume to my virtual hat around Christmas time. Unfortunately the volume has tailed off in the last year so I'm exploring options for improving things -- higher-quality placement (and therefore higher revenue per ad) and more prominent placement, beyond the virtual hat.

One of my boys is homeschooled, and even before he did well with some warm-up / practice worksheets at home. I made randomized worksheets in Google Sheets that would let us print out the same types of problems, but with different values. Because he could memorize them (at least the earlier, simpler ones). These are targeted for a 7th grade program but likely apply to 6th and 8th grades.

I'm writing this because I spent too much time puzzling over it until I found a few gems. I probably won't use this IRL because it's clunky (and I think I have a way I prefer). but here's ACTUALLY how to write a JSON-serializable class, and why.

The official JSONEncoder docs say: "To use a custom JSONEncoder subclass (e.g. one that overrides the default() method to serialize additional types), specify it with the cls kwarg; otherwise JSONEncoder is used." Less-obvious (at least to me): JSONEncoder can *only* encode native Python types (list, dict, int, string, tuple ... that sort of thing); if you want to encode any other class, you *must* specify cls= in the dump/load methods, and you must provide a JSONEncoder subclass.

Specifically, json.dumps(data, cls=CustomClass) (where CustomClass is a subclass of JSONEncoder, with a default(self, o) method:

class SpecialDataClass(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, self.__class__):
            return { 'data': obj.data } # or whatever serializable native type
        else:
            super().default(obj)

Serialize with: json.dumps(specialData, cls=SpecialDataClass)

This "works," but it broke down (for me) when I wanted to have a class stored within a dict, and I wanted to serialize that dict. Like, json.dumps(dict) fails if dict contains a special class. The class doing the serialization has to know what it's serializing (so it can specify that keyword), but I don't want that level of binding in my code. Generally this will fail if you have >1 special, unrelated classes to deserialize. Yes you can write a custom deserializing class which handles all of them (sort of a Moderator), but c'mon.

What I did instead (and what will probably change): I provided my to-be-serialized class with "serialize" and "deserialize" methods; they return or take simple JSON-friendly data. The enclosing class (PersistentDict) now accepts a "cls=" argument. New values in that dict would be instantiated in this class, and on read/write all values are pre-serialized or post-deserialized for sending down to json's dump/load methods.

Working on my cluster-backups I wanted an RPC-like mechanism for communication from client -> server, but I required decoupled operation of the clients and server. I wanted to "magically" serialize data of arbitrary length, simplify the logic (abstracting out checking for error conditions etc), but if possible maintain long-lived TCP connections between client and server.

Above all else, code using datagrams should be very natural, readable, and very easy to use.

A simple "echo" server:

from datagram import *

s = DatagramServer("localhost", 5005)

while True:
    with s.accept() as datagram:
        message = datagram.value().upper()
        datagram.send(message.upper())

The client:

from datagram import *

buffer = "a" * 1000

with Datagram(buffer, server="localhost", port=5005) as datagram:
    if datagram.send(): # or send(server="localhost", port=5000)
        print(datagram.receive())

The datagramwill evaluate True or False based on the most recent connection. if datagram is great. datagram.value() returns its deserialized contents (which is either what was most recently sent or received). len() and contains/in treats it like a list, and it can return iterables like for token in datagram or if token in datagram

I have a handful of smallish machines in my home; the 'always on' systems include an iMac (about 25% used of 1TB drive), a few rpis (about 25-50% used of 2-4TB storage, give/take), and a mac mini (50% used of 4TB RAIDZ2 + 2TB Time Machine). "Traditional" backups involve slurping all the important data to a (single) redundant location, and traditionally requires that the single location be as large as the sum of storage in the network. So backups here would require yet another machine and another 8T drive; or I would have to host that drive on an existing machine and deal with the potential irritation of restoring that machine if/when it fails.

This seems silly; in aggregate my network is less than half utilized, so it should be able to back itself up, right? BUT my mini has a few filesystems with over 1T used... so I would have to split those up and back them up to different machines around the network, keep track of all that, and pull it back if/when I have to restore. That's silly. And OBTW my net reliability approaches zero as I spread subsets-of-backups over more hosts: single host failure would almost certainly cause loss of some portion of the backup.

Quick recap of things I'd want:

  1. not-another-machine to store copies of my data
  2. not a single, massive FS on which to store them
  3. no SPOF (backup host)
  4. no "where did the files go" tracking

Cluster filesystems are available to pool storage around the network; a bunch of smallish filesystems can contribute their storage to look like one big filesystem. Unfortunately, as above I have at least 4 different OSs and I think 5-6 flavors. FORTUNATELY they're all UNIX-like (Windows was excised in 2004).

What I want is a cluster backup solution: NO extra machine to store data, NO massive FS, NO single point of failure, and NO juggling of "this can fit over there (as long as it doesn't grow). This should be a clever (not smart) and dynamic system. I would also love the ability to keep N copies of data spread across my M machines, and I DO NOT want to care which of those machines are up when it comes time to restore.

So yeah, I got bored and made one. It doesn't mangle files (keeping them easy to verify, easy to restore) but it also doesn't require I have any single storage location as large as any single dataset. It is dynamic, will live on free storage in the network, and requires minimal configuration. I can even specify a minimum number of replicas per "source", and the system will maintain at least those; if there are two replicas, I can survive *any* single-host failure.

Ex: say a source filesystem has 10 files, 0.txt ... 9.txt, of 100GB each (1TB total). I have 3 hosts on the network with 700GB available each (total: 2.1TB available). A traditional system won't work here. The dynamic system, however, will self-organize and may come up with a solution like: host A: {0-7}.txt; host B: {3-9}.txt; host C: {0,1,2,8,9,4,7}.txt. Any single failure still leaves a complete replica on the network, and every file exists (at least) twice.

Implementation details: it's a simple client/server architecture, many:many. The configuration specifies source filesystems, backup locations, space (or reservation) for backups, and some options like "TCP port" or "number of replicas." That's it. Adding a new source (or backup host) is a one-line config change. The configuration is also backed up, so the change gets made once and propagates to all participating machines.

The "Server" is relatively dumb, passive, and really acts like a locking mechanism. Clients will request files, and can ask status ("are any files underserved" / "are any files overserved"). Clients are more clever (but ignorant of one another): they will try to copy underserved files, greedily. If they are over-capacity, they'll try to drop overserved files. There is no client-client communication, and no server-initiated client communication.

The upside is a very simple dynamic system which stores N copies of data over M machines; the downside, there may be some over-use of the network while clients settle in to a stable convergence. If, for instance, one (large) file is served to 4 clients initially, 2 of them may choose to drop it later as other files need coverage. This results in the lost network traffic and storage IO for the two extra copies. In practice this seems pretty minimal, the system tends to converge quickly and with some almost-trivial locking in the server, "efficiency" approaches 100%.

Copies are performed with rsync (see above: all UNIX-like). It handles checksumming on copy, and can smartly transmit partial data for changed files. The clients and server will exchange checksums (server never volunteers -- it only confirms). For IO sanity, checksums can be sampled (and the clients+servers can rate-limit their own IO).

Metadata communication is over TCP, and could be encrypted with minimal effort. It's all client->server including the likely case where the client and server are on the same host.

It's a "teach myself nontrivial Python" project. I have mixed feelings about the language, tho more positive than when I started.