Working on my cluster-backups I wanted an RPC-like mechanism for communication from client -> server, but I required decoupled operation of the clients and server. I wanted to "magically" serialize data of arbitrary length, simplify the logic (abstracting out checking for error conditions etc), but if possible maintain long-lived TCP connections between client and server.

Above all else, code using datagrams should be very natural, readable, and very easy to use.

A simple "echo" server:

from datagram import *

s = DatagramServer("localhost", 5005)

while True:
    with s.accept() as datagram:
        message = datagram.value().upper()

The client:

from datagram import *

buffer = "a" * 1000

with Datagram(buffer, server="localhost", port=5005) as datagram:
    if datagram.send(): # or send(server="localhost", port=5000)

The datagramwill evaluate True or False based on the most recent connection. if datagram is great. datagram.value() returns its deserialized contents (which is either what was most recently sent or received). len() and contains/in treats it like a list, and it can return iterables like for token in datagram or if token in datagram

I have a handful of smallish machines in my home; the 'always on' systems include an iMac (about 25% used of 1TB drive), a few rpis (about 25-50% used of 2-4TB storage, give/take), and a mac mini (50% used of 4TB RAIDZ2 + 2TB Time Machine). "Traditional" backups involve slurping all the important data to a (single) redundant location, and traditionally requires that the single location be as large as the sum of storage in the network. So backups here would require yet another machine and another 8T drive; or I would have to host that drive on an existing machine and deal with the potential irritation of restoring that machine if/when it fails.

This seems silly; in aggregate my network is less than half utilized, so it should be able to back itself up, right? BUT my mini has a few filesystems with over 1T used... so I would have to split those up and back them up to different machines around the network, keep track of all that, and pull it back if/when I have to restore. That's silly. And OBTW my net reliability approaches zero as I spread subsets-of-backups over more hosts: single host failure would almost certainly cause loss of some portion of the backup.

Quick recap of things I'd want:

  1. not-another-machine to store copies of my data
  2. not a single, massive FS on which to store them
  3. no SPOF (backup host)
  4. no "where did the files go" tracking

Cluster filesystems are available to pool storage around the network; a bunch of smallish filesystems can contribute their storage to look like one big filesystem. Unfortunately, as above I have at least 4 different OSs and I think 5-6 flavors. FORTUNATELY they're all UNIX-like (Windows was excised in 2004).

What I want is a cluster backup solution: NO extra machine to store data, NO massive FS, NO single point of failure, and NO juggling of "this can fit over there (as long as it doesn't grow). This should be a clever (not smart) and dynamic system. I would also love the ability to keep N copies of data spread across my M machines, and I DO NOT want to care which of those machines are up when it comes time to restore.

So yeah, I got bored and made one. It doesn't mangle files (keeping them easy to verify, easy to restore) but it also doesn't require I have any single storage location as large as any single dataset. It is dynamic, will live on free storage in the network, and requires minimal configuration. I can even specify a minimum number of replicas per "source", and the system will maintain at least those; if there are two replicas, I can survive *any* single-host failure.

Ex: say a source filesystem has 10 files, 0.txt ... 9.txt, of 100GB each (1TB total). I have 3 hosts on the network with 700GB available each (total: 2.1TB available). A traditional system won't work here. The dynamic system, however, will self-organize and may come up with a solution like: host A: {0-7}.txt; host B: {3-9}.txt; host C: {0,1,2,8,9,4,7}.txt. Any single failure still leaves a complete replica on the network, and every file exists (at least) twice.

Implementation details: it's a simple client/server architecture, many:many. The configuration specifies source filesystems, backup locations, space (or reservation) for backups, and some options like "TCP port" or "number of replicas." That's it. Adding a new source (or backup host) is a one-line config change. The configuration is also backed up, so the change gets made once and propagates to all participating machines.

The "Server" is relatively dumb, passive, and really acts like a locking mechanism. Clients will request files, and can ask status ("are any files underserved" / "are any files overserved"). Clients are more clever (but ignorant of one another): they will try to copy underserved files, greedily. If they are over-capacity, they'll try to drop overserved files. There is no client-client communication, and no server-initiated client communication.

The upside is a very simple dynamic system which stores N copies of data over M machines; the downside, there may be some over-use of the network while clients settle in to a stable convergence. If, for instance, one (large) file is served to 4 clients initially, 2 of them may choose to drop it later as other files need coverage. This results in the lost network traffic and storage IO for the two extra copies. In practice this seems pretty minimal, the system tends to converge quickly and with some almost-trivial locking in the server, "efficiency" approaches 100%.

Copies are performed with rsync (see above: all UNIX-like). It handles checksumming on copy, and can smartly transmit partial data for changed files. The clients and server will exchange checksums (server never volunteers -- it only confirms). For IO sanity, checksums can be sampled (and the clients+servers can rate-limit their own IO).

Metadata communication is over TCP, and could be encrypted with minimal effort. It's all client->server including the likely case where the client and server are on the same host.

It's a "teach myself nontrivial Python" project. I have mixed feelings about the language, tho more positive than when I started.

I am moving to a new hosting provider because the old one is raising rates and annoyed me. This was a good excuse to upgrade a lot of software too.

If anything got lost, This email address is being protected from spambots. You need JavaScript enabled to view it.. The old stuff is available at (new provider, old content) or (old provider)

Note: this was written for Joomla 1.x and is WAY stale; I use Joe's Word Cloud

A simple content plugin which generates a "keyword cloud" at the end of every article.

Keywords are selected from the article's metadata, comma-separated keywords. They are then sized 1-5 (definitions in css) based on the relative hit frequency in the keywords in all viewable articles.

Each keyword is linked out to a search page for that keyword.

This plugin has been validated against W3C XHTML/1.0 Transitional and CSS 2.1
Licensed under GPL2, and don't be a jackass. If you use it, This email address is being protected from spambots. You need JavaScript enabled to view it.

About this: it will first check if a number is "probably prime" using gmp_prob_prime(). It will then spend up to 1s testing sums of powers, printing any as appropriate.

GCF (greatest common factors) are calculated using my own, probably somewhat / not terribly inefficient algorithm.

Draw Names from a Virtual Hat

Names Draw results

The Algorithm

The "virtual hat" follows a specific series of steps to draw all names from the hat: Each person draws in turn, starting with the first person in the list.

On a person's turn:

  • he shakes the hat
  • he removes one name from the hat
  • If anyone draws his own name, or if a "loop" is detected, the draw is considered "unfair" and everyone draws again.


When a group of people draw names from a hat, sometimes the last person to get the hat will have to draw his own name; this is an unfair draw. When the "virtual hat" detects this condition, it will simply draw all names again (after reshuffling the hat).

Loops / cliques

A 'loop' occurs / clique exists when a subset of people draw each other's name. For instance:
  • Alice => Bob
  • Bob => Carol
  • Carol => Alice
  • Doug => Edward
  • Edward => Frances
  • Frances => Doug
The clique Alice, Bob, Carol does not include the rest of the participants.

If this condition is detected, the draw is considered invalid and is run again; the process repeats until a draw is valid. To avoid blowing up a browser, it will give up after 500 tries. Please LMK if you see that.

A loop is detected by traversing the list in order -- in the above example, Alice -> Bob -> Carol -> Alice. If we manage to find all the names before we find a repeat, then the draw is valid (no loops). This is a variation of the "tortoise and hare" algorithm for detecting loops in a linked list of unknown size.