Leonard Richardson's Blog, page 13

The Ephemeral Software Collection

A lot of stuff has been happening around the Minecraft Archive Project, and NYCB is no longer the best place to put all this information, so I've created a separate website for it: The Minecraft Archive Project. It incorporates most of the stuff I've told you over the past year-and-a-half, about why I'm doing this, what's in the captures and who has copies of the data, but there's also plenty of new stuff, which I'll summarize.

The big thing is that I've started a whole other collection, the Ephemeral Software Collection, which is now bigger than the MAP. My goal with the ESC is to archive software that's likely to be overlooked, forgotten, or destroyed by a takedown notice. Also stuff that I just think would be interesting to have around. The ESC contains the non-Minecraft stuff I got from CurseForge in the December capture, but it also contains a ton of Git repos that I cloned from GitHub.

I asked around about games that had active level creation/modding communities, searched the GitHub API for the names of those games, and cloned all the repos that showed up in the search results. Then I started branching out, running searches for classic games like checkers and Snake, as well as more general terms like 'surreal' and 'gender' and 'senior project'. This is how I got the data for That's Life!. IMO the most significant part of the ESC capture is 750 gigabytes of games created for game jams.

I stopped when I ran out of old hard drives to fill up. You can see the full list of ESC collections; there are about 100 of them.

Before creating this web page, when I heard about another source of Minecraft maps or other ephemeral software, I had two choices: 1) do a lot of work to incorporate it into the MAP, 2) do nothing, feel guilty, eventually forget about it, and suffer a nagging feeling that I'd forgotten something important. Now when I find out this sort of thing, I stick it in the "What I Didn't Capture" section and then forget about it guilt-free. It's a nice system.

I guess the only other piece of news is, I did another MAP capture in early February to see if it was too much hassle to do a capture every month. Total haul: about 75 GB of images and binaries. It was a pretty big hassle, but that number implies that I save about twice as much stuff if I act within one month than if I wait a year, so I'm torn.

Like • 0 comments • flag

Published on February 19, 2016 09:33

#botUPDATE

Last week I fell ill and my cognitive capacity was limited to simple bot work. I created That's Life!, a bot which posts distinctive lines of code from Conway's Life implementations.

For reasons that will shortly become clear, I have cloned about 4000 Git repos that contain implementations of Conway's Life. (Well, I trust my reasons are already clear, but my overall strategy will shortly become clear.) That's a lot of code, but how to pick out the Life-specific code from generic loop processing, framework setup, etc?

Well, I have also cloned about 14,000 Git repos that contain Tic-Tac-Toe implementations. I used Pygments to tokenize all the code in both corpora. Any line of a Conway's Life implementation that contains a token not found in the Tic-Tac-Toe corpus is considered distinctive enough to go in the bot.

Alas, my condition deteriorated, until I was no longer able to write code at all. So I turned towards fulfilling my final vision for The Lonely Dungeon: augmenting the text clips with spot art. This meant a lot of miserable grunt work: scrolling through about 30,000 candidate images, marking the ones that looked cool or weird. But I was already miserable, so I was able to get it all done.

The Lonely Dungeon is now complete! We've got line drawings executed with varying levels of skill, glorious oil paintings, tons of maps with mysterious labels, and old RPG advertisements from magazines. And now I feel better and I can go back to work. Great timing!

Like • 0 comments • flag

Published on February 16, 2016 05:29

The Lonely Dungeon

Tactics: The huge monstrous crab is hungry and<br />attacks all non-locathah who enter the water or<br />attempt to travel on the ledges. It uses its 10-foot<br />reach to attempt to grab victims on the ledge. If it can<br />establish a grapple (note its +22 grapple modifier!), it<br />deals automatic constriction damage each round.<br />While partially submerged in the pool, the crab gains<br />improved cover (+8 to AC and +4 to Reflex saves).

Dear diary, once again I have created the greatest bot ever. It's The Lonely Dungeon (Tumblr, Twitter), another in my tradition of "out-of-context selections from a very large corpus". In this case the corpus is all those RPG sourcebooks that came out in the late 20th century.

I found these books fascinating when I was a kid. They were full of secret information, obscure contigencies, bit characters with weird motivations, worldbuilding for made-up societies. Each paragraph was a little story about why this part of the game couldn't be handled by the normal rules.

Now the books have been replaced by newer editions, or just forgotten since nobody plays the games anymore. As forbidding as they seemed, all those crypts and forests and space stations were incomplete unless someone was going through them and uncovering their secrets.

One of my current interests is worlds that end not through some calamity, but because the inhabitants get bored and move out. Like Minecraft Signs, The Lonely Dungeon is a spotlight picking out features of abandoned worlds.

I've been working on this bot for over a year in spare moments. For the first time in Leonard bot history, The Lonely Dungeon's primary medium is Tumblr, so that I can give you the full OCRed text of the text box. It's better for accessibility, especially as those scans can be difficult to read. I had to learn a lot about PDFs and image processing, and I've scaled back this bot from my original plans, but those plans are still on the table in some form. More on this when it happens! In the meantime... keep adventuring.

Like • 0 comments • flag

Published on February 08, 2016 05:04

The Minecraft (And Other Games) Archive Project

As suggested in the previous Minecraft Archive Project post, I have now completed a capture of the CurseForge family of sites. They host a ton of Minecraft stuff I hadn't downloaded before, including the popular Feed the Beast series of modpacks, lots of other modpacks, mods, and a ton of Bukkit plugins (not really sure what those are or how they differ from mods TBH).

CurseForge also has sites for Terraria and Kerbal Space Program, as well as many other games I haven't heard of or don't care about. I paid $30 for a premium membership and grabbed it all, downloading about 500 gigabytes of images and binaries. This doubles the size of the 201512 capture (though it probably introduces a lot of duplicates).

Here are the spoils, ordered by game:

Game What Capture Size (GB)
Firefall Add-ons <1
Kerbal Space Program Mods 23
Kerbal Space Program Shareables 1.8
Minecraft Bukkit plugins 19
Minecraft Customization <1
Minecraft Modpacks (Feed the Beast) 15
Minecraft Modpacks (Other) 87
Minecraft Mods 33
Minecraft Resource Packs 80
Minecraft Worlds 45
Rift Add-ons 7.5
Runes of Magic Add-ons 1.8
Skyrim Mods 6.4
Starcraft 2 Assets 4.7
Starcraft 2 Maps 46
Terraria Maps 4.8
The Elder Scrolls Online Add-ons <1
The Secret World Mods <1
Wildstar Add-ons 1.7
World of Tanks Mods 40
World of Tanks Skins 12
World of Warcraft Addons 47+

Here's the really cool part: CurseForge projects frequently link to Git repositories. I cloned every one I could find. I ended up with 5000 Minecraft/Bukkit repositories totalling 47 gigs, 103 Kerbal Space Program repositories totalling 6 gigs, and a couple hundred megabytes here and there for the other games. That's over 50 gigs of game-mod source code, which I predict will be a lot more useful to the future than a bunch of JAR files.

These numbers are gloriously huge and there are two reasons. 1. this is the first capture I've done of CurseForge, and possibly the only full capture I will ever do. So I got stuff dating back several years. 2. CurseForge keeps a full history of your uploaded files, not just the most recent version (which is typically what you'd find on Planet Minecraft or the Minecraft forum). Some of the World of Warcraft add-ons have hundreds of releases! I guess because they have to be re-released for every client update. And it doesn't take many releases for a 100MB Minecraft mod pack to start becoming huge.

Anyway, as always it's good to be done with a project like this, so I can work on other stuff, like all the short stories I owe people.

Like • 0 comments • flag

Published on January 22, 2016 19:07

Minecraft Archive Project: The 201512 Capture

On December 27th I started the third capture for the Minecraft Archive Project. Previous captures ran in February 2015 and March 2014. This time I collected about 420 gigabytes of material.

Here's the breakdown by what I believe the new files to be:

TypeNumber of filesCollective size
Maps33112320 GB
Maps (MCPE)15522 GB
Resource packs213730 GB
Resource packs (MCPE) 176172 MB
Mods6082 10 GB
Mods (MCPE)18391 GB
Screenshots33565157 GB
Skins31064132 MB
Server records25923361 MB
Blog posts6562129 MB

This time I think I was able to archive about 60-65% of the maps I saw, compared to 73% in the last capture. Even so, we ended up with 33k new maps in this capture versus 22k in the last one--and I didn't even get the adf.ly maps this time! (Nor will I--it's a huge pain and I'm sick of it.) 2012 was the single biggest year for custom Minecraft maps, and there was a downward trend visible in 2013 and 2014, but it looks like 2015 was really huge.

Couple new features in this capture: I started keeping track of blog posts and server records from Planet Minecraft. Server records are especially important because they usually feature screenshots, and in twenty years those screenshots will be the only record of what those servers looked like.

I've completely given up on the idea of archiving public servers--it's still theoretically possible but it's a full-time job for two developers, so I'd need to get a grant or some volunteer interest from the modding comunity. In fact, a few months ago the multiuser server I played Minecraft on went down, and I don't know whether my stuff is still around. That's life! Gonna archive the screenshots.

The full dataset is now about 2.4 terabytes. I bought a new drive to store the archive and set it up with XFS, and it does seem to improve the performance when iterating over the file set.

As always I'm putting a copy of the data on a server at NYPL Labs, and I recently gave Jason Scott a drive that contained the first two captures, so he can do whatever Jason thing he wants with the data. I don't have any plans to make this archive public, or even to re-run the Minecraft Geologic Survey on the new data. My maximum supportable commitment is spending some time once a year to shepherd these scripts through saving a representative sample of this artform.

I'm going to leave everything else to the future when the archive becomes valuable to other people. I am doing exploratory work for adding a third site to the archive, but that's all I'll say about that for now.

Like • 0 comments • flag

Published on January 10, 2016 04:36

The Crummy.com Review of Things 2015

Another year has gone, but what's the big deal? Let's remember the magical moments, like 12:12:12 on 12/12, or June 30th's leap second. Good timestamps, good timestamps. Here are the most worthwhile investments of my hard-earned 2015:

Books

I've been giving books short shrift by only mentioning a single Crummy.com Book of the Year, and in 2015 I started reading books on my commute (partly because I'm developing a tool that helps people read books on their commute), so I can afford to mention more than one. I have records of reading 25 books this year, and probably a couple more slipped through the cracks, but I've got a solid best-of slate.

The 2015 Crummy.com Book of the Year is Dragonfly: NASA And The Crisis Aboard Mir by Bryan Burrough. So much good stuff in that book. If you want to write fictional dingy spacecraft, you can't do better than looking at the dingy spacecraft we've actually built.

Runners-up:

Nightwood by Djuna Barnes (who needs her own NYCB post)
Jim Henson: The Biography by Brian Jay Jones
You Can't Win by Jack Black (not that Jack Black)
The Space Opera Renaissance, ed. David G. Hartwell and Kathryn Cramer (book needs its own NYCB post)
The Long Way to a Small, Angry Planet, by Becky Chambers

Honorable mention to Mallworld by Somtow Sucharitkul, a book that I didn't love, but I was blown away by its inventiveness. In 1982, Sucharitkul crammed Mallworld with all the jokes that would later be used in Futurama.

Film

Saw ninety-one features this year. As always, only films I saw for the first time are eligible for consideration, though that only eliminates three. Here are my must-see movies:

The Americanization Of Emily (1964)
Mad Max: Fury Road (2015)
The Brink's Job (1978)
Inside Out (2015)
Sullivan's Travels (1941)
Sunset Boulevard (1950)
The Breaking Point (1950)
The Man Who Shot Liberty Valance (1962)
Sweet Smell of Success (1957)
Fantastic Mr. Fox (2009)
The Parallax View (1974)
Nightmare Alley (1947)

And this year's bumper crop of "recommended" films:

The Best of Everything (1959)
Clueless (1995)
Wagon Master (1950)
The Crimson Kimono (1959)
The Godfather, Part II (1974)
Desperately Seeking Susan (1985)
Star Wars: The Force Awakens (2015)
Inside Man (2006)
The Grapes of Wrath (1939)
Kundo: Age of the Rampant (2014)
Ed Wood (1994)
How To Marry A Millionaire (1953)
Brainstorm (1983)
Invention For Destruction (1958)

Honorable mentions to the burglary in Rififi (1955) and the hotel tour in The Shining (1980). I don't want to sit through the whole movie again but those scenes were awesome.

Bots

Looking at the list of my follows I feel like I need to broaden my bot horizons because I love all of Allison's bots (except that damn Unicode Ebooks, which still has three more followers than Smooth Unicode) and I love bots that post images from image collections, and that doesn't seem like a very diverse set. Anyway, here are my faves of 2015:

Deep Question Bot and The Ephemerides by Allison Parrish
wikishoutouts by Jeff Sisson
Gutenberg's Delight by Hugo
The now-defunct men only by thricedotted

Games

Didn't play a lot of new video games this year because of the persistent problem with my computer shutting off if I dare to start up a game. I did replace the computer near the end of the year, so there will probably be more games in 2016. In the meantime, the Crummy.com Game of the Year is the super-atmospheric This War of Mine; its only flaw, which it shares with nearly all games, is that it's not roguelike enough.

A couple runners-up and honorable mentions:

80 Days
Mini Metro
Alphabear

I played board games pretty regularly but the only new game I remember is the much-loved "Code Names", which I also think is great.

I'd wanted to do an escape room this year, but put the idea on hold when Sumana wasn't interested. Near the end of the year, though, Pat Rafferty (who now works at an escape room in Portland) invited me to join his room-escaping team, and I leapt stood up at the opportunity. As part of a crew of six, I helped to repair a drifting spacecraft. It was really immersive, finally allowing me to live the experience of crawling through a Jeffries tube.

My only complaint is the puzzles were free-to-play iOS game-level stuff. I understand why you have to do it that way, since none of us would be able to repair a spacecraft in real life, but it meant that a very immersive exploration experience was constantly interrupted by having to decode some Morse Code or solve cheesy riddles. Same reason I didn't like Myst. I did like the puzzles that made you combine objects.

Going Out

Sumana and I at Town Hall for PHC Stereotypically this section would be called "Going Outside", but all the things I want to talk about happened indoors. In fact, two of them happened in the same building: the Town Hall Theater near Times Square. In fact, all of them, since I moved the escape room to the previous section,

Sumana and I both grew up listening to NPR, and we're both fans of the schticky comedy and down-home existentialism of A Prairie Home Companion (though less ardent fans than we were as teenagers). 2015 was the year I told Sumana (paraphrase) "You know, PHC does shows in New York, and as a project focused around a single individual who has been doing it since before we were born, it might not be around for much longer. We should see it live while we have the opportunity." Sumana was convinced by my airtight logic, and we caught the April 25th show. We had lousy seats but it was fun!

Town Hall selfie pre-PDQ Then, near the end of the year, the PDQ Bach Golden Anniversary Concert Kickstarter was announced. As per previous paragraph, Sumana and I are also fans of Peter Schickele's ur-podcast Schickele Mix, so we went through a similar process, although I ended up going to the concert alone. This time I had a great seat! Beautiful music, lots of laughs, I'm really glad I went.

Food

As you can see from the associated pictures, I lost a lot of weight in 2015. I still have a little more planned, but I'm very close to the impossible-seeming target weight I set in July. I found the Atkins diet to be very effective. I don't think I have a lot of self-control, but I am very, very stubborn, and Atkins lets you substitute stubbornness for self-control.

Because of this I didn't exactly spend a lot of time in 2015 exploring New York's burgeoned restaurant scene, and the Food section will be correspondingly short. However, I want to give a special shout-out to the King of Falafel halal food truck in Astoria. See, most places, if you order a meal without the carby thing, they'll simply omit the carby thing, yielding about 60% of a meal. However, if you order a plate at King of Falafel and ask for no rice, they will fill up the empty space with more meat and salad, and you still get a full meal. Thanks, King of Falafel. Saved my sanity.

Also this sugar-free flourless chocolate cake recipe is good for managing your chocolate cravings. Honorable mention: xylitol.

My Accomplishments

People say that being on Atkins normalizes your energy level, getting rid of the highs and crashes, and I've found this to be true but very inconvenient, since the highs are where I do all my creative work, and the crashes happen at night, a.k.a. "getting sleepy", or they happen at 2 PM, when I drink some tea, problem solved. Right now I feel like it's 1:30 PM all day. Anyway, if you don't count the amazing work I did going from Before to After, 2015 wasn't my most productive year, since I spent half the year in power-saving mode.

But I did finish Situation Normal, and handed it off to an agent, so the book is officially Not My Problem. I've started work on a new novel, Mine, my take on the classic Big Dumb Object In Space story.

I wrote three short stories: "We, the Unwilling" (a bonus story for Situation Normal); "Worm Hunt" (exploratory work for a novel I probably won't write); and "Only G51 Kids Will Remember These Five Moments", which I think I can sell if I ever get around to sending it out.

I gave three talks of note:

"The Enterprise Media Distribution Platform At The End Of This Book" (RESTFest), which goes over my hypermedia work at NYPL.
What Have I Done? (RESTFest), a five-minute talk about the "Richardson Maturity Model"
"Painting Bots, Carving Bots" (unofficial Bot Summit), a praxis-oriented taxonomy of bots.

I crafted a fabulous NaNoGenMo entry with a one-line shell script: Alphabetical Order.

Four bots came from my fingers in 2015:

The highly overengineered Ghostbusters Past.
The old-school phrasebot AMA Bot.
The self-descriptive A Dull Bot.
My fave of the year, Roller Derby Names! Endless fun and easy to write.

I also breathed new life into Smooth Unicode by implementing beautiful emoji mosaics.

Well-wishing

Finally I want to wish all of you readers the best in 2016, and to ask you to tell me what you liked in 2015. or what you're proud of accomplishing. I like other peoples' posts like this (Here's Allison's, here's Darius's), and I think taking a moment at the beginning of the new year to look back is satisfying in a way that can't be matched by the corporate "best of the year" lists that dominate the end of the old year.

Like • 0 comments • flag

Published on January 07, 2016 04:07

Roy's Postcards Return[s]!

Back in 2009 I started a project to transcribe and put online over 1000 postcards my dad bought in the 1980s. The toolchain that took things from postcards to web pages was always kind of rickety, and the project petered out altogether when my sisters sent me about 500 more postcards that Dad sent them. I decided I wouldn't start it up again until I'd transcribed all 1500 postcards and could put everything up at once.

Now it's done! The best way to experience it is through the daily @RoyPostcards bot. This is a labor of love for me, so I'm not as concerned that people follow along, but I tried to add interesting commentary whenever I could, and it's an interesting glimpse into everyday life in the 80s.

Like • 0 comments • flag

Published on November 27, 2015 12:09

Bot Techniques: The Wandering Monster Table

In preparation for the talk I'm giving Friday at Allison's unofficial Bot Summit, I'm writing little essays explaining some of the techniques I've used in bots. Today: the Wandering Monster Table!

In D&D, the Wandering Monster Table is a big situation-specific table that makes it possible for you, the Dungeon Master, to derail your carefully planned campaign on a random mishap. You roll the dice and a monster just kind of shows up and has to be dealt with. There are different tables for different scenarios and different biomes, but they're generally based on this probability distribution (from AD&D 1st Edition):

65% of the time you will get a Common monster, like a really big rat.
20% of the time you will get an Uncommon monster, like a hobgoblin.
11% of the time you will get a Rare monster, like a neo-otyugh.
4% of the time you will get a Very Rare monster, like Ygorl, Lord of Entropy.

Note that this doesn't mean you're going to run into Ygorl (Lord of Entropy) once every twenty-five adventures. There are a ton of Very Rare monsters, and Ygorl is just one chaos lord. He can't be everywhere. What this means is that most of the time the PCs are going to experience normal, boring wandering monsters. The values of die rolls fall under a normal distribution, and 68% (~65%) of die rolls will fall between within two standard deviations of the mean. Those are your common monsters.

Go out three standard deviations (95%) and things might get a little hairy for the PCs. Go out four standard deviations (99.7%) and you're looking at something really weird that even the Dungeon Master didn't really plan for. But what, exactly? That depends on the situation, and it may require another dice roll.

The WMT is a really good abstraction for creating variety. I use it in my bots all the time. Here's a sample of the WMT for Serial Entrepreneur:

common = ["%(product)s",
"%(product)s!",
"%(product)s...\n%(variant)s...",
"%(product)s? %(variant)s?",
...
]

uncommon = [
"%(product)s... %(variant)s...? Just throwing some ideas around.",
"%(product)s... or maybe %(variant)s...",
"%(product)s or %(variant)s?",
"Eureka! %(product)s!",
...
]

rare = [
"I don't think I'll ever be happy with my %(product)s...",
"Got a meeting with some VCs to pitch my %(product)s!",
"I'm afraid that my new %(product)s is cannibalizing sales of my %(variant)s.",
"The %(product)s flopped in my %(state)s test market... back to the draw
ing board.",
...
]

very_rare = [
"Am I to be remembered as the inventor of the %(product)s?",
"Sometimes I think about Edison's famous %(product)s and I wonder... can my %(product2)s compare?",
"I haven't sold a single %(product)s...",
"I hear %(billionaire)s is working on %(a_product)s...",
...
]

This creates a personality that most of the time just mutters project ideas to itself, but sometimes (uncommonly) gets a little more verbose, or (rarely) talks about where it is in the product development process, or (very rarely) compares itself to other inventors. The 'common' bucket contains nine entries which are slight variants; the 'rare' bucket contains 32 entries which are worded very differently.

The WMT works the same way in Smooth Unicode and Euphemism Bot. All these bots have their standbys: common constructs they return to over and over. Then they have three more tiers of constructs where the result is aesthetically riskier, or the joke is less likely to land, or a little of that construct goes a long way.

I also use the WMT in A Dull Bot to a more subtle purpose. Each tweet contains a random number of typos, and each typo is chosen from a WMT. One of the common typos is to transpose two letters. A very rare typo is to uppercase one word while leaving the rest of the sentence alone.

The WMT fixes one of the common aesthetic problems with bots, where every output is randomly generated but it gets dull quickly because the presentation is always the same. Since you can always dump more stuff into a WMT, it's an easy way to keep your bot's output fresh. In particular, whenever I get an idea like emoji mosaics, I can add it to Smooth Unicode's WMT instead of creating a whole new bot.

There's a Python implementation of a Wandering Monster Table in olipy.

Like • 0 comments • flag

Published on October 20, 2015 06:03

Auditioning: Sampling a Dataset to Maximize Diversity

My latest bot is Roller Derby Names, which takes its data from a list of about 40,000 distinct names chosen by roller derby participants. 40,000 is a lot of names, and although a randomly selected name is likely to be hilarious, if you look at a bunch of them they can get kind of repetitive. My challenge was to cut it down to a maximally distinctive subset of names. I used a simple technique I call 'auditioning' (couldn't find a preexisting name for it) which I first used with Minecraft Signs:

Shuffle the list.
Create a counter of words seen
For each string in the list:

Split the string into words.
Assume the string is not distinctive.
For each word in the string:

If this word has been seen fewer than n times, the string is distinctive.
Increment the counter for this word.

If the string is distinctive, output it.

My mental idea of this process is that each string is auditioning before the talent agent from the classic Chuck Jones cartoon One Froggy Evening. One word at a time, the string tries to impress the talent agent, but the agent has seen it all before. In fact, the agent has seen it all n times before! But then comes that magical word that the agent has seen only n-1 times. Huzzah! The string passes its audition. But the next string is going to have a tougher time, because with each successful audition the agent becomes more jaded.

You don't have to worry about stopwords because the string only needs one rare word to pass its audition. By varying n you can get a smaller or larger output set. For Minecraft Signs I set n=5, which gave a wide variety of signs while eliminating the ones that say "White Wool". For Roller Derby Names I decided on n=1.

Here's the size of the Roller Derby Names dataset, n-auditioned for varying values of n:

nDataset size
∞ (original data)40198
10040191
5040089
1037860
636104
535307
434203
332751
230387
125710

Auditioning the Roller Derby Names with n=50 excludes only the most generic sounding names: "Crash Baby", "Bad Lady", "Queen Bitch", etc. Setting n=1 restricts the dataset to the most distinctive names, like "Battlestar Kick Asstica" and "Collideascope". But it still includes over half the dataset. There's not really a lot of difference between n=10 and n=4, it's just, how many names do you want in the corpus.

I want to note that this is this is not a technique for picking out the 'good' items. It's a technique for maximizing diversity or distinctiveness. You can say that a name excluded by a lower value of n is more distinctive, but for a given value of n it can be totally random whether or not a name makes the cut. "Angry Beaver" made it into the final corpus and "Captain Beaver" didn't. As "beaver" jokes go, I'd say they're about the same quality. When the algorithm encountered "Captain Beaver", it had already seen "captain" and "beaver". If the list had been shuffled differently, the string "Captain Beaver" would have nailed its audition and "Angry Beaver" would be a has-been. That's show biz. This technique also magnifies the frequency of misspellings, as anyone who follows Minecraft Signs knows.

Also note that "Dirty Mary" is excluded by n=50. It's not the greatest name but it is a legitimate pun, so in terms of quality it should have made the corpus, but "Dirty" and "Mary" are both very common name components, so it didn't pass.

PS: (Roller Derby Names's sister bot) does not use this technique.
There's no requirement that a boat name be unique, and TBH most boat-namers aren't terribly creative.
Picking boat names that have only been used once (and are not names for human beings) cuts the dataset down plenty.

Like • 0 comments • flag

Published on October 16, 2015 17:47

N-Auditioning: Downsampling To Maximize Diversity

My latest bot is Roller Derby Names, which takes its data from a list of about 40,000 distinct names chosen by roller derby participants. 40,000 is a lot of names, and although a randomly selected name is likely to be hilarious, if you look at a bunch of them they can get kind of repetitive. My challenge was to cut it down to a maximally distinctive subset of names. I used a simple technique I call 'auditioning' (couldn't find a preexisting name for it) which I first used with Minecraft Signs:

Shuffle the list.
Create a counter of words seen
For each string in the list:

Split the string into words.
For each word in the string:

If this word has been seen n times, the string is not distinctive. Abandon the string.
Increment the counter for this word.

If the string is still distinctive, output it.

My mental idea of this process is that each string is auditioning before the talent agent from the classic Chuck Jones cartoon One Froggy Evening. One word at a time, the string tries to impress the talent agent, but the agent has seen it all before. In fact, the agent has seen it all n times before! But then comes that magical word that the agent has seen only n-1 times. Huzzah! The string passes its audition. But the next string is going to have a tougher time, because with each audition the agent becomes ever more jaded.

You don't have to worry about stopwords because the string only needs one rare word to pass its audition. By varying n you can get a smaller or larger output set. For Minecraft Signs I set n=5, which gave a wide variety of signs while eliminating the ones that say "White Wool". For Roller Derby Names I decided on n=1.

Here's the size of the Roller Derby Names dataset, n-auditioned for varying values of n:

nDataset size
∞ (original data)40198
10040191
5040089
1037860
636104
535307
434203
332751
230387
125710

Auditioning the Roller Derby Names with n=50 excludes only the most generic sounding names: "Crash Baby", "Bad Lady", "Queen Bitch", etc. Setting n=1 restricts the dataset to the most distinctive names, like "Battlestar Kick Asstica" and "Collideascope". There's not really a lot of difference between n=10 and n=4, it's just, how many names do you want in the corpus.

I want to note that this is this is not a technique for picking out the 'good' items. It's a technique for maximizing diversity or distinctiveness. You can say that a name excluded by a lower value of n is more distinctive, but for a given value of n it can be totally random whether or not a name makes the cut. "Angry Beaver" made it into the final corpus and "Captain Beaver" didn't. When the algorithm encountered "Captain Beaver", it had already seen "captain" and "beaver". If the list had been shuffled differently, the string "Captain Beaver" would have nailed its audition and "Angry Beaver" would be a has-been. That's show biz. This technique also magnifies the frequency of misspellings.

Also note that "Dirty Mary" is excluded by n=50. It's not the greatest name but it is a legitimate pun, so in terms of quality it should have made the corpus, but "Dirty" and "Mary" are both very common name components, so it didn't pass.

PS: (Roller Derby Names's sister bot) does not use this technique.
There's no requirement that a boat name be unique, and TBH most boat-namers aren't terribly creative.
Picking boat names that have only been used once (and are not names for human beings) cuts the dataset down plenty.

Like • 0 comments • flag

Published on October 16, 2015 17:47