Secretly Public Domain: Update

My "Secretly Public Domain" project got a lot of attention, which is great, but it also gave me a lot more work to do and pointed to some things that hadn't been explained very well. I've done that work, and here's an update:

Topline number is 73%

My original estimate was that 80% of pre-1963 books were not renewed. This was based on a couple of inaccurate assumptions, the big one being that I was counting works originally published in a foreign country. Those works might have lapsed into the public domain at some point, but the US copyright has since been restored by treaty. So their renewal status isn't really relevant.

Of the books where renewal status is relevant, here are the most recent statistics:


73% have no renewal record at all.
19% have a renewal record that's an excellent match.
8% are in a grey area. They have one or more renewal records, but none of them are an excellent match. One of them might be legit, or they might all be renewals for totally different books. They need to be checked manually.


Credits

The "Secretly Public Domain" bot was a publicity stunt to draw attention to the machine-readable registration records. It worked great, but it also drew attention to me, the person doing the publicity stunt, even though I had basically nothing to do with the original work. For the record, here are the people who actually did the work. The project inside NYPL was run by Sean Redmond, Greg Cram, and Josh Hadro (now of IIIF). The work of making the copyright records machine-readable was done by Data Conversion Laboratory.

Buried treasure

Most of the books whose copyright wasn't renewed are really obscure titles, but without looking very hard I found a very well-known science fiction novel that has no renewal record. I'm not mentioning the name as an incentive to get people to look at the data themselves. It's probably not the only well-known work whose copyright wasn't renewed.

How to make your own list

My original estimate of 80% was based on the quick and dirty script I used to write the Mastodon bot. To fix the "foreign works" problem and to produce a dataset that would stand up to scrutiny, I published a Python library specifically for handling this data. It's got business logic for making determinations like "was this book published in a foreign country" and "how well does this renewal record match this registration record". You run the scripts and at the end you have a bunch of JSON files with consolidated data. If you think there are bad assumptions, you can change the business logic and run the scripts again.

How to see the data

There were a number of requests for this data in a tabular form. I totally understand where this is coming from, and it's certainly the easiest way to get into the data, but it's tricky, because converting the JSON to tabular data destroys information that would be useful for taking the next step (see below).

So, I've done the best I can. I added a script to the end of my Python workflow which generates three huge tab-separated files, and I put those files in the cce-spreadsheets project. This should be good for getting an overview of which books were renewed, which weren't, and which are foreign publications.

What's next?

Discovering that a book published in 1950 is in the public domain, doesn't make a free digitized version of that book automatically appear. Somebody has to do the work. At this point we go from fast data processing to really slow research and digitization work. You or I can now make a near-complete list of unrenewed books in a few minutes, but that list just represents an enormous to-do list for someone.

There are basically three "someones" who might step up here: Project Gutenberg, Hathi Trust, and Internet Archive.

Project Gutenberg

As I mentioned earlier, Project Gutenberg digitized the copyright renewal records some time ago, and they use them all the time. They have a section of their Copyright How-To explaining how to check whether a particular title was renewed, and whether the renewal matters. There are other steps to clear a pre-1963 work: you have to verify that the author lived in the US at the time, stuff like that. The newly digitized registration records can help with some of this, and my data processing script that combines registration and renewal can help with more of it, but there's still some manual work you have to do for each book.

Once that work is done, Project Gutenberg volunteers will locate a copy of the book, scan it, and OCR it (assuming there's no existing scan). Then they'll proofread it and put out HTML and plain-text editions. As you can imagine, this process takes a really long time, but the result is a clean, accurate copy of the book that can be read on its own or reused in other projects. The catch is that somebody has to care enough about a specific book to go through all this trouble.

Hathi Trust

Hathi Trust already has scans of a lot of these 1924-1963 books. They just don't make these scans available to the public, because as far as they know, all these books are still under copyright. If they were convinced otherwise, they'd open up the scans—they opened up almost all of their 1923 stuff this January when the 95-year copyright term finally expired. So we have to make a case for opening up these books.

Earlier, NYPL took the highest-circulating 1924-1963 books in our research collection and checked to see which ones lacked a renewal record. We sent the list to Hathi Trust, and they did their own verification and opened up some of the books: The Americans in Santo Domingo from 1928 is an example. Once Hathi opens up a scan, it's available to the public. It also becomes possible for Gutenberg et al. to turn the raw scan into something more readable.

In the near future, people at NYPL (not me) will be talking to people at Hathi Trust about what kind of evidence is necessary, in general, to convince them that the copyright on a 1924-1963 book has lapsed. Then we'll be able to give them a list of all the books where we can find that kind of evidence. There'll still be a verification process on the Hathi Trust side -- at the very least, they have to go through the book and make sure it doesn't contain unauthorized reprints from other books -- but it should streamline things quite a bit.

Internet Archive

Internet Archive is a wild card here. They scan a lot of books, and I could see them treating the "unrenewed" list as a big list of additional books to scan, but it would be a new undertaking. Making unrenewed works available is something Project Gutenberg volunteers do already, and it's something that Hathi Trust could do relatively easily, but with Internet Archive it's more the sort of thing they'd do.

Data problems

That 8% of grey area, where it's not clear whether or not a book was renewed, points to the general difficulty of meshing together two sets of public records published across half a century and digitized by different people. The grey area represents a lot of manual work that has to be done, and of course there's always the fear that a book that seems to be free and clear actually isn't: the title page says "printed in Canada", or the smoking-gun copyright renewal didn't show up because its ID number was typed wrong.

There's going to be a lot of manual work in the process of clearing these books, but there's no reason to wait until everything's perfect to get started. My preference is to cast a very wide net, try to find any renewal that might possibly be related to a registration, and make the grey area as big as possible. We know that a majority of 1924-1963 books will always come up "no renewal", because there are way more registrations than renewals. We can deal with those and then take a closer look at the grey area.

 •  0 comments  •  flag
Share on Twitter
Published on August 09, 2019 06:45
No comments have been added yet.


Leonard Richardson's Blog

Leonard Richardson
Leonard Richardson isn't a Goodreads Author (yet), but they do have a blog, so here are some recent posts imported from their feed.
Follow Leonard Richardson's blog with rss.