Goodreads Developers

message 1: by Robert (last edited Jul 27, 2012 02:00PM) (new)

Jul 27, 2012 01:33PM

Hello!

I'm trying to use the search API method to find a book by title and author but it often chooses the wrong book to label "best". You can see similar results on the Goodreads site. There seems to be tons of bogus book entries from questionable authors for popular classics that put the title and real author in the "title" and inevitably these titles are returned as the “best book” in a search with the real book coming in second place.

For example, a search for “The Last of the Mohicans James Fenimore Cooper” returns a 0-ratings book authored by “Tom Ratliff”. The actual correct and popular entry for Last of the Mohicans is the 2nd result. Around the World in Eighty Days by Jules Verne falls victim to this odd behavior, too.

The search API only takes a single query string and I’m not restricting the fields to search (meaning it should be searching title and author fields). Shouldn’t the results favor the book with 23,000 ratings over the book with none?

Any suggestions on how to more reliably find the expected book via the search API?

Thanks!

reply | flag

message 2: by Brian (last edited Jul 28, 2012 01:06PM) (new)

Jul 28, 2012 01:05PM

There certainly are examples where our search isn't returning the book you're expecting. I think you may have to concede that the example you gave us was going to be a difficult one: the title was (I've updated the book's data now that you pointed it out) exactly the search string you gave. We certainly endeavor to improve our search algorithm, but in cases like this, the best recommendation I can give you is to post on the Librarians' Group: http://www.goodreads.com/group/show/2... to request someone clean up a particular book that looks funky (wrong title, misleading title, wrong author, author in title, needs to be combined with other editions, etc.).

Now that a few minutes have passed, my corrections to the book's data seem to have fixed the search you've provided.

I haven't seen this behavior very frequently, but you've found it happening 'often'...could you share some more examples? Are they usually in cases where you provide title and author as the query? See if the librarians can help out on your list, but if all the book data are correct, we'd certainly appreciate if you compiled a list of some examples so we can take a look at how to tweak the search algorithm parameters.

reply | flag

message 3: by Robert (new)

Jul 29, 2012 07:23AM

"around the world in eighty days jules verne"
"the adventures of tom sawyer mark twain"
"kidnapped robert louis stevenson"
"pride and prejudice jane austen"
"the great gatsby francis scott fitzgerald"
"the wind in the willows kenneth grahame"
"alice in wonderland lewis carroll"
"treasure island robert louis stevenson"
"the jungle book rudyard kipling"

reply | flag

message 4: by Robert (new)

Jul 29, 2012 07:31AM

It appears to weigh title matches more heavily than author matches. I'm not sure how one would deal with it except to add rating counts to the logic but I can imagine that could introduce different unwanted results.

I'm considering adding logic on my side to deal with it since I know I'm searching for both title and author; perhaps look at the review count on the best_book and perform a quick comparison on the author last name, grabbing the next result if it doesn't match.

reply | flag

message 5: by Robert (new)

Jul 30, 2012 08:38AM

As a follow-up, I stumbled onto a presumably related yet different issue with the various Oz books. Searching by title and full author name sometimes returns nothing but often returns one result, The Wizard of Oz, instead of the Oz book in question.

In this case I'm dealing with public domain Oz EPUBs that list the author as "Lyman Frank Baum". If I try a search on goodreads.com and change the author to the more common "L Frank Baum" or simply "Frank Baum" some results slightly improve. Result improve greatly if I put the title and "Frank Baum" in their own sets of quotes, however including "Lyman" still torpedoes the search.

Some examples:
"the marvelous land of oz lyman frank baum"
"dorothy and the wizard in oz lyman frank baum"
"tik-tok of oz lyman frank baum"
"ozma of oz lyman frank baum"
"the road to oz lyman frank baum"
"the emerald city of oz lyman frank baum"
"the patchwork girl of oz lyman frank baum"
"the scarecrow of oz lyman frank baum"

I realize these are rather unusual/low-popularity examples but I thought you might be interested.

reply | flag

message 6: by Brian (new)

Jul 31, 2012 12:37AM

i am interested, thanks. curious why you're always inserting author into searches. are these searches generated by your users or are you constructing these queries programmatically? have you tried dropping the author from search terms? because, yes, we definitely weight the title more than author.

do any of these look like 'dirty data'?...perhaps low quality books squatting on those exact titles, non-books, or even books that just need to be combined? please post to librarians' group if you're finding any of those conditions.

reply | flag

message 7: by Robert (last edited Jul 31, 2012 06:24AM) (new)

Jul 31, 2012 06:23AM

I'm programmatically calling the search API method, passing in the title (plus subtitle if I have one) and author name in a single string for the "q" parameter. I start with as much information as I can in an effort to ensure I get the right book. If that yields no results, I try again without the author name.

To me, the "best books" that are returned from the first batch of examples I listed look like dirty data; not real editions. Even if they are real books, they are certainly not the book one would expect as the "best book".

The Oz books are a slightly different problem caused by the same situation. I'm getting a real book back, just not the right one. If I search for "the emerald city of oz lyman frank baum" and there really is a book called "The Emerald City of Oz" by "L Frank Baum", that should be a better match than "The Wizard of Oz" by "L Frank Baum".

Before the search API method I was trying to use the book.title API method which accepts title and author as different parameters. The book.title API method results have different issues. For example, that API method doesn't find certain books that the search API method does. Also, the search API method can properly find books when provided a "lastname, firstname" author string but the book.title API method can't...so I switched to the search API.

reply | flag

message 8: by William, Goodreads engineer (new)

Aug 01, 2012 10:17AM

Mod

Before the search API method I was trying to use the book.title API method which accepts title and author as different parameters. The book.title API method results have different issues. For example, that API method doesn't find certain books that the search API method does. Also, the search API method can properly find books when provided a "lastname, firstname" author string but the book.title API method can't...so I switched to the search API.

Did you encounter this with your first examples, e.g., "around the world in eighty days jules verne", or are there other title/author combinations that illustrate this better?

reply | flag

message 9: by Robert (last edited Aug 01, 2012 11:32AM) (new)

Aug 01, 2012 11:30AM

Actually, yes, "Around the World in Eighty Days" is problematic for the book.title API method, too. If I pass "Around the World in Eighty Days" and "Jules Verne" as title and author, book.title's response is a book titled "Around The World In Eighty Days By Jules Verne" with a primary author named "Steven Otfinoski". Jules Verne is listed as a second author. However, Goodreads HAS a book (ID 54479) that exactly matches the title and author I provided. So...I don't know what is going on there.

The book.title API method has other minor issues that ultimately made it less useful than the search API method. Book.title seems to be case-sensitive while search is not. Book.title is also very strict with its matches while search API is fuzzier and can usually ignore differences not only in capitalization but formatting.

One example is "Forever" by Maggie Stiefvater. The search API will return the correct book for this query: "forever stiefvater, maggie". But book.title will yield no results until you properly capitalize "Forever" and send Maggie Stiefvater's name in First Last format.

Another example is "Island in the Sea of Time". The EPUB I was working with defined the author name as "S. M. Stirling" but the Goodreads book lists him as "S.M. Stirling". Again, the search API will find the correct book with the query "Island in the Sea of Time S. M. Stirling". The book.title API won't return any result as long as there is a space in between the author initials.

So it would be useful if the book.title method could inherit the case-insensitive/fuzzy nature of search while keeping its more useful, distinct title and author params...IF those params are used to good effect. The Jules Verne example sort of hints that perhaps they aren't utilized as one might expect.

reply | flag

message 10: by Robert (last edited Aug 01, 2012 12:15PM) (new)

Aug 01, 2012 12:14PM

Also, book.title strict string matching rules can be extra problematic when one combines a title and sub-title for the title API parameter for a more specific search. In my testing so far, the titles I encounter are generally capitalized the same as Goodreads' data. However, sub-titles are often slightly inconsistent. An EPUB file's metadata might have a title of "Kaboom!" and subtitle of "The Awesome Series Volume #2" while the Goodreads title will be something like "Kaboom! (The Awesome Series Book 2)". Sending a combined title/subtitle string to the search API almost always yields a match but often results in "book not found" from book.title.

reply | flag

message 11: by Brian (new)

Aug 01, 2012 02:57PM

By the way, I did some librarian cleanup on the particular examples you gave above that didn't do well on the search API, and it seems to have helped in most cases. Others might take 30 minutes or so for search indexing to catch up. So if you have a number of these and can post them to the librarians' group, we'll probably find that takes care of many of the problem books.

reply | flag

message 12: by Robert (new)

Aug 01, 2012 05:37PM

Thanks, will do in the future if I encounter more.

reply | flag

message 13: by Robert (new)

Aug 03, 2012 10:51AM

Just as a FYI, here's another example where I don't understand how the Goodreads search query works. I randomly tried a query based on an EPUB from Project Gutenberg called "1914" by Earl of Ypres John Denton Pinkstone French.

I sent the search API a query with my usual combined title and author string "1914 Earl of Ypres John Denton Pinkstone French". That yields No Results, so I fall back and try again with just the title "1914" and I get "A Christmas Carol" by Charles Dickens!?

Goodreads has a record for this title: http://www.goodreads.com/book/show/93...
The title is "1914" and the Goodreads version of the author name is "John Denton Pinkstone French". Why did the search API fail to find this book? The title is an exact match and the author name is a fairly close match. The book.title API also fails to return this book when I specify a title param of "1914".

reply | flag

Goodreads Developers discussion