Goodreads Developers discussion
questions
>
Search often chooses wrong book as 'best'
date
newest »


Now that a few minutes have passed, my corrections to the book's data seem to have fixed the search you've provided.
I haven't seen this behavior very frequently, but you've found it happening 'often'...could you share some more examples? Are they usually in cases where you provide title and author as the query? See if the librarians can help out on your list, but if all the book data are correct, we'd certainly appreciate if you compiled a list of some examples so we can take a look at how to tweak the search algorithm parameters.

"the adventures of tom sawyer mark twain"
"kidnapped robert louis stevenson"
"pride and prejudice jane austen"
"the great gatsby francis scott fitzgerald"
"the wind in the willows kenneth grahame"
"alice in wonderland lewis carroll"
"treasure island robert louis stevenson"
"the jungle book rudyard kipling"

I'm considering adding logic on my side to deal with it since I know I'm searching for both title and author; perhaps look at the review count on the best_book and perform a quick comparison on the author last name, grabbing the next result if it doesn't match.

In this case I'm dealing with public domain Oz EPUBs that list the author as "Lyman Frank Baum". If I try a search on goodreads.com and change the author to the more common "L Frank Baum" or simply "Frank Baum" some results slightly improve. Result improve greatly if I put the title and "Frank Baum" in their own sets of quotes, however including "Lyman" still torpedoes the search.
Some examples:
"the marvelous land of oz lyman frank baum"
"dorothy and the wizard in oz lyman frank baum"
"tik-tok of oz lyman frank baum"
"ozma of oz lyman frank baum"
"the road to oz lyman frank baum"
"the emerald city of oz lyman frank baum"
"the patchwork girl of oz lyman frank baum"
"the scarecrow of oz lyman frank baum"
I realize these are rather unusual/low-popularity examples but I thought you might be interested.

do any of these look like 'dirty data'?...perhaps low quality books squatting on those exact titles, non-books, or even books that just need to be combined? please post to librarians' group if you're finding any of those conditions.

To me, the "best books" that are returned from the first batch of examples I listed look like dirty data; not real editions. Even if they are real books, they are certainly not the book one would expect as the "best book".
The Oz books are a slightly different problem caused by the same situation. I'm getting a real book back, just not the right one. If I search for "the emerald city of oz lyman frank baum" and there really is a book called "The Emerald City of Oz" by "L Frank Baum", that should be a better match than "The Wizard of Oz" by "L Frank Baum".
Before the search API method I was trying to use the book.title API method which accepts title and author as different parameters. The book.title API method results have different issues. For example, that API method doesn't find certain books that the search API method does. Also, the search API method can properly find books when provided a "lastname, firstname" author string but the book.title API method can't...so I switched to the search API.
Before the search API method I was trying to use the book.title API method which accepts title and author as different parameters. The book.title API method results have different issues. For example, that API method doesn't find certain books that the search API method does. Also, the search API method can properly find books when provided a "lastname, firstname" author string but the book.title API method can't...so I switched to the search API.
Did you encounter this with your first examples, e.g., "around the world in eighty days jules verne", or are there other title/author combinations that illustrate this better?
Did you encounter this with your first examples, e.g., "around the world in eighty days jules verne", or are there other title/author combinations that illustrate this better?

The book.title API method has other minor issues that ultimately made it less useful than the search API method. Book.title seems to be case-sensitive while search is not. Book.title is also very strict with its matches while search API is fuzzier and can usually ignore differences not only in capitalization but formatting.
One example is "Forever" by Maggie Stiefvater. The search API will return the correct book for this query: "forever stiefvater, maggie". But book.title will yield no results until you properly capitalize "Forever" and send Maggie Stiefvater's name in First Last format.
Another example is "Island in the Sea of Time". The EPUB I was working with defined the author name as "S. M. Stirling" but the Goodreads book lists him as "S.M. Stirling". Again, the search API will find the correct book with the query "Island in the Sea of Time S. M. Stirling". The book.title API won't return any result as long as there is a space in between the author initials.
So it would be useful if the book.title method could inherit the case-insensitive/fuzzy nature of search while keeping its more useful, distinct title and author params...IF those params are used to good effect. The Jules Verne example sort of hints that perhaps they aren't utilized as one might expect.



I sent the search API a query with my usual combined title and author string "1914 Earl of Ypres John Denton Pinkstone French". That yields No Results, so I fall back and try again with just the title "1914" and I get "A Christmas Carol" by Charles Dickens!?
Goodreads has a record for this title: http://www.goodreads.com/book/show/93...
The title is "1914" and the Goodreads version of the author name is "John Denton Pinkstone French". Why did the search API fail to find this book? The title is an exact match and the author name is a fairly close match. The book.title API also fails to return this book when I specify a title param of "1914".
I'm trying to use the search API method to find a book by title and author but it often chooses the wrong book to label "best". You can see similar results on the Goodreads site. There seems to be tons of bogus book entries from questionable authors for popular classics that put the title and real author in the "title" and inevitably these titles are returned as the “best book” in a search with the real book coming in second place.
For example, a search for “The Last of the Mohicans James Fenimore Cooper” returns a 0-ratings book authored by “Tom Ratliff”. The actual correct and popular entry for Last of the Mohicans is the 2nd result. Around the World in Eighty Days by Jules Verne falls victim to this odd behavior, too.
The search API only takes a single query string and I’m not restricting the fields to search (meaning it should be searching title and author fields). Shouldn’t the results favor the book with 23,000 ratings over the book with none?
Any suggestions on how to more reliably find the expected book via the search API?
Thanks!