Goodreads Librarians Group discussion
note: This topic has been closed to new comments.
[Closed] Added Books/Editions
>
Large Book Data Import
message 51:
by
Plethora
(new)
Jan 02, 2014 12:05PM

reply
|
flag



https://www.goodreads.com/book/edits/...
An existing record was apparently edited to include the ASIN by sable, but you can't figure that directly from the librarian edit log here. It would be nice if the "Data sources for this edition" table at the bottom of the librarian edit log included ASIN, as well as ISBN and ISBN13.
It also doesn't help that a potentially bogus Amazon edition was created that conflated two books (see discussion here.) That doesn't exactly fill me with confidence regarding the quality of Amazon's data, even for kindle editions.
EDIT: I'm realizing now that the entry with the stolen ASIN may or may not have been a valid entry. Maybe it was a dup, actually. But still, it's alarming to see the sable import moving around ASINs between old records, rather than just making new duplicate entries as advertised.

Another example of partial name, with more than 1 publisher involved.
Jr. Willis
And another.

FYI, not so much a bogus author as import being unable to process umlaus properly. :(

I just came across 2 books w/ 'a' somewhere in the title, and not at beginning. In both instances, sort by title field had incorrectly inserted comma = title 'broken up' by wherever the 'a' was.

One other case where ASIN comes in handy (for links to Amazon sales pages if nothing else) is where an ASIN has been assigned to a pre-ISBN book sold as used, especially for books old enough to have more than one pre-ISBN edition, such as Remembrance of Things Past which I could swear had an ASIN when I created it, but which doest show even a field for such a record in the Librarian Edits now.

According to a kicking I recieved the other day using an ASIN from an old book is naughty. Thinking about it it does make some sort of sense since that ASIN is a selling link for a dealer, not an actual identifier for an edition.




Changed to the default description.

I don't think this still should have an ASIN from policy. Those are reserved for Kindle ebooks, it would need to be listed without an ISBN. In this case the ASIN is just a number Amazon issued for a third party seller to sell an item. These books likely have a correct ISBN - likely needing merging into this Temeraire: In the Service of the King edition.

I have changed a bunch myself but I just wanted to bring it up because it seems to be a problem with the imports overwriting real descriptions and it's getting kind of annoying.


SFBC releases used not to have ISBNs, but this recent one probably does - the same as the mass market edition. Since that one is in use, might as well keep the ASIN to link to the proper page on Amazon.

According to Banjomike msg above ... "According to a kicking I recieved the other day using an ASIN from an old book is naughty." This is not the acceptable practice.
I still venture that it matches the book I linked above, which is a SFBC 2006 edition.
Also old edition, prior to ISBN's are not getting ASIN's assigned for used book dealers that are selling them on Amazon. They simply have an empty field.
To clarify: We use ASINs for Kindle editions and Audible ebooks. We do not use them for print books.

That just sounds weird because the publisher page at http://www.sfbc.com/temeraire-in-the-... sells that hardcover edition omnibus book exclusively.
Any other non-SFBC purchase links go only to used editions. Meaning there is no amazon page for it, only an amazon marketplace seller.
(Just a seller entry that will never get used again now that the used copy is sold unless for some reason seller had more than one copy to sell. By that logic, might as well put ebay, half.com, and other used edition item numbers in isbn/asin fields.)
I know Rivka clarified it correctly; just wanted to give the publisher link confirming that it is the 2006 edition so I think should be combined as Bookworm R said in message 64 and 68.
It's isbn 9780739468715 according to details tab on that publisher link, so really I think it just needs to merge into the already there isbn 9780739468715 edition.
ETA: clarity; plus, just in case these asin show up, there are some duplicate, uncombined edition pages for the marketplace item on amazon under asins including B001S3IINE and B002V4KOOK (no doubt asin numbers more will be added as more sellers make additional isbn pages if failing to find the existing pages)

D.A. wrote: "I'm going over to feedback group and suggest that marketplace data for existing works be excluded from amazon data feed."
They are not coming in via the feed, as far as I am aware. That particular edition was added manually in 2010 (and actually appears to have originally been a Kindle edition).
They are not coming in via the feed, as far as I am aware. That particular edition was added manually in 2010 (and actually appears to have originally been a Kindle edition).


I have been setting the descriptions to the default description when I find those.

Another option is to revert the change. Sometimes the old blurb is better than the default blurb, more relevant to that specific edition, or it might contain a note about alternate cover editions.

https://www.goodreads.com/book/show/1... and the librarian change log https://www.goodreads.com/book/edits/...
Such edits should not be allowed to happen. I left it untouched.
@Julie: Setting the default will not save edition-specific descriptions.
@Banjomike: Reverting a change of a description never worked for me (regardless of whose edit, amazon_kcw's or human, it was).

Just wanted to give you a heads up and let you know that I'm working on returning the majority of the 'stolen' kindle edition ASINs back to their original book. This will be effective only for books altered by amazon_sable, not for amazon_kcw. Explanation:
amazon_sable updates occur when Goodreads does a large data dump from Amazon's catalog. amazon_kcw updates occur when a Goodreads on Kindle user attempts to add an Amazon book to their GR library that has not yet been mapped to a GR book.
In the future, amazon_sable data dumps will *not* steal ASINs. And, since amazon_kcw will only attempt a book match or creation if a GR book has not yet been mapped to an Amazon book, ASIN stealing should be very very rare in that case too (since we've mapped the vast majority of the existing ASINs in our library)
I will also be working on the issue of books being created with no isbn13 or isbn - This is caused when there is a partial but not full match for a book coming from sable or kcw. Similar to our decision about asins, we've decided that in the future amazon_sable will NOT create a new book if it matches an existing GR book on isbn/isbn13 but NOT title or author. We'll also be cleaning up a lot of the books and works created in this case.
amazon_kcw will continue to generate books when there is a partial match because a book creation request from amazon_kcw indicates that an Amazon book that a real-life user has purchased has not yet been mapped to an existing GR book.
(There's a lot of twists and turns and ifs and thens in this explanation, so let me know if anything isn't clear.)
Thank you everyone for all your input - it's been invaluable for tracking down some of the bad data patterns existing in the pre-Christmas data import and it's helped us refine our future strategy!!
-Sarah
Edit: And please keep new issues coming as they arise - we have an ongoing list and will continue to add new concerns to it. This post is just to let you know what I'm working on right this moment - other work is being done by others on the team and I'm sure I'll tackle some of the other issues once this first set of fixes is done.


Here's a link to my search at the point I left it.
(ETA: That's odd: the link is formatted correctly but doesn't seem to be live. Here it is at length: https://www.goodreads.com/search?page...)

I have mixed feelings about import-generated books removing ASINs from existing Goodreads books. On the one hand, we first attempt to match the import to the existing book and only generate a new book if the import doesn't match the existing book's title and author. On the other hand, GR has title and author formatting policies which sometimes result in false mismatches. Moving forward we'll be improving the matching algorithms to limit those incorrect mismatches, but another potential option is to add an item to the (to-be-created) Librarian Queue and allow a librarian to determine whether the existing book's data needs to be updated/edited.
This is still under discussion :-)

I'd hold off on your manual updates until we've run a few of our clean-up scripts. It might save you a lot of work! I'll keep this thread updated about when we run those scripts and what the results were.

I would suggest that your development team talks to Amazon about the reuse of ASIN's. From what I understand they allow the author to do whatever they want with a book after it has been uploaded. So they can change the content, title and/or cover without having a new ASIN issued. This creates a problem that I bought book titled XXX with this cover and content with ASIN XYZ. Author changes some component and ASIN remains same. I didn't read "NEW XYZ" or buy "NEW XYZ". If book has format/typo's fixed, fine reuse the ASIN, but when other components start changing if they issue new ASIN we can cut down on all the back and forth "fighting" with editions.
[Not sure if that makes sense and sounds right, hopefully someone else can state more elegantly]

Sarah,
you do not mention under which circumstances amazon_kcw is entitled to UPDATE/MODIFY EXISTING Goodreads entries.
Could you please elaborate on this?
Furthermore, could you please enlighten us on the match criteria?
Thank you very much,
Michael

So maybe the data import and asin moving issues should instead turn into addressing the purchase link features. I know of two major issues with the purchase links (again, I don't think that means valid editions should be vandalized just to move information onto what is currently for sale):
1. Amazon allows authors to bypass isbn agency rules requiring a new isbn be obtained when an author or publisher re-releases a book with a new bookcover. Goodreads alternate cover edition policy allows updated covers but because the isbn/asin numbers are not in the fields purchase links use, purchase links for alternate cover editions fail. Maybe goodreads database needs an ace or "also uses isbn/asin/ean/sku" field that purchase links would use first if present (I wouldn't even mind a cover edition #field to keep track of which cover change book is on. It's not really a new edition if all that changed was the cover. I'm not real sure most authors, readers or amazon understands the alternate cover edition policies like librarians and staff do).
2. Purchase links for books with bookseller numbers in the isbn13 field (such as bnid starting with 294 or kobo I'd starting with 123) fail if anything is in isbn10 field. You don't find those books for sale searching by isbn10 which is what the purchase links first search on. Maybe change so that links only use isbn10 field if no isbn13 or asin number is present.

As to "Amazon allows authors to bypass isbn agency rules requiring a new isbn be obtained when an author or publisher re-releases a book with a new bookcover" .... this isn't required by ISBN.org, a new cover does not automatically trigger the need for a new ISBN.
I have found though that any cover changes an author has made to a kindle edition filter down to my kindle editions. I leave the vast majority of my books in the cloud and never did the Calibre download cataloging of books (even though I own Calibre) but I know I have books that the cover did not look like it does now when I purchased the book. Which is a shame as I liked the old cover.
D.A. wrote: "So maybe the data import and asin moving issues should instead turn into addressing the purchase link features."
That's really a separate issue and outside the range of this thread.
That's really a separate issue and outside the range of this thread.

I would suggest that your development team talks to Amazon about the reuse of ASIN's. From what I understand they allow the author to do whatever they want with a book after it has been u..."
Hi Bookworm - thanks for your thoughts. It's a long standing issue/debate/discussion about whether to consider ASINs, isbns, and isbn13s as unique...
One of the cases we run into a lot (as you all know) is the multiple-cover book. Same isbns, different cover image. Same book? Different book? Hard to say.
It's a similar issue when an author changes their pen name or, even more confusingly the title of their book. We'll likely work toward some model of authors that allows for different aliases. Not sure if we'd want to do the same thing with book titles.
There's an even more confusing issue in that isbns and isbn13s are occasionally reused in different marketplaces for completely different books... *cry*
As it stands, GR policy is that isbn13s, isbns, and asins are all considered unique. In some cases this is going to result in book records that don't have isbn13/isbn values. In other cases this may result in a book on GR being mapped to the same book on Amazon even though the two records have slightly different data.
We're trying to find a happy medium that maps books well and doesn't create *too* much duplicate information.

under what circumstances will that policy allow for the modification of already existing data?
Thanks.

You might want to check on amazon in "manage my kindle" to see if you have automatic updates turned on.

I don't have automatic updates turned on. In fact I got a notice today for a book that has substantial fixing of typos. But I know I have had cover images changed.

Our general policy: User generated data (Librarian data in particular) should *never* be overwritten by data feeds (be they Onix feeds or data from Amazon). Data feeds are *only* used to supplement blank values on existing records or to create new records where one does not already exist. We are definitely aware that you guys do a better job of cleaning up book records than any of the other data sources out there ;-)
More specifics: The *one* exception to this at the moment is ASINs. When we receive book information from Amazon that indicates that an ASIN has potentially been misattributed to a book record, we are currently removing that ASIN from the current record and applying it to the newly created book. We have logs of *every* time this has happened and hope to expose this information in a useful way to Librarians sometime in the near future so that such changes can be easily overridden.
Matching:
Matching is done for Amazon data in a pretty similar way to other data feeds. Specifically:
- For Kindle Editions, we first look to see whether we have any books in our database that share the imported book's ASIN. If so, we then check whether the title and author are close matches to the existing book (we're working on tweaking our similarity algorithms so that books like this don't create a new book record: https://www.goodreads.com/book/show/1... ). If the title and the author are close enough matches, we consider the two books to be a match. If one of these *doesn't* match, we create a new record and generate a log of the conflict
- Non-Kindle Editions work in almost the same way, but we first try to match on isbn13. If we fail to find a book that matches on isbn13, title, and author, we then try to find an appropriate isbn10 match. If this again fails, we create a new record and generate a log of any conflicts we encountered along the way.

Some really good thoughts, points - we have a lot of discussions about how to use identifiers and when they should be considered unique, etc etc. :-)

This is based on the untidy Amazon database which has a priority over the really neat Goodreads database?
I would love to know just how the Amazon site is kept up-to-date. I think we can safely assume that they don't have hoards of volunteers keeping publishers/authors/literary agents/bloggers under control. I think we can also assume that if an author/etc wants to swap a cover or an ASIN or a title they can do it on Amazon without much trouble. On a daily basis we get authors asking for such changes to be made on Goodreads that thay have already done on Amazon. Now those Amazon changes are likely to result in a new book on Goodreads with the ASIN stolen from the original and probably correct edition. It doesn't seem like a very dependable way of deciding to move an ASIN between editions and it is likely to be a total swine to keep track of.

Personally, I don't think it's really an identifier at all if it's not unique (of course, the isbn13 field having numbers starting with 978 have corresponding isbn10 numbers but in that case both the isbn13 and the isbn10 are both unique).
Rather defeats the purpose of having an identifier if it belongs to more than one work or edition.

Thank you, Sarah, that was the answer I was expecting.
Still, there is ample evidence of amazon_kcw happily *altering* non-blank, valid, human-user-supplied data.
Is this considered a one-time initial bug or just collateral damage to be cleaned up by volunteers?

To hopefully assuage your concerns a bit - we're taking a number of steps to dial back how aggressive the last data import was and to ensure future imports are cleaner. We'll be returning the ASINs stolen in the previous import back to their original books. Future imports won't result in stolen ASINs. We'll also be removing a lot of the books imported without isbns and isbn13s that are likely duplicates of already existing GR books.
Also, cover images, book titles, descriptions, and all other book data that already exists on a book record will *not* be overwritten by imports. If they're being overwritten by a non-user, there's a bug! Make sure they're reported and we will look into them.
And, to give some context to what's going on here:
Over the coming year we're hoping to expand Goodreads into other countries and make it more accessible to users that speak languages other than English. While we have a large and incredible army of librarians, that army won't necessarily scale to meet the needs of users all over the world as we begin to fully expand our library to encompass book editions for more and more countries.
Furthermore, even if we could inspire the help of volunteers around the world as quickly as we'll need them, it wouldn't be the best use of your/their time to ask you/them to manually enter data for every book that exists in the world.
Right now our goal is to use the information we've gathered from this first data import, along with the insights that GR Librarians have shared with us to calibrate our system for clean, non-intrusive imports that will scale as our library grows rapidly over the coming year.
I want to reiterate that we know that you guys do an incredible amount of really really amazing work. Our mission here isn't to undo or undermine what you're doing - it's hopefully to develop a system that will support what you do and make your job easier (even if that's not how it seems while we feel our way through these initial, somewhat tangled steps.)

Do you have some examples? We can take a look to make sure it's not a systematic issue.

Personally, I don't thin..."
Too true. Unfortunately the waters are muddied by some real-world abuses of isbns and isbn13s. We deal with them as best we can!
This topic has been frozen by the moderator. No new comments can be posted.
Books mentioned in this topic
Snobs (other topics)The Twelve Dates of Christmas: Dates 1 and 2 (other topics)
The Twelve Dates of Christmas: Dates 1 and 2 (other topics)
The Twelve Dates of Christmas: Dates 1 and 2 (other topics)
Divisadero (other topics)
More...
Authors mentioned in this topic
Unknown (other topics)Various (other topics)
Unknown (other topics)
Unknown (other topics)
Avery T. Willis Jr. (other topics)
More...