Goodreads Librarians Group discussion
note: This topic has been closed to new comments.
[Closed] Added Books/Editions
>
Large Book Data Import

One weird behavior I see is the cover image being uploaded twice when it's uploaded by amazon_kcw. I'll look into that.
Also, re: descriptions like "Great story! Book is new and unread." - we'll look into whether there's a way to disregard descriptions entered by individual amazon merchants.

Thank you. I have fixed 4 of these already that I found and I wasn't even looking for them.

KCW seems to be bringing authors and a lot of GR authors without their . between initials - anything there?

KCW seems to be bringing authors and a lot of GR authors without their ...."
The publishers with quotes is definitely on our list - essentially the data feed we got had some publishers with two sets of quotes around their names and we didn't successfully strip both sets. That'll get fixed.
As far as author names - there are several issues where the formatting of an authors name in the feed is preventing it from matching to a GR author. I'll add the missing periods between initials to the list if it isn't already there. Do you have a quick example?


https://www.goodreads.com/book/show/2...

I know I have seen cases of public domain versions that will say something along the lines of this is an OCR version and nothing else, no book description. I don't think it is a new issue, but one that could be used to remove that type of wording and maybe just use the default description, hoping the default is at least about the book.

===================================================
This book was converted from its physical edition to the digital format by a community of volunteers. You may find it for free on the web. Purchase of the Kindle edition includes wireless delivery.
I realize this particular example came from a user, but it is what you will find on Amazon, so I would expect it to come in the form of import feeds as well.
https://www.goodreads.com/book/show/1...
===================================================
More information to be announced soon on this forthcoming title from Penguin USA
This example did come from an import
https://www.goodreads.com/book/show/1...
===================================================
No Description Available
This example was from an import as well
https://www.goodreads.com/book/show/1...

===================================================
This book w..."
I think that those type of blurbs are actually quite useful. They say to me "Don't buy me, you can get me for free elsewhere". And probably better formatted.

============================================..."
Well, I suppose but these are free editions typically to begin with. But I prefer to purchase a Penguin, Oxford etc edition instead of going with a free public domain.
I though find it rather annoying if I find a book with this type of description when I want to know what the book is about. I will research the edition, translations etc once I figure out if I want to read the darn thing. Which I can't do from such a description.
I also don't feel they fit guidelines.

This will not scale indeed. Considering the various issues with data imports, multiple language support, multiple character set support, stray editions in need to be combined, duplicated authors, multiple ISBNs without formalized storage thereof, and a data model in need of some revision - to just pump more data into this system will leave it beyond repair, and, I agree, no amount of volunteers will be able to save it. Unless you tackle the architectural/software issues first.
Good Luck.

For amazon_kcw, take http://www.goodreads.com/book/edits/1...
It lists the authors in wrong order, producing an incorrect principal author.
It clutters up the title with additional info (in brackets).
It has no clue of UTF-8 encoding in the description.
It omits the ISBN10 which can be automatically derived from the ISBN13 (well, GR could do that as well but doesn't).
It sets an incorrect *and* invalid language code.
It will leave the combination with the existent work to "volunteers".
Executive Summary: None of the relevant textual information elements provided can be used "as is".

For amazon_sable, take http://www.goodreads.com/book/show/19...
This is not the book. It has just 13 pages. And it is not available from Amazon.
The book is http://www.goodreads.com/book/show/15....

How would you match a given author from Amazon, even with correct "spelling", to multiple existent authors in GR, differentiated only be the varying number of blanks in their names?
E.g. ([] signifies blank)
Amazon: "J.R.[]Smith"
GR: "J.R.[]Smith", "J.R.[][]Smith", "J.R.[][][]Smith"
Which one will you choose?

another wierd ISBN"
That's showing up as an "EAN" on some other websites.

Ah cover images. We had a few of Amazon's generic 'no cover' images blacklisted so they wouldn't be imported, but since images are often uploaded by merchants, we couldn't screen for all of them.
If you see more of these, can you let me know their GR book IDs/ASINs? We might be able to expand the blacklist to get rid of useless non covers. "
https://www.goodreads.com/book/show/1...

I tried to undo the change, but it didn't seem to do anything.
Also has the " around Publisher.

"
The revert function for descriptions does not work. You have to do it manually.

"
The revert function for descriptions does not work. You have to do it manually."
Thank you for the pointer. I'll do when back at desktop unless someone else gets. It was long one that probably has formatting that is a PIA to deal with via mobile device.

Problem 1: Bad Amazon descriptions overwriting good onix feed descriptions
Solution 1: We're lowering the priority of the sable and kcw feeds to be below our more trusted onix feed sources. Let me know if you folks have any reservations about this - from what I've seen our usual data feeds are more reliable than data coming from amazon merchants.
Problem 2: Even if we enact Solution 1, bad descriptions will surface when no description exists for a book/work
Solution 2: We're looking into where the bad data is coming from. A lot of it seems to be coming from non-amazon affiliated merchants who tend to enter information about their particular product quality, shipping policies etc etc. We're going to see if there's some way to only whitelist descriptions form trusted merchants or for books sold directly from Amazon itself.
Problem 3: The revert function isn't working for descriptions
Solution 3: I'll look into this or file a ticket to be looked at soon. Does anyone know if this might be happening when a book's description is changed from using the work's default description? Or does it fail to work even when the change is from one non-default description to another non-default description?
(Note - I'm not implying these are all the issues, just trying to keep you all up to date on some of the solutions we're working on!)

KCW seems to be bringing authors and a lot of GR authors wit..."
Three quick examples should all be in the change list for Rivka at the moment, unless she's got to them already.

Personally, I find that particular descriptor useful: it tells me that this is the worst available edition of the book, riddled with errors, possibly missing entire sections of text, and I almost certainly don't want to buy it if there is a different real edition available.
When hand editing, I will sometimes supplement these non-descriptions with the default description or something more directly relevant, but I almost always retain some indication that it's a machine/OCR edition because that is relevant data. I wouldn't want to see that relevant data stripped by another machine.

I don't research my editions in that way on GR, but I also don't use pubic domain books, I will purchase the Penguin, Oxford etc book or borrow from the library.
Per manual: The description field is for entering a summary of the work.

https://www.goodreads.com/book/edits/...
The author names were all rather heavily appended with academic credentials. I fixed the primary, but left the secondaries as evidence for now.

So, my question is: shall we not import secondary authors? Or is it better to have a duplicate than to have no data at all? (I don't know the answer, I'm just reporting a problem: of course it's a radical solution not importing secondary authors at all, but having multiple entries for the same author is a problem too)

Solution 1: Yes.
Solution 2: Yes.
Solution 3: Yes. Note: It failed on me either ways. Furthermore, there seems to be something fishy in the logging process of description changes. Sometimes an edition will change the [default] description and you will not be able to find a trace of that change in the logs... Maybe that is linked?

Which makes it harder to combine (cause you have to go looking) and always with the " " around the publisher & sometimes the title.
Fixed a whole bunch of those titles yesterday if you want to track some down.

I have trouble understanding why amazon wants to override any such librarian edits. As a reader, I really don't like any solution that overrides existing goodreads book data much less a solution overriding the librarian corrections.
Let's face it, unless filling in the blanks or improving image quality, what librarians are doing is correcting bad data or standardizing it to work with gr database, series, editions, etc.Doesn't make sense that corrections for bad data get overriden by any data feed, including amazon's.
In terms of benefits to amazon, well, if bookbuyers cannot find kindle or other editions for sale at amazon because the data feed caused bad author, title or series information—surely that's not what amazon or authors want? (Okay, maybe the authors who are determined to treat goodreads book pages as if their product pages on bookseller sites ... but that's another discussion).
It's just really weird to me to think of having any "solutions" that need volunteer librarians correcting data then letting next data feed undo their efforts. Not unusual for a lot of book data on goodreads to come from the publisher and it's weird that amazon data would override those records as well.

Amazon should not be overriding librarian edits; we are reporting instances where librarian edits are being overridden as bugs. That's the whole point of this thread, yes?

1. Is it possible tags to be replaced, removed? Such as <p>, <br>
2. Is it possible descriptions to be cleared of non-relevant informations such as OCL numbers and links?
amazon_kcw updated the book Lucky: A Memoir by Alice Sebold
description: 'In a memoir hailed for its searing candor and wit, Alice Sebold reveals how her life was utterly transformed when, as an eighteen-year-old college freshman, she was brutally raped and beaten in a park near campus. What propels this chronicle of her recovery is Sebold's indomitable spirit-as she struggles for understanding ("After telling the hard facts to anyone, from lover to friend, I have changed in their eyes"); as her dazed family and friends sometimes bungle their efforts to provide comfort and support; and as, ultimately, she triumphs, managing through grit and coincidence to help secure her attacker's arrest and conviction. In a narrative by turns disturbing, thrilling, and inspiring, Alice Sebold illuminates the experience of trauma victims even as she imparts wisdom profoundly hard-won: "You save yourself or you remain unsaved."' to 'The author describes the circumstances of her rape as an eighteen-year-old college freshman, the arrest and trial of her attacker, and her struggle to reclaim her shattered life...Title: .Lucky..Author: .Sebold, Alice..Publisher: .Little Brown & Co..Publication Date: .2002/09/16..Number of Pages: .12..Binding Type: .PAPERBACK..Library of Congress: .<a href=''http://lccn.loc.gov/BL2002011677'' target=''Library of Congress''>BL2002011677</a>
Dec 12, 2013 11:32PM (#60371416)
Edit page: https://www.goodreads.com/book/edits/...
Book page: https://www.goodreads.com/book/show/2...

Cait wrote: "Cait (tigercait) | 4747 comments I am pretty sure that the "Goodreads combined" indicates that a new edition was imported and autocombined with an existing edition with the exact same title. It looks like there was an amazon_kcw import of a Kindle edition with the same timestamp. "
IS it possible for that to be stopped as books with the exact same title are not always the same content? See HERE

The 'edition language' is always set to English although it should be German, French, Spanish, Italian, or Portugese respectively. Instead, the actual language is added in brackets to the title as '(German Edition)'/'(French Edition)'/'(Spanish Edition)'/etc.
Some random examples:
- German
https://www.goodreads.com/book/edits/...
https://www.goodreads.com/book/edits/...
https://www.goodreads.com/book/edits/...
- French
https://www.goodreads.com/book/edits/...
https://www.goodreads.com/book/edits/...
https://www.goodreads.com/book/edits/...
- Spanish
https://www.goodreads.com/book/edits/...
https://www.goodreads.com/book/edits/...
https://www.goodreads.com/book/edits/...
- Other languages
https://www.goodreads.com/book/edits/...
https://www.goodreads.com/book/edits/...

There seems to have been some problem with our credential/title/prefix stripping code during the import - and by seems I mean there was definitely a problem ;-)
I'm not sure if the issue came up when a name had more than one title (i.e. MD, PhD) or if the filter just wasn't being applied at the proper time.
There's a ticket in progress to clean up the authors created by this import and to try to rematch them with pre-existing authors.

Thanks for the feedback - Ticket has been generated and hopefully we can put someone on that soon.

Which makes it harder to combine (cause you have to go looking) and a..."
Thanks Sandra - We're talking about adding a few filters to book titles when importing (like searching for and removing the Author's name or the Publisher's name). If you see common patterns - like the one you mentioned in your post - definitely report them.

We're working on improving our name matching algorithm to include all these possible patterns. Like Cait says in her post below yours, no Librarian edits should be getting overwritten. Overwrites are definitely bugs. And we're working to improve our matching code so that it generates *less* work for you all!

1. Is it possible tags to be replaced, removed?
I'll have to check what our standard practice is for feeds - but that's definitely something that's possible to improve.
2. Is it possible descriptions to be cleared of non-relevant informations such as OCL numbers and links?
This is a much trickier problem to do programatically. Removing content requires us to be able to match it to a specific pattern. We could potentially remove links and if OCL numbers are included in a systematic way we could try to weed them out.

I thought it might be difficult, but anyway decided to report. Maybe the script can just leave the description blank if it finds certain strings of data in the description? I've seen description that describe the physical condition of the book for example. I'm not sure what would be the better option.

Cait wrote: "Cait (tigercait) | 4747 comments I am pretty sure that the "Goodreads combined" indicates that a new edition was imported and autocombined with an existing edit..."
This is a difficult problem to solve. You're absolutely correct that the same title + same author on two books does not necessarily indicate that they belong to the same work. However, the two choices here are essentially:
1. Make the assumption that two books by the same author with the same title are the same work - this will require exceptions to the rule to be manually corrected.
2. Never assume that two books are part of the same work - this would require every book to be manually added to the appropriate work.
Number 2 generates way more manual work, so we've opted for solution number 1. That being said, the code tries to make the best guess it can about whether a book belongs to a work or not.

Thank you. Is it possible a blacklist to be created for titles or author, because I separated the works again, and don't want them to be re-combined by the same script?

The 'edition language' is always set to English although it should b..."
This is a known-ish issue. We were waiting for some changes to our data model to fix it, which I believe happened last night (I'll know more when the developers responsible for the change get into the office).
To avoid making this issue worse, the amazon_sable data import only imported english books - but we're still getting non-english books from amazon_kcw. We'll have to do some cleanup once the code fix goes out.

"
I'll have to look into exactly what the sequence of events was for that particular combination before I can say for sure. The improvements we hope to make for our author and title matching algorithms should help some though.

This issue should become less common as of yesterday afternoon - I dropped the priority of amazon_kcw to be below our usual onix feeds, so it should be less likely that an existing description will get overwritten by an Amazon merchant provided description.
The change will only affect new imports, though - we'll still have to deal with ones done previously.

One way might be to have the matching process check for librarian notes on the existing book and return no match if there is a note. You'd still have a problem where the note wasn't about combining, but often books with notes have librarians patrolling the area who would be more likely to notice that new combines are needed. (Of course, that might still fall on the side of "more work to combine stray editions than to separate incorrect ones", even so. Would it be possible to generate a report showing some recent matching action on books with notes so that a human could evaluate it?)

I'll put that suggestion into our list of possible solutions :-) We can do some analysis of how many books have librarian notes to get a better sense of how much of the catalog would be affected by that policy.

Here are two editions where I assume that amazon_sable created the first record under Jeff S. Smith on Dec 21 2013 and amazon_kcw created the second record under Jeff Smith on Jan 2 2014:
http://www.goodreads.com/work/edition...
Despite having different forms of the primary author name, these editions are combined, so I assume that they matched in some way (does the amazon_kcw feed include a list of other editions of the book?). When the second record matched as an edition of the first, it should have taken the first record's author name.
I might have the sequence of events wrong there, but here are some other examples:
http://www.goodreads.com/work/edition...
The Kindle edition was added by amazon_sable on Dec 21 2013 with a primary author of Jeff-1space-Smith but combined with two existing editions which at that time already had a primary author of Jeff-11space-Smith -- those I can confirm already existed with that disambiguated author name.
http://www.goodreads.com/work/edition...
Again, the Kindle edition was added by amazon_kcw on Jan 2 2014 with an author of Jeff-1space-Smith and matched to existing editions with Jeff-3space-Smith.
http://www.goodreads.com/book/show/19...
This is another Kindle edition which ought to have matched to the previous one but did not, created by amazon_kcw on Nov 29 2013 (was there a change in the matching after that?).
I've left all of these records as-is so that you can see the primary authors -- I'll come back for them in a bit. I consider the many, many Jeffs Smith out there one of my librarian responsibilities, as many be evidenced by the zebra-striping of notes on the Jeff-1space-Smith combine page. :)

https://www.goodreads.com/book/show/1...
https://www.goodreads.com/book/show/1...
These appear to have been created ~2 seconds apart on Dec 12 2013 by amazon_kcw. They were not combined with each other (or any other editions) despite being, as far as I can see, identical. Perhaps they came in with different id numbers and both failed a title match, prompting separate creation of alternate records -- is that something which can be checked?
(As for the question of the match, I don't think there's much that can be done about that when the record comes in as a library-bound edition with a publisher of "Turtleback" instead of the actual publisher, is there? If there is, there is an edition already existing for "The Dragonslayer (Bone)" which would match to these books' "The Dragonslayer (Turtleback School & Library Binding Edition) (Bone (Prebound))" if all of the extraneous formatting notes were stripped out of the title string.)

Sarah wrote: "I'll put that suggestion into our list of possible solutions :-) We can do some analysis of how many books have librarian notes to get a better sense of how much of the catalog would be affected by that policy."
Yay! :)

isbndb updated the book Out Of It by Stuart Walton
title: 'Out of It: A Cultural History of Intoxication' to 'Out Of It: A Cultural History Of Intoxication'
And Onix has done a similar import:
https://www.goodreads.com/topic/show/...
They are corrected now.
This topic has been frozen by the moderator. No new comments can be posted.
Books mentioned in this topic
Snobs (other topics)The Twelve Dates of Christmas: Dates 1 and 2 (other topics)
The Twelve Dates of Christmas: Dates 1 and 2 (other topics)
The Twelve Dates of Christmas: Dates 1 and 2 (other topics)
Divisadero (other topics)
More...
Authors mentioned in this topic
Unknown (other topics)Various (other topics)
Unknown (other topics)
Unknown (other topics)
Avery T. Willis Jr. (other topics)
More...
I believe he pointed out one in Msg 77."
Woops - thanks for pointing me there. Missed in my first glance through the page. Taking a look.