TCDL: Karma – A Data Integration Tool
Today I got to attend a 1/2 day workshop at the 2015 Texas Conference on Digital Libraries (TCDL) on Karma and open source data integration tool. Pedro Szekely was our instructor and started by warning us that he knows very little about libraries but a ton about data.
The files we needed for the workshop are all in Github if you’re interested in checking it out. You can follow the tutorial steps on the wiki. And of course you can find Karma itself on Github as well.
The Basics
Karma is a web based tool that run both the server and browser right on your own machine so we had computers with the tool installed to play with.
Karma is an open source tool that makes it easy to convert data from a variety of formats into Linked Data.
Users load into Karma the ontologies for their application and data samples of each of the data files to be converted. Karma makes the conversion process easy as it provides an intuitive graphical user interface to visualize and edit the mapping of data files to ontologies; Karma is flexible as it can import data from a wide variety of data formats (SQL, XML, JSON, CSV, Excel, AVRO, Web-Services) … Karma scales to very large dataset (40 million documents, 1 billion triples) and can refresh periodically (e.g., every hour); Karma is a free, open source tool.
Hands On
The rest of the workshop was hands on experience with Karma. We played with some sample data.
After we loaded some data in to Karma we mapped it to a few ontologies. When clicking on the title field for example, Karma even gave us 4 suggestions for what our titles might need to be mapped to, it knew how to make this suggestion because the tool learns (even if you made a mapping mistake in the past). This can be a huge time saver if you’re often working with the same types of data. Pedro did remind us that Karma does not know the right mapping, the user gets to choose whatever they want – even if it’s “wrong”.
Once in your data you can use Python scripts to clean up your data if you’d like. Each column has a ‘PyTransform’ option in the menu. I personally have never written Python, but it looks pretty simple and Pedro assured us that before he used Karma he also didn’t know Python but found that every question he had already had been asked and answered on StackOverflow.
Once you’re done working with your data you can then generate RDF, MySQL, JSON or many other formats for use with web applications.
When we were editing the data in a column Pedro made a very funny comment about one of the options we had to choose from. He said “You should never do that” and when asked by it was an option at all he said “because someone asked us to add it.” This is a question I find myself answering the same exact way when teaching people how to use open source tools. Open source is full of features that are there just because someone asked for it.
Conclusions
What I learned after this workshop is that Karma is awesome powerful! We have so much messy data out there that a tool like this can be very handy – and of course it’s open source which just makes it that much more appealing. I also learned that I’m probably not really cut out for working with a tool like Karma on a daily basis, but I know a lot of people who will be and I hope this summary will help them out.
Links/Resources
TCDL Sample/Tutorial
Karma on Github
Possible Schemas (from our tutorial)
DPLA
SKOS
EDM
Dublin Core
Cool URIs
The post TCDL: Karma – A Data Integration Tool appeared first on What I Learned Today....
Related posts:
Keynote: We The People: Open Source, Open Data
Spam Karma – Solved
KohaCon10: Library Data for Fun & Profit