Topic Discovery Using Machine Learning

By guest contributor Ergest Xheblati


Recently I was asked by Tiago to see if I could do something interesting with his Instapaper article collection stored on Evernote. I had recently completed a similar project at work where I looked at product descriptions and generated tags from them. I thought I could do something similar for Tiago’s project.


My favorite tool for doing this is Knime which is a free, visual data science platform with all the capabilities you need for doing NLP (Natural Language Processing). I thought that I could use the LDA algorithm in Knime in order to do unsupervised (aka unaided) topic discovery. Unsupervised simply means that the algorithm is discovering the topics without being told about them beforehand so it doesn’t need any examples to learn from.


The intuition behind the algorithm is quite simple though the statistics are not. Given a group of documents, assume that there’s only a small number of topics or themes in the documents and that words in the documents can be attributed to the topics. So the output of the algorithm is a collection of topics labeled topic_1, topic_2…topic_n (because it doesn’t know what they actually are) and for each topic a group of words that make up that topic.



To read this story, become a Praxis member.


Praxis


Praxis


You can choose to support Praxis with a subscription for $10 each month or $100 annually.


Members get access to:

1–3 exclusive articles per month, written or curated by Tiago Forte of Forte Labs
Members-only comments and responses
Early access to new online courses, ebooks, and events
A monthly Town Hall, hosted by Tiago and conducted via live videoconference, which can include open discussions, hands-on tutorials, guest interviews, or online workshops on productivity-related topics

Click here to learn more about what's included in a Praxis membership.


Already a member? Sign in here.

 •  0 comments  •  flag
Share on Twitter
Published on October 10, 2019 08:04
No comments have been added yet.