Benjamin Bengfort's Blog, page 2
August 31, 2016
Principal Component Analysis with Python
The amount of data generated each day from sources such as scientific experiments, cell phones, and smartwatches has been growing exponentially over the last several years. Not only are the number data sources increasing, but the data itself is also growing richer as the number of features in the data increases. Datasets with a large number of features are called high-dimensional datasets.
One example of high-dimensional data is high-resolution image data, where the features are pixels, and w...
August 2, 2016
NLP Research Lab Part 2: Skip-Gram Architecture Overview
Editor's Note: This post is part of a series based on the research conducted in District Data Labs' NLP Research Lab. Make sure to check out NLP Research Lab Part 1: Distributed Representations.
Chances are, if you’ve been working in Natural Language Processing (NLP) or machine learning, you’ve heard of the class of approaches called Word2Vec. Word2Vec is an implementation of the Skip-Gram and Continuous Bag of Words (CBOW) neural network architectures. At its core, the skip-gram appr...
July 27, 2016
NLP Research Lab Part 1: Distributed Representations
Editor's Note: This post is part of a series based on the research conducted in District Data Labs' NLP Research Lab.
This post is about Distributed Representations, a concept that is foundational not only to the understanding of data processing in machine learning, but also to the understanding of information processing and storage in the brain. Distributed representations of data are the de-facto approach for many state-of-the-art deep learning techniques, notably in the area of Nat...
July 26, 2016
Beyond the Word Cloud
In this article, we explore two extremely powerful ways to visualize text: word bubbles and word networks. These two visualizations are replacing word clouds as the defacto text visualization of choice because they are simple to create, understandable, and provide deep and valuable at-a-glance insights. In this post, we will examine how to construct these visualizations from a non-trivial corpus of news and blog RSS feeds. We begin by investigating the importance of text visualization. Next,...
June 9, 2016
District Data Labs PyCon Recap
Last week, a group of us from District Data Labs flew to Portland, Oregon to attend PyCon, the largest annual gathering for the Python community. We had a talk, a tutorial, and two posters accepted to the conference, and we also hosted development sprints for several open source projects. With this blog post, we are putting everything together in one place to share with those that couldn't be with us at the conference.
Tutorial: Natural Language Processing with NLTK and GensimMay 25, 2016
Visual Diagnostics for More Informed Machine Learning: Part 3
Note: Before starting Part 3, be sure to read Part 1 and Part 2!
Welcome back! In this final installment of Visual Diagnostics for More Informed Machine Learning, we'll close the loop on visualization tools for navigating the different phases of the machine learning workflow. Recall that we are framing the workflow in terms of the 'model selection triple' — this includes analyzing and selecting features, experimenting with different model forms, and evaluating and tuning fit...
Preparing for NLP with NLTK and Gensim
This post is designed to point you to the resources that you need in order to prepare for the NLP tutorial at PyCon this coming weekend! If you have any questions, please contact us according to the directions at the end of the post.
In this tutorial, we will explore the features of the NLTK library for text processing in order to build language-aware data products with machine learning. In particular, we will use a corpus of RSS feeds that have been collected since March to create supervis...
May 23, 2016
Visual Diagnostics for More Informed Machine Learning: Part 2
Note: Before starting Part 2, be sure to read Part 1!
When it comes to machine learning, ultimately the most important picture to have is the big picture. Discussions of (i.e. arguments about) machine learning are usually about which model is the best. Whether it's logistic regression, random forests, Bayesian methods, support vector machines, or neural nets, everyone seems to have their favorite! Unfortunately these discussions tend to truncate the challenges of machine learning into a s...
May 19, 2016
Visual Diagnostics for More Informed Machine Learning: Part 1
How could they see anything but the shadows if they were never allowed to move their heads?
— Plato The Allegory of the Cave
Python and high level libraries like Scikit-learn, TensorFlow, NLTK, PyBrain, Theano, and MLPY have made machine learning accessible to a broad programming community that might never have found it otherwise. With the democratization of these tools, there is now a large, and growing, population of machine learning practitioners who are primarily self-taught. At t...
May 11, 2016
Named Entity Recognition and Classification for Entity Extraction
The overwhelming amount of unstructured text data available today from traditional media sources as well as newer ones, like social media, provides a rich source of information if the data can be structured. Named Entity Extraction forms a core subtask to build knowledge from semi-structured and unstructured text sources. Some of the first researchers working to extract information from unstructured texts recognized the importance of “units of information” like names (such as person, organiza...


