Benjamin Bengfort's Blog, page 4

February 2, 2015

Getting Started with Spark (in Python)

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storag...

 •  0 comments  •  flag
Share on Twitter
Published on February 02, 2015 19:30

January 8, 2015

Creating a Hadoop Pseudo-Distributed Environment

Hadoop developers usually test their scripts and code on a pseudo-distributed environment (also known as a single node setup), which is a virtual machine that runs all of the Hadoop daemons simultaneously on a single machine. This allows you to quickly write scripts and test them on limited data sets without having to connect to a remote cluster or pay the expense of EC2. If you're learning Hadoop, you'll probably also want to set up a pseudo-distributed environment to facilitate your...

 •  0 comments  •  flag
Share on Twitter
Published on January 08, 2015 06:56

November 8, 2014

Simple CSV Data Wrangling with Python

I wanted to write a quick post today about a task that most of us do routinely but often think very little about - loading CSV (comma-separated value) data into Python. This simple action has a variety of obstacles that need to be overcome due to the nature of serialization and data transfer. In fact, I'm routinely surprised how often I have to jump through hoops to deal with this type of data, when it feels like it should be as easy as JSON or other serialization formats.



The basic proble...

 •  0 comments  •  flag
Share on Twitter
Published on November 08, 2014 07:53

October 23, 2014

Conditional Probability with R

In addition to regular probability, we often want to figure out how probability is affected by observing some event. For example, the NFL season is rife with possibilities. From the beginning of each season, fans start trying to figure out how likely it is that their favorite team will make the playoffs. After every game the team plays, these probabilities change based on whether they won or lost. This post won't speak to how these probabilities are updated. That's the subject for a f...

 •  0 comments  •  flag
Share on Twitter
Published on October 23, 2014 14:12

September 11, 2014

Computing a Bayesian Estimate of Star Rating Means

Consumers rely on the collective intelligence of other consumers to protect themselves from coffee pots that break at the first sign of water, eating bad food at the wrong restaurant, and stunning flops at the theater. Although occasionally there are metrics like Rotten Tomatoes, we primarily prejudge products we would like to consume through a simple 5 star rating. This methodology is powerful, because not only does it provide a simple, easily understandable metric, but people are generally...

 •  0 comments  •  flag
Share on Twitter
Published on September 11, 2014 10:01

August 20, 2014

How to Develop Quality Python Code

Developing in Python is very different from developing in other languages. Python is an interpreted language like Ruby or Perl, so developers are able to use read-evaluate-print loops (REPLs) to execute Python in real-time. This feature of Python means that it can be used for rapid development and prototyping because there is no build process. Python includes many functional programming tools akin to Scala or Javascript to assist with closure based script development. But Python is also a ful...

 •  0 comments  •  flag
Share on Twitter
Published on August 20, 2014 12:01

August 18, 2014

What Are the Odds?

The probability of an event represents the likelihood of the event to occur. For example, most of us would agree that the probability of getting a heads after flipping a fair coin is 0.5 or that the probability of getting a one on rolling a fair die is 1/6. However, there are many more places where we encounter probabilities in our lives. During election season, we have pundits and polls speaking to the likelihood (probability) of winning for each candidate. Doctors will often state that a pa...

 •  0 comments  •  flag
Share on Twitter
Published on August 18, 2014 05:34

August 8, 2014

How to Transition from Excel to R

R logo



In today's increasingly data-driven world, business people are constantly talking about how they want more powerful and flexible analytical tools, but are usually intimidated by the programming knowledge these tools require and the learning curve they must overcome just to be able to reproduce what they already know how to do in the programs they've become accustomed to using. For most business people, the go-to tool for doing anything analytical is Microsoft Excel.



If you're an E...

 •  0 comments  •  flag
Share on Twitter
Published on August 08, 2014 12:01