Dharmesh Kakadia's Blog
March 4, 2022
Crypto Removes Friction
There is a lot of debate recently about crypto either as solution to all the problems in the world (“Bitcoin fixes this”) or as the biggest Ponzi scheme. Instead, of arguing which side is correct, I want to provide an alternative lens of looking at crypto through technological progress.
All technologies remove friction
All technologies are tools for removing friction. Good and bad uses of technology are different side of the same coin and is entirely dependent on use of technology by the peopl...
March 13, 2021
What I Consumed This Week
I believe in “what you consume consumes you”, so here is my last week’s diet.
# Podcasts & VideosTim Urban - Exploring Ourselves on Infinite Loops. Great discussion and frameworks.
Jonathan Neman - Building the Modern Restaurant on Invest Like the Best. Good discussion on traditional businesses embracing tech & economics of food delivery. Also, check out Doordash and Pizza Arbitrage
Meghan Oprah Interview.
# ArticlesPrediction Markets: Tales from the ElectionDo Amazon ads bring in m...January 19, 2020
Internals of Spark Parser
In this post we will try to demystify details about Spark Parser and how we can implement a very simple language with the use of same parser toolkit that Spark uses.
# IntroApache Spark is a widely used analytics and machine learning engine, which you have probably heard of. You can use Spark with various languages - Scala, Java, Python - to perform a wide variety of tasks - streaming, ETL, SQL, ML or graph computations. Spark SQL/dataframe is one of the most popular ways to interact with Spark...
December 26, 2019
Verifying links with Github actions & Awesome Bot
Recently I started using github action to automate link checking in all of my awesome repos. I have been using awesome_bot to validate links and checks for duplicates, with travis since past 2+ years. I decided to give github actions try with this very simple automation. Github action is very rich and can automate a lot of chores for developers. There are number of existing actions available in the github market place. However, I couldn’t find one that allows me to verify links in markdown. So l...
December 4, 2018
Versatile RStudio development environment on Kubernetes
R is very versatile language for data analysis and widely used for data science and exploration alongside python. RStudio is a great IDE for exploring data using R. RStudio has a lot of powerful features for writing and debugging R code, but while using it on large data, it can be challenging due to:
ScalabilityPrivacy and security of dataAbility to connect R workflows with other tools (Spark, Tensorflow etc.)Backing up the R code automaticallyWe solve these challenges by running RStudio o...
December 1, 2018
MXNet tools in docker
How to convert MXNet model to Apple CoreML:
docker run -v "$PWD":/data --rm -it dharmeshkakadia/mxnet-coreml-tools-docker python mxnet_coreml_converter.pyFor example, if you want to convert Squeezenet model to coreml, to use with iOS.
Run the following from the a directory containing Squeezenet model files (Params, symbols, labels) and will generate squeezenetv11.mlmodel in the current directory.
March 31, 2018
Review - Are Ideas Getting Harder to Find?
This is a review of a recent paper Are Ideas Getting Harder to Find? by Charles I. Jones. Slides are also available.
The central content of the paper is answering the question with the following formula :
Economic growth = Research productivity × Number of researchers
The paper presents evidence and arguments that even the economic growth has been relatively stable over the years, there is a clear downwards trend in the research productivity. This is compensated by more and more people getting...
March 8, 2018
Automate SQL server data pipelines with Kubernetes
Kubernetes provides a great way to run modern infrastructure. SQL server is a widely deployed database. When you combine these two, you get a robust way of running a data pipeline using a modern platform.
Data pipelines are large part of all data infrastructure. The need to move data between different systems, is almost universal and tools/process to achieve this is generally referred to as a data pipeline. In this post we will see how we can leverage Kubernetes jobs API to build and run data pi...
January 9, 2018
Write a Presto query logging plugin
Presto is a fast distributed SQL query engine for big data. I wrote a more introductory and up and running post a while back.
Presto users frequently [1, 2, 3, 4] want the ability to log various details regarding queries and execution information from Presto. This is very useful for operationalizing presto in any organization. Logging query details allows a team to understand the usage of Presto, provide operational analytics and identify on performance bottlenecks. If you want to know how to ac...
December 24, 2017
Analyzing Azure Storage Performance
I work on performance of Big data systems at Azure HDInsight and as part of benchmarking, many times I need to analyze the performance of the cloud storage. Performance of the storage system plays a very critical role in the performance of the cloud big data systems. Even though there are public benchmarks available for theses systems, its important to measure performance for your workload. In that spirit, we will see how to leverage storage logs for benchmarking your big data workload on Azure ...