functor.tokyo -- How to get into Machine Learning for a Haskeller

How to get into Machine Learning for a Haskeller

2018-08-21

I've been interested in Haskell for about 5 years now. I've been using it professionally for about 3 years. Most of the work I've done so far has been web-related. A large part has been writing web application backends in Haskell.

I really enjoy using Haskell as a programming language, but I would like to expand my skill set.

When looking to level up, it seems like a lot of Haskellers focus on improving their skills in one of the following areas:

type systems / formal methods
compilers / programming languages
speed / performance
interesting abstractions

However, I want to focus on something else. I want to focus on data science.

The Road Less Travelled

Compared to web applications and compilers, there are not many Haskellers who are interested in data science. However, data science is a huge field. It encompasses everything from data visualization to deep learning. As a beginner, it is tough to know where to start. It is hard to figure out where I can leverage my Haskell knowledge to get some quick wins.

I decided to email a few people from the Haskell community and ask their opinion.

I asked two main questions:

What area of data science should I focus on, given my existing Haskell skills? Should I focus on data science as whole, or something more specific like machine learning?
Given your answer for the first question, are there any learning resources you would specifically recommend?

They all were gracious enough to email me back detailed responses.

Answers

I received great answers from Marco Zocca, Dominic Steinitz, and Joseph Abrahamson. I introduce each of them below and share their responses¹.

Marco Zocca

Marco is a data scientist and currently organizing the DataHaskell movement. He's active on Twitter and Github.

Marco starts off his response talking a little bit about what "data science" consists of.

I find the "data science" branding to be too loose. It really describes a broad range of competencies that span data visualization, statistics, computer science, etc. The most common course of action is to start with a quantitative question (e.g. "how many new houses will be built in Boston next year?"), gather relevant data, eyeball it with some sort of visualization, and fit a model to it. This last step is where it really gets interesting, especially for people with a more theoretical bent. Model fitting is the starting point of most machine learning techniques ("pick a function f such that it predicts the output y given a dataset X with good accuracy").

Granted, visualization and communication are also excellent crafts that take years to perfect. Visualization and communication connect to people more than raw numbers. They are used to create the necessary understanding of the data. [...]

Indeed, there are people building whole careers out of each of these competencies, so it's hard to give one-size-fits-all advice.

He is talking about a common theme: the label "data science" is too broad. It encompasses a huge range of skills. It is possible to have a career focusing on any one of these subfields.

He then goes on to talk a little more about the mathematical aspects of data science:

Part on my studies were in applied math (numerical optimization under uncertainty), so I greatly enjoy mathematically well-founded expositions of machine learning.

My personal favorite introductory textbook is Bishop's Pattern Recognition and Machine Learning. It has many nice pictures and formulas. It covers all of the basic ideas, both from "frequentist" and Bayesian views of probability theory. This makes many applications click together beautifully. Bishop also explains a few advanced applications in great detail. However, it was written in 2006, before deep learning started becoming really popular.

Another personal favorite textbook is Murphy's Machine Learning: A Probabilistic Perspective. This book is more recent but still gives an excellent theoretical foundation. After reading these two book you'll be able to make informed choices on which ideas can be applied to which problems.

Marco makes two excellent book recommendations:

It sounds like both of these books cover the basic ideas, as well as having a good theoretical foundation.

Marco then goes on to give a specific recommendation for beginners:

My suggestion is that you get a good feeling for foundational algorithms such as Expectation Maximization, which reappear regularly in applications such as "backpropagation" learning for neural networks.

At the end of the day, most of machine learning relies in various proportions on a few fields of knowledge, i.e. linear algebra, convex optimization, probability and information theory (all of these are subfields of functional analysis if you will), though computer science notions such as tree and graph algorithms and computational complexity are also very useful.

It sounds like the two books above give a good introduction to this type of foundational knowledge, so they should definitely be on your reading list!

Marco gives a great introduction to doing machine learning with Haskell.

Getting to Haskell, we still have a very fragmented landscape. People interested in numerics/machine learning AND Haskell are few and far apart, but I also notice a growing interest. DataHaskell definitely brought together many people who were previously working in isolation, which is what keeps me going at the end of the day. I've taken a sort of "editorial" role because I deeply believe in the usefulness of this project.

If you take a look at our knowledge base, you can see a growing collection of packages, classified by scope or application. There are still gaps, and the packages vary in scope and quality. But we're slowly getting there.

If you can, drop by our Gitter chatroom. There are people hanging out there more or less all the time.

If you're interested in both data science and Haskell, it sounds like DataHaskell is the community to join!

Dominic Steinitz

Dominic is working as a Haskeller and data scientist at Tweag I/O. He has a blog, as well as Twitter and Github.

In his response, Dominic recommends the Machine Learning online course by Andrew Ng on Coursera. He says it gives a good overview of a lot of the techniques of machine learning.

Dominic also recommends two books on machine learning:

He also mentioned video lectures that go along with the Information Theory book.

Dominic says the following about these two books:

Even though they are now quite old, you will find they cover pretty much every technique in a readable fashion. Then, if you need to, you can move on to more specialised books and papers.

I also asked Dominic what sorts of skills he looks for when hiring people at Tweag I/O in data science roles. He says the following:

I would want them to demonstrate some previous work where they had applied their skills / knowledge. A lot of statistical / data science / machine learning is about knowing how to model and which techniques to apply and why. Only practical experience can provide that. Of course it doesn't have to be paid experience.

He echos the sentiment from Marco that knowing how to model a particular problem is very important. You must know when to apply which techniques. This seems to be another common thread when talking about data science and machine learning.

Joseph Abrahamson

You probably know Joseph as tel. You've almost certainly seen him on Reddit, Twitter, Hacker News, Stack Overflow, or Github.²

Joseph's response starts off similarly to Marco's:

Data science is a BIG field, encompassing everything from business analytics, to "data engineering" with a focus on databases, to a more traditional focus on statistics (maybe even something like biostatistics).

A more modern spin comes from artificial intelligence and machine learning—perhaps the most modern is focused on "deep learning", (though I'm sort of convinced that's a bit of a fad). You can also dig in with topic areas like Natural Language Processing (NLP), Image Processing, Speech Recognition, etc.

There are a lot of inroads.

Just like Marco, Joseph acknowledges that data science is a very big field. There are many subfields.

Joseph goes on to talk specifically about machine learning:

Recently, "Machine Learning" is a common pathway for people with engineering chops to get a taste of this world. You can learn some existing toolkits (like scipy/scilearn or tensorFlow/keras) and understand a little how they work. Then, using normal software techniques you can integrate these tools with data streams and create powerful results. This is popular because it leverages these existing frameworks (which always seem to be made in Python for whatever reason) and merges traditional software engineering in nicely.

On the other hand, it's not a terrifically deep way of getting involved in data science / artificial intelligence / machine learning. For that, you will need to spend more time learning and should also probably try to figure out why you're interested.

As an engineer, it sounds like it might be possible to learn a popular data science framework/library and use it to get some interesting results. Of course, to do anything really interesting, it sounds like it requires much more knowledge and experience.

Joseph views data science as having three archetypes: the statistician, the machine learning expert, and the business analyst.³ These are each big directions you can take. He explains each of these as follows:

The traditional statistician route focuses on statistical techniques for model building and information synthesis. I call it "traditional" because you might spend a lot of time working with refinements of the same techniques that have been developed for the last century or so. You might also find these super-powered by vast data sets never before accessible or extended with modern Bayesian techniques that are terrifically flexible and interesting. This route seeks "generalizable learning", as opposed to optimal prediction or specific problem solving. This is what makes it my favorite.

Within this one, you've got the subgenres of biostatistician, experimentalist, modeler.

The machine learning expert focuses on the real nuts and bolts to designing machine learning algorithms. They are probably spending a lot of time tracking down the cutting edge of this research. It's quickly moving today and you'll want to be able to understand and contrast a large variety of methods, considering their applicability for a given task. This also usually entails a fairly large amount of engineering and "tending to" the machines you build—training them, reviewing them, analyzing their performance, etc. This is sort of the leveled up version of the most common inroad I mentioned above.

Within this one, you've got the subgenres of NLP specialist, image specialist, speech specialist.

The business analyst route focuses on lighter-weight statistical techniques from either of the previous two roles but spends more time understanding how to apply them to a business context where people need to discover decent answers quickly so that they can make decisions on the fly. This can be a very lightweight role where people are not well-versed in statistical technique, but in a business where this sort of reasoning is actually important... well, you'll find some very smart quantitative folks.

It sounds like the machine learning expert would fit with my current programming skills the best. The traditional statistician sounds like it would also be a good course to follow assuming you have a good math background. The business analyst also sounds good if you have a specific business problem to solve.

Joseph talks about what sort of background skills you need, as well as resources to look into:

For all of these, you will want to make sure you're leveled up on calculus and linear algebra--and ultimately their synthesis, multivariate calculus. These will give you the basic vocabulary for reading statistical materials and also will make linear and non-linear optimization techniques approachable which are the bread and butter of most statistical and machine learning techniques.

I learned both of these a long time ago, so I can't really give you a great recommendation for a book or anything. Spivak's Calculus is good, but it's dense and difficult to learn from. I also like Axler's Linear Algebra Done Right. It's an interesting take on the subject, but it's less practical and written in a non-standard way.

Joseph recommends having a good handle on both linear algebra and calculus. His suggestions for books are the following:

Joseph then talks a little more about resources for machine learning.

From here, you can dig in with statistical methods and machine learning methods and see which make more sense. I recommend learning Bayesian statistics which you can learn from Andrew Gelman's Bayesian Data Analysis and Bishop's Pattern Recognition and Machine Learning. You should also investigate Friedman, Tibshirani, and Hastie's book, Elements of Statistical Learning.

Neural networks are very hot now, but I don't know how if there's a good intro guide. I think the best suggestion at this point is to just dig into the literature and learn some packages like TensorFlow.

Finally, the literature is pretty good because it's very often all free—so don't feel afraid to try digging into papers once you've got some of the basics down.

Joseph also recommends Bishop's Pattern Recognition and Machine Learning! Here's a list of all three books he recommends:

Coming from Haskell, it is great to know that there is a lot of accessible literature. That's one of the things I really like about Haskell. There are tons of great research papers on programming techniques, libraries, new compiler features, etc.

Other Suggestions

I've also received a few other suggestions from people I know personally. Most of the suggestions mirror what has already been said above.

Mainly, if you're already a competent programmer, it may make sense to focus on machine learning instead of data science as a whole. Machine learning is where you can make the most use of your existing programming and DevOps knowledge.

Also, if you want to get a job doing machine learning, it may even make sense to focus on a subfield of machine learning, like Image Recognition or Natural Language Processing.

Conclusion

I received a lot of great advice about what to focus on and how to spend my study time. I think I am going to take Dominic's advice and go through the Machine Learning Coursera course, implementing the exercises in Haskell. This will hopefully be a broad (but shallow) introduction to the main ideas in machine learning. If I have any trouble with using Haskell, I'll have to seek help from the Marco and the DataHaskell community.

After that, I will read through Pattern Recognition and Machine Learning, since it is recommended by everyone! If the math is too advanced, I'll have to take Joseph's advice and brush up on my calculus and linear algebra.

Thanks again to Marco, Dominic, and Joseph for their detailed answers!

If any other Haskellers doing data science have an opinion on the topic, please feel free to leave a comment below, or email me directly!

Footnotes

Some of the answers are slightly edited so that they flow better with this article.↩︎
Joseph has some of the most in-depth and well-explained answers about difficult Haskell concepts on Reddit, Hacker News, and Stack Overflow. If you're interested in Haskell, I really recommend reading through his post history. You won't be disappointed!↩︎
I recently read an article by the head of data science at Airbnb, Elena Grewal. In this article she breaks down data science into almost exactly the same roles as Joseph.↩︎

tags: machine learning, haskell