Is this the best machine learning course?

The "mascot" for the Machine Learning course.

I recently completed the Machine Learning by Stanford University course taught by Dr. Andrew Ng, and offered through Coursera.

Previously I had taken a several courses from DataCamp (see my review here), and while I found them valuable I still didn’t feel like I understood the material to a sufficiently deep level. I gained a general understanding of the tools and techniques presented and became proficient enough to use these tools for my own projects (here, and here), but as a mathematician I don’t feel like I understand an algorithm until I can code it up myself.

After completing Dr. Ng’s course, I feel like I understand the algorithms covered and the coursework provides a foundation that gives me confidence that I can implement these algorithms myself.

Course Structure and Content

The course is designed to take 11 weeks assuming you spend 5-8 hours per week on it, but allows students to work at their own pace. The material each week is presented through a set of video lectures by Dr. Ng, along with quizzes and a programming exercise.

Dr. Ng covers the most commonly used algorithms for supervised and unsupervised machine learning tasks:

  • Linear regression
  • Logistic regression
  • Neural networks
  • Support vector machines
  • K-means clustering
  • Principal component analysis

Dr. Ng is an engaging lecturer who is able to distill the essential information into easily digested pieces. The material is presented at multiple levels providing both intuition about the algorithm and the technical details behind it.

The lectures are complemented by programming exercises. Here again the essential components of each algorithm are the focus of the assignment. For example, in the linear regression exercise the students write code to compute the cost function, and a single step in the gradient descent loop. The loop structure is provided, as are a number of test cases to allow the student to verify their solution is correct. All the assignments are done in Matlab or Octave which reduces the level of programming skill required while still requiring an understanding of the math involved in each algorithm. This isn’t a coding course and much of the tedious work is done for the students. At the same time the full code is provided to the students and can be examined by those so inclined.

More than just algorithms

Week 6 in the syllabus is called “Advice for Applying Machine Learning,” and I found this to be one of the most beneficial topics. Following my earlier courses, I was able to use any number of machine learning algorithms (whether I really understood them or not) because I knew which library to load, and the syntax to train the model. But I didn’t know how to deal with cases that weren’t so simple (see here). My approach in these cases was to go with my gut feeling and hope for the best.

In the week 6 lectures, Dr. Ng presents best practices for applying machine learning, and discusses diagnostic tools (eg. error metrics, learning curves) that can help highlight reasons why the algorithm may be under-performing. Armed with these new skills, I’m going back to my previous problem to see what can be done.

Best course ever?

This course came highly recommended by a number of data scientists who I respect, and it was certainly a valuable course to take. I’m happy that I now understand the algorithms covered in this course at the deep level I was looking for.

I can’t answer this question based on my own experience, as I haven’t taken every other machine learning course that is out there. However, thanks to this course, I could build a collaborative filtering algorithm and recommender system to help answer this question.

In all seriousness, I found this course to be incredibly helpful, and would recommend it to anyone who is interested in machine learning.

How to get a data science job

Here are some online resources for learning data science, and career advice. I’m planning to come back to this list often as I continue to develop my skills and work through my career transition.

Data Science Resources

Career Resources

Finding a Mentor

How is this a beginner level project?

I spent some time last week working on another project to add to my portfolio. This project, building a predictive model for loan approvals, is listed as the third beginner level project in this guide and I thought it would be straightforward.

Following the typical data cleanup tasks, my initial plan was to compare three basic predictive models:

  1. Logistic Regression
  2. Decision Tree
  3. Random Forest
Loan approval rates based on credit history

Initial exploration of the data showed that the applicant’s credit history has the largest single-variable influence on the loan approval. Other variables that seem important are: marital status, education level, and property area. Having made these discoveries, I was ready to jump into training the models.

Because the credit history appears to have such a strong correlation with loan approval, a model using credit history as a predictor was also used.  When the results from this model was submitted to the competition it received a score of 78%.

Next I trained the logistic regression model. It turns out that the only variable of any significance is the credit history. Submitting the results of this model again obtains a score of 78% accuracy.

Okay, moving on to the decision tree model. Decision trees are easy to over-fit, and so I didn’t want to include all the variables. Starting with the variables I determined to have the most impact, I trained the decision tree model and found that the optimal decision tree model looks like the one shown below:

Decision Tree model.

At this point, I’m really wondering what I’m doing wrong. I break down and look at the tutorial. At first glance, the tutorial isn’t helpful. They don’t do better, and the included random forest model doesn’t get much better accuracy than the original logistic regression model that only uses the credit history.

So none of the basic models seem to work very well. Why is this a beginner project? Keep in mind, this is the 3rd beginner project. By this point, its assumed the student is able to run the models. It’s only natural that the difficulty increases. The tutorial concludes with 3 important points:

  1. Using a more sophisticated model does not guarantee better results.
  2. Avoid using complex modeling techniques as a black box without understanding the underlying concepts. Doing so would increase the tendency of overfitting thus making your models less interpretable
  3. Feature Engineering is the key to success. Everyone can use an Xgboost models but the real art and creativity lies in enhancing your features to better suit the model.

Looking at the competition leaderboard, the top submissions have an 83% accuracy rate, so improvements can be made. Time to dive into the data again and see what can be done.

What I’m Reading: May 5, 2017

Data Science posts that have caught my interest this week:

I found a couple posts by Sam DeBrule that give a good overview of how Artificial Intelligence will change the working world in the near future, and resources to learn more, and keep up to date on the progress in the field:

A good post about Math:
You weren’t bad at maths — you just weren’t looking at it the right way

And some advice for PhD’s trying to get into a data science role:
What PhDs do wrong (and right!) when applying for Data Science jobs

What I’m reading

Lately, I’ve found that I’m not reading as much. I’ve been focused on a couple of projects and have neglected other aspects of my data science education. I’m starting to schedule time into my day for finding and reading things online. Here are some of the posts I’ve found interesting lately.

In addition to learning data science, I’m also trying to build a career. There are so many aspects to data science that I’m not always sure what to focus on. I think  I could really use a mentor to help me through this process, and found some good advice on that front as well:

Third class travel at first class prices.

As a first data science project, I performed an exploratory analysis of the Titanic passenger list. Using the data from the Kaggle competition, I looked for unexpected features in the data and found two things that surprised me:

  1. There were a significant number of unmarried young women traveling without any immediate family members. This seems very strange for 1912.
  2. The pricing of tickets doesn’t seem to follow any sort of pattern. There are many second and third class passengers who paid more for their tickets than some first class passengers.

You can find my full analysis here.

Is DataCamp worth the money?

I recently left my job as a software developer to focus on transitioning into a data science role (see this post). As the first step of my transition, I am working through the courses offered by DataCamp.

DataCamp offers a range of courses in Python and R in topics including: data importing, cleaning, manipulation, and visualization, as well as probability and statistics, machine learning, and finance. Within the last month, DataCamp has also created a number of course tracks based on specific skills or career path. These tracks are very helpful. I already had a good idea of which courses I wanted to take, but the tracks laid them out in an appropriate order and automatically started the next course in the sequence.

After about 3 weeks, I’ve completed 27 courses including the Data Scientist with R career track.

The Good:

  • At $30/month you can’t beat the price, though I don’t imagine they expect many people to complete 10 courses/week. I keep seeing reminders that paying for the full year up front works out cheaper than a monthly subscription, but after my first month, I’ll have taken all the courses I’m interested in.
  • Each course has a number of video segments, with exercises interspersed.  There is no need to install R or Python to get started, everything runs through your web browser. This makes it easy to focus on understanding the underlying concepts and not worrying that all the required packages are loaded.
  • The courses are consistently good, and I feel like I learned a lot in most of them.
  • The skill and career tracks simplify the task of choosing what to do next. When I first started I read each course description and made a list, and tried to decide the best order to take the courses in. When the career tracks came along, I was able to enroll and get through my courses in an appropriate order.

The (not so) Bad:

  • Everything is done in the browser. At the end of the course, I don’t have any working examples to refer back to. Also to simplify the exercises, each exercise builds on the last. I have seldom seen all the code for one task collected together on the screen.
  • The DataCamp platform provides excellent feedback, including guidance matching the errors found in your code. The drawback is that you must code the exercise in exactly the same way the course creator did. After taking some of the more advanced programming courses, I was frustrated at the way some exercises were presented knowing that there was a better way to do it and I was unable to practice what I had already learned.
  • Many of the exercises are reduced to “fill in the blank.” I would like to do more of the typing myself as I find this helps me remember what I’ve learnt better.

At this point, I’m very happy with my experience. While I don’t think I could consider myself a Data Scientist, I’ve gotten an introduction to many topics and have at least a vague sense of how to start a project of my own. Other people might disagree and feel completely qualified to call themselves data scientists after completing the career track, but I’m a mathematician by training, and I don’t think I understand something until I know all the details about the algorithm and can implement it myself (perhaps that’s a topic for another post). None of the DataCamp courses go into this level of detail, nor do they promise to.

As with everything in life, what you get out depends on what you put in. It would be very easy to get through the DataCamp courses without learning anything. I focused on understanding the concepts since I can always look up the syntax as needed.

Have you taken any online courses? What was your experience?

Advice, insight, etc.

Some of the things I’ve read lately:

Did you do an integral today?

When I was at UBC, I lived in a graduate student residence. There was an engineering student who would ask me regularly if I had done an integral that day. While I suspect it was a bit of a joke I was never quite sure how to respond other than, “well, no, that’s not really what I do.”

This interaction has stuck with me and while you can find many write ups about the misconceptions people have about math (see here for example), I know this phenomena exists in every field. My sister is a librarian, and I was shocked to learn that there’s more to her job that scowling at people from behind her desk and saying “shh.”

What are the misconceptions people have about what you do?

Advice, inspiration, etc.

I’ve recently come across a number of blog posts, articles, etc. that I’ve found useful. As the first in an ongoing series of posts, I’ll share the links that have captured my attention.

  1. Chad Bryant has and excellent series: So You Want to be a Data Scientist Part 1, Part 2, Part 3. The suggestion to find an area of interest and become a subject matter expert really resonated with me.
  2. The Road to Data Science, by Joel Grus: a great set of slides about becoming a data scientist.
  3. How to get your first job in Data Science, by Tomi Mester: some good advice regarding the most important skills, how to develop them, and encouragement to work on “pet projects”
  4. How to be an “idea machine”, by James Altucher: I’m trying to do this, but still get stuck trying to have “good” ideas…