Becoming a Data Scientist

After working as a software developer for a few years, I’m ready for something different. I’ve explored a few different options and have decided to pursue a career in data science.

How will I make this switch? Following the advice of Marc Miller, I’ve talked to a number of people to find out what a data scientist does, the skills needed, and how to get a data science job.

The term data scientist is not well defined and means different things to different people (Chad Bryant has an excellent summary here), but common advice is to create a portfolio to demonstrate what you can do. Ideally this will contain a variety of projects showcasing your expertise. I’m currently working towards developing this portfolio.

Step 1 is to gain some basic skills, and I’ve decided to begin by working through the many courses at DataCamp. DataCamp offers courses in Python and R (the main two languages used for data science), and since I already am proficient in Python, I’m tackling the R courses. I’m hoping that through these courses I’ll gain an overview of the basic concepts, and will have the tools to begin building my portfolio.

Step 2 in my plan is to work through more in depth courses. I haven’t decided what provider to use yet, but there seem to be many options (through udacity, edX, udemy, etc.). These courses dig deeper into the material, and also provide projects that can be included in a portfolio.

Step 3 is to go it alone: find a data set and see what I can do. This step could also include Kaggle competitions as these tend to have nicer data sets and clearly defined questions that could make a first project more manageable.

Through all of this I’m continuing to meet people and learning about what they do. Do you work as a data scientist? I’d love the opportunity to talk to you.

A model for the early stages of type 1 diabetes

As an M.Sc. student, I worked on a model describing the early stages of
type 1 diabetes. Type 1 diabetes occurs when the immune system attacks and destroys the insulin producing beta-cells in the pancreas. Our
experimental collaborator had observed that macrophages (a part of the immune system that clears away dead cells from tissues) from mice susceptible to type 1 diabetes were less efficient than macrophages from healthy mice. It was also observed that not every mouse susceptible to the disease developed it. Based on these observations, we developed a model to answer two questions:
1. Can the difference in macrophage efficiency account for the differences observed between healthy and susceptible mice?
2. Can a naturally occurring wave of beta-cell death associated with normal development in all mice be a triggering event that leads to type 1 diabetes in susceptible mice?

Within our modeling framework, the healthy state corresponds to a steady-state solution representing no inflammation, while the diseased state corresponds to a steady-state solution representing chronic inflammation. It is then assumed that during the chronic inflammation state the immune system will become primed to attack and kill the pancreatic beta-cells. Since not all susceptible mice develop type 1 diabetes, the healthy state should exist and be stable for both strains of mice being modeled. In addition, the steady state corresponding to chronic inflammation should exist and be stable for the susceptible mice. Through the use of dynamical systems approaches (especially phase-plane analysis) we determined that the initial model could not satisfy these requirements for biologically reasonable parameter values. The model was then expanded to include additional cell populations as well as the toxic effect of harmful cytokines released by the macrophages. This expanded model, shown below, demonstrated that differences in macrophage efficiency could explain the difference between healthy and susceptible mice.model

To  answer the second question, the wave of beta-cell death was incorporated into the model. The model demonstrated that for healthy mice the temporary inflammation quickly died down, and the model returned to the non-inflamed steady state. In the case of the susceptible mice, the wave of beta-cell death was sufficient to push the system to chronic inflammation, suggesting that the naturally occurring cell death could indeed be a triggering stimulus for the development of type 1 diabetes.

For more details, read the full paper here.

An algorithm for locally adaptive time stepping

in_silico_network_whiteAs a Ph.D. candidate, my research focused on the development of an efficient computational algorithm suitable for simulations of electrical impulses in nerve cells. Current research in computational neuroscience involves simulations of electrical impulses that travel through large computational domains, such as the example shown to the left from the Blue Brain Project. In many cases there is a spatial localization of activity, with a small region of the cell (or network of cells) changing rapidly while the majority of the system evolves very little. By taking advantage of this spatial localization of activity, I was able to develop an algorithm that can be more efficient than the standard approach. In a traditional simulation algorithm, the entire cell is treated as a single large system that is solved simultaneously. Thus in order to obtain an accurate solution, the entire system must be updated using a time step size that is sufficiently small to capture the fastest evolution, even though most of the system could be accurately updated using a much larger time step.

To take advantage of the spatial localization of activity, I developed an algorithm for locally adaptive time stepping (LATS). Within this scheme, the system is split into subdomains, and each subdomain is updated with an adaptive time step most appropriate for the local level of activity, as shown in the figure below. The challenge of localized adaptive time stepping is in maintaining accurate flow of information and stability of the solution. Through the application of domain decomposition techniques, I was able to computationally connect the subdomains through boundary conditions obtained through a conservation of flux. To address the stability concerns, I replaced the time stepping scheme that had been used for neuroscience simulations since the 1960’s with a method that provides better stability and proved better suited to the LATS algorithmtcplot_ap-dt-only.

Evaluating the LATS algorithm is not as simple as stating an X% reduction in computational time. The underlying numerical scheme is comparable to the standard approach, but the major benefit of the LATS method is that the computational cost scales with the level of activity in the system, rather than the physical size of the domain. Thus in situations where there is sparse activity in a large computational domain, the LATS method provides a significant reduction in computational cost by focusing computational resources where they are most needed. The LATS method was developed within the context of computational neuroscience, but is applicable to any system with sparse activity.

In the video below, an electrical impulse is initiated in a cell, and propagates through two cells.  The colors represent the membrane voltage, and the sections of the cell become transparent as the step size increases.

For more details, read the full paper here.

Do you have an application where the LATS methods could be helpful? I’d like to discuss it.

What is Quantitative Systems Pharmacology?

I recently attended the Sanofi – MSISB Mount Sinai Systems Pharmacology Symposium. This one day meeting focused on the ways quantitative systems pharmacology approaches can be used throughout the drug development process and how these approaches can gain larger prominence within the industry. Speakers from academia, industry, and the FDA presented their success stories and vision for the future. In this short post, I will share what I took away from the meeting.

With greater availability of data, advances in computing, and the previous successes of quantitative approaches, mechanistic modelling is gaining prominence under the label “quantitative systems pharmacology” (QSP). Many of the speakers at the symposium discussed the benefits of QSP. A QSP approach quantifies the mental model and assumptions that the research team is working from, and provides a framework for integrating data and making predictions. A QSP model also provides insight when it fails to fit the data; highlighting gaps in knowledge that can in turn suggest new experiments and prioritize future studies.

The standard modeling approach that is used within the pharmaceutical industry is a data-driven, empirical approach. This brute-force approach requires many experimental tests and statistical analysis to obtain a set of equations that can reproduce the observed behavior. These models are relatively inexpensive to produce, but are problem specific and only valid within the range of experimentally observed data. On the other hand, QSP models take much longer to develop but the extra cost can be worthwhile since QSP models are not restricted to a “range of validity” and so can be used to extrapolate from observed values to make predictions. These predictions can be used to translate observations from one system to another, allowing researchers to make predictions in human populations based on animal models for example.

Another advantage of QSP models is that they are not “single-use” products, and can be used during the drug development process for many compounds that affect the system covered by the model. During some of the informal discussions it came out that this aspect of QSP was the foundation of the business plan for companies like Rosa & Co. and Applied BioMath, who develop large models based on basic biology (described as “pre-competitive” models) and then incorporate the client company’s proprietary data to create a system-specific model for the compound under investigation. I have also come across the company DILIsym, that has created a mechanistic model of drug induced liver injury that they license to companies to support risk-assessment and decision making related to new compounds.

Perhaps the biggest take-away for me was a renewed appreciation for the power of mathematical modeling that I discovered as a student. The speakers presented concrete examples of how mathematical modeling contributed to advances in care for people with heart arrhythmia, kidney disease, TB, and other conditions. These examples demonstrate the powerful role mathematics can play, and are a preview of the changes to come within the pharmaceutical industry as quantitative systems pharmacology approaches gain more traction.