What does it take to be a data scientist? The question is not new, but the answer has slightly changed. The term ‘data science’ was coined in 2001 and serious practice commenced from 2010. Early articles in 2010 mention about three characteristics of a data scientist: IT Skills, Math/Stat Skills and Domain Expertise. Possibly, there is nothing more to add to this triad even now.
However, the last four or five years have forced some changes in the underlying make-up of the triad. The increasing gap between IT and Business, rapid changes in computing and storage, explosion of data – especially unstructured data, arrival of new algorithms – particularly in the deep learning space, proximity of data scientists with top management, the idea of unlocking value from systems thinking, and increasing value creation opportunities from understanding interconnectedness of various industries are a few factors that are driving the change in the characteristics of the triad.
Consequently, data scientists have to involve themselves in operational side of the business (e.g. CRM systems), handle more unstructured data such as text, voice and image, possess data engineering skill sets, work more with High Performance Clusters and Big Data, move to structural equation modelling rather than simple linear equations, solve more math problems now than ever before, discuss opportunities, problems and solutions with senior management using BI & visualisation tools, have systems thinking and have multiple domain experience.
Therefore, in this post I revisit past intelligence, add new ones and make it comprehensive and current.
In the context of a Data Scientist, IT Skills refer to the ability to understand fully the software world that is vital for her performance. It includes knowledge of databases and ways to handle them and of statistical or mathematical software packages. A lot appears to be changing in this area. In spite of vast amounts of data already available, data scientists are seeking new data to improve upon model performance. Characteristics of data is changing. It will now increasingly be text, voice or image. In a separate development, untapped machine data in industries and IoT running into several EBs (Exa Bytes) are now available for analysis. All these calls for astute and robust technology to lift and analyse them. Every day, new libraries are being added to a body of open source technologies. What are the implications to a data scientist? She has to:
- Engineer new datasets: Even in a world of exploding data, a Data Scientist needs to know what data is required to answer a specific question, have them acquired if not previously available. To do so, some exposure to Data Planning is desirable. Data planning is the ability to (1) understand the enterprise’s end goal and the execution strategy, (2) convert that understanding into the lead and lag measures, metrics, etc and (3) lastly, have the data (measures and metrics) collected by the IT team. It calls for both knowledge of IT and Business Strategies. For e.g., data scientists @ matrimony.com, a leading matchmaking portal, were asked to maximise call centre revenues from contacting members who abandoned mid-way a payment page visit. Payment page is a page on the portal that has information in the form of links about different packages, benefits of each of the packages, package comparison, several ways to be make payments, and more. Data scientists found missing several critical data in the data-mart. One being the time spent on the payment page and another being clicks on different links. The data scientists worked with the IT teams in having such new data collected. Later, a log-regression analysis proved that members who spent more time on the page, made several clicks on the links, and visited more than once in the last one week had very high propensity score to pay. It led to better conversion rates and more revenue.
- Assemble existing data: A data scientist’s work commences with exploring several data in the data warehouse as well as the data lake. It calls for skills in handling data in unstructured and structured forms, or multiple database formats. Days of solving a problem in silos are over. A data scientist now takes data not just from one mart (e.g. sales), but also from other marts (such as operations and HR) calling for systems view. Some of the essential skills of a data scientist now include latest knowledge of:
- Scripting languages such as Python, Scala and C++.
- Databases such as SQL and NoSQL
- Big Data / Open Source: Hadoop therein such as R, Hive, Pig, Spark and Scala. Hive is used more by data analysts and Pig is used more by programmers.
- Parallel Databases and parallel query processing
- Handle statistical and math processing software: Once the data is in one’s grasp, the next step is to analyse it. The popular stats and match packages still are the ones that have been around for years if not decades. However, newer libraries are being added each day that provide better performance of models. Any one or more of the following, but updated, statistical / math tools would do:
- R or Python. One may need pbdR (Programming with Big Data in R) for utilizing High Performance Clusters (HPC) and extremely large data sets / lake thereby allowing R to perform at very high processing requirements. There is interesting ‘R vs Python’ debate among data scientists. The central purpose of Python is not Stats and is good when it comes to integrating codes into production systems. If you are developer, you will like Python. R on the other hand has had head start in stats with several inbuilt formulas. If you are stats person, you will like R. I really don’t have a preference.
- Weka, or ADaMSoft, or Shogun, or Random Forrest, or OpenStat. These are some open source packages with their own unique capabilities. E.g. Random Forrest is good in ensemble techniques.
- SPSS, or SAS, or Rapidminer. These are some proprietary software with great capabilities to perform end-to-end analytics.
- LISREL or AMOS (now part of IBM SPSS) for structural equation modelling
- SageMath for comprehensive math problems such as Calculus, Linear Programming, and Algebra.
- Optimisation or Linear Programming problems: Free: OpenMDAO or SciLab; Proprietary: MATLAB or Mathematica
- GNU Octave for solving linear and non-linear problems
- BI and Visualisation Tools such as Spark, Pentaho, SpagoBI, Dundas, Tableau, Cognos, and SAS VA. With increasing proximity of the data scientists with the top management, some of the work borders on how easily is a data science work explained to them.
- The work of a data scientist is greatly enhanced by her understanding of how several operational IT systems such as campaign management, sales force automation, and call centre dialer work. For e.g., a list of customers with high propensity to buy a product generated from Log Regression Model, needs to be exported or deployed to an operational CRM system used by Sales/Marketing Team. The CRM system may then send a mail, SMS or such with an inducement to buy. Knowledge how the operational IT systems are deployed and currently working provides invaluable insights of how and what a Data Scientist needs to focus her workflows.
Stats and Math Skills:
Possibly, at the heart of a data science lies the improving ability to crunch numbers. New techniques are being uncovered to handle common issues faced by data scientists. For instance, Support Vector Machine, a tool to classify, solves no new problem. But it solves it in more efficient manner, i.e. with least classification errors. Analysing text, voice, and image has been vexing. Advancements by way of adding layers to the Neural Networks (deep learning) has allowed solving hitherto unsolved ones. Consider, for e.g., ‘Dittory’. It is in a challenging business of helping customers discover similar unbranded apparel on the web using image search. Data scientist struggled to even detect a feature (e.g. mandarin neck) in an image. However, very high processing capabilities and very large datasets have changed the old and ignored Convolutional Neural Network (CNN) into a powerhouse of new capabilities. Almost 30 million apparel images across Indian eCommerce sites were used and rest is history.
The examples of SVM and CNN have an important message for data scientist: keep a keen eye on what is latest in the select important techniques:
- Data Exploration Techniques:
- Uni- and bi- variate analysis
- Correlation and covariance matrix
- Simple tests of hypothesis such as z and chi square tests
- Missing value treatment
- Outlier detection and treatments
- Variable transformation (also called Feature Engineering)
- Confidence Interval Estimation
- Dependent Techniques:
- Regression (including Cox Regression)
- Log Regression
- Other General Linear Models such as Lasso, Ridge, Elastic Net, Bayesian, and Polynomial
- Linear and Quadratic Discriminant Analysis
- Special Linear Models such as Kernel Ridge Regression, Support Vector Machines
- Structural Equation Modelling
- Experimental Design (Lift Modelling, Yield Optimisation, etc)
- Independent Techniques:
- Clustering Analysis, including kNN
- Factor Analysis or PCA extraction or Feature Selection
- Other statistical techniques and algorithms:
- Forecasting / Time series
- Survival Analysis
- C5 Decision Tree Algorithm
- Attribution modelling
- Collaborative Filtering, Association Rules, Linkages
- RFM Techniques
- Neural Networks
- Handling unstructured data (such as text, image and voice):
- Indexing or Tagging
- Web Analytics
- Text Analytics: Sentiment Analysis
- Natural Language Processing
- Image Analysis
- Voice Analysis
- Game Theory
- Linear Programming
To be a good data scientist domain knowledge, systems thinking and cross industry exposure are important.
Domain knowledge is acquired with exposure to industry dynamics. Industries such as BFSI, Telecom, Retail, eCommerce, and Education have large number of customers and tech enabled data systems leading to generation of large (if not Big-) data. While application of IT Skills and Math/Stats Skills are nearly same in each of these industries, the business questions may be different. For e.g. Market Basket Analysis may be more important in the Retail Industry while Survival Analysis may be so in Insurance.
Some questions appear to be universal. For e.g. Churn Reduction. Yet, the approach and the variables that determine churn across industries would vary somewhat. Consider for e.g. churn modelling in telecom and BFSI. The broad categories of predictor variables in both the industries may be Customer Characteristics, Purchase History, Customer Product Usage Data, and Customer Payments or Billing data.
In telecom, Customer Product Usage Data may cover variables such as Number of Calls, Outgoing-, Incoming-, Roaming-, International- Calls, Number of SMS, Total Minutes, Number of VAS activated or deactivated, Data Usage, and App Usage. The same in BFSI Credit Card Business may take a different avatar. It may refer to variables such as Number of Transactions, Categories of Purchases, Days of Card Usage, Value of Purchases and Number of Automatic Debit Instructions.
Identifying the specific variables for a good analysis calls for reasonable domain expertise.
Systems Thinking: Clearly, data science practice calls for an interdisciplinary approach. One cannot reduce churn (marketing analytics) and continue the same (poor) product performance (marketing analytics). Or reduce warranty (marketing analytics) without appropriate changes in reverse logistics (supply chain analytics). Or improve work-force productivity (HR analytics) without changes in production scheduling (production analytics).
A data scientist has to think holistically. No wonder the function has strategic importance and in several organisation, reporting directly to the CEO.
Cross Industry Exposure: I think having exposure to application of data science in two or more industries adds to the effectiveness of the practice; it is due to the ‘outside-in innovation’ effect. In fact, there are early evidences analytics may soon be no more confined to an industry; it will call for analysis of data from across industries. We are already witnessing firms aggregating data from across industries such as telecom, social media and ecommerce to improve search engine data analytics and consequent marketing campaigns.
Lack of cross industry exposure can be compensated by a study of successful application of IT, Stats or Math in different domains or industries. One can also augment by talking to peers in other industries and attending data science application conferences. The picture below shows successful application of one technique in a field has spawned similar application in other fields as well.
The question is, whether such cross-industry exposure should occur in the early, mid or late career of a data scientist. While there are no studies to back my hunch, I would avoid such exposure at the early stages of a data science career; focus in one domain in early stages has advantages.
Have the broad requirements of what it takes to be data scientist changed? No. The triad still comprises IT skills, Statistics and Math Skills and Domain Knowledge. However, several changes in the technology, science and business dynamics are forcing changes in the underlying characteristics of the triad. Data scientists are expected to increasingly spend time in data planning, use different and better technologies in data lifting, refurbish their stats and math armory with techniques that have never been used, perform holistic analytics that involves all functions within an organisation and use data / practices not just from the industry but from across the industries. The change calls for strategic thinking, high & quick learning and be outcome focussed.
Postscript: Reviewers of the above article pointed out that data scientists should have some very important soft skills and abilities such as communication, questioning mindset, problem solving attitude and influencing without authority. I agree and thank the reviewers.