15 minute read

What is Data Science?

The science of extracting knowledge and insights from data.

Examples of Data Science Projects:

  • Network Analytics
  • Social Networks and Marketting
  • TV and Video Consumption
  • Content Recommender Systems

Data Science Process

We will be focussing on the CRISP Data Mining Process (CRISP-DM = Cross-Industry Standard Process for Data Mining)

Figure 1: The CRISP Data Mining Process
Figure 1: The CRISP Data Mining Process

Business Understanding

In this stage, the analysts’ creativity plays a large role. The key to a great success is a creative problem formulation by some analyst regarding how to cast the business problem as one or more data science problems. High-level knowledge of the fundamentals helps creative business analysts see novel formulations.

One way to improve understanding is to try and break the big questions into smaller questions that are answerable by data.

Big Questions

  • How do we increase revenue?
  • How do we diversify customers?
  • Why is customer satisfaction down?

Small Questions

  • What are the properties of high-performing stores?
  • Which population segments are buying less?
  • What strategies reduce customer service hold times?

We will also improve our business understanding by asking the right questions to different stakeholders. This also means that we need to communicate our results to different groups in different ways.

Stakeholders

  • Executive
  • Sales
  • Marketing
  • Technology
  • Operations

Questions to ask

  • What is the goal of the solution?
  • What data is available?
  • What constraints exist?
  • How do we evaluate?
  • What does success look like?
  • How will people act on the results?

Data Understanding

It is important to understand the strengths and limitations of the data because rarely is there an exact match with the data science problem you’re trying to solve.

One should know where the data comes from, we should connect with the teams that collected the data and understand how it was created. Secondly, we also need to know how to get the data from these teams, what systems or tools we require. Next, we understand the data by conductive exploratory data analysis (EDA). Finally, we also need know the limits of our data, and spend time cleaning, understanding it.

Data Preparation

This stage involves transforming our data for modelling by conducting exploratory data analysis, missing data analysis, outlier detections and assesment, feature engineering and data munging.

One of the core concepts in data science and machine learning that separates it from statistics is the splitting of data into training and test sets.

We randomly partition the labelled dataset into training and test sets.

  • Training set is a random selection of input data points (usually between 70%-90% of the original dataset size) which is used to build the model and estimate the model parameters.
  • We use the test set (a.k.a. “hold-out set”) which consists of the remaining data (10%-30% of the original dataset size) to asses the model’s performance. We generate estimates of the “generalization error” - model’s error on the test data.
Figure 2: Training and Test Datasets
Figure 2: Training and Test Datasets

Modelling

This is the stage where we apply machine learning algorithms to make predictions.

Evaluation

The purpose of the evaluation stage is to assess the model results rigorously and to gain confidence that they are valid and reliable before moving on to deployment.

We must define numerical metrics to determine the best model or the appropriate model paramters to use. Some examples are RMSE (root mean square error), MAE (mean absolute error), ROC Curves (receiver operating characteristic curves).

The evaluation metric should be tied to the business problem. For example, in a recommender system, the popular items should score well. It is also important to understand how the metrics translate to practical significance, such as, what does RMSE mean for our particular use case. You may find that the “best” performance might not actually have practical significance.

Always ask: is the success metric of the evaluation aligned with the business need? We could compare our model’s result to some baseline, or dumb models.

Deployment

One critical aspect of every data science exercise in business is: what action will be taken?

Models don’t have any business value until they get into production. Deployment decisions can effect the impact of the model. That is, deployments always has suprises. One way is to deploy slowly with pilot studies and A/B testing with careful monitoring and feedback loops.

Data science and the deployment teams (software engineers, DevOps, QA) should be integrated from the start to avoid any technical surprises.

The CRISP-DM process should not stop here. It should iterate and continously improve!

Types of Data Science Tasks

Supervised Learning

  • There is a specific quantifiable target that we are interested in or trying to predict.
  • Example, predict whether a customer will buy our product based on his/her browsing habits of our website.
  • Algorithms like Decision Trees, Regression are used.

Types of Supervised Learning Models

  • Classification: The prediction variable is categorical (perhaps binary). Example, will the loan application default on their loan?
  • Regression: The prediction variable is numeric (continuous). Example, how much money will this customer spend on my product next year?
  • Time Series: Here the data is time based, and we are forecasting into the future. Example, what will my company stock close at next month?
  • Other Models: Predicting text or any other media. Example, what should my customer care chatbot say next?

Unsupervised Learning

  • There is no such quantifiable target, we are just trying to understand the data.
  • Example, segment customer base into clusters that help define broad customer types.
  • We use data visualizations, segmentation, clustering in this.

Exploratory Data Analysis

Understanding your data goes hand and hand with the business problem you’re trying to solve.

Finding Data

  • In business critical systems, the main challenge is data cleaning. The systems are often set up for collection of data, as they are for running a business.
  • We can find data online, but companies do not share critical data, and if they do, it is via a rate limited API. To collect online data, we can use parsers, scrapers, and other tools. BeautifulSoup is a python module that can be used to scrape websites. We could also use GenAI to get data.
  • We cane rely on curated data science websites like UCI repository. The challenge in this case is that the dataset is sanitized and already much analyzed.

Reading Data in Python

Ideally the data that we will access will come to us in a nice Excel or CSV. However, this is not true in real life, we need to deal with database accesses, SQL or streaming data.

Reading the data in CSV requires the pandas module and the pd.read_csv() function, with the file name or the URL of the data.

Exploring Data through EDA

  • Study and understand each feature
  • Know the feature types and distributions
  • Identify outliers
  • Consider data transformations
  • Identify missing data
  • Visualize the data

Feature Types

  • Numeric: Features where calculations are meaningful
  • Categorical: Features where we cannot do meaningful calculations. They can be multivalued, or might have numeric labels. Also check how the categories are ordered:
    • Ordinal: Income (A: 0-50k, B: 50k-100k, C: 100k-500k, D: >500k)
    • Nominal: Eye color (A: Blue, B: Brown, C: Green, D: Other)
  • Binary: Only two possible values - 0 and 1. (it is a special case of categorical feature type)
  • Date Stamps: Always check the date format. They can be considered as a numeric if not treated carefully. (Python uses datetime library)

Things can get messy very easily. For example, data about names, SSNs, credit ratings, zip codes, text, images or sound. You need to make sure that the software recognizes the right data type.

Manipulating data in Python

  • Renaming variables (.rename)
  • Summaries of variables (.info and .describe)
  • Exploring data (.head and .tail)
  • Viewing slices of data (.loc and .iloc)
  • Categorical variables (.value_counts)
  • Merging and splitting attributes (.concat, .split)
  • Also can use sampling if the data is too large (.sample)

Encoding Categorical Variables

For example, we have categorical data for diet types: vegan, vegetarian, pescatarian and omnivore. We can recode this variable as k binary (0, 1) indicator variables (also called as dummy variables or one-hot encoding). Note that: we only need k - 1 dummies to encode the information.

For some methods (like regression) you must encode with k - 1 dummies to avoid an error.

Category isVegan isVegetarian isPescatarian
Vegan 1 0 0
Vegetarian 0 1 0
Pescatarian 0 0 1

If all values are 0, then it belongs to Omnivore category.

You can use the pd.get_dummies(drop_first=True) or sklearn.preprocessing.OneHotEncoder(drop='first') in Python.

Another issue with dealing with categorical variables is that there can be too many levels, which can make the model complicated and messy. We may then want to transform the data only to the top levels and an ‘others’ category.

Outliers

Outliers are observations that are far from the mean, and fall outside the range of most of the data.

We can find outliers using summary tables, data visualization or simple data sorting.

Outliers typically are attributable to one of the following causes:

  • The measurement is observed, recorded or entered incorrectly (e.g. blood pressure = -50)
  • The measurement comes from a different set of population (e.g. age = 5 in a data about adults)
  • The measurement is correct, but represents a rare (chance) event (e.g. people with net worth of billions of dollars in a dataset about general population)

Outliers should always be considered and inspected to see if they are “real” or some artifact of data collection.

To Remove or Not To Remove

  • Outliers can have an outsize impact on analysis. You may choose to remove an outlier, even if it legitimate, just to avoid influence on the analysis.
  • You can choose a robust analysis technique like Decision Trees. Regression is sensitive to outliers.
  • Carefully assess data for legitimacy. If appears incorrect, remove it.
  • Do not blindly use the automated outlier-finding algorithms (like IQR rule)

Skewed Data

Highly skewed data can be problematic in data science analysis, especially with regression models (linear and logistic). Such skewed data can obscure pattern and invalidate statistically tests. One way to resolve skewed datasets is by using log-scale, square roots, or binning.

Missing Data

Data points that are not recorded or are absent in the dataset are missing datapoints. They can be represented as ‘.’, ‘NA’, ‘-1’, etc. (Python uses NaN for missing values)

Reasons for missing data:

  • data entry errors
  • non-responses in surveys
  • system errors, etc.

Missing data considerations:

  • How much is there?
  • Is it missing at random?
  • Is missing-ness important by itself?
  • How does the model handle it?

Pandas functions - isnull() and notnull() are used to detect missing values.

Strategies for Missing Data

  • Deletion Methods:
    • Row deletion: remove the entire row if any value is NA
    • Column deletion: delete features with high percentage of NA. (impact of analysis needs to evaluated, as it could be drastic)
  • Imputation Methods:
    • Mean/Median/Mode Imputation: replace all the NAs with mean, median or mode of the feature
    • Modeled Imputation: use algorithms like regression or k-nearest neighbours to predict and fill missing values
  • Interpolation: Give data points the average of the two observations surrounding it (relevant for time series)
  • Create a new category for NA

It is important to evaluate how much influence does your strategy have on the final conclusions.

Pandas functions - dropna() is used to drop rows or columns with missing values, fillna() is used to fill missing values with a specific value or method (mean, median or mode).

ScikitLearn functions - SimpleImputer() applies basic imputation strategies and IterativeImputer() applies modelling based imputation.

Data Visualization

Looking carefully at the data is an important part of the exploratory process:

  • to identify mistakes in collection/processing
  • to find violations of statistical assumptions
  • to observe patterns in the data

Sometimes a good, simple visualization is all the “analysis” you need.

However, data visualization is a challenging topic - very hard to do well - many design decisions to make. There are in-depth courses, that discuss the theories which cover things like - what do humans perceive well, how to use marks, colors, etc. Careful thought should be taken with the message you’re trying to convey with your data.

It is always best to start to simple with the basic building blocks of visualizations - histograms, boxplots, scatterplots.

Histograms

Histograms are probably the most used (and useful) way to present numerical data. They provide a quick view of the range, distribution shape, outliers and trends in the data.

Figure 3: Data Visualization - Histogram
Figure 3: Data Visualization - Histogram

The number of bins and bin-widths can have a role in the interpretation.

Figure 4: Histogram with more bins
Figure 4: Histogram with more bins

Boxplots

A boxplot is another way of getting a quick visual clue of the key points of the data distribution.

Figure 5: Data Visualization - Boxplot
Figure 5: Data Visualization - Boxplot

Multiple Variables

You can represent contingency tables using stacked bars, side-by-side bars or standardized bars.

Also, side by side boxplots allows you to compare numeric distributions across variables which are categorical.

Figure 6: Data Visualization - Multiple Boxplots
Figure 6: Data Visualization - Multiple Boxplots

For multiple numeric variables, scatter plot is the standard tool. It can be used to display the relation between two variables. We can easily see if there is any correlation between the data points plotted on the x-axis and the y-axis, or we can even find any strange patterns in the data, identify outliers, etc. When you are dealing with overplotting, you can use “alpha blending” (i.e. transparency) or heatmaps.

Figure 7: Data Visualization - Scatterplot
Figure 7: Data Visualization - Scatterplot

From Exploration to Model Fitting

Fitting data science models, follows the same basic structure:

  • Instantiate a model
      from sklearn.linear_model import LinearRegression
    
      model = LinearRegression()
    
  • Identify X (predictors) and Y (target) and then create training and test sets.
      from sklearn.model_selection import train_test_split
    
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
  • Use .fit to apply the model to the data
      model.fit(X_train, y_train)
    
  • Explore the outcomes of the model
      y_pred = model.predict(X_test)
    
  • Calculate model fit metrics
      from sklearn.metrics import root_mean_squared_error    
    
      rmse = root_mean_squared_error(y_test, y_pred)
    

We do this over and over again.

Decision Trees

Decision trees are one of the most useful and interpretable ways to build a predictive, supervised machine learning model.

There are very easy to interpret.

Figure 8: Decision Tree Example
Figure 8: Decision Tree Example
  • We start of with all the data.
  • At each level, we create a rule that creates “branches” (typically 2).
  • Each branch then can split independently.
  • Each level is defined by a rule.
  • Bottom level are called the “leaves”.
  • Each data point falls in one and only one leaf
  • Each leaf is defined by a set of rules that got us there
  • Each leaf gets labelled with the majority class for those that fell in the leaf.
  • Thus, new data can now be classified - every new instance goes to a single leaf.

Nodes

Numeric attributes are split at an “optimal” spot at the nodes

Leaves

Prediction for a new case is the majority class in the leaf node.

Class Probabilities

We can also assign a class probability to a new data point from the leaf node it falls into. Trees assign probabilities by looking at the proportion of the training labels that fell into each leaf node.

Probability Correction

\[ \text{Laplace Correction} = p(c) = \frac{n + 1}{N + 2} \]

Where \(n\) is the number of examples in the leaf belonging to the class, and \(N\) is the total number of cases.

Laplace correction is often used in cases with small datasets, to offset the impact of “overfitting”.

Overfitting

The larger you build the tree, the more accurate it gets… on the training set. 😕

Applying a tree against a test set allows us to determine the optimal size of the tree to maximize for generalization, and avoid overfitting.

Why Decision Trees?

Trees are one of the most popular models for supervised classification. Trees are easy to understand (when small), easy to communicate to executives and non-technical stakeholders, and they work pretty well for a simple model.

Trees are also computationally cheap for large datasets, can handle large sets of attributes easily, ignores irrelevant variables, handles missing data well, and doesn’t make any distributional assumptions (compared to regression).

Growing a Tree

There are two elements to a good split:

  • “purity” of the branches
  • “coverage” of the data

Measuring Impurity

The more homogeneous (consistent) a group is, the more pure it is. We measure purity with entropy:

\[ \text{Entropy} = - p_1 \log_2(p_1) - p_2 \log_2(p_2) \]

Entropy is maximized when the classes are equally distributed, and the least when only one class exists in the sample at a time. (Note: Entropy also extends to more than two classes)

\[ \text{Low Entropy} = \text{More Information} \]

How do we use Entropy?

  • A decrease in entropy means more information
  • We calculate entropy before and after the split
  • We evaluate the potential split by taking the overall entropy of the split (a weighted average of the two entropies).

Measuring Coverage

Entropy gives us a measure of how helpful a given leaf is in predicting the target. But we need an overall sense of the value of the split. We want splits that increase our overall information, not just carve off little areas with perfect entropy. We do this by measuring the overall improvement in entropy across the entire dataset for each split.

\[ \text{Information Gain} = \text{Impurity (parent)} - \text{Impurity (children)} \]

Finding the Best Split

At each split we need to:

  • Find the optimal split point for each numeric attribute
  • Find the optimal split for any categorical attribute
  • Compare the information gain of those splits for each attribute
  • Pick the best one and split

However, if we follow these rules, we can keep splitting and growing the tree until we have perfect purity or we find no splits that can add information.

But we don’t want to grow as far as we can. Fewer nodes are typically better, more comprehensible and more accurate. Test (hold-out) sets will guide us in finding the optimal tree size.

Tree can also be interpreted geometrically - trees create rules that correspond to axis-parallel lines.

Information Gain for Insight

Each split of the decision tree results in information gain - decreasing entropy is information gain. We can assign the information gain to the attribute used for the split. Thus, information gain can give us a measure of feature importance.

Random Forests

Typically multiple trees (or forests) will outperform a single well-built tree.

Random Forests is a machine learning method that fits many different trees to a prediction problem - using random samples of rows and columns - and averaging the prediction across trees. This results in an improved predictive power, however, at the loss of interpretability.

from sklearn.ensemble import RandomForestClassifier

Evaluating Models

References

  1. Provost, Foster, and Fawcett, Tom. Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking. United States, O’Reilly Media, 2013.
  2. James, Gareth, et al. An Introduction to Statistical Learning: With Applications in Python. Germany, Springer International Publishing, 2023.
  3. Shmueli, Galit, et al. Data Mining for Business Analytics: Concepts, Techniques, and Applications in R. Germany, Wiley, 2017.
  4. Diez, David, et al. OpenIntro Statistics. United States, OpenIntro, Incorporated, 2019.