FanduTech – Data Science Product Development and Consulting:

Machine Learning, Predictive Analytics, Big Data

How to leverage Data Science in Retail Industry

Views 1210 | Likes0 | Dislikes 0

Description:

The notion of leveraging data to make product and strategic decisions is now common knowledge--even more essential in the consumer space. While we may have a gut feeling how our customers may react to product changes in terms of new features, sometimes their reactions could surprise us. Today's approach to product development with analytics allows us to test different hypotheses, measure how they perform, learn, and continuously improve our product or service. Combining huge data with knowledge of the rules and the business allows us to do predictive as well as prescriptive analytics. Predictive analytics is used to predict some event, outcome or problems in future whereas prescriptive analytics takes predictive analytics one step further by offering specific and actionable next steps for how to solve or handle the outcome or issues brought up in the predictive data analysis.

This blog details out how data pipeline can be leveraged to ingest data and do analytics. For this purpose AWS technology stacks are referred, but any other tools or technology can do. The steps involved in this are building data pipelines to collect the metrics, setting up the data warehouse or data lake to store the data, and creating the analysis and visualization tools. Build your own custom analytics in order to analyze and improve your marketing and sales pipeline. Collect data about the actions visitors take on your website and correlate it with data coming from your Facebook ads, Google Adwords and other social sites. It allows you to build easily digestible dashboards that analyze the performance of each component of your product, and to have a holistic view of the ROI of each initiative.

There are various data science tools and techniques available to set up the data engineering platform. Some of them are offered from AWS as below.

Redshift - Redshift combines high performance with low cost, providing great value for money. It is based on columnar storage technology, so it is great for analytic queries that require aggregation and for high throughput data ingestion.

Data Pipeline - AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services. AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available. AWS Data Pipeline manages and streamlines data-driven workflows, which includes scheduling data movement and processing.

Building Data Pipeline - Analytics data pipelines over Redshift is usually composed of the following steps:

  • Connecting to data sources.
  • Mapping data to Redshift tables.
  • Loading data to Redshift – staging area a.k.a. data lake
  • Cleaning, filtering, enriching and transforming data
  • Loading updated data to Data Mart area

Connecting to Data Sources

First, data pipeline needs to be able to acquire data from all of the data sources. Most of the data sources will require the pipeline to actively pull data from it, where data pipeline would have to regularly query the data sources in order to collect latest update. But some of the data sources can also push data to the pipeline. For example, Android, and web applications push analytic events directly from the clients.

You should track all the relevant user actions on apps. For example, in order to monitor your order leads, you should track when a retailer opens, closes and submits or does not submit order.

Mapping data to Redshift tables

It requires creating new Redshift tables and choosing Redshift-compatible column names for each field and column type (varchar, integer, boolean, timestamp, etc.). When data's schema changes and new fields are seen for the first time, the Redshift schema will also probably require modification. It depends on how data are ingested, whether the whole data point with all fields are pulled into Redshift tables – in that case there will certainly be changes in the Redshift schema. In the other case where data point with selective fields are pulled, then Redshift schema might not undergo changes.

When creating a new Redshift table, it is also important to carefully choose the distribution key and sort key. If there is a key that is more likely to be used for joins (e.g. user id, product id, etc), use it as a distribution key. Amazon Redshift makes sure that all data points with the same distribution key are stored in the same partition. This reduces the amount of data sent through the network during queries and dramatically improves performance. 

Loading your data to Redshift

Loading data to Amazon Redshift can be tricky. In order to allow a high load throughput, data needs to be buffered and loaded in batches. To avoid data-loss, we should take into account that Redshift will sometimes be unavailable due to maintenance (resize, vacuum) or a long-running query. We can adopt a policy of incrementally loading data to Redshift to avoid data-loss and scheduling of data pipeline should take care of any Redshift related operational issue.

It is also possible that one or more activities may fail in AWS data pipeline due to operational or data issues. So daily monitoring of this process will help in minimizing the data loss. Once a failed activity is identified, that needs to be run manually after cause rectification.

For example, sometime the AWS data pipeline fails to launch a EC2 resource on which a particular activity runs. So in this case this activity again needs to be run manually and all other activities which were dependent on it need to be executed.

Cleaning, filtering, transforming and enriching data

After acquiring the raw data from all data sources, it needs to be prepared before loading it to Data Mart. Some of data you would probably prefer to ignore completely, while for some data points you would like to leave only the relevant fields. Some data would need enrichment (e.g. adding geolocation according to ip address), while other data would need to be transformed. We might look at ‘flattening’ the data so that the reporting and analysis queries become simpler and run faster.

One example of data cleaning is the data coming from a data source where some data are junk data and these data needs to be cleaned before any analysis or reporting done on these data. It also deals with detection and removal of incorrect records from the data. You can write scripts (e.g. in Python or R) that can do data cleaning or validation activities, and plug them as part of data pipeline.

The process of data validation and cleansing ensures that the inconsistencies in the data are identified well before the data is used in the analytics process. The inconsistent data is then replaced, modified, or deleted to make it more consistent.

Analysis

User Engagement

User engagement leads to the growth of your business. When attempting to analyze the engagement of users with your product, you can measure, for example, how many events or footfalls were captured every day/week/month.

Events per week

This analysis deals with how many of those footfalls converted into actual order, which brand/product/categories gets ordered more frequently – retailer/area/city-wise. This leads to deeper insights into end-user (consumer) behavior geographically which in turn leads to many retailers leads which you might want to target in that geolocation.

Monthly Active Users (MAU)

The problem with above (events per week) analysis is that we do not know how many distinct users generated those events/footfalls. We can see growth in the activity, but it’s hard for us to tell if it’s because there are more users, or because the existing users used the product more often.

Therefore, another popular metric to follow in addition is the monthly active users (MAU), which measures how many distinct users were active during the month. This will show how many new users we were able to acquire in comparison to the previous month and how many we were able to retain. This can also give an indication how effective our sales force is on the field, which geolocation is doing well etc.

Engagement charts are interesting, but they are often referred to as “vanity metrics". That’s because they don’t tell us anything about the improvement in the performance of our product. Engagement can go up and to the right just because we invest more in salesforce, for example.

Funnels

Funnels allow us to measure the performance of our product. For example, it will tell us how likely a retailer will place an order using your app and the improvement in likelihood will demonstrate the improvement in performance of the product.

Let’s say we have three steps funnel for your app:

  • Retailer opens your app
  • Retailer creates Order
  • Retailer submits order

We can write queries to get count of distinct Retailers who did these three activities each day/week/month and see the conversion rate. We can also see how this funnel develops over a period of time and that will give an idea whether there is an improvement in the product.

Retention

One of the most popular and important charts is the retention cohort analysis. It allows us to see how often our users come back to our product. We can learn how quickly users churn and what percentage of our initial users become regular/retained users.

If you look at your retention cohort chart, it allows gain insight as to whether your product is improving or not. If users that start using your product this week churn less than users who started using your product last month, you can infer that updates to your product offering have had a positive impact on the user experience.

Machine Learning (Predictive and Prescriptive Analytics)

You can leverage machine learning to develop recommendation systems, create a custom experience for each of your users, or estimate the potential value of each user and re-target high-value users with customized campaigns. For example, you can show customized product list to the retailer on the basis of order-basket analysis of each retailer. You can also recommend to the retailer to stock up products that were expected to be in demand in near future.

In the Data Mart you create, you will have data from multiple sources. It is having the ability to identify what business outcomes they can impact through the integration and analysis of this data, and the execution capability to glean relevant insights and deliver them to the relevant person to be able to act on them at the right time. In general, you can

  • Gain a deeper understanding of your customers’ behaviors, needs and preferences to build a more personal relationship.
  • Improve marketing effectiveness through micro-targeting, personalization and delivery of context and channel sensitive promotions and offers that increase the likelihood of purchase.
  • Optimize the supply chain to ensure the most profitable outcome in terms of demand fulfillment balanced against cost of carrying excess inventory.
  • Spot flash trends, that have an impact on demand to be able to turn them into revenue capturing opportunities.

Login to like or dislike

Comments


Login to add a new comment


Recent Blogs
  • Sept. 2, 2020
    Views 1210 | Likes0 | Dislikes 0

    How to leverage Data Science in Retail Industry

  • Sept. 1, 2018
    Views 1635 | Likes0 | Dislikes 0

    Building AWS Data Pipeline for cross-account resources

  • March 30, 2017
    Views 1259 | Likes1 | Dislikes 0

    Text Classification with Deep Learning in Keras

  • Jan. 18, 2017
    Views 1136 | Likes0 | Dislikes 0

    Unpickling issue in multi-module Python project

  • Aug. 13, 2015
    Views 959 | Likes0 | Dislikes 0

    Is Big Data Just a Fad?