Shopify's Approach to Leverage Recursive Embedding and Clustering to Enhanced Data Explainability

Shopify has recently published a tech blog about some of their internal machine learning process on how to get more actionable insights based on their customer signals. One of the main challenges of any online business is to get actionable insight from their data for decision-making. Shopify shares its methodology and experience to solve this problem by clustering diverse data sets through a unique method involving dimensionality reduction, recursion, and supervised machine learning. The approach yields strong results and provides insights and better explainability. It helps user researchers and data scientists enhance their understanding, refine their solutions, and iterate more efficiently for the final solution. Additionally, this method includes an explainability layer, facilitating the validation of findings to communicate with the stakeholders. The following diagram shows this high-level method.

Overall Workflow Diagram

Based on the blog post, the author proposed a method containing 4 simple steps:

Make the data manageable.

Cluster it.

Understand it (and predict it).

Communicate it.

The first step in this process is to find a way to visualize data to manage it better. The main challenge is that in real practice we need to handle high dimensional data. One practical approach is to use dimension reduction techniques like Principal Component Analysis or PCA. The main challenge with PCA is that in many cases not all information can be presented in 2 dimensions. The author suggested using state of the art technique of Uniform Manifold Approximation and Projection or UMAP instead of PCA. The main difference between PCA and UMAP is that UMAP is the projection method that reserved local and global similarity of the points in the lower dimension and it is non-linear in comparison to PCA. This will capture non-linear relationships among data. As an example, the author showed the difference in the results when using the MNIST (Modified National Institute of Standards and Technology) dataset. MNIST has 784 dimensions to represent the written digits 0 to 9. The following figures show the differences.

Once we visualize data and get an initial sense, we need to create some meaningful clusters. As mentioned in the article this clustering should have the following properties for explainability:

A point belongs to a cluster if the cluster exists.

If you need parameters for your clustering, make them intuitive.

Clusters should be stable, even when changing the order of the data or the starting conditions

Numerous clustering algorithms, such as K-Means and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), exist in the field. HDBSCAN leverages a hierarchical approach combining clustering and DBSCAN methods to yield more robust and meaningful clusters. Extensive experimentation conducted at Shopify has demonstrated that HDBSCAN consistently produces more meaningful and stable results.

In pursuit of a deeper understanding of cluster behavior, a recursive application of clustering techniques becomes imperative. This iterative process allows for enhanced insights into the intricate dynamics within clusters. Subsequently, once a sufficient number of clusters have been established, the application of supervised techniques, notably classification, becomes viable. Established classification methodologies, such as XGBoost, can be employed as a one-versus-all model for each cluster.

Moreover, the integration of SHAP serves to enhance interpretability, elucidating the primary drivers within each cluster. This dual approach, combining HDBSCAN for initial clustering and subsequent classification through XGBoost, augmented by SHAP for explicability, forms a comprehensive methodology for gaining profound insights into the behavior of diverse clusters.

In the final stage, there is a need to communicate findings with the data science group and other stakeholders and iterate on the process for the final solution if needed.

A similar methodology has also been used successfully in other disciplines like anomaly detection in health data.

Many machine learning engineers found this work exciting. As one of them commented on the LinkedIn post of this work :