Blog / Machine learning

Machine learning and fraud prediction: What data is required to make it work?

What data does a machine learning engine need to predict chargebacks and fraud? Let's take a deep dive.

Machine learning and fraud prediction: What data is required to make it work?

There is more noise than signal when it comes to machine learning (ML) and its role in fraud detection – or more accurately, fraud prediction.

Opinions vary from the deeply cynical to the almost magical and the net result is a great deal of confusion over what it is and is not capable of doing.

For an introduction to ML and how we use it at Ravelin, listen to my colleague Dr. Eddie Bell’s excellent podcast.

In this blog post, I’ll attempt to tackle one key aspect of ML: data.

  • What data does a fraud prevention model need to be effective?
  • Why does it need this data?
  • What does "effective" mean anyway?

To start at the end, effective means accurate – and accurate means getting most predictions right most of the time about which orders and customers are likely to be linked to fraud. To make those predictions accurately, Ravelin needs access to a merchant’s data.

NOTE: Let's say straight away that in this blog post, we’re simplifying matters. Any merchant will have data specific to their own business and that’s great; more data is always better than less. However, there are general truths that we can talk about, and we're going to do this right away.

Types of data

One consequence of magical thinking around ML is that somehow the models will come up with an accurate prediction with minimal inputs. Unfortunately, this is not the case.

Ravelin requires a reasonable amount of data to make good predictions. The better the data, the better the response. This is managed through the integration process at the start of an engagement where the data is consumed through the API – whose full documentation you can access right here.

Identity, behavior and networks

Ravelin uses a micro-model architecture, which is lots of little discrete machine learning models that, in aggregate, combine to give a prediction.

But for clarity we an bundle them into three categories:

  1. Identity - who the customer is
  2. Behavior - what the customers does
  3. Network - who the customer associates with

The percentages in the diagram are purely indicative of what a % of a merchant's data would contribute to a prediction and a determination. We can dig into a little more detail on each.

Identity Behaviour Network Machine Learning

Identity Model

The identity model is everything a merchant can tell us about the customer on their system. From the initial sign-up, email location, device, timestamps – this can be anything up to 100 attributes, but usually much less. You can read the API here.

Ensemble machine learning: It is tempting to see these models as discrete and atomic but important to realize that they are not. These models can predict individually but not as effectively as when they are combined into what is called ensemble models. Combined models are multiples more effective than models working in isolation.

Behavior Model

This is the big one. This is everything a customer does, orders, and pays with on the site. For the technically-minded, there is more API documentation to explore here.

This is a rich seam for machine learning fraud detection. It is where we find the most variety in data types that are available, but equally where we find the most compelling contributions to fraud prediction accuracy. This can easily reach to 200 or so attributes, and within those attributes, the models can mine thousands of features.

Orders versus customers: An important point to make here is that while in this blog post we’ve focused on a customer-centric view of predictions, it is equally possible to do it with an order-centric view. It uses slightly different models, and a different dashboard, but we’ve found the prediction results to be very good. Customer history, however, definitely provides a richer resolution in fraud prediction.

Network Model

Finally, we have the network model. This is especially well-developed in Ravelin as it is something we know adds significant marginal gain when combined with the other models.

Here, we pull information such as a device ID or location information, and quickly map out connections in the data that look highly suspicious. This model is less data-intense as it pulls from other sources. There are also JS snippets available that pull data from your site and app, making it a very straightforward part of the integration process.

Why does Ravelin need this data?

The algorithms that underpin Ravelin are built on the experiences of our existing clients. They evolve constantly and we continually update the models for our clients based on the data we receive, the review feedback from the client’s analysts, and what we are learning across our client base in general.

We also employ investigations analysts to look into specific anomalies or errors and in aggregate their findings can result in model adaptations.

In short, Ravelin is a powerful fraud prediction engine from the get-go. However, for a new client, the engine is highly reliant on the quality and quantity of the data that is fed into it. Where there are gaps, the performance simply is not as good.

Trust. Then analyze.

Now, data quality is a hard thing to define. No-one wants to admit that their baby is ugly and there are often hard conversations about how and why certain data is missing.

This is something that the Integrations team is very used to. Equally, the Detection team at Ravelin is creative about working around gaps and ensuring optimal performance (recommending the optimal block and acceptance rates for a business).

This is an open, productive and valuable process, worth investing the time and energy on getting right. We believe the integration phase is the bedrock of trust is engaging with Ravelin.

Learn more about machine learning here.

Related content