Technical Aspects of Customer Segmentation – Data Synchronization
"Extensive experience analyzing and segmenting CRM databases," "Strong grasp of audience segmentation," "Understanding and experience of data segmentation and serving personalised content." These are examples of the most common requirements placed in CRM/lifecycle marketing job offers. Although the concept of customer segmentation is fairly simple and easy to start with, becoming a segmentation pro requires a dive into technical details. With this article, we want to help you take this leap.
Table of contents:
- Storage and format complexity
- Data formats
- Synchronization frequency
- Data synchronization medium
- API-first world
- Data synchronization triggers
Customer segmentation starts with collecting customer data in one, central place and making them ready for grouping and acting on. Sounds easy, but an increasing number of data sources add complexity to data collecting and crunching.
That's why efficient segmentation starts with ensuring consistent data flow from several data sources to a single server. This process has earned a name that you might have heard a lot recently — data synchronization or simply data sync. It’s a process of establishing consistency among systems and subsequent continuous updates to maintain uniformity.
A lot of clever words mean that it's primarily the engineering team’s domain. But making yourself familiar with key concepts helps a lot. Here’s an overview:
- Data synchronization – in this section we’ll learn how customer data are stored, why people want to move them around and what roadblocks digital teams need to overcome to do so. On a more practical note, this chapter explains the inner parts of how modern CRM systems exchange data over the Internet with APIs.
- Data integrity & security – the next part will help you understand what it takes to keep data in a consistent state after you synchronize it. We’ll learn how you can use schema to take care of data uniqueness and prevent data duplication. Finally, we’ll reflect on technical aspects of data security and privacy as they’ve become first-class citizens in the post-GDPR and CCPA times.
- Data crunching – finally, we’ll help you become fluent by showing you some tips on data filtering and making data-driven decisions in general.
Storage and format complexity
Because of recent developments in technology, the cost of data storage has fallen. This enabled enterprises to collect enormous amounts of data.
These data can be divided into two categories: structured and unstructured. To make segmentation work, digital teams need to figure out how to transition from unstructured to structured.
By structured data storages we mostly mean SQL databases or Excel files. They are great and versatile tools, but have their cons too. SQL is hard to learn for people without technical background, Excel’s top advantage, flexibility, becomes a nightmare for long-term maintenance of data integrity. That’s why these general-purpose software tools have been superstructured with more focused ones like CRM, CMS, ERP or analytics tools.
Although these job-oriented tools increase productivity in the areas they’re hired for, they often become a problem on the department/company level. Why’s that? Each of these tools usually uses its own data format and, unless both software platforms are integrated with each other, the data exchange is hindered. And, a modern marketer uses a helluva lot of such tools.
This is why most data sync processes require a middleman. It adapts the data format generated by the authoring software (called the “source” in the IT world) to the destination format (“target”). The process of such adaptation is called ETL. It’s an abbreviation from Extract (data from the source), Transform (in a way it’s recognized and accepted by the target), Load (it to the target maintaining data consistency).
What kind of data formats can you meet in the customer segmentation industry?
Data formats
Nowadays, segmentation software uses two open formats for data sync purposes in most cases. But what’s the “data format” anyway? This is nothing more than a text structured in a way that is understandable for computers. Let’s start with the first, simpler one.
CSV – Comma Separated Values. Imagine you take a table. A regular table from MS Word is just fine. Now let’s remove external borders and replace internal ones with commas. Voila, you’ve just created a CSV file.
The biggest advantage? Simplicity and compactness. Developers can easily export data in this format because SQL and Excel databases store data in tables.
On the other hand, CSV has two major flaws. First, within a CSV file, you can’t represent hierarchy. For example, you can’t show that one value is related to another. The second most important drawback is that CSV is a fixed format. It allows you to exchange data according to the exact number and type of columns you defined at the beginning. If you add a column necessary for another target system or simply remove a column, it will cause an error at the import attempt. In such a case, developers will need to adjust the code to reflect the structural changes.
Because of the dynamics that marketing technology has gained in recent years, this limitation turned out to be a major roadblock.
For this reason, modern systems started to exchange data with an XML and JSON data. They are both based on the same concept of data representation. We’ll describe the latter because it’s more popular.
This is an example of a JSON file. When we compare our JSON file to the CSV counterpart, we’ll immediately find similarities — we can see that this file stores exactly the same data.
The striking difference is that the column names from CSV files are repeated. While this might seem redundant and impede readability for humans, it’s the repetition that gives JSON flexibility – it makes the order of items irrelevant. This is helpful if you need to add a new property (a column) to an exchanged file. The target software who expects the new property will consume it, whereas the targets who processed the file before the column addition, will ignore the new property and will work undisturbed.
This feature guarantees that if you add more fields to the data format sent by the source application, it won’t break the target. This is why the JSON format is said to be more scalable and more flexible than CSV.
You can read more about JSON files here.
Synchronization frequency
The common requirement today is that the ecommerce systems are real-time. Customers want to see what the status of their order is, a real-time parcel tracking or what the current balance on their account is. Also, marketers want to react fast, they want to run real-time campaigns which create timely shopping experiences.
To achieve that, the underlying data storages have adopted real-time synchronization.
However, there are two challenges with real-time synchronization to be mindful about.
First, a general rule of thumb — the more real-timeliness you want to get, the more it costs. This cost manifests itself in the developers’ time but also with the hardware that runs servers (today it’s mostly cloud solutions) you need to keep the data synchronization systems up and running. So, your very first task before putting down a “real-time” requirement when talking to engineers is to consider what kind of data synchronization frequency you really need. Perhaps, updates sent once an hour or once a day are enough to ensure great customer experience while reducing developers' work and saving budget.
The other challenge is the data extraction capabilities of data storages you work with. Sometimes the real-time synchronization might be impeded when one of the systems doesn’t provide an easy way to extract data. Easy meaning developer-friendly. Let’s analyse this issue in detail by going through how developers move data around systems.
Data synchronization medium
We’ve learned how the data is stored, what formats are used for exchange, and how the synchronization frequency may impact the efforts of setting the whole synchronization system up. But what does it actually take to transfer data from one database to the other? Well, you need a medium.
It can be a physical storage like DVD, USB drive, or other hardware-based synchronization, but with the massive amounts of data and the real-time synchronization requirements, few do it this way today. In the majority of cases, it’s all done via cable, or actually many cables interconnected by around-the-globe computers — called the Internet.
To be more specific, modern software platforms use the Hypertext Transfer Protocol (HTTP) which is the foundation of the World Wide Web.
If you’re interested (and you’d better be!) in how servers talk to each other through the Internet, we strongly recommend jumping into this non-techie’s guide to servers.
In a nutshell, you can treat HTTP protocol like guidelines telling developers how they send data over the Internet. On top of HTTP, developers create Application Programming Interfaces (APIs) which are specific descriptions of what data and in what order can be exchanged between two systems.
The software that makes an API available in the Internet and makes it accessible by other systems (similarly to the way you visit a regular webpage) is called an application server. All it does is listening and responding to requests that come in from other application servers, and depending on the request it adds or gets information from the underlying database. Useful information: when people mention API, it’s often a shortcut for the application server which exposes API to the Internet.
Let’s play with the APIs then.
API-first world
APIs have become a lingua franca of nowadays digital marketing. Most of data synchronization work today is learning an API to be able to extract data from it. It’s like learning another set of words from a foreign language, assuming you know the grammar, which in this case is defined by HTTP.
Although it’s a developer's job, diving into this topic might help you navigate the marketing technology world.
We have created a dedicated article with some practical examples on understanding APIs, so if you want to master it (and again, you should), give it a read here.
This is the most important excerpt from it:
“If you go to a restaurant as a customer, you are not allowed to enter the kitchen. You need to know what is available. For that, you have the menu. After looking at the menu, you make an order to a waiter, who passes it to the kitchen and who will then deliver what you have asked for. The waiter can only deliver what the kitchen can provide.
How does that relate to an API? The waiter is the API. You are someone who is asking for service. In other words, you are an API customer or consumer. The menu is the documentation which explains what you can ask for from the API. The kitchen is, for example, a server; a database that holds only a certain type of data — whatever the buyer has bought for the restaurant as ingredients and what the chef has decided they will offer and what the cooks know how to prepare.”
- Kitchen – The database, no customers allowed to protect data integrity.
- Waiter – The API, a middleman that knows how to serve data from the database without disrupting its functioning.
- Customer – An external system that wants to get their data.
- Menu – The data format reference the external systems have to use to perform their operation.
- Order – An actual single API call.
Coming back to synchronization, the remaining question now is to figure out under what conditions two systems should exchange data.
Data synchronization triggers
Let’s assume we have two application servers ready to exchange information. To be more specific, let’s imagine your email service provider (ESP) wants to know the number of customer orders (e-shop) to send them a promo coupon after the tenth order. Now, what kind of data flow can we implement to achieve this scenario? We can distinguish three “triggers” that can start the email coupon machinery on the ESP side.
a) Data polling – in this case ESP is asking e-shop API repeatedly: “Let me know the total order amount for Jane Doe”. It does so every minute, an hour, or a day etc. When ESP receives the information, it will recalculate the coupon sendout conditions every API request. Your gut feeling is right if you’re thinking this might be suboptimal to call and process data in this way. What’s the alternative?
b) Data pushing – what if an e-shop could notify ESP application the moment Jane made her 10th order? That’s right. Modern ecommerce platforms have realized these shortcomings and included such notifications into their feature set. They are usually called webhooks or call-outs. It’s a simple yet powerful functionality. It allows you to define when and which applications should be notified under specific conditions. You can also define what kind of information should a notification include — sometimes it’s reasonable to send full information, but often only a changed property is enough.
c) Batch processing – sometimes, the number of requests is so huge that neither of these two methods is reasonable. The amount of processing power necessary to handle the load would harm one (or even both) of the parties. In this case, developers group all the information in a “batch” and schedule the data exchange at night or, more generally, when the servers aren’t busy with the daily traffic. This helps control traffic to applications and prevents potential strain on your server.
Useful terms: to prevent sending too many requests to a server, application developers apply rate limiters to their applications. It limits the number of API calls in a given period e.g., 5000 calls per minute. Often, the calls which exceed the quota are throttled. This means that they will be served eventually (instead of dropping them altogether) but with some delay.
So far, we’ve learned how customer data is synced. But before the actual customer segmentation can happen, we need to understand how to keep the data consistent in your database. In the next part, we’ll describe how to ensure data integrity and security to achieve well-performing campaigns.