Hi! My name is David and I am the CTO of Easysize. We work with large fashion retailers on identifying and preventing returns in real time. I have 8 years of experience in data-driven products, machine learning and SaaS. In this post, I’ll share some of the techniques we employ while working with our clients’ precious data.
As with many other companies nowadays, Easysize is data driven. All our products rely heavily on how much and what kind of data we have. Together with all the data we get from tracking and mining, we also work with the data we get from our clients. Here are my 5 best pieces of advice when doing exactly that!
Whether you are just starting, or have been working in a data driven field for a long time, getting more data is always a priority. We require our clients to provide us with large sets of historical data before we even start the integration process. At this stage, the priority is the trust between you and your customer, and your ability to ensure safety and security of the data.
We tend to use the most secure tools for data transfer that are available. Most common practice is a SFTP server with IP filtration, which ensures that no one else except for ourselves and the client will have access to raw data files.
If, like us, you rely both on historical data and continuous data updates, you should help your clients setup automated data extractions. One way to achieve that is through API (record by record) or by using the same SFTP server and regular uploads of raw files.
Ok, so you got the data, now what? You probably need to prepare the data for future use. Given that you have a variety of sources for the same kind of data, the most important thing at this stage is to standardise its format and structure.
Even though you probably spent some time preparing a nice document with data requirements for your clients and explained all the needed formats and whatnot, I’m afraid getting clean data from your customers is just not that simple. Most of your clients will have their own, unique data structure and they usually tend to simplify their life rather than yours. But worry not, everyone has to deal with the same problems and here is what we do.
We have a golden rule, and that is: Never modify initial raw files manually.
Just by simply following that rule we ended up with a pretty flexible setup, when before processing the CSVs we get from our clients we put them through a preparation process. Every client has their own data processing configuration, that allows for a variety of modifications, from renaming data columns to applying regular expressions to the eachrow. This setup help us maintain a single process of cleaning, parsing, structuring, and loading to any file after it has gone through the preparation. And we are dealing with fashion data, which is messy at best.
One of the biggest challenges with automated data extractions and complicated loading process is that your parsing/cleaning processes need to learn from somewhere. In our case, we went through several years of verification of data and we have built a substantial dictionaries that allow us to automatically process anything that we have seen before or anything similar to what we have seen. But a big chunk of the data still has to go through verification before it end up in the hands of your app or API or what have you.
Try to save every last bit of information you use when verifying. This, together with building a dictionary of already verified data, will help you to speed up the process and reach almost real-time lever.
Security is the most crucial part when it comes to handling any kind of valuable data. That means that you must do everything you possibly can, and more, to protect customers’ data. In our case, apart from implementing every last bit of security measures recommended by vendors like AWS, Azure and Google Cloud, setting up all sorts of firewalls and VPNs, we’ve decided to go one step further.
Before we process any data through the data loading pipeline it undergoes complete anonymising. In this step, we encrypt and anonymise every piece of information that could indicate that it belongs to one of our clients. Fortunately, that doesn’t limit our algorithms in any way, and we are still able to get a 100% of information out of the data. However, by doing this step we make sure that our data can never be used against our clients.
Every new client and data source might bring new data to the table. The common practice is to ignore the outlying info and stick with a standard set of attributes. In the early days, we also followed this common practice, but we learned the hard way that every piece of information counts and that there is no way to foresee what we might need in the future. Now we save every piece of information that we get from our clients in a structured way.
Following this rule also forces you to look ahead while designing the data structure. Besides that we also make sure that all our algorithms are aware that some information might be missing.