How to clean up data

Can you clean up data from a single source?

it is entirely possible to clean up and refine your data from a single source. Use a master data management tool to take control of your data tidy-up efforts.

Get to grips with your data

Now is the time to refine, organise and improve your data. Enterprise businesses worldwide are amassing ‘big data’ and ‘deep diving’ into this data for real-time access to information that helps them know and understand their customers. Dealing with data effectively means making use of the information that is available to you, across the whole spectrum of data, to transform or revolutionise your business.

However, the power of data is also its weakness; it’s only as useful as how accurate it is. Data needs to be precise, up-to-date, functional, and correct to be used for the best benefit.

Problematic data processes

Data errors can hold you back from creating real customer insights and generating information that can improve your efficiency, accountability, and success. So if it’s time for a data health check, consider if your data handling practices serving you well? Or do you have any of these problematic processes in place?

Activities that create data that is not useful and can’t be accessed when required, which creates a duplication of effort
Systems that do not add value to strategy, reporting, operations or business
Processes and records that are not adequately or at all integrated with other central operating systems or are adjacent to the leading information network
Acquisition of data that could be used in other applications, but isn’t
Data decision processes that are not documented, clear, or are changing with individual users
Poor data handling structure and hierarchy

What are examples of data errors or “Dirty Data”

If you have any of these ‘bad data habits’ happening in your enterprise, it’s likely you’ll be inadvertently causing data errors, degrading the overall quality of your data set. Errors can occur in data in several ways, most frequently when the data is:

Out of date
Incomplete
Inaccurate
Duplicated
Inconsistent

Nowadays, it’s basically impossible to have an absolutely perfect data set. When you have thousands or thousands of data records, it is not realistic to expect that these records will have a 100% total accuracy rate. Data quality is about accepting errors, understanding how they occur, and setting up processes, practices, and reviews to do what you can to improve the overall integrity of your data.

Out of date data

Things change - it’s inevitable. One data field particularly prone to becoming out of date quickly is customer records. Customers move, change email addresses and mobile phone numbers. Often you can use online tools to assist, such as prompts and reminders during transactions or using EDM contacts to encourage subscribers to update their details.

But this should be done subtly and carefully, as too many prompts for information might make some customers unsubscribe from your accounts or unfollow your profile. Consider how you might incentivise the record updating process and help customers to feel motivated to keep their details up to date.

Internal data can become out of date too. Data that is superseded may not be adequately eliminated or disposed of. Staff might not include records held in long-forgotten folders or directories in data mergers and updates. When new systems are implemented, due diligence may not be applied during rollout or integration, and some of the old data can remain.

Old data can negatively impact reporting and limit your ability to get a complete picture of business processes and transactions. Data is especially at risk of becoming out of date when:

Key employees and data contributors change roles or leave the businesses
Organisations restructure, merge or rebrand
Software systems become obsolete, are not updated, are superseded or replaced

Incomplete data

A record can be defined as incomplete if it lacks the key fields you need to process the incoming information before sales and marketing take action.

When you are only able to collect and store some of the information related to products, options, and customers, you can create incomplete data records. One simple way to ensure you are collecting the data you need is to make all fields mandatory on forms and records. When asking customers to subscribe, or when completing an order, take steps to ensure you acquiring as much information as possible.

Details like gender and age might not seem so important at the point of sale; however, this type of data is very useful for profiling and undertaking your customer segments. Reporting on segmentation requires an aggregation of key fields and complex entries to be able to run effectively, so you need a full set of data to do this properly.

You may have data fields available but not used them consistently or see that entry as a necessity. The more fields you can populate, the more insights you can gather from your data. Fields that do not seem critical when you set out collecting data may become so later on. Incomplete entries on certain fields can result in a missed profiling and revenue raising opportunity.

Another reason you might have incomplete data is as a result of systems integration, where systems use different metadata, tags, and titles to hold information about your products.

Inaccurate data

Your data can become inaccurate when errors are made in acquiring, recording, and creating records and files. Non-deliberate errors are caused by customers when they say, enter their email address instead of a street address in a field.

There are tools that you can use to help ensure you get accurate information when it is created. It is much better to get data right at the point of creation than trying to sort and tidy it up later. Validation tools can be used in records systems or on websites to cross-check and validate information coming into corporate systems. You can prevent fake street addresses by adding a location checker.

We can all imagine customers trying to enter ten zeros for their mobile number because they don’t want to be contacted by phone. Data checkers and validators can draw on multiple data sources, not just customer input, to ensure accuracy is created in the data creation process.

Duplicate data

Firstly, concerning customer records, duplicates can throw out your capability to accurately report on transactions, the number of subscribers and customer profiles and demographics. Having multiple records for the same customer or product can cause a problem. Reporting can be impacted because marketing data will be out and incorrect as a result of these duplicated records.

Duplicate records can occur in systems in a number of ways. Most commonly, they will be related to customers who have changed contact details or addresses, establishing a new account rather than updating an existing account. On occasion, when people forget a password or which email address the account used, they go ahead and make another profile, resulting in two records for the same customer.

Duplicate customer records can make it difficult to determine how much customers are spending and even trickier when they have a complaint or grievance about a product or order. Some customers might not know they have an existing account, especially when they can’t seem to log in with what they thought was their account name and details. One way to reduce this problem is to make it easy for customers to check if they have an account with a prominent and straightforward account checker.

Customers also expect that when they contact you, that you will have ready access to any information, receipts, transactions and emails you have on their file. It looks terrible and distinctly as if you don’t have your data under control, if you have to go looking for other records within the system when they contact you.

Customers are often unimpressed when they receive more than one contact at a time because it is quickly apparent to them that you don’t have your data in order. Other ways duplicate entries occur include:

Flaws in data migration processes
Issues with third party connectors and plugins
Errors in manual data entry
Batch import faults

Duplicate data related to products and stock can also cause issues. It is especially frustrating for customers when they can’t complete a transaction. They may see goods available on site but because of duplicate record are unable to complete the transaction because what they are after isn’t actually in stock. Other duplicate data problems that can arise include:

Incorrect supply in storage/ product availability figures
Skewed metrics related to campaigns
Poor engagement online
Less opportunity for automated marketing
Inefficiency in data handling and workflows
Lost revenue if there are duplicate account holders

Many systems now have automatic data checking processes that can be switched on to help you identify identical or close machining records.

Inconsistent data

Inconsistent data makes it very difficult to get a complete picture of your operations and reduces the value of faith that you can have with your data because it can’t be seen as entirely accurate and holistic.

With records that have inconstant metadata, attributes and criteria, you can’t report on them holistically. It is also challenging to carry out integration on data that has different structure types and fields. Other systems use attributes such as file names in different ways.

If files haven’t been set up with a correct name or have different formats, including the use of publication and dates, it can be hard to merge all of the records to gain a complete picture. It matters not only at the point of collection but also in how data is stored and filed within networks and systems.

Data veracity

Because you can never guarantee an absolutely accurate data set, you will need to accept or tolerate errors. Veracity is a term used to describe the existence of inherent biases, murkiness or inconsistencies. Data collection or processing methods may have varied over time; there might be uncertainty about the accuracy of some components.

Systems complications, the application of security settings or conditions or even digital assets and infrastructure can impact the overall integrity of your data. Considering veracity helps you and your business determine your tolerance level for errors or omissions in the data you have.

How to manage multiple data errors

When you know you have errors, you will need to assess what is realistic for your businesses, what resources you have available to work on data integrity and the degree of accuracy you are happy to settle for. You will rarely have resources, time, or capacity to have a person manually look through and update all for your records. You will want some software on your side to identify and resolve as many issues as possible for you.

Despite the fact there will always be some errors within your data; the best ways to manage multiple errors are to ensure you pay attention to:

Quality data entry and collection processes with clear standards, hierarchies and defined fields
Databases which are kept up to date and reviewed for accuracy
Fields that are created to provide good value for reporting, giving you accurate information pulled from correct records
Consistency across data collection points, website and accounts

What is the best data cleansing strategy?

To carry out data cleansing, a strategy will help you articulate a clear vision for what to achieve, the level of accuracy you are seeking, what opportunities will arise from adequate data cleaning, and then a clearly defined process for how the cleansing should occur. The strategy should cover:

People - you will need people to help establish, run and review data cleansing efforts and clearly detail key responsibilities and stakeholders
Process - a data cleansing strategy should clearly outline the systems and techniques for conducting in data assessment and clean up
Systems - a straightforward approach for running data cleanups needs to articulate the systems involved and the hierarchy of decision making for corrections and amendments

Step 1: Make clear what data is there

To begin with; make a comprehensive assessment of the data in your systems. Seek to determine which data is:

Present but in an incorrect format
Inconsistent
Incorrect or out of data
Redundant or duplicate

Step 2: Set goals and targets for data

Next, set a plan for your data by determining what your internal requirements are and thinking of data from the customer's point of view. What information do they require access to? What will make it easier for them to complete a purchase? What level of personalisation do you want to provide to each user?

Step 3: Design a data quality model

This involves thinking even more closely about data to make your data as valuable and relevant as possible. Consider and structure data fields, parts and characteristics. Compare your ideal data standards with the findings of your initial data assessment.

Step 4: Integrate’ data quality rules’

At this point, you will be able to use your clean-up system to define the rules you want to check and apply to your data.

Step 5: Run the data clean up

When you get started with your data clean-up, you will be running extensive assessment tools to help you be confident your data quality rules are being implemented and used effectively. The data cleaning process will help you identify any erroneous records or data exceptions to the stated data quality rules you are checking against.

Step 6: Monitor data quality in relation to the targets

Your data cleanup will give you insight into the strengths and weaknesses of your data. Use the cleanup findings to propel you towards achieving the data goals you established in step 2.

Data cleansing steps to take

Data cleansing discovers errors and issues in data, identifying weaknesses and gaps, monitoring data use and storage, and protecting data that is used, wherever it is stored around the business information network. The data may be inactive, in motion, or in use. Data cleansing systems can look at data at any point along the data lifecycle stages of:

Creation
Transmission
Usage
Storage
Architecture
Destruction

The process should help you improve data quality, making it easier to search, use, combine, compare and access.

Data cleansing doesn’t happen only once

Achieving a clean data set is an ongoing process, and it is wise to regularly review how accurate and complete your data. Data cleansing should occur regularly, at set schedules through the life of your business.

How Pimcore can clean up data

Pimcore is a robust open-source master data management system that can help manage data through cohesive analysis, testing, monitoring, and reporting. It has the unique capability to modify records to improve integrity while keeping a version history to enable you to track differences. It can compare versions of records and track modifications and changes made to product records and files.

Suppose you have a data collection of any considerable size, and the data has issues related to data inaccuracy, inconsistency, duplication, or other errors.

In that case, you will need both an effective data cleansing strategy and a data assessment system. Your plan will outline what you want your minimum standards for your data, so you’ll need to think about how the data needs to be used.

When you use a master data management product like Pimcore, a single source of truth methodology is put into place. This means that there is a master set of information and product details that are used throughout the data network.

There is less risk of duplicate files and other errors, because assets are managed centrally and securely within one system.

Data cleansing can occur from within the master data management program and does not need to be scheduled and carried out through the siloes and separate software systems and structures. This helps ensure data is both relevant and accurate.