There’s bad data everywhere, but nowhere is it more insidious than when you encounter it for the first time while trying to migrate a website. Bad data, in this case, is defined as any data that doesn’t conform to the expectations of the migrator and their code, whether that’s because of bad data entry, surprising contextual requirements, or historical changes in value patterns. It can come at you from a variety of directions and can have surprising origins. Discovering your bad data early, and the reasons for it, can point to issues that may exist in other parts of the migration and can also give you clues on how to effectively deal with it.
Forewarned is forearmed, so identifying bad or surprising data early allows your migration to go more smoothly and will make your client a lot happier. Here are some approaches and tips to make that migration go more smoothly.
Finding Bad Data
You may have already created a set of spreadsheets for the new site that indicate where the content in the old site is coming from, and what transformations may be needed to get it there. Just assuming that you know what the data in the old site looks like, however, is a pitfall waiting to snare you. Even if you’ve looked at a handful of records for each source object or field, you may not find the data that’s in there waiting to bite you and your migration code.
For many fields, it can be useful to do queries on a snapshot of the old database. Fields that have a limited number of distinct values should have a distinct() query run against them. For example, a field that should have a ‘Y’ or ‘N’ value, may also have ‘y’ or ‘n’ values, too, or even ‘true’, ‘false’, 0, 1, ‘’, NULL, ‘t’, ‘f’, and so on. Find all the outliers before you begin coding, so you can take them into account. For a yes/no field like this, you may want to code a process plugin that can interpret all of them into a single pair of values that you can then reuse on other fields in your migration.
Date Data
Dates are particularly troublesome fields. If the old database field is not in ISO date format, you may find that dates are recorded in all sorts of ways. Converting them all correctly for use in a MySQL date or datetime field can be difficult. Often the easiest thing to do is not try to do it yourself, but rather use PHP’s strtotime() function. But that may not be good enough if your dates might be outside of the timestamp epoch (roughly, 1970 - 2038). Then you’ll have to fallback on PHP’s date class.
This brings up the next issue with dates: they may have contextual problems. Is it ok if you find a date that’s 100 years in the future or 100 years in the past? Often, dates need to be range checked somehow.
Another problem dates can have is timezones. Often, dates don’t have an associated timezone in a database. You may need to know the default timezone for the old system, or perhaps where the client is located. Timezone date calculations are surprisingly finicky to get right.
Deeper Waters
If there are unstructured text fields in the old data, you may need to watch out for data that is outside the “normal” 7 bit ASCII code page. Even Western European languages have codes outside of this range, things like ø, ç, ü and so on, including currency signs like £ and €. Then there is multibyte string data, used for representing characters on the vast number of code pages for languages that use other alphabets or other special-purpose glyphs, like for math and physics.
There can also be problems with characters that have been entered for various symbols: emojis that employ character codes that are specific to Windows or the Mac and cause problems in their native forms.
If your MySQL database is not set up right, it might reject inserting such data. Choosing a “reasonable” format can be a challenge sometimes, but usually, the better answer is to detect these situations and write some custom code to fix them, to convert the wacky characters into something acceptable and displayable.
Unstructured Text
A common problem often seen when migrating older Drupal sites, but can also happen with other technologies as well, is that text fields will have self-referencing data encoded in it. A very common situation is that images on the site are inserted into the text with an <a>
tag: < href="sites/default/files/my_file_name.jpg">
, for example. This may not work correctly on the new site, especially if, as a part of the migration, the existing files have been reorganized.
In Drupal 7 sites, you may instead see a blob of JSON that is a media entity reference. You will have to find these and use the mapping tables created when the media entities were migrated to change them. You will probably have to find another way to deal with them, as media entities in Drupal 8 are not yet integrated into the default CKEditor WYSIWYG fields.
Getting Cozy with Your Data
While investigating the existing data, be sensitive to changes in patterns. Maybe you’re dealing with an e-commerce system and the states that an order goes through start having a different pattern at some point in time in the data. This can indicate a point when the old system was modified or upgraded, and your assumptions about what data you’ll see and how to migrate it across to the new system need to take this into account.
This is true especially if you’re migrating off a system that no longer has any technical support. Maybe because the company that implemented isn’t around anymore or is otherwise not available, or the software in use is obsolete. You may have to guess how data objects are constructed out of the database tables. There may be extra tables no longer in use - but do they have historic information that you should be capturing and moving? You will need to have several sessions with the client to ask them about particular examples in their data. Puzzling this out when the client themselves doesn’t know how their system works can keep you up at night.
You’ll have to try to rebuild the original designer’s mindset to understand how they structured the data. If the data has any complexity to it, you will have to learn how related data objects are referenced, and rebuild these connections correctly in Drupal.
Sometimes references are conditional on the state of one or more fields. For example, if there’s an original order field in the order table, maybe that is only filled in when this order is actually a return. What does the order total field represent in this case? Should it be positive or negative? These are the kinds of issues to be sensitive to. Often an initial audit of an old site won’t find these items.
Data Issues at Many Levels
These are just a few areas where problems pop up when migrating into Drupal. Issues pop up in the raw data, in the context around the data and in the actual meaning of the data. Finding and fixing these can require a great deal of careful testing and creative solution development. Being aware of the issues beforehand and following through during development can lead to a smoother, faster migration process, and a fuller understanding of what problems the client is trying to solve.
Remember to use your favorite database tool to investigate your incoming data fields like the ones above:
- Look at select and checkbox lists to verify that the values are consistent throughout.
- Check date fields for good and consistent formatting and “reasonable” data ranges, and consider if there are any timezone issues.
- Look at the content of unstructured text. Are there character set issues? Internal HTML references to the old site, especially regarding images?
- Be sure to deeply investigate how the old system is structured, especially if you don’t have access to a technical resource for the old system. Verify that you understand how, and why, the data in the old system are structured the way they are.