Typical Legacy Data Problems

The table below was originally in an article entitled "Challenges With Legacy Data" by Scott Ambler on IBM's DeveloperWorks website.

The original article concentrates largely on integration of existing legacy databases into an object-orientated environment rather than data migration, but the typical problems found when working with legacy data are the same and the information remains relevant.

Problem

Example

Potential impact

A single column being used for several purposes

Additional information for an inventory item is stored in the Notes column. Additional information will be one or more of: a lengthy description of the item, storage requirements, or safety requirements when handling the item.

  • One or more attributes of your objects may need to be mapped to this field, requiring a complex parsing algorithm to determine the proper usage of the column.

  • Your objects may be forced to implement a similar attribute instead of implementing several attributes as your design originally described.


The purpose of a column is determined by the value of one or more other columns

If the value of DateType is 17, then PersonDate represents the date of birth of the person. If the value is 84, then PersonDate is the person's date of graduation from high school. If the value is between 35 and 48, then it is the date the person entered high school.



  • A potentially complex mapping is required to work with the value stored in the column.

Incorrect data values

The AgeInYears column for a person contains the value -3. Or the AgeInYears column contains 7 although the BirthDate is August 14 1967 and the current date is October 10 2001.

  • Your objects will need to implement validation code to ensure that their base data values are correct.

  • Strategies to replace incorrect values may need to be defined and implemented.

  • An error-handling strategy will need to be developed to deal with bad data. This may include logging of the error, attempting to fix the error, or dropping the data from processing until the problem is corrected.


Inconsistent/incorrect data formatting

The name of a person is stored in one table in the format "Firstname Surname" and in another table in the format "Surname, Firstname".



  • Parsing code will be required to retrieve and store the data as appropriate.

Missing data

The date of birth of a person has not been recorded in some records.



  • See strategies for dealing with incorrect data values.

Missing columns

You need a middle name of a person but a column for it does not exist.

  • You may need to add the column to the existing legacy schema.

  • You may need to do without the data.

  • Identify a default value until the data is available.

  • An alternate source for the data may need to be found.


Additional columns

The Social Security number for a person is stored in the database and you don't need it.

  • For columns that are required for other applications you may be required to implement them in your objects to ensure the other applications can use the data your application generates.

  • You may need to write the appropriate default value to the database when inserting a new record.

  • For database updates, you may need to read the original value and then write it back out again.


Multiple sources for the same data

Customer information is stored in three separate legacy databases.

  • Identify a single source for your information and use only that source.

  • Be prepared to access multiple sources for the same information.

  • Identify rules for choosing a preferred source when you discover the same information is stored in several places.


Important entities, attributes, and relationships hidden and floating in text fields



A notes text field contains the information ("Clark and Lois Kent, Daily Planet Publications").

  • Develop code to parse the information from the fields.

  • Do without the information.

Data values that stray from their field descriptions and business rules

The maiden name column is being used to store a person's fabric preference for clothing.

  • You need to update the documentation to reflect the actual usage.

  • Developers who took the documentation at face value may need to update their code.

  • Data analysis should be performed to determine the exact usage in case different applications are using the field for different purposes.


Various key strategies for the same type of entity

One table stores customer information using the Social Security number as the key, another uses the ClientID as the key, and another uses a surrogate key.

  • You need to be prepared to access similar data via several strategies, implying the need for similar finder operations in some classes.

  • Some attributes of an object may be immutable, their value cannot be changed, because they represent part of a key in your relational database.


Unrealized relationships between data records

A customer has a summer home. Both pieces of data are recorded in your database, but there is no relationship stored in the database regarding this fact.

  • Data may be inadvertently replicated, eventually a new address record is inadvertently created (and the relationship now defined) for the summer home even though one already exists.

  • Additional code may need to be developed to detect potential problems. Procedures for handling the problems will also be required.


One attribute is stored in several fields

The Person class requires a single name field whereas it is stored in the columns FirstName and Surname in your database.



  • Potentially complex parsing code may be required to retrieve and then save the data.

Inconsistent use of special characters



A date uses hyphens to separate the year, month, and day whereas a numerical value stored as a string uses hyphens to indicate negative numbers.



  • Complexity of parsing code increases.

  • Additional documentation required to indicate character usage.

Different data types for similar columns

A customer ID is stored as a number in one table and a string in another.

  • You may need to decide how you want the data to be handled by your objects, and then transform it to/from your data source(s) as appropriate.

  • If foreign data have a different type than original data they represent, then table joins (and hence any SQL embedded in your objects) become more difficult.


Different levels of detail

An object requires the total sales for the month but your database stores individual totals for each order, or an object requires the weight of individual components of an item, such as the doors and engine of a car, but your database only records the aggregate weight.



  • Potentially complex mapping code may be required to resolve the various levels of detail.

Different modes of operation

Some data is a read-only snapshot of information whereas other data is read write.

  • The design of your objects must reflect the nature of the data they are mapped to. Objects based on read-only data therefore cannot update or delete it.

Varying timeliness of data

The Customer data is current, Address data is one day out of date, and the data pertaining to countries and states is accurate to the end of the previous quarter because you purchase that information from an external source.



  • Your object code must reflect, and potentially report to their clients, the timeliness of the information that they are based on.

Varying default values

Your object uses a default of Green for a given value yet another application has been using Yellow, resulting in a preponderance (in the opinion of your users) of Yellow values stored in the database.

  • You may need to negotiate a new default value with your users.

  • You may not be allowed to store your default value (for example, Green is an illegal value in the database).


Various representations

The day of the week is stored as T, Tues, 2, and Tuesday in four separate columns.

  • Translation code back and forth between a common value that your object(s) use will need to be developed.


About the author

Scott Ambler

Scott W. Ambler is President of Ronin International, a consulting firm specializing in object-oriented software process mentoring, architectural modeling, and Enterprise JavaBeans (EJB) development. He has authored or co-authored several books about object-oriented development, including the recently released The Object Primer 2nd Edition. He can be reached at scott.ambler@ronin-intl.com and at his Web site at www.ambysoft.com.