One of my colleagues recently said about data we had in the database: I cannot imagine the code that would produce such data. And that's correct. We don't have that code right now. We used to have it in the past.
Code changes. Data stays. This is the fundamental truth of production systems that most developers learn the hard way.
Your current codebase doesn't produce that weird state? Great. But your codebase from six months ago might have. Your codebase from two years ago definitely did. And that data is still there, waiting to break your assumptions.
When you're debugging production errors, expect the unexpected. Your code doesn't allow a given state? Well, maybe it doesn't now. But it might have allowed it in the past.
The Data Inertia Problem
You may be familiar with Hyrum's law:
With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
A similar rule applies to your data. If you have a small internal database, data migration is a no brainer. It might be just a matter of a tiny SQL migration script executed on application boot.
It is more problematic if your data is big and migration takes a lot of time. Is it feasible to run it at once? Is it possible to do it with near zero downtime?
It is even more problematic if the data is shared across multiple services. How do you orchestrate the upgrade of services? Do you prefer a downtime and maintenance window or services that support both old and new data formats?
It is even more problematic if the data is an immutable ledger or blockchain and users own the data, when all updates need to be authorized by users. You cannot migrate the data on your own. Would you rather bother users to perform some actions or make your services backward compatible?
Avoiding the Data Inertia Problem
Before we go further: Can you avoid the data inertia problem?
Yes: Keep your data small and encapsulated. Don't share it across services. Use a regular, boring database instead of an immutable ledger. It's probably sufficient for most apps.
Backward Compatibility Forever
I'd say the more heavily systems rely on the data, the more inevitable is backward compatibility. Ultimately you may reach the point when data migrations are no longer feasible.
Even if your application no longer is able to produce the data with a given format, the data might be there.
Your code needs to:
- Read old data formats
- Handle missing fields gracefully
- Support deprecated APIs alongside new ones
- Maintain multiple code paths for different data versions
- Never assume data matches your current schema
It's not elegant. It's not clean. But it's what production systems require. The alternative is breaking user data.
Debugging the Unexpected
When you see unexpected data in a production system, there is a subtle difference between: "How did this happen?", and "What code could have created this state?".
Check git history. Look at old migrations. Review deprecated code paths. Ask your colleagues. The bug might not be in your current code. It might be in data created by code that no longer exists.
Your code doesn't allow a given state? That's fine. But your system does. Systems have history. Systems accumulate technical debt in the form of data that doesn't match current expectations.
Write code that handles historical data. Write code that validates everything. Write code that fails gracefully when it encounters the unexpected.
In complex systems it's often easier to migrate code than migrate data. Code changes, data stays.
Appendix: Data in the AI world
People tend to say that code is cheap in the AI era. It's easy to rework an app. That's true. But what about the data?
If it's small, then this is not an issue. AI can easily provide you with a migration script.
For most complex cases, you cannot avoid the inertia. Migrating a 200 GB database still takes time and careful planning. Updating an immutable ledger still requires user authorization. The fundamental constraints remain unchanged.
AI is powerful enough to change the code in minutes. Yet it has no power over data. Code changes, data stays—even in the AI era.