Be careful using python’s dataclass

python

Python’s @dataclass is a powerful tool for creating data containers with minimal boilerplate code. However, it introduces a subtle pitfall when working with mutable defaults like lists or NumPy arrays. This article explores this common issue, its root cause, and how to fix it effectively.

Author

Dominik Lindner

Published

February 18, 2025

## 1 What’s the Problem with Mutable Defaults?

When using @dataclass, attributes defined with default values can behave unexpectedly if the default value is a mutable object. In Python, mutable objects (e.g., lists, dictionaries, NumPy arrays) are shared across all instances if defined at the class level. This can lead to unintentional coupling between instances.

1.1 A Stress-Strain Data Class

I encountered many of these issues when I tried to rewrite an old code of mine following clean architecture princicples. It is my FEM solver I did during my PhD. Let’s have a look at one of the dataclasses CStressStrainData. The goal is to store and manipulate stress-strain-related properties such as stress, strain, and energy which occur during processing in one integration point.

Here’s a straightforward implementation using @dataclass:

from dataclasses import dataclass
import numpy as np

@dataclass
class CStressStrainData:
    stress: np.ndarray = np.zeros(4)
    eps_total: np.ndarray = np.zeros(4)
    energy: float = 0

Different integration algorithms requiring the a converged solution to start in each step. A class which integrates the current and the converged data is MaterialModel class. We do not need to bother with the actual intend of the class.

class MaterialModel:
    def __init__(self):
        self.stress_strain_converged = CStressStrainData()
        self.stress_strain = CStressStrainData()

At first glance, this seems fine. Each instance of CStressStrainData appears independent. However, this assumption is incorrect.

1.2 The Issue: Shared References

Consider this snippet:

model1 = MaterialModel()
model2 = MaterialModel()

# Modify eps_total in model1
model1.stress_strain.eps_total += np.array([1, 0, 0, 0])

# Inspect eps_total in model2
print(model2.stress_strain.eps_total)  
# Output: [1. 0. 0. 0.]

Wait, what? Why did modifying model1.stress_strain.eps_total also affect model2? The issue lies in how Python handles mutable objects. When we use np.zeros(4) as a default value, it is created once at the class level and shared across all instances.

2 Why This Happens: The Core of Mutable Defaults

In Python:

Immutable types (e.g., integers, floats, strings) are passed by value.
Mutable types (e.g., lists, dictionaries, NumPy arrays) are passed by reference.

When you define a default value like np.zeros(4), it becomes a class attribute, shared among all instances of the class. Any modification affects every instance referencing the same object.

In our example, both model1.stress_strain.eps_total and model2.stress_strain.eps_total point to the same NumPy array.

2.1 Fixing the Mutable Default Issue

The solution is to ensure that each instance gets its own copy of the mutable object. In @dataclass, this can be achieved using field(default_factory=...) for mutable defaults.

Here’s the corrected version:

from dataclasses import dataclass, field

@dataclass
class CStressStrainData:
    stress: np.ndarray = field(default_factory=lambda: np.zeros(4))
    eps_total: np.ndarray = field(default_factory=lambda: np.zeros(4))
    energy: float = 0

2.2 Why Does This Work?

The expression field(default_factory=...) ensures that a new object is created for each instance during initialization.
The lambda function (lambda: np.zeros(4)) ensures that the factory function is called each time, creating an independent NumPy array.

Now, the behavior is as expected:

model1 = MaterialModel()
model2 = MaterialModel()

model1.stress_strain.eps_total += np.array([1, 0, 0, 0])
print(model2.stress_strain.eps_total)  # Output: [0. 0. 0. 0.]

Each instance of CStressStrainData now has its own independent eps_total.

3 When to Use `@dataclass` and Mutable Defaults

3.1 Pros of `@dataclass`:

Reduces boilerplate code by generating __init__, __repr__, and other methods.
Works well for simple data containers with default values.

3.2 Cons of `@dataclass` with Mutable Defaults:

Requires careful handling of mutable types to avoid shared state issues.
Can become awkward for complex initialization logic.

3.3 General Rules:

Use field(default_factory=...) for mutable defaults.
Avoid defining mutable objects directly as default values.

3.4 For complex classes switch back to traditional classses

If the class has complex initialization or significant behavior, a traditional class definition might be more appropriate:

class CStressStrainData:
    def __init__(self):
        self.stress = np.zeros(4)
        self.eps_total = np.zeros(4)
        self.energy = 0

This approach avoids the pitfalls of shared mutable defaults and offers greater flexibility.

4 Key Takeaways

Understand mutable defaults:
- Avoid using mutable objects like lists or NumPy arrays as direct default values.
Use field(default_factory=...):
- It’s the correct way to define mutable defaults in @dataclass.
Test for shared references:
- Use id() or inspect behavior to confirm objects are independent.
Know when to skip @dataclass:
- If initialization or behavior is complex, a regular class might be a better choice.

By following these best practices, you can leverage the power of @dataclass without falling into the mutable defaults trap.

5 Conclusion

Mutable defaults can be a subtle but impactful bug in Python. Using @dataclass is a great way to simplify your code, but you must handle mutable objects carefully. With field(default_factory=...) and proper design, you can avoid unexpected behavior and keep your code clean and robust.