Beyond Accuracy: Measuring Productivity Weaknesses Caused by Excessive, Unwanted, and Underlying Signs of Regression

0 4 8 minutes read

Beyond Accuracy: Measuring Productivity Weaknesses Caused by Excessive, Unwanted, and Underlying Signs of Regression

At first glance, adding more features to the model seems like an obvious way to improve performance. If the model can learn from more information, it should be able to make better predictions. However, in practice, this natural environment often presents subtle structural hazards. Every added feature creates additional dependencies on upstream data pipelines, external systems, and data quality checks. A single missing field, a schema change, or a delayed data set can quietly degrade the guesswork in production.

The deeper problem isn’t computational cost or system complexity — it’s weight instability. In regression models, especially when the factors are correlated or have weak information, the motivator struggles to assign credit in a meaningful way. Coefficients can change unpredictably as the model tries to spread out the influence of overlapping signals, and differences in low signals may appear significant simply because of noise in the data. Over time, this leads to models that look complex on paper but behave inconsistently in use.

In this article, we will examine why adding too many factors can make regression models less reliable than more accurate. We will examine how correlated factors distort coefficient estimates, how weak signals are mistaken for true patterns, and why each additional factor increases the weakness of production. To make these ideas stronger, we will go through examples using a real estate price data set and compare the behavior of large “kitchen-sink” models with other, more stable ones.

Imports dependencies

pip install seaborn scikit-learn pandas numpy matplotlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")


plt.rcParams.update({
    "figure.facecolor": "#FAFAFA",
    "axes.facecolor":   "#FAFAFA",
    "axes.spines.top":  False,
    "axes.spines.right":False,
    "axes.grid":        True,
    "grid.color":       "#E5E5E5",
    "grid.linewidth":   0.8,
    "font.family":      "monospace",
})

SEED = 42
np.random.seed(SEED)

This code sets a clean, consistent Matplotlib style by adjusting the background colors, grid appearance, and removing unnecessary axes for a clearer view. It also sets the random seed for NumPy (42) to ensure that any randomly generated data remains reproducible across runs.

Synthetic Materials Data Set

N = 800   # training samples

# ── True signal features ────────────────────────────────────
sqft          = np.random.normal(1800, 400, N)          # strong signal
bedrooms      = np.round(sqft / 550 + np.random.normal(0, 0.4, N)).clip(1, 6)
neighborhood  = np.random.choice([0, 1, 2], N, p=[0.3, 0.5, 0.2])  # categorical

# ── Derived / correlated features (multicollinearity) ───────
total_rooms   = bedrooms + np.random.normal(2, 0.3, N)       # ≈ bedrooms
floor_area_m2 = sqft * 0.0929 + np.random.normal(0, 1, N)   # ≈ sqft in m²
lot_sqft      = sqft * 1.4    + np.random.normal(0, 50, N)   # ≈ sqft scaled

# ── Weak / spurious features ────────────────────────────────
door_color_code  = np.random.randint(0, 10, N).astype(float)
bus_stop_age_yrs = np.random.normal(15, 5, N)
nearest_mcdonalds_m = np.random.normal(800, 200, N)

# ── Pure noise features (simulate 90 random columns) ────────
noise_features = np.random.randn(N, 90)
noise_df = pd.DataFrame(
    noise_features,
    columns=[f"noise_{i:03d}" for i in range(90)]
)

# ── Target: house price ─────────────────────────────────────
price = (
      120 * sqft
    + 8_000 * bedrooms
    + 30_000 * neighborhood
    - 15 * bus_stop_age_yrs          # tiny real effect
    + np.random.normal(0, 15_000, N) # irreducible noise
)

# ── Assemble DataFrames ──────────────────────────────────────
signal_cols = ["sqft", "bedrooms", "neighborhood",
               "total_rooms", "floor_area_m2", "lot_sqft",
               "door_color_code", "bus_stop_age_yrs",
               "nearest_mcdonalds_m"]

df_base = pd.DataFrame({
    "sqft": sqft,
    "bedrooms": bedrooms,
    "neighborhood": neighborhood,
    "total_rooms": total_rooms,
    "floor_area_m2": floor_area_m2,
    "lot_sqft": lot_sqft,
    "door_color_code": door_color_code,
    "bus_stop_age_yrs": bus_stop_age_yrs,
    "nearest_mcdonalds_m": nearest_mcdonalds_m,
    "price": price,
})

df_full = pd.concat([df_base.drop("price", axis=1), noise_df,
                     df_base[["price"]]], axis=1)

LEAN_FEATURES  = ["sqft", "bedrooms", "neighborhood"]
NOISY_FEATURES = [c for c in df_full.columns if c != "price"]

print(f"Lean model features : {len(LEAN_FEATURES)}")
print(f"Noisy model features: {len(NOISY_FEATURES)}")
print(f"Dataset shape       : {df_full.shape}")

This code creates a synthetic dataset designed to simulate a real-world property price scenario, where only a small number of variables actually influence the target while many others introduce redundancy or noise. The dataset consists of 800 training samples. Key signal characteristics such as square footage (sqft), number of bedrooms, and neighborhood class represent key drivers of home prices. In addition to this, several acquired factors were deliberately designed to be highly correlated with contextual variables—eg floor_area_m2 (square footage unit conversion), lot_sqftand complete_rooms. This variant simulates multicollinearity, a common problem in real datasets where many features carry overlapping information.

The dataset includes weak or spurious features—such as door_color_code, bus_stop_yearsagain near_mcdonalds_m—that have little or no relationship to the value of the asset. To further replicate the “kitchen-sink model” problem, the script generates 90 random noise features, which represent non-significant columns that often appear in large datasets. The variable target price is created using a well-known formula where square footage, bedrooms, and proximity have a strong influence, while the age of the bus stop has very little effect and random noise introduces natural variation.

Finally, two feature sets are defined: a soft model that contains only three real signal features (sqft, bedrooms, area) and a noisy model that contains every column available without the target. This setup allows us to directly compare how a small, high-signal feature set performs against a large, feature-heavy model full of redundant and irrelevant variables.

Multicollinearity regression

print("n── Correlation between correlated feature pairs ──")
corr_pairs = [
    ("sqft", "floor_area_m2"),
    ("sqft", "lot_sqft"),
    ("bedrooms", "total_rooms"),
]
for a, b in corr_pairs:
    r = np.corrcoef(df_full[a], df_full[b])[0, 1]
    print(f"  {a:20s} ↔  {b:20s}  r = {r:.3f}")


fig, axes = plt.subplots(1, 3, figsize=(14, 4))
fig.suptitle("Weight Dilution: Correlated Feature Pairs",
             fontsize=13, fontweight="bold", y=1.02)

for ax, (a, b) in zip(axes, corr_pairs):
    ax.scatter(df_full[a], df_full[b],
               alpha=0.25, s=12, color="#3B6FD4")
    r = np.corrcoef(df_full[a], df_full[b])[0, 1]
    ax.set_title(f"r = {r:.3f}", fontsize=11)
    ax.set_xlabel(a); ax.set_ylabel(b)

plt.tight_layout()
plt.savefig("01_multicollinearity.png", dpi=150, bbox_inches="tight")
plt.show()
print("Saved → 01_multicollinearity.png")

This category shows multicollinearity, a situation where many factors contain almost the same information. The code includes correlation coefficients for three objectively correlated feature pairs: sqft vs floor_area_m2, sqft vs lot_sqft, and bedrooms vs total_rooms.

As the printed results show, this relationship is very strong (r ≈ 1.0, 0.996, and 0.945), which means that the model finds many signals that describe the same characteristic of the underlying structure.

Scatter plots visualize this overlap. Because these factors go almost perfectly together, the regression optimizer struggles to decide which factor should get credit for the target prediction. Instead of giving a clear weight to one variable, the model tends to divide the influence of all related factors in inappropriate ways, leading to unstable and diluted coefficients. This is one of the main reasons why adding redundant features can make the model undefined and unstable, even if the predictive performance at first appears to be the same.

Weight Instability Across Training Cycles

N_CYCLES   = 30
SAMPLE_SZ  = 300  # size of each retraining slice

scaler_lean  = StandardScaler()
scaler_noisy = StandardScaler()

# Fit scalers on full data so units are comparable
X_lean_all  = scaler_lean.fit_transform(df_full[LEAN_FEATURES])
X_noisy_all = scaler_noisy.fit_transform(df_full[NOISY_FEATURES])
y_all        = df_full["price"].values

lean_weights  = []   # shape: (N_CYCLES, 3)
noisy_weights = []   # shape: (N_CYCLES, 3)  -- first 3 cols only for comparison

for cycle in range(N_CYCLES):
    idx = np.random.choice(N, SAMPLE_SZ, replace=False)

    X_l = X_lean_all[idx];  y_c = y_all[idx]
    X_n = X_noisy_all[idx]

    m_lean  = Ridge(alpha=1.0).fit(X_l, y_c)
    m_noisy = Ridge(alpha=1.0).fit(X_n, y_c)

    lean_weights.append(m_lean.coef_)
    noisy_weights.append(m_noisy.coef_[:3])   # sqft, bedrooms, neighborhood

lean_weights  = np.array(lean_weights)
noisy_weights = np.array(noisy_weights)

print("n── Coefficient Std Dev across 30 retraining cycles ──")
print(f"{'Feature':<18} {'Lean σ':>10} {'Noisy σ':>10}  {'Amplification':>14}")
for i, feat in enumerate(LEAN_FEATURES):
    sl = lean_weights[:, i].std()
    sn = noisy_weights[:, i].std()
    print(f"  {feat:<16} {sl:>10.1f} {sn:>10.1f}  ×{sn/sl:.1f}")


fig, axes = plt.subplots(1, 3, figsize=(15, 4))
fig.suptitle("Weight Instability: Lean vs. Noisy Model (30 Retraining Cycles)",
             fontsize=13, fontweight="bold", y=1.02)

colors = {"lean": "#2DAA6E", "noisy": "#E05C3A"}

for i, feat in enumerate(LEAN_FEATURES):
    ax = axes[i]
    ax.plot(lean_weights[:, i],  color=colors["lean"],
            linewidth=2, label="Lean (3 features)", alpha=0.9)
    ax.plot(noisy_weights[:, i], color=colors["noisy"],
            linewidth=2, label="Noisy (100+ features)", alpha=0.9, linestyle="--")
    ax.set_title(f'Coefficient: "{feat}"', fontsize=11)
    ax.set_xlabel("Retraining Cycle")
    ax.set_ylabel("Standardised Weight")
    if i == 0:
        ax.legend(fontsize=9)

plt.tight_layout()
plt.savefig("02_weight_instability.png", dpi=150, bbox_inches="tight")
plt.show()
print("Saved → 02_weight_instability.png")

This test simulates what happens in real production systems where the models are periodically retrained with new data. Over 30 retraining cycles, the code randomly samples subsets of the dataset and fits two models: a soft model that uses only the three important signal features, and a noisy model that uses the full feature set containing random variables. By tracking the key factor coefficients throughout the retraining cycle, we can see how stable the learned weights are over time.

The results show a clear pattern: the noisy model shows the highest coefficient of variation.

For example, the standard deviation of the sqft coefficient increases by 2.6×, while bedrooms become 2.2× less stable compared to the reduced model. The plotted lines make this effect clearly visible—the coefficients of the soft model remain smooth and consistent across training cycles, while the weights of the noisy model fluctuate significantly. This instability arises because the correlated and irrelevant factors force the developer to redistribute the credit unpredictably, making the behavior of the model unreliable even if the overall accuracy appears to be the same.

Signal-to-Noise Ratio (SNR) degradation.

correlations = df_full[NOISY_FEATURES + ["price"]].corr()["price"].drop("price")
correlations = correlations.abs().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(14, 5))
bar_colors = [
    "#2DAA6E" if f in LEAN_FEATURES
    else "#E8A838" if f in ["total_rooms", "floor_area_m2", "lot_sqft",
                             "bus_stop_age_yrs"]
    else "#CCCCCC"
    for f in correlations.index
]

ax.bar(range(len(correlations)), correlations.values,
       color=bar_colors, width=0.85, edgecolor="none")

# Legend patches
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor="#2DAA6E", label="High-signal (lean set)"),
    Patch(facecolor="#E8A838", label="Correlated / low-signal"),
    Patch(facecolor="#CCCCCC", label="Pure noise"),
]
ax.legend(handles=legend_elements, fontsize=10, loc="upper right")
ax.set_title("Signal-to-Noise Ratio: |Correlation with Price| per Feature",
             fontsize=13, fontweight="bold")
ax.set_xlabel("Feature rank (sorted by |r|)")
ax.set_ylabel("|Pearson r| with price")
ax.set_xticks([])

plt.tight_layout()
plt.savefig("03_snr_degradation.png", dpi=150, bbox_inches="tight")
plt.show()
print("Saved → 03_snr_degradation.png")

This section measures the signal strength of each factor by computing its absolute correlation with the target variable (value). The bar chart measures all the features by their correlation, highlighting real high-signal features in green, correlated or weak features in orange, and a large set of pure noise features in gray.

Observations show that only a small number of variables carry a meaningful predictive signal, while most contribute little to others. If too many low-signal or noisy features are included in the model, they reduce the signal-to-noise ratio, making it difficult for the optimizer to identify the truly important features.

Enter Drift Simulation

def predict_with_drift(model, scaler, X_base, drift_col_idx,
                       drift_magnitude, feature_cols):
    """Inject drift into one feature column and measure prediction shift."""
    X_drifted = X_base.copy()
    X_drifted[:, drift_col_idx] += drift_magnitude
    return model.predict(scaler.transform(X_drifted))

# Re-fit both models on the full dataset
sc_lean  = StandardScaler().fit(df_full[LEAN_FEATURES])
sc_noisy = StandardScaler().fit(df_full[NOISY_FEATURES])

m_lean_full  = Ridge(alpha=1.0).fit(
    sc_lean.transform(df_full[LEAN_FEATURES]),  y_all)
m_noisy_full = Ridge(alpha=1.0).fit(
    sc_noisy.transform(df_full[NOISY_FEATURES]), y_all)

X_lean_raw  = df_full[LEAN_FEATURES].values
X_noisy_raw = df_full[NOISY_FEATURES].values
base_lean   = m_lean_full.predict(sc_lean.transform(X_lean_raw))
base_noisy  = m_noisy_full.predict(sc_noisy.transform(X_noisy_raw))

# Drift the "bus_stop_age_yrs" feature (low-signal, yet in noisy model)
drift_col_noisy = NOISY_FEATURES.index("bus_stop_age_yrs")
drift_range     = np.linspace(0, 20, 40)   # up to 20-year drift in bus stop age

rmse_lean_drift, rmse_noisy_drift = [], []
for d in drift_range:
    preds_noisy = predict_with_drift(
        m_noisy_full, sc_noisy, X_noisy_raw,
        drift_col_noisy, d, NOISY_FEATURES)
    # Lean model doesn't even have this feature → unaffected
    rmse_lean_drift.append(
        np.sqrt(mean_squared_error(base_lean, base_lean)))  # 0 by design
    rmse_noisy_drift.append(
        np.sqrt(mean_squared_error(base_noisy, preds_noisy)))

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(drift_range, rmse_lean_drift,  color="#2DAA6E",
        linewidth=2.5, label="Lean model (feature not present)")
ax.plot(drift_range, rmse_noisy_drift, color="#E05C3A",
        linewidth=2.5, linestyle="--",
        label="Noisy model ("bus_stop_age_yrs" drifts)")
ax.fill_between(drift_range, rmse_noisy_drift,
                alpha=0.15, color="#E05C3A")
ax.set_xlabel("Feature Drift Magnitude (years)", fontsize=11)
ax.set_ylabel("Prediction Shift RMSE ($)", fontsize=11)
ax.set_title("Feature Drift Sensitivity:nEach Extra Feature = Extra Failure Point",
             fontsize=13, fontweight="bold")
ax.legend(fontsize=10)
plt.tight_layout()
plt.savefig("05_drift_sensitivity.png", dpi=150, bbox_inches="tight")
plt.show()
print("Saved → 05_drift_sensitivity.png")

This experiment shows how the drift factor can silently affect model predictions in production. The code introduces a gradual drift into a weak factor (bus_stop_age_yrs) and measures how much the model predictions change. Since the soft model does not include this feature, its predictions remain completely stable, while the noisy model becomes more and more sensitive as the size of the drift increases.

The resulting plot shows a prediction error that gradually increases as the feature moves, highlighting an important fact of production: every additional feature becomes another potential point of failure. Even low-signal variables can introduce instability if their data distribution changes or upstream pipelines change.

Check it out Full Codes here. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I am very interested in Data Science, especially Neural Networks and its application in various fields.