Refactored Neural Scaling Laws -- Update 1

This is a short update about my research: Critical Depth and the Scaling Law Paradox: A Refactored Resource Model

My proposed refactored model of the Chinchilla-based NSL may have some empirical data/grounding.

Indeed, a piece of research by A. Karpathy seems to bring some backing to the predictions of my refactored model. Here follows my reply--reproduced to his experiments:


Hi Andrej,

Thank you for sharing your detailed and granular empirical analysis.

Building on an earlier piece of research: A Resource Based Model For Neural Scaling Laws I've come up with a refined NSL profile for the NSL, as follows:

  • Structural Phase (below critical depth): ℓ ∝ Np-2/3
  • Redundancy Phase (standard scaling): ℓ ∝ Np-1/3
  • Width-Only Scaling: ℓ ∝ Np-1/2

From my theoretical derivation with width-only scaling (fixed depth): Np ∝ N2

This implies that N ∝ Np1/2. Since FLOPS (C) is grows proportionally to the resources (neurons) (N) this means that N ∝C0.5 which is pretty close to your 0.49 for the optimal model size exponent.

This seems to suggest that width-only scaling beyond critical depth is not just theoretically sound but also matches (at least in your setting) the optimal scaling behavior you've observed in practice. The slight deviation from 0.5 (0.49) is consistent with real-world constraints while validating the core theoretical framework.

Additionally, here is an illustration(from my preprint) graph / projection in relation with the Chinchilla paper:

neural_scaling_external_labels

This alignment between theory and empirical evidence reinforces the value of your miniseries approach -- it's helping to bridge the gap between theoretical understanding and practical implementation of scaling laws.

Best regards, Tolga


In addition, another piece of research that echoes my refactored model predictions is a paper from the DeepSeek-AI team: DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. Here is one element echoing my model's predictions.

  • The 0.5243 exponent is within a measurement error of my model's -1/2 for the width-only scaling prediction.

Finally, there are other research that fall in line with my model's predictions. This may allow/require for a new preprint on the NSL. This subject could be even more important if one wants to "align" policies/governance stuff based on evals.