Does Natural Data Really Have a Power-Law Spectrum?

Motivation

The rank-ordered eigenvalues of sample covariances of natural datasets appear to exhibit a power-law falloff. This is upstream of the scaling laws for the loss of LLMs as compute is increased, whether by training set size or model size [1]. The scaling laws are important for obvious reasons, perhaps the most important being that the absence of any indication of their abatement promises steady improvement with scale. There is of course also the question of whether the scalings can be improved upon, or at least whether their prefactors can, which would lead to more rapid progress and cost-effectiveness in training. As such, scaling constitutes a broad and active subfield of ML research.

Many interesting theoretical papers exploring the origins and limits of scaling utilize the idealized assumption of power-law data as a starting point. That is, data which either before or after some feature mapping is sampled from a Gaussian with structured covariance whose eigenvalues fall off as a power-law. This assumption is particularly well-motivated by looking at the rank-ordered plot of eigenvalues of sample covariances for natural data after feature mapping, but a full comparison requires comparing the spectral densities, which I have not seen in the literature. For this reason, I set out to compute the theoretical spectral density for power-law data and compare it to natural data.

In this post I leverage the Silverstein equation [6] to numerically compute the spectral densities for power-law data in the infinite dimension limit and use the implementation to study natural data more closely. The case of Gaussian input data, yielding the famous Marchenko-Pastur law for its eigenvalue distribution, already provides the qualitative picture of double descent and its mitigation by regularization, but does not exhibit power-law behavior and is not sensible to compare directly to natural data or feature-mapped natural data. For this reason, I numerically solve the Silverstein equation, which I use to compute the theoretical sample spectral densities in the large dimension limit, at finite overparameterization. Then I compare to natural image and text datasets where we see vastly improved fits to non-linearly mapped data, as well as interesting structure to the deviations from theory. In particular, I observe log-periodic oscillatory behavior to the fractional empirical excesses of the sample eigenvalue distributions.

The code needed to produce the figures in this post is housed at this GitHub repo.

Background

First I will review what is being calculated and its motivation from wanting to understand the generalization error or loss. A useful starting point is the double-descent analysis of Advani and Saxe [2]. They consider a student-teacher model in the context of linear regression, with outputs generated by $$y= X w_* +\epsilon $$ where $w_* \in \mathbb{R}^D$ with entries sampled from a Gaussian with unit variance for simplicity, the design matrix of inputs $X \in \mathbb{R}^{P \times D}$ with entries sampled for a zero-mean Gaussian with variance $\frac{1}{D}$, and noise $\epsilon$ sampled from a zero-mean Gaussian with variance $\sigma_\epsilon^2$. Using the model $$ \hat y = X w $$ we would like to know how the generalization error behaves in training, in particular to prevent over-training. The updates for $w$ come from the gradient of the mean-squared-error as usual $$d w = -\eta\left( y X-wX^T X \right) $$ Some linear algebra reveals that the generalization error with the updates for $w$ viewed as a continuous function of time $t$ defined in units of a time-step $\tau$ take the following form $$E_g(w(t)) = \frac{1}{D} \sum\limits_i \left(e^{-2\lambda_i t/\tau}+\frac{\sigma_\epsilon^2}{\lambda_i}(1-e^{-\lambda_i t /\tau})^2 \right)$$ where the $\lambda_i$ are the eigenvalues of the sample covariance $XX^T$. This equation reveals the phenomenon of double-descent, with the accumulation of many small eigenvalues near equiparameterization (i.e. when $q \sim 1$ with $q = \frac{D}{P}$) driving up the noise term. In the infinite dimensional limit, i.e. $D, P \to \infty$ with $q$ fixed, this sum is an integral over the spectral density, the continuous analogue of the discrete set of eigenvalues above. So it is the eigenvalue distribution of the sample covariance $$ \Sigma_S = \frac{X^T X}{P} $$ along with the noise scales, which dictates the loss landscape in this student-teacher model, and motivates for instance early stopping as a regularization technique to prevent over-training.

Double descent in linear regression is perhaps the simplest setting in which the effect of the spectrum of input data on the loss can be analyzed. But as Kaplan et al showed [1], even in the case of complex, deep networks for LLMs, power-law scaling in the loss remains robust. This has motivated the study of scaling in more non-trivial toy models and there has been a plethora of work in this domain. One such paper is the work of Maloney, Roberts, and Sully [3] which utilized tools from random matrix theory to compute the generalization error in random feature models, which have additional layers of linear mapping. Along the way, they note that non-linear mapping of features in natural datasets extends the number of dimensions over which there is a power-law fall off in the eigenvalues. It is worth scrutinizing the claim of power-law behavior in natural data more closely. Now it is prudent to introduce some of the technical tools needed to do so.

Resolvents, Stieltjes transform, and Spectral Density

It is useful to recall the matrix resolvent, the Stieltjes transform, and their relationship to the spectral density. The matrix resolvent of an $N \times N$ matrix $\mathbf{A}$ is:

$$ \mathbf{G}_{\mathbf{A}} = (z \mathbf{I}-\mathbf{A})^{-1}$$

The Stieltjes transform of the matrix $A$ is a scalar function obtained by taking the normalized trace

$$g_{\mathbf{A}}(z) = \frac{1}{N} \text{Tr}[\mathbf{G}_{\mathbf{A}}(z)] = \frac{1}{N}\sum\limits_k \frac{1}{\lambda_k-z}$$

This sum over the eigenvalues can of course be written as an integral over the spectral density:

$$g(z) = \int\limits_{-\infty}^\infty \frac{\rho(z)}{\lambda-z}d\lambda $$

This equation and its "inverse" are crucial as there exists a self-consistent equation for $g(z)$ which we can then leverage to determine the spectral density. The inversion of the above is:

$$\lim_{\epsilon \to 0}\text{Im}[g(z+i\epsilon)] = \pi \rho(z) $$ The matrices we want to apply this to are sample covariance matrices for input data with design matrix $X^{\alpha, i} \in \mathbb{R}^{P \times D}$ drawn from a Gaussian with population covariance $$\Sigma_P = \text{diag}(1, \dots, \lambda_k, \dots \lambda_D) $$

with $\lambda_k \sim k^{-\alpha}$ for some non-negative exponent $\alpha$. The classic Marchenko-Pastur case corresponds to $\alpha = 0$. As stated above, we want to compute the Stieltjes transform of the sample covariance

$$ \Sigma_S = \frac{X^T X}{P} $$

in the limit as $P, D \to \infty$ with $q = D/P$ held fixed. When this sample covariance is built from Gaussian input vectors, it is called a Wishart matrix. A white Wishart samples every component from the same Gaussian. A structured Wishart has non-trivial structure to the eigenvalues of the population covariance, each component being sampled from a Gaussian with different covariance. The theoretical distribution of sample covariance eigenvalues determined by the equations above is what we will compare to the empirical sample covariance eigenvalue distributions for natural data, such as CIFAR-10, MNIST, and Wikitext.

Silverstein Equation

The final ingredient needed is the most non-trivial: the self-consistent equation for $g(z)$. This is the Silverstein equation:

$$ g(z) = \int \frac{1}{t \Delta +z}dH(t) $$

where $\Delta$ is

$$\Delta = -1+q-qzg(z) $$

and $dH(t)$ is the density of the eigenvalues of the population covariance matrix, $dH(t) \sim t^{-1-1/\alpha}$ for $\alpha > 0$ and $\delta(t-1)$ when $\alpha =0$. Indeed we recover the usual equation for $g(z)$ in the $\alpha = 0$ case

$$ \frac{1}{g(z)} = z-1-q-qz g(z) $$

This equation can be solved numerically for a given real $z$ plus a tiny imaginary part. Then $\frac{1}{\pi}\text{Im}[g(z+i\epsilon)]$ gives us our desired spectral density. This classic case is also of course analytically tractable, with

$$g(z) = \frac{z-1-q\pm \sqrt{(\lambda_+-z)(z-\lambda_-) }}{2qz} $$

where $\lambda_\pm = (1\pm\sqrt{q})^2$. The discontinuity across the branch cut yields the Marchenko-Pastur distribution

$$\rho(z) = \frac{\sqrt{(\lambda_+-z)(z-\lambda_-) }}{2\pi qz}+\mathbb{1}_{q > 1}(1-1/q)\delta(z)$$

I present the numerical, analytical, and empirical comparison in Fig. 1.

mp spec density
Figure 1. White wishart, Marchenko-Pastur law. The dashed curve is the numerical solution and the solid curve is the analytic expression.

For comparison I also present the fractional excess of the empirical density relative to the theory in Fig. 2 below. The dispersion is symmetric as one would expect.

mp frac excess
Figure 2. The fractional excess of the empirical spectral density relative to theory. The dispersion is symmetric.

For finite and generic $\alpha$, the $t$-integral in the self-consistent equation is not analytically tractable, but it can be handled numerically. I discuss the numerical implementation below.

Numerical Implementation

For numerical implementation I need to be slightly more specific about what I mean by the distribution $dH(t)$. The integral will not converge without a finite cutoff $\Lambda > 0$. In this case the normalized distribution is

$$ dH(t) = \frac{1}{\alpha}\frac{\Lambda^{1/\alpha}}{1-\Lambda^{1/\alpha}}t^{-1-1/\alpha}$$

Additionally, in order to improve numerical stability, it is best to switch variables to $u = \log t$ and to divide both sides of the self-consistent equation by $g(z)$. This way, the numerical solver is not trying to identify cancellations between large numbers, but reliably computes an order-one number when looking for a solution. This form of the self-consistent equation is

$$ 1 = \frac{1}{\alpha}\frac{1}{1-\Lambda^{1/\alpha}} \int\limits_{\log(\Lambda)}^0 \frac{e^{u+u/\alpha} }{g(z)(\Lambda*\Delta+z e^{u}) }du$$

In practice, I use SciPy's numerical integration and root finder in order to find the solution to this equation for complex $g(z)$, giving $z$ a small imaginary part. The imaginary part of this solution yields the spectral density for that value of z. These are all the tools needed in order to compare to data.

First it is instructive to look at the rank-ordered plot of the eigenvalues of the sample covariance. These are the plots typically presented for natural datasets and it is instructive to keep in mind what it looks like for exactly power law data. This is presented in Fig. 3.

structured wishart
Figure 3. Rank-ordered eigenvalues for power-law data with $\alpha = 1.5$ and $q = 2$.

There is a bit of a tail at the end, and by eye it is difficult to discern what sort of tail at the smaller eigenvalues is to be expected for samples from pure power-law data and what constitutes a systematic deviation. It will be fruitful to examine the spectral density and the fractional empirical excesses in order to analyze this. I compare the numerical solution for the spectral density to empirics for $\alpha = 1.5$ and $q = 2$ in Fig. 4, finding solid agreement.

power spec density
Figure 4. Spectral density for data with $\alpha = 1.5$ and $q = 2$.

The fractional empirical excess is presented in Fig. 5. Again, symmetric dispersion is found, indicative of the absence of systematic deviations. There are very few large eigenvalues in absolute terms, so the dispersion naturally becomes quite large in that regime as the finite dimensionality is relevant.

power frac excess
Figure 5. Fractional empirical excess for data with $\alpha = 1.5$ and $q = 2$.

It is also interesting to consider equiparameterized samples i.e. samples with $q \sim 1$. Close to equiparameterization, there are two power-laws for two parts of the distribution, depicted for $\alpha = 1.7$ and $q = .99$ in Fig. 6. The larger eigenvalues fall as that population density dictates $\sim z^{-1-/\alpha}$ before transitioning to the $z^{-1/2}$ falloff of the density of smaller eigenvalues. The latter is also observed in the white Wishart case. There is again good agreement, with symmetric dispersion. I emphasize this point as there are systematic deviations in the case of natural data which may be indicative of interesting phenomena.

power equi
Figure 6. Equiparameterized, structured wishart with $\alpha = 1.7$ and $q = .99$. The line at smaller eigenvalues has slope -1/2 and the line at larger eigenvalues has slope $-1-1/\alpha$ .
power equi excess
Figure 7. Fractional excess of empirical distribution relative to theory. The dispersion is symmetric.

Natural Data: Images

CIFAR-10

It is time to compare the theoretical spectral density for power-law population covariances to natural data. We can start with the most naive comparison, which is to directly compare the sample covariance of the CIFAR-10 training set to the power-law spectral density implied by its parameters. The first obvious issue with this is that the rank-ordered plot makes pretty manifest that there isn't an unambiguous power, and the falloff looks more like a geometric distribution, as I depict in Fig. 8. On the other hand, there is no unambiguous way to assess the tail of this plot, which is why it is crucial to analyze the spectral density.

cifar rank ordered
Figure 8. Rank-ordered, CIFAR-10 training set covariance eigenvalues.

Despite the power-law stretch initially, the tail is more than an artifact of finite over-parameterization, as demonstrated by the spectral density in Fig. 9.

cifar spec density
Figure 9. Spectral density for CIFAR-10 training set covariance eigenvalues with theory curve fitting the cutoff to agree empirically while using the exponent implied by the large eigenvalue behavior in the rank-ordered plot and the implied $q$ value from the input dimension and training set size.

The agreement is very poor for the raw CIFAR data, but as has been noted previously, the non-linear activations so central to neural networks are capable of extending the power-law [3]. This is demonstrated clearly by the plot of rank-ordered eigenvalues, presented in Fig. 10.

cifar relu
Figure 10. Rank-ordered plot of eigenvalues for sample covariance in feature-mapped CIFAR.
cifar relu frac
Figure 11. Rank-ordered eigenvalues divided by the power-law. Some log-oscillatory behavior is present.

The ReLU mapping clearly not only serves to extend beyond the trivial rank limitations, but drastically improves the agreement with a power-law spectral density. We see this in Fig. 12. The data has been feature-mapped up to 10k feature dimensions and then ReLU-mapped. We therefore now use an overparameterization ratio of $q = \frac{1}{5}$ and a power-law $\alpha = 1.23$. These naive parameters agree extremely well.

cifar relu density
Figure 12. ReLU mapped CIFAR-10 data, mapped up to 10k feature dimensions. Theoretical spectral density for $\alpha = 1.22$ and $q = 1/5$.

Systematic deviations do remain at small eigenvalues, though.

cifar relu density zoom
Figure 13. Zoomed in version of Fig. 12, where a systematic discrepancy exists.
cifar relu frac
Figure 14. The fractional empirical excess of the spectral density exhibits log-periodic oscillations at small eigenvalues..

This excess is reminiscent of the log-periodic oscillations observed around the rank-ordered plot of word frequencies around the Zipf's law prediction. The ReLU mapping improves the power-law and reveals two clear periods of oscillation in the spectral density with a period of about a decade. Perhaps the phenomenon of log-periodic oscillations in the spectra of natural data around a power-law has a universality.

MNIST

The MNIST dataset has the amusing property that despite its training set size far exceeding its input dimension, it does not have a full-rank covariance matrix. This is not a widely advertised fact, though it has been noted elsewhere [9]. The raw input spectral density is in Fig. 15. We see the transition in the scaling of the empirical density characteristic of equiparameterization. The rank of the covariance in feature space can vary by a few depending on the finite precision, but is around 711.

mnist raw density
Figure 15. The spectrum of the MNIST training set. Despite a 60k images, the covariance is not full rank.

This near-equiparameteriation is maintained after feature mapping to 10k input dimensions and applying ReLU.

mnsit relu density
Figure 16. The spectrum after random feature mapping and ReLU. The near equiparameterization is preserved.
mnist relu excess
Figure 17. The fractional empirical excess is less well motivated for this poor fit, but in the region of better fit still exhibits some evidence of log periodic oscillation.

Despite the inferior fit, I present the fractional empirical excess as well. This is certainly not worth looking at in the small eigenvalue regime which violently disagrees as the density scales more flatly than $\lambda^{-1/2}$, but for the intermediate eigenvalues the dispersion does have some evidence of log-periodic oscillation that might survive a more rigorous analysis.

Natural Data: Text

It's also interesting to look at natural language. Here the design matrix is contingent on both a tokenization scheme and a choice of embedding dimension. For tokenization I use custom Byte Pair Encoding (BPE) from Hugging Face trained on the first 5% of "wikitext-2-raw-v1". I use a vocab size of 8k, and use 100k tokens as the input data whose spectrum I study. Despite the 100k tokens, there are only 7216 unique tokens. In particular, an embedding dimension above 7216 will furnish a rank-deficient design matrix. I look at embedding dimensions above and below this unique token number. After embedding and centering, I apply ReLU and compute the covariance from the ReLU-mapped design matrix.
wiki 8k spec
Figure 18. The spectral density uses a vocab of size 8k and then an embedding of dimension 8k. The effective $q$ taking the ratio of the embedding dimension and number of unique tokens is close to the theory curve with $q = 1.14$ and $\alpha = 1$.
wiki 8k excess figure
Figure 19. The empirical fractional excess exhibits some oscillatory behavior.
With $n_{\text{embed}} = 8k$ there is an implied $q \sim 1.1$. I produce the above theory curve with $q = 1.14$ and $\alpha = 1$. There is quite a good fit, and again log-periodic structure to the excess density at smaller eigenvalues. A separate comparison can be made with an embedding dimension $n_{\text{embed}} = 4k$. This is presented in Fig. 20 and Fig. 21.
wiki 4k spec
Figure 20. Again, the appropriate $q$ is roughly the embedding dimension divided by the number of embedding dimensions. The theory curve has $q = .58$.
wiki 4k excess
Figure 21. Again, some oscillatory behavior is present in the fractional empirical excess.
There is an implied overparameterization ratio $q \sim .55$. I use a fit with $q = .58$ and $\alpha = .98$. Similar systematic deviations are present at small to intermediate eigenvalues.

Log-periodicity and Discrete Scale Invariance

If these deviations are robust and indicative of something universal in data, one possibility is discrete scale invariance. Schematically, discrete scale invariance is the presence of a scaling symmetry only for integer powers of a value of the scaling parameter. That is, whereas an ordinary power-law $f(r) = r^{-\alpha}$ respects the scaling symmetry $$ f(\lambda r) = \lambda^{-\alpha} f(r)$$ the discrete case would say that this holds only for $\lambda = \lambda_*^n$ for some $\lambda_*$ and $n$ an integer. Clearly a normal power-law is not restricted to this, but the following function is $$f(r) = r^{-\alpha}\left[1+A\sin\left(2\pi\frac{\log r}{\log \lambda_*} \right) \right]$$ this has been proposed as a possible explanation for log-periodicity around power-laws observed elsewhere, and may have a basis in the underlying discrete and heirarchical structure of language. It is less clear what this picture might mean for image data.

Summary

Getting enough control on the computation of the theoretical spectral density for power-law structured input data was sufficient to make more refined observations about covariance spectra of natural data. The non-linear feature maps not only extend the power-law behavior, as observed before, but indeed improve the fit to a power-law spectral density. It would be interesting to probe whether the log-periodic oscillatory structure is universal and has any implications for training loss.

Acknowledgments

I would like to thank Alex Atanasov and Cengiz Pehlevan for valuable conversations related to the work herein.

References

[1] J. Kaplan et al., “Scaling Laws for Neural Language Models” (2020), arXiv:2001.08361

[2] M. Advani, A. Saxe,"High-dimensional dynamics of generalization error in neural networks" (2017), arXiv:1710.03667v1

[3] A. Maloney, D. Roberts, J. Sully, "A Solvable Model of Neural Scaling Laws" (2022), arXiv:2210.16859

[4] A. Atanasov, J. A. Zavatone-Veth, C. Pehlevan, "Scaling and Renormalization in high-dimensional regression" (2024), arXiv:2405.00592v3

[5] B. Bordelon, A. Atanasov, C. Pehlevan, "How Feature Learning Can Improve Neural Scaling Laws" (2024), arXiv:2409.17858v2

[6] J. Silverstein, "Strong Convergence of the Empirical Distribution of Eigenvalues of Large Dimensional Random Matrices" (1995)

[7] E. Dobriban, "Efficient computation of limit spectra of sample covariance matrices" (2015)

[8] M. Potters, J.P. Bouchaud, "A First Course in Random Matrix Theory" (2020)

[9] P. Izmailov, P. Nicholson, S. Lotfi, A. Gordon Wilson, "Dangers of Bayesian Model Averaging under Covariate Shift" (2010)