Why predicting Patch-wise Neural Field is more scalable than directly predicting Pixels?

Nice work! Could you please provide more insights about why predicting patch-wise neural field is more scalable than directly predicting pixels?