address reviews

JyotinderSingh · JyotinderSingh · commit bcadbc7c18ac · 2025-10-15T13:29:10.000+05:30
diff --git a/guides/int8_quantization_in_keras.py b/guides/int8_quantization_in_keras.py
@@ -3,7 +3,7 @@
 Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh)
 Date created: 2025/10/14
 Last modified: 2025/10/14
-Description: Complete guide to using INT8 quantization in Keras and KerasHub
+Description: Complete guide to using INT8 quantization in Keras and KerasHub.
 Accelerator: GPU
 """
 
@@ -28,12 +28,6 @@
 * Inference uses `q` and `s` to reconstruct effective weights on the fly
   (`w ≈ s · q`) or folds `s` into the matmul/conv for efficiency.
 
-### Trade-off
-
-Wider dynamic range (larger `a_max`) reduces clipping but increases rounding error;
-tighter range reduces rounding error but risks more clipping. Per-channel scaling
-for weights can be used to recover accuracy when compared to per-tensor scaling.
-
 ### Benefits
 
 * Memory / bandwidth bound models: When implementation spends most of its time on memory I/O,
@@ -77,7 +71,7 @@
 
 
 # Create a random number generator.
-rng = np.random.default_rng(7)
+rng = np.random.default_rng()
 
 # Create a simple functional model.
 inputs = keras.Input(shape=(10,))
diff --git a/guides/ipynb/int8_quantization_in_keras.ipynb b/guides/ipynb/int8_quantization_in_keras.ipynb
@@ -11,7 +11,7 @@
     "**Author:** [Jyotinder Singh](https://x.com/Jyotinder_Singh)<br>\n",
     "**Date created:** 2025/10/14<br>\n",
     "**Last modified:** 2025/10/14<br>\n",
-    "**Description:** Complete guide to using INT8 quantization in Keras and KerasHub"
+    "**Description:** Complete guide to using INT8 quantization in Keras and KerasHub."
    ]
   },
   {
@@ -24,8 +24,8 @@
     "\n",
     "Quantization lowers the numerical precision of weights and activations to reduce memory use\n",
     "and often speed up inference, at the cost of a small accuracy drop. Moving from `float32` to\n",
-    "`float16` halves the memory requirements; `float32` to `int8` is ~4x smaller (and ~2x vs\n",
-    "`float16`). On hardware with low-precision kernels (e.g., Tensor Cores), this can also\n",
+    "`float16` halves the memory requirements; `float32` to INT8 is ~4x smaller (and ~2x vs\n",
+    "`float16`). On hardware with low-precision kernels (e.g., NVIDIA Tensor Cores), this can also\n",
     "improve throughput and latency. Actual gains depend on your backend and device.\n",
     "\n",
     "### How it works\n",
@@ -36,34 +36,29 @@
     "* For a tensor (often per-output-channel for weights) with values `w`:\n",
     "  * Compute `a_max = max(abs(w))`.\n",
     "  * Set scale `s = (2 * a_max) / 256`.\n",
-    "  * Quantize: `q = clip(round(w / s), -128, 127)` (stored as int8) and keep `s`.\n",
+    "  * Quantize: `q = clip(round(w / s), -128, 127)` (stored as INT8) and keep `s`.\n",
     "* Inference uses `q` and `s` to reconstruct effective weights on the fly\n",
-    "(`w ≈ s · q`) or folds `s` into the matmul/conv for efficiency.\n",
-    "\n",
-    "### Trade-off\n",
-    "Wider dynamic range (larger `a_max`) reduces clipping but increases rounding error;\n",
-    "tighter range reduces rounding error but risks more clipping. Per-channel scaling\n",
-    "for weights can be used to recover accuracy when compared to per-tensor scaling.\n",
+    "  (`w ≈ s · q`) or folds `s` into the matmul/conv for efficiency.\n",
     "\n",
     "### Benefits\n",
     "\n",
     "* Memory / bandwidth bound models: When implementation spends most of its time on memory I/O,\n",
-    "reducing the computation time does not reduce their overall runtime. `int8` reduces bytes\n",
-    "moved by ~4x vs `float32`, improving cache behavior and reducing memory stalls;\n",
-    "this often helps more than increasing raw FLOPs.\n",
-    "* Compute bound layers on supported hardware: On NVIDIA GPUs, int8\n",
-    "[Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/) speed up matmul/conv,\n",
-    "boosting throughput on compute-limited layers.\n",
-    "* Accuracy: Many models retain near-FP accuracy with `float16`; `int8` may introduce a modest\n",
-    "drop (often ~1-5% depending on task/model/data). Always validate on your own dataset.\n",
-    "\n",
-    "### What Keras does in `int8` mode\n",
-    "\n",
-    "* **Mapping**: Symmetric, linear quantization with `int8` plus a floating-point scale.\n",
+    "  reducing the computation time does not reduce their overall runtime. INT8 reduces bytes\n",
+    "  moved by ~4x vs `float32`, improving cache behavior and reducing memory stalls;\n",
+    "  this often helps more than increasing raw FLOPs.\n",
+    "* Compute bound layers on supported hardware: On NVIDIA GPUs, INT8\n",
+    "  [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/) speed up matmul/conv,\n",
+    "  boosting throughput on compute-limited layers.\n",
+    "* Accuracy: Many models retain near-FP accuracy with `float16`; INT8 may introduce a modest\n",
+    "  drop (often ~1-5% depending on task/model/data). Always validate on your own dataset.\n",
+    "\n",
+    "### What Keras does in INT8 mode\n",
+    "\n",
+    "* **Mapping**: Symmetric, linear quantization with INT8 plus a floating-point scale.\n",
     "* **Weights**: per-output-channel scales to preserve accuracy.\n",
     "* **Activations**: **dynamic AbsMax** scaling computed at runtime.\n",
     "* **Graph rewrite**: Quantization is applied after weights are trained and built; the graph\n",
-    "is rewritten so you can run or save immediately."
+    "  is rewritten so you can run or save immediately."
    ]
   },
   {
@@ -101,7 +96,7 @@
     "\n",
     "\n",
     "# Create a random number generator.\n",
-    "rng = np.random.default_rng(7)\n",
+    "rng = np.random.default_rng()\n",
     "\n",
     "# Create a simple functional model.\n",
     "inputs = keras.Input(shape=(10,))\n",
@@ -155,27 +150,9 @@
    },
    "outputs": [],
    "source": [
-    "from keras import saving\n",
-    "\n",
-    "# Build a functional model.\n",
-    "inputs = keras.Input(shape=(10,))\n",
-    "x = layers.Dense(32, activation=\"relu\")(inputs)\n",
-    "outputs = layers.Dense(1, name=\"target\")(x)\n",
-    "model = keras.Model(inputs, outputs)\n",
-    "model.build((None, 10))\n",
-    "\n",
-    "# Sample inputs for evaluation.\n",
-    "x_eval = rng.random((32, 10)).astype(\"float32\")\n",
-    "\n",
-    "# Quantize the model in-place to INT8.\n",
-    "model.quantize(\"int8\")\n",
-    "\n",
-    "# INT8 outputs after quantization.\n",
-    "y_int8 = model(x_eval)\n",
-    "\n",
     "# Save the quantized model and reload to verify round-trip.\n",
     "model.save(\"int8.keras\")\n",
-    "int8_reloaded = saving.load_model(\"int8.keras\")\n",
+    "int8_reloaded = keras.saving.load_model(\"int8.keras\")\n",
     "y_int8_reloaded = int8_reloaded(x_eval)\n",
     "roundtrip_mse = keras.ops.mean(keras.ops.square(y_int8 - y_int8_reloaded))\n",
     "print(\"MSE (INT8 vs reloaded-INT8):\", float(roundtrip_mse))"
@@ -195,7 +172,7 @@
     "In this example, we will:\n",
     "\n",
     "1. Load the [gemma3_1b](https://www.kaggle.com/models/keras/gemma3/keras/gemma3_1b)\n",
-    "preset from KerasHub\n",
+    "  preset from KerasHub\n",
     "2. Generate text using both the full-precision and quantized models, and compare outputs.\n",
     "3. Save both models to disk and compute storage savings.\n",
     "4. Reload the INT8 model and verify output consistency with the original quantized model."
@@ -215,16 +192,16 @@
     "gemma3 = Gemma3CausalLM.from_preset(\"gemma3_1b\")\n",
     "\n",
     "# Generate text for a single prompt\n",
-    "output = gemma3.generate(\"Keras is a\", max_length=30)\n",
+    "output = gemma3.generate(\"Keras is a\", max_length=50)\n",
     "print(\"Full-precision output:\", output)\n",
     "\n",
-    "# Save FP32 Gemma3 model\n",
+    "# Save FP32 Gemma3 model for size comparison.\n",
     "gemma3.save_to_preset(\"gemma3_fp32\")\n",
     "\n",
     "# Quantize in-place to INT8 and generate again\n",
     "gemma3.quantize(\"int8\")\n",
     "\n",
-    "output = gemma3.generate(\"Keras is a\", max_length=30)\n",
+    "output = gemma3.generate(\"Keras is a\", max_length=50)\n",
     "print(\"Quantized output:\", output)\n",
     "\n",
     "# Save INT8 Gemma3 model\n",
@@ -233,7 +210,7 @@
     "# Reload and compare outputs\n",
     "gemma3_int8 = Gemma3CausalLM.from_preset(\"gemma3_int8\")\n",
     "\n",
-    "output = gemma3_int8.generate(\"Keras is a\", max_length=30)\n",
+    "output = gemma3_int8.generate(\"Keras is a\", max_length=50)\n",
     "print(\"Quantized reloaded output:\", output)\n",
     "\n",
     "\n",
diff --git a/guides/ipynb/quantization_overview.ipynb b/guides/ipynb/quantization_overview.ipynb
@@ -116,7 +116,7 @@
     "import numpy as np\n",
     "\n",
     "# Create a random number generator.\n",
-    "rng = np.random.default_rng(7)\n",
+    "rng = np.random.default_rng()\n",
     "\n",
     "# Sample training data.\n",
     "x_train = rng.random((100, 10)).astype(\"float32\")\n",
diff --git a/guides/md/int8_quantization_in_keras.md b/guides/md/int8_quantization_in_keras.md
@@ -3,7 +3,7 @@
 **Author:** [Jyotinder Singh](https://x.com/Jyotinder_Singh)<br>
 **Date created:** 2025/10/14<br>
 **Last modified:** 2025/10/14<br>
-**Description:** Complete guide to using INT8 quantization in Keras and KerasHub
+**Description:** Complete guide to using INT8 quantization in Keras and KerasHub.
 
 
 <img class="k-inline-icon" src="https://colab.research.google.com/img/colab_favicon.ico"/> [**View in Colab**](https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/int8_quantization_in_keras.ipynb)  <span class="k-dot">•</span><img class="k-inline-icon" src="https://github.com/favicon.ico"/> [**GitHub source**](https://github.com/keras-team/keras-io/blob/master/guides/int8_quantization_in_keras.py)
@@ -15,8 +15,8 @@
 
 Quantization lowers the numerical precision of weights and activations to reduce memory use
 and often speed up inference, at the cost of a small accuracy drop. Moving from `float32` to
-`float16` halves the memory requirements; `float32` to `int8` is ~4x smaller (and ~2x vs
-`float16`). On hardware with low-precision kernels (e.g., Tensor Cores), this can also
+`float16` halves the memory requirements; `float32` to INT8 is ~4x smaller (and ~2x vs
+`float16`). On hardware with low-precision kernels (e.g., NVIDIA Tensor Cores), this can also
 improve throughput and latency. Actual gains depend on your backend and device.
 
 ### How it works
@@ -27,34 +27,29 @@ Quantization maps real values to 8-bit integers with a scale:
 * For a tensor (often per-output-channel for weights) with values `w`:
   * Compute `a_max = max(abs(w))`.
   * Set scale `s = (2 * a_max) / 256`.
-  * Quantize: `q = clip(round(w / s), -128, 127)` (stored as int8) and keep `s`.
+  * Quantize: `q = clip(round(w / s), -128, 127)` (stored as INT8) and keep `s`.
 * Inference uses `q` and `s` to reconstruct effective weights on the fly
-(`w ≈ s · q`) or folds `s` into the matmul/conv for efficiency.
-
-### Trade-off
-Wider dynamic range (larger `a_max`) reduces clipping but increases rounding error;
-tighter range reduces rounding error but risks more clipping. Per-channel scaling
-for weights can be used to recover accuracy when compared to per-tensor scaling.
+  (`w ≈ s · q`) or folds `s` into the matmul/conv for efficiency.
 
 ### Benefits
 
 * Memory / bandwidth bound models: When implementation spends most of its time on memory I/O,
-reducing the computation time does not reduce their overall runtime. `int8` reduces bytes
-moved by ~4x vs `float32`, improving cache behavior and reducing memory stalls;
-this often helps more than increasing raw FLOPs.
-* Compute bound layers on supported hardware: On NVIDIA GPUs, int8
-[Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/) speed up matmul/conv,
-boosting throughput on compute-limited layers.
-* Accuracy: Many models retain near-FP accuracy with `float16`; `int8` may introduce a modest
-drop (often ~1-5% depending on task/model/data). Always validate on your own dataset.
-
-### What Keras does in `int8` mode
-
-* **Mapping**: Symmetric, linear quantization with `int8` plus a floating-point scale.
+  reducing the computation time does not reduce their overall runtime. INT8 reduces bytes
+  moved by ~4x vs `float32`, improving cache behavior and reducing memory stalls;
+  this often helps more than increasing raw FLOPs.
+* Compute bound layers on supported hardware: On NVIDIA GPUs, INT8
+  [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/) speed up matmul/conv,
+  boosting throughput on compute-limited layers.
+* Accuracy: Many models retain near-FP accuracy with `float16`; INT8 may introduce a modest
+  drop (often ~1-5% depending on task/model/data). Always validate on your own dataset.
+
+### What Keras does in INT8 mode
+
+* **Mapping**: Symmetric, linear quantization with INT8 plus a floating-point scale.
 * **Weights**: per-output-channel scales to preserve accuracy.
 * **Activations**: **dynamic AbsMax** scaling computed at runtime.
 * **Graph rewrite**: Quantization is applied after weights are trained and built; the graph
-is rewritten so you can run or save immediately.
+  is rewritten so you can run or save immediately.
 
 ---
 ## Overview
@@ -80,7 +75,7 @@ from keras import layers
 
 
 # Create a random number generator.
-rng = np.random.default_rng(7)
+rng = np.random.default_rng()
 
 # Create a simple functional model.
 inputs = keras.Input(shape=(10,))
@@ -114,7 +109,7 @@ print("Full-Precision vs INT8 MSE:", float(mse))
 
 <div class="k-default-codeblock">
 ```
-Full-Precision vs INT8 MSE: 7.132767677830998e-06
+Full-Precision vs INT8 MSE: 4.982496648153756e-06
 ```
 </div>
 
@@ -129,27 +124,9 @@ is preserved when saving to `.keras` and loading back.
 
 
 ```python
-from keras import saving
-
-# Build a functional model.
-inputs = keras.Input(shape=(10,))
-x = layers.Dense(32, activation="relu")(inputs)
-outputs = layers.Dense(1, name="target")(x)
-model = keras.Model(inputs, outputs)
-model.build((None, 10))
-
-# Sample inputs for evaluation.
-x_eval = rng.random((32, 10)).astype("float32")
-
-# Quantize the model in-place to INT8.
-model.quantize("int8")
-
-# INT8 outputs after quantization.
-y_int8 = model(x_eval)
-
 # Save the quantized model and reload to verify round-trip.
 model.save("int8.keras")
-int8_reloaded = saving.load_model("int8.keras")
+int8_reloaded = keras.saving.load_model("int8.keras")
 y_int8_reloaded = int8_reloaded(x_eval)
 roundtrip_mse = keras.ops.mean(keras.ops.square(y_int8 - y_int8_reloaded))
 print("MSE (INT8 vs reloaded-INT8):", float(roundtrip_mse))
@@ -170,7 +147,7 @@ and follow the same workflow as above.
 In this example, we will:
 
 1. Load the [gemma3_1b](https://www.kaggle.com/models/keras/gemma3/keras/gemma3_1b)
-preset from KerasHub
+  preset from KerasHub
 2. Generate text using both the full-precision and quantized models, and compare outputs.
 3. Save both models to disk and compute storage savings.
 4. Reload the INT8 model and verify output consistency with the original quantized model.
@@ -183,16 +160,16 @@ from keras_hub.models import Gemma3CausalLM
 gemma3 = Gemma3CausalLM.from_preset("gemma3_1b")
 
 # Generate text for a single prompt
-output = gemma3.generate("Keras is a", max_length=30)
+output = gemma3.generate("Keras is a", max_length=50)
 print("Full-precision output:", output)
 
-# Save FP32 Gemma3 model
+# Save FP32 Gemma3 model for size comparison.
 gemma3.save_to_preset("gemma3_fp32")
 
 # Quantize in-place to INT8 and generate again
 gemma3.quantize("int8")
 
-output = gemma3.generate("Keras is a", max_length=30)
+output = gemma3.generate("Keras is a", max_length=50)
 print("Quantized output:", output)
 
 # Save INT8 Gemma3 model
@@ -201,7 +178,7 @@ gemma3.save_to_preset("gemma3_int8")
 # Reload and compare outputs
 gemma3_int8 = Gemma3CausalLM.from_preset("gemma3_int8")
 
-output = gemma3_int8.generate("Keras is a", max_length=30)
+output = gemma3_int8.generate("Keras is a", max_length=50)
 print("Quantized reloaded output:", output)
 
 
@@ -221,11 +198,11 @@ print(f"Gemma3: Size reduction: {gemma_reduction:.1f}%")
 
 <div class="k-default-codeblock">
 ```
-Full-precision output: Keras is a deep learning library for Python. It is a high-level API for neural networks. It is a Python library for deep learning
+Full-precision output: Keras is a deep learning library for Python. It is a high-level API for neural networks. It is a Python library for deep learning. It is a library for deep learning. It is a library for deep learning. It is a
 
-Quantized output: Keras is a deep learning library for Python. It is a high-level API for neural networks. It is a Python library for deep learning
+Quantized output: Keras is a deep learning library for Python. It is a high-level API for neural networks. It is a Python library for deep learning. It is a library for deep learning. It is a library for deep learning. It is a
 
-Quantized reloaded output: Keras is a deep learning library for Python. It is a high-level API for neural networks. It is a Python library for deep learning
+Quantized reloaded output: Keras is a Python library for deep learning. It is a library that is used to train and deploy deep learning models. It is a library that is used to train and deploy deep learning models. It is a library that is used to train
 
 Gemma3: FP32 file size: 3815.32 MiB
 Gemma3: INT8 file size: 957.81 MiB
diff --git a/guides/md/quantization_overview.md b/guides/md/quantization_overview.md
@@ -88,7 +88,7 @@ import keras
 import numpy as np
 
 # Create a random number generator.
-rng = np.random.default_rng(7)
+rng = np.random.default_rng()
 
 # Sample training data.
 x_train = rng.random((100, 10)).astype("float32")