diff --git a/src/llmcompressor/modifiers/smoothquant/README.md b/src/llmcompressor/modifiers/smoothquant/README.md index ff65836c6..03d1060a3 100644 --- a/src/llmcompressor/modifiers/smoothquant/README.md +++ b/src/llmcompressor/modifiers/smoothquant/README.md @@ -5,7 +5,7 @@ In this tutorial, we'll cover how to specify the correct mappings for applying t ## Understanding the Mapping Format ### Context -SmoothQuant leverages activation scaling to smooth out input activations to make quantization more efficient for large language models (LLMs). As mentioned in the SmoothQuant paper, "By default, we perform scale smoothing for the input activations of self-attention and feed-forward layers." +SmoothQuant leverages activation scaling to smooth out input activations, making quantization more efficient for large language models (LLMs). As mentioned in the SmoothQuant paper, "By default, we perform scale smoothing for the input activations of self-attention and feed-forward layers." This means that we need to smooth the inputs feeding into: - The **q/k/v blocks** (query, key, value blocks of self-attention) @@ -27,45 +27,45 @@ One of the quirks of working with LLM architectures is that we need to apply smo [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"] ``` -Instead of targeting broader modules like `mlp`, we specify the lower-level projections (`gate_proj` and `up_proj`) and the post-attention layer normalization explicitly. +Instead of targeting broader modules like `mlp`, we explicitly specify the lower-level projections (`gate_proj` and `up_proj`) and the `post_attention_layernorm` normalization. ### The Mapping Format A mapping in SmoothQuant takes the form: -```python +"`python [[layers smoothed inputs pass into], output_to_smooth] ``` For example, in the default mapping: -```python +"`python [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"] ``` -This specifies that we want to smooth the inputs feeding into the projections (`gate_proj`, `up_proj`) as well as the output from `post_attention_layernorm`. +This specifies that we want to smooth the inputs feeding into the projections (`gate_proj`, `up_proj`) and the output from `post_attention_layernorm`. ## Specifying Your Own Mappings To create your own mappings, follow these steps: 1. **Identify the layers you want to pass smoothed input activations into**: - You can find the exact names of these layers by exploring the relevant model file (e.g., `modeling_llama.py`). For example, you might target layers related to the self-attention or feed-forward blocks. + You can find the exact names of these layers by exploring the relevant model file (e.g., `modeling_llama.py`). For example, you might target layers related to the self-attention or feed-forward blocks. 2. **Match leaf modules**: - Ensure you're targeting leaf modules (i.e., the individual components of broader blocks, such as `gate_proj` and `up_proj` instead of a larger `mlp` module). + Ensure you're targeting leaf modules (i.e., the individual components of broader blocks, such as `gate_proj` and `up_proj` instead of a larger `mlp` module). 3. **Specify the correct regular expressions**: - Use regular expressions to match the layers you want to target. For instance, if you want to target all projection layers across all attention heads, you could use a regex like `"re:.*proj"`. If you want to target a specific projection layer, make the regex more specific. + Use regular expressions to match the layers you want to target. For instance, if you want to target all projection layers across all attention heads, you could use a regex like `"re:.*proj"`. If you want to target a specific projection layer, make the regex more specific. ### Example Custom Mapping -Let’s say you’re working with a model that has layers named similarly to LLaMA, and you want to smooth the input activations of the self-attention layers as well as the feed-forward layers. Here is how you might specify the mapping: +Let's say you're working with a model with layers named similar to LLaMA, and you want to smooth the input activations of the self-attention layers and the feed-forward layers. Here is how you might specify the mapping: ```python mapping = [ # Smooth the inputs going into the query, key, value projections of self-attention - [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"], + [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"], # Smooth the inputs going into the first feed-forward block (fc1) - [["re:.*fc1"], "re:.*post_attention_layernorm"] + [["re:.*fc1"], "re:.*post_attention_layernorm"] ] ``` @@ -77,6 +77,6 @@ This ensures that SmoothQuant modifies the correct activations, improving quanti ## Conclusion -By understanding the structure of your model and specifying precise mappings, you can apply the SmoothQuant Modifier effectively. Use the diagram on page 5 of the [SmoothQuant paper](https://arxiv.org/pdf/2211.10438) and inspect your model’s code to identify the correct layers and leaf modules to target for smoothing. +By understanding the structure of your model and specifying precise mappings, you can apply the SmoothQuant Modifier effectively. Use the diagram on page 5 of the [SmoothQuant paper](https://arxiv.org/pdf/2211.10438) and inspect your model's code to identify the correct layers and leaf modules to target for smoothing. -Now that you know how to create these mappings, experiment with different model architectures and observe how SmoothQuant impacts performance and quantization accuracy. \ No newline at end of file +Now that you know how to create these mappings experiment with different model architectures and observe how SmoothQuant impacts performance and quantization accuracy. \ No newline at end of file