Skip to content

Commit

Permalink
Fix grammar
Browse files Browse the repository at this point in the history
  • Loading branch information
rahul-tuli committed Sep 6, 2024
1 parent b86cbbb commit 56ba00b
Showing 1 changed file with 13 additions and 13 deletions.
26 changes: 13 additions & 13 deletions src/llmcompressor/modifiers/smoothquant/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ In this tutorial, we'll cover how to specify the correct mappings for applying t
## Understanding the Mapping Format

### Context
SmoothQuant leverages activation scaling to smooth out input activations to make quantization more efficient for large language models (LLMs). As mentioned in the SmoothQuant paper, "By default, we perform scale smoothing for the input activations of self-attention and feed-forward layers."
SmoothQuant leverages activation scaling to smooth out input activations, making quantization more efficient for large language models (LLMs). As mentioned in the SmoothQuant paper, "By default, we perform scale smoothing for the input activations of self-attention and feed-forward layers."

This means that we need to smooth the inputs feeding into:
- The **q/k/v blocks** (query, key, value blocks of self-attention)
Expand All @@ -27,45 +27,45 @@ One of the quirks of working with LLM architectures is that we need to apply smo
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
```

Instead of targeting broader modules like `mlp`, we specify the lower-level projections (`gate_proj` and `up_proj`) and the post-attention layer normalization explicitly.
Instead of targeting broader modules like `mlp`, we explicitly specify the lower-level projections (`gate_proj` and `up_proj`) and the `post_attention_layernorm` normalization.

### The Mapping Format

A mapping in SmoothQuant takes the form:

```python
"`python
[[layers smoothed inputs pass into], output_to_smooth]
```
For example, in the default mapping:
```python
"`python
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
```
This specifies that we want to smooth the inputs feeding into the projections (`gate_proj`, `up_proj`) as well as the output from `post_attention_layernorm`.
This specifies that we want to smooth the inputs feeding into the projections (`gate_proj`, `up_proj`) and the output from `post_attention_layernorm`.

## Specifying Your Own Mappings

To create your own mappings, follow these steps:

1. **Identify the layers you want to pass smoothed input activations into**:
You can find the exact names of these layers by exploring the relevant model file (e.g., `modeling_llama.py`). For example, you might target layers related to the self-attention or feed-forward blocks.
You can find the exact names of these layers by exploring the relevant model file (e.g., `modeling_llama.py`). For example, you might target layers related to the self-attention or feed-forward blocks.

2. **Match leaf modules**:
Ensure you're targeting leaf modules (i.e., the individual components of broader blocks, such as `gate_proj` and `up_proj` instead of a larger `mlp` module).
Ensure you're targeting leaf modules (i.e., the individual components of broader blocks, such as `gate_proj` and `up_proj` instead of a larger `mlp` module).

3. **Specify the correct regular expressions**:
Use regular expressions to match the layers you want to target. For instance, if you want to target all projection layers across all attention heads, you could use a regex like `"re:.*proj"`. If you want to target a specific projection layer, make the regex more specific.
Use regular expressions to match the layers you want to target. For instance, if you want to target all projection layers across all attention heads, you could use a regex like `"re:.*proj"`. If you want to target a specific projection layer, make the regex more specific.

### Example Custom Mapping

Lets say youre working with a model that has layers named similarly to LLaMA, and you want to smooth the input activations of the self-attention layers as well as the feed-forward layers. Here is how you might specify the mapping:
Let's say you're working with a model with layers named similar to LLaMA, and you want to smooth the input activations of the self-attention layers and the feed-forward layers. Here is how you might specify the mapping:

```python
mapping = [
# Smooth the inputs going into the query, key, value projections of self-attention
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
# Smooth the inputs going into the first feed-forward block (fc1)
[["re:.*fc1"], "re:.*post_attention_layernorm"]
[["re:.*fc1"], "re:.*post_attention_layernorm"]
]
```

Expand All @@ -77,6 +77,6 @@ This ensures that SmoothQuant modifies the correct activations, improving quanti

## Conclusion

By understanding the structure of your model and specifying precise mappings, you can apply the SmoothQuant Modifier effectively. Use the diagram on page 5 of the [SmoothQuant paper](https://arxiv.org/pdf/2211.10438) and inspect your models code to identify the correct layers and leaf modules to target for smoothing.
By understanding the structure of your model and specifying precise mappings, you can apply the SmoothQuant Modifier effectively. Use the diagram on page 5 of the [SmoothQuant paper](https://arxiv.org/pdf/2211.10438) and inspect your model's code to identify the correct layers and leaf modules to target for smoothing.

Now that you know how to create these mappings, experiment with different model architectures and observe how SmoothQuant impacts performance and quantization accuracy.
Now that you know how to create these mappings experiment with different model architectures and observe how SmoothQuant impacts performance and quantization accuracy.

0 comments on commit 56ba00b

Please sign in to comment.