Skip to content

Commit 9364d8b

Browse files
mrutland-armnvidia-bfigg
authored andcommitted
arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
BugLink: https://bugs.launchpad.net/bugs/2143859 The ARM64_WORKAROUND_REPEAT_TLBI workaround is used to mitigate several errata where broadcast TLBI;DSB sequences don't provide all the architecturally required synchronization. The workaround performs more work than necessary, and can have significant overhead. This patch optimizes the workaround, as explained below. The workaround was originally added for Qualcomm Falkor erratum 1009 in commit: d9ff80f ("arm64: Work around Falkor erratum 1009") As noted in the message for that commit, the workaround is applied even in cases where it is not strictly necessary. The workaround was later reused without changes for: * Arm Cortex-A76 erratum #1286807 SDEN v33: https://developer.arm.com/documentation/SDEN-885749/33-0/ * Arm Cortex-A55 erratum #2441007 SDEN v16: https://developer.arm.com/documentation/SDEN-859338/1600/ * Arm Cortex-A510 erratum #2441009 SDEN v19: https://developer.arm.com/documentation/SDEN-1873351/1900/ The important details to note are as follows: 1. All relevant errata only affect the ordering and/or completion of memory accesses which have been translated by an invalidated TLB entry. The actual invalidation of TLB entries is unaffected. 2. The existing workaround is applied to both broadcast and local TLB invalidation, whereas for all relevant errata it is only necessary to apply a workaround for broadcast invalidation. 3. The existing workaround replaces every TLBI with a TLBI;DSB;TLBI sequence, whereas for all relevant errata it is only necessary to execute a single additional TLBI;DSB sequence after any number of TLBIs are completed by a DSB. For example, for a sequence of batched TLBIs: TLBI <op1>[, <arg1>] TLBI <op2>[, <arg2>] TLBI <op3>[, <arg3>] DSB ISH ... the existing workaround will expand this to: TLBI <op1>[, <arg1>] DSB ISH // additional TLBI <op1>[, <arg1>] // additional TLBI <op2>[, <arg2>] DSB ISH // additional TLBI <op2>[, <arg2>] // additional TLBI <op3>[, <arg3>] DSB ISH // additional TLBI <op3>[, <arg3>] // additional DSB ISH ... whereas it is sufficient to have: TLBI <op1>[, <arg1>] TLBI <op2>[, <arg2>] TLBI <op3>[, <arg3>] DSB ISH TLBI <opX>[, <argX>] // additional DSB ISH // additional Using a single additional TBLI and DSB at the end of the sequence can have significantly lower overhead as each DSB which completes a TLBI must synchronize with other PEs in the system, with potential performance effects both locally and system-wide. 4. The existing workaround repeats each specific TLBI operation, whereas for all relevant errata it is sufficient for the additional TLBI to use *any* operation which will be broadcast, regardless of which translation regime or stage of translation the operation applies to. For example, for a single TLBI: TLBI ALLE2IS DSB ISH ... the existing workaround will expand this to: TLBI ALLE2IS DSB ISH TLBI ALLE2IS // additional DSB ISH // additional ... whereas it is sufficient to have: TLBI ALLE2IS DSB ISH TLBI VALE1IS, XZR // additional DSB ISH // additional As the additional TLBI doesn't have to match a specific earlier TLBI, the additional TLBI can be implemented in separate code, with no memory of the earlier TLBIs. The additional TLBI can also use a cheaper TLBI operation. 5. The existing workaround is applied to both Stage-1 and Stage-2 TLB invalidation, whereas for all relevant errata it is only necessary to apply a workaround for Stage-1 invalidation. Architecturally, TLBI operations which invalidate only Stage-2 information (e.g. IPAS2E1IS) are not required to invalidate TLB entries which combine information from Stage-1 and Stage-2 translation table entries, and consequently may not complete memory accesses translated by those combined entries. In these cases, completion of memory accesses is only guaranteed after subsequent invalidation of Stage-1 information (e.g. VMALLE1IS). Taking the above points into account, this patch reworks the workaround logic to reduce overhead: * New __tlbi_sync_s1ish() and __tlbi_sync_s1ish_hyp() functions are added and used in place of any dsb(ish) which is used to complete broadcast Stage-1 TLB maintenance. When the ARM64_WORKAROUND_REPEAT_TLBI workaround is enabled, these helpers will execute an additional TLBI;DSB sequence. For consistency, it might make sense to add __tlbi_sync_*() helpers for local and stage 2 maintenance. For now I've left those with open-coded dsb() to keep the diff small. * The duplication of TLBIs in __TLBI_0() and __TLBI_1() is removed. This is no longer needed as the necessary synchronization will happen in __tlbi_sync_s1ish() or __tlbi_sync_s1ish_hyp(). * The additional TLBI operation is chosen to have minimal impact: - __tlbi_sync_s1ish() uses "TLBI VALE1IS, XZR". This is only used at EL1 or at EL2 with {E2H,TGE}=={1,1}, where it will target an unused entry for the reserved ASID in the kernel's own translation regime, and have no adverse affect. - __tlbi_sync_s1ish_hyp() uses "TLBI VALE2IS, XZR". This is only used in hyp code, where it will target an unused entry in the hyp code's TTBR0 mapping, and should have no adverse effect. * As __TLBI_0() and __TLBI_1() no longer replace each TLBI with a TLBI;DSB;TLBI sequence, batching TLBIs is worthwhile, and there's no need for arch_tlbbatch_should_defer() to consider ARM64_WORKAROUND_REPEAT_TLBI. When building defconfig with GCC 15.1.0, compared to v6.19-rc1, this patch saves ~1KiB of text, makes the vmlinux ~42KiB smaller, and makes the resulting Image 64KiB smaller: | [mark@lakrids:~/src/linux]% size vmlinux-* | text data bss dec hex filename | 21179831 19660919 708216 4154896 279fca6 vmlinux-after | 21181075 19660903 708216 41550194 27a0172 vmlinux-before | [mark@lakrids:~/src/linux]% ls -l vmlinux-* | -rwxr-xr-x 1 mark mark 157771472 Feb 4 12:05 vmlinux-after | -rwxr-xr-x 1 mark mark 157815432 Feb 4 12:05 vmlinux-before | [mark@lakrids:~/src/linux]% ls -l Image-* | -rw-r--r-- 1 mark mark 41007616 Feb 4 12:05 Image-after | -rw-r--r-- 1 mark mark 41073152 Feb 4 12:05 Image-before Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org> (cherry picked from commit a8f7868) Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Acked-by: Jamie Nguyen <jamien@nvidia.com> Acked-by: Carol L Soto <csoto@nvidia.com> Acked-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Abdur Rahman <abdur.rahman@canonical.com> Signed-off-by: Brad Figg <bfigg@nvidia.com>
1 parent 6128b24 commit 9364d8b

File tree

6 files changed

+47
-36
lines changed

6 files changed

+47
-36
lines changed

arch/arm64/include/asm/tlbflush.h

Lines changed: 35 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -31,18 +31,10 @@
3131
*/
3232
#define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \
3333
"tlbi " #op "\n" \
34-
ALTERNATIVE("nop\n nop", \
35-
"dsb ish\n tlbi " #op, \
36-
ARM64_WORKAROUND_REPEAT_TLBI, \
37-
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
3834
: : )
3935

4036
#define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \
4137
"tlbi " #op ", %x0\n" \
42-
ALTERNATIVE("nop\n nop", \
43-
"dsb ish\n tlbi " #op ", %x0", \
44-
ARM64_WORKAROUND_REPEAT_TLBI, \
45-
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
4638
: : "rZ" (arg))
4739

4840
#define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)
@@ -181,6 +173,34 @@ static inline unsigned long get_trans_granule(void)
181173
(__pages >> (5 * (scale) + 1)) - 1; \
182174
})
183175

176+
#define __repeat_tlbi_sync(op, arg...) \
177+
do { \
178+
if (!alternative_has_cap_unlikely(ARM64_WORKAROUND_REPEAT_TLBI)) \
179+
break; \
180+
__tlbi(op, ##arg); \
181+
dsb(ish); \
182+
} while (0)
183+
184+
/*
185+
* Complete broadcast TLB maintenance issued by the host which invalidates
186+
* stage 1 information in the host's own translation regime.
187+
*/
188+
static inline void __tlbi_sync_s1ish(void)
189+
{
190+
dsb(ish);
191+
__repeat_tlbi_sync(vale1is, 0);
192+
}
193+
194+
/*
195+
* Complete broadcast TLB maintenance issued by hyp code which invalidates
196+
* stage 1 translation information in any translation regime.
197+
*/
198+
static inline void __tlbi_sync_s1ish_hyp(void)
199+
{
200+
dsb(ish);
201+
__repeat_tlbi_sync(vale2is, 0);
202+
}
203+
184204
/*
185205
* TLB Invalidation
186206
* ================
@@ -266,7 +286,7 @@ static inline void flush_tlb_all(void)
266286
{
267287
dsb(ishst);
268288
__tlbi(vmalle1is);
269-
dsb(ish);
289+
__tlbi_sync_s1ish();
270290
isb();
271291
}
272292

@@ -278,7 +298,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
278298
asid = __TLBI_VADDR(0, ASID(mm));
279299
__tlbi(aside1is, asid);
280300
__tlbi_user(aside1is, asid);
281-
dsb(ish);
301+
__tlbi_sync_s1ish();
282302
mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
283303
}
284304

@@ -305,20 +325,11 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
305325
unsigned long uaddr)
306326
{
307327
flush_tlb_page_nosync(vma, uaddr);
308-
dsb(ish);
328+
__tlbi_sync_s1ish();
309329
}
310330

311331
static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
312332
{
313-
/*
314-
* TLB flush deferral is not required on systems which are affected by
315-
* ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
316-
* will have two consecutive TLBI instructions with a dsb(ish) in between
317-
* defeating the purpose (i.e save overall 'dsb ish' cost).
318-
*/
319-
if (alternative_has_cap_unlikely(ARM64_WORKAROUND_REPEAT_TLBI))
320-
return false;
321-
322333
return true;
323334
}
324335

@@ -334,7 +345,7 @@ static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
334345
*/
335346
static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
336347
{
337-
dsb(ish);
348+
__tlbi_sync_s1ish();
338349
}
339350

340351
/*
@@ -469,7 +480,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
469480
{
470481
__flush_tlb_range_nosync(vma->vm_mm, start, end, stride,
471482
last_level, tlb_level);
472-
dsb(ish);
483+
__tlbi_sync_s1ish();
473484
}
474485

475486
static inline void flush_tlb_range(struct vm_area_struct *vma,
@@ -501,7 +512,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
501512
dsb(ishst);
502513
__flush_tlb_range_op(vaale1is, start, pages, stride, 0,
503514
TLBI_TTL_UNKNOWN, false, lpa2_is_enabled());
504-
dsb(ish);
515+
__tlbi_sync_s1ish();
505516
isb();
506517
}
507518

@@ -515,7 +526,7 @@ static inline void __flush_tlb_kernel_pgtable(unsigned long kaddr)
515526

516527
dsb(ishst);
517528
__tlbi(vaae1is, addr);
518-
dsb(ish);
529+
__tlbi_sync_s1ish();
519530
isb();
520531
}
521532

arch/arm64/kernel/sys_compat.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ __do_compat_cache_op(unsigned long start, unsigned long end)
3737
* We pick the reserved-ASID to minimise the impact.
3838
*/
3939
__tlbi(aside1is, __TLBI_VADDR(0, 0));
40-
dsb(ish);
40+
__tlbi_sync_s1ish();
4141
}
4242

4343
ret = caches_clean_inval_user_pou(start, start + chunk);

arch/arm64/kvm/hyp/nvhe/mm.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -271,7 +271,7 @@ static void fixmap_clear_slot(struct hyp_fixmap_slot *slot)
271271
*/
272272
dsb(ishst);
273273
__tlbi_level(vale2is, __TLBI_VADDR(addr, 0), level);
274-
dsb(ish);
274+
__tlbi_sync_s1ish_hyp();
275275
isb();
276276
}
277277

arch/arm64/kvm/hyp/nvhe/tlb.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
169169
*/
170170
dsb(ish);
171171
__tlbi(vmalle1is);
172-
dsb(ish);
172+
__tlbi_sync_s1ish_hyp();
173173
isb();
174174

175175
exit_vmid_context(&cxt);
@@ -226,7 +226,7 @@ void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
226226

227227
dsb(ish);
228228
__tlbi(vmalle1is);
229-
dsb(ish);
229+
__tlbi_sync_s1ish_hyp();
230230
isb();
231231

232232
exit_vmid_context(&cxt);
@@ -240,7 +240,7 @@ void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
240240
enter_vmid_context(mmu, &cxt, false);
241241

242242
__tlbi(vmalls12e1is);
243-
dsb(ish);
243+
__tlbi_sync_s1ish_hyp();
244244
isb();
245245

246246
exit_vmid_context(&cxt);
@@ -266,5 +266,5 @@ void __kvm_flush_vm_context(void)
266266
/* Same remark as in enter_vmid_context() */
267267
dsb(ish);
268268
__tlbi(alle1is);
269-
dsb(ish);
269+
__tlbi_sync_s1ish_hyp();
270270
}

arch/arm64/kvm/hyp/pgtable.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -483,7 +483,7 @@ static int hyp_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx,
483483
*unmapped += granule;
484484
}
485485

486-
dsb(ish);
486+
__tlbi_sync_s1ish_hyp();
487487
isb();
488488
mm_ops->put_page(ctx->ptep);
489489

arch/arm64/kvm/hyp/vhe/tlb.c

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ void __kvm_tlb_flush_vmid_ipa(struct kvm_s2_mmu *mmu,
115115
*/
116116
dsb(ish);
117117
__tlbi(vmalle1is);
118-
dsb(ish);
118+
__tlbi_sync_s1ish_hyp();
119119
isb();
120120

121121
exit_vmid_context(&cxt);
@@ -176,7 +176,7 @@ void __kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
176176

177177
dsb(ish);
178178
__tlbi(vmalle1is);
179-
dsb(ish);
179+
__tlbi_sync_s1ish_hyp();
180180
isb();
181181

182182
exit_vmid_context(&cxt);
@@ -192,7 +192,7 @@ void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu)
192192
enter_vmid_context(mmu, &cxt);
193193

194194
__tlbi(vmalls12e1is);
195-
dsb(ish);
195+
__tlbi_sync_s1ish_hyp();
196196
isb();
197197

198198
exit_vmid_context(&cxt);
@@ -217,7 +217,7 @@ void __kvm_flush_vm_context(void)
217217
{
218218
dsb(ishst);
219219
__tlbi(alle1is);
220-
dsb(ish);
220+
__tlbi_sync_s1ish_hyp();
221221
}
222222

223223
/*
@@ -358,7 +358,7 @@ int __kvm_tlbi_s1e2(struct kvm_s2_mmu *mmu, u64 va, u64 sys_encoding)
358358
default:
359359
ret = -EINVAL;
360360
}
361-
dsb(ish);
361+
__tlbi_sync_s1ish_hyp();
362362
isb();
363363

364364
if (mmu)

0 commit comments

Comments
 (0)