Recombine FS inputs into vectors #56

anarsoul · 2018-05-24T00:43:20Z

NIR splits outputs and inputs in nir_lower_io_to_scalar_early() and unfortunately for FS it means increased number of instructions and increased register pressure.

We need to recombine inputs from scalar to vectors.

See https://www.mail-archive.com/[email protected]/msg189216.html

yuq · 2018-05-24T02:02:17Z

Right, I also meet this problem after 18.0 rebase. But due to want to focus on kernel, I haven't solve it from the root. Now I'm doing 18.1 rebase, so will cover these NIR changes in a better way.

anarsoul · 2018-05-25T06:32:02Z

kmscube -M rgba fails in lima-18.0 branch due to this issue with "ppir: ppir: regalloc fail"

anarsoul · 2018-05-25T06:35:28Z

impl main {
	decl_reg vec4 32 r0
	decl_reg vec2 32 r1
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = intrinsic load_input (ssa_0) () (1, 0) /* base=1 */ /* component=0 */	/* vVaryingColor */
	vec1 32 ssa_2 = intrinsic load_input (ssa_0) () (1, 1) /* base=1 */ /* component=1 */	/* vVaryingColor */
	vec1 32 ssa_3 = intrinsic load_input (ssa_0) () (1, 2) /* base=1 */ /* component=2 */	/* vVaryingColor */
	vec1 32 ssa_4 = intrinsic load_input (ssa_0) () (1, 3) /* base=1 */ /* component=3 */	/* vVaryingColor */
	r0.x = imov ssa_1
	r0.y = imov ssa_2.x
	r0.z = imov ssa_3.x
	r0.w = imov ssa_4.x
	vec1 32 ssa_6 = intrinsic load_input (ssa_0) () (0, 0) /* base=0 */ /* component=0 */	/* vTexCoord */
	vec1 32 ssa_7 = intrinsic load_input (ssa_0) () (0, 1) /* base=0 */ /* component=1 */	/* vTexCoord */
	r1.x = imov ssa_6
	r1.y = imov ssa_7.x
	vec4 32 ssa_9 = tex r1 (coord), 0 (texture) 0 (sampler)
	vec4 32 ssa_10 = fmul r0, ssa_9
	intrinsic store_output (ssa_10, ssa_0) () (0, 15, 0) /* base=0 */ /* wrmask=xyzw */ /* component=0 */	/* gl_FragColor */
	/* succs: block_0 */
	block block_0:
}

========prog========
-------block------
const 0 ssa0
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      mov 11 reg1
        ld_var 9 ssa6
      mov 12 reg1
        ld_var 10 ssa7
====================
ppir: ppir_lower_texture create load_coords node 16 for 13
========prog========
-------block------
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      ld_coords 16 new
        mov 11 reg1
          ld_var 9 ssa6
        mov 12 reg1
          ld_var 10 ssa7
====================
ppir: node_to_instr create move 17 from store 15
ppir: insert_load_tex: create move 18 for 13
======ppir instr list======
      vary texl unif vmul smul vadd sadd comb stor const0|1
*000: null null null 14   null 17   null null null | 
 001: null null null null null null 5    null null | 
 002: 1    null null null null null null null null | 
 003: null null null null null null 6    null null | 
 004: 2    null null null null null null null null | 
 005: null null null null null null 7    null null | 
 006: 3    null null null null null null null null | 
 007: null null null null null null 8    null null | 
 008: 4    null null null null null null null null | 
 009: 16   13   null null null 18   null null null | 
 010: null null null null null null 11   null null | 
 011: 9    null null null null null null null null | 
 012: null null null null null null 12   null null | 
 013: 10   null null null null null null null null | 
------------------------
======ppir instr depend======
[0[1[2]][3[4]][5[6]][7[8]][9[10[11]][12[13]]]]
------------------------
ppir: ppir: regalloc fail

yuq · 2018-05-25T07:10:08Z

In fact, even we don't recombine scalar to vector, ppir should not fail, but only generate longer code. This "regalloc fail" is indeed the ppir need to implement reg spill when out of regs.

anarsoul · 2018-05-25T07:17:27Z

As far as I understand we need to reverse engineer how temporaries work first. I see store and load temporary instructions in the doc, but I don't understand where they're stored.

yuq · 2018-05-25T07:23:09Z

I guess it's here:
https://github.com/yuq/mesa-lima/blob/lima-18.0/include/drm-uapi/lima_drm.h#L107

Each PP can have a memory stack which I guess is used to store tmp.

anarsoul · 2018-05-25T17:34:03Z

There're 2 stack_address, one in lima_pp_frame_reg, another in drm_lima_m400_pp_frame/drm_lima_m450_pp_frame.

I'm not sure what's the difference between them.

yuq · 2018-05-25T23:22:47Z

lima_pp_frame_reg one is dummy, drm_lima_m400_pp_frame one is used, one for each PP.

anarsoul · 2018-05-26T07:23:04Z

@yuq how do you tell which one is dummy?

yuq · 2018-05-26T12:10:05Z

In this function:
https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L303

LIMA_PP_FRAME & LIMA_PP_STACK are per PP, so the lima_pp_frame_reg will be set to new value before task start.

anarsoul · 2018-05-26T18:48:40Z

@yuq, what about LIMA_PP_STACK_SIZE?

anarsoul · 2018-05-26T18:50:38Z

And why LIMA_PP_STACK is not used for mali450?

yuq · 2018-05-27T00:23:11Z

LIMA_PP_STACK is used by mali450 here:
https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L321

Just because bcast will set same address to all PPs, so I have to set it
individually for each PP.

LIMA_PP_STACK_SIZE is same for all PPs, so the lima_pp_frame_reg one
is not dummy.

enunes · 2018-06-05T02:09:46Z

I'll have some time to work on lima again starting by the end of this week. If nobody is currently working on this, I could pick it up and work on it.

From what I understand there are two issues,

recombine scalars to vec4 to restore the earlier behavior and
implement register spilling to avoid this kind of problem in the future

Is that correct?
Anybody working in any of these?

Just to be sure though, are we sure already that we want (1) (as in the title of this issue), even with (2) implemented? Or is it something that we would need to benchmark?

yuq · 2018-06-05T02:21:41Z

I think you are right. 2 is needed anyway for correctness and 1 is for better performance. But 1 needs to be done also because I wrote ppir for vec4, not for scalar, some refine or recombine may be needed for correctness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recombine FS inputs into vectors #56

Recombine FS inputs into vectors #56

anarsoul commented May 24, 2018

yuq commented May 24, 2018

anarsoul commented May 25, 2018

anarsoul commented May 25, 2018

yuq commented May 25, 2018

anarsoul commented May 25, 2018

yuq commented May 25, 2018

anarsoul commented May 25, 2018

yuq commented May 25, 2018

anarsoul commented May 26, 2018

yuq commented May 26, 2018

anarsoul commented May 26, 2018 •

edited

Loading

anarsoul commented May 26, 2018

yuq commented May 27, 2018

enunes commented Jun 5, 2018 •

edited

Loading

yuq commented Jun 5, 2018

Recombine FS inputs into vectors #56

Recombine FS inputs into vectors #56

Comments

anarsoul commented May 24, 2018

yuq commented May 24, 2018

anarsoul commented May 25, 2018

anarsoul commented May 25, 2018

yuq commented May 25, 2018

anarsoul commented May 25, 2018

yuq commented May 25, 2018

anarsoul commented May 25, 2018

yuq commented May 25, 2018

anarsoul commented May 26, 2018

yuq commented May 26, 2018

anarsoul commented May 26, 2018 • edited Loading

anarsoul commented May 26, 2018

yuq commented May 27, 2018

enunes commented Jun 5, 2018 • edited Loading

yuq commented Jun 5, 2018

anarsoul commented May 26, 2018 •

edited

Loading

enunes commented Jun 5, 2018 •

edited

Loading