Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recombine FS inputs into vectors #56

Open
anarsoul opened this issue May 24, 2018 · 15 comments
Open

Recombine FS inputs into vectors #56

anarsoul opened this issue May 24, 2018 · 15 comments

Comments

@anarsoul
Copy link
Contributor

NIR splits outputs and inputs in nir_lower_io_to_scalar_early() and unfortunately for FS it means increased number of instructions and increased register pressure.

We need to recombine inputs from scalar to vectors.

See https://www.mail-archive.com/[email protected]/msg189216.html

@yuq
Copy link
Owner

yuq commented May 24, 2018

Right, I also meet this problem after 18.0 rebase. But due to want to focus on kernel, I haven't solve it from the root. Now I'm doing 18.1 rebase, so will cover these NIR changes in a better way.

@anarsoul
Copy link
Contributor Author

kmscube -M rgba fails in lima-18.0 branch due to this issue with "ppir: ppir: regalloc fail"

@anarsoul
Copy link
Contributor Author

impl main {
	decl_reg vec4 32 r0
	decl_reg vec2 32 r1
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = intrinsic load_input (ssa_0) () (1, 0) /* base=1 */ /* component=0 */	/* vVaryingColor */
	vec1 32 ssa_2 = intrinsic load_input (ssa_0) () (1, 1) /* base=1 */ /* component=1 */	/* vVaryingColor */
	vec1 32 ssa_3 = intrinsic load_input (ssa_0) () (1, 2) /* base=1 */ /* component=2 */	/* vVaryingColor */
	vec1 32 ssa_4 = intrinsic load_input (ssa_0) () (1, 3) /* base=1 */ /* component=3 */	/* vVaryingColor */
	r0.x = imov ssa_1
	r0.y = imov ssa_2.x
	r0.z = imov ssa_3.x
	r0.w = imov ssa_4.x
	vec1 32 ssa_6 = intrinsic load_input (ssa_0) () (0, 0) /* base=0 */ /* component=0 */	/* vTexCoord */
	vec1 32 ssa_7 = intrinsic load_input (ssa_0) () (0, 1) /* base=0 */ /* component=1 */	/* vTexCoord */
	r1.x = imov ssa_6
	r1.y = imov ssa_7.x
	vec4 32 ssa_9 = tex r1 (coord), 0 (texture) 0 (sampler)
	vec4 32 ssa_10 = fmul r0, ssa_9
	intrinsic store_output (ssa_10, ssa_0) () (0, 15, 0) /* base=0 */ /* wrmask=xyzw */ /* component=0 */	/* gl_FragColor */
	/* succs: block_0 */
	block block_0:
}

========prog========
-------block------
const 0 ssa0
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      mov 11 reg1
        ld_var 9 ssa6
      mov 12 reg1
        ld_var 10 ssa7
====================
ppir: ppir_lower_texture create load_coords node 16 for 13
========prog========
-------block------
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      ld_coords 16 new
        mov 11 reg1
          ld_var 9 ssa6
        mov 12 reg1
          ld_var 10 ssa7
====================
ppir: node_to_instr create move 17 from store 15
ppir: insert_load_tex: create move 18 for 13
======ppir instr list======
      vary texl unif vmul smul vadd sadd comb stor const0|1
*000: null null null 14   null 17   null null null | 
 001: null null null null null null 5    null null | 
 002: 1    null null null null null null null null | 
 003: null null null null null null 6    null null | 
 004: 2    null null null null null null null null | 
 005: null null null null null null 7    null null | 
 006: 3    null null null null null null null null | 
 007: null null null null null null 8    null null | 
 008: 4    null null null null null null null null | 
 009: 16   13   null null null 18   null null null | 
 010: null null null null null null 11   null null | 
 011: 9    null null null null null null null null | 
 012: null null null null null null 12   null null | 
 013: 10   null null null null null null null null | 
------------------------
======ppir instr depend======
[0[1[2]][3[4]][5[6]][7[8]][9[10[11]][12[13]]]]
------------------------
ppir: ppir: regalloc fail

@yuq
Copy link
Owner

yuq commented May 25, 2018

In fact, even we don't recombine scalar to vector, ppir should not fail, but only generate longer code. This "regalloc fail" is indeed the ppir need to implement reg spill when out of regs.

@anarsoul
Copy link
Contributor Author

As far as I understand we need to reverse engineer how temporaries work first. I see store and load temporary instructions in the doc, but I don't understand where they're stored.

@yuq
Copy link
Owner

yuq commented May 25, 2018

I guess it's here:
https://github.com/yuq/mesa-lima/blob/lima-18.0/include/drm-uapi/lima_drm.h#L107

Each PP can have a memory stack which I guess is used to store tmp.

@anarsoul
Copy link
Contributor Author

There're 2 stack_address, one in lima_pp_frame_reg, another in drm_lima_m400_pp_frame/drm_lima_m450_pp_frame.

I'm not sure what's the difference between them.

@yuq
Copy link
Owner

yuq commented May 25, 2018

lima_pp_frame_reg one is dummy, drm_lima_m400_pp_frame one is used, one for each PP.

@anarsoul
Copy link
Contributor Author

@yuq how do you tell which one is dummy?

@yuq
Copy link
Owner

yuq commented May 26, 2018

In this function:
https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L303

LIMA_PP_FRAME & LIMA_PP_STACK are per PP, so the lima_pp_frame_reg will be set to new value before task start.

@anarsoul
Copy link
Contributor Author

anarsoul commented May 26, 2018

@yuq, what about LIMA_PP_STACK_SIZE?

@anarsoul
Copy link
Contributor Author

And why LIMA_PP_STACK is not used for mali450?

@yuq
Copy link
Owner

yuq commented May 27, 2018

LIMA_PP_STACK is used by mali450 here:
https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L321

Just because bcast will set same address to all PPs, so I have to set it
individually for each PP.

LIMA_PP_STACK_SIZE is same for all PPs, so the lima_pp_frame_reg one
is not dummy.

@enunes
Copy link
Contributor

enunes commented Jun 5, 2018

I'll have some time to work on lima again starting by the end of this week. If nobody is currently working on this, I could pick it up and work on it.

From what I understand there are two issues,

  1. recombine scalars to vec4 to restore the earlier behavior and
  2. implement register spilling to avoid this kind of problem in the future

Is that correct?
Anybody working in any of these?

Just to be sure though, are we sure already that we want (1) (as in the title of this issue), even with (2) implemented? Or is it something that we would need to benchmark?

@yuq
Copy link
Owner

yuq commented Jun 5, 2018

I think you are right. 2 is needed anyway for correctness and 1 is for better performance. But 1 needs to be done also because I wrote ppir for vec4, not for scalar, some refine or recombine may be needed for correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants