I wrote a portable LLVM IR code which realizes vector max() function(this should be provided as a intrinsic function for MUDA) as below.
;; max.ll
define <4xfloat> @muda_maxf4(<4xfloat> %a, <4xfloat> %b)
{
;; extract
%a0 = extractelement <4xfloat> %a, i32 0
%a1 = extractelement <4xfloat> %a, i32 1
%a2 = extractelement <4xfloat> %a, i32 2
%a3 = extractelement <4xfloat> %a, i32 3
%b0 = extractelement <4xfloat> %b, i32 0
%b1 = extractelement <4xfloat> %b, i32 1
%b2 = extractelement <4xfloat> %b, i32 2
%b3 = extractelement <4xfloat> %b, i32 3
;; c[N] = a[N] > b[N]
%c0 = fcmp ogt float %a0, %b0
%c1 = fcmp ogt float %a1, %b1
%c2 = fcmp ogt float %a2, %b2
%c3 = fcmp ogt float %a3, %b3
;; if %c[N] == 1 then %a[N] else %b[N]
%r0 = select i1 %c0, float %a0, float %b0
%r1 = select i1 %c1, float %a1, float %b1
%r2 = select i1 %c2, float %a2, float %b2
%r3 = select i1 %c3, float %a3, float %b3
;; pack
%tmp0 = insertelement <4xfloat> undef, float %r0, i32 0
%tmp1 = insertelement <4xfloat> %tmp0, float %r1, i32 1
%tmp2 = insertelement <4xfloat> %tmp1, float %r2, i32 2
%r = insertelement <4xfloat> %tmp2, float %r3, i32 3
ret <4xfloat> %r
}
Since LLVM IR doesn't accept vector type for compare and select instruction,
so I do vector -> scalar conversion at first, then do compare/select, finally take scalar result back to vector.
But, for this LLVM IR input code, the LLVM x86/SSE backend emits following unoptimized assembly.
$ llvm-as max.ll; opt -std-compile-opts max.bc -f | llc -march=x86 -mcpu=penryn -f
.text
.align 16
.globl muda_maxf4
.type muda_maxf4,@function
muda_maxf4:
extractps $3, %xmm1, %xmm2
extractps $3, %xmm0, %xmm3
ucomiss %xmm2, %xmm3
ja .LBB1_2 #
.LBB1_1: #
movaps %xmm2, %xmm3
.LBB1_2: #
extractps $1, %xmm1, %xmm2
extractps $1, %xmm0, %xmm4
ucomiss %xmm2, %xmm4
ja .LBB1_4 #
.LBB1_3: #
movaps %xmm2, %xmm4
.LBB1_4: #
movss %xmm4, %xmm2
unpcklps %xmm3, %xmm2
extractps $2, %xmm1, %xmm3
extractps $2, %xmm0, %xmm4
ucomiss %xmm3, %xmm4
ja .LBB1_6 #
.LBB1_5: #
movaps %xmm3, %xmm4
.LBB1_6: #
ucomiss %xmm1, %xmm0
ja .LBB1_8 #
.LBB1_7: #
movaps %xmm1, %xmm0
.LBB1_8: #
unpcklps %xmm4, %xmm0
unpcklps %xmm2, %xmm0
ret
.size muda_maxf4, .-muda_maxf4
^^), it consists of lots of jumps, while I expected to get one of following
- cmpps + andps/andnps/orps
- cmpps + blendps(in SSE4)
- maxps
According to the document( $(llvm)/lib/Target/X86/README-SSE.txt ),
select instruction(conditional move) is currently mapped to branch,
not and*/andn*/or* triple or blend* x86 instruction.
This is the reason LLVM x86/SSE backend emits unexpected and unoptimized assembly.
If I had a enough time, I'd like to write a patch for LLVM's x86/SSE backend to emit and*/andn*/or* trible instead branching.