This is a blog for MUDA development. MUDA is a (short) vector language for CPUs.

Thursday, April 10, 2008

Initial tryout on double x 4 in MUDA.

To prepare 256-bit SIMD(double x 4) for Intel's AVX(Intel's future CPU functionality), I'm trying to implemt double x 4 feature in MUDA.

New type dvec is introduced for MUDA, which represents double x 4.




// input.mu
dvec bora_func(dvec a)
{
return a * a * a;
}




First, I've wrote SSE backend which translates dvec-typed expression with almost same manner as done in vec(float x 4) type.




$ mudah input.mu > bora.c
dvec bora (const double * a)
{
const __muda_m256 t_dvec2 = (*((__muda_m256 *)(a))) ;
const __muda_m256 t_dvec1 = (*((__muda_m256 *)(a))) ;
const __muda_m256 t_dvec3 = _muda_mul_4d( t_dvec2 , t_dvec1 ) ;
const __muda_m256 t_dvec4 = (*((__muda_m256 *)(a))) ;
const __muda_m256 t_dvec5 = _muda_mul_4d( t_dvec3 , t_dvec4 ) ;
return t_dvec5 ;
}




Here __muda_m256 and _muda_mul_4d is a simple wrapper C function which emulates 256-bit SIMD in current 128-bit SIMD machine, as defined in following.




typedef union {
struct { __m128d v[2]; };
double f[4];
} __muda_m256 __attribute__((aligned(16)));

static inline __muda_m256 _muda_mul_4d(__muda_m256 a, __muda_m256 b)
{
__muda_m256 ret;

ret.v[0] = _mm_mul_pd(a.v[0], b.v[0]);
ret.v[1] = _mm_mul_pd(a.v[1], b.v[1]);

return ret;
}




But gcc compiler translates this code into following unoptimized assembly.




$ gcc -msse2 -O3 -c bora.c
$ otool -v -t bora.o
_bora:
00000000 pushl %ebp
00000001 movl %esp,%ebp
00000003 pushl %edi
00000004 pushl %esi
00000005 subl $0x00000150,%esp
0000000b movl 0x0c(%ebp),%eax
0000000e movl (%eax),%edx
00000010 movl %edx,0xfffffed4(%ebp)
...
00000111 movl 0xfffffec4(%ebp),%eax
00000117 mulpd 0xffffff38(%ebp),%xmm0
0000011f movapd %xmm0,0xffffff18(%ebp)
00000127 movapd 0xffffff08(%ebp),%xmm0
...
(total 170 instructions)




Doh! lots of mov*!

Even though when using latest llvm-gcc(llvm-gcc4.2-2.2-x86-darwin8), still some redundant mov instructions remains in the output.




_bora:
00000000 pushl %ebp
00000001 movl %esp,%ebp
00000003 subl $0x000000e8,%esp
00000009 movl 0x0c(%ebp),%eax
0000000c movapd 0x10(%eax),%xmm0
00000011 movapd (%eax),%xmm1
00000015 movapd %xmm0,0xffffff68(%ebp)
0000001d movapd %xmm1,0xffffff58(%ebp)
00000025 movapd %xmm0,0xffffff48(%ebp)
0000002d movapd %xmm1,0xffffff38(%ebp)
00000035 movapd 0xffffff58(%ebp),%xmm2
0000003d mulpd 0xffffff38(%ebp),%xmm2
00000045 movapd %xmm2,0xffffff78(%ebp)
0000004d movapd 0xffffff68(%ebp),%xmm2
00000055 mulpd 0xffffff48(%ebp),%xmm2
0000005d movapd %xmm2,0x88(%ebp)
00000062 movapd 0xffffff78(%ebp),%xmm3
0000006a movapd %xmm2,0xc8(%ebp)
0000006f movapd %xmm3,0xb8(%ebp)
00000074 movapd %xmm0,0xa8(%ebp)
00000079 movapd %xmm1,0x98(%ebp)
0000007e movapd 0xb8(%ebp),%xmm0
00000083 mulpd 0x98(%ebp),%xmm0
00000088 movapd %xmm0,0xd8(%ebp)
0000008d movapd 0xc8(%ebp),%xmm0
00000092 mulpd 0xa8(%ebp),%xmm0
00000097 movapd %xmm0,0xe8(%ebp)
0000009c movapd 0xd8(%ebp),%xmm1
000000a1 movapd %xmm0,0xffffff28(%ebp)
000000a9 movapd %xmm1,0xffffff18(%ebp)
000000b1 movl 0x08(%ebp),%eax
000000b4 movapd 0xffffff28(%ebp),%xmm0
000000bc movapd 0xffffff18(%ebp),%xmm1
000000c4 movapd %xmm1,(%eax)
000000c8 movapd %xmm0,0x10(%eax)
000000cd addl $0x000000e8,%esp
000000d3 popl %ebp
000000d4 ret $0x0004
(38 instructions)




I also got almost same result from Intel's icc compiler.
It seems that for C compiler this code is difficult to optimize.
I think I have to translate MUDA code into C code much more in flat manner without using any macros or inlined wrapper function.
(directly emit 2 _mm_mul_pd() for dvec-typed mulitiply).

How about LLVM IR?

Then, I also added initial double x 4 support for LLVM backend of MUDA.

LLVM IR version.




$ mudah --llvm input.mu
define <4xdouble> @bora (<4xdouble> %a)
{
%a.addr = alloca <4xdouble> ;
store <4xdouble> %a, <4xdouble>* %a.addr ;
%t_dvec2 = load <4xdouble>* %a.addr ;

%t_dvec1 = load <4xdouble>* %a.addr ;

%t_dvec3 = mul <4xdouble> %t_dvec2 , %t_dvec1 ;
%t_dvec4 = load <4xdouble>* %a.addr ;

%t_dvec5 = mul <4xdouble> %t_dvec3 , %t_dvec4 ;
ret <4xdouble> %t_dvec5 ;
}

$ llvm-as bora.ll -f
$ llc bora.bc -f
$ cat bora.s

_bora:
Leh_func_begin3:
Llabel3:
subl $44, %esp
movapd %xmm0, (%esp)
movapd %xmm1, 16(%esp)
movaps %xmm1, %xmm2
mulpd %xmm2, %xmm2
mulpd %xmm2, %xmm1
movaps %xmm0, %xmm2
mulpd %xmm2, %xmm2
mulpd %xmm2, %xmm0
addl $44, %esp
ret
(11 instructions)




The output assembly is almost optimized!
LLVM infrastructure do good job when we use vector expression!

Tuesday, March 4, 2008

Unoptimized "select" instruction handling in LLVM x86 backend

I wrote a portable LLVM IR code which realizes vector max() function(this should be provided as a intrinsic function for MUDA) as below.


;; max.ll
define <4xfloat> @muda_maxf4(<4xfloat> %a, <4xfloat> %b)
{

;; extract
%a0 = extractelement <4xfloat> %a, i32 0
%a1 = extractelement <4xfloat> %a, i32 1
%a2 = extractelement <4xfloat> %a, i32 2
%a3 = extractelement <4xfloat> %a, i32 3

%b0 = extractelement <4xfloat> %b, i32 0
%b1 = extractelement <4xfloat> %b, i32 1
%b2 = extractelement <4xfloat> %b, i32 2
%b3 = extractelement <4xfloat> %b, i32 3

;; c[N] = a[N] > b[N]
%c0 = fcmp ogt float %a0, %b0
%c1 = fcmp ogt float %a1, %b1
%c2 = fcmp ogt float %a2, %b2
%c3 = fcmp ogt float %a3, %b3

;; if %c[N] == 1 then %a[N] else %b[N]

%r0 = select i1 %c0, float %a0, float %b0
%r1 = select i1 %c1, float %a1, float %b1
%r2 = select i1 %c2, float %a2, float %b2
%r3 = select i1 %c3, float %a3, float %b3

;; pack

%tmp0 = insertelement <4xfloat> undef, float %r0, i32 0
%tmp1 = insertelement <4xfloat> %tmp0, float %r1, i32 1
%tmp2 = insertelement <4xfloat> %tmp1, float %r2, i32 2
%r = insertelement <4xfloat> %tmp2, float %r3, i32 3

ret <4xfloat> %r
}

Since LLVM IR doesn't accept vector type for compare and select instruction,
so I do vector -> scalar conversion at first, then do compare/select, finally take scalar result back to vector.

But, for this LLVM IR input code, the LLVM x86/SSE backend emits following unoptimized assembly.


$ llvm-as max.ll; opt -std-compile-opts max.bc -f | llc -march=x86 -mcpu=penryn -f

.text
.align 16
.globl muda_maxf4
.type muda_maxf4,@function
muda_maxf4:
extractps $3, %xmm1, %xmm2
extractps $3, %xmm0, %xmm3
ucomiss %xmm2, %xmm3
ja .LBB1_2 #
.LBB1_1: #
movaps %xmm2, %xmm3
.LBB1_2: #
extractps $1, %xmm1, %xmm2
extractps $1, %xmm0, %xmm4
ucomiss %xmm2, %xmm4
ja .LBB1_4 #
.LBB1_3: #
movaps %xmm2, %xmm4
.LBB1_4: #
movss %xmm4, %xmm2
unpcklps %xmm3, %xmm2
extractps $2, %xmm1, %xmm3
extractps $2, %xmm0, %xmm4
ucomiss %xmm3, %xmm4
ja .LBB1_6 #
.LBB1_5: #
movaps %xmm3, %xmm4
.LBB1_6: #
ucomiss %xmm1, %xmm0
ja .LBB1_8 #
.LBB1_7: #
movaps %xmm1, %xmm0
.LBB1_8: #
unpcklps %xmm4, %xmm0
unpcklps %xmm2, %xmm0
ret
.size muda_maxf4, .-muda_maxf4

^^), it consists of lots of jumps, while I expected to get one of following

  • cmpps + andps/andnps/orps
  • cmpps + blendps(in SSE4)
  • maxps
Anyway, it's not so strange LLVM's x86/SSE background emits above unoptimized code.

According to the document( $(llvm)/lib/Target/X86/README-SSE.txt ),
select instruction(conditional move) is currently mapped to branch,
not and*/andn*/or* triple or blend* x86 instruction.

This is the reason LLVM x86/SSE backend emits unexpected and unoptimized assembly.

If I had a enough time, I'd like to write a patch for LLVM's x86/SSE backend to emit and*/andn*/or* trible instead branching.

Thursday, February 28, 2008

Logical op in LLVM

In MUDA, taking logical op for floating point type are possible.


vec and_func(vec a, vec b)
{
return a & b;
}


This syntax computes, for example, as follows.


// a = (0x3f800000, 0x3f800000, 0x3f800000, 0x3f800000) = (1.0f, 1.0f, 1.0f, 1.0f)
// b = (0xffffffff, 0x00000000, 0x00000000, 0xf0f00000)
// a & b = (0x3f800000, 0x00000000, 0x00000000, 0x30800000)


but In LLVM IR, logical instruction does not accept floating point type for its operand, thus if we want to take a logical op on floating point values,
first we have to change the type of variable from float to integer, without changing its content.

For this purpose, LLVM IR provides bitcast op.

MUDA's LLVM IR backend emits following code againt above input MUDA code.


define <4xfloat> @and_func (<4xfloat> %a, <4xfloat> %b)
{
%a.addr = alloca <4xfloat> ;
store <4xfloat> %a, <4xfloat>* %a.addr ;
%b.addr = alloca <4xfloat> ;
store <4xfloat> %b, <4xfloat>* %b.addr ;
%t_vec2 = load <4xfloat>* %a.addr ;

%t_vec3 = load <4xfloat>* %b.addr ;

%t_ivec4 = bitcast <4xfloat> %t_vec2 to <4xi32> ;
%t_ivec5 = bitcast <4xfloat> %t_vec3 to <4xi32> ;
%t_ivec6 = and <4xi32> %t_ivec4 , %t_ivec5 ;
%t_vec1 = bitcast <4xi32> %t_ivec6 to <4xfloat> ;
ret <4xfloat> %t_vec1 ;
}

Generated LLVM IR code is somewhat redundant, but LLVM optimizer and x86 backend emits exactly what I expected.


$ llvm-as tmp.ll -f; opt -std-compile-opts tmp.bc -f | llc -march=x86
.text
.align 4,0x90
.globl _and_func
_and_func:
andps %xmm1, %xmm0
ret

.subsections_via_symbols


LLVM IR is mapped to just one andps instruction. LLVM is so nice!

Friday, February 8, 2008

Work in progress | MUDA -> LLVM backend

I've started to implement LLVM IR backend for MUDA.

http://lucille.svn.sourceforge.net/viewvc/lucille/angelina/haskellmuda/CodeGenLLVM.hs?view=markup


MUDA's LLVM IR backend is still work in progress.
First, I've won success on simple case.

Here's MUDA input code,



// input.mu
vec
func( vec a, vec b ) {

return a + b;

}


MUDA's current LLVM backend emits,


$ mudah --llvm input.mu > tmp.ll
$ cat tmp.ll

;;
;; The following code was generated by MUDA compiler
;;
target datalayout = "i32:128:128-f32:128:128"

define <4xfloat> @func (<4xfloat> %a, <4xfloat> %b)
{
%a.addr = alloca <4xfloat> ;
store <4xfloat> %a, <4xfloat>* %a.addr ;
%b.addr = alloca <4xfloat> ;
store <4xfloat> %b, <4xfloat>* %b.addr ;
%t_vec1 = load <4xfloat>* %a.addr ;

%t_vec2 = load <4xfloat>* %b.addr ;

%t_vec3 = add <4xfloat> %t_vec1 , %t_vec2 ;
ret <4xfloat> %t_vec3 ;
}



The LLVM code generated in MUDA -> LLVM IR backend is straightforward and somewhat redundant.

Try to get native code with optimization by LLVM midend and backend,


$ llvm-as tmp.ll -f
$ opt -std-compile-opts -f tmp.bc -o tmp.opt.bc
$ llc tmp.opt.bc -f
$ cat tmp.opt.s

.text
.align 4,0x90
.globl _func
_func:
addps %xmm1, %xmm0
ret

.subsections_via_symbols



This is exactly what I want to get in x86 assembler(Just one addps instruction),
LLVM(and it's x86 backend) rocks!

Sunday, February 3, 2008

MUDA blog lanuched.

I've decided to launch MUDA blog site separately.

And I've finished almost basic implementation of MUDA language.
If you want to play with MUDA, go

http://lucille.sourceforge.net/muda/


And check out the current svn tree.

Example & Documentation will be updated soon.

Here is TODOs

  • LLVM IR backend(near furure)
  • math library for MUDA(near future)
  • Automatic optimzation(middle future)
  • Formal verification of computation code by gappa(middle future)

About

My life to be a renderer writer & quant. Here is my main blog. http://lucille.atso-net.jp/blog/