Tuesday, January 24, 2012

Bash'ing in Parallel

When you need to process many items (like I had to process for my video from the previous post - 4817 initial pictures) you would better think about how much time it might take. In my example, even knowing that 4817 pictures with such perfect but not yet well threaded to use all CPU cores tool as imagemagick takes about 2 second per picture , total time is kind of nightmare to wait for.
So, as I am processing on Linux and with script - here is the trick to utilize all local and remote CPU cores even from BASH script and(!) minimum changes required:
Instead of traditional for many years:

for i in `find . -type f -name "file*.png"`;
do 
do_view_port.sh `echo $i | sed -e 's/\.png//' -e 's/^.*_//'`
done 

Where I process all .png files in the folder and one by one with my do_view_port.sh, "magic" script with needed actions, just get GNU Parallel (which is IMHO best by now) and do very minor changes:


find . -type f -name "file*.png"  | sed -e 's/\.png//' -e 's/^.*_//' |
parallel -j+0 --eta do_view_port.sh {.}

where it start to be cleaner for the look and faster ( in my case: 7.65x ).
Just note {.} which represents a "current" arguments...

So and in total - this, kind of one change, allowed me to finish needed changes within as much as 23min (instead of 175min or close to 3 hours ) 

PS: I personally liked this free ETA, as always, meaning time yet to go/Estimated Time to Achieve :)

Sunday, January 22, 2012

breaking picture down

Всегда радовали глаз японские игры, а особенно их эффекты, я бы даже сказал спец-эффекты...
Как то было даже интересно посмотреть как именно они реализовывают эти свои цвета радуги, но не доходили руки...
Однако дошли и до очень IMHO неплохой вещи - Devil May Cry, ее 4й части. 
Оказалось particle машина у них, японцев работает во всю и даже чуть больше,шире и с большим набором красок, включая ядовито-какие-то...
Много конечно реализовывается с помощью "правильно продуманых" mesh-ей и дальше уже шейдеры - например полет мечя вверх лихо повторяется подготовлеными mesh-ами, причем в несколько проходов...
Использование ессно alpha и немного другого зелья...

Как то все это и организовалось в небольшое видео: 


Как именно все подготавливалось, прорисовывалось и пост-обрабатывалось скорее всего выльется в отдельных пост, там и просто и по-разному...


Надо будет еще что то поразбирать "в винтики"...

Update: последний в видео пост-эффект, по-сути, это ессно довольно простой pixel/fragment shader:

// Buffer Definitions: 
//
// cbuffer FilterBlur
// {
//
//   float gXfBlurStart;                // Offset:    0 Size:     4
//   float gXfBlurWidth;                // Offset:    4 Size:     4 [unused]
//   float2 gXfScreenCenter;            // Offset:    8 Size:     8
//   float gXfAlpha;                    // Offset:   16 Size:     4
//
// }
//
//
// Resource Bindings:
//
// Name                   Type  Format         Dim Slot Elements
// ---------------- ---------- ------- ----------- ---- --------
// PointSampler0       sampler      NA          NA    0        1
// LinearSampler1      sampler      NA          NA    1        1
// PointSampler0TEXTURE    texture   float          2d    0        1
// LinearSampler1TEXTURE    texture   float          2d    1        1
// FilterBlur          cbuffer      NA          NA    0        1
//
//
//
// Input signature:
//
// Name             Index   Mask Register SysValue Format   Used
// ---------------- ----- ------ -------- -------- ------ ------
// SV_POSITION          0   xyzw        0      POS  float       
// TEXCOORD             0   xy          1     NONE  float   xy  
// TEXCOORD             1     zw        1     NONE  float       
//
//
// Output signature:
//
// Name             Index   Mask Register SysValue Format   Used
// ---------------- ----- ------ -------- -------- ------ ------
// SV_TARGET            0   xyzw        0   TARGET  float   xyzw
//
ps_4_0
dcl_input linear v1.xy
dcl_output o0.xyzw
dcl_constantbuffer  cb0[2], immediateIndexed
dcl_sampler s0, mode_default
dcl_sampler s1, mode_default
dcl_resource_texture2d ( float , float , float , float ) t0
dcl_resource_texture2d ( float , float , float , float ) t1
dcl_temps 2
add r0.xy, v1.xyxx, -cb0[0].zwzz
mad r0.xy, r0.xyxx, cb0[0].xxxx, cb0[0].zwzz
sample r0.xyzw, r0.xyxx, t1.xyzw, s1
sample r1.xyzw, v1.xyxx, t0.xyzw, s0
add r0.xyzw, r0.xyzw, -r1.xyzw
mad o0.xyzw, cb0[1].xxxx, r0.xyzw, r1.xyzw
ret

Thursday, January 12, 2012

Timing fun: timeBeginPeriod/timeEndPeriod

I will start from: The multimedia timer services allow an application to schedule periodic timer events — that is, the application can request and receive timer messages at application-specified intervals.
it is quite interesting to see quite interesting limitation:
You must match each call to timeBeginPeriod with a call to timeEndPeriod, specifying the same minimum resolution in both calls. An application can make multiple timeBeginPeriod calls as long as each call is matched with a call to timeEndPeriod.

from timeBeginPeriod MSDN description and a bit more:
You must match each call to timeBeginPeriod with a call to timeEndPeriod, specifying the same minimum resolution in both calls. An application can make multiple timeBeginPeriod calls as long as each call is matched with a call to timeEndPeriod.

from timeEndPeriod MSDN description

Funny, right? "Must much each call...the same minimum resolution in both call ..."
Let me show details that will help you to understand why and it details:
(disassembled but C-liked, by HexRay - not everything been changed but only a major logic )
MMRESULT __stdcall timeBeginPeriod(UINT uPeriod)
{
  UINT v1; // esi@1
  char *v2; // eax@2
  __int16 v3; // cx@3
  int v4; // eax@8
  MMRESULT v5; // esi@10
  MMRESULT result; // eax@11

  v1 = uPeriod;
  if ( uPeriod < TDD_MAXRESOLUTION )
  {
    result = 97;
  }
  else
  {
    JUMPOUT(uPeriod, dword_41B28FE4, loc_41B0A0A1);
    EnterCriticalSection(&ResolutionCritSec);
    v2 = (char *)&word_41B28FF6[v1 - TDD_MAXRESOLUTION];
    if ( *(_WORD *)v2 == -1 )
    {
      LeaveCriticalSection(&ResolutionCritSec);
      result = 97;
    }
    else
    {
      v3 = *(_WORD *)v2 + 1;
      *(_WORD *)v2 = v3;
      if ( v3 != 1 || v1 >= saved_value_2 )
      {
        v5 = 0;
      }
      else
      {
        if ( WPP_GLOBAL_Control != &WPP_GLOBAL_Control
          && *((_DWORD *)WPP_GLOBAL_Control + 7) & 0x400000
          && *((_BYTE *)WPP_GLOBAL_Control + 25) >= 5u )
          WPP_SF_P(
            *((_DWORD *)WPP_GLOBAL_Control + 4),
            *((_DWORD *)WPP_GLOBAL_Control + 5),
            16,
            (int)dword_41B02720,
            v1);
        v4 = 10000 * v1;
        uPeriod = 10000 * v1;
        if ( 10000 * v1 < MinimumTime )
        {
          v4 = MinimumTime;
          uPeriod = MinimumTime;
        }
        if ( NtSetTimerResolution(&uPeriod, v4, 1, &uPeriod) < 0 )
        {
          if ( WPP_GLOBAL_Control != &WPP_GLOBAL_Control && *((_DWORD *)WPP_GLOBAL_Control + 7) & 0x400000 )
          {
            if ( *((_BYTE *)WPP_GLOBAL_Control + 25) >= 1u )
              WPP_SF_P(
                *((_DWORD *)WPP_GLOBAL_Control + 4),
                *((_DWORD *)WPP_GLOBAL_Control + 5),
                17,
                (int)dword_41B02720,
                v1);
          }
          --word_41B28FF6[v1 - TDD_MAXRESOLUTION];
          v5 = 97;
        }
        else
        {
          saved_value = v1;
          v5 = 0;
          saved_value_2 = (uPeriod + 9900) / 0x2710;
        }
      }
      LeaveCriticalSection(&ResolutionCritSec);
      result = v5;
    }
  }
  return result;
}



//----- (41B09FEB) --------------------------------------------------------
MMRESULT __stdcall timeEndPeriod(UINT uPeriod)
{
  UINT v1; // esi@1
  char *v2; // eax@3
  __int16 v3; // cx@4
  int v4; // ecx@9
  MMRESULT v5; // esi@10
  MMRESULT result; // eax@11
  char v7; // [sp+4h] [bp-4h]@9

  v1 = uPeriod;
  if ( uPeriod < TDD_MAXRESOLUTION )
  {
    result = 97;
  }
  else
  {
    if ( uPeriod >= dword_41B28FE4 )
    {
      result = 0;
    }
    else
    {
      EnterCriticalSection(&ResolutionCritSec);
      v2 = (char *)&unk_41B28FF6 + 2 * (v1 - TDD_MAXRESOLUTION);
      if ( *(_WORD *)v2 )
      {
        v3 = *(_WORD *)v2 - 1;
        *(_WORD *)v2 = v3;
        if ( !v3 && v1 == saved_value )
        {
          while ( v1 < dword_41B28FE4 && !*(_WORD *)v2 )
          {
            ++v1;
            v2 += 2;
          }
          NtSetTimerResolution(dword_41B28FE4, 10000 * saved_value_2, 0, &v7);
          saved_value_2 = dword_41B28FE4;
          saved_value = v1;
          if ( v1 < dword_41B28FE4 )
          {
            if ( WPP_GLOBAL_Control != &WPP_GLOBAL_Control
              && *((_DWORD *)WPP_GLOBAL_Control + 7) & 0x400000
              && *((_BYTE *)WPP_GLOBAL_Control + 25) >= 5u )
              WPP_SF_P(
                *((_DWORD *)WPP_GLOBAL_Control + 4),
                *((_DWORD *)WPP_GLOBAL_Control + 5),
                20,
                (int)dword_41B02720,
                v1);
            if ( NtSetTimerResolution(v4, 10000 * v1, 1, &uPeriod) < 0 )
            {
              if ( WPP_GLOBAL_Control != &WPP_GLOBAL_Control && *((_DWORD *)WPP_GLOBAL_Control + 7) & 0x400000 )
              {
                if ( *((_BYTE *)WPP_GLOBAL_Control + 25) >= 1u )
                  WPP_SF_P(
                    *((_DWORD *)WPP_GLOBAL_Control + 4),
                    *((_DWORD *)WPP_GLOBAL_Control + 5),
                    21,
                    (int)dword_41B02720,
                    v1);
              }
            }
            else
            {
              saved_value_2 = (uPeriod + 9999) / 0x2710;
            }
          }
        }
        v5 = 0;
      }
      else
      {
        v5 = 97;
      }
      LeaveCriticalSection(&ResolutionCritSec);
      result = v5;
    }
  }
  return result;
}

note several things:
- usage of saved_value ( and saved_value_2 )
- note usage of TDD_MAXRESOLUTION and error returns details
- an implicit usage of EnterCriticalSection to be good thread save
(will skip the rest as been less relevant for now)

you already noticed usage of
if ( !v3 && v1 == saved_value )
inside timeEndPeriod, right :)?

That would describe and answer: "Must much each call...the same minimum resolution in both call ..."
IMHO, purely architectural issue...

Now about timeGetDevCaps function to determine the minimum and maximum timer resolutions supported by the timer servicesand TDD_MAXRESOLUTION,
well, its code shows everything faster than I would ever describe:
MMRESULT __stdcall timeGetDevCaps(LPTIMECAPS ptc, UINT cbtc)
{
  MMRESULT result; // eax@3

  if ( cbtc >= 8 && ptc )
  {
    ptc->wPeriodMin = TDD_MAXRESOLUTION;
    ptc->wPeriodMax = 1000000;
    result = 0;
  }
  else
  {
    result = 97;
  }
  return result;

PS: WineHQ does the things differently...

PS2: With HexRay: some time (coffee break) for HexRay processing, 10 sec of looking and understanding full story
Assembler, w/o HexRay: 1 min (mostly scrolling back and forth) - so 6x slower? :)

Saturday, January 7, 2012

Deferred Shading by OpenGL

It is quite popular to use deferred shading and not standard shading these days. A lot of the details already described well with pros and cons.
With many lights and vertices - worth to check and I have seen already 5.8x better results for deferred approach: Def: 480FPS vs Std: 84FPS for 1081344 vertices
Based on the model(s): Harley Quinn contains a bit more than 36K vertices alone.
Where from the OpenGL side, differences are (Left is deferred shading): ExamDiff Pro Diff Report
First Text Fragment
Second Text Fragment
1 //Pass 1
2 //Draw the geometry, saving parameters into the buffer
3
4 //Make the pbuffer the current context
5 pbuffer.MakeCurrent();
6
7 //Clear buffers
8 glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
9 glColor4f(1.0f, 1.0f, 1.0f, 1.0f);
10
11 glLoadIdentity();
12 gluLookAt( 0.0f, 4.0f, 3.0f,
13 0.0f, 0.0f, 0.0f,
14 0.0f, 1.0f, 0.0f);
15
16 //Bind and enable vertex & fragment programs
17 glBindProgramARB(GL_VERTEX_PROGRAM_ARB, deferredShadingPass1VP);
18 glEnable(GL_VERTEX_PROGRAM_ARB);
19
20 glBindProgramNV(GL_FRAGMENT_PROGRAM_NV, deferredShadingPass1FP);
21 glEnable(GL_FRAGMENT_PROGRAM_NV);
22
23 //Draw the torus knot
24 glDrawElements(GL_TRIANGLES, torusKnot.numIndices, GL_UNSIGNED_INT, (char *)NULL);
25
26 //Draw the "floor"
27 glNormal3f(0.0f, 1.0f, 0.0f);
28 glBegin(GL_TRIANGLE_STRIP);
29 {
30 glVertex3f( 5.0f,-0.5f, 5.0f);
31 glVertex3f( 5.0f,-0.5f,-5.0f);
32 glVertex3f(-5.0f,-0.5f, 5.0f);
33 glVertex3f(-5.0f,-0.5f,-5.0f);
34 }
35 glEnd();
36
37 glDisable(GL_VERTEX_PROGRAM_ARB);
38 glDisable(GL_FRAGMENT_PROGRAM_NV);
39
40 //Copy the pbuffer contents into the pbuffer texture
41 glBindTexture(GL_TEXTURE_RECTANGLE_NV, pbufferTexture);
42 glCopyTexSubImage2D(GL_TEXTURE_RECTANGLE_NV, 0, 0, 0, 0, 0,
43 pbuffer.width, pbuffer.height);
44
45 //Make the window the current context
46 WINDOW::Instance()->MakeCurrent();
47
48 //Pass 2
49 //Draw a quad covering the region of influence of each light
50 //Unpack the data from the buffer, perform the lighting equation and update
51 //the framebuffer
52
53 //Set orthographic projection, 1 unit=1 pixel
54 glMatrixMode(GL_PROJECTION);
55 glPushMatrix();
56 glLoadIdentity();
57 gluOrtho2D(0, WINDOW::Instance()->width, 0, WINDOW::Instance()->height);
58
59 //Set identity modelview
60 glMatrixMode(GL_MODELVIEW);
61 glPushMatrix();
62 glLoadIdentity();
63
64 //Disable depth test
65 glDisable(GL_DEPTH_TEST);
66
67 //Bind the pbuffer texture
68 glBindTexture(GL_TEXTURE_RECTANGLE_NV, pbufferTexture);
69
70 //Bind and enable fragment program
71 glBindProgramNV(GL_FRAGMENT_PROGRAM_NV, deferredShadingPass2FP);
72 glEnable(GL_FRAGMENT_PROGRAM_NV);
73
74 //Loop through the lights
75 for(int i=0; i<numLights; i)
76 {
77 //Calculate the rectangle to draw for this light
78 int rectX, rectY, rectWidth, rectHeight;
79
80 lights[i].GetWindowRect(WINDOW::Instance()->width, WINDOW::Instance()->height,
81 viewMatrix, currentTime, cameraNearDistance,
82 cameraFovy, cameraAspectRatio,
83 rectX, rectY, rectWidth, rectHeight);
84
85 //Enable additive blend if i>0
86 if(i>0)
87 {
88 glBlendFunc(GL_ONE, GL_ONE);
89 glEnable(GL_BLEND);
90 }
91
92 //Send the light's color to fragment program local parameter 0
93 glProgramLocalParameter4fvARB( GL_FRAGMENT_PROGRAM_NV, 0, lights[i].color);
94
95 //Send 1/(light radius)^2 to fragment program local parameter 1
96 float inverseSquareLightRadius=1.0f/(lights[i].radius*lights[i].radius);
97 glProgramLocalParameter4fARB( GL_FRAGMENT_PROGRAM_NV, 1,
98 inverseSquareLightRadius, inverseSquareLightRadius,
99 inverseSquareLightRadius, inverseSquareLightRadius);
100
101 //Send the light's position to fragment program local parameter 2
102 glProgramLocalParameter4fvARB( GL_FRAGMENT_PROGRAM_NV, 2,
103 VECTOR4D(lights[i].GetPosition(currentTime)));
104
105 //Draw the rectangle
106 glBegin(GL_TRIANGLE_STRIP);
107 {
108 glVertex2i(rectX, rectY);
109 glVertex2i(rectX rectWidth, rectY);
110 glVertex2i(rectX, rectY rectHeight);
111 glVertex2i(rectX rectWidth, rectY rectHeight);
112 }
113 glEnd();
114 }
115
116 //Restore matrices
117 glMatrixMode(GL_PROJECTION);
118 glPopMatrix();
119 glMatrixMode(GL_MODELVIEW);
120 glPopMatrix();
121
122 glEnable(GL_DEPTH_TEST);
123 glDisable(GL_FRAGMENT_PROGRAM_NV);
124 glDisable(GL_BLEND);
125
1 //Make an initial pass to lay down Z
2 glColorMask(0, 0, 0, 0);
3
4 //Draw the torus knot
5 glDrawElements(GL_TRIANGLES, torusKnot.numIndices, GL_UNSIGNED_INT, (char *)NULL);
6
7 //Draw the "floor"
8 glNormal3f(0.0f, 1.0f, 0.0f);
9 glBegin(GL_TRIANGLE_STRIP);
10 {
11 glVertex3f( 5.0f,-0.5f, 5.0f);
12 glVertex3f( 5.0f,-0.5f,-5.0f);
13 glVertex3f(-5.0f,-0.5f, 5.0f);
14 glVertex3f(-5.0f,-0.5f,-5.0f);
15 }
16 glEnd();
17
18 glColorMask(1, 1, 1, 1);
19
20 //Bind and enable vertex & fragment programs
21 glBindProgramARB(GL_VERTEX_PROGRAM_ARB, standardShadingVP);
22 glEnable(GL_VERTEX_PROGRAM_ARB);
23
24 glBindProgramARB(GL_FRAGMENT_PROGRAM_ARB, standardShadingFP);
25 glEnable(GL_FRAGMENT_PROGRAM_ARB);
26
27 //Loop through the lights
28 for(int i=0; i<numLights; i)
29 {
30 //Calculate and set the scissor rectangle for this light
31 int scissorX, scissorY, scissorWidth, scissorHeight;
32
33 lights[i].GetWindowRect(WINDOW::Instance()->width, WINDOW::Instance()->height,
34 viewMatrix, currentTime, cameraNearDistance,
35 cameraFovy, cameraAspectRatio,
36 scissorX, scissorY, scissorWidth, scissorHeight);
37
38 glScissor(scissorX, scissorY, scissorWidth, scissorHeight);
39 glEnable(GL_SCISSOR_TEST);
40
41 //Enable additive blend if i>0
42 if(i>0)
43 {
44 glBlendFunc(GL_ONE, GL_ONE);
45 glEnable(GL_BLEND);
46 }
47
48 //Calculate the object space light position and send to
49 //vertex program local parameter 0
50 //Object space and world space are the same
51 glProgramLocalParameter4fvARB( GL_VERTEX_PROGRAM_ARB, 0,
52 VECTOR4D(lights[i].GetPosition(currentTime)));
53
54 //Send the light's color to fragment program local parameter 0
55 glProgramLocalParameter4fvARB( GL_FRAGMENT_PROGRAM_ARB, 0, lights[i].color);
56
57 //Send 1/(light radius)^2 to fragment program local parameter 1
58 float inverseSquareLightRadius=1.0f/(lights[i].radius*lights[i].radius);
59 glProgramLocalParameter4fARB( GL_FRAGMENT_PROGRAM_ARB, 1,
60 inverseSquareLightRadius, inverseSquareLightRadius,
61 inverseSquareLightRadius, inverseSquareLightRadius);
62
63 //Draw the torus knot
64 glDrawElements(GL_TRIANGLES, torusKnot.numIndices, GL_UNSIGNED_INT, (char *)NULL);
65
66 //Draw the "floor"
67 glNormal3f(0.0f, 1.0f, 0.0f);
68 glBegin(GL_TRIANGLE_STRIP);
69 {
70 glVertex3f( 5.0f,-0.5f, 5.0f);
71 glVertex3f( 5.0f,-0.5f,-5.0f);
72 glVertex3f(-5.0f,-0.5f, 5.0f);
73 glVertex3f(-5.0f,-0.5f,-5.0f);
74 }
75 glEnd();
76 }
77
78 glDisable(GL_VERTEX_PROGRAM_ARB);
79 glDisable(GL_FRAGMENT_PROGRAM_ARB);
80 glDisable(GL_SCISSOR_TEST);
81 glDisable(GL_BLEND);
Number of differences: 10
Added(5,22)
Deleted(0,48)
Changed(82)
Changed in changed(42)
Ignored

Shaders side will be later...

PS have you seen a good difference viewer ?

Wednesday, January 4, 2012

Parallel file load

In some cases and on some OS ( need to say - Linux )
parallel load of files to process, such as:
tbb::tick_count start = tbb::tick_count::now();

parallel_invoke( [&]() {preload(argv[1],first);},[&]() {preload(argv[2],second);} );

// preload #1 0.118449 seconds ,  :: parallel_invoke
// preload #2 0.130777 seconds ,  :: preload, preload
can be, as you can see, ~10% faster ( measured via tbb::tick_count, for sure )

Dont really want to go into more details but nice improvement and for almost no changes....