2

I have a 455 byte array that contains 13, 35 byte data structures.

TurnTreeBuff:   resb 455    ; TurnNum (1 byte) + MT (16 bytes) + PM (16 bytes)
                ; + BoardState (2 bytes) = 35 bytes * 13 turns = 455 bytes

Initially I thought I could access a particular data structure by taking an index, multiplying by 35 and adding it to TurnTreeBuff. But the only valid scale factors are 1, 2, 4 and 8.

So this:

mov word [TurnTreeBuff+ebx*35],ax ; doesn't work

I can do it by copying the index into another register, then shifting the original index value left 5 times and adding it's copy to it's shifted value 3 times. But this seems really cumbersome. It also seems really cumbersome to use the MUL command. Am I better off just padding my structure to be 64 bytes, so it's a multiple of 2? That seems really wasteful as well.

Is there a better way?

(Using NASM 2.10 to assemble a 32-bit binary)

10
  • Is speed an issue, Is memory space an issue? Or is the issue one of code readability? Commented Nov 3, 2014 at 21:38
  • 1
    32 bit binary I guess you are using 32 bit code. Have you considered imul? Commented Nov 3, 2014 at 21:51
  • 1
    Yes, you can. The instruction set reference is your friend: The two- and three-operand forms may also be used with unsigned operands. Note, if you are in a loop, it's probably easier to increment your pointer by 35 in each iteration, and not do a multiply. Commented Nov 3, 2014 at 22:04
  • 1
    It will work on unsigned numbers. But I see @jester answered that. That's what I get for walking away for a snack! He is also right .. if you are walking a structure from beginning to end one structure at a time adding would be better. I had assumed that access may be random given the comment access a particular data structure Commented Nov 3, 2014 at 22:09
  • 2
    Also, it makes sense to keep stuff aligned so you could extend TurnNum to 2 bytes (which is how you use it apparently anyway). Multiplying by 36 can be done with a LEA trick: lea ecx, [ebx*8+ebx]; mov [TurnTreeBuff + 4*ecx], ax Commented Nov 3, 2014 at 22:16

1 Answer 1

3

That * 35 would indeed not compile:

mov word [TurnTreeBuff+ebx*35],ax ; doesn't work

The processor only supports power of two and only the first few: x1, x2, x4, and x8. So if your structure is larger, you have to revert to using a multiplication.

Note that mul/imul are very fast (i.e. as fast as an add or sub) so you shouldn't worry about using such, although if a simple shift would work with your structure (i.e. 64 bytes as you mentioned) then using a shift is probably a lot better. (the mul/imul take longer if you use the result on the very next line.)

Finally, a mov in 32 bit process from an odd address is not a good idea. So the size of your structure should at least be a multiple of 4 bytes, so 36 bytes in your case.

P.S. you could also use both features: so your index (ebx) could be set to 0, 9, 18, etc. and then use the x4 in the instruction. However, using a multiplicator in the address field slows things down a bit... Yet, if you like to have fun, you can do what Jester proposed which is to multiply a bare index (0, 1, 2, etc.) and use lea ecx, [ebx * 8 + ebx] to multiply by 9 and then use that with x4 in the other address. The big problem with such cool things is: if your structure changes its size... you have to rewrite a lot of code.

Now, what I would generally do, assuming you are looping over your array of structures, is add the size of the structure to the index. For example:

  mov ebx, TurnTreeBuff  ; get address of first structure
.L1:
  ...
  mov al, [ebx+0] ; TurnNum
  mov eax, [ebx+1] ; MT 1st work (should be aligned...)
  mov eax, [ebx+5] ; MT 2nd work
  ..
  mov ax, [ebx+33] ; BoardState
  ...
  add ebx, 35 ; again, use a multiple of your architecture: 16, 32, 64 bits...
  loop .L1

Now the mov instructions are very efficient because you do not have a complicated address mode which slows down things (assuming you do millions of accesses, it will show!)

Note that your structure should be reorganized to get things aligned:

  TurnNum (1 byte)
  PAD (1 byte)
  BoardState (2 bytes)
  MT (16 bytes)
  PM (16 bytes)

Otherwise you hit the memory at unaligned positions all the time and that definitively slows things down.

P.S. The x2, x4, and x8 were primarily added to processors so one could access arrays of pointers, with the added benefit that you could access structures of such sizes. It uses just 2 bits in the instruction, hence the limited range: 1 << n where n is 0, 1, 2, 3.

Sign up to request clarification or add additional context in comments.

2 Comments

Some points: + mul/imul are not as fast as an add or sub Source: (agner.org/optimize/instruction_tables.pdf) + Processors generally have dedicated hardware for address generation. So using adds or powers of 2 multiplies don't slow things down at all on x86 processors.
The addressing may not slow down on newer versions of x86. Although if you overuse it, you end up with a bigger program which won't as well fit in the instruction cache generating another kind of slow down... For the IMUL with an immediate value is 2 cycles instead of 1 for the ADD. I think that's bearable. Plus if you can have another instructions in between before using the result of the IMUL the pipeline will probably "cancel out" the "loss."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.