Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PutFS in text.z80 #53

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

NonstickAtom785
Copy link
Contributor

No description provided.

@NonstickAtom785
Copy link
Contributor Author

Optimized the Text Routine and added it to the pull request. You can test it if you'd like :)

@NonstickAtom785
Copy link
Contributor Author

PutLeft:
  ld a,(de)
  call nc,Put
PutRight:
  ld a,(de)
  call ShiftPut
  ld a,(de)
  inc de
  call c,Put
  djnz PutLeft
  ret

Five bytes smaller. I think...

@NonstickAtom785
Copy link
Contributor Author

I don't know how fast this one is but it is a little more optimized.

  ld bc,$030F
  ld a,(de)
  jr nz,PutRight    ;Note my nz is there because I am doing and 7 in my GetPixel routine.
  ld c,$F0
PutLeft:
  call Put
  inc de
  call ShiftPut
  djnz PutLeft
  ret
PutRight:
  call ShiftPut
  inc de
  call Put
  djnz PutRight
  ret
ShiftPut:
  rlca \ rlca \ rlca \ rlca
Put:
  bit textInverse,(iy+textFlags)
  jr z,+_
  cpl
_:
  and c
  or (hl)
  ld (hl),a
  push bc
  ld bc,12
  add hl,bc
  pop bc
  ld a,(de)
  ret

I would like to remove the push bc and the pop bc but I don't know any faster alternatives. And also there is the fact that everything after ld (hl),a adds extra clock cycles when the routine reaches the last line. Is there a way to improve that?

I added in some improvements to this version. If I calculated right the Clock Cycles are 839 for PutRight and 834 for PutLeft
I have used ixl half register as a temp storage for the mask. It reduced 1 byte and removed more clock cycles.
@Zeda
Copy link
Owner

Zeda commented Dec 5, 2020

  push bc
  ld bc,12
  add hl,bc
  pop bc
  ld a,(de)
  ret

I would like to remove the push bc and the pop bc but I don't know any faster alternatives.

For code like that, I do something like this:

  ld a,12
  add a,l
  ld l,a
  ld a,(de)
  ret nc
  inc h
  ret

That goes from 8 bytes, 59cc to 8 bytes, 33cc|41cc (average is 33.375 though as the 41cc only happens 12/256 times (~4.7%) on average).

Your newest text code didn't quite work for me when I compiled it, but here is what I came up with for the PutFS routine

PutFS:
; read the font from flash to RAM
; need to add 3*A to the fontpointer
  ld hl,(FontPointer)
  ld b,0
  ld c,a
  add hl,bc
  add hl,bc
  adc hl,bc     ;add hl,bc won't set the right flags, so use adc
  ld a,(font_ptr_page)
  jp p,+_
  or a
  jr z,+_
  set 6,h
  res 7,h
  inc a
_:
  ld c,3
  ld de,$8005
  call readarc

; get the text position and update it
  ld hl,(textRow)
;  ld b,0 ;B is already 0 from the ReadArc routine
  ld a,h
  cp 24
  ld a,l
  jr c,+_
  ld h,b
  add a,6
_:
  cp 3Bh
  jr c,+_
  sub 3Ch
  jr nc,+_
  add a,6
_:
  ld l,a
; need to advance the x-coord by 1
  inc h
  ld (textRow),hl
  dec h
  ;want A*12+H/2+(gbuf_temp), and we know A < 64
  add a,a
  add a,a
  ;now A*3+(gbuf_temp)+H/2
  ld c,a
  ld a,h
  ld hl,(gbuf_temp)
  add hl,bc
  add hl,bc
  add hl,bc
  ld c,a
  srl c
  add hl,bc
  rra
  ld e,4    ; now DE points to the byte before the char data
  jr nc,put_left
put_right:
  ld c,$0F
  call put_right2
  call put_right2
put_right2:
; read in the byte
  inc de
  ld a,(de)

; check if it needs to be inverted
  bit InvertTextFlag,(iy+UserFlags)
  jr z,$+3
  cpl
  ld b,a      ; back up the byte
  call shift_put_lr
  ld a,b      ;restore the byte
  jr put_lr

put_left:
  ld c,$F0
  call put_left2
  call put_left2
put_left2:
; read in the byte
  inc de
  ld a,(de)

; check if it needs to be inverted
  bit InvertTextFlag,(iy+UserFlags)
  jr z,$+3
  cpl

  ld b,a      ; back up the byte
  call put_lr
  ld a,b      ;restore the byte

shift_put_lr:
; rotate the nibbles
  rrca
  rrca
  rrca
  rrca
put_lr:
; mask the byte
  and c

; OR it to the screen
  or (hl)
  ld (hl),a

; advance the gbuf ptr
  ld a,l
  add a,12
  ld l,a
  ret nc
  inc h
  ret

I reorganized some of the beginning code in PutFS. With the somewhat recent text updates, Grammer (finally) supported archived fonts, but I basically patched the PutFS code instead of reorganizing it to be more optimal. So now it reads the char data from flash to a fixed location so it doesn't have to keep track of the char # or pointer. It then updates text coordinates, and directly proceeds to convert those to an offset into the graphics buffer. I tweaked that calculation to save a few more bytes and clock cycles by taking advantage of the Y-coordinate being less than 64. Then we get into the actual drawing of the char were I use your idea of calling a common a put/shiftput subroutine, but instead of using B as a counter and looping 3 times, I just call the body of the routine twice and fall through for the third iteration. As well, I move the logic to invert the text to the body instead of the put routine, saving about 81cc (at the cost of 7 bytes since I duplicate that code in the left and right variants).

Over all, the code that actually draws the char is about 141.25cc faster than your latest routine (and ~263.25cc than the original), and I didn't calculate the clock cycles saved from my changes to the load/coord/calculate stages. Your version is 6 bytes smaller than mine and a full 19 bytes smaller than the original, but currently I like the above version more.

(Side note: While I was typing all of this up, I saw your trick with using (de) to restore the byte, and by using that in my code, saved 3 bytes and 6cc, nice! Since that also frees up a variable, I'm hoping to find even more optimizations, so I'll edit this comment.)

EDIT: That ld a,(de) trick won't work with my code because my code doesn't re-apply the invert logic, so the every other row of pixels would be inverted in invert mode. So I lost the three bytes savings, but I was able to save 18cc more.

@NonstickAtom785
Copy link
Contributor Author

NonstickAtom785 commented Dec 7, 2020

I think these modifications save bytes and clock cycles. I can't test it atm but I'm almost positive unless there is something I'm missing. I added some of the operations to the routine instead of the overhead loop as I think they should do the same thing but with less bytes.

PutFS:
; Read the font from Flash to RAM
; Need to add 3A to the font-pointer
  ld hl,(FontPointer)
  ld b,0
  ld c,a
  add hl,bc
  add hl,bc
  adc hl,bc     ;add hl,bc won't set the right flags, so use adc
  ld a,(font_ptr_page)
  jp p,+_
  or a
  jr z,+_
  set 6,h
  res 7,h
  inc a
_:
  ld c,3
  ld de,$8005
  call readarc

; get the text position and update it
  ld hl,(textRow)
;  ld b,0 ;B is already 0 from the ReadArc routine
  ld a,h
  cp 24
  ld a,l
  jr c,+_
  ld h,b
  add a,6
_:
  cp 3Bh
  jr c,+_
  sub 3Ch
  jr nc,+_
  add a,6
_:
  ld l,a
; need to advance the x-coord by 1
  inc h
  ld (textRow),hl
  dec h
; Want A*12+H/2+(gbuf_temp), and we know A < 64
  add a,a
  add a,a
; Now A*3+(gbuf_temp)+H/2
  ld c,a
  ld a,h
  ld hl,(gbuf_temp)
  add hl,bc
  add hl,bc
  add hl,bc
  ld c,a
  srl c
  add hl,bc
  rra
  ld e,5    ; Well now DE should point to the character.
  ld a,(de)
  jr nc,+_
  ld c,$0F
put_right:
  call put_right2
  call put_right2
put_right2:
  call shift_put_lr
  inc de
  jr put_lr
_:
  ld c,$F0
put_left:
  call put_left2
  call put_left2
put_left2:
  call put_lr
  inc de

shift_put_lr:
; Rotate the nibbles
  rrca
  rrca
  rrca
  rrca
put_lr:
; Check if it needs to be inverted
  bit InvertTextFlag,(iy+UserFlags)
  jr z,$+3
  cpl
; Mask the byte
  and c

; OR it to the screen
  or (hl)
  ld (hl),a

; advance the gbuf ptr
  ld a,l
  add a,12
  ld l,a
  ld a,(de)   ; Restore the byte
  ret nc
  inc h
  ret

Edit: I tested and it didn't work. So time to try again until I get it. I know the c optimization works because c isn't being used by anything else so it should stay the same no matter what.

Edit2: I removed the extra bytes and added my trick into the routine. I also fixed a superb amount of extra stuff that wasn't needed in the loop area that was used twice. I took those and put them into the put_lr: routine to be used by both routines. It works quite well. I still would like to know how you are counting your clock cycles. I changed the code above to my current.

Edit3
You said something about your invert logic not reapplying. Do you mean that you save a then use the inverted a again? If that's the case the size optimized method might be a tad bit slower because it's reapplying the invert logic again on each loop, which could slow it down.

@NonstickAtom785
Copy link
Contributor Author

Here are the proofs:
Test1
Test2

@NonstickAtom785 NonstickAtom785 changed the title Remove push/pop af in text.z80 Optimize PutFS in text.z80 Dec 7, 2020
I just prettied it up a bit.
@NonstickAtom785
Copy link
Contributor Author

Your optimization is almost too superior but that's okay. It is fast! That's my final push. I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants