Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arec performance improvements #154

Closed

Conversation

Philonous
Copy link
Contributor

  • Use SmallArray# instead of Array to avoid intermediate lists when constructing ARecs
    • Also saves a few words per record since we don't need Arrays' book-keeping
  • Add ARec pseudo-constructors &:, arnil and arec to efficiently construct ARecs without going through Recs
  • Turn ElField into a newtype to improve accessor performance (both for Rec and ARec)
    • Also saves one word per field because we don't need the constructor
    • Also saves space for the KnownSymbol dictionary that wasn't used anywhere

See #150 (comment) for benchmark comparisons

Example usage of the ARec pseudo-constructors:

user :: ARec ElField '[ "name"   ::: String
                      , "age"    ::: Int
                      , "active" ::: Bool]
user = arec (  #name   =: "Peter"
            &: #age    =: 4
            &: #active =: True
            &: arnil
            )

Related to #150 and #153

@Philonous Philonous changed the title Arec performance refactored Arec performance improvements May 19, 2021
@Philonous Philonous force-pushed the arec-performance-refactored branch from 230a745 to ffdb558 Compare May 19, 2021 21:41
@Philonous
Copy link
Contributor Author

Using the new &: and arnil constructors looks a lot like the old toARec method, compare with

user = toArec ( #name   =: "Peter"
             :& #age    =: 4
             :& #active =: True
             :& RNil
             )

The difference is that it doesn't rely on inlining to get rid of the intermediate data structure, which might speed up compilation and work out better in -O0 situations, though I haven't tested it. I'm not sure if it's worth the additional code.

@Philonous Philonous changed the title Arec performance improvements DRAFT: Arec performance improvements May 22, 2021
@acowley
Copy link
Contributor

acowley commented May 25, 2021

I've been thinking about this, and I'm leaning towards just keeping the old toArec variant. My main concern is the potential confusion between &: and :&. I think I did something similar to this (kind of cute variant of an operator) in another part of the API and quickly came to regret it as it was just too confusing (not in a deep way, of course, more of a USB plug kind of thing where you just have to try it both ways).

@Philonous
Copy link
Contributor Author

I would agree, but I figured out that the same interface can be used for more things, e.g.

  • A combinator that copies one arec into another, so you can easily add fields to an existing arec (either the beginning or the end)
  • A combinator that copies a slice of an arec, so you can keep, say, the first 4 fields of an existing arec
  • Updating an arec. The existing update combinator copies the entire array every time, with no way to batch updates.

But I would probably get away from the "cute" operators.

@Philonous Philonous force-pushed the arec-performance-refactored branch from ffdb558 to e396e6a Compare May 26, 2021 20:56
@Philonous
Copy link
Contributor Author

I have removed &:, arnil and arec from the public interface.

They still live in Data.Vinyl.ARec.Internal since I used them to implement toARec. This should be much less of a problem, since Internal modules don't need to be stable, so they can be renamed or removed at will. I have also renamed &: to arcons.

I have also added test cases for arecGetSubset and arecSetSubset since I changed the code in a way that's not obviously semantically identical.

@Philonous Philonous force-pushed the arec-performance-refactored branch 2 times, most recently from bae68f2 to b8f30bb Compare May 27, 2021 10:52
Philonous added 3 commits May 27, 2021 12:56
* Move from Array to SmallArray# to avoid intermediate list during construction
* Use class instead of recursive function in toARec for improved inlining
* Saves a 1 word per field
* Improves accessor performance significantly both for Rec as well as ARec
@Philonous Philonous force-pushed the arec-performance-refactored branch from b8f30bb to 5aeff45 Compare May 27, 2021 10:56
@Philonous Philonous changed the title DRAFT: Arec performance improvements Arec performance improvements May 27, 2021
@@ -23,9 +24,26 @@
-- element to the head of a list, but element access is linear time;
-- array access time is uniform, but extending the array is more
-- slower.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blank line needs to start with -- or haddock can not process the file.

@acowley
Copy link
Contributor

acowley commented May 28, 2021

The comments you added are invaluable; I really appreciate that thoroughness!

The Frames test suite built against this new version of vinyl passed, but I'm hitting Simplifier ticks exhausted when building the benchdemo executable.

The change in this PR that I'm most wary of is changing ElField to a newtype, which I'm guessing is related. I haven't really dug into it yet, and I'm interested to hear if you have some thoughts on what's going wrong.

ETA: Compiling that executable also uses a huge amount of RAM.

@Philonous
Copy link
Contributor Author

I can reproduce the compilation failure, and reverting 5e38074 fixes it.

How vexing, I would never have guess that a newtypes might lead to that problem.

@Philonous
Copy link
Contributor Author

The problem might be some over-zealous inlining in Frames, though.

I ran

sed -i -e 's/INLINE/INLINEABLE/g' **/*.hs

in the frames directory and it solved the problem

@Philonous
Copy link
Contributor Author

Philonous commented May 30, 2021

In any case, the problem seems to be that turning the data into a newtype simplifies strictness analysis, giving the inliner more opportunities to fire, and boy does it ever fire.

I manually copied the TH-splice back into the source file, and commented out a bunch of column definitions, and the result is that reducing the number of column definitions drastically improves compilation times (almost exponentially).

Btw, that's part of the reason that I am interested in ARecs, since they don't rely on unrolling and inlining so heavily.

@acowley
Copy link
Contributor

acowley commented May 30, 2021

That makes a lot of sense! I need to check that thE INLINABLE change doesn’t affect performance much.

Is there a way of determining steps we could take to ease the tradeoffs between compile- and run-times? I never worked out a way of profiling compile times, so I’m always flying a bit blind trying to help that. But I figured that — to some extent — inlining happens when something is used, so hopefully we’re not spending time optimizing something the user isn’t even using.

@Philonous
Copy link
Contributor Author

Philonous commented May 30, 2021

OK, I bisected the INLINE pragmas in Frames until I found this single one which fixes the compilation issue for BenchDemo.hs:
https://github.com/acowley/Frames/blob/master/src/Frames/CSV.hs#L247

It doesn't seem to make a huge difference for run time:
before:

0.23user 0.00system 0:00.24elapsed 98%CPU

after

0.24user 0.01system 0:00.25elapsed 100%CPU

That's well within run-to-run variance

@acowley
Copy link
Contributor

acowley commented May 30, 2021

This is really interesting. I believe the thinking was that we wanted the whole chain of loading functions to be specialized to the right type. But perhaps it’s mostly okay if various layers are generic.

Many, perhaps most, uses of frames are some kind of fold. We want the inner part of the fold to in-line away into vector accesses of particular columns so there is no manifest Rec. Getting a better handle on where an opaque generic blocks that would be ideal.

For now, we should probably just remove that INLINE and keep moving forward. I’d really like to have a compilation benchmark harness so we can see the effects of changes on compile and run times. I’m not sure if there’s a way we could instruct cabal to help with that.

@Philonous
Copy link
Contributor Author

Just to be a little more precise, I put the benchdemo main in a criterion harness and ran the different variants:
Gist

newtype / INLINEABLE: 194.3 ms   (192.0 ms .. 197.0 ms)
data / INLINEABLE 193.4 ms   (190.4 ms .. 198.1 ms)
data / INLINE 199.2 ms   (197.8 ms .. 199.8 ms)
data / all NOINLINE 199.1 ms   (197.3 ms .. 201.6 ms)
  • newtype / data are the vinyl variants
  • I tried both INLINE and INLINEABLE on readTableMaybeOpt
  • I replaced every occurence of INLINE with NOINLINE in the Frames code and ran the benchmark (last line)

It seems that the INLINE pragmas make no discernible difference in this particular case.

@Philonous
Copy link
Contributor Author

Interestingly, comparing the core output from the NOINLINE version to the INLINE version of Benchdemo, I see this:

data + NOINLINE:

# grep -e ':&' -o BenchDemo.noinline.dump-simpl | wc -l
184

data + INLINE

grep -e ':&' -o BenchDemo.dump-simpl | wc -l
390

so it doesn't seem like the INLINE fused away the Rec in the first place, which would also explain why not inlining doesn't deteriorate performance

@acowley
Copy link
Contributor

acowley commented May 31, 2021

Okay, so one thing is to take out those INLINE pragmas in Frames. But of course then one has to wonder if we can do better. No difference between newtype and data is also a bit surprising given the vinyl benchmarks.

Is it worth making the newtype change now? My concern is that it can affect user code, and I’m not confident how it will benefit performance. It seems like the right thing to do if we can get away with it given what you found in vinyl, but I’d love to see it borne out in frames.

@Philonous
Copy link
Contributor Author

Philonous commented Jun 1, 2021

Back of the envelope calculation: moving from data to newtype saved ~9ns for 1 row, 15 fields [1]
If we assume that everything scales linearly, then in the BenchDemo, which handles 36634 rows, 18 columns we'd expect to save
9ns / 15 fields / 1row * 18 fields * 36 634 rows = 395 647ns = 0.4ms

That's below the run-to-run variance, so no wonder we don't see anything.

[1]: #150 (comment)

@acowley
Copy link
Contributor

acowley commented Jun 3, 2021

For benchdemo, I tried building and running it with the existing Vinyl and Frames, then Frames with the INLINE pragmas removed from the CSV module, and then with the ElField newtype (i.e. vinyl-0.14). Run times were all the same, as you've reported, but compile times were affected:

variant Compile Time (s)
vinyl-0.13.3 with INLINE 8.8
vinyl-0.13.3 no INLINE 8.0
vinyl-0.14.0 no INLINE 10.8

You've explained how the newtype offers the optimizer more to work with, so this isn't entirely unexpected. But is there anything we can do? With the fix you identified of removing the bad INLINE pragmas in Frames, we're looking at a noticeable increase in compile times without improving run time. Likely we still have too many INLINE pragmas in various places.

As I understand it, the main performance win with the newtype is shown in the ARec sums benchmark, but that's also a benchmark where plain Haskell records are significantly faster, and SRec is a better Vinyl choice if it works for the use case. The newtype change is a great win relative to the current implementation, so I'm inclined to go with it, but it's disappointing that the particular benchmark that motivates this change is also suggestive of things not working out so well. I've played with it some, but haven't had any successes improving that benchmark for ARec.

Compare to the storable benchmark in Vinyl that shows zero overhead from using Vinyl vs plain records. Admittedly, the storable benchmark is a very specific technique that I happened to make heavy use of, and not representative of all uses.

So, end of the day, I'm happy to go ahead with this PR if you feel like it's ready as there are a lot of very positive changes. I think there's just that haddock issue lingering, but I can fix that in a followup commit if you'd prefer.

acowley added a commit to acowley/Frames that referenced this pull request Jun 3, 2021
These were increasing compile times without improving run times. They
also caused the GHC simplifier to blow up with an upcoming change to
Vinyl to [change](VinylRecords/Vinyl#154) the
`ElField` type from a GADT to a `newtype`.
@Philonous
Copy link
Contributor Author

I've looked at -dump-simpl-stats for both the newtype and the data version. (Working on this project was the first time I had to deal with GHC debug output, so I'm not sure I'm reading it right) Most of it looks like internally generated names that mean nothing to me, but one interesting difference was that readRec got inlined way more often in the newtype version. So I tried adding {-# NOINLINE #-} pragmas around it, which brought compilation time down from 13 to 9 seconds on my machine, but it also worsened the benchmark from 193ms to 216ms.

It does however suggest that removing the right inline pragma(s) could make a significant difference in compilation times.

Regarding SRec: They are indeed as fast as normal Haskell records, but aren't they better compared to UNPACKed Haskell records? As far as I remember these were faster still. But it's still impressive performance.

@acowley
Copy link
Contributor

acowley commented Jun 21, 2021

I've gone ahead and pulled your commits in. I spent some more time poking at this, but didn't have any new ideas. Given that, I'd rather we move forward than let this great contribution wither on the vine.

Thank you for the code and the discussions!

@acowley acowley closed this Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants