-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add matmul with float16 #39
Add matmul with float16 #39
Conversation
gpu.h
Outdated
*/ | ||
template<class precision> | ||
Tensor createTensor(Context &ctx, const Shape &shape, precision *data) { | ||
throw std::runtime_error("Unsupported number type for reateTensor(Context &ctx, const Shape &shape, precision *data) "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
numeric_types/half.h
Outdated
}; | ||
|
||
inline half operator + (const half& a, const half& b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the expectation is that host code is ahead-of-time and not performance critical, if we're going to include these at all and it's not too complex to implement something reasonably performant we should consider that. We're already citing Mike Acton's implementation for the conversions, there's at least some reference implementations here https://github.com/macton/ps3-archive/blob/master/src/half-precision/Half.v2/int_insn.h
Alternatively we could leave this out of scope unless we have a strong case for this being valuable. It could be a reasonable expectation for users to bring their own computations on the host side - they could use f32 or their own choice of library or implementation and limit gpu.cpp to casting / data movement.
Maybe there's a rationale that's not obvious, but it seems like having operator implementations that are implemented via casts internally could be worse than either option? If f32 conversions are happening within operations, wouldn't it be better to use f32 and cast at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
randn needs these operators to generate values of half type.
But I can cast float32 values to half values after processing randn.
utils/array_utils.h
Outdated
@@ -285,7 +306,8 @@ inline void flip(float *a, size_t R, size_t C, bool horizontal = true) { | |||
* @param tol The tolerance for closeness. | |||
* @return bool True if the arrays are close, false otherwise. | |||
*/ | |||
bool isclose(float *a, float *b, size_t n, float tol = 1e-3) { | |||
template<class precision=float> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think isclose
can be kept as two concrete overloads (one for half, one for f32) - keeps this code easier to quickly scan and not get tripped up on having to think about which template overload is going to get triggered. If in the future we have more precision types, and this trival overloading starts is too repetitive, we can re-evaluate then.
But optimize for easy-to-scan and reason about is valuable enough to keep simple overloads the default, unless there's something i'm missing.
utils/array_utils.h
Outdated
void randn(float *a, size_t N, std::mt19937 &gen, float mean = 0.0, | ||
float std = 1.0) { | ||
std::normal_distribution<float> dist(mean, std); | ||
template<class precision=float> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think instead of committing to implement variants of every function here for each precision type randn
, transpose
, etc. let's start with a clean f32 implementation with the expectation that most tasks can be handled by deferring the casting to after any AOT host operations are.
randn
, transpose
are already written with the expectation that they're only used for testing or non-critical AOT data preparation. afaict implementing support for half here doesn't add much either in performance or what's possible over deferring the half cast until other CPU preparation/testing code is done in f32.
If I'm missing something let me know.
gpu.h
Outdated
return createTensor(ctx, shape, kf16 ,data); | ||
} | ||
|
||
template<class precision> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comments on whether we can get rid of getPrecision
entirely.
gpu.h
Outdated
* @endcode | ||
*/ | ||
template<class precision> | ||
Tensor createTensor(Context &ctx, const Shape &shape, precision *data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as with other comments, let's keep createTensor
at value level unless there's something we really need to add the templated variants.
kernel = createKernel(ctx, matmul, bindings, | ||
/*nWorkgroups*/ nWorkgroups); | ||
} | ||
return kernel; | ||
} | ||
|
||
template<class precision=float> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After learning my lesson from type-templated early iterations of gemma.cpp my bias here is to implement the core implementations with precision types at the value level as a default unless there's a rationale otherwise, but only for if there's a strong case.
Templated overloads are okay as a more typesafe convenience API sugar - it's easy to wrap a value-level core implementation in a more strongly templated wrapper. It's a bigger refactoring to go in the other direction and unwind a core implementation into value level.
My suggestion would be, unless there's some value I'm missing.
- Either have
runTest
take a precision argument or have a separaterunTest
(e.g.runtest16
) for alternative precision values. - avoid having
getPrecision
to pull out values from types, remove them entirely unless there's something we need them for.
Thank you! This is a massive performance boost. Most of my comments are really the same idea (favor value-level, especially for the core implementations, unless there's a strong rationale for templates) which I don't think should be that big a change but let me know if there's a rationale i'm missing. Minor question - are the m2 numbers for pallas and the pytorch numbers @ghostplant mentions in #40 (comment) for 2 or 4 byte floating point representations? Look forward to trying this out on my M1 when it lands, nice work! |
BTW I've been meaning to write some contributor notes but TLDR is something like ~
|
@austinvhuang Thank you for your review! |
89e890d
to
b5ed517
Compare
b5ed517
to
23dd96e
Compare
LGTM thanks for the revisions! |
Thank you, too. |
This PR implements matrix multiplication with float16.
Mac's memory bandwidth is not enough for maximum floating point performance of the GPU.To improve performance, we need to reduce memory traffic.
#40 (comment)
The result on M2 pro is as follows: