Use compiler intrinsics for bit ops in hash

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Use compiler intrinsics for bit ops in hash

David Fetter
Folks,

The recent patch for distinct windowing aggregates contained a partial
fix of the FIXME that didn't seem entirely right, so I extracted that
part, changed it to use compiler intrinsics, and submit it here.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

v1-0001-Use-compiler-intrinsics-for-bit-ops-in-hash.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Use compiler intrinsics for bit ops in hash

Jesse Zhang
Hi David,

On Tue, Jan 14, 2020 at 9:36 AM David Fetter <[hidden email]> wrote:
>
> Folks,
>
> The recent patch for distinct windowing aggregates contained a partial
> fix of the FIXME that didn't seem entirely right, so I extracted that
> part, changed it to use compiler intrinsics, and submit it here.

The changes in hash AM and SIMPLEHASH do look like a net positive
improvement. My biggest cringe might be in pg_bitutils:

> diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
> index 498e532308..cc9338da25 100644
> --- a/src/include/port/pg_bitutils.h
> +++ b/src/include/port/pg_bitutils.h
> @@ -145,4 +145,32 @@ pg_rotate_right32(uint32 word, int n)
>  return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
>  }
>
> +/* ceil(lg2(num)) */
> +static inline uint32
> +ceil_log2_32(uint32 num)
> +{
> + return pg_leftmost_one_pos32(num-1) + 1;
> +}
> +
> +static inline uint64
> +ceil_log2_64(uint64 num)
> +{
> + return pg_leftmost_one_pos64(num-1) + 1;
> +}
> +
> +/* calculate first power of 2 >= num
> + * per https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
> + * using BSR where available */
> +static inline uint32
> +next_power_of_2_32(uint32 num)
> +{
> + return ((uint32) 1) << (pg_leftmost_one_pos32(num-1) + 1);
> +}
> +
> +static inline uint64
> +next_power_of_2_64(uint64 num)
> +{
> + return ((uint64) 1) << (pg_leftmost_one_pos64(num-1) + 1);
> +}
> +
>  #endif /* PG_BITUTILS_H */
>

1. Is ceil_log2_64 dead code?

2. The new utilities added here (ceil_log2_32 and company,
next_power_of_2_32 and company) all require num > 1, but don't clearly
Assert (or at the very least document) so.

3. A couple of the callers can actively pass in an argument of 1, e.g.
from _hash_spareindex in hashutil.c, while some other callers are iffy
at best (simplehash.h maybe?)

4. It seems like you *really* would like an operation like LZCNT in x86
(first appearing in Haswell) that is well defined on zero input. ISTM
the alternatives are:

   a) Special case 1. That seems straightforward, but the branching cost
   on a seemingly unlikely condition seems to be a lot of performance
   loss

   b) Use architecture specific intrinsic (and possibly with CPUID
   shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ
   intrinsic elsewhere. The CLZ GCC intrinsic seems to map to
   instructions that are well defined on zero in most ISA's other than
   x86, so maybe we can get away with special-casing x86?

Cheers,
Jesse


Reply | Threaded
Open this post in threaded view
|

Re: Use compiler intrinsics for bit ops in hash

David Fetter
On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote:

> Hi David,
>
> On Tue, Jan 14, 2020 at 9:36 AM David Fetter <[hidden email]> wrote:
> >
> > Folks,
> >
> > The recent patch for distinct windowing aggregates contained a partial
> > fix of the FIXME that didn't seem entirely right, so I extracted that
> > part, changed it to use compiler intrinsics, and submit it here.
>
> The changes in hash AM and SIMPLEHASH do look like a net positive
> improvement. My biggest cringe might be in pg_bitutils:
Thanks for looking at this!

> > diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
> > index 498e532308..cc9338da25 100644
> > --- a/src/include/port/pg_bitutils.h
> > +++ b/src/include/port/pg_bitutils.h
> > @@ -145,4 +145,32 @@ pg_rotate_right32(uint32 word, int n)
> >  return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
> >  }
> >
> > +/* ceil(lg2(num)) */
> > +static inline uint32
> > +ceil_log2_32(uint32 num)
> > +{
> > + return pg_leftmost_one_pos32(num-1) + 1;
> > +}
> > +
> > +static inline uint64
> > +ceil_log2_64(uint64 num)
> > +{
> > + return pg_leftmost_one_pos64(num-1) + 1;
> > +}
> > +
> > +/* calculate first power of 2 >= num
> > + * per https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
> > + * using BSR where available */
> > +static inline uint32
> > +next_power_of_2_32(uint32 num)
> > +{
> > + return ((uint32) 1) << (pg_leftmost_one_pos32(num-1) + 1);
> > +}
> > +
> > +static inline uint64
> > +next_power_of_2_64(uint64 num)
> > +{
> > + return ((uint64) 1) << (pg_leftmost_one_pos64(num-1) + 1);
> > +}
> > +
> >  #endif /* PG_BITUTILS_H */
> >
>
> 1. Is ceil_log2_64 dead code?
Let's call it nascent code. I suspect there are places it could go, if
I look for them.  Also, it seemed silly to have one without the other.

> 2. The new utilities added here (ceil_log2_32 and company,
> next_power_of_2_32 and company) all require num > 1, but don't clearly
> Assert (or at the very least document) so.

Assert()ed.

> 3. A couple of the callers can actively pass in an argument of 1, e.g.
> from _hash_spareindex in hashutil.c, while some other callers are iffy
> at best (simplehash.h maybe?)

What would you recommend be done about this?

> 4. It seems like you *really* would like an operation like LZCNT in x86
> (first appearing in Haswell) that is well defined on zero input. ISTM
> the alternatives are:
>
>    a) Special case 1. That seems straightforward, but the branching cost
>    on a seemingly unlikely condition seems to be a lot of performance
>    loss
>
>    b) Use architecture specific intrinsic (and possibly with CPUID
>    shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ
>    intrinsic elsewhere. The CLZ GCC intrinsic seems to map to
>    instructions that are well defined on zero in most ISA's other than
>    x86, so maybe we can get away with special-casing x86?
b) seems much more attractive. Is there some way to tilt the tools so
that this happens? What should I be reading up on?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

v2-0001-Use-compiler-intrinsics-for-bit-ops-in-hash.patch (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Use compiler intrinsics for bit ops in hash

Jesse Zhang
On Tue, Jan 14, 2020 at 2:09 PM David Fetter <[hidden email]> wrote:
> > The changes in hash AM and SIMPLEHASH do look like a net positive
> > improvement. My biggest cringe might be in pg_bitutils:
> >
> > 1. Is ceil_log2_64 dead code?
>
> Let's call it nascent code. I suspect there are places it could go, if
> I look for them.  Also, it seemed silly to have one without the other.
>

While not absolutely required, I'd like us to find at least one
place and start using it. (Clang also nags at me when we have
unused functions).

> On Tue, Jan 14, 2020 at 12:21:41PM -0800, Jesse Zhang wrote:
> > 4. It seems like you *really* would like an operation like LZCNT in x86
> > (first appearing in Haswell) that is well defined on zero input. ISTM
> > the alternatives are:
> >
> >    a) Special case 1. That seems straightforward, but the branching cost
> >    on a seemingly unlikely condition seems to be a lot of performance
> >    loss
> >
> >    b) Use architecture specific intrinsic (and possibly with CPUID
> >    shenanigans) like __builtin_ia32_lzcnt_u64 on x86 and use the CLZ
> >    intrinsic elsewhere. The CLZ GCC intrinsic seems to map to
> >    instructions that are well defined on zero in most ISA's other than
> >    x86, so maybe we can get away with special-casing x86?

i. We can detect LZCNT instruction by checking one of the
"extended feature" (EAX=80000001) bits using CPUID. Unlike the
"basic features" (EAX=1), extended feature flags have been more
vendor-specific, but fortunately it seems that the feature bit
for LZCNT is the same [1][2].

ii. We'll most likely still need to provide a fallback
implementation for processors that don't have LZCNT (either
because they are from a different vendor, or an older Intel/AMD
processor). I wonder if simply checking for 1 is "good enough".
Maybe a micro benchmark is in order?

> Is there some way to tilt the tools so that this happens?
We have a couple options here:

1. Use a separate object (a la our SSE 4.2 implemenation of
CRC). On Clang and GCC (I don't have MSVC at hand), -mabm or
-mlzcnt should cause __builtin_clz to generate the LZCNT
instruction, which is well defined on zero input. The default
configuration would translate __builtin_clz to code that
subtracts BSR from the width of the input, but BSR leaves the
destination undefined on zero input.

2. (My least favorite) use inline asm (a la our popcount
implementation).

> b) seems much more attractive. Is there some way to tilt the tools so
> that this happens? What should I be reading up on?

The enclosed references hopefully are good places to start. Let
me know if you have more ideas.

Cheers,
Jesse


References:

[1] "How to detect New Instruction support in the 4th generation Intel®
Core™ processor family"
https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family
[2] "Bit Manipulation Instruction Sets"
https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets


Reply | Threaded
Open this post in threaded view
|

Re: Use compiler intrinsics for bit ops in hash

John Naylor-2
In reply to this post by David Fetter
On Wed, Jan 15, 2020 at 6:09 AM David Fetter <[hidden email]> wrote:
> [v2 patch]

Hi David,

I have a stylistic comment on this snippet:

- for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
- {
- if ((1 << i) <= metap->hashm_bsize)
- break;
- }
+ i =  pg_leftmost_one_pos32(metap->hashm_bsize);
  Assert(i > 0);
  metap->hashm_bmsize = 1 << i;
  metap->hashm_bmshift = i + BYTE_TO_BIT;

Naming the variable "i" made sense when it was a loop counter, but it
seems out of place now. Same with the Assert.

Also, this

+ * using BSR where available */

is not directly tied to anything in this function, or even in the
function it calls, and could get out of date easily.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply | Threaded
Open this post in threaded view
|

Re: Use compiler intrinsics for bit ops in hash

David Fetter
On Sat, Jan 18, 2020 at 11:46:24AM +0800, John Naylor wrote:

> On Wed, Jan 15, 2020 at 6:09 AM David Fetter <[hidden email]> wrote:
> > [v2 patch]
>
> Hi David,
>
> I have a stylistic comment on this snippet:
>
> - for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
> - {
> - if ((1 << i) <= metap->hashm_bsize)
> - break;
> - }
> + i =  pg_leftmost_one_pos32(metap->hashm_bsize);
>   Assert(i > 0);
>   metap->hashm_bmsize = 1 << i;
>   metap->hashm_bmshift = i + BYTE_TO_BIT;
>
> Naming the variable "i" made sense when it was a loop counter, but it
> seems out of place now. Same with the Assert.
Fixed by removing the variable entirely.

> Also, this
>
> + * using BSR where available */
>
> is not directly tied to anything in this function, or even in the
> function it calls, and could get out of date easily.

Removed.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

v3-0001-Use-compiler-intrinsics-for-bit-ops-in-hash.patch (8K) Download Attachment