Neither GCC nor Clang define an __arm64__ preprocessor macro, but use
__aarch64__ (MSVC uses _MARM_64). Add a "64" suffix to the define, i.e.
NETGEN_ARCH_ARM64 to make it more obvious in only refers to aarch64, and
to be in line with NETGEN_ARCH_AMD64.
Replace the (Clang specific) __builtin_readcyclecounter with inline
asm:
- The function return cycles (i.e. varies with CPU frequency), not time
- It may return 0, depending on the PMU settings
- It may cause an illegal instruction, in case it is not trapped by the
kernel, e.g. on FreeBSD.
Reading the generic timer/counter CNTVCT_EL0 instead of PMCCNTR_EL0 avoids
these pitfalls. The inline asm works on GCC and Clang, instead of
Clang only for the builtin.