I suspect a lot of the complication here is ultimately due to what the hardware allows, but it’s unfortunate that everything is so underdocumented (or requires reverse engineering).
An amazing performance tool is the ‘Processor Trace’ instrument. It’s M4 only unfortunately, I think due to hardware capability. It allows a detailed timeline of which functions were called in which order and how long they took. This can reveal all sorts of things that statistical profilers cannot – sometimes a call is much faster than other times; sometimes you do something slow and stupid between two steps when you would rather they run together; sometimes you are calling the same function again and again instead of caching its result. Unlike older methods, this is low overhead and doesn’t require recombination or knowing which functions are ‘interesting’. It is similar to intel processor trace and I think based on Arm CoreSight (both IPT and coresight tracing (I’m a bit unclear on the exact features that ‘coresight’ corresponds to) are supported by perf on Linux; not sure about asahi on Apple silicon).
> I suspect a lot of the complication here is ultimately due to what the hardware allows
Most likely, yes. Reading between the lines a bit, there are ten hardware counters - two of them wired to fixed triggers, and the other eight each having a multiplexer selecting from a different set of triggers for each counter. This is a fairly typical hardware design - limiting the number of counters and the number of triggers per counter makes for a simpler hardware implementation. (The increased software complexity is an acceptable tradeoff for a feature which is only used by some developers.)