Overview
LWP is AMD's LightWeightProfiling extention to the x86_64 architecture,
which enables collection of statistical performance data from user mode code.
Prior to the introduction of LWP, the traditional methods of collecting performance
data about a particular process were to intrusively inject collection code, or to query
the on-chip performance counters via a device driver.
Because LWP operates completely in user space, on a per-thread basis, it incurs a much
lower overhead than other statistical performance collection methods, and is thus superior
Notes:
- intrusive instrumentation, various granularity/injection point (where in the toolchain)
- statistical sampling
- time: sample at timer intervals, collect # of events that have happened. Counters
could roll-over unless they are 64-bit (few if any are) if the interval is large enough
- event: count up to a limit or down to zero, and when the threshold has been reached,
generate an event record
- instruction: when the threshold for a particular instruction action is reached, generate
an event record. Instructions retired, branches retired, etc.
- skid issues: Initially, events from PMUs were imprecise, and could cause reporting of the event
on an incorrect instruction. That has generally been eliminated with precise events.
- limited/bounded set of resources, counters are system-wide. User A sets them up,
could conflict with User B sampling and vice versa. Pre-emption can trivially happen.
- LWP + system-wide possible.
Because LWP is a per-thread, user-space mechanism, it can be operational at the same
time that the system wide PMUs are in use.
Implementation approach - the lwpcore files provide a core set of routines to access
an LWPCB (LWP Control Block), in a policy neutral way. Generally, higher layers
of software are assumed to perform the storage management, and/or consuming the entries
in the ring buffer.
For this particular implementation, we will utilize a convention wherein:
- C++ is used
- the shared memory has a header region, followed by the ring buffer itself.
- the header holds:
- misc information such as thread id, thread creation/exit time, etc.
- the LWPCB