Utility for statistically sampling strings with associated values.
Install ‘Middleware’ into your WSGI app to use. Call the ‘set’ function at any time during your request to sample a key/value probabilistically. Uses a simple algorithm (http://en.wikipedia.org/wiki/Reservoir_sampling) to approximate incidence rates and sums. Cost per request is ~3 memcache calls.
Gathered stats are available via built-in UI that’s separated into related sections. Users are encouraged to use multiple overlapping time-periods of samplers to provide different levels of resolution.
Example code
SAMPLEZ_POPULAR_CONTENT = samplez.Section(
'Popular content',
samplez.Config(
'content_10m',
period=600,
by_value=True,
samples=10000,
value_units='qps'),
samplez.Config(
'content_1h',
period=3600,
by_value=True,
samples=10000,
value_units='qps'))
SAMPLEZ_LATENCY = samplez.Section(
'Latency',
samplez.Config(
'latency_10m',
period=300,
by_value=True,
samples=10000,
value_units='ms'),
samplez.Config(
'latency_1h',
period=3600,
by_value=True,
samples=10000,
value_units='ms'))
samplez.set(SAMPLEZ_LATENCY, 'http://example.com/some/content', 124)
samplez.set(SAMPLEZ_POPULAR_CONTENT, 'http://example.com/some/content')
Code originally from the PubSubHubbub project:
http://pubsubhubbub.googlecode.com
Configuration for a reservoir sampler.
Adjust the value for a sampling key.
Computes the frequency of a sample.
Checks if this config is expired.
Something is wrong with a configured DoS limit, sampler, or scorer.
WSGI middleware that asynchronously updates samplez tables.
Sampler that saves key/value pairs for multiple reservoirs in parallel.
The basic algorithm is:
- Get the reservoir start timestamp.
- If more than period seconds have elapsed, set the timestamp to now, set the reservoir’s event counter to zero (average case this is skipped).
- Increment the event counter by the number of new samples.
- Set memcache values to incoming samples following the reservoir algorithm, potentially only sampling a subset.
The benefit of this approach is it can be applied to many reservoirs in parallel without incurring additional API calls. The only limit is the 32MB limit on App Engine batch API calls, which puts a cap on the amount of samples that can be made simultaneously.
Samples are stored in keys like: ‘sampler_name:0’, ‘sampler_name:1’
Values stored for samples look like: ‘key_sample:NNNN:WWWW’ where the ‘N’s represent the sample value as a big-endian-encoded 4-byte string, and the ‘W’s are a UNIX timestamp as a big-endian-encoded 4-byte string. The timestamp is used to ignore samples that are not from the current period.
There can be a race for resetting the timestamp for a sampler right after the period starts, but it always favors the caller who inserted last (all earlier data will be overwritten). This results in some missing data for short-period samplers, but it’s okay.
Gets statistics for a particular config and/or key.
This will only retrieve samples for the current time period. Samples from previous time periods will be ignored.
Samples a set of reported key/values synchronously.
Contains a batch of keys and values for potential sampling.
Returns all the sampling keys present across all configs.
Each key will be present at least once, but some keys may be present more than once if they were inserted repeatedly. The keys are in insertion order. This simplifies testing of this class.
Gets the value for a key/config.
Retrieves the keys present for a specific Config.
Removes a key/value for a specific config.
If the key is not present for the config, this method does nothing.
Sets a key/value for one or more configs.
Each config/key combination may only have a single value. Subsequent calls to this method with the same key/config will overwrite the previous value.
Contains the current results of a sampler for a given config.
Adds a new sample to these results.
Gets the weighted average of this key’s sampled values.
Gets the count of unique samples for a key.
Gets the frequency of events for this key during the sampling period.
Gets the max value seen for a key.
Gets the min value seen for a key.
Gets the unique sample data for a key.
Handler that serves samplez data.
A set of related Configs with a pretty name.
Configs that are added to a Section will be auto-registered with the module so they can be displayed on built-in status pages.
Gets statistics for a Section, optionally for a single key.
Use when retrieving data from multiple configs; ensures that the memory usage of the previous result is garbage collected before the next one is returned.
Applies all pending samples without the need for WSGI middleware.
WSGI convenience method; sets a key/value for one or more configs.
Each config/key combination may only have a single value. Subsequent calls to this method with the same key/config will overwrite the previous value.