Group By Ranges & Buckets
The Keen REST API fully supports the new group_by ranges
and group_by buckets
functionality.
It’s possible to invoke an ad-hoc query, create and retrieve the results from a saved query, a cached query, or a cached dataset.
The response structure is identical as the group_by property_name
response, so it should work out-of-the-box with the visualisation framework of your choice.
However, we are still working on rolling out the support for the group_by ranges
and the group_by buckets
in:
- Some SDKs
- The Explorer
- The Saved Queries view
Overview
The purpose of this guide is to show you how to use the new group_by parameter options: ranges
and buckets
.
The group_by parameter can be added to all analysis types (i.e. count, count_unique, select_unique, sum, min, max, average, percentile, …) but extractions. It groups results categorically. Up until now it was only possible to group by a specific value of a property_name
. The new functionality introduces a possibility to group by a range of some property values.
Grouping by a range of values can be useful to:
- build a Web Service response time histogram
- charge for phone calls of different duration
- analyze the results of an exam
An exam case study
The examples below will analyse an exam results using the group_by ranges
and group_by buckets
functionality.
Following event schema will be used in the examples:
{
"score": 71,
"age": 59,
"state": "ID"
}
Assumptions:
- The max
score
at the exam is 100, the minscore
is 0. - For statistical reference the
age
and thestate
(2-letter code) are registered. - The exam is only for adults, so
age >= 18
.
Exam results analysis
To understand how students performed in the exam the group_by buckets
feature can be used. The score
will be divided into groups of size 10 as specified in the size
parameter.
The query:
{
"analysis_type": "count",
"event_collection": "exam_results",
"group_by": {
"buckets": {
"size": 10
},
"property_name": "score"
},
...
}
The query explained:
- The
group_by buckets
construct is perfect for an initial analysis of some data. The definition requires just thesize
and theproperty_name
parameters. - The optional
offset
parameter will default to 0, so buckets will be created using the pattern:[0,10), [10, 20), [20, 30), ...
. - The last bucket
[100, 110)
aggregates only thescore = 100
events, however, ingroup_by buckets
there is no way to modify thekey
shown in the chart legend.
The result:
Improved exam result analysis
The chart above gives us some information, however, it’s still unknown how many students have passed (65 points or more), or what grade they’ve received. Let’s use the group_by ranges
with the count analysis to build a more sophisticated exam score chart.
The query:
{
"analysis_type": "count",
"event_collection": "exam_results",
"group_by": {
"ranges": [
{"to": 65, "key": "[0, 65) -> E"},
{"from": 65, "to": 70, "key": "[65, 70) -> D"},
{"from": 70, "to": 80, "key": "[70, 80) -> C"},
{"from": 80, "to": 90, "key": "[80, 90) -> B"},
{"from": 90, "key": "[90, 100] -> A"}
],
"property_name": "score"
},
...
}
The query explained:
- The
group_by ranges
requires more manual work, as each range needs to be specified manually, but is more powerful:- each
range
can have different size - as edges can be specified as either inclusive or exclusive,
- the
key
can be customised
- each
- Both the
to
andfrom
parameters are optional.- If
from
is not provided then-Infinity
is assumed. - If
to
is not provided then+Infinity
is assumed.
- If
- The
key
is also optional. If not provided it will be calculated based on thefrom
andto
parameters. - The
from
parameter is inclusive by default. It can be changed by thefrom_closed
boolean parameter. - The
to
parameter is exclusive by default. It can be changed by theto_closed
boolean parameter.
The result:
Exam results in different age groups
We can drill down further to see in which age group the exam score
average was the highest.
The query:
{
"analysis_type": "average",
"target_property": "score",
"event_collection": "exam_results",
"group_by": {
"buckets": {
"size": 10,
"offset": 18
},
"property_name": "age"
},
...
}
The query explained:
- As mentioned in the
Overview
not only the count analysis type can be used with thegroup_by
parameter. - In this example the
average
analysis will be performed on data subsets determined by theage
property. - Since all participants are adults, the
offset
parameter is set to 18, so the buckets are built starting from this value.
The result:
Exam results grouped by US states
In this example more group_by ranges
features will be presented. The most important one is the ability to group by not only number ranges but also by string ranges.
The query:
{
"analysis_type": "average",
"target_property": "score",
"event_collection": "exam_results",
"group_by": {
"ranges": [
{"from": "A", "to": "B", "key": "Starting with A"},
{"from": "C", "to": "D", "key": "Starting with C"},
{"from": "M", "to": "MI", "to_closed": true, "key": "States: MA, MD, ME, MI"},
{"from": "MI", "to": "N", "from_closed": false, "key": "States: MN, MO, MS, MT"},
{"to":"N", "key": "Starting with [A-M]"},
{"from":"N", "key": "Starting with [N-Z]"},
{"key": "All states"}
],
"property_name": "state",
"allow_range_overlap": true,
"alias": "mark"
},
...
}
The query explained:
- String types are compared lexicographically, so you can define ranges for strings as well.
- The
from
parameter is inclusive by default. This can be changed by the"from_closed" = false
boolean parameter. - The
to
parameter is exclusive by default. Can be changed by the"to_closed" = true
boolean parameter. - There are several overlapping
range
definitions, so theallow_range_overlap
must be set totrue
, otherwise the query will error. - The
key
is the string that will be displayed in the chart legend. - The last
range
definition is interesting as neitherfrom
norto
is present. Suchrange
collects all events and will display the average for all states (as thekey
suggests). - The
alias
is optional and only visible in the raw JSON response.
The result:
Exam pass rate evolution over time
The exam results statistic: how many passed and failed attempts per year, can be created using the interval
keyword.
The query:
{
"analysis_type": "count",
"event_collection": "exam_results",
"group_by": {
"ranges": [
{"to": 65, "key": "FAIL"},
{"from": 65, "key": "PASS"}
],
"property_name": "score",
"alias": "Exam result"
},
"interval": "yearly",
...
}
The query explained:
- Only two
range
definitions are present:PASS
andFAIL
. - Since
to
is exclusive, then allscore
values strictly less than65
are marked asFAIL
. - Since
from
is inclusive, then allscore
values grater than or equal to65
are marked asPASS
. - The
yearly
interval will additionally group the results by the year the exam has been completed.
The result: