Compute Pricing Guide
What does “Properties Scanned” mean?
In short, it represents the amount of data we had to process to answer the query you requested. For the rest of this document we’ll use the following definitions:
N
: the total properties scanned.
E
: the number of events that exist within the timeframe
you provided.
P
: the number of properties per event required to calculate the query.
The formula used to calculate the number of properties scanned per query is:
N = E * P
To calculate P
for a given query, first find the set of unique properties referenced in the filters
, group_by
, or target_property
parameters. If the query has a timeframe
(which almost all should) then include keen.timestamp
in this set. P
is then equal to the cardinality, or number of elements, in the set. This means that if you’re filtering and grouping by the same property, that only increases P
by 1.
Here are some examples of calculating P
(all assuming that a timeframe
is present):
Query Definition | P |
---|---|
count collection C | 1 |
count collection C, filter on property A | 2 |
sum on property A, filter on property B | 3 |
sum on property A, filter on property A, group_by property A | 2 |
Working Example
To illustrate this formula let’s consider an example query: a sum
of property x
, with a filter on property y
being equal (eq
) to a specific value, a timeframe
of this_90_days
, and an interval
of daily
. Assume that the collection being queried gets a steady 10K
events per day. For this query P
will be 3
(the properties are x
, y
, and keen.timestamp
) and E
will be 900K
so N = 900K * 3 = 2.7M
. For the sake of example, if the cost is $1 per 10M
properties scanned this query would cost $0.27. If this query is powering a KPI in a dashboard that is viewed 20 times per day, or 600 times per month, the monthly cost of that KPI would be 600 * 2.7M = 1.62B
properties scanned or $162. Read on to see how caching can help bring that price down.
Why is Keen Compute usage priced based on properties scanned?
You can think of Keen as a columnar database indexed on keen.timestamp
.
“Columnar database” means that when we are reading data to evaluate a query, we can read just the subset of properties (or columns) that are relevant to the query and skip the others. All other factors being equal, a query that reads 10 properties requires 5 times as much work as one that reads 2 properties.
“Indexed on keen.timestamp
” means that we can efficiently look up the subset of events whose keen.timestamp
falls within a given range, also known as a timeframe
. All other factors being equal, a query whose timeframe
includes 10 million events requires 10 times as much work as one whose timeframe
includes 1 million events. Importantly, this is true even if filters on other properties dramatically reduce the number of events that are actually used to compute the result. A query with a timeframe
that includes 10 million events and a filter
that matches just 10 will still have to read all 10 million.
While these two factors are not the only contributors to how much it costs Keen to execute a query, they tend to dominate in most cases and provide a good approximation. We use these factors to compute your usage in order to align your costs with ours, which incentivizes efficient implementations and allows us to lower costs for everyone.
How are Extractions priced?
Giving you access to your raw data is very important for us at Keen. We provide the ability to extract chunks of your data in CSV or JSON format via our Extraction API.
The pricing for this follows the same pricing formula mentioned above. The P
in this case is equal to the number of properties per event you want to extract. By default this is all of the properties in the schema for the given event collection (similar to a SELECT *
SQL query).
You can limit the properties retrieved by an extraction using the property_names
parameter. If you only need a small subset of the properties in the schema for your use case then this can result in a large cost savings (and performance improvement).
How are funnels priced?
Funnels are a powerful tool for analyzing data. They allow you to analyze a cohort’s behavior across multiple events.
The calculations for E
and P
are slightly different for funnels than for other queries:
E
is calculated as the sum of the number of events that matched the timeframe of each step in the funnel. For example if step 1 is over collection foo
with a timeframe of this_30_days
, step 2 is over collection foo
with a timeframe of this_10_days
, and step 3 is over collection bar
with a timeframe of this_30_days
, the E
will be equal to:
[# of events in foo in this_30_days]
+ [# of events in foo in this_10_days]
+ [# of events in bar in this_30_days]
P
is calculated based on the set of properties that appear in any step
: all filters, all actor_id
properties, plus keen.timestamp
. Note that properties with the same name but in different collections are currently considered to be the same property for the purposes of this calculation.
The total properties scanned is still just computed as N = E * P
.
Finding the Cost of a Query at Execution Time
We provide you with an option to enrich query results with the detailed number of scanned events (E
) and properties (P
), as well as the total N
. Please read how to modify your query in order to see the execution details.
How Caching Saves on Query Costs
By default our compute API calculates answers at the time of request. These “ad hoc” queries are great for exploration, but if similar or identical queries are made frequently it can drive up costs. Caching is an effective way to make common queries both cheaper and faster.
Cached Queries
A Cached Query is a query that Keen automatically runs periodically according to a specified refresh_rate
(configurable between 4 and 48 hours). The result is then kept in a cache so when it is retrieved, it is pulled from the cache instead of recomputing it.
Cached Query pricing is based purely on the queries that Keen runs to update the cache; there is no cost to you for retrieving the cached result.
If a hypothetical dashboard is viewed 100 times per day and its queries are all being calculated from scratch every time, the total properties scanned usage will rise very quickly. If instead the same dashboard uses Cached Queries they will only be calculated once per refresh_rate
period, thus reducing the amount of compute that needs to be done. On top of that, the data required to power the dashboard will be served from the cache for increased speed.
To estimate the total monthly properties scanned for a Cached Query, simply compute the properties scanned for a single execution (using the N = E * P
formula from above) and then multiply by R
= the number of times the query will be run per month. With a refresh_rate
of 4 hours and a 30-day month, for example, R
will be around 180.
Working Example, revisited
In our example above we considered a query run 20 times per day that would generate 1.62B
or $162 worth of properties scanned usage per month. If that same query was migrated to a Cached Query with a refresh_rate
of 4 hours then it would generate 2.7M * 180 = 486M
properties scanned per month, or $48.60, a 70% savings. The more frequently the query result is retrieved, the bigger the savings of a Cached Query over ad hoc. Conversely if the ad hoc query is only run once or twice per day then Cached Queries are probably not a good cost-saving opportunity.
Cached Datasets
Cached Datasets allow you to precompute results for a query for every value of an index_by
property (or combination of properties) and then quickly look up the result of that query for a specific value.
Like Cached Queries, Cached Datasets are priced based on the queries that Keen runs in the background to build and update the cache. There is no cost for retrieving results. To estimate how much a Cached Dataset will cost per month, we need to understand a little more about how it is updated.
Under the hood Keen logically stores the cached dataset as a matrix whose rows are the values of the index_by
property and whose columns are the intervals. Each cell in the matrix represents the result of the query for that index_by
value in that interval. Once per hour Keen checks which columns are due to be refreshed by finding (a) any columns that have never been computed before; (b) the column that contains the current time, if any; and (c) the column that contains the time 48 hours ago, if any. (This 48 hour “trailing update” is to catch any late-arriving data.) Note that (b) and (c) may be the same column, e.g. because the Cached Dataset has a monthly
interval and it is the middle of the month.
To update a column Keen runs a query that is similar to the one in the Cached Dataset definition, but it is modified in the following ways:
- The
interval
is removed. - The
timeframe
is set to be the absolute boundaries of the interval corresponding to the column being updated. For example if the Cached Dataset has adaily
interval and we are updating the column for the current day, then thetimeframe
for the update query will be set to something like{"start":"2019-01-01T00:00:00Z","end":"2019-01-02T00:00:00Z"}
. - The
index_by
property (or properties) are added to the list ofgroup_by
properties.
Once this query completes it is parsed into individual results for each index_by
value and the appropriate cells in the matrix are updated.
With that background in mind the monthly cost of a Cached Dataset can be estimated as follows. First, compute the cost of a single column update query. To do this follow the same N = E * P
formula as before, using the typical number of events per unit of the Cached Dataset’s interval
for E
and computing P
including the index_by
properties. Then based on the interval
and timeframe
being used look up how many column update queries will be run per month in the following table (note that a 30-day month is assumed for simplicity):
Interval | Timeframe keyword | Column updates per month (approx.) | Explanation |
---|---|---|---|
minutely |
this_N_minutes |
60 * 24 * 30 = 43200 |
Each hour the latest ~60 minutes are computed. (Note: 48 hours = 2880 minutes > 2000 (N max value), so the “trailing update” will always be outside of the timeframe.) |
minutely |
previous_N_minutes |
60 * 24 * 30 = 43200 |
Same as this_N_minutes . |
hourly |
this_N_hours |
2 * 24 * 30 = 1440 |
Each hour is computed twice: once when it first comes into the timeframe and once by the “trailing update” pass. |
hourly |
previous_N_hours |
2 * 24 * 30 = 1440 |
Same as this_N_hours . |
daily |
this_N_days |
2 * 24 * 30 = 1440 |
Each hour the current day and the day before yesterday are computed. |
daily |
previous_N_days |
30 + (24 * 30) = 750 |
Each day is computed once within the hour after it ends, then again every hour the 2nd day afterwards by the “trailing update” pass. |
weekly |
this_N_weeks |
(24 * 30) + (4 * 48) = 912 |
Each hour the current week is computed. For the 48 hours at the start of a week, the previous week is computed by the “trailing update” pass. (Note: some months will compute 5 weeks instead of 4.) |
weekly |
previous_N_weeks |
4 * 48 = 192 |
For the 48 hours at the start of a week, the previous week is computed by the “trailing update” pass. (Note: some months will compute 5 weeks instead of 4.) |
monthly |
this_N_months |
(24 * 30) + 48 = 768 |
Each hour the current month is computed. For 48 hours at the start of a month, the previous month is computed by the “trailing update” pass. |
monthly |
previous_N_months |
48 |
For the 48 hours at the start of a month, the previous month is computed by the “trailing update” pass. |
yearly |
this_N_years |
24 * 30 = 720 |
Each hour the current year is computed. (Note: the previous year is recomputed for the 48 hours of the year, so this will be slightly higher in January.) |
yearly |
previous_N_years |
48 in January; 0 thereafter |
The previous year is re-evaluated for the first 48 hours in January but after that no updates are necessary. |
Then multiply together the cost per update query times the number of column updates per month to get a rough estimate of total properties scanned usage. Note that for this_N_*
timeframes with longer-duration intervals (daily
and up) this will usually be an over-estimate because the column update query for an in-progress interval won’t read as much data as for an already-finished interval. For example when first updating the column for the current day given a this_N_days
interval, there will only be 1/24th of a day’s worth of events to scan. To get a precise estimate you will need to take this into account, but in most cases even the conservative estimate will be good enough.
It is also important to note that there will be an initial bootstrapping phase when a Cached Dataset is first created. During this phase Keen will need to run queries for every column in the matrix, i.e. every interval in the timeframe
. To estimate the cost of this bootstrapping phase just estimate the cost of a single column update query (as described above) and multiply by the number of columns/intervals (e.g. 500
for a this_500_days
timeframe).
Working Example, re-revisited
Building upon the example Query and Cached Query above, imagine we convert the query to a Cached Dataset by removing the filter on y
and instead using y
as the index_by
property. We use a timeframe of this_90_days
and a daily
interval. The cost of a single column update query will be N = E * P = 10K * 3 = 30K
(actually less in practice due to the partial-interval behavior mentioned above). The number of column updates per month will be 1440, from the table. So the total number of properties scanned will be less than 30K * 1440 = 43.2M
or $4.32 per month, a >10x reduction compared the Cached Query version and a 37x reduction over the original ad hoc version.
Warning: avoid this_N_years
Cached Datasets
When using a this
modifier and a yearly
interval, we have to run a column update query every hour that covers the entire current year-to-date. This becomes quite expensive as the year progresses and contains more events. For this reason we strongly discourage use of this_N_years
with Cached Datasets.