Rakam - Product Analytics on Your Data Warehouse

How Rakam API Works

Performance

The API is developed with Java 8 and Netty and it's highly optimized for the API requests. It's battle-tested in product environments with 30B events per month with only 2 servers.

Modular

The API server requires you to set the configurations via config.properties file and you define all the enrichment, sanitization, and target database option in this file. You can even develop your own modules using ServiceLoader mechanism

No learning curve

The API works similar to SaaS product analytics tools, you send your event properties and the system automatically generates the schema on the fly. It's easy as using Google Analytics.

Your data, your rules

You will be installing the API on your servers that we don't have any access to. You can use the event data for product analytics, recommendation system or any in-house project later on.

One-click install with Terraform

We use Kubernetes with Terraform templates and we provide one-click installers for AWS and Google Cloud.

How SDKs work

We provide SDKs for Android, iOS and web platforms. The SDKs automatically enrich the data depending on the platform and send the data to your Rakam API.

Enrich data in real-time

Rakam API will automatically enrich your customer data with location, referrer and device information in API.
The SDK has in-built properties such as advertisement id in mobile devices or the URL in websites.

SQL for everything

You can either ingest the data into your Postgresql database or Snowflake cluster that horizontal scales easily. Then, run SQL queries on your event-data using the schema sent from your SDKs.

Deployment Types

PostgreSQL

It's the easiest way to start using Rakam. The API server enriches, sanitizes the data and INSERT into Postgresql in real-time. We make use of partitioned tables and BRIN indexes in order to leverage analytical features of Postgresql. This deployment type is scalable if you have less than 100M events per month because Postgresql is not horizontally scalable.

Snowflake

Snowflake based solution makes use of a distributed commit log (we support for Kafka, Kinesis and Google PubSub), a distributed storage system (we support S3 and Google Storage) for storing raw event data and Snowflake cluster if you need SQL access. Since it's horizontally scalable, it's the big-data solution that supports more than 100M events per month.

Compare to other solutions

We believe in using the right tool for the right technology. Here is the comparison matrix for similar solutions:

For event analytics

For pageview analytics

If you don't have a data team

Devops
Installation	It has the Terraform installer for AWS and Docker Compose that allows you to install it on your cloud. You enter the variables for your workload depending on your throughput and it setups the resources for you.	It supports AWS, Google Cloud, and Azure. You should go through the Snowplow documentation and setup the resources manually if you want to install it on your own. Otherwise, Snowplow offers consultation services for setting up the resources for you.	Since it's a SaaS, you register on Segment.com and create a tracker using their interface.
Maintenance	It uses Kubernetes under the hood and make use of Cloudwatch for scaling the cluster automatically in AWS.	The enterprise package (Snowplow Insights) has in-built alarms that scales the cluster. You need to setup the alarms in open-source version yourself.	No maintenance required.
High Availability	The Kubernetes cluster makes use of multiple availability zones within the region by default for performance reasons, it can also span across multiple regions for high availability.	It depends on the Collector that you use. The most reliable option is the Cloudfront tracker that depends on the managed AWS service thus doesn't require any servers.	No option provided.
Scalability	It scales horizontally and tested on 40B events per month and up to 2000 event types in production. It uses horizontal scaling, there is no limit in theory.	It scales horizontally and probably well-tested more than Rakam API since it's more popular.	No option provided.
Data Collection
Scalability	It's available in Web, iOS, amd Android platforms. See the documentation here for the details. It also provides an API for collecting data from third party and backend services.	It's available in many different languages and platforms. See the documentation for details.	In addition to the mobile and web platforms, it supports hundreds of sources including marketing and CRM tools. See the documentation.
Custom Events	Automatic schema discovery by inferring the values. We also have Taxonomy feature for validating and controlling the schemas. see docs	You need to define the schema in integration phase. see docs	Automatic schema discovery by inferring the values. It also has Segment Protocol for controlling the data flow.
User Attributes	See Docs		See Docs
Enrichment	Most common pre-built enrichment types like Geo Mapping, User Agent and Sessionization etc. In addition to that custom module SPI for writing enrichments via Java and Javascript. We're in favor of ELT since it's easier to maintain. (Transforming the data in the data warehouse)	Many pre-built enrichment integrations. see docs	It does not provide any enrichment in the event data.
Customization	We have an opinionated way to collect the data and only support 2 deployment types for simplicity. For > 10M events / month, use Postgresql.For > 10M events / month, use S3 target and connect to Snowflake.	It supports many different backends from Elasticsearch to BigQuery, S3, Redshift and Postgresql. It also have many different collectors for different use-cases. It might be a better solution is need more than "collecting raw customer event data".	It does not provide any customization, instead all the things work in plug/play manner.
Latency	By default, it's 5 minutes. The big part of the latency comes from Kinesis consumer to S3 target because we write the data in batches for performance reasons.	It's 45 - 80 minutes. It's because it relies on Cloudfront. See the documentation. for details.	At least 2 hours. Segment syncs the data as a separate process and you can configure it at least 2 hour intervals.
	Best for event analytics that have many custom events	Best for pageview (click-stream) and enterprise-level analytics that have custom use-cases.	Best solution if you don't have a data team and need any customization.

Collect customer event data the hassle-free way