The world is one big data problem. – Andrew McAfee, Principal Research Scientist, MIT
You can have data without information, but you cannot have information without data, and businesses are inundated with data from various sources.
As a multi-product company that has seen rapid growth over the past few years, we generate over 50 TB of data each month across our products.
With each of our software products catering to different aspects of the customer journey, we knew that we could not compromise on data privacy and security. The biggest challenge that we have been trying to solve is interpreting this data and generating insights out of it, while ensuring that the data is always secure. Naturally, our system was designed to ensure that we maintained the highest standards around security, governance, and compliance.
Today, between the data science team, machine learning team, customer success teams, product teams, and business operations teams, we have a lot of people who interpret, analyze, and build models based on available data.
Previously, only the IT department had access to the data, and business units had to go through the IT department to access any data. Other business teams of the organization had to go through various tedious manual actions to generate any type of information. On one hand, there are teams that have a constant need for data access and on the other, there is a constant challenge to make the data available to teams while keeping security and privacy norms under check.
Earlier, if teams needed any crucial data, they would have to get security clearance. Then the DevOps team would run queries on our database and finally, they would get the data they needed. We wanted to devise proactive solutions that help tackle this challenge. We wanted to democratize data and make our organization function on data-driven solutions.
The primary step was to build a platform that provided transparency and enabled users to trust the data they were working with. It was clear that the need of the hour was to build a data lake. This inspired us to kick-start the big data initiative at Freshworks.
We needed to create a large body of data at rest, or a data lake, as known across the industry. Like any other man-made lake, a data lake needs steady input streams and a treatment plant to process the data. Apache Hadoop had all the tools we needed to ingest, process, and enrich the incoming data.
The construction of Baikal – an artificial lake
Every team has its own unique goals and priorities. The data generated by them is stored in silos. The repository of fixed data is controlled by a single organizational unit and cannot be easily accessed by the other teams. To derive business insight and take data-driven decisions in daily operations, we needed to ensure that all teams had access to this information. In order to consolidate the data silos built for special-purpose applications, we built ‘Baikal‘, a scalable data repository, using Cloudera as a platform on AWS.
Cloudera is a provider and supporter of Apache Hadoop for the enterprise. It provides a unified platform to store, process, and analyze data which is largely used for big data analysis. It also aids in finding new ways to derive values from existing data.
We put together a small team of specialized engineers to help accelerate and reach our goal of building a data lake. We named our artificial lake after Baikal, the largest freshwater lake in India. We designed it to capture data from all our products, across all possible formats, and store it in a way that was queryable.
We started building the platform using analytics and processing frameworks/libraries like Apache HBase, Apache Hive, Apache Pig, Impala, Apache Spark, and other ecosystem components. Our goal was to manipulate raw data and derive real insights, rather than make it a mere query and data extraction tool. We wanted our data lake to be a genuine enterprise data hub.
Arriving at the right platform
When we started, we went with the obvious choice of using Amazon EMR. However, Amazon EMR had grown out of what was originally ‘transient’ workloads. Amazon EMR did not work well when we needed to keep the cluster around for a prolonged time. Being tied to the AWS stack meant that we did not have the choice of accessing rich Apache components like Cloudera or Hortonworks.
We looked into third party data lake providers as well. S3-based data lakes are easy to set up. Data is kept in S3, and clusters are spun up on EC2 to query the data. Even spot instances can be spun up or down based on usage.
After detailed discussions, we decided to build the data lake in-house. That would allow us to build a complete solution that would fit our needs, adhere to our strict security standards, allow fine-grained access control, and would have right performance/cost characteristics. Moreover, it would help us develop in-house expertise that will come in handy to make use of a large scale system like this.
Cloudera gave us a platform to start working on. From an engineering standpoint, it gave us the same code that comes with the open source stack but with bug fixes, patches, and interoperability with Kerberos and other stacks in the same family. Cloudera gave us predictable and consistent access to platform improvements along with good community support.
We had laid out the fundamentals of the building process and set our end goals. The entire process required us to gather data that was being ingested from various sources, process it, and store it in some easily accessible form.
Freshworks’ data is spread across RDS, S3, and custom data sources. In Baikal, we made use of an existing stack to ingest data from these sources. Apart from that, we wrote a lot of custom data pipelines (using AWS and Cloudera Stack) to move data at regular intervals. In a few cases, we wrote custom connectors to ingest data and take it to our data storage in Hadoop Distributed File Storage (HDFS) and Apache HBase. We wanted to efficiently create elaborate data processing workloads that are fault tolerant, repeatable and exhibited high availability (HA).
Sqoop, a command-line interface application, is the most commonly used component for ingesting data from RDBMS to a Hadoop Cluster. A staging area or a landing zone is created as an intermediate storage area. It is used for data processing during the extract, transform and load (ETL) process. The final relevant data is stored in a cluster which is consumed by Hive or Pig components.
Since we cater to different teams, there are multiple ways of processing the data. While some teams need quick immediate reports, some others find it more time-optimal to register a job and have the report delivered to their emails. We use tools like Oozie and Luigi to help with creating custom workflows.
Some other teams tapped into Apache Spark for batch-processing capabilities. Spark—which has APIs in Scala, Java and Python—runs in our Hadoop clusters. By using YARN for internal memory, and physical disks for processing the data, it saves time and resources for our developers.
While we opened up Spark for batch processing, we knew that it provided the constructs for streaming analytics as well. At Freshworks, we had a team working on Central Kafka Platform and we were looking to leverage the same downstream. Developers were able to build a real-time streaming application using Apache Spark, and many such pipelines were being used by our machine learning team to build bots.
When we started with the cluster sizing experiment, we were clear that we would not go ahead with transient clusters. We needed a combination of a different family of machines to address different workloads. The diagram that follows depicts how we started classifying our machines based on various workloads and types.
We chose an EC2 instance with ephemeral storage, as it provided the highest throughput to HDFS. Historically, it has been a clear choice for most Hadoop users.
For ease of use, we stored our data in the form of Hive tables. The data pulled from RDBMS is stored the same way as in the source. We did comprehensive tests in order to choose the right compression format (Snappy, gzip and such), and find a balance between higher compression ratio vs. performance, keeping in mind CPU usage as well.
We also ran numerous tests and changed the storage formats several times to suit our needs. Eventually, we found that columnar stores like Apache Parquet offered very good flexibility between both fast data ingestion and fast random data lookup.
We started with a very simple pipeline: a Hadoop stack with Sqoop, Hive, and Oozie to ingest data from RDS. We faced common issues like handling multiline columns, accented characters, and incremental updates.
Taking into consideration the kind of apps that were running on our platform, we needed our data to be synchronized with the data source. We started by revamping the existing stack and used a known use case to popularize the platform. The stack sat behind Tableau (which can connect to Hive) as the visualization platform to make easy-to-read charts, infographics and reports, demonstrating the ease of integration with BI tools.
In order to bring in more stakeholders to the table and generate new ideas for data usage, we started with a customer-churn use case to help our customer success team. We noticed that certain periodically-run reports were resource-intensive, and could be a good start to help us advertise the system and its capabilities throughout the organization. We automated the flow using Hive and Pig scripts, and helped the customer success team have the reports on the first of every month.
The stack started to automate a number of reports and helped in making the entire organization understand the importance of such a project. Later, we were able to release the data-platform to our largest consumer, the machine learning team.
This is the first of two parts on why we built our data lake. We wanted to share our story, and start a discussion with engineers across the world who’d be interested in the topic.
Read the second part, where we dive into the engineering nuances of the data lake.