How Twitter Overcame Its Real-Time Data Challenges

PinIt

Learn how Twitter addressed its real-time processing needs with a distributed, and fault-tolerant-stream processing engine.

Twitter’s real-time, back-end infrastructure processes more than 500 million tweets per day.  However, in 2013, when Karthik Ramasamy’s company, Locomatix, was purchased by Twitter, the 7-year-old social media platform needed to upgrade its real-time processing and analytics engine. Ramasamy, until recently an engineering manager at Twitter, told the audience at Qubole’s 2017 Data Platforms Conference how the company tackled this challenge.

Ramasam says that data has its highest value when it’s first produced. Certainly, in the realm of marketing, understanding and being proactive to shifts in consumer behavior requires real-time data analytics. Being proactive rather than reactive is the fine line that separates the successes and the failures in any industry.

Heron is born

To address Twitter’s real-time processing needs, Ramasamy and his team developed Heron, a “real time, distributed, and fault-tolerant-stream processing engine,” which has been in use since 2014.

Heron established a stable architecture for Twitter to achieve the following:

  • Provide backward interface compatibility to Apache Storm.
  • Extract, transform and load data in real time.
  • Disaggregate and classify data as it’s being created by Twitter users.
  • Quickly identify and take action regarding fraudulent Twitter accounts.
  • Improve the speed of real-time trending.
  • Perpetually and rapidly update machine learning models to match real-time data processing.
  • Near-immediate classification of the media morass flowing through users’ Twitter feeds.
  • Fast analysis of machine (server) functionality which predicts the probability of possible failures within the network and memory capacity.

Building a culture of data

Ramasamay emphasized the importance of a data-driven culture for reaping the benefits of a well-engineered, real-time, data-oriented system. Though self-service is a key component of providing a method for internal users to access the data, a centralized data team is essential for implementing a self-service framework, he said.

Ramasamay echoed the stance of LinkedIn’s Shrikanth Shankar on having specialized sub-groups of the data team. For example, a dedicated ETL (extract, transform,and load) team or person could focus on ensuring the usability of data throughout an organization. If a company is at a point in its big data initiative where it has hired data scientists, the ETL team could create an interface specifically designed to meet the data-pulling requirements unique to the data science objectives (deriving actionable insight using statistical and machine learning models).

Depending on the size of the enterprise, data scientists can be assigned to each department, such as marketing, production, and so forth. This lessens the delay between ETL and analysis, as everyone will have a streamlined process between the gathering of raw data and determining its utility (since not all data is actionable).

For additional information on Heron use cases, or to read extensive details regarding Twitter’s transition to Heron for its real-time processing, visit the Heron documentation resources at Github.

Related:
Lessons from LinkedIn: Faster Insights Through a Unified Data Ecosystem

More on Qubole’s 2017 Data Platforms Conference

Kat Campise

About Kat Campise

Kat Campise is a journalist and data scientist. She has a Ph.D in educational psychology from the University of Nevada-Las Vegas.

Leave a Reply

Your email address will not be published. Required fields are marked *