SHARE
Facebook X Pinterest WhatsApp

Using ORC Files to Speed Data Analytics

thumbnail
Using ORC Files to Speed Data Analytics

Why trying to analyze big data in the form of CSV and TSV files can be a colossal waste.

Written By
thumbnail
Chris Raphael
Chris Raphael
Apr 4, 2017

The comma-separated or tab-separated value is the go-to source for data analysis. If you have a CSV or TSV file on hand, almost any business intelligence, data visualization, or advanced analytics tool can handle it.

Even Microsoft Excel can handle CSV and TSV files, provided they are not too large.
And that’s the challenge: When it comes to big data, which may span millions of rows and columns, spending time trying to analyze CSV and TSV files can be an expensive slog, says Charles Pritchard, a big data expert at Expedia.

“I’ve seen companies blow $1 million on learning how to grab CSV files with their own custom code,” Pritchard said. “They have to connect to the API as well as grab data from a server. And I end up seeing 12-to-18 month projects, and I am thinking if someone could have provided the file in an optimized format, we could have cut down that project scope tremendously.”

Pritchard advocates use of the optimized-row columnar (ORC) file, which grew out of Apache Hive as an effort to speed the efficiency of data stores in Hadoop. ORC files have several innovations:

Optimized storage format:  The files contain groups of row data called stripes, under which the first 10,000 rows are stored in the first column, then the second 10,000 rows beneath the first column.

Indices: At the end of each set of 10,000 rows, an in-line index calculates the minimum, maximum, and sum values.

Faster analysis and compression: The ORC format also allows users to skip row groups that don’t contain certain filters. Auxiliary information is contained in a file footer, and a postscript holds compression parameters.

5024-orcfilelayout

ORC file layout. Graphic credit: Apache Software Foundation.

“It cuts down tremendously in the size of the file and the processing time needed to access the file,” Pritchard said. “Ten gigs of compressed ORC can hold the same data as 500 gigs of uncompressed CSV, depending on the data.”

Under the ORC format, users can also skip entire blocks of 10,000 rows, “making it much faster for processing in a big data world,” he said.

Big data analysis and CSV files: a match made in hell?

Pritchard offered an example of log files from website visits. Those log files may contain millions of rows of data. A program such as Adobe Site Catalyst, which is designed to enhance user experience on websites, will store unique identifiers for website visitors—where they are from and numerous interactions with various website pages. For a Fortune 500 company, the amount of data generated from their website can be enormous, and it often comes in the form of CSV or TSV files – some of them 200 to 300 columns wide for every user interaction.

Imagine trying to analyze user interactions on a website with the file in that format.

“The vast majority of the data is repeated for that user visit. So you’ll see a user interact a few dozen times on a website in a day: their browser info isn’t changing, their IP isn’t changing, etc. But it’s all repeated in the CSV/TSV file,” Pritchard said. “There’s room to compress that in standard CSV/TSV, but ORC really shines here.”

Existing data tools might cap out trying to analyze such data, which may also have to undergo data filtering to identify certain variables. The project then gets sent to the engineering team.

“And it ends up as a mix,” Pritchard said. “So they’ll end up using kind of some · programs and start manually doing things and copying and pasting values into spreadsheets to get aggregates.” Fusing that data with data stored in another system, say for example customer relationship management, might be another project.

“The journey of doing that and the code that is kind of created it and wasted – it’s not trivial,” Pritchard said. “I’ve seen developers write thousands and thousands of lines of code.” Even in cases where code has been developed to take CSV and TSV files and translate them to ORC format, developers still need to perform quality assurance on the data to make sure nothing was lost in conversion.

So what has to happen to speed the process? Analytics and business intelligence vendors, who offer CSV and TSV compatibility, need to do the same with ORC functionality.

Sound too simple? Pritchard will be speaking more on the subject at the Data Platforms 2017 conference.

Advertisement

Why is big data analysis so challenging?

thumbnail
Chris Raphael

Chris Raphael (full bio) covers fast data technologies and business use cases for real-time analytics. Follow him on Twitter at raphaelc44.

Recommended for you...

Data Immediacy’s Next Step
Smart Talk Episode 9: Apache Iceberg and Streaming Data Architectures
Smart Talk Episode 5: Disaggregation of the Observability Stack
Smart Talk Episode 4: Real-Time Data and Vector Databases

Featured Resources from Cloud Data Insights

The Difficult Reality of Implementing Zero Trust Networking
Misbah Rehman
Jan 6, 2026
Cloud Evolution 2026: Strategic Imperatives for Chief Data Officers
Why Network Services Need Automation
The Shared Responsibility Model and Its Impact on Your Security Posture
RT Insights Logo

Analysis and market insights on real-time analytics including Big Data, the IoT, and cognitive computing. Business use cases and technologies are discussed.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.