Parallel Processing 2.0: Round Two, Ready Fight!


Vectorized CPUs have been gestating for decades. Now they are emerging as the innovator’s choice as they use parallel CPU instructions when working on a single user’s workload.

Gartner, IDC, and others say analytics is the #1 or #2 priority for large and small corporations.  Underscoring all analytics is the need to cope with big data — 100s or 1000s of terabytes of storage. Why? Because more data means more accuracy and discovering hidden insights.  Some data warehouses exceed 50 petabytes; data lakes can exceed 200PB of storage.  But big data storage is only half the story. Big data also means horrible wait times for results. Anyone can attach 100 terabytes to a laptop. Analyzing 100 terabytes on a laptop could take weeks or months, which explains investments in massively parallel processing.

Massively parallel processin technology is the foundation of data warehouses, data lakes, NoSQL, data science, and a lot of artificial intelligence. MPP + analytics is a $67B worldwide market with 10% growth. Getting results fast means speeding up the time-to-value. And less boredom waiting for answers. It also means delivering results to managers and peers sooner.

Software is the MPP hallmark. Many CPU cores per server execute many tasks concurrently for multiple users. One core is tasked with sending network messages. Another core filters incoming data from storage. Consolidating results across servers for each user is the nexus of NoSQL and MPP database’s value.

Round two: Vectorization begins with GPUs

Vectorized CPUs have been gestating for decades. Now they are emerging as the innovator’s choice.  Vectorization is a fancy word covering both GPUs and Intel AVX instructions. Vectorization uses parallel CPU instructions working on a single user’s workload.  Like synchronized dancers, each vector processor executes the same instruction against different data in memory.

Large memory array calculations finish in seconds instead of hours. Analysts and data scientists spend less time waiting, more time learning. Or, they can ingest real-time data streams as fast as it arrives.  Vector processing does for compute power what MPP does for storage. While MPP exploits multiple cores and servers (coarse-grained), vectorization exploits parallelism inside the CPU itself (fine-grained).   Exploiting both is nirvana.

Vectorization had been around forever but only gained commercial traction around 2001. It started when NVIDIA built graphics processing units (GPUs). GPUs rendered video game displays in real-time. In 2001, these were four concurrent pixel pipelines. Today’s GPUs have hundreds of mini-CPUs on one silicon microchip, sometimes thousands.

Gamers love their graphics, as do video studios like Pixar. But what opened the GPU commercial floodgates? In 2007, NVIDIA released general use programming interfaces. Meteoric adoption and innovation then burst from university graduate students and software vendors. 

Voilà! Today, GPUs are the brains inside blockchains, NFTs, and speech recognition. Self-driving cars use GPUs as navigation “eyes.” Other GPU workloads include cryptographic (security) algorithms, real-time stream manipulations, neural networks, natural language processing, and audio/video searches. All because thousands of programmers got software that makes mini-CPUs dance in synchronization.

Intel’s vectorization

Intel added Advanced Vector Extensions (AVX) instructions to Sandy Bridge CPUs in 2011. Every microprocessor since then has evolved along with software to exploit it. Yes, you already have vector processing capability on-premises and in many cloud computers.

Today, each AVX-512 instruction can split up a 512-bit memory cell into eight 64-bit simultaneous calculations. Now consider that Intel servers have 28 CPU cores. That’s 224 calculations (aka synchronized dancers) per GHz clock tick.

Hint: 224 > 28. And unlike multitasking, AVX instructions are extremely memory and CPU-efficient.  

The vectorization concepts used by AVX CPUs and GPUs — known as SIMD — are somewhat similar. But comparisons between AVX and GPUs are always misleading. They each have unique strengths the other does not have. Don’t bother counting CPUs or mini-CPUs. Look instead at the workloads GPUs, and AVX instructions excel at. Corporations need both kinds of hardware.

Intel-based vectorization excels at data management pipelines, databases (sorting, calculations, aggregations), data compression, decompression, real-time streams, artificial intelligence, and machine learning.    

Tackling real-time data streams

It follows that real-time logistics and tracking is a popular vectorized workload that would benefit from parallel processing. Typical real-time vector analysis of concurrent streams include:

  • Correlating Internet of Things sensor streams
    • Monitoring wildfires, storms, or flood response
    • Tracking trucks, shipments, sales receipts
    • National security and battlefield logistics
    • Machine learning recommendation engines
    • Fast loading of data streams to disk (especially columnar formats)

Bonus: Where available, MPP server-side vectorization can render millions of data points in real-time. This avoids choking your BI-server or PC on a terabyte of data in real-time. Some vectorized MPP servers render the visuals, sending only a dozen megabytes to your screen in real-time. A life-saver!

Checklist for vectorized products for parallel processing

The first thing to look for is “native vectorization” software. Many major databases use Intel AVX instructions on a few SQL operations. Your programmers can’t see it, can’t measure it, can’t be sure it’s ever being used. All you get is technical manuals and a sales pitch. Native vectorization provides easy-to-use tools and APIs for your programmers. Ideally, languages such as SQL and Python should invoke vectorization, avoiding arcane hardware-level coding.

Other attributes to look for include: 

  • Vector software running on an MPP server cluster for scalability (8 to 64 servers required).
  • Vectorized hardware in the cloud AND on-premises.   
  • Connectivity to open-source Kafka or other popular real-time streaming tools.
  • Software that exploits both GPUs and Intel AVX for different workloads. Not everything fits in one style of vector processing.
  • A strong library of vectorized built-in functions, e.g., geospatial, temporal, graph, AI, etc.

Be sure to call customer references on the checklist above.

Parallel processing 2.0 is here — now! Ready FIGHT!


About Dan Graham and Chad Meley

With over 30 years in IT, Dan Graham has been a DBA, IBM’s Global BI Solutions Strategy Director, and General Manager of Teradata’s 6700 high-end servers.  His skills include MPP systems, data warehouses, big data, data lakes, graph analytics, benchmarking, and IoT. Dan is currently an independent consultant. Chad Meley is the CMO at Kinetica, the database for time and space. He has more than 20 years of experience in leadership roles centered around data and analytics marketing and information technology systems and encompassing all facets of management, strategy, planning, and operations for companies such as Teradata, Electronic Arts, Dell, and FedEx.

Leave a Reply

Your email address will not be published. Required fields are marked *