The two disadvantages can be addressed by the column-store format that forms a table in a sequence of columns with two advantages: (1) Only required columns are read from the storage during the selection process of a query; and (2) a high compression rate can be achieved due to similar data types in a column. Two disadvantages are associated with the column-store format. First, if the result of a query needs operations among multiple columns that may be stored in different tracks in a disk, or even worse in different nodes connected by networks, significant delays may come from random accesses in disks or/and remote accesses via networks. Second, it is not easy for write intensive workloads where rows or records are frequently added/deleted and updated.
As data volume becomes increasinly large, the tables have to be partitioned and placed among many computing nodes in clusters. Thus, a pure row-store or column-store would not be efficient for processing tables in a distributed way in large clusters.
The RCFile Data Format
The RCFile (Row Columnar File) data format is designed and implemented for data processing in distributed systems. RCfile is a hybrid data format that forms a table in a sequence row groups. A row group consists of multiple rows. Furtherfore, each row group is partitioned into columns. In this way, the RCFile format has the merits of both row-store and column-store, which is particularly desirable for big data processing in clusters.
The Usage of RCFile in Various Data Processing Systems
RCFile has been used in several major big data systems, including HBase, Hive, HCatelog, Impala, Pig, Presto, and others. Conventional databases have also been enabled to access data in the RCFile format, including IBM database, Microsoft SQL Server, Oracle database, SAS in-database products, Teradata and others.