Support for Parquet Predicate Pushdown Filter for Data Stored in Amazon S3

Tuesday, August 14, 2018 by Ujjwal Bhardwaj

The combination of Spark and Parquet is a very popular foundation for building scalable analytics platforms. In particular performance, scalability and ease of use are key elements of this solution that make it very appealing to its users. Predicate push down is a feature of Spark and Parquet that can improve query performance by reducing the amount of data read from Parquet files. It works by evaluating filtering predicates in the query against metadata stored in the Parquet files. Parquet may store statistics in the metadata section of its files and can use that information to take decisions including, but not limited to, skip reading chunks of data if the provided filter predicate value in the query is outside the range of values stored for a given column.

Predicate Pushdown Filter works in the similar manner when used with Parquet files as it does with HDFS. For valid queries made to the data, it avoids reading the whole data and use the knowledge stored in metadata to improve query performance. This can be verified by executing the SparkSQL script on a sample parquet data from both HDFS and S3 and by doing a benchmark on them. You can observe similar performance in both the cases and also, by executing the following commands,

  spark.read.format("parquet").load("s3://<sample-bucket>/parquetdata").createOrReplaceTempView("abcd")
  spark.sql("EXPLAIN SELECT * FROM abcd WHERE id=-1").collect()

You can see the following output in both the cases.

  res1: Array[org.apache.spark.sql.Row] =
  Array([== Physical Plan ==
  *(1) Project [registration_dttm#0, id#1, first_name#2, last_name#3, email#4, gender#5, ip_address#6, cc#7, country#8, birthdate#9, salary#10, title#11, comments#12]
  +- *(1) Filter (isnotnull(id#1) && (id#1 = -1))
     +- *(1) FileScan parquet [registration_dttm#0,id#1,first_name#2,last_name#3,email#4,gender#5,ip_address#6,cc#7,country#8,birthdate#9,salary#10,title#11,comments#12] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://<sample-bucket>/parquetdata], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,-1)], ReadSchema: struct<registration_dttm:timestamp,id:int,first_name:string,last_name:string,email:string,gender:...])

Here, you can see that the PushedFilters parameter has the content of the WHERE clause. This determines the "Predicate Push Down”.