Skip to content

[Feature Request]: IcebergIO Predicate Filter Pushdown support for Nested Columns. #37288

@barunkumaracharya

Description

@barunkumaracharya

What would you like to happen?

I see that From Apache Beam 2.66.0, after this PR , one can apply "filter" on a Iceberg ManagedIO Read Operation.
For ex- If i have 2 columns named day and ingestedAt, i can write something like this and pass as a map to my Iceberg ManagedIO.read() transform.

filter: "\"day\" = '20260109' AND \"ingestedAt\" = '2026-01-09T15:08:38.122'"

What i also want to know is, if its possible to apply this kind of a filter on a nested column (column present inside a struct type). For ex, if i had a column named eventInfo whose type is struct and it had a column called zone inside it, I wanted to know if i could apply a predicate pushdown filter like

filter: "\"day\" = '20260109' AND \"ingestedAt\" = '2026-01-09T15:08:38.122' AND \"eventInfo.\".\"zone\" = 'abc'"

I tried running an example of this sort, where my column name was data and i saw this exception -

Caused by: java.lang.IllegalArgumentException: Unsupported filter type in field 'data': STRUCT at org.apache.beam.sdk.io.iceberg.FilterUtils.convertLiteral(FilterUtils.java:390) at org.apache.beam.sdk.io.iceberg.FilterUtils.convertFieldAndLiteral(FilterUtils.java:320) at org.apache.beam.sdk.io.iceberg.FilterUtils.convertFieldAndLiteral(FilterUtils.java:305) at org.apache.beam.sdk.io.iceberg.FilterUtils.convert(FilterUtils.java:197) at org.apache.beam.sdk.io.iceberg.FilterUtils.convertLogicalExpr(FilterUtils.java:271) at org.apache.beam.sdk.io.iceberg.FilterUtils.convert(FilterUtils.java:227) at org.apache.beam.sdk.io.iceberg.FilterUtils.convert(FilterUtils.java:148) ... 41 more

26/01/12 23:24:40 ERROR PipelineException: Invalid resource error: Pipeline execution failed: Failed to create partitions for source ScanSource java.lang.RuntimeException: Failed to create partitions for source ScanSource at org.apache.beam.runners.spark.io.SourceRDD$Bounded.getPartitions(SourceRDD.java:128) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291)

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions