Can You Compare None to Timestamp? Decoding Delta Lake Timestamp Comparisons in DataFusion

When querying timestamp columns in Delta Lake tables using DataFusion, you might encounter unexpected errors due to type mismatches. This issue stems from discrepancies between the logical field type stored in the Delta log and the physical column type expected by DataFusion. Let’s explore a common scenario where comparing a timestamp column with None or using different timestamp precisions can lead to problems.

Understanding the Timestamp Discrepancy

Consider a Delta Lake table with a TIMESTAMP_MILLIS column representing timestamps in milliseconds. Parquet tools might identify this column as:

optional int64 endOfDsegTime (TIMESTAMP_MILLIS);

However, Delta-rs, the Rust implementation of the Delta Lake API, might represent the schema as:

SchemaField { name: "endOfDsegTime", type: primitive( "timestamp", ), nullable: true, metadata: {}, }, 

Notice the lack of precision information (milliseconds) in the Delta-rs schema. This ambiguity causes DataFusion to assume a nanosecond precision (Timestamp(Nanosecond, None)) by default, leading to type mismatch errors when comparing with millisecond timestamps.

Comparing with Specific Timestamp Values

Attempting to filter this column using endOfDsegTime >= TO_TIMESTAMP_MILLIS('2021-03-19T00:00:00') in DataFusion results in:

Error during planning: 'Timestamp(Nanosecond, None) >= Timestamp(Millisecond, None)' can't be evaluated because there isn't a common type to coerce the types to

Using TO_TIMESTAMP which defaults to nanoseconds, while accepted by DataFusion’s planner, eventually fails during execution in Arrow compute:

Cannot evaluate binary expression GtEq with types Timestamp(Millisecond, None) and Timestamp(Nanosecond, None) 

Rewriting the table to use int96 (nanoseconds) resolves the issue, as the physical and logical types align. Interestingly, the Delta log schema remains identical in both cases, highlighting Delta Lake’s lack of explicit distinction between different timestamp precisions.

The Challenge of Timestamp Metadata in Delta Lake

DataFusion relies on accurate schema information for query planning and execution. The current Delta Lake metadata lacks the granularity to differentiate between timestamp types solely from the log. This limitation hinders DataFusion’s ability to correctly interpret and compare timestamps. While Spark seamlessly handles these scenarios, DataFusion requires explicit type casting or schema adjustments to ensure accurate comparisons. Furthermore, directly reading the Parquet files bypasses the Delta log, allowing DataFusion to infer the correct schema, but this approach negates the benefits of using Delta Lake’s transactional capabilities. Addressing this challenge might require enhancements to the Delta Lake metadata to include precision information for timestamp types, enabling DataFusion to perform accurate comparisons without relying on workarounds. Using int96 consistently for timestamp columns could mitigate this problem but introduces potential compatibility issues with other tools and systems that expect TIMESTAMP_MILLIS. Currently there is no simple solution and workarounds are required depending on the specific use case.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *