DuckDB 1.3.2 Crash: Array Columns Bug & Fix
Hey guys, have you ever run into a head-scratcher while working with DuckDB? I recently stumbled upon an issue where DuckDB 1.3.2 was crashing when dealing with array columns. Let's dive deep into this bug, how to replicate it, and, most importantly, the solution! This article will also cover how the issue was fixed in version 1.4.1. This is a common situation for a lot of people, so hopefully, we can learn a lot!
The Bug: DuckDB 1.3.2 and Array Columns
The Problem
So, what exactly was happening? In DuckDB 1.3.2, a specific query involving array columns would trigger a crash. The error message wasn't super helpful at first glance. It said something about an "Unknown Error" and then quickly devolved into a "Vector::Reference used on vector of different type" assertion failure. This is often an internal error, so that is not good news, but don't worry, as there's a good ending! This means that DuckDB was getting confused about the data types within the array, leading to a fatal error and the database shutting down. This is especially problematic if you're using array columns to store hierarchical data or other structured information.
Reproducing the Crash
The original bug report provided a detailed set of steps to reproduce the issue. To recreate the crash, you'd need to create two tables (t1 and t2) using the iceberg table function. The first table (t1) includes an array column named hierarchy_array. The steps involve inserting data, updating existing rows, and inserting new rows with different array lengths. Then, the bug would appear when performing a random slice comparison using the && (overlap) operator on the hierarchy_array column. Below is the SQL code that would trigger the crash:
DROP TABLE IF EXISTS t1, t2;
CREATE TABLE t1 (
id SERIAL,
name TEXT,
shipper_account_id INT,
created_at TIMESTAMP DEFAULT now()
) USING iceberg;
CREATE TABLE t2 (
id SERIAL,
company_name TEXT,
hierarchy_array TEXT[]
) USING iceberg;
-- Insert a few thousand rows
INSERT INTO t1 (name, shipper_account_id)
SELECT
'order_' || g,
(random() * 5000)::INT
FROM generate_series(1, 5000) AS g;
INSERT INTO t2 (company_name, hierarchy_array)
SELECT
'shipper_' || g,
ARRAY[ '0H' || (10000 + g)::TEXT, (10000 + g)::TEXT ]
FROM generate_series(1, 5000) AS g;
ALTER TABLE t1 ADD COLUMN hierarchy_array TEXT[];
-- Update existing rows
UPDATE t1
SET hierarchy_array = ARRAY[
'0H' || (10000 + id)::TEXT,
(id % 5)::TEXT
];
-- Insert new rows with different array lengths
INSERT INTO t1 (name, shipper_account_id, hierarchy_array)
SELECT
'new_order_' || g,
(random() * 5000)::INT,
ARRAY[
'0H' || (90000 + g)::TEXT,
'EXTRA' || (g % 10)::TEXT,
CASE WHEN g % 2 = 0 THEN 'ALT' ELSE NULL END
]
FROM generate_series(5001, 5200) AS g;
-- Update some random rows
UPDATE t1
SET hierarchy_array = ARRAY['0H11317', '1234567']
WHERE id % 37 = 0;
-- Random slice comparison
SELECT COUNT(*)
FROM t1
WHERE hierarchy_array && ARRAY['0H11317', '999999'];
If you ran this in DuckDB 1.3.2, boom! Crash! The error message would show up, and the database would be invalidated.
The Stack Trace
The stack trace provided in the bug report gives us a peek into what was happening under the hood. It points to an assertion failure within the vector handling code, specifically when referencing a vector with an incorrect type. This suggests that the issue lies in how DuckDB was processing the array data, likely during the overlap comparison, leading to incorrect type assumptions.
The Fix: DuckDB 1.4.1 to the Rescue!
The Solution
The good news is that the DuckDB team swiftly addressed this issue. The fix was incorporated into DuckDB 1.4.1. This version seems to have resolved the problem, so if you are on 1.3.2, you can update to the latest version to stop the crashes.
Verification
The original report includes a query that was causing the crash. When run in DuckDB 1.4.1, it works without a hitch. This confirms that the bug has been squashed. The query now returns the correct count of matching rows without any errors.
SELECT count(*) AS count
FROM (SELECT id, name, shipper_account_id, created_at, hierarchy_array FROM read_parquet([ 's3://testbucketpglake/postgres/public/t1/1059354/data/b9a7564d-54d5-43a0-9334-71a1e9dd65bf/data_0.parquet', 's3://testbucketpglake/postgres/public/t1/1059354/data/45d926bc-b684-4586-a502-c06586c94f3e/data_0.parquet', 's3://testbucketpglake/postgres/public/t1/1059354/data/e59ed7c4-78b6-488a-bdb0-44e06188d29f/data_0.parquet'], schema=map {1: {name: 'id', type: 'int', default_value: NULL}, 2: {name: 'name', type: 'string', default_value: NULL}, 3: {name: 'shipper_account_id', type: 'int', default_value: NULL}, 4: {name: 'created_at', type: 'timestamp', default_value: NULL}, 5: {name: 'hierarchy_array', type: 'string[]', default_value: NULL}}, explicit_cardinality=5340, filename='_pg_lake_filename', file_row_number=true) res (id,name,shipper_account_id,created_at,hierarchy_array) WHERE (_pg_lake_filename, file_row_number) NOT IN (SELECT (file_path, pos) FROM read_parquet([ 's3://testbucketpglake/postgres/public/t1/1059354/data/ccb7f51a-685e-46b0-aa50-d89f728f4516.parquet', 's3://testbucketpglake/postgres/public/t1/1059354/data/4e50caa3-ae73-48a5-9204-a0d788049b34.parquet']))) t1(id, name, shipper_account_id, created_at, hierarchy_array)
WHERE (hierarchy_array && ARRAY['0H11317'::text, '999999'::text]);
Upgrading to 1.4.1
If you're using DuckDB and running into this issue, upgrading to version 1.4.1 or later is the recommended solution. You can typically upgrade using your preferred package manager or by downloading the latest binaries from the DuckDB website. Ensure that the new version is compatible with your environment before deploying it.
Conclusion: Keeping DuckDB Strong
This bug in DuckDB 1.3.2 highlights the importance of staying up-to-date with software releases, especially when dealing with data processing and database systems. While the crash was frustrating, the quick response from the DuckDB team and the subsequent fix in 1.4.1 are a testament to their dedication to providing a robust and reliable database solution. This situation demonstrates the necessity of testing and how open-source projects can leverage the community's support to solve problems.
Key Takeaways
- Stay Updated: Always keep your DuckDB version current to benefit from bug fixes and performance improvements.
- Test Thoroughly: Test your queries, especially those using array columns, to identify potential issues early on.
- Report Issues: If you find any bugs, don't hesitate to report them to the DuckDB community. Your feedback helps make the database better for everyone.
So, if you're working with DuckDB and array columns, make sure you're on version 1.4.1 or later to avoid this crash. Happy querying!