How to Build Portable, In-Data Pipelines for Engineering Features with Ibis Using Lazy Python APIs and DuckDB Releases

In this tutorial, we show how we use it Ibis building a portable, in-domain engineering feature pipeline that looks and feels like Pandas but runs entirely within the database. We show how to connect to DuckDB, register data securely within the backend, and describe complex transformations using window functions and joins without pulling raw data into local memory. Keeping all transformations lazy and backend-agnostic, we demonstrate how to write analytics code once in Python and rely on Ibis to translate it into efficient SQL. Check it out FULL CODES here.
!pip -q install "ibis-framework[duckdb,examples]" duckdb pyarrow pandas
import ibis
from ibis import _
print("Ibis version:", ibis.__version__)
con = ibis.duckdb.connect()
ibis.options.interactive = True
We install the necessary libraries and start the Ibis environment. We establish a DuckDB connection and enable interactive operations so that all subsequent operations are always lazy and run by the backend. Check it out FULL CODES here.
try:
base_expr = ibis.examples.penguins.fetch(backend=con)
except TypeError:
base_expr = ibis.examples.penguins.fetch()
if "penguins" not in con.list_tables():
try:
con.create_table("penguins", base_expr, overwrite=True)
except Exception:
con.create_table("penguins", base_expr.execute(), overwrite=True)
t = con.table("penguins")
print(t.schema())
We load the Penguins dataset and register it explicitly within the DuckDB catalog to ensure that it is available for SQL execution. We validate the table schema and ensure that the data now resides within the database rather than in local memory. Check it out FULL CODES here.
def penguin_feature_pipeline(penguins):
base = penguins.mutate(
bill_ratio=_.bill_length_mm / _.bill_depth_mm,
is_male=(_.sex == "male").ifelse(1, 0),
)
cleaned = base.filter(
_.bill_length_mm.notnull()
& _.bill_depth_mm.notnull()
& _.body_mass_g.notnull()
& _.flipper_length_mm.notnull()
& _.species.notnull()
& _.island.notnull()
& _.year.notnull()
)
w_species = ibis.window(group_by=[cleaned.species])
w_island_year = ibis.window(
group_by=[cleaned.island],
order_by=[cleaned.year],
preceding=2,
following=0,
)
feat = cleaned.mutate(
species_avg_mass=cleaned.body_mass_g.mean().over(w_species),
species_std_mass=cleaned.body_mass_g.std().over(w_species),
mass_z=(
cleaned.body_mass_g
- cleaned.body_mass_g.mean().over(w_species)
) / cleaned.body_mass_g.std().over(w_species),
island_mass_rank=cleaned.body_mass_g.rank().over(
ibis.window(group_by=[cleaned.island])
),
rolling_3yr_island_avg_mass=cleaned.body_mass_g.mean().over(
w_island_year
),
)
return feat.group_by(["species", "island", "year"]).agg(
n=feat.count(),
avg_mass=feat.body_mass_g.mean(),
avg_flipper=feat.flipper_length_mm.mean(),
avg_bill_ratio=feat.bill_ratio.mean(),
avg_mass_z=feat.mass_z.mean(),
avg_rolling_3yr_mass=feat.rolling_3yr_island_avg_mass.mean(),
pct_male=feat.is_male.mean(),
).order_by(["species", "island", "year"])
We define a reusable engineering feature pipeline using pure Ibis expressions. We enumerate the features found, apply data cleansing, and use windowing functions and structured integration to create advanced, database-derived features while keeping the rest of the pipeline lazy. Check it out FULL CODES here.
features = penguin_feature_pipeline
print(con.compile(features))
try:
df = features.to_pandas()
except Exception:
df = features.execute()
display(df.head())
We request a pipeline feature and integrate it into DuckDB SQL to ensure that all changes are pushed down to the database. We then run the pipeline and return only the final compiled results for evaluation. Check it out FULL CODES here.
con.create_table("penguin_features", features, overwrite=True)
feat_tbl = con.table("penguin_features")
try:
preview = feat_tbl.limit(10).to_pandas()
except Exception:
preview = feat_tbl.limit(10).execute()
display(preview)
out_path = "/content/penguin_features.parquet"
con.raw_sql(f"COPY penguin_features TO '{out_path}' (FORMAT PARQUET);")
print(out_path)
We engineer properties like a table directly inside DuckDB and query it lazily for validation. We also export the results to a Parquet file, which shows how we can provide computer generated features for stream analysis or machine learning workflows.
In conclusion, we built, integrated, and implemented a fully optimized engineering workflow within DuckDB using Ibis. We demonstrated how to test the generated SQL, visualize the results directly in the database, and send it for downstream use while maintaining portability across the back-end of the analysis. This approach reinforces the main idea behind Ibis: we keep computation close to the data, reduce unnecessary data movement, and maintain a single, reusable Python codebase that scales from field testing to production specifications.
Check it out FULL CODES here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Check out our latest issue of ai2025.deva 2025-centric analytics platform that transforms model implementations, benchmarks, and ecosystem activity into structured datasets that you can sort, compare, and export.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.



