How to Build Portable, In-Data Pipelines for Engineering Features with Ibis Using Lazy Python APIs and DuckDB Releases

ocopd 2026年1月9日

0 19 3 minutes read

How to Build Portable, In-Data Pipelines for Engineering Features with Ibis Using Lazy Python APIs and DuckDB Releases

In this tutorial, we show how we use it Ibis building a portable, in-domain engineering feature pipeline that looks and feels like Pandas but runs entirely within the database. We show how to connect to DuckDB, register data securely within the backend, and describe complex transformations using window functions and joins without pulling raw data into local memory. Keeping all transformations lazy and backend-agnostic, we demonstrate how to write analytics code once in Python and rely on Ibis to translate it into efficient SQL. Check it out FULL CODES here.

!pip -q install "ibis-framework[duckdb,examples]" duckdb pyarrow pandas


import ibis
from ibis import _


print("Ibis version:", ibis.__version__)


con = ibis.duckdb.connect()
ibis.options.interactive = True

We install the necessary libraries and start the Ibis environment. We establish a DuckDB connection and enable interactive operations so that all subsequent operations are always lazy and run by the backend. Check it out FULL CODES here.

try:
   base_expr = ibis.examples.penguins.fetch(backend=con)
except TypeError:
   base_expr = ibis.examples.penguins.fetch()


if "penguins" not in con.list_tables():
   try:
       con.create_table("penguins", base_expr, overwrite=True)
   except Exception:
       con.create_table("penguins", base_expr.execute(), overwrite=True)


t = con.table("penguins")
print(t.schema())

We load the Penguins dataset and register it explicitly within the DuckDB catalog to ensure that it is available for SQL execution. We validate the table schema and ensure that the data now resides within the database rather than in local memory. Check it out FULL CODES here.

def penguin_feature_pipeline(penguins):
   base = penguins.mutate(
       bill_ratio=_.bill_length_mm / _.bill_depth_mm,
       is_male=(_.sex == "male").ifelse(1, 0),
   )


   cleaned = base.filter(
       _.bill_length_mm.notnull()
       & _.bill_depth_mm.notnull()
       & _.body_mass_g.notnull()
       & _.flipper_length_mm.notnull()
       & _.species.notnull()
       & _.island.notnull()
       & _.year.notnull()
   )


   w_species = ibis.window(group_by=[cleaned.species])
   w_island_year = ibis.window(
       group_by=[cleaned.island],
       order_by=[cleaned.year],
       preceding=2,
       following=0,
   )


   feat = cleaned.mutate(
       species_avg_mass=cleaned.body_mass_g.mean().over(w_species),
       species_std_mass=cleaned.body_mass_g.std().over(w_species),
       mass_z=(
           cleaned.body_mass_g
           - cleaned.body_mass_g.mean().over(w_species)
       ) / cleaned.body_mass_g.std().over(w_species),
       island_mass_rank=cleaned.body_mass_g.rank().over(
           ibis.window(group_by=[cleaned.island])
       ),
       rolling_3yr_island_avg_mass=cleaned.body_mass_g.mean().over(
           w_island_year
       ),
   )


   return feat.group_by(["species", "island", "year"]).agg(
       n=feat.count(),
       avg_mass=feat.body_mass_g.mean(),
       avg_flipper=feat.flipper_length_mm.mean(),
       avg_bill_ratio=feat.bill_ratio.mean(),
       avg_mass_z=feat.mass_z.mean(),
       avg_rolling_3yr_mass=feat.rolling_3yr_island_avg_mass.mean(),
       pct_male=feat.is_male.mean(),
   ).order_by(["species", "island", "year"])

We define a reusable engineering feature pipeline using pure Ibis expressions. We enumerate the features found, apply data cleansing, and use windowing functions and structured integration to create advanced, database-derived features while keeping the rest of the pipeline lazy. Check it out FULL CODES here.

features = penguin_feature_pipeline
print(con.compile(features))


try:
   df = features.to_pandas()
except Exception:
   df = features.execute()


display(df.head())

We request a pipeline feature and integrate it into DuckDB SQL to ensure that all changes are pushed down to the database. We then run the pipeline and return only the final compiled results for evaluation. Check it out FULL CODES here.

con.create_table("penguin_features", features, overwrite=True)


feat_tbl = con.table("penguin_features")


try:
   preview = feat_tbl.limit(10).to_pandas()
except Exception:
   preview = feat_tbl.limit(10).execute()


display(preview)


out_path = "/content/penguin_features.parquet"
con.raw_sql(f"COPY penguin_features TO '{out_path}' (FORMAT PARQUET);")
print(out_path)

We engineer properties like a table directly inside DuckDB and query it lazily for validation. We also export the results to a Parquet file, which shows how we can provide computer generated features for stream analysis or machine learning workflows.

In conclusion, we built, integrated, and implemented a fully optimized engineering workflow within DuckDB using Ibis. We demonstrated how to test the generated SQL, visualize the results directly in the database, and send it for downstream use while maintaining portability across the back-end of the analysis. This approach reinforces the main idea behind Ibis: we keep computation close to the data, reduce unnecessary data movement, and maintain a single, reusable Python codebase that scales from field testing to production specifications.

Check it out FULL CODES here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Check out our latest issue of ai2025.deva 2025-centric analytics platform that transforms model implementations, benchmarks, and ecosystem activity into structured datasets that you can sort, compare, and export.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.

ocopd 2026年1月9日

0 19 3 minutes read

How to Build Portable, In-Data Pipelines for Engineering Features with Ibis Using Lazy Python APIs and DuckDB Releases

ocopd

Leave a Reply Cancel reply

EV Supply Chain Play: Why Solid State Power (SLDP) Can Be a Hidden Gem in Battery Tech

5 Ways to Plan Your Financial Journey to Buying a Home – RISMedia’s Housecall

A $350K Wire Almost Went Into a Scam How a Fraudster Used the US Embassy to Steal $350K

Holy Ship! HII Stock Jumps on Trump Navy Ship Plans

Multilingual Sentiment Analysis – Importance, Methodology, and Challenges

Humans and AI at Work: Who’s Really in Control?

ocopd

Wall Street Analyst Is Bullish On Bitcoin, Predicts Price Recovery

Meta and Harvard Researchers Introduce Confucius Code Agent (CCA): A Software Engineering Agent That Can Work on Large Codebases

Related Articles

Zyphra Releases ZUNA: A Basic 380M-Parameter BCI Model for EEG Data, Improves Thought-to-Text Development

A smart parking system can prevent frustration and emissions MIT News

Tavus Introduces Phoenix-4: Gaussian-Diffusion Model Brings Real-Time Emotional Intelligence and Sub-600ms Latency to Generative Video AI

Google DeepMind Releases Lyria 3: An Advanced Music Generation AI Model That Turns Images and Text into Custom Tracks and Compiled Songs and Lyrics