Haoyu (Clara) Su

AI & Data Engineering | Data, health, and humanity - engineered together

Streaming Data the YVR Way: Rainy City, Real-Time Data

Abstract YVR bike-sharing animation

──────── 🚲 🌦️ 🚲 ────────

Introduction

On a rainy Vancouver morning, it’s common to see rows of Mobi and Lime bikes waiting at docking stations for the next rider heading to work, school, or the SkyTrain. Bike-share systems have become part of everyday life in the city, supporting a convenient and low-carbon way to get around as Vancouver works toward its goal of 100% renewable energy by 2050.

This project started with a simple question: **🚲 What can real-time bike-share data tell us about how the city moves?**
To explore this, I built a data engineering pipeline that collects live GBFS feeds from Mobi and Lime, streams the data through Kafka on Confluent Cloud, processes it with Databricks, and visualizes the results in a dashboard.
The goal is to turn raw API data into insights about station activity, demand patterns, and urban mobility, while simulating a real-world workflow for building modern, data-driven platforms.

Scope

This project was designed as a practical end-to-end data engineering case study with three goals:

  • build a real-time streaming pipeline for live bike-share data ingestion
  • support historical analytics and trend analysis on station activity and demand
  • simulate a production-style workflow with clear separation between ingestion, staging, transformation, and orchestration

The platform combines streaming infrastructure, cloud processing, data modeling, and dashboard delivery into a single workflow. Rather than focusing only on analytics, the project emphasizes the engineering design choices needed to make a pipeline reliable, modular, and scalable.

Figure 1. High-level architecture of the YVR bike-sharing analytics platform integrating real-time streaming pipelines, historical analytics workflows, and CI/CD orchestration.

Tech Stack

Kafka Confluent Cloud Databricks Apache Spark Python Snowflake dbt Dagster Streamlit

The project uses a modern analytics stack built around streaming, transformation, and orchestration:

  • GBFS APIs (Mobi and Lime) for live station information and station status feeds
  • Kafka on Confluent Cloud for real-time ingestion and staging
  • Databricks + Spark Structured Streaming for stream processing and table loading
  • Snowflake for analytical storage
  • dbt for historical transformations and trend modeling
  • Dagster for workflow orchestration and CI/CD-style scheduling
  • Streamlit for dashboard-based visualization

One important design choice in this project was the addition of a data staging layer.

The app pulls real-time data from the Lime and Mobi GBFS APIs, including station_information and station_status, and sends the raw JSON responses to Kafka topics running on Confluent Cloud. Instead of processing the API data directly, Kafka acts as a staging layer where all incoming messages are stored first. This makes it easier to debug issues, replay data, and keep ingestion separate from downstream processing logic.

After the data is staged in Kafka, Databricks streaming jobs read from the topics and load the data into bronze tables, where the records can later be cleaned and transformed. This staging-first design makes the pipeline feel closer to a real-world data engineering workflow, where ingestion, staging, transformation, and analytics are clearly separated.

What the Pipeline Enables

With this architecture, the platform can support questions such as:

  • Which stations are most active during commuting hours?
  • How does bike availability change across different areas of the city?
  • What demand patterns appear over time?
  • How can real-time station feeds be combined with historical analytics for better operational visibility?

By combining streaming ingestion with downstream modeling, the project supports both immediate operational monitoring and longer-term analytical use cases.

Showcase

The application layer is designed to surface live and historical insights through an interactive dashboard.

Planned showcase components include:

  • real-time station activity monitoring
  • station availability and demand trends
  • historical usage summaries
  • mobility insights across time and location

Repository

The full project is available on GitHub:

yvr-bike-streaming-pipeline

It includes the streaming pipeline, transformation logic, orchestration setup, and project assets used to build this end-to-end bike-sharing analytics workflow.