──────── 🚲 🌦️ 🚲 ────────
Introduction
This project started with a simple question: **🚲 What can real-time bike-share data tell us about how the city moves?**
To explore this, I built a data engineering pipeline that collects live GBFS feeds from Mobi and Lime, streams the data through Kafka on Confluent Cloud, processes it with Databricks, and visualizes the results in a dashboard.
The goal is to turn raw API data into insights about station activity, demand patterns, and urban mobility, while simulating a real-world workflow for building modern, data-driven platforms.
Scope
This project was designed as a practical end-to-end data engineering case study with three goals:
- build a real-time streaming pipeline for live bike-share data ingestion
- support historical analytics and trend analysis on station activity and demand
- simulate a production-style workflow with clear separation between ingestion, staging, transformation, and orchestration
The platform combines streaming infrastructure, cloud processing, data modeling, and dashboard delivery into a single workflow. Rather than focusing only on analytics, the project emphasizes the engineering design choices needed to make a pipeline reliable, modular, and scalable.
Figure 1. High-level architecture of the YVR bike-sharing analytics platform integrating real-time streaming pipelines, historical analytics workflows, and CI/CD orchestration.
Tech Stack
The project uses a modern analytics stack built around streaming, transformation, and orchestration:
- GBFS APIs (Mobi and Lime) for live station information and station status feeds
- Kafka on Confluent Cloud for real-time ingestion and staging
- Databricks + Spark Structured Streaming for stream processing and table loading
- Snowflake for analytical storage
- dbt for historical transformations and trend modeling
- Dagster for workflow orchestration and CI/CD-style scheduling
- Streamlit for dashboard-based visualization
One important design choice in this project was the addition of a data staging layer.
The app pulls real-time data from the Lime and Mobi GBFS APIs, including station_information and station_status, and sends the raw JSON responses to Kafka topics running on Confluent Cloud. Instead of processing the API data directly, Kafka acts as a staging layer where all incoming messages are stored first. This makes it easier to debug issues, replay data, and keep ingestion separate from downstream processing logic.
After the data is staged in Kafka, Databricks streaming jobs read from the topics and load the data into bronze tables, where the records can later be cleaned and transformed. This staging-first design makes the pipeline feel closer to a real-world data engineering workflow, where ingestion, staging, transformation, and analytics are clearly separated.
What the Pipeline Enables
With this architecture, the platform can support questions such as:
- Which stations are most active during commuting hours?
- How does bike availability change across different areas of the city?
- What demand patterns appear over time?
- How can real-time station feeds be combined with historical analytics for better operational visibility?
By combining streaming ingestion with downstream modeling, the project supports both immediate operational monitoring and longer-term analytical use cases.
Showcase
The application layer is designed to surface live and historical insights through an interactive dashboard.
Planned showcase components include:
- real-time station activity monitoring
- station availability and demand trends
- historical usage summaries
- mobility insights across time and location
Repository
The full project is available on GitHub:
It includes the streaming pipeline, transformation logic, orchestration setup, and project assets used to build this end-to-end bike-sharing analytics workflow.
