Dream data engineering stack for smart hardware products
Ghz is a data stack which creates and manages IoT message queues and clickhouse on top of blob storages like S3/GCS/ABS
Explore how Ghz transforms IoT data engineering with upto 80% infra cost-saving, lean teams and scalable pipelines.
Loading...
Standard IoT Data pipeline and trade-offs
A typical IoT data pipeline incorporates these key subsystems and dataflow
MQTT broker
Primary pub/sub gateway which devices connects to
  • Light weight open protocol to manage millions of devices
  • Not designed for persistent data storage
  • Scale dependent on implementation details of the broker
Kafka
Secondary pub/sub engine to persistently store data
  • Reliably persists data on disk
  • Stored data is still raw data.
  • Need extra processing and querying subsystems
Processing
Usually Spark/Flink or other stream processing engines which subscribes to Kafka
  • Enriches data
  • Writes data back to Kafka or stores in an analytics database
  • Tradeoffs vary depending on the processing engine
Analytics databases
Serves as query and API layer for building customer facing products
  • Loaded either from Kafka or processing layer
  • Very fast
  • Requires manual tuning and query optimizations
  • Hard to scale and manage distributed setup
Object storage
Usually S3/GCS/ABS. Serves as archival for historical or cold data
  • Cheap and highly available
  • Slow
  • GETS and PUTS are charged. Need to prevent billing foot guns
  • Data compactions should be manually handled
Most common problems

1

2

3

1
High infra costs
  • SSDs and VMs contribute to steep infrastructure expenses
  • Each subsystem adds its own infra overheads in vms and disks
  • Because SSDs and compute are tightly coupled in a lot of subsystems, it's hard to automate downscaling
2
High latency and hard to debug data gaps
  • Data jumps through too many hops to reach analytics databases and become usable
  • Any data loss will lead to debugging every hop consuming precious engineering time
3
Specialized teams or individual enterprise offerings
  • Starts taking crucial engineering bandwidth to solve data engineering instead of delivering customer experience
  • Over time, each subsystem demands either
  • Experts for scaling and cost control or
  • Buying enterprise versions of subsystems
  • EMQX
  • Confluent
  • Databricks
  • Leads to either high engineering budgets or enterprise tooling budget
Ghz: Simplifying IoT data engineering
After 5 years of building full stack IoT platform which handles Data capture, OTAs, Visualizations, Remote diagnostics at Bytebeam, Ghz is born out of the need to tackle 2 main problems
  • Enable one person data engineering team to handle 1M devices
  • Design a system which brings down costs down upto 80%
Every design decision is made to prioritize the above 2 points over anything else
Architecture
How
1
Breakthrough #1
A broker which can create directly create and serve deltalakes in object storage
  • Data from devices, directly land in S3 in structured parquet/deltalake format
  • If incoming data is in arrow/json, all users have to do is upload schema
  • Level 1 structured data is available without having to manage any infra or data jumping through
  • MQTT → Kafka → Processing → Storage
  • Broker takes care of optimizing object storage PUTS, GETS and storage efficiency
2
Breakthrough #2
By designing the broker for object storage
  • Entire pipleline becomes totally stateless and naturally separates storage and compute
  • Make it easy to
  • Scale up/down compute,
  • Use spot vms which are upto 80% cheaper
  • Use local disks which are 50% cheaper and significantly faster than network disks
  • Object storage is
  • 70% cheaper than SSDs
  • There no need to over provision to keep extra buffer
  • Leads to effective 90% in savings
3
Breakthrough #3
Manages infra and ephemeral clickhouse
  • Creates vms to provision clickhouse as superfast analytics engine
  • Provisions loaders to load real-time or historical data from deltalake to clickhouse
  • Handles clickhouse maintainance
  • Provides query insights to users to optimize their queries
Impact at Bytebeam
1
GCP Bill Reduction
Lowered operational cost from $7000 to $1500 monthly
2
No Backups
We dont spend any time for managing backups as data always exists in S3
3
Deploy vscode near real data in cloud
  • Since loaders automate loading data from S3 to clickhouse, we create replicas vms to develop and debug data processing logic
  • This significantly improved our dev work flow and ship faster
4
Lean Team
Two engineers manage 500K simulated devices sending data at per second frequency