Dream data engineering stack for smart hardware products

Dream data engineering stack for smart hardware products
Ghz is a data stack which creates and manages IoT message queues and clickhouse on top of blob storages like S3/GCS/ABS
Explore how Ghz transforms IoT data engineering with upto 80% infra cost-saving, lean teams and scalable pipelines.
Loading...
Standard IoT Data pipeline and trade-offs
A typical IoT data pipeline incorporates these key subsystems and dataflow
MQTT broker
Primary pub/sub gateway which devices connects to
Light weight open protocol to manage millions of devices
Not designed for persistent data storage
Scale dependent on implementation details of the broker
Kafka
Secondary pub/sub engine to persistently store data
Reliably persists data on disk
Stored data is still raw data.
Need extra processing and querying subsystems
Processing
Usually Spark/Flink or other stream processing engines which subscribes to Kafka
Enriches data 
Writes data back to Kafka or stores in an analytics database
Tradeoffs vary depending on the processing engine
Analytics databases
Serves as query and API layer for building customer facing products
Loaded either from Kafka or processing layer 
Very fast
Requires manual tuning and query optimizations
Hard to scale and manage distributed setup x
Object storage
Usually S3/GCS/ABS. Serves as archival for historical or cold data
Cheap and highly available
Slow
GETS and PUTS are charged. Need to prevent billing foot guns
Data compactions should be manually handled
Most common problems
1
2
3
1
High infra costs
SSDs and VMs contribute to steep infrastructure expenses
Each subsystem adds its own infra overheads in vms and disks
Because SSDs and compute are tightly coupled in a lot of subsystems, it's hard to automate downscaling
2
High latency and hard to debug data gaps
Data jumps through too many hops to reach analytics databases and become usable
Any data loss will lead to debugging every hop consuming precious engineering time
3
Specialized teams or individual enterprise offerings
Starts taking crucial engineering bandwidth to solve data engineering instead of delivering customer experience
Over time, each subsystem demands either
 Experts for scaling and cost control or 
Buying enterprise versions of subsystems
EMQX
Confluent
Databricks
Leads to either high engineering budgets or enterprise tooling budget
Ghz: Simplifying IoT data engineering
After 5 years of building full stack IoT platform which handles Data capture, OTAs, Visualizations, Remote diagnostics at Bytebeam, Ghz is born out of the need to tackle 2 main problems
Enable one person data engineering team to handle 1M devices
Design a system which brings down costs down upto 80%
Every design decision is made to prioritize the above 2 points over anything else
Architecture
How
1
#1
A broker which can create directly create and serve deltalakes in object storage
Data from devices, directly land in S3 in structured parquet/deltalake format
If incoming data is in arrow/json, all users have to do is upload schema
Level 1 structured data is  available without  having to manage any infra or data jumping through 
MQTT → Kafka → Processing → Storage
Broker takes care of optimizing object storage PUTS, GETS and storage efficiency
2
#2
By designing the broker for object storage 
Entire pipleline becomes totally stateless and naturally separates storage and compute
Make it easy to 
Scale up/down compute,
Use spot vms which are upto 80% cheaper
Use local disks which are 50% cheaper and significantly faster than network disks
Object storage is 
70% cheaper than SSDs
There no need to over provision to keep extra buffer
Leads to effective 90% in savings
3
#3
Manages infra and ephemeral clickhouse
Creates vms to provision clickhouse as superfast analytics engine
Provisions loaders to load real-time or historical data from deltalake to clickhouse 
Handles clickhouse maintainance
Provides query insights to users to optimize their queries
Impact at Bytebeam
1
GCP Bill Reduction
Lowered operational cost from $6000 to $1500 monthly
2
No Backups
We dont spend any time for managing backups as data always exists in S3
3
Develop and deploy code near data
Instead of pulling data to your code, ghz infra provides the flexibility to deploy code near data
This keeps the infra lean and performant. We don't have to manage many services
Since loaders automate loading data from S3 to clickhouse, we also create replicas vms to develop and debug data processing logic
This significantly improved our dev work flow and ship faster
4
Lean Team
Two engineers manage 100K simulated devices sending data at per second frequency