Better data, not sensors
Want more reliable robots? Focus on dataset quality, sensor calibration, and fusion techniques—why data matters more than adding sensors.
Every robot emits a stream of data: sensor bursts, camera frames, lidar scans, controller traces, and debug logs.
The pileup isn’t the problem; the disorder is. Files land in random buckets, live in strange formats, and demand fragile scripts.
When something fails, the evidence is there somewhere, but getting to it is slow and frustrating.
The shift is simple: teams are building data stacks that match their workflows.
Make runs visible: your robot needs a DVR. If you can’t reconstruct what the robot experienced, you’re guessing.
Modern viewers let you replay missions, scrub timelines, overlay sensor streams, and inspect messages all in the browser. That means perception, controls, and ops can look at the exact moment together, without screen-share marathons.
Good places to start:
- Webviz: https://webviz.io/
- RViz: https://wiki.ros.org/rviz
- PlotJuggler: https://www.plotjuggler.io/
- NVIDIA Isaac tools: https://developer.nvidia.com/isaac-sdk
This kills “works on my machine” debates, speeds up root-cause analysis, and creates shared context across the team.
Make logs searchable: the remote for that DVR. Sampling a few bags and hoping they’re representative is a time sink. You should be able to ask the data direct questions and get answers fast:
- “Show runs where the arm missed within 2 cm.”
- “Find lidar dropouts between mile markers 3 and 5.”
Practical building blocks:
- Elasticsearch + Kibana: https://www.elastic.co/
- OpenSearch: https://opensearch.org/
- ClickHouse for high-speed analytics: https://clickhouse.com/
- Vector databases for semantic queries (images/text/signals): Weaviate https://weaviate.io/ or Pinecone https://www.pinecone.io/
A pattern that works:
- Store structured run metadata and events in Elasticsearch/OpenSearch.
- Keep heavy time-series in ClickHouse or Parquet on S3/GCS.
- Add embeddings and a vector database to search unstructured data, such as images or windowed LiDAR signals.
Speak a common language: standardize formats and schemas. Consistent, documented formats turn a fragile setup into a modular system.
When logs and labels follow shared schemas, you can swap tools, merge datasets, and scale pipelines without glue code.
Standards worth adopting:
- MCAP (high-performance logging): https://mcap.dev/
- ROS 2 bags (native): https://docs.ros.org/en/foxy/Tutorials/Ros2bag/
- OpenLABEL (annotations): https://www.asam.net/standards/detail/openlabel/
- COCO (vision annotations): https://cocodataset.org/
- Parquet (columnar telemetry): https://parquet.apache.org/
- Zarr (chunked arrays in the cloud): https://zarr.dev/
Why this matters now, Robotics is shifting from hardware heroics to data discipline.
The teams that move fastest won’t just design better mechatronics, they’ll build better data foundations.
- Faster iteration: Anyone can reproduce the exact moment, inspect it, and test a fix.
- Better models, sooner: Clean, queryable, well-labeled logs feed training and evaluation.
- Scalable operations: Web-based tools and standard formats work across labs, test tracks, and cloud clusters.
- Real velocity: Diagnose, improve, and redeploy weekly or daily.
How to get started
- Treat data like a product
- Assign owners, define SLAs, and set retention policies.
- Version datasets, labels, and models. Consider DVC (https://dvc.org/) or LakeFS (https://lakefs.io/).
- Stand up observability first
- Start with Webviz or RViz for replay and inspection.
- Make it a norm: every incident report links to a replayable timeline.
- Add search next
- Centralize metadata (run IDs, robot IDs, routes, outcomes) in Elasticsearch/OpenSearch.
- Keep large time series in ClickHouse or Parquet with lightweight indexes pointing to them.
- Add vector search for semantic queries if you have images, text, or signal windows.
- Standardize early
- Log to MCAP or ROS 2 bag with documented message schemas.
- Pick OpenLABEL or COCO for annotations and stick with it.
- Keep a schema repo with versioned protobufs/IDLs and migrations.
- Close the loop
- Tie search results directly to replay links.
- Auto-curate datasets from queries (e.g., “all failed grasps within 2 cm” becomes a training set).
- Track model performance by scenario tags sourced from your search system.
A simple reference architecture
- Ingest: ROS 2 → MCAP → object storage (S3/GCS/MinIO)
- Index: Metadata to OpenSearch; heavy telemetry to ClickHouse; embeddings to Weaviate
- Observe: Webviz/RViz/PlotJuggler in CI and browser
- Label: CVAT or Label Studio exporting to OpenLABEL/COCO
- Orchestrate: Dagster/Prefect for pipelines; DVC/LakeFS for versioning
Team habits that make it stick
- Every PR links to a replay and a saved query with affected cases.
- Every incident includes a minimal reproducible replay.
- Dataset updates go through code review with schema diffs and metrics deltas.
The Takeaway
Hardware will keep getting better. The leverage now is turning messy logs into a living system of record, one that lets you see what happened, ask precise questions, and ship improvements fast.
Build visibility, add search, and adopt standards. Your robots and team will move much faster.