TinyML: neural nets on microcontrollers
Keyword spotting, gesture detection, and the patterns for running inference on a $5 MCU. The other end of the AI spectrum from Jetson — where every kilobyte counts.
Jetson Orin runs a 7B-parameter VLA at 600 ms per inference. TinyML runs a 100,000-parameter classifier at 50 ms — on a $5 microcontroller drawing 50 mW. The use cases differ: not VLAs but always-on listening, vibration anomaly detection, gesture sensing, simple computer vision. The patterns are different from data-center ML; the field is mature in 2026.
Why TinyML
- Always-on: at 50 mW, run continuously on a coin cell for months.
- No connectivity required: inference happens at the sensor; no cloud roundtrip.
- Privacy: data never leaves the device.
- Latency: tens of milliseconds; suitable for closed-loop control.
- Cost: $5 MCU vs $250+ for the cheapest Jetson.
TinyML doesn't replace Jetson; it adds a layer that runs everywhere a Jetson can't.
The targets
| Class | Examples | Inference |
|---|---|---|
| Cortex-M0+ class | RP2040, STM32G0 | Tiny models (5–20 KB) |
| Cortex-M4F class | STM32F4, nRF52, ESP32 | Most popular tier (50–500 KB) |
| Cortex-M7 / M33 | STM32H7, Teensy 4, ESP32-S3 | CNN-class (1–4 MB) |
| NPU-accelerated | Coral Dev Board Micro, Arduino Nicla, ESP32-P4 | Visual wake words, real-time CV |
The ESP32-S3 in 2026 has built-in vector instructions ~10× faster than M4 for ML; a great hobby starting point.
The classic TinyML applications
Keyword spotting ("Hey Robot")
Microphone → MFCC features → small CNN → "is this 'wake word'?" Constantly running; triggers louder processing on detection. Pioneered by Google's "Hey Google."
Model size: ~50 KB. Inference: ~10 ms on M4. Runs continuously at 50 mW.
Visual wake words
Low-resolution camera → small CNN → "is there a person in frame?" Trigger high-power AI only on detection. Used in surveillance, doorbell cameras, smart-home robots.
Model size: 200–500 KB. Inference: ~50 ms on Cortex-M7.
Gesture / activity recognition
Accelerometer → window of last N samples → small NN → "is this a wave / shake / swipe?" Used in wearables, drone gesture control, pet collars.
Model size: 5–50 KB. Inference: <5 ms.
Anomaly detection
Vibration / sound from a machine → autoencoder → "is this normal?" Used in industrial maintenance, predictive failure detection.
Simple object classification
Tiny camera → 96×96 grayscale → 1 MB model → 2–5 classes. Used in toys, classroom robots.
The toolchain
| Stage | Tool |
|---|---|
| Train (desktop) | PyTorch, TensorFlow, Edge Impulse |
| Quantize | TensorFlow Lite, ONNX Runtime, NNCF |
| Convert | TensorFlow Lite Micro (most common) |
| Deploy | PlatformIO + TFLM, Arduino lib, vendor SDKs |
Edge Impulse is the easiest path: web GUI, no-code model training, automatic deployment to many MCU targets. Free for hobbyists; commercial tier for production.
For more control: train PyTorch → export ONNX → use NNCF to quantize → use TFLM to deploy.
The size budget
Three things compete for MCU memory:
- Flash (program memory): model weights live here. Typical: 256 KB – 4 MB.
- RAM (working memory): activations + intermediate tensors. Typical: 64 KB – 1 MB.
- Stack/heap: regular firmware code. Typical: 16 KB – 256 KB.
Models live in flash; inference produces intermediate activations in RAM. A model with 200 KB weights but 500 KB peak activations needs the bigger RAM. The TFLM tensor_arena sizing is one of the first things to tune.
Quantization
FP32 weights → INT8: 4× smaller, ~2–3× faster on M4 (which has DSP instructions). With minimal accuracy loss for most classifiers.
Pattern: train in FP32; quantize post-hoc with calibration data; deploy INT8. Quantization-aware training (training with simulated INT8) buys another ~1% accuracy.
For very small models, sub-INT8 (binary, ternary) becomes viable. Niche but extreme efficiency.
The 50-line working example
On an ESP32-S3 with built-in microphone, deploy a "shake detection" model:
// 1. Capture 1 second of accelerometer at 100 Hz → 100 samples × 3 axes
// 2. Compute features (mean, std, FFT bins)
// 3. Run TFLM inference
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "model_data.h" // exported by xxd from .tflite
constexpr int kArenaSize = 8 * 1024;
uint8_t arena[kArenaSize];
void setup() {
static tflite::AllOpsResolver resolver;
static tflite::MicroInterpreter interp(
tflite::GetModel(model_tflite),
resolver, arena, kArenaSize);
interp.AllocateTensors();
}
void loop() {
fill_features(interp.input(0)->data.f);
interp.Invoke();
float prob_shake = interp.output(0)->data.f[0];
if (prob_shake > 0.8) trigger_action();
}
Twenty lines of TFLM glue + a trained model. Total flash usage: ~50 KB. Runs forever on a small battery.
Limitations
- No transfer learning at the edge: TinyML deployment is one-way. Updates require new firmware.
- Limited data types: most TinyML supports a subset of TF Lite ops. Custom architectures may not convert.
- No GPU: even with NPU accelerators, the gap to a Jetson is huge for any non-trivial model.
- Debugging is hard: GDB on a flashed MCU. No print() in real-time work.
Where TinyML fits in robotics
- Wake-word activation for higher-power AI ("Hey, robot...").
- Always-on safety (vibration anomaly, motor over-temp).
- Gesture interfaces (drone hand-control, swarm signaling).
- Sensor preprocessing before transmission to a host.
- Educational robots / kits (deploy ML to a kid's robot in an afternoon).
Common gotchas
- Sensor noise vs training distribution: a model trained on Edge Impulse's ideal data falls apart on noisy real sensor data. Augment training data realistically.
- Quantization accuracy loss: FP32 model is 95%; INT8 is 89%. Sometimes acceptable; sometimes not. Test before deploy.
- Memory blowups: TFLM crashes silently when arena is too small. Size with
kArenaSizesweep. - Flash wear: model updates over the air write to flash; flash has ~10000 write cycles. Don't update too often.
Exercise
Use Edge Impulse to train a 3-gesture classifier (wave, tap, twist) on your phone's accelerometer. Deploy to an Arduino Nano 33 BLE Sense or ESP32-S3. Show that detection works in real-time. The full path from "data collection" to "deployed model" in under 2 hours is the field's progress made tangible.
Next
I²C, SPI, UART, CAN — the bus protocols that connect every component on your robot.
Comments
Sign in to post a comment.