Published2026-04-28·~12 min

TinyML: neural nets on microcontrollers

Keyword spotting, gesture detection, and the patterns for running inference on a $5 MCU. The other end of the AI spectrum from Jetson — where every kilobyte counts.

by RobotForge

#embedded#tinyml#mcu

Jetson Orin runs a 7B-parameter VLA at 600 ms per inference. TinyML runs a 100,000-parameter classifier at 50 ms — on a $5 microcontroller drawing 50 mW. The use cases differ: not VLAs but always-on listening, vibration anomaly detection, gesture sensing, simple computer vision. The patterns are different from data-center ML; the field is mature in 2026.

Why TinyML

Always-on: at 50 mW, run continuously on a coin cell for months.
No connectivity required: inference happens at the sensor; no cloud roundtrip.
Privacy: data never leaves the device.
Latency: tens of milliseconds; suitable for closed-loop control.
Cost: $5 MCU vs $250+ for the cheapest Jetson.

TinyML doesn't replace Jetson; it adds a layer that runs everywhere a Jetson can't.

The targets

Class	Examples	Inference
Cortex-M0+ class	RP2040, STM32G0	Tiny models (5–20 KB)
Cortex-M4F class	STM32F4, nRF52, ESP32	Most popular tier (50–500 KB)
Cortex-M7 / M33	STM32H7, Teensy 4, ESP32-S3	CNN-class (1–4 MB)
NPU-accelerated	Coral Dev Board Micro, Arduino Nicla, ESP32-P4	Visual wake words, real-time CV

The ESP32-S3 in 2026 has built-in vector instructions ~10× faster than M4 for ML; a great hobby starting point.

The classic TinyML applications

Keyword spotting ("Hey Robot")

Microphone → MFCC features → small CNN → "is this 'wake word'?" Constantly running; triggers louder processing on detection. Pioneered by Google's "Hey Google."

Model size: ~50 KB. Inference: ~10 ms on M4. Runs continuously at 50 mW.

Visual wake words

Low-resolution camera → small CNN → "is there a person in frame?" Trigger high-power AI only on detection. Used in surveillance, doorbell cameras, smart-home robots.

Model size: 200–500 KB. Inference: ~50 ms on Cortex-M7.

Gesture / activity recognition

Accelerometer → window of last N samples → small NN → "is this a wave / shake / swipe?" Used in wearables, drone gesture control, pet collars.

Model size: 5–50 KB. Inference: <5 ms.

Anomaly detection

Vibration / sound from a machine → autoencoder → "is this normal?" Used in industrial maintenance, predictive failure detection.

Simple object classification

Tiny camera → 96×96 grayscale → 1 MB model → 2–5 classes. Used in toys, classroom robots.

The toolchain

Stage	Tool
Train (desktop)	PyTorch, TensorFlow, Edge Impulse
Quantize	TensorFlow Lite, ONNX Runtime, NNCF
Convert	TensorFlow Lite Micro (most common)
Deploy	PlatformIO + TFLM, Arduino lib, vendor SDKs

Edge Impulse is the easiest path: web GUI, no-code model training, automatic deployment to many MCU targets. Free for hobbyists; commercial tier for production.

For more control: train PyTorch → export ONNX → use NNCF to quantize → use TFLM to deploy.

The size budget

Three things compete for MCU memory:

Flash (program memory): model weights live here. Typical: 256 KB – 4 MB.
RAM (working memory): activations + intermediate tensors. Typical: 64 KB – 1 MB.
Stack/heap: regular firmware code. Typical: 16 KB – 256 KB.

Models live in flash; inference produces intermediate activations in RAM. A model with 200 KB weights but 500 KB peak activations needs the bigger RAM. The TFLM tensor_arena sizing is one of the first things to tune.

Quantization

FP32 weights → INT8: 4× smaller, ~2–3× faster on M4 (which has DSP instructions). With minimal accuracy loss for most classifiers.

Pattern: train in FP32; quantize post-hoc with calibration data; deploy INT8. Quantization-aware training (training with simulated INT8) buys another ~1% accuracy.

For very small models, sub-INT8 (binary, ternary) becomes viable. Niche but extreme efficiency.

The 50-line working example

On an ESP32-S3 with built-in microphone, deploy a "shake detection" model:

// 1. Capture 1 second of accelerometer at 100 Hz → 100 samples × 3 axes
// 2. Compute features (mean, std, FFT bins)
// 3. Run TFLM inference

#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "model_data.h"  // exported by xxd from .tflite

constexpr int kArenaSize = 8 * 1024;
uint8_t arena[kArenaSize];

void setup() {
    static tflite::AllOpsResolver resolver;
    static tflite::MicroInterpreter interp(
        tflite::GetModel(model_tflite),
        resolver, arena, kArenaSize);
    interp.AllocateTensors();
}

void loop() {
    fill_features(interp.input(0)->data.f);
    interp.Invoke();
    float prob_shake = interp.output(0)->data.f[0];
    if (prob_shake > 0.8) trigger_action();
}

Twenty lines of TFLM glue + a trained model. Total flash usage: ~50 KB. Runs forever on a small battery.

Limitations

No transfer learning at the edge: TinyML deployment is one-way. Updates require new firmware.
Limited data types: most TinyML supports a subset of TF Lite ops. Custom architectures may not convert.
No GPU: even with NPU accelerators, the gap to a Jetson is huge for any non-trivial model.
Debugging is hard: GDB on a flashed MCU. No print() in real-time work.

Where TinyML fits in robotics

Wake-word activation for higher-power AI ("Hey, robot...").
Always-on safety (vibration anomaly, motor over-temp).
Gesture interfaces (drone hand-control, swarm signaling).
Sensor preprocessing before transmission to a host.
Educational robots / kits (deploy ML to a kid's robot in an afternoon).

Common gotchas

Sensor noise vs training distribution: a model trained on Edge Impulse's ideal data falls apart on noisy real sensor data. Augment training data realistically.
Quantization accuracy loss: FP32 model is 95%; INT8 is 89%. Sometimes acceptable; sometimes not. Test before deploy.
Memory blowups: TFLM crashes silently when arena is too small. Size with kArenaSize sweep.
Flash wear: model updates over the air write to flash; flash has ~10000 write cycles. Don't update too often.

Exercise

Use Edge Impulse to train a 3-gesture classifier (wave, tap, twist) on your phone's accelerometer. Deploy to an Arduino Nano 33 BLE Sense or ESP32-S3. Show that detection works in real-time. The full path from "data collection" to "deployed model" in under 2 hours is the field's progress made tangible.

I²C, SPI, UART, CAN — the bus protocols that connect every component on your robot.

← Previous

Jetson Orin for on-board AI

I²C, SPI, UART, CAN — when to use which