I trained a 67-million-parameter transformer end to end on an M4 Mac Mini using Apple Silicon MPS and achieved 93.94 percent exact-match accuracy on CLI command generation.
No discrete GPU. Twenty-four gigabytes of unified memory. A task where a single missing character counts as complete failure.
This project started as a constraint experiment. How far could a carefully built small model go if every part of the pipeline was designed around consumer hardware limits? That meant training from scratch, streaming data instead of downloading it, and being honest about what worked and what broke.
The answer surprised me. With modern architectural components like RoPE, RMSNorm, and SwiGLU, aggressive data efficiency, and roughly 13 hours of pretraining plus about four minutes of supervised fine-tuning, a model smaller than GPT-2 learned to generate syntactically correct shell commands nearly 94 percent of the time. The remaining 6 percent failed in ways that turned out to be more instructive than the successes.
This is not a benchmark paper, a claim about general intelligence, or a guide to replacing ChatGPT. It is a grounded look at what actually happens when you build a modern small language model from first principles, train it on real data, and ask it to do something unforgiving.
Here is what I learned.
1. The Constraint and the Result
The defining constraint of this project was hardware.
All training was done on an M4 Mac Mini with 24GB of unified memory using Apple Silicon’s Metal Performance Shaders backend. There was no discrete GPU, no CUDA, and no ability to hide inefficiencies behind massive batch sizes. Every design choice had to respect memory pressure and wall-clock time. If a training decision was inefficient, the M4 made that obvious within minutes, not hours.
The task choice amplified those constraints. CLI command generation is exacting by nature. Commands are short, compositional, and brittle. A missing flag, a truncated regex, or an incomplete pipe is not “mostly right.” It is wrong. That made exact-match accuracy the only metric that mattered and removed any ability to rely on subjective evaluation.
Within those limits, the final results were:
- Model size: 66.73 million parameters
- Training data: 204.8 million tokens
- Pretraining time: roughly 13 hours wall time
- Supervised fine-tuning time: approximately 4 minutes
- Electricity usage: roughly 1 kilowatt-hour, under $0.50 at typical US electricity rates
- Final accuracy: 93.94 percent exact match on a held-out CLI evaluation set
The most important point is not the accuracy number in isolation. It is that these results were achieved end to end on consumer hardware, using a model trained from scratch, with full visibility into the data pipeline, training dynamics, and failure modes.
That combination—consumer hardware, exact evaluation, full transparency—shaped every decision that follows.
2. Why Build a Tiny LLM
The decision to build a small language model from scratch was driven by the task, not by ideology.
CLI command generation is a correctness problem, not a creativity problem. Commands are short, structured, and compositional. They rely on precise syntax, ordering, and punctuation. A missing flag, a truncated regex, or an incomplete pipe does not degrade quality gracefully. It fails outright.
This creates a clear neuro-symbolic boundary. The problem is not about producing plausible language, but about generating exact symbolic structures that must execute correctly. That makes CLI commands an unusually strong stress test for small models and a poor fit for subjective evaluation.
Training from scratch also provided control. Owning the tokenizer, data pipeline, training loop, and evaluation logic made failures diagnosable. When the model broke, the cause could be traced to data coverage, architectural constraints, or training dynamics rather than opaque behavior inside a black box.
Just as important were the explicit non-goals:
- This was not an attempt to build a general-purpose assistant.
- It was not a benchmark against frontier models.
- It was not designed for multilingual generation.
- It was not meant to replace API-based systems for broad tasks.
The goal was narrower and more practical: build the smallest model that could reliably generate exact, structured commands under tight hardware constraints, and understand precisely why it succeeded or failed.
3. High-Level System
The system was designed end to end, with each stage shaped by the constraints of consumer hardware and the requirements of exact output.
At a high level, the pipeline looks like this:
Tokenizer → streaming data → pretraining → supervised fine-tuning → evaluation → continual learning
The tokenizer was trained first, with explicit support for instruction and command boundaries. This made it possible to separate natural language intent from structured output during both training and evaluation.
Wikipedia was streamed directly from Hugging Face rather than downloaded, avoiding tens of gigabytes of local storage. Text was tokenized incrementally, segmented into fixed-length sequences, and written into token shards sized to balance disk IO and memory usage. These shards were later consumed using memory-mapped loading, allowing the training loop to scale without exhausting RAM.
Pretraining taught the model general language structure and syntax. Supervised fine-tuning then adapted the model to instruction-to-command mapping, with loss applied only to the command portion of each sequence.
Evaluation was handled asynchronously and designed to be repeatable and strict. Exact-match accuracy was computed on a held-out set using a lightweight AsyncIO-based evaluation loop, making it easy to rerun tests and gate updates without manual intervention.
Finally, a small continual learning system wrapped the training process. New data could be introduced incrementally, but updates were only accepted if they improved performance without harming existing behavior.
The complexity here comes from orchestrating simple pieces under tight constraints, not from any single exotic component.
3.5. The Training Run in Numbers
Before going deeper, it is worth pausing on what the full training run actually looked like on the M4.
All results in this post come from a single end-to-end run with the following characteristics:
- Model size: 66.73 million parameters
- Training tokens: 204.8 million
- Pretraining time: roughly 13 hours wall time
- Supervised fine-tuning time: approximately 4 minutes
- Pretraining loss: reduced from roughly 60 to 3.59
- Accelerator: Apple Silicon MPS (M4 Mac Mini, 24GB unified memory)
- Electricity usage: roughly 1 kilowatt-hour, costing under $0.50
#### The Throughput Reality Check
It is important to be realistic about how this compares to enterprise hardware.
A single NVIDIA A100 would likely complete this specific pretraining run in 20 to 30 minutes. The cloud cost for that window is relatively low, on the order of one to two dollars. This estimate assumes on-demand pricing and ignores setup, data transfer, and iteration overhead. From a pure throughput perspective, there is no competition here.
But the advantage of training locally is not about beating an A100 in a sprint. It is about the cost of curiosity.
Training on local hardware fundamentally changes the developer’s relationship with iteration and failure:
- Zero marginal cost. In the cloud, every hyperparameter mistake, data-sharding bug, or aborted experiment has a price tag. On local hardware, the cost of a “failed” 13-hour run is roughly twenty-five cents of electricity.
- No cold-start overhead. There is no time spent provisioning instances, managing SSH keys, uploading data to remote volumes, or waiting for capacity. Training starts when you decide to start it.
- Persistence. You have a dedicated training appliance that is silent, draws less power than a lightbulb, and can iterate continuously without a ticking clock on your credit card.
This is the legitimacy checkpoint. These numbers reflect what actually happened on a single consumer machine. They show that for targeted, \~60M-parameter models, modern transformer architectures are no longer gated behind enterprise infrastructure. They are accessible at home, and that accessibility meaningfully changes how experimentation, debugging, and learning happen.
4. Architecture: Small but Modern
The model architecture was intentionally conservative.
TinyLLM uses a 12-layer transformer with a hidden dimension of 512, 8 attention heads, and a maximum context length of 512 tokens, for a total of 66.73 million parameters. There are no exotic blocks, no routing layers, and no architectural experiments designed to impress by novelty alone. Every component was chosen because it has demonstrated stable behavior at small scale under tight compute and memory constraints.
The core architectural choices were:
- Rotary positional embeddings (RoPE)
- RMSNorm instead of LayerNorm
- SwiGLU feed-forward layers
- Weight tying between input embeddings and the output projection
These choices reflect a well-understood modern recipe. They reduce parameter count, improve training stability, or both, without introducing additional complexity.
The parameter breakdown makes the tradeoffs explicit:
- Token embeddings: 16.38M parameters (32,000 vocab × 512 dim)
- Attention blocks (12 layers): 12.58M parameters
- Feed-forward networks (12 layers): 37.75M parameters
- RMSNorm layers: \~0.01M parameters
- Total: 66.73M parameters
Weight tying is the single most impactful optimization. By sharing the input embedding matrix with the output projection layer, the model saves exactly 16.38 million parameters, roughly 20 percent of the total size. This is not just a memory optimization. In small models, weight tying often improves consistency between learned token representations and output probabilities.
Nothing in this architecture is novel. That is the point. The goal was not to invent a new transformer variant, but to assemble a restrained, modern configuration that could deliver reliable results within hard hardware limits.
5. Data Pipeline: Fitting 200M Tokens in RAM
The data pipeline is where the hardware constraints became unavoidable.
Rather than downloading a full Wikipedia dump, the dataset was streamed directly from Hugging Face. This avoided tens of gigabytes of local storage and allowed preprocessing to happen incrementally. Text was tokenized on the fly and written into fixed-size shards without ever loading the full corpus into memory.
Each shard contains approximately one million tokens. This size was chosen deliberately. Smaller shards increase filesystem overhead and IO churn. Larger shards reduce flexibility and increase page-fault pressure during training. Around one million tokens per shard struck a practical balance.
During training, shards are loaded using NumPy memory-mapped arrays (mmap_mode="r"). This allows the model to index into large token arrays as if they were in memory while letting the operating system handle paging transparently. In practice, this made it possible to train on more than 200 million tokens without exceeding RAM limits or triggering swap.
Batch size and sequence length were not tuning knobs. They were hard constraints.
- Batch size was fixed at 8 during pretraining.
- Sequence length was fixed at 512 tokens.
Larger batches or longer contexts caused immediate memory pressure and degraded throughput on Apple Silicon. Rather than fighting those limits, the pipeline was designed to operate comfortably within them.
This approach is unglamorous, but it works.
6. Training: What Actually Happened
Training proceeded in two phases: pretraining followed by supervised fine-tuning.
During pretraining, the model was trained on 204.8 million tokens using AdamW with a cosine learning rate schedule and linear warmup. Gradient clipping was applied to stabilize updates at small batch sizes.
The full pretraining loss trajectory looks like this:
- Step 100: 60.67
- Step 1,000: 10.97
- Step 5,000: 6.31
- Step 10,000: 5.39
- Step 20,000: 4.56
- Step 30,000: 4.23
- Step 50,000: 3.59
The sharpest drops occurred in the first 5,000 steps, as the model learned basic token statistics and syntax. After roughly 30,000 steps, gains slowed to incremental refinement, which is expected at this scale.
Supervised fine-tuning is where behavior changed abruptly.
The SFT dataset contained just over 2,300 instruction-to-command examples. During fine-tuning, loss was masked so that instruction tokens were ignored and only the command portion contributed to the gradient. This ensured the model was optimizing for exact command generation rather than instruction paraphrasing.
The fine-tuning loss collapsed rapidly:
- Step 50: 4.92
- Step 200: 1.92
- Step 500: 1.15
- Step 1,000: 0.08
- Step 2,000: 0.01
The collapse happened between steps 200 and 1,000. This was the moment the model “clicked.” Outputs shifted from loosely structured commands to consistently correct syntax with proper flags, ordering, and punctuation. The model did not become more fluent. It became exact.
The optimizer and schedule were intentionally unremarkable. AdamW with cosine decay is a known quantity. The gains here came from alignment between the task, the data, and the evaluation metric, not from clever optimization tricks.
At this point, the training metrics look strong. The next question is whether those gains hold up under strict evaluation, and where the model still breaks.
7. Evaluation: Exactness Over Vibes
Evaluation was deliberately strict.
Because the task is CLI command generation, approximate correctness is not meaningful. Commands either execute correctly or they do not. A missing flag, a truncated regex, or an incomplete pipe is a failure, regardless of how plausible the output might look to a human reader.
For that reason, exact-match accuracy was used as the primary metric. The generated command must match the reference command exactly, character for character. This is a harsh metric, but it aligns with real-world usage and removes ambiguity from evaluation.
The held-out evaluation set was constructed as follows:
- The supervised dataset was split 95 percent for training and 5 percent for evaluation.
- The held-out set was stratified by command complexity:
* Simple commands (e.g., listing files): \~30%
* Moderate commands (pipes, basic flags): \~50%
* Complex commands (regex, redirection, multi-stage pipelines): \~20%
- No examples from the held-out set appeared in supervised fine-tuning.
- Pretraining data consisted of general Wikipedia text and did not include CLI command examples, eliminating leakage between pretraining and evaluation.
The final held-out set contained 99 instruction–command pairs.
Exact-match accuracy on this set reached 93.94 percent after fine-tuning. This number should be interpreted carefully. It does not imply general robustness or open-ended reasoning ability. It means that under a strict, task-aligned metric, the model produces fully correct commands most of the time.
The advantage of this evaluation setup is clarity. There is no room for subjective scoring, cherry-picked examples, or post-hoc interpretation. Either the output matches the target exactly, or it does not.
8. The 6 Percent That Failed
The most interesting part of the project lives in the failures.
Roughly 6 percent of the evaluation examples failed exact match. What mattered was not the number, but the pattern. Every failure shared a common trait: the model stopped early.
Here are three representative examples.
Failure 1: Regex Truncation
Instruction:
Remove all color codes and escape sequences from a log file.
Expected:
sed 's/\x1b\[[0-9;]*m//g' logfile.txt
Generated:
sed 's/
The model correctly identified the tool and the command prefix, then terminated as soon as it encountered a dense regex pattern.
Failure 2: Pipes and Redirection
Instruction:
Generate a secure random password using urandom.
Expected:
tr -dc A-Za-z0-9 </dev/urandom | head -c 16 ; echo ''
Generated:
tr -dc A-Za-z0-9
The output is a plausible prefix, followed by an early end-of-sequence token. The redirection and pipeline never appear.
Failure 3: Email Address Handling
Instruction:
Send an email with a subject and body from the command line.
Expected:
echo "Body text" | mail -s "Subject" user@example.com
Generated:
echo "Body text" | mail -s "Subject" user
The command structure is correct, but the output truncates partway through the email address.
The pattern held across all six failures: premature termination at the exact moment the model encountered a low-frequency, symbol-dense pattern.
This is not a reasoning failure. It is a data coverage failure. Regexes, redirection operators, email addresses, and long pipelines were underrepresented in the fine-tuning set. When the model encountered spans it had seen too infrequently, it defaulted to stopping rather than guessing.
This diagnosis is useful because it points to a clear path forward. The failures are fixable without changing the architecture.
9. Continual Learning That Says “No”
Rather than retraining the model wholesale whenever new data was added, a small continual learning system was layered on top of the training loop.
The goal was not to maximize short-term gains, but to preserve correctness on the original task while incorporating new examples safely.
The system has three core components:
- A replay buffer that mixes new examples with a subset of prior training data.
- Weight anchoring that penalizes large deviations from the previous model state.
- Evaluation gating that decides whether an update is accepted or rejected.
Each micro-update is evaluated against two criteria:
- Improvement on the new task must exceed a minimum threshold, δ \= 0.01.
- Degradation on the base evaluation set must remain below a maximum tolerance, ε \= 0.02.
In practice, this meant the system could and did say no.
| Update | New Task Δ | Base Task Δ | Decision |
| ----- | ----- | ----- | ----- |
| \#1 | \+2.0% | −0.50% | REJECTED |
| \#2 | \+1.5% | −0.01% | ACCEPTED |
The rejected update improved performance on new examples but caused an unacceptable drop on the base task. The accepted update delivered smaller gains while preserving existing behavior.
The key point is that these decisions were automatic. The gating logic enforced constraints even when it was tempting to accept an update that “mostly worked.”
This turns continual learning into a conservative process. Progress is slower, but regressions are controlled. For tasks where exactness matters, that tradeoff is worth making.
10. What I’d Change for v2
The next version of this model does not need to be larger. It needs to be more deliberate.
The most obvious improvement is targeted data. Based on the failure analysis, adding roughly 500 regex-heavy and pipeline-heavy command examples would likely eliminate most of the early-EOS failures. These examples do not need to be diverse in topic. They need to be dense in symbols, redirection operators, and long spans of non-alphanumeric tokens. The expected cost is modest: a couple of hours to generate and validate data, with a high probability of fixing four out of the six observed failure cases.
The second change would be intermediate evaluation checkpoints during both pretraining and supervised fine-tuning. Accuracy was only measured at the end of training. Adding evaluations every 5,000 pretraining steps and every few hundred SFT steps would make plateaus and regressions visible much earlier. The additional wall-time cost is minimal compared to the insight gained.
Third, I would add explicit throughput and memory logging. Tokens per second, peak memory usage, and allocator behavior were not tracked in this run. That made it impossible to produce a precise cost and efficiency breakdown beyond rough estimates. Lightweight profiling with existing PyTorch tools would make future runs easier to compare, tune, and reproduce.
None of these changes are architectural overhauls. They are small, practical steps that would materially improve reliability, debuggability, and confidence in the results.
11. When You Shouldn’t Do This
It is worth being explicit about when this approach does not make sense.
You should not build a model like this if you need general knowledge. Large API-backed models exist precisely to handle open-ended tasks, long-tail facts, and broad domains. A small model trained from scratch will not compete there.
You should not do this if you need multilingual support. Tokenization complexity, data requirements, and evaluation difficulty increase sharply, and the benefits of training your own model diminish quickly.
You should not build from scratch if your domain has fewer than a few thousand high-quality examples. Small models are unforgiving when data is sparse or noisy. In those cases, fine-tuning an existing model is almost always the better choice.
You should not attempt this without evaluation discipline. If you cannot define what “correct” means and enforce it mechanically, you will end up optimizing vibes instead of behavior.
You should not do this if you are in a high-velocity sprint phase with a finalized dataset. If you have a locked-in configuration and just need results in twenty minutes, rent the A100. Local training is for the exploration phase, where you are still figuring out the data sharding, tokenizer boundaries, and loss behavior.
Finally, you should not do this if you cannot afford to wait. Training from scratch takes hours to days of wall time. If you need results tomorrow, use an API.
Building a tiny LLM is a tool, not a philosophy. It is powerful in the right context and wasteful in the wrong one.
12. Closing: The Actual Lessons
The biggest lesson from this project was not architectural.
RoPE, RMSNorm, SwiGLU, and weight tying all worked as expected, but none of them mattered when the data was wrong. When I added targeted examples, failures dropped. When I didn’t, no amount of hyperparameter tuning or architectural tweaking helped. Data quality beat scale every time.
The second lesson was that infrastructure choices compound. Streaming data instead of downloading tens of gigabytes made training feasible. Memory-mapped shards made scale possible on limited hardware. Conservative evaluation gates prevented regressions during continual learning. These decisions mattered more than another layer or a wider hidden dimension.
Most importantly, small models are not toys when the task is narrow and exact. If you control the domain, the format, and the evaluation, a 67-million-parameter model can be useful, inspectable, fast, and cheap to run. CLI command generation does not need a poetic model. It needs a precise one.
Building from scratch also teaches things that fine-tuning never will. You see where models break. You learn which problems are data problems and which are architectural. You develop intuition for cost, memory, and failure modes that no API hides from you.
This project is not finished. The remaining failures are fixable. The continual learning system needs more real-world testing. And there are plenty of experiments left to run.
The code is on GitHub at github.com/geddydukes/tiny\_llm. Training logs, evaluation results, and failure cases are all public.
If you have been considering training a small model for a specific, well-defined task, my advice is simple: start small, measure everything, and be honest about failures.
Go build something.