i tried to read Ukrainian handwriting with AI and accidentally learned ML exists

A personal, non-technical retrospective about Kaggle Handwritten to Data, GLM-OCR, Qwen-VL, adapters, Codex, Claude, and discovering the Ukrainian ML scene.

I joined AI HOUSE's Kaggle Handwritten to Data competition thinking it would be mostly about making OCR work on Ukrainian handwriting.

That sounds complicated, but in my head it was still kind of simple: take a page, ask a model what is written there, submit text, receive score, become smart.

In reality it became a small month-long documentary about why machine learning people always talk about data splits, validation, adapters, checkpoints, and why "bigger model" is not a magic spell.

I am not an ML engineer. I am a software engineering student. Most of the training work was me steering Codex and Claude, reading logs, asking why something got worse, rerunning comparisons, and trying not to submit nonsense. So this is not a tutorial from someone who suddenly became an OCR scientist. It is more like a field report from someone who entered the forest with a laptop and kept finding weirder animals.

The project repo is here: dusy4/handwritten-to-data-revelation. It has the experiment history, docs, runbooks, score notes, and enough artifacts to explain how the pipeline moved from GLM-OCR to Qwen-VL specialists.

competition context
what the competition was actually asking
why I started with GLM-OCR
adapters sounded like cheat codes
then came bigger models
agents did the ML, I did the steering
validation is where the fantasy ended
MamayLM and the thing I missed
Ukrainian ML scene rabbit hole
what I still want to try
final thoughts

competition context

The competition page is Handwritten to Data on Kaggle, the AI HOUSE competition this project was built for.

The simple description is: take Ukrainian handwritten documents and turn them into structured machine-readable output. It was not a clean OCR demo where every page looks the same. The data mixed school pages, archive-looking pages, university material, dictations, different layouts, different handwriting styles, and all the normal document weirdness that makes OCR stop feeling like "just read the text".

That context matters because my pipeline was not one model looking at one pretty image. It was a chain: detect regions, classify or preserve their type, read the handwriting inside them, rebuild the page, and submit it in the exact format Kaggle expected.

what the competition was actually asking

The task was not just "read handwriting".

The page first had to be cut into meaningful regions. A school notebook page, an archive page, a university page, a dictation page — all of them look different. Some have tables, some have formulas, some have old handwriting, some have clean modern lines, and some look like a phone camera had a personal conflict with the document.

So the pipeline became three separate jobs pretending to be one job.

First, find the regions on the page. In the final pipeline this was a YOLO v6 typed detector. Think of it as a model that points at the page and says: this rectangle is text, this one is a table, this one is another region type.

The training data context was RUKOPYS, the Ukrainian handwritten text recognition dataset from Ukrainian Catholic University. My project docs point to it as the real dataset source, and Hugging Face marks it as a Ukrainian image/text dataset for object detection and image-to-text. This is the kind of page the detector had to reason about before the OCR model even got a crop.

Then each rectangle had to be read by an OCR model. This is where GLM-OCR, Qwen-VL, LoRAs, adapters, and most of the confusion lived.

Then everything had to become a valid Kaggle CSV. This sounds boring, but boring decides submissions. Wrong row count, invalid JSON, broken region order, weird serialization — the score does not care about your intentions.

The best public score I got was:

v20 publicScore: 0.75185
pipeline: YOLO v6 typed detector + Qwen2.5-VL-7B-Instruct + archive-r64-1000 LoRA

That is not a victory lap. The public leader was far ahead. The interesting part is what the project taught me while getting there.

why I started with GLM-OCR

I started with GLM-OCR mostly because I already had experience around it from my glmmedia-ocr project.

That is the honest reason. Not because I had a perfect benchmark matrix. I had touched GLM-OCR before, I knew roughly how to make it run, and in a competition you start from the thing you can actually move.

GLM-OCR was good enough to create a real pipeline. It helped me get from random scripts to a system that had detection, recognition, validation, output files, and enough logs that a coding agent could continue the work without losing the whole plot.

The GLM era ended with:

v17 publicScore: 0.56819

At that moment it was not great, but it was real. A real bad score is more useful than an imaginary good architecture.

adapters sounded like cheat codes

One competition recommendation was basically: do not treat every document as the same document.

That made sense immediately. A school notebook is not an archive manuscript. A university sheet with formulas is not a dictation page. The model should not behave identically on all of them.

So I went into adapters for different document types: school, archive, university, dictation, and similar buckets.

The simple explanation of an adapter is this: instead of retraining the entire huge model, you attach a small learned patch to it. The base model stays mostly the same, but the adapter nudges it toward a specific style of data.

In theory this sounds perfect. In practice it becomes routing. You now have many small specialists and must decide when each one should speak. If the archive adapter is better on archive pages but worse on school pages, then the router matters. A specialist can help one slice and hurt another.

This was one of the first ML lessons that actually landed for me: specialization is not free. Every specialist creates a new decision point.

then came bigger models

Eventually GLM-OCR started to feel like it had reached its ceiling for this pipeline.

The jump happened when the project moved to Qwen/Qwen2.5-VL-7B-Instruct. That model was simply stronger as a recognizer. Not in a theoretical blog-post way, but in the only way that mattered: the score moved.

version	public score	what changed
v17	0.56819	best GLM-OCR routed system
v18	0.67364	first Qwen-VL LoRA submission
v19	0.71352	larger mixed-source Qwen-VL LoRA
v20	0.75185	archive specialist promoted near the end

This is also where I learned that bigger is only half the story.

Bigger base model helps. Bigger data can help. Bigger compute helps. But the exact checkpoint matters. Sometimes the earlier checkpoint is better than the final one. Sometimes training longer makes the model more confident and less correct. Before this competition I would have assumed training longer is obviously better. Now I understand why ML people keep saying: check validation.

agents did the ML, I did the steering

I did not sit there manually writing perfect training code from scratch like a Kaggle grandmaster.

I used Codex and Claude heavily. They wrote scripts, fixed evaluation loops, generated runbooks, compared outputs, prepared Slurm commands, staged adapters, and explained what the logs meant. My role was closer to a project manager who keeps asking annoying questions:

is this actually better?
compare it against the old checkpoint
why did local improve but public not?
make it reproducible
save the artifact
write the command as one line

That workflow works surprisingly well, but only if the project has memory. The repository needed docs, experiment logs, current-state files, exact paths, submission IDs, and commands. Otherwise the agent becomes confident but contextless.

This is where software engineering helped more than ML knowledge. A chaotic experiment folder is bad engineering. If the model improves but nobody can reproduce why, it is not really a result.

validation is where the fantasy ended

The most useful diagram in the whole project is probably not the model architecture. It is the validation loop.

At first I wanted to look at OCR output and judge it like a human: this looks better, that looks worse, this crop feels cleaner.

That is not enough.

The competition score cared about exact characters. Ukrainian letters, punctuation, line order, missed regions, broken page reconstruction. A model can look nicer and still score worse. A model can produce more plausible Ukrainian and still be wrong because OCR does not reward plausible text. It rewards transcription.

This is why post-processing is tricky. A language model can "fix" the sentence into normal Ukrainian and accidentally change the ground truth. For normal humans this looks better. For CER it is worse.

The uncomfortable lesson: your eyes are not the metric.

MamayLM and the thing I missed

MamayLM was already in my docs, but my brain filed it under "Ukrainian language model for correction".

The idea was to use it as a conservative post-processing corrector: take OCR output, fix obvious Ukrainian mistakes, do not rewrite formulas, do not destroy punctuation, and do not produce beautiful Ukrainian prose where the handwritten ground truth is messy.

That idea still makes sense.

But the dumb part is that it did not fully click for me that MamayLM-Gemma-3-12B-IT-v1.0 is listed as an image-text-to-text model. In other words, it is not only a text corrector candidate. It can be tested directly as an OCR recognizer too.

That skipped my mind during the competition. Maybe it would have been worse. Maybe it would have produced too much plausible text. Maybe it would have been too slow. But it was a real path and I mentally placed it in the wrong box.

Ukrainian ML scene rabbit hole

One unexpectedly good part of the competition was that I discovered more of the Ukrainian ML scene.

Before this, Ukrainian AI for me was mostly abstract. During the competition I started running into model names, channels, datasets, people, repositories, Telegram discussions, HuggingFace pages, and weirdly specific OCR experiments.

That was maybe more valuable than the Kaggle score.

It made ML feel less like a distant thing and more like an ecosystem I can actually follow. Ukrainian language models, OCR attempts, Cyrillic handwriting work, data collection, local communities — all of that became visible because I had a real problem and suddenly every niche link mattered.

I even bought AI Engineering: Building Applications with Foundation Models by Chip Huyen as a high-level ML introduction. I needed something that explains the field from above because my background is software engineering, not model training. The funny part is that in Ukraine you can buy the English book for around 500 UAH, which is about 10 EUR, while in Germany the same English book is around 50 EUR. Same language, completely different price reality.

After that I will probably read the deeper ML book by the same author. Basically the competition made me admit that I cannot just treat ML as a black box that Codex operates for me forever. At some point I need the mental map.

what I still want to try

Even though the competition ended, I do not feel done with the problem.

The first thing I want to test is MamayLM as a direct OCR engine, not only as a corrector. The safe way would be a small fixed validation probe: same crops, same metric, no vibes.

The second thing is conservative post-processing. Not "rewrite this into nice Ukrainian". More like: only fix obvious OCR corruption when the image and the original output agree enough.

The third thing is trying newer Gemma-based routes. I saw this Telegram post, and it reminded me to look at Gemma models, but DiffusionGemma specifically is probably the wrong variant for CER even though i thought about it would be interesting to look at but as i read there's no meaningful improvements of quality of that series, only the speed-bump at the cost of precision

I still do not know if that direction is useful for OCR or if it is just another shiny thing that sounds better than it scores. But after missing the MamayLM multimodal angle, I do not want to ignore an entire model family just because it was not in my first mental search space.

The fourth thing is VmF0x/lapa-ocr-lora, which I found after the competition. Hugging Face lists it as an OCR/handwriting LoRA for Ukrainian image-to-text work, based on lapa-llm/lapa-v0.1.2-instruct. I have not properly tested it yet, its actually interesting since its another ukrainian language focused llm

The future plan is not "train everything again forever".

small fixed probe -> compare by CER -> keep only if it wins -> then think about routing

Boring, but now I understand why boring is the right shape.

final thoughts

The project started as "let's make OCR for Ukrainian handwriting" and became a crash course in how ML work actually behaves.

The detector mattered, but the recognizer decided most of the score. Adapters helped, but routing made them complicated. Bigger models helped, but validation decided which bigger model was real. Agents helped massively, but only because the repo had enough structure for them to keep context.

The final Kaggle score was not first-place material. But the project itself was useful because it changed how I look at ML. Before this, a model was mostly "call API / fine-tune / hope". After this, a model is a whole messy system around data, evaluation, artifacts, compute, and human decisions under time pressure.

So the real result is not only 0.75185.