Yoruba OCR · structured datasets · human review

Build Yoruba LLM training data from scans

Upload page images, run PaddleOCR for text extraction, correct diacritics, and organize work in collections (plain OCR, VLM-style turns, or exam Q&A). Optionally connect Google Gemini on the server to suggest structured exam fields from the image.

Pipeline

One workspace, two kinds of AI

PaddleOCR handles printed text for OCR Yoruba collections. Gemini (or OpenAI-compatible) is only for optional exam paper vision assist — structured JSON you still edit and save yourself.

PaddleOCR extraction

Local pipeline or hosted Paddle API — batch extract with chunking for large sets, diacritics editor, and confidence cues.

Gemini exam assist

Set GEMINI_API_KEY on the server for "Suggest from image" on exam Q&A items. No key = manual entry only — OCR still works.

Collections & schemas

Split OCR, VLM multi-turn, and exam workflows. Workspace shows vision assist status with audited timestamps.

Exports for training

ZIP with JSON/CSV and train/val/test splits — filter by collection and download with your session identity.

Native diacritics & side-by-side editing

Quick access to tone marks (á à ẹ ọ ṣ, …) and image preview while you refine transcriptions and structured payloads.