Yoruba OCR · structured datasets · human review
Build Yoruba LLM training data from scans
Upload page images, run PaddleOCR for text extraction, correct diacritics, and organize work in collections (plain OCR, VLM-style turns, or exam Q&A). Optionally connect Google Gemini on the server to suggest structured exam fields from the image.
Pipeline
One workspace, two kinds of AI
PaddleOCR handles printed text for OCR Yoruba collections. Gemini (or OpenAI-compatible) is only for optional exam paper vision assist — structured JSON you still edit and save yourself.
PaddleOCR extraction
Local pipeline or hosted Paddle API — batch extract with chunking for large sets, diacritics editor, and confidence cues.
Gemini exam assist
Set GEMINI_API_KEY on the server for "Suggest from image" on exam Q&A items. No key = manual entry only — OCR still works.
Collections & schemas
Split OCR, VLM multi-turn, and exam workflows. Workspace shows vision assist status with audited timestamps.
Exports for training
ZIP with JSON/CSV and train/val/test splits — filter by collection and download with your session identity.
Quick access to tone marks (á à ẹ ọ ṣ, …) and image preview while you refine transcriptions and structured payloads.