QuechuaBase.org

Oct 1, 2025·
Elwin Huaman
Elwin Huaman
· 3 min read
Image credit: Elwin Huaman
project

Our Mission

Quechuabase.org is an open-source, community-driven digital effort dedicated to the preservation, documentation, and revitalization of the Quechua language family. By leveraging modern database architecture and collaborative linguistics, we aim to provide a centralized hub for researchers, educators, and native speakers to access high-quality linguistic data.

The Problem

Despite being the most widely spoken indigenous language family in the Americas, Quechua suffers from digital under-representation. Fragmented dialects, inconsistent orthography, and a lack of structured datasets make it difficult for developers to build effective AI tools, translation software, or educational apps.

Quechua Base Campaign 2025

Our Approach

Quechuabase.org serves as a Linguistic Data Knowledge Graph. Our project features:

  • Knowledge Graph: A searchable knowledge base mapping variations between Southern, Central, and Northern Quechua.
  • Corpus Architecture: A curated collection of written texts and transcribed sentences for Natural Language Processing (NLP) training.
  • Open API: A gateway for developers to integrate Quechua vocabulary and grammar rules into responsible applications.
  • Community Validation: A “wiki-style” verification layer where native speakers can validte and refine entries to ensure cultural and linguistic accuracy.

How to contribute

🚀 The Andes go Digital: Introducing Quechuabase.org 🏔️

Quechua is more than a language; it is a worldview. But in the world of Big Data and AI, indigenous voices are often left behind.

We’re changing that.

Introducing Quechuabase.org, the first comprehensive, open-source knowledge base designed to power the next generation of Quechua digital tools.

Whether you are a linguist, a developer, or a Quechua enthusiast, our platform is built for you.

Some Results in 2025

We have successfully integrated Puno Quechua into the global Common Voice ecosystem. The contributed datasets are categorized into two primary types:

Reading (Scripted) Speech:

This portion of the dataset consists of volunteers reading aloud a set of pre-approved, CCO-licensed sentences.

  • Corpus Size: We collected and uploaded 2,065 sentences for the platform, which represents 11.6% of the total Quechua sentences available on Common Voice.
  • Objective: These recordings are made in noise-free environments with clear pronunciation to provide high-quality training data for standard voice applications.

Spontaneous Speech:

This represents a more advanced contribution type designed to capture how the language is used in real-world, casual conversation.

  • Methodology: We uploaded 150 open-ended questions specifically focused on the Agricultural and Food domain for community members to answer freely.
  • Linguistic Features: These recordings capture natural speaking patterns, including accents, hesitations, repetitions, and interjections such as “mm” and “ahh”.
  • Code-Switching Mitigation: To minimize the frequent mixing of Quechua and Spanish (code-switching), we encouraged and centered these topics around culturally vibrant subjects like farming and daily life.
Elwin Huaman
Authors
Research Scientist
Elwin Huaman is a Researcher, Project Manager, and an Activist for Under-Resourced Languages. He has experience creating Knowledge Graphs and applying Semantic Web technologies in academia and industry. He has co-authored the book: Knowledge Graphs - Methodology, Tools and Selected Use Cases, and leads the QICHWABASE Knowledge Graph that supports a harmonization process of the language and knowledge of Quechua communities across the world.