Saturday, March 28, 2026

Fixing Urdu and Arabic Search in Koha’s Elasticsearch

Optimizing Elasticsearch for Sirah Research

Breaking the Language Barrier for Urdu and Arabic in Koha 25.11

The Problem: By default, Elasticsearch treats Urdu/Arabic like English. It fails to recognize Right-to-Left (RTL) logic, character normalization, or diacritics (Harakat).

For a specialized library like the Sirah Research Hub, Elasticsearch is far superior to the legacy Zebra engine—but only if you install the "Missing Piece."

1. The Missing Piece: The ICU Analysis Plugin

The ICU (International Components for Unicode) plugin is vital. It allows the search engine to ignore Zer/Zabar/Pesh so a user finds the Prophet's name regardless of exact spelling.


# Run this on your Koha server terminal:
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu
sudo systemctl restart elasticsearch

2. Configure the Arabic/Urdu Analyzer

Once the plugin is active, tell Koha to use it. Navigate to:

Koha Admin → Search Engine Configuration (Elasticsearch)

Ensure your multilingual fields are mapped to the icu_analyzer or the dedicated arabic analyzer to handle word roots and stemming.

Zebra vs. Elasticsearch: The Sirah Hub Verdict

Feature Zebra (Legacy) Elasticsearch (Recommended)
Urdu/Arabic Support Requires manual .chr file hacking Native with ICU Plugin
Performance Slow with 14,000+ records Sub-millisecond response
Fuzzy Searching Basic/Limited Advanced (Finds "Sirah" vs "Seerah")
Scalability Fixed/Rigid Built for Big Data
💡 Academic Impact: Elasticsearch’s "Fuzzy Searching" is a lifesaver for researchers. It ensures that varying transliterations of names (e.g., Shibli vs Shebli) still lead to the correct scholarly records.

Action Step: Verify with your technical team if analysis-icu is installed. Without it, the Hub is limited; with it, it is world-class.

No comments:

Post a Comment

Claude

Claude for Academics — A Research Companion Research & Academia Claude as Your Research Companion A practical gui...