We run MediaWiki 1.43 internally, and I first draft my Zimbra wiki articles there before moving them over to Zimbra’s public wiki. That workflow has been effective, but I wanted to experiment with a natural language search interface. A key requirement was keeping the source links to wiki articles so we could click directly to the references under the answers.
In later iterations, I added a temperature (creativity) slider to control answer variability and a prompt history so identical questions return the same answers. I also tested different embedding models, including VoyageAI’s voyage-context-3, which claimed up to 80% higher efficiency for chunk embeddings. However, for our dataset it performed worse, so I standardized on OpenAI’s text-embedding-3-small for vector generation and GPT-4o for the search interface.
For cost, I started with a $5 token budget. After a few weeks of heavy testing and crawling, usage was only about $0.80. I also tested GPT-5, but in my limited trials it didn’t yield better results than GPT-4o.
The tool works as follows:
1) there is a crawler that connects to your mediawiki and builds a vector database
2) there is a front-end that takes prompts and returns answers from that local vector database
That local vector database is 4.1MB whereas the wiki site is just over 2GB so substantial differences in disk size
Hallucination should be kept to a minimum because the models are limited to what it was trained using RAG to force only answers from our media wiki.
The project is here: https://github.com/JimDunphy/media-wiki-ai-search
It should NOT be used to crawl Zimbra's wiki pages or others for that matter as most would have countermeasures like any sensible website does these days to keep bots from causing harm. In addition, you are required to have an account and password with your mediawiki. I have some mgmt scripts in my github to install mediawiki 1.43 if anyone wants to build their own mediawiki with a single command so it's easy to get started if one wants to investigate this.
The next step is to gather information, use cases and requirements to determine what could be done for archival or even a better search interface to huge mailbox's with 100's of GB of data.
What surprised me with this project with mediawiki?
I have 2 wiki articles on letsencrypt and they are very long winded. I asked the model to provide a recipe for installing a letsencrypt certificate and it provided a perfect step by step without missing any steps. Mailbox's have a lot of redundant and different threads and similar content so this might not map but I am curious given I didn't think the search would be beneficial to the mediawiki initially.
If anyone else is working on this, I would be interested to hear of your experiences or ideas you are attempting or have thought about.
Jim
