Search Highlight On Wikicomplete Info
tags: dev search wikicomplete wikidot
05 Dec 2008 01:55
I just introduced this nice feature to the Wiki Complete wiki farm (based on Wikidot software).
How does it work?
Here are the Google results for female superheroes wikicomplete query:
locate any link from wikicomplete.info and click. The words female, superheroes, wikicomplete should be highlighted using different colors.
The same mechanism is used for local searches.
hartnell (the main admin of the Wiki Complete) really likes this feature. There is some chance we'll introduce it to Wikidot.com! After the new search for wikidot is done.
Tech
If you wonder how did I do this, here is the answer. I used Zend_Search_Lucene utility for highlighting. It was all quick and easy after parsing the HTTP_REFERER looking for the actual search query.
Summary
This feature was implemented on some random pages I walked through. I hope you like it being implemented on WikiComplete!
Comments: 3
New Search For Wikidot
tags: dev fulltext lucene php search wikidot zend
03 Dec 2008 20:58
As some of you already noticed, we now use Google Custom Search for searching the whole Wikidot. The reason for this was quite obvious.
We now host 2 million pages and the full text search engine we had used was not fast enough to satisfy a regular user. The average search time (in all wikis) was about 30 seconds.
Google searches the whole Wikidot in less than 1 second. The downsides of using Google engine are:
- using external service — prestige and dependence
- displaying ads on search results — for those who don't use AdBlock
- pages get indexed after some significant time
- only public wikis can be indexed
One important thing is Google indexes every content from a site. This includes Wikidot.com footer, menus, wiki header, the real content and tags. All of these is treated in an unknown way, so we have no big/real impact on how Google treats different portions of pages.
This leads to a conclusion, that we need a search engine.
We would like:
- to treat tags as more important than the regular content
- not to search from Wikidot.com static elements (like the footer on every page)
- allow searching all wikis available for given user
- all public wikis
- all private wikis that the user is a member of
Coming to technical details, one could say, we just need a generic full text search engine. We can use one available in our storage system or one of dedicated search-only engines.
Tsearch
Tsearch — the full text search engine for PostgreSQL (storage used for Wikidot data) is currently used when searching a wiki. This is quite nicely integrated and plays well. But when there are over 20,000 wikis to search (I mean only non-spam public ones), the efficiency is not enough.
Lucene
Lucene — one of the most popular search engines for Java is one of the possible choices. The mechanism it works is the following:
- application pulls some documents to the search index
- document is a webpage in our situation
- index is a datastore to be used when searching
- user queries the index with a query
- a bunch of documents is returned in the order of relevance
- the documents returned are more or less the same as the documents pulled to the index before
This mechanism requires populating index by application
- updating the index every now and then
- updating document on some change — like page edit
and requires us to define some functions that deal with search results
- the original webpage is usually not stored in the index, but only tokenized, to allow finding it when searching for any word in the document
- this makes the index smaller and faster
- this makes we need to store an additional ID of a webpage, to be able to retrieve the full result from the database based on the stored ID
Nutch
Nutch would be a different — more Googlish — approach to the search issue. Nutch indexes mainly HTML files (given URLs) and crawls through the the links. This has both advantages and disadvantages:
The main advantage of using Nutch is that as a search result we get a formatted HTML document
- with links to items found
- with context of the search phrase quoted
- the search phrase words outlined in some way
This is very similar to what we get searching for some phrase with Google.
What I don't like about Nutch is quite big overhead of populating the index. A page must be compiled by the server and HTML must be produced. Then the same HTML must be parsed by the search engine to get important data. There is a lot of information generated by the server and then forgot by the search engine.
Nutch (similar to Lucene) is a Java project and requires some Java environment. This may and may not be a problem, but is a point we must concern when looking for optimal solution.
There is OpenSearch project which aims to make the Nutch results more interchangeable (exporting them as RSS feeds). Using it a PHP application can safely ask for results a HTTP service and get RSS feed to parse and present to user.
Zend_Search_Lucene
There is also a quite nice thing around: Zend_Search_Lucene. It is a search engine written entirely in PHP being a part of Zend Framework for PHP. The internal format of the search index file is compatible with Lucene and this is where the name of the package comes from. Also the query language is the same (or very similar).
It seems, the PHP implementation should be really slow, when searching really big sets of data, but after some testing, we get the search results for almost any query in about 1 second, searching almost the whole Wikidot.
I think this is a really nice solution, because it can integrate well with the existing PHP code of Wikidot. Also the searching can be easily parallelized for many machines. For example, you can have 4 search machines, each getting 1/4 of search queries to carry out. This way we don't reduce the search time, but avoid searching many things in one index at once.
There are some options to consider when dividing the search queries to different machines. We can select the machine to perform the search by random, by turn or by search hash. Search hash would make a MD5 sum (or other hash) of a query, compute the modulo rest from division by number of machines from the hash (treated as an integer) and assign the search to the machine having number of computed modulo. This means the same query will always be performed on the same machine (it can be then better cached or optimized).
The Zend implementation of Lucene is also really trivial to understand and use, so it seems a good start for me. Testing it on the whole public part of Wikidot I got a index of about 500 MB. Adding a single page to the index of this size takes about 2 seconds. Searching — about 1 second.
Sphinx
When I have asked my friend about full text search engines he recommends, he pointed out Sphinx — standalone application for this purpose. It is not very popular software as it haven't find it way to the Ubuntu repository for example, but it seems very interesting.
Sphinx can be fed with XML streams of data from any application, can fetch data from PostgreSQL or MySQL databases or be communicated with via its API and libraries to many languages.
It seems it's somehow similar to Lucene, but implemented using traditional languages, not Java.
The choice
There are probably some other solutions that are worth trying, but I think the most appropriate for now is using the Zend's one as it's the easiest to adapt. We can optionally use some caching mechanisms and queries distribution. Also we need more testing of situations that may appear (like a need to perform 100 queries simultaneously).
Comments: 7
Na Cypr
tags: cypr cyprus polish
01 Dec 2008 22:26

Czas zdradzić się z planami.
Wybieram się w pewnym horyzoncie czasowym na Cypr. Przynajmniej na miesiąc (najchętniej miodowy), ale w najlepszym wypadku na resztę życia.
Dlaczego tam? Jest tam ciepło, kraj jest anglojęzyczny, jest trochę pracy, jest ciepło, blisko do morza, blisko w góry i ciepło. Wyspa jest mała, więc wszędzie jest blisko, można jeździć skuterem, bo jest ciepło.
No i mają fajne radio, które bardzo szumi i przewijają się tam 3 języki: angielski, grecki i turecki. Jak zapowiadają spektakle teatralne, to nieodłączną informacją jest to w jakim języku będą się odbywać. Co ciekawe, sporo odbywa się w języku francuskim.
W dobie internetu nie ma wielkiego znaczenia gdzie pracuję. Ważne dla kogo, i czy robię to dobrze. Choć jak dobrze pójdzie, to może uda się założyć tam firmę: są proste i przystępne podatki no i jest ciepło.
A jak jest ciepło, to jest mniej stresu. Wywalenie Cię z mieszkania nie jest wielkim koszmarem, bo zanim się wprowadzisz gdzieś indziej możesz się przespać w aucie albo na plaży, bo jest ciepło. Nie trzeba kupować ani nosić ton ciuchów, bo jest ciepło. A jak jest za ciepło? To godzinka drogi (z dowolnego miejsca Cypru) i jestem nad morzem.
Cypr to moja wyspa marzeń. A jej wizerunek mam ostatnio w pokoju, z czego jestem bardzo zadowolony.
Kto dołącza?