|
Main
Community
Commercial services
Misc
Subscribe in a reader
|
Documentation
Sphinx 0.9.9 reference manual
Sphinx is a full-text search engine, distributed under GPL version 2.
Commercial licensing (eg. for embedded use) is also available upon request.
Generally, it's a standalone search engine, meant to provide fast,
size-efficient and relevant full-text search functions to other
applications. Sphinx was specially designed to integrate well with
SQL databases and scripting languages.
Currently built-in data source drivers support fetching data either via
direct connection to MySQL, or PostgreSQL, or from a pipe in a custom XML
format. Adding new drivers (eg. to natively support some other DBMSes)
is designed to be as easy as possible.
Search API is natively ported to PHP, Python, Perl, Ruby, Java, and
also available as a pluggable MySQL storage engine. API is very
lightweight so porting it to new language is known to take a few hours.
As for the name, Sphinx is an acronym which is officially decoded
as SQL Phrase Index. Yes, I know about CMU's Sphinx project.
- high indexing speed (upto 10 MB/sec on modern CPUs);
- high search speed (avg query is under 0.1 sec on 2-4 GB text collections);
- high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);
- provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;
- provides distributed searching capabilities;
- provides document exceprts generation;
- provides searching from within MySQL through pluggable storage engine;
- supports boolean, phrase, and word proximity queries;
- supports multiple full-text fields per document (upto 32 by default);
- supports multiple additional attributes per document (ie. groups, timestamps, etc);
- supports stopwords;
- supports both single-byte encodings and UTF-8;
- supports English stemming, Russian stemming, and Soundex for morphology;
- supports MySQL natively (MyISAM and InnoDB tables are both supported);
- supports PostgreSQL natively.
Sphinx is available through its official Web site at http://www.sphinxsearch.com/.
Currently, Sphinx distribution tarball includes the following software:
indexer: an utility which creates fulltext indexes;search: a simple command-line (CLI) test utility which searches through fulltext indexes;searchd: a daemon which enables external software (eg. Web applications) to search through fulltext indexes;sphinxapi: a set of searchd client API libraries for popular Web scripting languages (PHP, Python, Perl, Ruby).spelldump: a simple command-line tool to extract the items from an ispell dictionary to help customize your index, for use with wordforms.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License,
or (at your option) any later version. See COPYING file for details.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
If you don't want to be bound by GNU GPL terms (for instance,
if you would like to embed Sphinx in your software, but would not
like to disclose its source code), please contact
the author to obtain
a commercial license.
1.5. Author and contributorsAuthor
Sphinx initial author and current primary developer is:
ContributorsPeople who contributed to Sphinx and their contributions (in no particular order) are:
- Robert "coredev" Bengtsson (Sweden), initial version of PostgreSQL data source;
- Len Kranendonk, Perl API
- Dmytro Shteflyuk, Ruby API
Many other people have contributed ideas, bug reports, fixes, etc.
Thank you!
Sphinx development was started back in 2001, because I didn't manage
to find an acceptable search solution (for a database driven Web site)
which would meet my requirements. Actually, each and every important aspect was a problem:
- search quality (ie. good relevance)
- statistical ranking methods performed rather bad, especially on large collections of small documents (forums, blogs, etc)
- search speed
- especially if searching for phrases which contain stopwords, as in "to be or not to be"
- moderate disk and CPU requirements when indexing
- important in shared hosting enivronment, not to mention the indexing speed.
Despite the amount of time passed and numerous improvements made in the
other solutions, there's still no solution which I personally would
be eager to migrate to.
Considering that and a lot of positive feedback received from Sphinx users
during last years, the obvious decision is to continue developing Sphinx
(and, eventually, to take over the world).
Most modern UNIX systems with a C++ compiler should be able
to compile and run Sphinx without any modifications.
Currently known systems Sphinx has been successfully running on are:
- Linux 2.4.x, 2.6.x (various distributions)
- Windows 2000, XP
- FreeBSD 4.x, 5.x, 6.x
- NetBSD 1.6, 3.0
- Solaris 9, 11
- Mac OS X
CPU architectures known to work include X86, X86-64, SPARC64.
I hope Sphinx will work on other Unix platforms as well.
If the platform you run Sphinx on is not in this list,
please do report it.
At the moment, Windows version of Sphinx is not intended to be used
in production, but rather for testing and debugging only. Two most prominent
issues are missing concurrent queries support (client queries are stacked
on TCP connection level instead), and missing index data rotation support.
There are succesful production installations which workaround these issues.
However, running high-volume search service under Windows
is still not recommended.
On UNIX, you will need the following tools to build
and install Sphinx:
- a working C++ compiler. GNU gcc is known to work.
- a good make program. GNU make is known to work.
On Windows, you will need Microsoft Visual C/C++ Studio .NET 2003 or 2005.
Other compilers/environments will probably work as well, but for the
time being, you will have to build makefile (or other environment
specific project files) manually.
2.3. Installing Sphinx on Linux
Extract everything from the distribution tarball (haven't you already?)
and go to the sphinx subdirectory:
$ tar xzvf sphinx-0.9.8.tar.gz
$ cd sphinx
Run the configuration program:
There's a number of options to configure. The complete listing may
be obtained by using --help switch. The most important ones are:
--prefix, which specifies where to install Sphinx; such as --prefix=/usr/local/sphinx (all of the examples use this prefix)--with-mysql, which specifies where to look for MySQL
include and library files, if auto-detection fails;--with-pgsql, which specifies where to look for PostgreSQL
include and library files.
Build the binaries:
Install the binaries in the directory of your choice: (defaults to /usr/local/bin/ on *nix systems, but is overridden with configure --prefix)
2.4. Installing Sphinx on WindowsInstalling Sphinx on a Windows server is often easier than installing on a Linux environment; unless you are preparing code patches, you can use the pre-compiled binary files from the Downloads area on the website. Extract everything from the .zip file you have downloaded - sphinx-0.9.8-win32.zip (or sphinx-0.9.8-win32-pgsql.zip if you need PostgresSQL support as well.) You can use Windows Explorer in Windows XP and up to extract the files, or a freeware package like 7Zip to open the archive. For the remainder of this guide, we will assume that the folders are unzipped into C:\Sphinx, such that searchd.exe can be found in C:\Sphinx\bin\searchd.exe. If you decide to use any different location for the folders or configuration file, please change it accordingly. Install the searchd system as a Windows service: C:\Sphinx> C:\Sphinx\searchd --install --config C:\Sphinx\sphinx.conf --servicename SphinxSearch
The searchd service will now be listed in the Services panel within the Management Console, available from Administrative Tools. It will not have been started, as you will need to configure it and build your indexes with indexer before starting the service. A guide to do this can be found under Quick tour.
2.5. Known installation issues
If configure fails to locate MySQL headers and/or libraries,
try checking for and installing mysql-devel package. On some systems,
it is not installed by default.
If make fails with a message which look like
/bin/sh: g++: command not found
make[1]: *** [libsphinx_a-sphinx.o] Error 127
try checking for and installing gcc-c++ package.
If you are getting compile-time errors which look like
sphinx.cpp:67: error: invalid application of `sizeof' to
incomplete type `Private::SizeError<false>'
this means that some compile-time type size check failed.
The most probable reason is that off_t type is less than 64-bit
on your system. As a quick hack, you can edit sphinx.h and replace off_t
with DWORD in a typedef for SphOffset_t, but note that this will prohibit
you from using full-text indexes larger than 2 GB. Even if the hack helps,
please report such issues, providing the exact error message and
compiler/OS details, so I could properly fix them in next releases.
If you keep getting any other error, or the suggestions above
do not seem to help you, please don't hesitate to contact me.
2.6. Quick Sphinx usage tour
All the example commands below assume that you installed Sphinx
in /usr/local/sphinx, so searchd can
be found in /usr/local/sphinx/bin/searchd.
To use Sphinx, you will need to:
Create a configuration file.
Default configuration file name is sphinx.conf.
All Sphinx programs look for this file in current working directory
by default.
Sample configuration file, sphinx.conf.dist, which has
all the options documented, is created by configure.
Copy and edit that sample file to make your own configuration: (assuming Sphinx is installed into /usr/local/sphinx/)
$ cd /usr/local/sphinx/etc
$ cp sphinx.conf.dist sphinx.conf
$ vi sphinx.conf
Sample configuration file is setup to index documents
table from MySQL database test; so there's example.sql
sample data file to populate that table with a few documents for testing purposes:
$ mysql -u test < /usr/local/sphinx/etc/example.sql
Run the indexer to create full-text index from your data: $ cd /usr/local/sphinx/etc
$ /usr/local/sphinx/bin/indexer
Query your newly created index!
To query the index from command line, use search utility:
$ cd /usr/local/sphinx/etc
$ /usr/local/sphinx/bin/search test
To query the index from your PHP scripts, you need to:
Run the search daemon which your script will talk to: $ cd /usr/local/sphinx/etc
$ /usr/local/sphinx/bin/searchd
Run the attached PHP API test script (to ensure that the daemon
was succesfully started and is ready to serve the queries):
$ cd sphinx/api
$ php test.php test
Include the API (it's located in api/sphinxapi.php)
into your own scripts and use it.
Happy searching!
The data to be indexed can generally come from very different
sources: SQL databases, plain text files, HTML files, mailboxes,
and so on. From Sphinx point of view, the data it indexes is a
set of structured documents, each of which has the
same set of fields. This is biased towards SQL, where
each row correspond to a document, and each column to a field.
Depending on what source Sphinx should get the data from,
different code is required to fetch the data and prepare it for indexing.
This code is called data source driver (or simply
driver or data source for brevity).
At the time of this writing, there are drivers for MySQL and
PostgreSQL databases, which can connect to the database using
its native C/C++ API, run queries and fetch the data. There's
also a driver called xmlpipe, which runs a specified command
and reads the data from its stdout.
See Section 3.8, “xmlpipe data source” section for the format description.
There can be as many sources per index as necessary. They will be
sequentially processed in the very same order which was specifed in
index definition. All the documents coming from those sources
will be merged as if they were coming from a single source.
Attributes are additional values associated with each document
that can be used to perform additional filtering and sorting during search.
It is often desired to additionally process full-text search results
based not only on matching document ID and its rank, but on a number
of other per-document values as well. For instance, one might need to
sort news search results by date and then relevance,
or search through products within specified price range,
or limit blog search to posts made by selected users,
or group results by month. To do that efficiently, Sphinx allows
to attach a number of additional attributes
to each document, and store their values in the full-text index.
It's then possible to use stored values to filter, sort,
or group full-text matches.
Attributes, unlike the fields, are not full-text indexed. They
are stored in the index, but it is not possible to search them as full-text,
and attempting to do so results in an error. For example, it is impossible to use the extended matching mode expression
@column 1 to match documents where column is 1, if column is an
attribute, and this is still true even if the numeric digits are normally indexed. Attributes can be used for filtering, though, to restrict returned
rows, as well as sorting or result grouping;
it is entirely possible to sort results purely based on attributes, and ignore the search
relevance tools. Additionally, attributes are returned from the search daemon, while the
indexed text is not.
A good example for attributes would be a forum posts table. Assume
that only title and content fields need to be full-text searchable -
but that sometimes it is also required to limit search to a certain
author or a sub-forum (ie. search only those rows that have some
specific values of author_id or forum_id columns in the SQL table);
or to sort matches by post_date column; or to group matching posts
by month of the post_date and calculate per-group match counts.
This can be achieved by specifying all the mentioned columns
(excluding title and content, that are full-text fields) as
attributes, indexing them, and then using API calls to
setup filtering, sorting, and grouping. Here as an example.
Example sphinx.conf part:
...
sql_query = SELECT id, title, content, \
author_id, forum_id, post_date FROM my_forum_posts
sql_attr_uint = author_id
sql_attr_uint = forum_id
sql_attr_timestamp = post_date
...
Example application code (in PHP):
// only search posts by author whose ID is 123
$cl->SetFilter ( "author_id", array ( 123 ) );
// only search posts in sub-forums 1, 3 and 7
$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );
// sort found posts by posting date in descending order
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
Attributes are named. Attribute names are case insensitive.
Attributes are not full-text indexed; they are stored in the index as is.
Currently supported attribute types are:
- unsigned integers (1-bit to 32-bit wide);
- UNIX timestamps;
- floating point values (32-bit, IEEE 754 single precision);
- string ordinals (specially computed integers);
- MVA, multi-value attributes (variable-length lists of 32-bit unsigned integers).
The complete set of per-document attribute values is sometimes
referred to as docinfo. Docinfos can either be
- stored separately from the main full-text index data ("extern" storage, in
.spa file), or - attached to each occurence of document ID in full-text index data ("inline" storage, in
.spd file).
When using extern storage, a copy of .spa file
(with all the attribute values for all the documents) is kept in RAM by
searchd at all times. This is for performance reasons;
random disk I/O would be too slow. On the contrary, inline storage does not
require any additional RAM at all, but that comes at the cost of greatly
inflating the index size: remember that it copies all
attribute value every time when the document ID
is mentioned, and that is exactly as many times as there are
different keywords in the document. Inline may be the only viable
option if you have only a few attributes and need to work with big
datasets in limited RAM. However, in most cases extern storage
makes both indexing and searching much more efficient.
Search-time memory requirements for extern storage are
(1+number_of_attrs)*number_of_docs*4 bytes, ie. 10 million docs with
2 groups and 1 timestamp will take (1+2+1)*10M*4 = 160 MB of RAM.
This is PER DAEMON, not per query. searchd
will allocate 160 MB on startup, read the data and keep it shared between queries.
The children will NOT allocate any additional
copies of this data.
3.3. MVA (multi-valued attributes)
MVAs, or multi-valued attributes, are an important special type of per-document attributes in Sphinx.
MVAs make it possible to attach lists of values to every document.
They are useful for article tags, product categories, etc.
Filtering and group-by (but not sorting) on MVA attributes is supported.
Currently, MVA list entries are limited to unsigned 32-bit integers.
The list length is not limited, you can have an arbitrary number of values
attached to each document as long as RAM permits (.spm file
that contains the MVA values will be precached in RAM by searchd).
The source data can be taken either from a separate query, or from a document field;
see source type in sql_attr_multi.
In the first case the query will have to return pairs of document ID and MVA values,
in the second one the field will be parsed for integer values.
There are absolutely no requirements as to incoming data order; the values will be
automatically grouped by document ID (and internally sorted within the same ID)
during indexing anyway.
When filtering, a document will match the filter on MVA attribute
if any of the values satisfy the filtering condition.
(Therefore, documents that pass through exclude filters will not
contain any of the forbidden values.)
When grouping by MVA attribute, a document will contribute to as
many groups as there are different MVA values associated with that document.
For instance, if the collection contains exactly 1 document having a 'tag' MVA
with values 5, 7, and 11, grouping on 'tag' will produce 3 groups with
'@count' equal to 1 and '@groupby' key values of 5, 7, and 11 respectively.
Also note that grouping by MVA might lead to duplicate documents in the result set:
because each document can participate in many groups, it can be chosen as the best
one in in more than one group, leading to duplicate IDs. PHP API historically
uses ordered hash on the document ID for the resulting rows; so you'll also need to use
SetArrayResult() in order
to employ group-by on MVA with PHP API.
To be able to answer full-text search queries fast, Sphinx needs
to build a special data structure optimized for such queries from
your text data. This structure is called index; and
the process of building index from text is called indexing.
Different index types are well suited for different tasks.
For example, a disk-based tree-based index would be easy to
update (ie. insert new documents to existing index), but rather
slow to search. Therefore, Sphinx architecture allows for different
index types to be implemented easily.
The only index type which is implemented in Sphinx at the moment is
designed for maximum indexing and searching speed. This comes at a cost
of updates being really slow; theoretically, it might be slower to
update this type of index than than to reindex it from scratch.
However, this very frequently could be worked around with
muiltiple indexes, see Section 3.10, “Live index updates” for details.
It is planned to implement more index types, including the
type which would be updateable in real time.
There can be as many indexes per configuration file as necessary.
indexer utility can reindex either all of them
(if --all option is specified), or a certain explicitly
specified subset. searchd utility will serve all
the specified indexes, and the clients can specify what indexes to
search in run time.
3.5. Restrictions on the source data
There are a few different restrictions imposed on the source data
which is going to be indexed by Sphinx, of which the single most
important one is:
ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).
If this requirement is not met, different bad things can happen.
For instance, Sphinx can crash with an internal assertion while indexing;
or produce strange results when searching due to conflicting IDs.
Also, a 1000-pound gorilla might eventually come out of your
display and start throwing barrels at you. You've been warned.
3.6. Charsets, case folding, and translation tables
When indexing some index, Sphinx fetches documents from
the specified sources, splits the text into words, and does
case folding so that "Abc", "ABC" and "abc" would be treated
as the same word (or, to be pedantic, term).
To do that properly, Sphinx needs to know
- what encoding is the source text in;
- what characters are letters and what are not;
- what letters should be folded to what letters.
This should be configured on a per-index basis using
charset_type and
charset_table options.
charset_type
specifies whether the document encoding is single-byte (SBCS) or UTF-8.
charset_table
specifies the table that maps letter characters to their case
folded versions. The characters that are not in the table are considered
to be non-letters and will be treated as word separators when indexing
or searching through this index.
Note that while default tables do not include space character
(ASCII code 0x20, Unicode U+0020) as a letter, it's in fact
perfectly legal to do so. This can be
useful, for instance, for indexing tag clouds, so that space-separated
word sets would index as a single search query term.
Default tables currently include English and Russian characters.
Please do submit your tables for other languages!
3.7. SQL data sources (MySQL, PostgreSQL)
With all the SQL drivers, indexing generally works as follows.
- connection to the database is established;
- pre-query (see Section 9.1.9, “sql_query_pre”) is executed
to perform any necessary initial setup, such as setting per-connection encoding with MySQL;
- main query (see Section 9.1.10, “sql_query”) is executed and the rows it returns are indexed;
- post-query (see Section 9.1.21, “sql_query_post”) is executed
to perform any necessary cleanup;
- connection to the database is closed;
- indexer does the sorting phase (to be pedantic, index-type specific post-processing);
- connection to the database is established again;
- post-index query (see Section 9.1.22, “sql_query_post_index”) is executed
to perform any necessary final cleanup;
- connection to the database is closed again.
Most options, such as database user/host/password, are straightforward.
However, there are a few subtle things, which are discussed in more detail here.
Ranged queries
Main query, which needs to fetch all the documents, can impose
a read lock on the whole table and stall the concurrent queries
(eg. INSERTs to MyISAM table), waste a lot of memory for result set, etc.
To avoid this, Sphinx supports so-called ranged queries.
With ranged queries, Sphinx first fetches min and max document IDs from
the table, and then substitutes different ID intervals into main query text
and runs the modified query to fetch another chunk of documents.
Here's an example.
Example 1. Ranged query usage example
# in sphinx.conf
sql_query_range = SELECT MIN(id),MAX(id) FROM documents
sql_range_step = 1000
sql_query = SELECT * FROM documents WHERE id>=$start AND id<=$end
If the table contains document IDs from 1 to, say, 2345, then sql_query would
be run three times:
- with
$start replaced with 1 and $end replaced with 1000; - with
$start replaced with 1001 and $end replaced with 2000; - with
$start replaced with 2000 and $end replaced with 2345.
Obviously, that's not much of a difference for 2000-row table,
but when it comes to indexing 10-million-row MyISAM table,
ranged queries might be of some help.
sql_post vs. sql_post_index
The difference between post-query and post-index query is in that post-query
is run immediately when Sphinx received all the documents, but further indexing
may still fail for some other reason. On the contrary,
by the time the post-index query gets executed, it is guaranteed
that the indexing was succesful. Database connection is dropped and re-established
because sorting phase can be very lengthy and would just timeout otherwise.
xmlpipe data source was designed to enable users to plug data into
Sphinx without having to implement new data sources drivers themselves.
It is limited to 2 fixed fields and 2 fixed attributes, and is deprecated
in favor of Section 3.9, “xmlpipe2 data source” now. For new streams, use xmlpipe2.
To use xmlpipe, configure the data source in your configuration file
as follows:
source example_xmlpipe_source
{
type = xmlpipe
xmlpipe_command = perl /www/mysite.com/bin/sphinxpipe.pl
}
The indexer will run the command specified
in xmlpipe_command,
and then read, parse and index the data it prints to stdout.
More formally, it opens a pipe to given command and then reads
from that pipe.
indexer will expect one or more documents in custom XML format.
Here's the example document stream, consisting of two documents:
Example 2. XMLpipe document stream
<document>
<id>123</id>
<group>45</group>
<timestamp>1132223498</timestamp>
<title>test title</title>
<body>
this is my document body
</body>
</document>
<document>
<id>124</id>
<group>46</group>
<timestamp>1132223498</timestamp>
<title>another test</title>
<body>
this is another document
</body>
</document>
Legacy xmlpipe legacy driver uses a builtin parser
which is pretty fast but really strict and does not actually
fully support XML. It requires that all the fields must
be present, formatted exactly as in this example, and
occur exactly in the same order. The only optional
field is timestamp; it defaults to 1.
3.9. xmlpipe2 data source
xmlpipe2 lets you pass arbitrary full-text and attribute data to Sphinx
in yet another custom XML format. It also allows to specify the schema
(ie. the set of fields and attributes) either in the XML stream itself,
or in the source settings.
When indexing xmlpipe2 source, indexer runs the given command, opens
a pipe to its stdout, and expects well-formed XML stream. Here's sample
stream data:
Example 3. xmlpipe2 document stream
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>
<sphinx:document id="1234">
<content>this is the main content <![CDATA[[and this <cdata> entry must be handled properly by xml parser lib]]></content>
<published>1012325463</published>
<subject>note how field/attr tags can be in <b class="red">randomized</b> order</subject>
<misc>some undeclared element</misc>
</sphinx:document>
<!-- ... more documents here ... -->
</sphinx:docset>
Arbitrary fields and attributes are allowed.
They also can occur in the stream in arbitrary order within each document; the order is ignored.
There is a restriction on maximum field length; fields longer than 2 MB will be truncated to 2 MB (this limit can be changed in the source).
The schema, ie. complete fields and attributes list, must be declared
before any document could be parsed. This can be done either in the
configuration file using xmlpipe_field and xmlpipe_attr_XXX
settings, or right in the stream using <sphinx:schema> element.
<sphinx:schema> is optional. It is only allowed to occur as the very
first sub-element in <sphinx:docset>. If there is no in-stream
schema definition, settings from the configuration file will be used.
Otherwise, stream settings take precedence.
Unknown tags (which were not declared neither as fields nor as attributes)
will be ignored with a warning. In the example above, <misc> will be ignored.
All embedded tags and their attributes (such as <b> in <subject>
in the example above) will be silently ignored.
Support for incoming stream encodings depends on whether iconv
is installed on the system. xmlpipe2 is parsed using libexpat
parser that understands US-ASCII, ISO-8859-1, UTF-8 and a few UTF-16 variants
natively. Sphinx configure script will also check
for libiconv presence, and utilize it to handle
other encodings. libexpat also enforces the
requirement to use UTF-8 charset on Sphinx side, because the
parsed data it returns is always in UTF-8.
XML elements (tags) recognized by xmlpipe2 (and their attributes where applicable) are:
- sphinx:docset
- Mandatory top-level element, denotes and contains xmlpipe2 document set.
- sphinx:schema
- Optional element, must either occur as the very first child
of sphinx:docset, or never occur at all. Declares the document schema.
Contains field and attribute declarations. If present, overrides
per-source settings from the configuration file.
- sphinx:field
- Optional element, child of sphinx:schema. Declares a full-text field.
The only recognized attribute is "name", it specifies the element name
that should be treated as a full-text field in the subsequent documents.
- sphinx:attr
- Optional element, child of sphinx:schema. Declares an attribute.
Known attributes are:
- "name", specifies the element name that should be treated as an attribute in the subsequent documents.
- "type", specifies the attribute type. Possible values are "int", "timestamp", "str2ordinal", "bool", "float" and "multi".
- "bits", specifies the bit size for "int" attribute type. Valid values are 1 to 32.
- "default", specifies the default value for this attribute that should be used if the attribute's element is not present in the document.
- sphinx:document
- Mandatory element, must be a child of sphinx:docset.
Contains arbitrary other elements with field and attribute values
to be indexed, as declared either using sphinx:field and sphinx:attr
elements or in the configuration file. The only known attribute
is "id" that must contain the unique integer document ID.
There's a frequent situation when the total dataset is too big
to be reindexed from scratch often, but the amount of new records
is rather small. Example: a forum with a 1,000,000 archived posts,
but only 1,000 new posts per day.
In this case, "live" (almost real time) index updates could be
implemented using so called "main+delta" scheme.
The idea is to set up two sources and two indexes, with one
"main" index for the data which only changes rarely (if ever),
and one "delta" for the new documents. In the example above,
1,000,000 archived posts would go to the main index, and newly
inserted 1,000 posts/day would go to the delta index. Delta index
could then be reindexed very frequently, and the documents can
be made available to search in a matter of minutes.
Specifying which documents should go to what index and
reindexing main index could also be made fully automatical.
One option would be to make a counter table which would track
the ID which would split the documents, and update it
whenever the main index is reindexed.
Example 4. Fully automated live updates
# in MySQL
CREATE TABLE sph_counter
(
counter_id INTEGER PRIMARY KEY NOT NULL,
max_doc_id INTEGER NOT NULL
);
# in sphinx.conf
source main
{
# ...
sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
sql_query = SELECT id, title, body FROM documents \
WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
source delta : main
{
sql_query_pre = SET NAMES utf8
sql_query = SELECT id, title, body FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
index main
{
source = main
path = /path/to/main
# ... all the other settings
}
# note how all other settings are copied from main,
# but source and path are overridden (they MUST be)
index delta : main
{
source = delta
path = /path/to/delta
}
Note how we're overriding sql_query_pre in the delta source.
We need to explicitly have that override. Otherwise REPLACE query
would be run when indexing delta source too, effectively nullifying it. However,
when we issue the directive in the inherited source for the first time, it removes
all inherited values, so the encoding setup is also lost.
So sql_query_pre in the delta can not just be empty; and we need
to issue the encoding setup query explicitly once again.
Merging two existing indexes can be more efficient that indexing the data
from scratch, and desired in some cases (such as merging 'main' and 'delta'
indexes instead of simply reindexing 'main' in 'main+delta' partitioning
scheme). So indexer has an option to do that.
Merging the indexes is normally faster than reindexing but still
not instant on huge indexes. Basically,
it will need to read the contents of both indexes once and write
the result once. Merging 100 GB and 1 GB index, for example,
will result in 202 GB of IO (but that's still likely less than
the indexing from scratch requires).
The basic command syntax is as follows:
indexer --merge DSTINDEX SRCINDEX [--rotate]
Only the DSTINDEX index will be affected: the contents of SRCINDEX will be merged into it.
--rotate switch will be required if DSTINDEX is already being served by searchd.
The initially devised usage pattern is to merge a smaller update from SRCINDEX into DSTINDEX.
Thus, when merging the attributes, values from SRCINDEX will win if duplicate document IDs are encountered.
Note, however, that the "old" keywords will not be automatically removed in such cases.
For example, if there's a keyword "old" associated with document 123 in DSTINDEX, and a keyword "new" associated
with it in SRCINDEX, document 123 will be found by both keywords after the merge.
You can supply an explicit condition to remove documents from DSTINDEX to mitigate that;
the relevant switch is --merge-dst-range:
indexer --merge main delta --merge-dst-range deleted 0 0
This switch lets you apply filters to the destination index along with merging.
There can be several filters; all of their conditions must be met in order
to include the document in the resulting mergid index. In the example above,
the filter passes only those records where 'deleted' is 0, eliminating all
records that were flagged as deleted (for instance, using
UpdateAttributes() call).
There are the following matching modes available:
- SPH_MATCH_ALL, matches all query words (default mode);
- SPH_MATCH_ANY, matches any of the query words;
- SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;
- SPH_MATCH_BOOLEAN, matches query as a boolean expression (see Section 4.2, “Boolean query syntax”);
- SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language (see Section 4.3, “Extended query syntax”). As of 0.9.9, this has been superceded by SPH_MATCH_EXTENDED2, providing additional functionality and better performance. The ident is retained for legacy application code that will continue to be compatible once Sphinx and its components, including the API, are upgraded.
- SPH_MATCH_EXTENDED2, matches query using the second version of the Extended matching mode.
- SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below. NB, any query terms will be ignored, such that filters, filter-ranges and grouping will still be applied, but no text-matching.
The SPH_MATCH_FULLSCAN mode will be automatically activated in place of the specified matching mode when the following conditions are met:
- The query string is empty (ie. its length is zero).
- docinfo storage is set to
extern.
In full scan mode, all the indexed documents will be considered as matching.
Such queries will still apply filters, sorting, and group by, but will not perform any full-text searching.
This can be useful to unify full-text and non-full-text searching code, or to offload SQL server (there are cases when Sphinx scans will perform better than analogous MySQL queries).
An example of using the full scan mode might be to find posts in a forum. By selecting the forum's user ID via SetFilter() but not actually providing any search text, Sphinx will match every document (i.e. every post) where SetFilter() would match - in this case providing every post from that user. By default this will be ordered by relevancy, followed by Sphinx document ID in ascending order (earliest first).
4.2. Boolean query syntax
Boolean queries allow the following special operators to be used:
Here's an example query which uses all these operators:
Example 5. Boolean query example
( cat -dog ) | ( cat -mouse)
There always is implicit AND operator, so "hello world" query actually
means "hello & world".
OR operator precedence is higher than AND, so "looking for cat | dog | mouse"
means "looking for ( cat | dog | mouse )" and not
"(looking for cat) | dog | mouse".
Queries like "-dog", which implicitly include all documents from the
collection, can not be evaluated. This is both for technical and performance
reasons. Technically, Sphinx does not always keep a list of all IDs.
Performance-wise, when the collection is huge (ie. 10-100M documents),
evaluating such queries could take very long.
4.3. Extended query syntax
The following special operators can be used when using the extended matching mode:
- operator OR:
hello | world - operator NOT:
hello -world
hello !world
- field search operator:
@title hello @body world - field position limit modifier (introduced in version 0.9.9):
@body[50] hello - multiple-field search operator:
@(title, body) hello world - all-field search operator:
@* hello - phrase search operator:
"hello world" - proximity search operator:
"hello world"~10 - quorum matching operator:
"the world is a wonderful place"/3 - exact form operator (introduced in version 0.9.9):
raining =cats and =dogs
Here's an example query which uses most of these operators:
Example 6. Extended matching mode: query example
"hello world" @title "example program"~5 @body python -(php|perl) @* code
The full meaning of this search is:
- Find the words 'hello' and 'world' adjacently in any field in a document;
- Additionally, the same document must also contain the words 'example' and 'program' in the title field, with up to, but not including, 10 words between the words in question; (E.g. "example PHP program" would be matched however "example script to introduce outside data into the correct context for your program" would not because two terms have 10 or more words between them)
- Additionally, the same document must contain the word 'python' in the body field, but not contain either 'php' or 'perl';
- Additionally, the same document must contain the word 'code' in any field.
There always is implicit AND operator, so "hello world" means that
both "hello" and "world" must be present in matching document.
OR operator precedence is higher than AND, so "looking for cat | dog | mouse"
means "looking for ( cat | dog | mouse )" and not
"(looking for cat) | dog | mouse".
Field position limit, introduced in version 0.9.9, additionaly restricts the searching
to first N position within given field (or fields). For example, "@body[50] hello" will
not match the documents where the keyword 'hello' occurs at position 51 and below
in the body.
Proximity distance is specified in words, adjusted for word count, and
applies to all words within quotes. For instance, "cat dog mouse"~5 query
means that there must be less than 8-word span which contains all 3 words,
ie. "CAT aaa bbb ccc DOG eee fff MOUSE" document will not
match this query, because this span is exactly 8 words long.
Quorum matching operator introduces a kind of fuzzy matching.
It will only match those documents that pass a given threshold of given words.
The example above ("the world is a wonderful place"/3) will match all documents
that have at least 3 of the 6 specified words.
Exact form operator, introduced in version 0.9.9, will match the document only if the keyword occurred
in exactly the specified form. The default behaviour is to match the document
if the stemmed keyword matches. For instance, "runs" query will match both
the document that contains "runs" and the document that
contains "running", because both forms stem to just "run" - while "=runs"
query will only match the first document. Exact form operator requires
index_exact_words option to be enabled.
The operator affects the keywords, and thus can be used within phrase,
proximity, or quorum operators.
Starting with 0.9.9, arbitrarily nested brackets and negations are allowed.
However, the query must be possible to compute without involving an implicit
list of all documents:
// correct query
aaa -(bbb -(ccc ddd))
// queries that are non-computable
-aaa
aaa | -bbb
Specific weighting function (currently) depends on the search mode.
There are these major parts which are used in the weighting functions:
- phrase rank,
- statistical rank.
Phrase rank is based on a length of longest common subsequence
(LCS) of search words between document body and query phrase. So if
there's a perfect phrase match in some document then its phrase rank
would be the highest possible, and equal to query words count.
Statistical rank is based on classic BM25 function which only takes
word frequencies into account. If the word is rare in the whole database
(ie. low frequency over document collection) or mentioned a lot in specific
document (ie. high frequency over matching document), it receives more weight.
Final BM25 weight is a floating point number between 0 and 1.
In all modes, per-field weighted phrase ranks are computed as
a product of LCS multiplied by per-field weight speficifed by user.
Per-field weights are integer, default to 1, and can not be set
lower than 1.
In SPH_MATCH_BOOLEAN mode, no weighting is performed at all, every match weight
is set to 1.
In SPH_MATCH_ALL and SPH_MATCH_PHRASE modes, final weight is a sum of weighted phrase ranks.
In SPH_MATCH_ANY mode, the idea is essentially the same, but it also
adds a count of matching words in each field. Before that, weighted
phrase ranks are additionally mutliplied by a value big enough to
guarantee that higher phrase rank in any field will make the
match ranked higher, even if it's field weight is low.
In SPH_MATCH_EXTENDED mode, final weight is a sum of weighted phrase
ranks and BM25 weight, multiplied by 1000 and rounded to integer.
This is going to be changed, so that MATCH_ALL and MATCH_ANY modes
use BM25 weights as well. This would improve search results in those
match spans where phrase ranks are equal; this is especially useful
for 1-word queries.
The key idea (in all modes, besides boolean) is that better subphrase
matches are ranked higher, and perfect matches are pulled to the top. Author's
experience is that this phrase proximity based ranking provides noticeably
better search quality than any statistical scheme alone (such as BM25,
which is commonly used in other search engines).
There are the following result sorting modes available:
- SPH_SORT_RELEVANCE mode, that sorts by relevance in descending order (best matches first);
- SPH_SORT_ATTR_DESC mode, that sorts by an attribute in descending order (bigger attribute values first);
- SPH_SORT_ATTR_ASC mode, that sorts by an attribute in ascending order (smaller attribute values first);
- SPH_SORT_TIME_SEGMENTS mode, that sorts by time segments (last hour/day/week/month) in descending order, and then by relevance in descending order;
- SPH_SORT_EXTENDED mode, that sorts by SQL-like combination of columns in ASC/DESC order;
- SPH_SORT_EXPR mode, that sorts by an arithmetic expression.
SPH_SORT_RELEVANCE ignores any additional parameters and always sorts matches
by relevance rank. All other modes require an additional sorting clause, with the
syntax depending on specific mode. SPH_SORT_ATTR_ASC, SPH_SORT_ATTR_DESC and
SPH_SORT_TIME_SEGMENTS modes require simply an attribute name.
SPH_SORT_RELEVANCE is equivalent to sorting by "@weight DESC, @id ASC" in extended sorting mode,
SPH_SORT_ATTR_ASC is equivalent to "attribute ASC, @weight DESC, @id ASC",
and SPH_SORT_ATTR_DESC to "attribute DESC, @weight DESC, @id ASC" respectively.
SPH_SORT_TIME_SEGMENTS mode
In SPH_SORT_TIME_SEGMENTS mode, attribute values are split into so-called
time segments, and then sorted by time segment first, and by relevance second.
The segments are calculated according to the current timestamp
at the time when the search is performed, so the results would change over time.
The segments are as follows:
- last hour,
- last day,
- last week,
- last month,
- last 3 months,
- everything else.
These segments are hardcoded, but it is trivial to change them if necessary.
This mode was added to support searching through blogs, news headlines, etc.
When using time segments, recent records would be ranked higher because of segment,
but withing the same segment, more relevant records would be ranked higher -
unlike sorting by just the timestamp attribute, which would not take relevance
into account at all.
SPH_SORT_EXTENDED mode
In SPH_SORT_EXTENDED mode, you can specify an SQL-like sort expression
with up to 5 attributes (including internal attributes), eg:
@relevance DESC, price ASC, @id DESC
Both internal attributes (that are computed by the engine on the fly)
and user attributes that were configured for this index are allowed.
Internal attribute names must start with magic @-symbol; user attribute
names can be used as is. In the example above, @relevance
and @id are internal attributes and price is user-specified.
Known internal attributes are:
- @id (match ID)
- @weight (match weight)
- @rank (match weight)
- @relevance (match weight)
- @random (return results in random order)
@rank and @relevance are just additional
aliases to @weight.
SPH_SORT_EXPR mode
Expression sorting mode lets you sort the matches by an arbitrary arithmetic
expression, involving attribute values, internal attributes (@id and @weight),
arithmetic operations, and a number of built-in functions. Here's an example:
$cl->SetSortMode ( SPH_SORT_EXPR,
"@weight + ( user_karma + ln(pageviews) )*0.1" );
The following operators and functions are supported. They are mimiced after MySQL.
The functions take a number of arguments depending on the specific function.
- Operators: +, -, *, /, <, > <=, >=, =, <>.
- 0-argument functions: NOW().
- Unary (1-argument) functions: ABS(), CEIL(), FLOOR(), SIN(), COS(), LN(), LOG2(), LOG10(), EXP(), SQRT(), BIGINT().
- Binary (2-argument) functions: MIN(), MAX(), POW(), IDIV().
- Ternary (3-argument) functions: IF().
- Variable argument count functions: INTERVAL(), IN().
Calculations can be performed in three different modes: (a) using single-precision,
32-bit IEEE 754 floating point values (the default), (b) using signed 32-bit integers,
(c) using 64-bit signed integers. The expression parser will automatically switch
to integer mode if there are no operations the result in a floating point value.
Otherwise, it will use the default floating point mode. For instance, "a+b"
will be computed using 32-bit integers if both arguments are 32-bit integers;
or using 64-bit integers if both arguments are integers but one of them is
64-bit; or in floats otherwise. However, "a/b" or "sqrt(a)" will always be
computed in floats, because these operations return non-integer result.
To avoid the first, you can use IDIV(). Also, "a*b" will not be automatically
promoted to 64-bit when the arguments are 32-bit. To enforce 64-bit results,
you can use BIGINT(). (But note that if there are non-integer operations,
BIGINT() will simply be ignored.)
Comparison operators (eg. = or <=) return 1.0 when the condition is true and 0.0 otherwise.
For instance, (a=b)+3 will evaluate to 4 when attribute 'a' is equal to attribute 'b', and to 3 when 'a' is not.
Unlike MySQL, the equality comparisons (ie. = and <> operators) introduce a small equality threshold (1e-6 by default).
If the difference between compared values is within the threshold, they will be considered equal.
All unary and binary functions are straightforward, they behave just like their mathematical counterparts.
But IF() behavior needs to be explained in more detail.
It takes 3 arguments, check whether the 1st argument is equal to 0.0, returns the 2nd argument if it is not zero, or the 3rd one when it is.
Note that unlike comparison operators, IF() does not use a threshold!
Therefore, it's safe to use comparison results as its 1st argument, but arithmetic operators might produce unexpected results.
For instance, the following two calls will produce different results even though they are logically equivalent:
IF ( sqrt(3)*sqrt(3)-3<>0, a, b )
IF ( sqrt(3)*sqrt(3)-3, a, b )
In the first case, the comparison operator <> will return 0.0 (false)
because of a threshold, and IF() will always return 'b' as a result.
In the second one, the same sqrt(3)*sqrt(3)-3 expression will be compared
with zero without threshold by the IF() function itself.
But its value will be slightly different from zero because of limited floating point
calculations precision. Because of that, the comparison with 0.0 done by IF()
will not pass, and the second variant will return 'a' as a result.
BIGINT() function, introduced in version 0.9.9, forcibly promotes the integer argument to 64-bit type,
and does nothing on floating point argument. It's intended to help enforce evaluation
of certain expressions (such as "a*b") in 64-bit mode even though all the arguments
are 32-bit.
IDIV() functions performs an integer division on its 2 arguments. The result
is integer as well, unlike "a/b" result.
IN(expr,val1,val2,...), introduced in version 0.9.9, takes 2 or more arguments, and returns 1 if 1st argument
(expr) is equal to any of the other arguments (val1..valN), or 0 otherwise.
Currently, all the checked values (but not the expression itself!) are required
to be constant. (Its technically possible to implement arbitrary expressions too,
and that might be implemented in the future.) Constants are pre-sorted and then
binary search is used, so IN() even against a big arbitrary list of constants
will be very quick.
INTERVAL(expr,point1,point2,point3,...), introduced in version 0.9.9, takes 2 or more arguments, and returns
the index of the argument that is less than the first argument: it returns
0 if expr<point1, 1 if point1<=expr<point2, and so on.
It is required that point1<point2<...<pointN for this function
to work correctly.
NOW(), introduced in version 0.9.9, is a helper function that returns current timestamp as a 32-bit integer.
4.6. Grouping (clustering) search results
Sometimes it could be useful to group (or in other terms, cluster)
search results and/or count per-group match counts - for instance,
to draw a nice graph of how much maching blog posts were there per
each month; or to group Web search results by site; or to group
matching forum posts by author; etc.
In theory, this could be performed by doing only the full-text search
in Sphinx and then using found IDs to group on SQL server side. However,
in practice doing this with a big result set (10K-10M matches) would
typically kill performance.
To avoid that, Sphinx offers so-called grouping mode. It is enabled
with SetGroupBy() API call. When grouping, all matches are assigned to
different groups based on group-by value. This value is computed from
specified attribute using one of the following built-in functions:
- SPH_GROUPBY_DAY, extracts year, month and day in YYYYMMDD format from timestamp;
- SPH_GROUPBY_WEEK, extracts year and first day of the week number (counting from year start) in YYYYNNN format from timestamp;
- SPH_GROUPBY_MONTH, extracts month in YYYYMM format from timestamp;
- SPH_GROUPBY_YEAR, extracts year in YYYY format from timestamp;
- SPH_GROUPBY_ATTR, uses attribute value itself for grouping.
The final search result set then contains one best match per group.
Grouping function value and per-group match count are returned along
as "virtual" attributes named
@group and
@count respectively.
The result set is sorted by group-by sorting clause, with the syntax similar
to SPH_SORT_EXTENDED sorting clause
syntax. In addition to @id and @weight,
group-by sorting clause may also include:
- @group (groupby function value),
- @count (amount of matches in group).
The default mode is to sort by groupby value in descending order,
ie. by "@group desc".
On completion, total_found result parameter would
contain total amount of matching groups over he whole index.
WARNING: grouping is done in fixed memory
and thus its results are only approximate; so there might be more groups reported
in total_found than actually present. @count might also
be underestimated. To reduce inaccuracy, one should raise max_matches.
If max_matches allows to store all found groups, results will be 100% correct.
For example, if sorting by relevance and grouping by "published"
attribute with SPH_GROUPBY_DAY function, then the result set will
contain
- one most relevant match per each day when there were any
matches published,
- with day number and per-day match count attached,
- sorted by day number in descending order (ie. recent days first).
4.7. Distributed searching
To scale well, Sphinx has distributed searching capabilities.
Distributed searching is useful to improve query latency (ie. search
time) and throughput (ie. max queries/sec) in multi-server, multi-CPU
or multi-core environments. This is essential for applications which
need to search through huge amounts data (ie. billions of records
and terabytes of text).
The key idea is to horizontally partition (HP) searched data
accross search nodes and then process it in parallel.
Partitioning is done manually. You should
- setup several instances
of Sphinx programs (
indexer and searchd)
on different servers; - make the instances index (and search) different parts of data;
- configure a special distributed index on some of the
searchd
instances; - and query this index.
This index only contains references to other
local and remote indexes - so it could not be directly reindexed,
and you should reindex those indexes which it references instead.
When searchd receives a query against distributed index,
it does the following:
- connects to configured remote agents;
- issues the query;
- sequentially searches configured local indexes (while the remote agents are searching);
- retrieves remote agents' search results;
- merges all the results together, removing the duplicates;
- sends the merged resuls to client.
From the application's point of view, there are no differences
between usual and distributed index at all.
Any searchd instance could serve both as a master
(which aggregates the results) and a slave (which only does local searching)
at the same time. This has a number of uses:
- every machine in a cluster could serve as a master which
searches the whole cluster, and search requests could be balanced between
masters to achieve a kind of HA (high availability) in case any of the nodes fails;
-
if running within a single multi-CPU or multi-core machine, there
would be only 1 searchd instance quering itself as an agent and thus
utilizing all CPUs/core.
It is scheduled to implement better HA support which would allow
to specify which agents mirror each other, do health checks, keep track
of alive agents, load-balance requests, etc.
4.8. searchd query log format
searchd logs all succesfully executed search queries
into query log file. Here's an example:
[Fri Jun 29 21:17:58 2007] 0.004 sec [all/0/rel 35254 (0,20)] [lj] test
[Fri Jun 29 21:20:34 2007] 0.024 sec [all/0/rel 19886 (0,20) @channel_id] [lj] test
This log format is as follows:
[query-date] query-time [match-mode/filters-count/sort-mode
total-matches (offset,limit) @groupby-attr] [index-name] query
Match mode can take one of the following values:
- "all" for SPH_MATCH_ALL mode;
- "any" for SPH_MATCH_ANY mode;
- "phr" for SPH_MATCH_PHRASE mode;
- "bool" for SPH_MATCH_BOOLEAN mode;
- "ext" for SPH_MATCH_EXTENDED mode;
- "ext2" for SPH_MATCH_EXTENDED2 mode;
- "scan" if the full scan mode was used, either by being specified with SPH_MATCH_FULLSCAN, or if the query was empty (as documented under Matching Modes)
Sort mode can take one of the following values:
- "rel" for SPH_SORT_RELEVANCE mode;
- "attr-" for SPH_SORT_ATTR_DESC mode;
- "attr+" for SPH_SORT_ATTR_ASC mode;
- "tsegs" for SPH_SORT_TIME_SEGMENTS mode;
- "ext" for SPH_SORT_EXTENDED mode.
Additionally, if searchd was started with --iostats, there will be a block of data after where the index(es) searched are listed. A query log entry might take the form of:
[Fri Jun 29 21:17:58 2007] 0.004 sec [all/0/rel 35254 (0,20)] [lj] [ios=6 kb=111.1 ms=0.5] test
This additional block is information regarding I/O operations in performing the search: the number of file I/O operations carried out, the amount of data in kilobytes read from the index files and time spent on I/O operations (although there is a background processing component, the bulk of this time is the I/O operation time) 5. Command line tools referenceAs mentioned elsewhere, Sphinx is not a single program called 'sphinx', but a collection of 4 separate programs which collectively form Sphinx. This section covers these tools and how to use them. 5.1. indexer command referenceindexer is the first of the two principle tools as part of Sphinx. Invoked from either the command line directly, or as part of a larger script, indexer is solely responsible for gathering the data that will be searchable.
The calling syntax for indexer is as follows:
indexer [OPTIONS] [indexname1 [indexname2 [...]]]
Essentially you would list the different possible indexes (that you would later make available to search) in sphinx.conf, so when calling indexer, as a minimum you need to be telling it what index (or indexes) you want to index. If sphinx.conf contained details on 2 indexes, mybigindex and mysmallindex, you could do the following:
$ indexer mybigindex
$ indexer mysmallindex mybigindex
As part of the configuration file, sphinx.conf, you specify one or more indexes for your data. You might call indexer to reindex one of them, ad-hoc, or you can tell it to process all indexes - you are not limited to calling just one, or all at once, you can always pick some combination of the available indexes. The majority of the options for indexer are given in the configuration file, however there are some options you might need to specify on the command line as well, as they can affect how the indexing operation is performed. These options are:
--config <file> (-c <file> for short) tells indexer to use the given file as its configuration. Normally, it will look for sphinx.conf in the installation directory (e.g. /usr/local/sphinx/etc/sphinx.conf if installed into /usr/local/sphinx), followed by the current directory you are in when calling indexer from the shell. This is most of use in shared environments where the binary files are installed somewhere like /usr/local/sphinx/ but you want to provide users with the ability to make their own custom Sphinx set-ups, or if you want to run multiple instances on a single server. In cases like those you could allow them to create their own sphinx.conf files and pass them to indexer with this option. For example:
$ indexer --config /home/myuser/sphinx.conf myindex
--all tells indexer to update every index listed in sphinx.conf, instead of listing individual indexes. This would be useful in small configurations, or cron-type or maintenance jobs where the entire index set will get rebuilt each day, or week, or whatever period is best. Example usage:
$ indexer --config /home/myuser/sphinx.conf --all
--rotate is used for rotating indexes. Unless you have the situation where you can take the search function offline without troubling users, you will almost certainly need to keep search running whilst indexing new documents. --rotate creates a second index, parallel to the first (in the same place, simply including .new in the filenames). Once complete, indexer notifies searchd via sending the SIGHUP signal, and searchd will attempt to rename the indexes (renaming the existing ones to include .old and renaming the .new to replace them), and then start serving from the newer files. Depending on the setting of seamless_rotate, there may be a slight delay in being able to search the newer indexes. Example usage:
$ indexer --rotate --all
--quiet tells indexer not to output anything, unless there is an error. Again, most used for cron-type, or other script jobs where the output is irrelevant or unnecessary, except in the event of some kind of error. Example usage:
$ indexer --rotate --all --quiet
--noprogress does not display progress details as they occur; instead, the final status details (such as documents indexed, speed of indexing and so on are only reported at completion of indexing. In instances where the script is not being run on a console (or 'tty'), this will be on by default. Example usage:
$ indexer --rotate --all --noprogress
--buildstops <outputfile.text> <N> reviews the index source, as if it were indexing the data, and produces a list of the terms that are being indexed. In other words, it produces a list of all the searchable terms that are becoming part of the index. Note; it does not update the index in question, it simply processes the data 'as if' it were indexing, including running queries defined with sql_query_pre or sql_query_post. outputfile.txt will contain the list of words, one per line, sorted by frequency with most frequent first, and N specifies the maximum number of words that will be listed; if sufficiently large to encompass every word in the index, only that many words will be returned. Such a dictionary list could be used for client application features around "Did you mean..." functionality, usually in conjunction with --buildfreqs, below. Example:
$ indexer myindex --buildstops word_freq.txt 1000
This would produce a document in the current directory, word_freq.txt with the 1,000 most common words in 'myindex', ordered by most common first. Note that the file will pertain to the last index indexed when specified with multiple indexes or --all (i.e. the last one listed in the configuration file)
--buildfreqs works with --buildstops (and is ignored if --buildstops is not specified). As --buildstops provides the list of words used within the index, --buildfreqs adds the quantity present in the index, which would be useful in establishing whether certain words should be considered stopwords if they are too prevalent. It will also help with developing "Did you mean..." features where you can how much more common a given word compared to another, similar one. Example:
$ indexer myindex --buildstops word_freq.txt 1000 --buildfreqs
This would produce the word_freq.txt as above, however after each word would be the number of times it occurred in the index in question.
--merge <dst-index> <src-index> is used for physically merging indexes together, for example if you have a main+delta scheme, where the main index rarely changes, but the delta index is rebuilt frequently, and --merge would be used to combine the two. The operation moves from right to left - the contents of src-index get examined and physically combined with the contents of dst-index and the result is left in dst-index.
In pseudo-code, it might be expressed as: dst-index += src-index
An example:
$ indexer --merge main delta --rotate
In the above example, where the main is the master, rarely modified index, and delta is the less frequently modified one, you might use the above to call indexer to combine the contents of the delta into the main index and rotate the indexes.
--merge-dst-range <attr> <min> <max> runs the filter range given upon merging. Specifically, as the merge is applied to the destination index (as part of --merge, and is ignored if --merge is not specified), indexer will also filter the documents ending up in the destination index, and only documents will pass through the filter given will end up in the final index. This could be used for example, in an index where there is a 'deleted' attribute, where 0 means 'not deleted'. Such an index could be merged with:
$ indexer --merge main delta --merge-dst-range deleted 0 0
Any documents marked as deleted (value 1) would be removed from the newly-merged destination index. It can be added several times to the command line, to add successive filters to the merge, all of which must be met in order for a document to become part of the final index.
5.2. searchd command referencesearchd is the second of the two principle tools as part of Sphinx. searchd is the part of the system which actually handles searches; it functions as a server and is responsible for receiving queries, processing them and returning a dataset back to the different APIs for client applications.
Unlike indexer, searchd is not designed to be run either from a regular script or command-line calling, but instead either as a daemon to be called from init.d (on Unix/Linux type systems) or to be called as a service (on Windows-type systems), so not all of the command line options will always apply, and so will be build-dependent. Calling searchd is simply a case of:
$ searchd [OPTIONS]
The options available to searchd on all builds are: --help (-h for short) lists all of the parameters that can be called in your particular build of searchd.--config <file> (-c <file> for short) tells searchd to use the given file as its configuration, just as with indexer above.--stop is used to stop searchd, using the details of the PID file as specified in the sphinx.conf file, so you may also need to confirm to searchd which configuration file to use with the --config option. NB, calling --stop will also make sure any changes applied to the indexes with UpdateAttributes() will be applied to the index files themselves. Example:
$ searchd --config /home/myuser/sphinx.conf --stop
--pidfile is used to explicitly state a PID file, where the process information is stored regarding searchd, used for inter-process communications (for example, indexer will need to know the PID to contact searchd for rotating indexes). Normally, searchd would use a PID if running in regular mode (i.e. not with --console), but it is possible that you will be running it in console mode whilst the index is being updated and rotated, for which a PID file will be needed.
$ searchd --config /home/myuser/sphinx.conf --pidfile /home/myuser/sphinx.pid
--console is used to force searchd into console mode; typically it will be running as a conventional server application, and will aim to dump information into the log files (as specified in sphinx.conf). Sometimes though, when debugging issues in the configuration or the daemon itself, or trying to diagnose hard-to-track-down problems, it may be easier to force it to dump information directly to the console/command line from which it is being called. Running in console mode also means that the process will not be forked (so searches are done in sequence) and logs will not be written to. (It should be noted that console mode is not the intended method for running searchd) You can invoke it as such:
$ searchd --config /home/myuser/sphinx.conf --console
--iostats is used in conjuction with the logging options (the query_log will need to have been activated in sphinx.conf) to provide more detailed information on a per-query basis as to the input/output operations carried out in the course of that query, with a slight performance hit and of course bigger logs. Further details are available under the query log format section. You might start searchd thus:
$ searchd --config /home/myuser/sphinx.conf --iostats
--port portnumber (-p for short) is used to specify the post that searchd should listen on, usually for debugging purposes. This will usually default to 3312, but sometimes you need to run it on a different port. Specifying it on the command line will override anything specified in the configuration file. The valid range is 0 to 65535, but ports numbered 1024 and below usually require a privileged account in order to run. An example of usage:
$ searchd --port 3313
--index <index> forces this instance of searchd only to serve the specified index. Like --port, above, this is usually for debugging purposes; more long-term changes would generally be applied to the configuration file itself. Example usage:
$ searchd --index myindex
There are some options for searchd that are specific to Windows platforms, concerning handling as a service, are only be available on Windows binaries. Note that on Windows searchd will default to --console mode, unless you install it as a service. --install installs searchd as a service into the Microsoft Management Console (Control Panel / Administrative Tools / Services). Any other parameters specified on the command line, where --install is specified will also become part of the command line on future starts of the service. For example, as part of calling searchd, you will likely also need to specify the configuration file with --config, and you would do that as well as specifying --install. Once called, the usual start/stop facilities will become available via the management console, so any methods you could use for starting, stopping and restarting services would also apply to searchd. Example:
C:\WINDOWS\system32> C:\Sphinx\bin\searchd.exe --install --config C:\Sphinx\sphinx.conf
If you wanted to have the I/O stats every time you started searchd, you would specify its option on the same line as the --install command thus:
C:\WINDOWS\system32> C:\Sphinx\bin\searchd.exe --install --config C:\Sphinx\sphinx.conf --iostats
--delete removes the service from the Microsoft Management Console and other places where services are registered, after previously installed with --install. Note, this does not uninstall the software or delete the indexes. It means the service will not be called from the services systems, and will not be started on the machine's next start. If currently running as a service, the current instance will not be terminated (until the next reboot, or searchd is called with --stop). If the service was installed with a custom name (with --servicename), the same name will need to be specified with --servicename when calling to uninstall. Example:
C:\WINDOWS\system32> C:\Sphinx\bin\searchd.exe --delete
--servicename <name> applies the given name to searchd when installing or deleting the service, as would appear in the Management Console; this will default to searchd, but if being deployed on servers where multiple administrators may log into the system, or a system with multiple searchd instances, a more descriptive name may be applicable. Note that unless combined with --install or --delete, this option does not do anything. Example:
C:\WINDOWS\system32> C:\Sphinx\bin\searchd.exe --install --config C:\Sphinx\sphinx.conf --servicename SphinxSearch
--ntservice is the option that is passed by the Management Console to searchd to invoke it as a service on Windows platforms. It would not normally be necessary to call this directly; this would normally be called by Windows when the service would be started, although if you wanted to call this as a regular service from the command-line (as the complement to --console) you could do so in theory.
5.3. search command referencesearch is one of the two less prominent tools within the Sphinx package. Whereas searchd is responsible for searches in a server-type environment, search is aimed at testing the index from the command line, and testing the index quickly without building a framework to make the connection to the server and process its response.
Note: search is not intended to be deployed as part of a client application; it is strongly recommended you do not write an interface to search instead of searchd, and none of the bundled client APIs support this method. (In any event, search will reload files each time, whereas searchd will cache them in memory for performance.) That said, many types of query that you could build in the APIs could also be made with search, however for very complex searches it may be easier to construct them using a small script and the corresponding API. Additionally, some newer features may be available in the searchd system that have not yet been brought into search. The calling syntax for search is as follows:
search [OPTIONS] word1 [word2 [word3 [...]]]
When calling search, it is not necessary to have searchd running; simply that the account running search has read access to the configuration file and the location and files of the indexes. The default behaviour is to apply a search for word1 (AND word2 AND word3... as specified) to all fields in all indexes as given in the configuration file. If constructing the equivalent in the API, this would be the equivalent to passing SPH_MATCH_ALL to SetMatchMode, and specifying * as the indexes to query as part of Query. There are many options available to search. Firstly, the general options:
--config <file> (-c <file> for short) tells search to use the given file as its configuration, just as with indexer above.--index <index> (-i <index> for short) tells search to limit searching to the specified index only; normally it would attempt to search all of the physical indexes listed in sphinx.conf, not any distributed ones.--stdin tells search to accept the query from the standard input, rather than the command line. This can be useful for testing purposes whereby you could feed input via pipes and from scripts.
Options for setting matches:
--any (-a for short) changes the matching mode to match any of the words as part of the query (word1 OR word2 OR word3). In the API this would be equivalent to passing SPH_MATCH_ANY to SetMatchMode.--phrase (-p for short) changes the matching mode to match all of the words as part of the query, and do so in the phrase given (not including punctuation). In the API this would be equivalent to passing SPH_MATCH_PHRASE to SetMatchMode.--boolean (-b for short) changes the matching mode to Boolean matching. Note if using Boolean syntax matching on the command line, you may need to escape the symbols (with a backslash) to avoid the shell
|