Mantis Bugtracker

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0000559 [Sphinx] general crash always 2010-07-27 05:05 2010-07-27 05:36
Reporter chris14679 View Status public  
Assigned To
Priority normal Resolution open  
Status new   Product Version 1.10-beta
Summary 0000559: frequently locks up entire server when using realtime indexes
Description We are using beta 1.10 downloaded as RPM with 64 bit IDs.

We have tried to convert an existing index into a realtime index but have now given up due to frequent serious crashes.

There are 330M rows in the table we are indexing. On average we can insert about 10M rows before the system locks up. I was monitoring 'top' at the exact time of one of the crashes, when the 'top' display froze there was 19GB of available ('cached' according to top) memory (8GB RAM + 11GB swap) from the total on this machine of 36GB RAM and 11GB swap. The load average was only 5.4 (the machine has 8 cores) so memory shortage and high CPU use can be rules out as causes.

I can rule out the content of the table causing the problem because when we restarted insertion from the same point after a crash, it works OK but crashes later on a different record. We got up to about 50M rows inserted with about 5 crashes at which point we abandoned realtime.

On the last occasion, the machine actually locked up while searchd was loading the realtime index, before we started inserting. This pretty much rules out that anything other than the realtime index could be causing the crashes.

There were no sphinx.crash files and nothing in the log to indicate what the problem is.

I have now reverted to using conventional sphinx indexes and rebuilding the whole index every few days.

Strangely, the server still responded to PING, sometimes intermittently, but everything else locked up - SSH, HTTP etc. all went down instantly and stayed down. A hard reboot followed by database recovery from logs was necessary each time.
Additional Information Server is RHEL5 and is shared by a mysql database server.

Unresolved bug 0000445 also occurred on this same server causing similar crashes with a different version of sphinx. Since then we have rotated indexes manually to avoid causing a crash so I don't know whether 0000445 is still an ongoing problem. I don't want to try a rotation until I have some indication that you think 0000445 has been identified and fixed.

Generally the crashes left no record in any of the logs but on one occasion we did get this in the 'messages' file:

Jul 25 05:20:56 db2 kernel: INFO: task kjournald:991 blocked for more than 120 seconds.
Jul 25 05:20:56 db2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 25 05:20:56 db2 kernel: kjournald D ffff810001004420 0 991 331 1011 954 (L-TLB)
Jul 25 05:20:56 db2 kernel: ffff8104be7e1dd0 0000000000000046 0000000000000010 ffff81079efe8bb0
Jul 25 05:20:56 db2 kernel: 0000000000000000 000000000000000a ffff81093ff59860 ffffffff80308b60
Jul 25 05:20:56 db2 kernel: 00000a47563b5d15 00000000000023d7 ffff81093ff59a48 00000000bfb15a78
Jul 25 05:20:56 db2 kernel: Call Trace:
Jul 25 05:20:56 db2 kernel: [<ffffffff880335cf>] :jbd:journal_commit_transaction+0x16d/0x1066
Jul 25 05:20:56 db2 kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
Jul 25 05:20:56 db2 kernel: [<ffffffff8004b36f>] try_to_del_timer_sync+0x7f/0x88
Jul 25 05:20:56 db2 kernel: [<ffffffff880375d3>] :jbd:kjournald+0xc1/0x213
Jul 25 05:20:56 db2 kernel: [<ffffffff800a0abe>] autoremove_wake_function+0x0/0x2e
Jul 25 05:25:46 db2 kernel: [<ffffffff88037512>] :jbd:kjournald+0x0/0x213
Jul 25 05:26:41 db2 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jul 25 05:26:41 db2 kernel: [<ffffffff80032894>] kthread+0xfe/0x132
Jul 25 05:26:41 db2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jul 25 05:26:41 db2 kernel: [<ffffffff800a08a6>] keventd_create_kthread+0x0/0xc4
Jul 25 05:26:41 db2 kernel: [<ffffffff80032796>] kthread+0x0/0x132
Jul 25 05:26:41 db2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jul 25 05:26:41 db2 kernel:
Jul 25 05:26:41 db2 kernel: INFO: task mysqld:6497 blocked for more than 120 seconds.
Jul 25 05:26:41 db2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 25 05:26:41 db2 kernel: mysqld D 0000000000000003 0 6497 6261 6499 (NOTLB)
Jul 25 05:26:41 db2 kernel: ffff8104a7e6ded8 0000000000000086 ffff81093b609128 ffffffff800646ac
Jul 25 05:26:42 db2 kernel: 0000000000000292 0000000000000007 ffff8101d50ab820 ffff8104b67677a0
Jul 25 05:26:42 db2 kernel: 00000a59bf07efc2 0000000000005ccd ffff8101d50aba08 0000000480066b88
Jul 25 05:26:42 db2 kernel: Call Trace:
Jul 25 05:26:42 db2 kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92
Jul 25 05:26:42 db2 kernel: [<ffffffff8000df14>] do_mmap_pgoff+0x431/0x7d7
Jul 25 05:26:42 db2 kernel: [<ffffffff80064613>] __down_write_nested+0x7a/0x92
Jul 25 05:26:42 db2 kernel: [<ffffffff80020bbc>] sys_mprotect+0xcc/0x224
Jul 25 05:26:42 db2 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jul 25 05:26:42 db2 kernel:
Jul 25 05:26:42 db2 kernel: INFO: task mysqld:6534 blocked for more than 120 seconds.
Jul 25 05:26:42 db2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 25 05:26:42 db2 kernel: mysqld D 00002aaaaaaab000 0 6534 6261 6539 6531 (NOTLB)
Jul 25 05:26:42 db2 kernel: ffff81012c8e1f08 0000000000000086 0000000000000821 0000000000000002
Jul 25 05:26:42 db2 kernel: 0000000000000004 0000000000000007 ffff81025d3d4820 ffff8101d50ab820
Jul 25 05:26:42 db2 kernel: 00000a59bf0792f5 000000000001ca18 ffff81025d3d4a08 000000044c4c00e0
Jul 25 05:26:42 db2 kernel: Call Trace:
Jul 25 05:26:42 db2 kernel: [<ffffffff8000c5fe>] _atomic_dec_and_lock+0x39/0x57
Jul 25 05:26:42 db2 kernel: [<ffffffff80064613>] __down_write_nested+0x7a/0x92
Jul 25 05:26:42 db2 kernel: [<ffffffff800161ee>] sys_munmap+0x32/0x59
Jul 25 05:26:42 db2 kernel: [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Jul 25 05:26:42 db2 kernel:
Jul 25 05:26:42 db2 kernel: INFO: task mysqld:7846 blocked for more than 120 seconds.
Jul 25 05:26:42 db2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 25 05:26:42 db2 kernel: mysqld D 0000000000000006 0 7846 6261 15789 7845 (NOTLB)
Jul 25 05:26:42 db2 kernel: ffff810706cd7e18 0000000000000086 0000000000000000 00002aad67b81000
Jul 25 05:26:42 db2 kernel: ffffffff80317ed0 0000000000000007 ffff8109350ab860 ffff81093e6d5040
Jul 25 05:26:42 db2 kernel: 00000a59bf084051 0000000000007ffa ffff8109350aba48 0000000580008f97
Jul 25 05:26:42 db2 kernel: Call Trace:
Jul 25 05:26:42 db2 kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92
Jul 25 05:26:42 db2 kernel: [<ffffffff80066ad0>] do_page_fault+0x446/0x874
Jul 25 05:26:42 db2 kernel: [<ffffffff800312b3>] do_fork+0x148/0x1c1
Jul 25 05:27:28 db2 kernel: [<ffffffff8005dde9>] error_exit+0x0/0x84
Jul 25 05:27:28 db2 kernel:
Jul 25 05:27:28 db2 kernel: INFO: task mysqld:13903 blocked for more than 120 seconds.
Jul 25 05:27:28 db2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 25 05:27:28 db2 kernel: mysqld D 0000000000000006 0 13903 6261 13904 13902 (NOTLB)
Jul 25 05:27:28 db2 kernel: ffff810723dc9e18 0000000000000086 0000000000000000 00002aad67939000
Jul 25 05:27:28 db2 kernel: ffffffff80317ed0 0000000000000007 ffff81093fc2f860 ffff8109350ab860
Jul 25 05:27:28 db2 kernel: 00000a59bf07c057 000000000001df0a ffff81093fc2fa48 0000000580008f97
Jul 25 05:27:28 db2 kernel: Call Trace:
Jul 25 05:29:52 db2 kernel: [<ffffffff800646ac>] __down_read+0x7a/0x92
Jul 25 05:38:02 db2 syslogd 1.4.1: restart.


-----------


The following errors also appeared in the 'messages' log after we stopped using realtime indexes and did not coincide with a lockup, but I include them as they may be relevant:

Jul 26 00:11:27 db2 kernel: searchd[6216]: segfault at 0000000000000000 rip 00002b31e9abced7 rsp 00007fffcb4c5668 error 6
Jul 26 00:21:48 db2 kernel: searchd[14278]: segfault at 0000000000000000 rip 00002b7024215ed7 rsp 00007fffa1d83888 error 6
Tags No tags attached.
Attached Files

- Relationships

-  Notes
(0000812)
chris14679 (reporter)
2010-07-27 05:36
edited on: 2010-07-27 05:39

#index rt_profiles
#{
# type = rt
# rt_mem_limit = 1024M

# path = /disk2/sphinx/rtprofiles/rtprofiles
# charset_type = utf-8

# morphology=none

# rt_field = str
# rt_attr_uint = age
# rt_attr_uint = quality
# rt_attr_uint = gender
# rt_attr_float = lat_rads
# rt_attr_float = long_rads
#}

#############################################################################
## searchd settings
#############################################################################

searchd
{
    # IP address to bind on
    # optional, default is 0.0.0.0 (ie. listen on all interfaces)
    #
    # address = 127.0.0.1
    # address = 192.168.0.1

    listen = [removed]:[removed]
    listen = [removed]:[removed]:mysql41
    listen = localhost:[removed]:mysql41
    
    # log file, searchd run info is logged here
    # optional, default is 'searchd.log'
    log = /var/log/sphinx/searchd.log

    # query log file, all search queries are logged here
    # optional, default is empty (do not log queries)
    query_log = /var/log/sphinx/query.log

    # client read timeout, seconds
    # optional, default is 5
    read_timeout = 5

    # maximum amount of children to fork (concurrent searches to run)
    # optional, default is 0 (unlimited)
    max_children = 50
    
    # request timeout, seconds
    # optional, default is 5 minutes
    client_timeout = 3600

    # PID file, searchd process ID file name
    # mandatory
    pid_file = /var/run/sphinx/searchd.pid

      # /var/log/searchd.pid

    # max amount of matches the daemon ever keeps in RAM, per-index
    # WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
    # default is 1000 (just like Google)
    max_matches = 100000

    # seamless rotate, prevents rotate stalls if precaching huge datasets
    # optional, default is 1
    seamless_rotate = 1

    # whether to forcibly preopen all indexes on startup
    # optional, default is 0 (do not preopen)
    preopen_indexes = 1

    # whether to unlink .old index copies on succesful rotation.
    # optional, default is 1 (do unlink)
    unlink_old = 1
    
    mva_updates_pool = 20M
    crash_log_path = /var/log/sphinx/crash
    read_buffer=16M

workers=threads


max_packet_size=64M

}


- Issue History
Date Modified Username Field Change
2010-07-27 05:05 chris14679 New Issue
2010-07-27 05:36 chris14679 Note Added: 0000812
2010-07-27 05:39 chris14679 Note Edited: 0000812


Copyright © 2000 - 2008 Mantis Group
Powered by Mantis Bugtracker