Page 1 of 2

OMERO web failure

PostPosted: Wed Nov 14, 2012 10:59 am
by ingvar
Hello,

Since I clean out my postgres DB, see:
viewtopic.php?f=4&t=3085
I have had a problem that my image viewer have stopped working on four occasions. I have two servers, one has failed once, the other three times. In all cases stop / start web and admin made OMERO web start working again.

On three out of four occasions, the output of "bin/omero admin diagnostics" had errors like:

Code: Select all
ves-ebi-81:/var/omero/dist pdb_em$ bin/omero admin diagnostics
================================================================================
OMERO Diagnostics 4.3.3-DEV-ice34
================================================================================

WARNING:omero.util.UpgradeCheck:UPGRADE AVAILABLE:Please upgrade to 4.4.4
See http://trac.openmicroscopy.org.uk/omero for the latest version

Commands:   java -version                  error:'NoneType' object has no attribute 'group'
Commands:   python -V                      2.6.6     (/usr/bin/python)
Commands:   icegridnode --version          3.4.2     (/usr/bin/icegridnode)
Commands:   icegridadmin --version         3.4.2     (/usr/bin/icegridadmin)
Commands:   psql --version                 8.4.11    (/usr/bin/psql)

The 'NoneType' errors were for one or more of java and icegridnode/icegridadmin, though the most instance of OMERO web failing did not give an error here, so maybe unreleated.

OMEROweb_request.log shows error like:
Code: Select all
2012-11-14 03:51:37,374 ERROR [                          django.request] (proc.18414) handle_uncaught_exception:209 Internal Server Error: /emdb/omero/webemdb/1814_sliceviewer/
Traceback (most recent call last):
  File "/var/omero/dist/lib/python/django/core/handlers/base.py", line 111, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/var/omero/dist/lib/python/omeroweb/webemdb/views.py", line 1148, in sliceviewer
    conn = getConnection(request)
  File "/var/omero/dist/lib/python/omeroweb/webemdb/views.py", line 1129, in getConnection
    logger.debug('emdb connection: %s server %s' % (conn._sessionUuid, blitz.host))
AttributeError: 'NoneType' object has no attribute '_sessionUuid'

OMEROweb.log around that time shows:
Code: Select all
2012-11-14 03:21:37,353  INFO [                           omero.gateway] (proc.18412) connect:1617 created connection (uuid=823228bd-33e1-4c9d-b79c-19e4dc9c12c4)
2012-11-14 03:31:37,515  INFO [                           omero.gateway] (proc.18411) connect:1617 created connection (uuid=3623828d-a81a-4ceb-89e7-ac25ddd6e476)
2012-11-14 03:31:37,705  INFO [               omeroweb.webgateway.views] (proc.18411) _purge:233 reached connector_pool_size (70), size after purge: (18)
2012-11-14 03:41:37,457  INFO [                           omero.gateway] (proc.18414) connect:1617 created connection (uuid=3786f22e-6f43-4b29-9aaa-a85fbf55de23)
2012-11-14 03:51:37,371  INFO [                           omero.gateway] (proc.18414) connect:1610 BlitzGateway.connect().createSession(): Traceback (most recent call last):
  File "/var/omero/dist/lib/python/omero/gateway/__init__.py", line 1581, in connect
    self._createSession()
  File "/var/omero/dist/lib/python/omero/gateway/__init__.py", line 1491, in _createSession
    self.setSecure(self.secure)
  File "/var/omero/dist/lib/python/omero/gateway/__init__.py", line 1462, in setSecure
    self.c = oldC.createClient(secure=secure)
  File "/var/omero/dist/lib/python/omero/clients.py", line 303, in createClient
    nClient = omero.client(props)
  File "/var/omero/dist/lib/python/omero/__init__.py", line 28, in client
    return omero.clients.BaseClient(*args, **kwargs)
  File "/var/omero/dist/lib/python/omero/clients.py", line 131, in __init__
    self._initData(id)
  File "/var/omero/dist/lib/python/omero/clients.py", line 263, in _initData
    self.__oa = self.__ic.createObjectAdapter("omero.ClientCallback")
  File "/usr/lib64/python2.6/site-packages/Ice/Ice.py", line 417, in createObjectAdapter
    adapter = self._impl.createObjectAdapter(name)
UnknownException: exception ::Ice::UnknownException
{
    unknown = Thread.cpp:521: IceUtil::ThreadSyscallException:
syscall exception: Resource temporarily unavailable
}

2012-11-14 03:51:37,373 WARNI [                            omero.client] (proc.18414) __del__:318 Ignoring error in client.__del__:<class 'Glacier2.SessionNotExistException'>
2012-11-14 03:51:37,397 ERROR [                 omeroweb.feedback.views] (proc.18414) handler500:132 handler500: Server error
2012-11-14 03:51:37,397 ERROR [                 omeroweb.feedback.views] (proc.18414) handler500:134 Traceback (most recent call last):

  File "/var/omero/dist/lib/python/django/core/handlers/base.py", line 111, in get_response
    response = callback(request, *callback_args, **callback_kwargs)

  File "/var/omero/dist/lib/python/omeroweb/webemdb/views.py", line 1148, in sliceviewer
    conn = getConnection(request)

  File "/var/omero/dist/lib/python/omeroweb/webemdb/views.py", line 1129, in getConnection
    logger.debug('emdb connection: %s server %s' % (conn._sessionUuid, blitz.host))

AttributeError: 'NoneType' object has no attribute '_sessionUuid'

2012-11-14 03:53:37,266  INFO [                           omero.gateway] (proc.18414) connect:1610 BlitzGateway.connect().createSession(): Traceback (most recent call last):
  File "/var/omero/dist/lib/python/omero/gateway/__init__.py", line 1581, in connect
    self._createSession()
  File "/var/omero/dist/lib/python/omero/gateway/__init__.py", line 1491, in _createSession
    self.setSecure(self.secure)
  File "/var/omero/dist/lib/python/omero/gateway/__init__.py", line 1462, in setSecure
    self.c = oldC.createClient(secure=secure)
  File "/var/omero/dist/lib/python/omero/clients.py", line 303, in createClient
    nClient = omero.client(props)
  File "/var/omero/dist/lib/python/omero/__init__.py", line 28, in client
    return omero.clients.BaseClient(*args, **kwargs)
  File "/var/omero/dist/lib/python/omero/clients.py", line 131, in __init__
    self._initData(id)
  File "/var/omero/dist/lib/python/omero/clients.py", line 263, in _initData
    self.__oa = self.__ic.createObjectAdapter("omero.ClientCallback")
  File "/usr/lib64/python2.6/site-packages/Ice/Ice.py", line 417, in createObjectAdapter
    adapter = self._impl.createObjectAdapter(name)
UnknownException: exception ::Ice::UnknownException
{
    unknown = Thread.cpp:521: IceUtil::ThreadSyscallException:
syscall exception: Resource temporarily unavailable
}

2012-11-14 03:53:37,268 WARNI [                            omero.client] (proc.18414) __del__:318 Ignoring error in client.__del__:<class 'Glacier2.SessionNotExistException'>
2012-11-14 03:53:37,269 ERROR [                 omeroweb.feedback.views] (proc.18414) handler500:132 handler500: Server error
2012-11-14 03:53:37,269 ERROR [                 omeroweb.feedback.views] (proc.18414) handler500:134 Traceback (most recent call last):

  File "/var/omero/dist/lib/python/django/core/handlers/base.py", line 111, in get_response
    response = callback(request, *callback_args, **callback_kwargs)

  File "/var/omero/dist/lib/python/omeroweb/webemdb/views.py", line 1148, in sliceviewer
    conn = getConnection(request)

  File "/var/omero/dist/lib/python/omeroweb/webemdb/views.py", line 1129, in getConnection
    logger.debug('emdb connection: %s server %s' % (conn._sessionUuid, blitz.host))

AttributeError: 'NoneType' object has no attribute '_sessionUuid'


I have not figured out what extensions that are available for attachment upload (.txt .text .log .dat and no extension are all disallowed), but a section of the Blitz-log is available at: http://pastebin.com/Vyw6MRU7

OMEROweb_request.log, none of the other log files show anything interesting.

One possibility is that I deleted something that I should not have deleted from the postgres db. I removed all rows from event, eventLog, and session that did not have a foreign key constraint.

On one of the failures, I got an out memory error from java. I should also point out that java only available through an nfs mount.

Kind Regards,
Ingvar

Re: OMERO web failure

PostPosted: Thu Nov 15, 2012 1:52 pm
by jmoore
I would most suspect that some type of error condition is leading to your java executable not being accessible (temporarily/periodically) over NFS. The missing version especially points to a major problem:

Code: Select all
Commands:   java -version                  error:'NoneType' object has no attribute 'group'


Is there any chance of getting a local copy of java? We strongly discourage use of NFS in almost all cases.

Cheers,
~Josh

Re: OMERO web failure

PostPosted: Thu Nov 15, 2012 2:35 pm
by ingvar
Hi Josh,

Getting java installed locally will likely involve protracted negotiations, but I will try.

Some of bits of informations I have are confusing.

This problem only started within days of clearing out of the postgres databases. Though that may be a coincidence, or possibly have something to do that with that the server is now only accessed every few minutes, rather than every two seconds.

Directly after running the diagnostics, 'java -version', gives the expected response, and running the diagnostics again gives the NoneType error.
While an NFS access problem may be intermittent, the above observation looks odd, unless either accessing java through a python call is somehow slower than accessing to from a bash shell, or is possible that the response is cached somewhere.
In one instance this morning, a had a single error, but the next access work fine. This was different to the other instances, where once the first error happen, any further access attempts failed, at least for several hours.

I will report on any progress on getting java installed locally.

Cheers,
Ingvar

Re: OMERO web failure

PostPosted: Sun Nov 18, 2012 11:33 am
by ingvar
Hi Josh,

I installed java locally on both my servers. About 30 hours later they both showed the same problem again, the Web client just stops working, so something else going on here.

One of the servers now show another problem, it run out of file handlers:
cat /proc/sys/fs/file-nr
3744 0 384858

ves-ebi-81:/var/omero/dist pdb_em$ ulimit -n
1024

I had to kill some processes by hand, and then log out and in again, before I could shutdown and restart OMERO.

Looking at the various processes, it appears that I use about 500 file descriptors as soon as I have started OMERO + web server.
What is a resonable user limit for file handlers for the user that runs the OMERO services.

Cheers,
Ingvar

Re: OMERO web failure

PostPosted: Mon Nov 19, 2012 9:38 am
by cxallan
This is RHEL or CentOS 6 Ingvar?

Re: OMERO web failure

PostPosted: Mon Nov 19, 2012 10:12 am
by ingvar
RHEL6.2

Re: OMERO web failure

PostPosted: Mon Nov 19, 2012 12:03 pm
by ingvar
On the filehandler issue, I wrote a simple bash script:
Code: Select all
#!/bin/bash
groups='java glacier ice python'
echo "cannot check httpd processes"
for g in ${groups}; do
  pids=`ps -u pdb_em | grep $g | sed -e 's/^[ \t]*//' | cut -d ' ' -f1`
  echo "${g}"
  for pid in ${pids}; do
    echo "$pid `ls /proc/${pid}/fd | wc -l`"
  done
done

which gives output like:
Code: Select all
ves-ebi-81:/var/omero/dist pdb_em$ ./checkFH.bash
cannot check httpd processes
java
3702 129
3730 122
3735 123
glacier
3729 17
ice
3670 48
3719 28
python
3714 17
3716 15
3724 19
3743 18
3954 16
3955 185
3956 186
3957 175
3958 142
3959 125

Some of the python processes can reach 500 fh's, the other do not vary much. The number of fh's does not appear to drift, even after 1000 requests using ApacheBench. The total number of fh's as reported by
Code: Select all
cat /proc/sys/fs/file-nr
temporarly jumped from 3000 to 4000 but started to come down again after a few minutes.
It seems that I stay below the 1024 fh limit per process, but maybe there is a per user limit that I go beyond.

/Ingvar

Re: OMERO web failure

PostPosted: Wed Nov 21, 2012 9:16 am
by cxallan
RHEL / CentOS 6 has very aggressive default file handle and per user process count limits. One or two orders of magnitude less than RHEL / CentOS 5. Under high load 1024 file handles per process will simply not be sufficient and neither will the default maximum process limit of 1024 (for which threads count against). I would suggest increasing both to at least 4096. You can examine these with ulimit. The following settings are from one of our production systems running on CentOS 6:

Code: Select all
$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256498
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 16384
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 8192
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


https://www.openmicroscopy.org/site/sup ... open-files

We'll expand it to include a description about process limits.

Re: OMERO web failure

PostPosted: Wed Nov 21, 2012 7:43 pm
by ingvar
Thanks Chris,

I had raised the file handle limit to 4096, that seems to be enough for me at least for now. I check the file handel count once an hour, and while the total drift higher from the start, I think they level out after a while.

I will ask our systems group to raise the max user processes too. Another value that was significantly different is the pending signals, I have 30499, you 256498, do I need to raise that too.

I still have the original problem, that web pages starts to return 500 errors. Though no longer showing the NoneType error for java, moving java locally cured that aspect of the problem.

Cheers,
Ingvar

Re: OMERO web failure

PostPosted: Thu Nov 22, 2012 8:31 am
by cxallan
No problem Ingvar.

ingvar wrote:Thanks Chris,

I had raised the file handle limit to 4096, that seems to be enough for me at least for now. I check the file handel count once an hour, and while the total drift higher from the start, I think they level out after a while.

I will ask our systems group to raise the max user processes too. Another value that was significantly different is the pending signals, I have 30499, you 256498, do I need to raise that too.


Understood. To my knowledge we've not modified the pending single count at all. I would think your 30499 to be sufficient.

ingvar wrote:I still have the original problem, that web pages starts to return 500 errors. Though no longer showing the NoneType error for java, moving java locally cured that aspect of the problem.

Cheers,
Ingvar


Great. The 500 errors are coming from OMERO.web errors then? Timeouts in the Apache log?