Page 1 of 3

import speed

PostPosted: Wed Feb 02, 2011 5:42 pm
by helmerj
Hi!

Still attempting to import a large number of files we realized that the import of about 100.000 files would take basically almost two days. In order to speed things up, we had to change our strategy. we implemented a job queuing system that would start a specified number parallel import threads. The number of threads I did calculate by CPU use and memory usage but the results I got were not as expected:

1. CPU usage of the importing client is fairly low
2. memory usage of importing clients is rather low as well.
3. CPU usage on the omero server did skyrocket with more imports going on in parallel.

The high CPU usage on the Omero server as due to several postgresql threads each using about 20 to 30% of the servers CPU power becoming soon the limiting factor. I was bale to get a bit more speed by separating the postgresql from the omero server by moving the database to its own machine. But even then, the CPU usage was close to 100% when using 4 imports in parallel.

Why is the postgresql CPU usage so high? I am only importing 1400 files in this particular session. High CPU usage for postgresql usually relates to missing indices resulting in often and time consuming queries. With all the relations between users, projects, datasets, images, annotations, experiments, tags,... are all the indices in the right place? Is there any other reason why the CPU usage of postgresql is skyrocketing?

Any advice on how to improve the postgresql behavior would be greatly appreciated as it would very much improve the speed of importing images into OMERO.

Cheers Juergen

Re: import speed

PostPosted: Wed Feb 02, 2011 11:02 pm
by helmerj
We have done some test, logging query times inside postgresql, and have found that postgresql is spending lots of time calling this function:

Code: Select all

2011-02-02 16:31:54 CETSTATEMENT:  select ome_nextval('seq_session'::text)
2011-02-02 16:31:54 CETLOG:  duration: 7.606 ms  plan:
        Result  (cost=0.00..0.26 rows=1 width=0) (actual time=7.592..7.593 rows=1 loops=1)
          Output: ome_nextval('seq_session'::character varying)


this query is slower than all the other ones by a factor of 100 to 1000. Looking at the source code this possibly might be a locking problem?

I have also tested the latest trunk (4.3-dev) that I have build from svn in which this function is missing. This had raised the hope that DB performance would be better but the import time is the same. Running 8 import session in parallel the DB servers is running at 100% CPU usage.

IF anyone is interested I can upload the postgresql query log.

Cheers Juergen

Re: import speed

PostPosted: Thu Feb 03, 2011 7:34 am
by jmoore
Hi Juergen,

foremost, the fact that so much PostgreSQL time is spent on ome_nextval('session') is interesting and disturbing. To make sure I understand what's going on: how are you running your experiments? Sessions are very heavy weight and are intended to be kept alive for hours or even days.

If the problems persist, there are several things we can do to try to isolate issues. One would involve turning off the Indexer:
Code: Select all
bin/omero admin ice server disable Indexer-0 && bin/omero admin ice server stop Indexer-0
. It's responsible for adding everything you import to the Lucene index, but with the number of imports you are doing, that may be adding too much contention on some of the DB rows.

Another would be to get your PostgreSQL logging setup so that we know exactly which queries are taking so long? Which PG version are you currently using? Have you set pg_stats_command=true (or similar for your version)? Would setting up statement logging http://www.postgresql.org/docs/9.0/stat ... gging.html be possible?

Another would be to add indexes as you mentioned. ticket 2568 lists several of the indexes that we are planning on adding in the next DB upgrade, but if you are willing to help with unstable feature testing, we could add many more indexes to your DB and test the impact.

Regards,
~Josh

Re: import speed

PostPosted: Thu Feb 03, 2011 9:19 am
by helmerj
jmoore wrote:Hi Juergen,

foremost, the fact that so much PostgreSQL time is spent on ome_nextval('session') is interesting and disturbing. To make sure I understand what's going on: how are you running your experiments? Sessions are very heavy weight and are intended to be kept alive for hours or even days.

If the problems persist, there are several things we can do to try to isolate issues. One would involve turning off the Indexer:
Code: Select all
bin/omero admin ice server disable Indexer-0 && bin/omero admin ice server stop Indexer-0
. It's responsible for adding everything you import to the Lucene index, but with the number of imports you are doing, that may be adding too much contention on some of the DB rows.

Another would be to get your PostgreSQL logging setup so that we know exactly which queries are taking so long? Which PG version are you currently using? Have you set pg_stats_command=true (or similar for your version)? Would setting up statement logging http://www.postgresql.org/docs/9.0/stat ... gging.html be possible?

Another would be to add indexes as you mentioned. ticket 2568 lists several of the indexes that we are planning on adding in the next DB upgrade, but if you are willing to help with unstable feature testing, we could add many more indexes to your DB and test the impact.

Regards,
~Josh


Hi Josh,

I would more than happy to help with the speed improvement., as we are plannig to use Omero in a High Content Screen environment and we are talking about hundreds of thousand images every day. So every milisecond we can save has a huge impact.

I will try to turn the indexer off during import and see what impact that actually has. I will do a post her and present the result using the same hardware so I can compare it to my benchmarks I did yesterday.

I already had some extra logging in my postgresql setup:

Code: Select all
shared_preload_libraries = 'auto_explain'# (change requires restart)
custom_variable_classes = 'auto_explain'
auto_explain.log_min_duration = '3ms'
auto_explain.log_analyze = true
auto_explain.log_verbose = true
auto_explain.log_nested_statements = true


I have log files using the 3ms and also a set which logged everything (0ms). I am happy to share those log files which are checked into our project's repository. Of course I am happy to take advice on how to produce more and better log files so they would be of more use to you!

Right now I am in a test environment and I have complete control over versions and execution of all components of the import system. I started using Omero 4.2.2 and also compiled and installed trunk (4.3-dev). The log files I have been created while using the 4.2.2 version (3ms and 0ms). From my test import using 4.3-dev I should have a 3ms log version (full import of 1404 files for that one.). But as I said tell me what you need and I generate it.

Thanks for your help on this. I think working on this together we can improve Omero greatly and make it better for every one. Without speed improvements Omero in its current state is unfortunately not ready for larger number of files and therefor not ready for HCS data streams. We are willing to do our best to change that!

Cheers Juergen

Re: import speed

PostPosted: Thu Feb 03, 2011 9:52 am
by helmerj
I forgot to explain they way we import in a bit more detail:

  • Images are transfered from a Windows machine where they are acquired to a storage machine (Unix)
  • Images are grid fitted and reboxed
  • a ruby script is looping over all images extracting metdata from the metaXpress files, getting metadata from the metaXpress database and generates an OME-XML block for every image which is written to the image.
  • When the script has reached a configurable number of images (default batch size right now is 100) the array of image file names is submitted to a worker script in a job queuing system
  • The job queuing system can be started with a configurable number of workers that each starts an import session by calling the importer-cli script with the list of files names, initiating the import to the central Omero server in parallel
  • CPU usage on the client is low as far as the Omero-java threads are concerned

Yesterday I have tested 4, 6, 8, and 12 parallel import sessions. When using 4.2.2 4 workers result in 5 postgres threads on the DB server which is then maxed out at 100% CPU usage. When using 4.3-dev the DB server is maxed out at 100% CPU usage when using 8 workers resulting in 9 postgres threads. Speed of import unfortunately is the same for both version with the same number of workers:

Code: Select all
Omero server 4.2.2 (Xms=1024M) using number of workers (each Xms=512M) on a machine with two cores and 8GB of RAM. Postgres server on dedicated machine (3GB of RAM)

# workers     time[min]
  4              21.50
  6              18.80
  8              16.80
12             15.95



Cheers Juergen

Re: import speed

PostPosted: Thu Feb 03, 2011 9:57 am
by jmoore
Juergen,

thanks for the offer of help! I'll certainly take you up on it. But first I'd like to figure out the session issue before moving forward. Could you share the launcher from your queue? My guess is that it would be worthwhile for you to be re-using a session between the worker threads. Perhaps looking at the SQL logs would also help clear it up.

Best regards,
~Josh.

Re: import speed

PostPosted: Thu Feb 03, 2011 10:20 am
by helmerj
Hi Josh,

right now we are not reusing a session. we simply start a new import call as such in our worker script:

Code: Select all
@working_dir = '/home/helmerj/Downloads/OMERO.clients-Beta4.2.2.linux/'

    def self.process_import(data)

      dataset_id = data["dataset_id"]
      files = data["files"]
     
      importer_cmd_line = ["./importer-cli","-s", "localhost", "-u", "importer", "-w", Secret_password", "-d", dataset_id.to_s]
      files.each do |file|
        importer_cmd_line << file
      end
      command = Escape.shell_command(importer_cmd_line)



So if I understand you right, we should call bin/omero sessions login to get a session and then reuse that for our parallel importing call using

Code: Select all
importer-cli -k "session-key"


I will try that and report if that will speed things up.

Cheers Juergen

Re: import speed

PostPosted: Thu Feb 03, 2011 10:28 am
by helmerj
Interestingly the speed of import, in a certain range, not related to the size of the imported files. My original data is 2MB in size for each file. So the import of 1404 files, 2MB each takes 17 minutes, using Omero 4.2.2 and 8 import threads in parallel. Using Omero 4.3-dev, 8 workers and 1404 images of a 40k sample file (single-channel-2008.ome.tif) takes 18 minutes to import. So most of the time during import is spend on getting th information in the database not on file conversion which would be slower with increasing file size.

Cheers Juergen

Note: File indexing was still on.

Re: import speed

PostPosted: Thu Feb 03, 2011 11:52 am
by helmerj
Session reuse:

I have changed to importer-cli call to reuse a session that I had created before and kept alive every 5 minutes.

This works fine but is nothing faster than having the importer-cli script login individually each time... as a matter of fact it is 2 minutes slower using 4.3-dev and importing 1404 image files...

Will now try to turn of index creation...

Cheers Juergen

Re: import speed

PostPosted: Thu Feb 03, 2011 12:43 pm
by helmerj
jmoore wrote:
Code: Select all
bin/omero admin ice server disable Indexer-0 && bin/omero admin ice server stop Indexer-0


Hi Josh, when I execute this command, I get the following error:

Code: Select all
# bin/omero admin ice server stop Indexer-0
error: the server didn't stop successfully:
The server is already inactive.


Does that mean that the indexing had been off all along?

The import after issuing these cmd is as slow as before and uses the same amount of CPU power on the postgres server...

Cheers Juergen