jwarren wrote:Unfortunately we go into the same directories often to upload data as the directory the file is in is the only way to associate the correct omero id with our external db image record. When we have new images then we have to scan the directories to see if there are new ones in them.
Ah, ok. Sorry for having not realized that. What you're doing is sounding a bit more like dropbox now, but of course, only partially, which explains how/why you're script has developed as it has. Speaking of which, is that visible on github?
Using the CLI with directory loading (not per file basis) and no_thumbnails argument was 6 times faster than on a file basis and thumbnails.
Just for comparison, how many files (how large each) and how many seconds is that?
Using the CLI using the directory as an argument still falls over after a few thousand images (complained of lost connection) and then I have to restart the job.
Hmmmm.....that's surprising unless your calling Python script is timing out the session that it created. Looking at that again would be helpful.
This means we need to go through the same directories in case the directory contents was only half loaded... resulting in duplicates or triplicates in omero (depending on the number of times we've already run the importer).
How difficult would it be to put a filename check in the importer? Should I try it? Any pointers?
This should be pretty straight-forward. Perhaps we need to work out some more of the details first though. (For example, would checking on the sha1 suffice? Filenames may not be unique for everyone)
Writing this importer in java and using the java client code itself maybe faster for me to write and use?
That's certainly up to you. But, for example, if you'd like to have a go at the duplicate detection, I can point you to the right location (roughly
https://github.com/openmicroscopy/openmicroscopy/blob/v.5.0.8/components/blitz/src/ome/formats/importer/cli/CommandLineImporter.java#L332).
Cheers,
~Josh