bentomas.com

Trying to make Directory Watcher faster

For a new project I started recently, I have been using a Ruby gem called Directory Watcher, which does just what it says: It watches directories! Specifically, it watches for changes to files. Even more specificaly, adding, modifying and deleting files.

It is pretty simple, from the docs:

require 'directory_watcher'

dw = DirectoryWatcher.new '.'
dw.add_observer {|*args| args.each {|event| puts event}}

dw.start
gets      # when the user hits "enter" the script will terminate
dw.stop

Not only is it simple but it works well and it works quickly. I have absolutely no complaints.

One thing I found interesting about how it worked, though, was that when you register an observer, it gets passed an Array of all the changes as opposed to each change one at a time. I asked the developer, Tim Pease, as to why, and he said:

No particular reason for the choice…Iterating over the events array and passing them one at a time might prove to be faster.

Please do investigate and let me know what you find out :)

And so I decided to investigate!

I tried two variations of the original code. The bulk of the original is in its run method:

def run
	until @stop
		start = Time.now.to_f

		files = scan_files
		keys = [files.keys, @files.keys]  # current files, previous files

		find_added(files, *keys)
		find_modified(files, *keys)
		find_removed(*keys)

		notify_observers
		@files = files    # store the current file list for the next iteration

		nap_time = @interval - (Time.now.to_f - start)
		sleep nap_time if nap_time > 0
	end
end

As you can see it calls a method named scan_files which returns a Hash of all the files found. It then compares this hash to the previous iteration of the hash to find the differences. In my first variation I modified the scan_files method to just call notify_observers itself with each event as they are found as opposed to returning an array. I thought this would be faster because there would be less overhead.

In my second variation, I did exactly as Tim suggested and just had it iterate through the array of events, calling notify_observers once for each event.

To test, I ran each of the three variations 50 times on directories of increasing size. Here is my test script.

My results:

Original First Second
4,887 Files 0.22 0.26 0.23
42,304 Files 1.48 1.91 1.7
249,467 Files 28.7 34.1 29.9

(All times in seconds. I calculated Standard Deviations but they didn’t vary much from variation to variation so I’m not bothering to post them.)

And a graph:

Graph of data from previous table

As you can see the original was the fastest, but only better than the second by 1.2 seconds when scanning over 200,000 files. When running my tests I did nothing with what notify_observers passed back, so in the case of the original it didn’t iterate through the events array, which might account for the descrepancy.

Either way, Directory Watchers original implementation of storing the list of ‘events’ (changed files) in an array was definitely faster than notifying for each event (my first varation).

I prefer getting my events one at a time, so I’ll be going with my second variation for the code I use. And thanks Tim for writing this wonderful gem, it is making my life quite a bit easier!