Tuesday, September 9, 2008

Implementing a Breadth-First Search with LINQ

I was working on some code today that will manages the order of execution of some ETL jobs based on some known dependency information.  I've gathered the dependency information into a directed graph that I created using the QuickGraph library.  In order to prevent jobs from executing before their dependencies, I need to process the nodes (jobs) of the graph in a breadth-first manner.  For my unit test fixtures I have created the following directed graph:image

When I tried using the built-in breadth first search implementation, it visited the nodes in the following order 1, 3, 2, 5, and 4.  The problem is that it assumes a single root and does not look at the fact that 5 depends on 4, which hasn't run yet.  As I was digging around a bit with the .NET Reflector as to whether I should try implementing my own breadth-first search algorithm, I began thinking that there would probably be an easier solution possible with LINQ.  The highest number of vertices that I'm going to be dealing with at once is about 100, so performance really isn't much of a concern.

Note: I should mention that the vertices of the graph represent my jobs, which I originally was referring to as "Project Items".  In the code and comments below, these are all synonymous.  Eric Evans would disappointed in me for my lack of a ubiquitous language.

I started by initializing a Dictionary to track whether or not a particular job has been executed yet (starting with "false"):

// Initialize dictionary to track execution of all vertices
Dictionary<ProjectItem, bool> executedByProjectItem = graph.Vertices.ToDictionary(v => v, e => false);


Then I loop until they've all been executed:



// Keep looping while some jobs have not been executed
while (executedByProjectItem.Values.Any(e => !e))
{
	...
}


Inside the loop, the first thing I want to do is get a list of all the vertices (ProjectItems) that can currently be executed.  To get this list:




  1. I start with the graph's vertices and ...


  2. I perform a join to the edges...


  3.      ...where the vertex name matches the edge's target vertex's name. 



    The GroupJoin method provides a collection of edges for each vertex in my graph. 


  4. From that join, I create a new anonymous type that contains a reference to the vertex,


  5.      ...and then a count of the related edges for which the source job has yet to execute. 


  6. With that calculation in hand, I filter the list down to just those vertices that have no unexecuted dependencies


  7.      ...and have not already been executed themselves (root vertices would otherwise continue to be executed since they will always have an UnexecutedDependencyCount of  0). 


  8. In the final step, I use the Select projection method to create a return set of IEnumerable<ProjectItem> instead of the anonymous type, since that's what I'm really interested in.



image



The final step of the loop is to just iterate through all the jobs that can be executed, and uh... execute them.



When I ran this code, the output looked like this:



Executing job: projectItem1
Executing job: projectItem3
Executing job: projectItem4
=== Pass complete ====
Executing job: projectItem2
Executing job: projectItem5
=== Pass complete ====


There is a subtle issue at work here.  Based on my query, I should have only seen 1 and 4 in the first pass.  Job 3 should have been executed on the second pass, once Job 1 had been executed.  However, due to the nature of deferred execution of some operations in LINQ, once Job 1 has been marked as executed, Job 3 just showed up to the party late.  It's obviously not wrong, it was just a little unexpected the first time I saw it.  To get a static list on the first pass, I could add a call to the ToList() at #8 in the code listing above, which would have iterated through the collection immediately.  With the ToList call in place, the code produces the following output:



Executing job: projectItem1
Executing job: projectItem4
=== Pass complete ====
Executing job: projectItem3
=== Pass complete ====
Executing job: projectItem2
=== Pass complete ====
Executing job: projectItem5
=== Pass complete ====


It's interesting how the deferred execution actually cuts down on the number of times I have to iterate through the jobs to execute them all (from 4 to 2), and because the results will still be correct, I've gone without the ToList in my final copy of the code.



At the end of the day, I am happy with how that worked out.  In about 30 lines of code with LINQ, I was able to implement a reliable Breadth First iteration of my directed graph with potentially multiple root nodes.