Month: November 2015

Spliterator for “in-between” lambda processing

The usecase of summing up the distances in a list of vectors can benefit from parallelization on todays multi-core systems with Java 8 lambdas using a Spliterator.

If one writes

vecList.stream().mapToDouble( e-> 
    e.getDistance( b )

Then a reference to the next point “b” isn’t there hence the calculation cannot be performed that way. There is a property of each list item, it has a relation to the previous and a relation to the next item in the list.

Expressed as a piece of java code it might look as simple as this

final class Relation<T> {
   private final T a;
   private final T b;
   public Relation(T t0, T t1) {
      this.a = t0;
      this.b = t1;
   }
   public T getA() {
      return a;
   }
   public T getB() {
      return b;
   }
}

If we iterate on the relation organizing the list items into pairs, a distance calculation is possible by using a lambda expression and the stream api.

double totalDistance = stream.mapToDouble(
   e -> {
      return e.getA().distance(e.getB());
   }).sum();

To get there, a Spliterator class has to be defined that provides items for processing when the lambda expression is evaluated and is also capable of dividing the remaining set of items into subsets (sub-lists in this case) for parallel processing. Here is a possible implementation:

/**
 * A provider for pairs of consecutive list items
 *
 * @param <T>
 */
public class PairIterator<T> implements Spliterator<Relation<T>> {

 private final List<T> list;

 private int index;
 private final int start;
 private int end;
 
 /**
  * Allows to iterate over elements and their successors
  */
 public PairIterator(List<T> list) {
    this(list, 0, list.size());
 }

 /**
  * Internally used when the iterator is being split
  */
 private PairIterator(List<T> list, int start, int end) {
    this.list = list;
    this.start = index = start;
    this.end = end;
 }
 /**
  * Creates a new Relation between two consecutive items and 
  * performs the {@link Consumer#accept(entry)} operation. 
  */ 
 @Override
 public boolean tryAdvance(Consumer<? super Relation<T>> action) {
    Relation<T> entry;
    if (index >= end || index >= list.size()-1) {
       return false;
    }
    entry = new Relation<T>(list.get(index), list.get(++index));
    action.accept(entry);
    return true;
 }

 /**
  * If the list still has more that 4 elements to be processed
  * the method splits the work items into two working sets of ordered elements
  */ 
 @Override
 public PairIterator<T> trySplit() {
    int size = end - index;
    if (size > 4) {
       //reuse the middle element for both sides, so pair(mid-1,mid) and pair(mid,mid+1) is created 
       int mid = start + (size >> 1);
       int prevEnd = end;
       end = mid;
       return new PairIterator<T>(list, mid, prevEnd);
    }
    return null;
 }
 /**
  * @return the size of the remaining workload
  */
 @Override
 public long estimateSize() {
    return (end - index);
 }
 /**
  * Configuration of this spliterator
  */
 @Override
 public int characteristics() {
    return Spliterator.CONCURRENT | Spliterator.SIZED | Spliterator.ORDERED;
 }
}

There are two methods in the Spliterator implementation that are used to accomplish streaming the relations of predeccessor and successor, tryAdvance and trySplit. The first actually performs the action on the item pair by instantiating a Relation  on-the-fly, and calling accept(...) where the real calculation happens. The latter contains the algorithm to distribute the workload to the consumers. If the size of the remaining elements is greater than 4 it changes the indeces pointing into the list and returns a new instance configured for the remaining indeces. The method is designed to have a list element being used in two relations, once as a sucessor and once as a predecessor, so care must be taken not to modify its state by executing code with side-effects.

To use it a Stream is constructed with the StreamSupport class by passing the Spliterator as an argument and true to ensure it is processed in parallel:

//creates a parallel stream
Stream<Relation<Vec2>> stream = StreamSupport.stream(new PairIterator<Vec2>(vecList), true);

double totalDistance = stream.mapToDouble(
  e -> {
     return e.getA().distance(e.getB());
  }).sum();

 

Generic interfaces and algorithms

A common pattern in object oriented programming is interfacing classes that have the same methods. If these classes encapsulate a datatype and expose  methods that take the datatype as a parameter, generic interfaces. The approach results in the ability to apply a generic algorithm on different encapsulated data types.

For example vector math classes have this property, they can be defined for double and float datatypes and also for 2- and 3-dimensional implementations with methods for doing common operations.

public final class Vec3 implements Vec<Vec3> {
  private double x;
  private double y;
  private double z;

  /**
   * returns the distance to another vector
   * @param other the other vector
   * @return distance
   */
   public final double distance(Vec3 other) {
      return Math.sqrt(
        (x-other.x)*(x-other.x)+
        (y-other.y)*(y-other.y)+
        (z-other.z)*(z-other.z));
   }
}

public final class Vec2 implements Vec<Vec2> {
  private double x;
  private double y;
 
  /**
   * returns the distance to another vector
   * @param other the other vector
   * @return distance
   */
   public final double distance(Vec2 other) {
      return Math.sqrt(
        (x-other.x)*(x-other.x)+
        (y-other.y)*(y-other.y));
   }
}

The generic interface that both classes implement can then be declared like this:

public interface Vec<V> {
   double distance(V other);
}

It can contain all functions that have the same ‘abstract’ method signature that the classes have in common, ie. the more complicated cross product that returns a vector again.

The real benefit of this is not immediately obvious, consider adding up the distances of a list of items – one would naively just implement the method for each vector type Vec2 and Vec3. This results in duplicate code, reduced maintainability and is errorprone.

A class holding the algorithms operating on the Vec<V> interface can be used instead, the method needs to make use of the disance(V) method to determine the overall distance.

public class Toolbox {
  /**
   * Determines the overall distance between the points in the list
   */
  public static <T extends Vec<T>> double getAllDistances(List<T> points) {
     T prevPoint = points.get(0);
     T point;
     double sum = 0.0;
     int len = points.size();
     for (int i=1;i<len; i++) {
        point = points.get(i);
        sum+=point.distance(prevPoint);
        prevPoint = point;
     }
     return sum;
  }
}

This method is type-safe and works with all types implementing the generic Vec interface due to the <T extends Vec<T>> declaration. That way algorithms on vectors can safely be defined once for several variations of the classes.

In contrast, a declaration like below seems more natural at first, but does not work:

 public static <T> double getAllDistances(List<Vec<T>> points) {
    Vec<T> prevPoint = points.get(0);
    Vec<T> point;
    double sum = 0.0;
    for (int i=1; i<points.size(); i++) {
       point = points.get(i);
       sum+=point.distance(prevPoint);
       prevPoint = point;
    }
    return sum;
 }

Here, calling the method point.distance(prevPoint) gives a compiler error since prevPoint is of type Vec<T> but the method signature is distance(V other) and expects it to be T, rather than Vec<T>.