Optimize CGAL 3D convex hull
CGAL has a very robust and performant 3D Quickhull implementation. Recently I wrote my own Quickhull implementation, and tried some optimization ideas that bring 5x - 10x speedup over the CGAL implementation when tested on point sets with size ranging from a few thousand to a few million.
Before diving into the details on the optimization I did, let's first take a quick look at how Quickhull works.
Overview of the Quickhull algorithm
Quickhull is a popular choice for 3D convex hull computation due to its efficiency. Given a point set, Quickhull proceeds as follows: first it finds an initial tetrahedron, with the positive side of each face pointing outwards, and the vertices of each face are counterclockwise ordered. Then the outside point set of each face will be populated with the remaining points that are on the positive side (i.e. outside) of that face. Faces with non-empty outside point set are stored in a pending face container.
In each iteration, Quickhull will draw one pending face from the container, find the farthest point in the outside point set of that pending face, and mark all faces that have this point on their positive side as visible. The edges separating visible and invisible faces are collected. These edges form a closed border ring. New faces will be created from these edges and the farthest point. Visible faces will be removed and their outside points will be redistributed to the newly added faces, which will then be put in the pending face container.
As a simple example, let's take a look at the image above. We have 5 points $A, B, C, D, E$. The initial tetrahedron is created from $A, B, C$ and $D$, while $E$ is the only outside point of face $\triangle CBD$. Obviously $E$ is invisible to all faces other than $\triangle CBD$. So in the first iteration (which is also the last iteration), $E$ is the farthest point, and the visible set consists of only $\triangle CBD$. The three edges that form the border edge ring are $CB$, $BD$ and $DC$. As new faces are created from these three edges and point $E$, Quickhull terminates and the convex hull is found.
CGAL Implementation and optimization opportunities
The Quickhull implementation in CGAL is very fast and works extremely well. But one thing caught my attention was that at most places std::list is used. Triangulation_data_structure_2 is the data structure to store the intermediate result with neighborhood information for each face. The vertices and faces are stored in CGAL::Compact_container. Because it's impossible to specify where to put an element in a Compact_container, a Compact_container is essentially a list.
One of the drawbacks of list is that it requires one heap allocation for each element. If done frequently, Heap allocation is going to be much more expensive than stack allocation due to additional bookkeeping. Moreover, generally speaking, list is a lot less cache friendly because its elements are scattered around the memory instead of contiguously stored. Every fetch instruction could potentially populate the cache line with irrelevant data. In a standard list implementation, each entry in the list needs two additional pointers, which is not exactly very efficient memory-wise (on a 64-bit OS, each pointer needs 8 bytes).
Of course for the Quickhull implementation, list is a reasonable choice given that any vertex or face could be removed when running Quickhull. But list provides more than what is needed: when removing an element from list, the order of the elements remained in the list is unchanged. When order doesn't matter, it is possible to use vector instead while keeping the operation of removing an arbitrary element from the vector at constant time complexity. The idea is, to remove $i$th element of a vector v with length n, we can simply swap v[i] and v[n - 1] and then remove the last element.
Now it's time to take a look at the details of CGAL Quickhull implementation and the optimization techniques I used.
0. Initialization
1// CGAL Quickhull init
2std::list<Point_3> points(input_begin, input_end);
3Triangulation_data_structure_2 tds(initial_tetrahedron);
4points.remove(tds.vertices);
5for (face in tds.faces)
6 for (point in points)
7 if (is_on_positive_side(face, point))
8 {
9 points.remove(point);
10 // face.points is of type std::list<Point_3>
11 face.points.append(point);
12 }
13std::list<Face_handle> pending_facets;
14for (face in tds.faces)
15 if (!face.points.empty())
16 {
17 set_pending_flag(face);
18 pending_facets.push_back(&face);
19 }
One more feature I would like to have for convex hull is that the output should be indices into the input point set, which means I cannot remove points from the input point set because each face have to store the indices of the three points as its vertices. So I use a vector<uint8_t> to store flags of the vertices. Similarly, each face also has a uint8_t field for the flags. Faces are stored in a vector.
It's important to reserve enough space for faces because we would want to avoid copy operations incurred by reallocation. Four times the number of the input vertices seems to be good enough for me, even for a heavily tessellated sphere.
1. The loop
1// CGAL Quickhull main loop
2while (!pending_facets.empty())
3{
4 std::list<Point_3> vis_outside_set;
5 Face_handle start = pending_facets.front();
6 Point_3 farthest_pt_it = farthest_outside_point(start);
7 start->points.erase(farthest_pt_it);
8 std::list<Face_handle> visible_set;
9 std::map<Vertex_handle, Edge> border_map;
10 std::vector<Edge> edges;
11
12 // find visible set, i.e. faces that have farthest_pt_it on their positive side
13 // collect outside points
14 // create border edge ring
15 // create new faces from border edge ring
16 // partition outside set
17}
Face struct in my implementation
1// face_t in my implementation
2struct face_t {
3 std::array<uint32_t, 3> point_indices;
4 // opposite to the corresponding point_index
5 std::array<uint32_t, 3> neighbor_face_indices;
6 std::vector<uint32_t> outside_point_indices;
7 std::array<uint8_t, 3> in_neighbor_indices;
8 uint8_t flags;
9};
10// for faces[f] and its neighbor with index g = faces[f].neighbor_face_indices[i],
11// let k = f.in_neighbor_indices[i], then
12// f == faces[g].neighbor_face_indices[k]
As mentioned earlier, removing a face from std::vector<face_t> is done by swapping that face with the last face in the vector and then shrink the size of the vector by 1. This will invalidate any reference to the last face. So I cannot have a vector of indices of pending faces. It would force me to to go over the container to update the indices of the faces that are relocated. But without such a container, how can I skip the non-pending faces (faces that have no outside points) in each iteration?
The trick is to store all the non-pending faces in front of the pending faces, and keep track of the index of the first pending face. To maintain this loop invariant, it's necessary to be able to swap two faces. There is a bit more to do than simply exchanging the content of two memory places. Each face has references to its three (and only three) neighbors. If one face is relocated, and thereby gets a different index, we need to update its neighbors about the index change. To make this process more efficient, I introduced a new member face_t::in_neighbor_indices to keep track of references to a face in its neighbor faces.
With the small inconveniences above addressed, there is nothing stopping me from replacing all the lists with vectors. Another easy optimization is to define all the vectors before entering the loop to avoid frequent construction and destruction of vectors.
2. Find visible set
1std::vector<Vertex_handle> vertices;
2visible_set.push_back(start);
3set_visited_flag(start);
4vertices.push_back(start->vertices);
5set_visited_flag(start->vertices);
6
7for (vis_it = visible_set.begin(); vis_it != visible_set.end(); ++vis_it) {
8 for (nf : vis_it->neighbor_faces) {
9 int neighbor_idx = nf->get_neighbor_idx(vis_it);
10 Edge shared_edge = get_shared_edge(nf, vis_it);
11 // 0 <= neighbor_idx < 3;
12 // vis_it is nf->neighbor_faces[neighbor_idx];
13 // nf->vertices[neighbor_idx] is not in vis_it->vertices
14 if (!is_visited_flag_set(nf)) {
15 if (!is_on_negative_side(nf, farthest_pt_it)) {
16 visible_set.push_back(nf);
17 Vertex_handle vh = nf->vertices[neighbor_idx];
18 if (!is_visited_flag_set(vh)) {
19 vertices.push_back(vh);
20 set_visited_flag(vh);
21 }
22 } // nf not visited
23 else {
24 set_border_flag(nf);
25 set_border_flag(get_vertices(shared_edge));
26 border_map.insert({ get_start_vertex(shared_edge), shared_edge });
27 }
28 }
29 else if (is_border_flag_set(nf)) {
30 set_border_flag(get_vertices(shared_edge));
31 }
32 } // loop over neighbor faces of vis_it
33} // loop over visible_set
34
35for (v in vertices)
36{
37 if (!is_border_flag_set(v))
38 tds.delete_vertex(v);
39 else
40 unset_flags(v);
41}
This step is barely affected by the switch from list to vector. As some faces are marked as visible, which will be removed later, so here we note down how many of them are pending and non-pending respectively.
3. Collect outside points
1for (f in visible_set)
2{
3 vis_outside_set.append(f.points);
4 f.points.clear();
5 if (is_pending_flag_set(f)) // is in pending_facets
6 pending_facets.erase(f);
7 unset_flags(f);
8}
In this step, no need for me to remove f from pending_facets because I don't have such a container.
4. Create border edge ring
1it =border_map.begin();
2Edge e = it->second;
3edges.push_back(e);
4unset_flags(it->first);
5border_map.erase(it);
6while(!border_map.empty())
7{
8 it = border_map.find(get_end_vertex(e));
9 e = it->second;
10 unset_flags(e.first);
11 edges.push_back(e);
12 border_map.erase(it);
13}
5. Create new faces
1int to_remove_face_count = visible_set.size() - edges.size();
2// reuse visible_set for new faces
3if (to_remove_face_count < 0) {
4 for (int i = 0; i < -to_remove_face_count; ++i)
5 visible_set.push_back(tds.create_face());
6}
7else {
8 for (int i = 0; i < to_remove_face_count; ++i) {
9 tds.delete_face(visible_set.back());
10 visible_set.pop_back();
11 }
12}
13
14for (edge in edges) {
15 face next_f = create_face_from_edge_and_point(farthest_pt_it, edge);
16 face nf = get_border_face(edge);
17 set_adjacency(next_f, nf);
18 set_adjacency(prev_f, next_f);
19}
In this step, I need to use additional flags to indicate which faces are newly added and which faces will not be reused and should be removed at the end of current iteration.
6. Partition outside points
1for (f in visible_set) {
2 for (p in vis_outside_set)
3 {
4 if (is_on_positive_side(f, p)) {
5 f.points.append(p);
6 vis_outside_set.remove(p);
7 }
8 }
9 if (!f.points.empty()) {
10 pending_facets.push_back(f);
11 set_pending_flag(f);
12 }
13 else
14 unset_pending_flag(f);
15}
In this step, I need to again update the number of pending and non-pending faces.
7. Additional step to relocate the misplaced faces
This step is only required by my implementation. After last step, we can compute the new position of the first pending face (referred to as new_pending_face_begin in the following) should all non-pending faces were located before all pending faces.
But at this moment there might be some misplaced faces. (Most of) these misplaced faces are either replaced by new faces or marked as to be removed. I used two passes to relocate those misplaced faces.
In the first pass, the faces to be removed are also considered pending faces so that they will be relocated to the right side of the last non-pending face. Now we have two types of misplaced faces. Type A: non-pending face with index no less than new_pending_face_begin and type B: pending face with index smaller than new_pending_face_begin.
Apart from the visible_set, I also need to check the faces between the old first pending face (pending_face_begin) and new_pending_face_begin for potentially misplaced faces due to the change of the position of the first pending face. At the beginning of the first pass, the faces that are not touched by this iteration but still misplaced are pushed into a stack. Then I start to go over each face in the visible set and check if it is a different type of misplacement than the top of the stack, if so swap it with the top of the stack, and pop the top of the stack, otherwise push it to the stack. Although faces to be removed are treated like pending faces, but an additional vector is used to store the (new) indices of these faces, let's say the name of this vector is removed_face_indices.
By the end of the first pass, the stack should be empty, and all the non-pending faces should be located to the left of all the pending faces that start at new_pending_face_begin, because it is impossible that the faces in the stack are of the same type of misplacement, as that would mean the new_pending_face_begin should be wrong. In the second pass the faces to be removed (in removed_face_indices) will be relocated to the end of faces and eventually get removed by shrinking the size of faces.
Clearly the complexity of this additional step is linear to the size of the visible set in each iteration, so the overhead this approach entails should be negligible compared to the performance gain it could potentially bring. The performance test result also confirms that.