2016-04-30

Some visualizations of the Stacks project

The Stacks project?

The Stacks project is an open source textbook on algebraic geometry, which “aims to build up enough basic algebraic geometry as foundations for algebraic stacks“, implying “a good deal of theory on commutative algebra, schemes, varieties, algebraic spaces, has to be developed en route“. Algebraic geometry is a branch in mathematics that studies the geometry of zeros of a polynomial system. Despite the complexity and central importance in mathematics, various connections between machine learning and algebraic geometry has been shown in the literature: see S. Watanabe (2009), S. Arora et al (2012), V. Chandrasekaran et al (2012), G. Blekherman et al (2013), R. Livni et al (2013), F. J. Király et al (2014), and the Wikipedia page on algebraic statistics for examples.

Hope that you are now motivated to know more about algebraic geometry through the Stacks project. Let’s first have a peek on the project from some statistics. Currently (HEAD: a90a2ea, feel free to check the statistics from the newest HEAD), the project has

221 contributors
5190 pages after compilation
96 chapters
550 references
15457 tags (labels in the .tex files)
18 MB of .tex files

Apparently this is a daunting amount of texts. We need visualizations to help get ideas of the tome.

The provided API for tags

Stacks has provided API for retrieving statements and graphs given query tags. A statement is the text content labeled with the query. A graph contains nodes and links. The nodes include the query and other related tags with meta-data, while the links are relations between the involved tags. Graphs are described by d3.js-oriented JSON format with three categories: force-directed, cluster, and collapsible, for each tag. The differences are:

Force-directed: if the query tag is a theorem (lemma/proposition/…, same below), the API returns a set of dependent tags of theorems as nodes, and the the cross-references in the proofs as links (the source refers the target). In other cases, a single node will be returned and the links will be empty (even there are cross references in the corresponding text).
Cluster: similar to force-directed, but in the form of a tree instead of a general graph, with the query tag as the root. So nodes may be duplicated. If the query tag is not a theorem, the tree only contains the root.
Collapsible: always a tree of four levels with the query tag as the root. The levels are root > chapter > section > related tags or again the query itself. If the query tag is a chapter or a section, intermediate nodes will be duplicated to reach the depth of four. Since all the tags are at the leaves, the tag dependency is less clearer than in cluster.

The text above will be like buzz sound before seeing examples: see Figure 1 for some possible visualizations for tag 015I (Lemma 13.21.2), created directly from the returned JSON by the API.


(a) “Cluster” tree layout	(b) Force-direct graph. Tags are draggable.

(c) “Collapsible” tree layout. Click the blue circles to expand or collapse the intermediate nodes.	(d) “Collapsible” force-direct graph. Tags are draggable. Click the blue rectangles to expand or collapse the intermediate nodes.

Figure 1: different visualizations of related tags for the query 015I created from the graph API. In collapsible visualizations Figure 1(c) and Figure 1(d), more informative chapter/section names are shown instead of the corresponding raw tag names.

Some visualizations in Figure 1 are interactive so you can play with them. The visualizations can also be enhanced in various ways, like showing the tag statements from the statement API when mouse over the tags, to help the users understand why the tags are connected. (Update: the Stacks project actually has its own implementation of the graphs which do show the statements when hovering over the tag nodes. Check the Extras part in the lemma’s explanation page, for example)

What else to visualize?

With the graph API as building blocks, we can certainly do more. But to be motivated, we need to first ask ourselves: What do we want to see? Or, what do we care about in this data?

Among all the aspects of the book, let’s focus on one specific question in this post: which parts of the book are important to the subject?
There are immediately two follow-ups:

What are the “parts”?
How to measure the importance quantitatively?

Different answers give different insights. I will show them by different visualizations in the following sections.

Scatter plot for the theorem importance

Let’s start with the theorems. Intuitively a theorem is important if

it is cross-referred to by many later theorems, or
it cross-refers to many earlier theorems.

In the first case, we can say the theorem is fundamental since it acts as a basis for others, while in the second case, the theorem may be comprehensive since it requires many preliminaries. This motivates a scatter plot of “times of being cross-referred to” (“#cross-referred”) vs. “times of cross-referring to others” (“#cross-referring”) for each theorem. Figure 2 shows the plot. The 10 theorems to the top-right of the dashed line are assumed to be important. The slope is determined in an empirical way: sort the coordinates and take the 10 largest $x$ and $y$ values, say $(x_1,\cdots,x_{10})$ and $(y_1,\cdots,y_{10})$, then let $\mathrm{slope}=-(\sum_{i=1}^{10} y_i/x_i)/10$. I move the line until there are 10 (and only 10) points to its top-right. Table 1 shows the information of the selected tags.

Figure 2: a scatter plot of the theorems. $x$-axis: times of a theorem cross-referring to other theorems; $y$-axis: times of a theorem being cross-referred to. It is as expected that most theorems lie at the bottom-left corner. Theorems above the dashed line are of our interest.

Table 1: the detailed information of the 10 theorems that we think important according to our assumption. The tags are sorted into descending order of their distances to the dashed line. The hyperlinks point to the text contents of the theorems.

We can for example verify the fundamental role of 00DV, Nakayama lemma, by Wikipedia:

Informally, the lemma immediately gives a precise sense in which finitely generated modules over a commutative ring behave like vector spaces over a field. It is an important tool in algebraic geometry, because it allows local data on algebraic varieties, in the form of modules over local rings, to be studied pointwise as vector spaces over the residue field of the ring.

WikipediaNakayama lemma

Well, as a non-expert on algebraic geometry, I can still feel how it is crucial to the subject from the wording.
We can also investigate other theorems in the table. They are mostly lists of properties of some mathematics objects or long proofs, with 01UA somehow as an exception. We will investigate it more later.

Chord diagram for chapter relation

We can create the same kind of scatter plot for sections and chapters if pooling the number of cross-references at section or chapter level. But the scatter plot has at least one drawback: it does not tell where the cross-references come and go. Of course for theorems we can only go back to the full graph, which may be monstrous. However, at the level of chapter, we can resort to chord diagram to track the cross-references in the proofs. Chord diagram is a tool for visualizing a numerical square matrix with positive entries, which can encode pairwise- and self- relations between entities. In a chord diagram matrix entries are visualized as chords between two arcs on the same circle. Say we have a matrix $M=(m_{ij})_{n\times n}$, a chord diagram for $M$ can be constructed so that

The angular span of arc $i$ is proportional to $\sum_j m_{ij}$
The chord for $m_{ij}$ has a gradually changing width between arc $i$ and arc $j$ :
i. the width on arc $i$ is proportional to $m_{ij}/\sum_j m_{ij}$
ii. the width on arc $j$ is proportional to $m_{ij}/\sum_i m_{ij}$

See also an illustrative example made by d3.js. For the Stacks project, we can let $m_{ij}$ be the number of cross references in the proofs in chapter $i$ that refers to the theorems in chapter $j$. Additionally, chapters have been partitioned into seven “topics” in the project (Preliminaries, Schemes, etc, see stacks/stacks-project/chapters.tex), so another chord diagram for the topics is also available. Figure 3 shows a combination of the two chord diagrams.

Figure 3: a combination of two chord diagrams for chapters and topics. Chapters start with Set Theory and end with More on Morphisms of Stacks, both of which are located at the top. Mouse over the outer annular sectors to highlight the chords involving the corresponding topic. Mouse over the inner smaller colored blocks to highlight the chords involving the corresponding chapter.

From Figure 3, we can again see in the chapter view that Commutative Algebra is an important chapter as it is cross-referred to everywhere in later chapters, which coincides with Table 1. Additionally, as is hidden in the scatter plot, Commutative Algebra is more influential on Morphisms of Schemes, Varieties, and Dualizing Complexes than most of the others (obviously related chapters like More on Algegbra is ignored). For the topic view, we can see for example that as the name suggests, Topics in Scheme Theory is an extension of Schemes, and does not have much to do with the subsequent chapters. The proofs cross-referred to in later chapters are mostly from topic Schemes. This suggests a tree-like structure in the content dependency.

As a case study, we can go back and try to find out why 01UA (Lemma 28.24.9) is so popular, as it is not a list of properties nor a long proof. We start from its chapter Morphisms of Schemes. If following the chords we will see the chapter is mostly cross-referred by More on Morphisms and Morphisms of Algebraic Spaces, while the theorems in Morphisms of Algebraic Spaces is majorly cross-referred by More on Morphisms of Spaces and Morphisms of Algebraic Stacks. So I will guess 01UA can be found in these four morphisms-related chapters. Actually this can be verified in Figure 4, a tree layout for the theorems that cross-refers 01UA, including the chapters they are in.

Figure 4: a tree layout for theorems cross-referring 01UA, Lemma 28.24.9, with the chapters that the theorems are in. The four chapters in my guess are in the tree, plus Descent as another chapter that cross-refers the theorem a lot. On the other hand, schemes-related chapters also cross-refers to the theorem quite often. This may suggest Lemma 28.24.9 connects notions like “morphisms”, “schemes”, “descent”, and “algebraic spaces” in algebraic geometry. But interestingly the chapter Descent and Algebraic Spaces does not cross-refer to this theorem.

Why the visualizations?

We have created some visualizations to answer a version of the original question “which theorems/chapters of the book are important to the subject?” without any knowledge in algebraic geometry. But we have a missing tile – why are the visualizations like this in the sense of mathematics? The absence of the answer will make the figures less helpful for the practitioners, since the figures don’t help them much to understand the subject from the mathematics side. For example, we don’t know why Nakayama’s lemma is so important before looking into Wikipedia. And we don’t understand why Lemma 28.24.9 does not in the chapter Descent and Algebraic Spaces, though it appears in both Descent and Algebraic Spaces. The concerns here, are

What mathematics notions are handled in each chapter (section/theorem/…)?
How are the notions related by theorems?
Ultimately, which notions are important to the subjects? And why?

Let’s try to answer the questions by further mining the texts and the tag graph of Stacks.

(to be continued)

2016-04-09

A faster fisheye plugin for d3.js

Introduction to fisheye

Fisheye is a kind of “Focus+Context” techniques for visualizations. Fisheye enhances glyphs locally with displacements and zooming, providing details in the focal area. Meanwhile, it keeps the glyphs unchanged outside the area, serving as context so that the user still has an overview for the visualization. One of the advantages of fisheye over a plain magnifier, which has a constant scaling factor in the whole area, is the continuity at the boundary of focal area.

The continuity implies different glyphs should have different displacements and scaling factor ($\ge 1$) inside the focal area. Usually, we are more interested in the following isotropic case:

the focal area is a round region centered at the current focal point, which can be the current position of the mouse cursor if on a computer interface like a web page;
inside the area displacements or scaling factors are functions of the distances between their original positions and the focal point;
at the boundary there is no displacement nor scaling
The first two visualizations in d3’s fisheye demo page are examples for this case. To be more specific, let $p_0=(x_0,y_0)$ be the coordinate of the focal point, $p=(x, y)$ be the coordinate of a glyph to be distorted, and $r$ be the radius of the focal region. Here is one way for distortion:

$$\begin{equation} \begin{aligned} x_{f} = & x_0+k\cdot(x-x_0) \\ y_{f} = & y_0+k\cdot(y-y_0) \\ s_{f} = & k \end{aligned} \label{eq:distortion} \end{equation}$$

where $f$ stands for “fisheye”, $s$ is the scaling factor, and $k=k(d(p,p_0))$ is a monotonically decreasing function w.r.t. the distance between $p$ and $p_0$. Figure 1 shows a curve for $k$ from the implementation in d3-plugins/fisheye with $r=200$.

Figure 1: an example for the function $k$ in Eq. \eqref{eq:distortion}. Notice $k(r)=1$ so no changes at the boundary.

k-d tree: accelerate nearest neighbor search

So if we apply the calculated distortions on the nearest neighbors of the current focal point whenever the focal position changes then we are done. d3-plugins/fisheye implements this. However, this naive $O(n)$ nearest neighbor searching does not scale. Different tree-like space-partitioning data structures have been invented for indexing points in $\mathbb{R}^n$, reducing the time complexity of spatial searching to $O(\log n)$ on average. There are for example k-d tree, ball tree, quadtree, etc. I simply choose k-d tree in fast-fisheye.

k-d tree is a binary tree for space indexing, created from subdividing the space by axis-aligned hyperplanes recursively. To build a k-d tree, we start with an axis and find the median in that axis. The axis-aligned hyperplane going through the median will divide the whole point set into two subsets. We then create the root node at the median, and construct the two child k-d trees from the two subsets, until a stopping criterion is met, which is usually reaching the limit for tree depth, leaf size, or leaf radius. Figure 2 is taken from Wikipedia, showing an example of the construction.

Figure 2: a k-d tree for point set $\{(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)\}$. Axis $x$ and $y$ are chosen alternatingly when moving down during the construction. In fact cycling through the axes is the standard way of axis-choosing at each level.

Given a query point, k-d tree supports two kinds of nearest neighbor searching: the $k$ nearest neighbors of the query point, or all neighbors within a radius of the query point. We only care about the latter case in fisheye. A branch in the k-d tree is pruned if its bounding rectangle does not intersect the query disk. The Python-like pseudocode is shown below.

# input: root, query, radius
# output: neighbors
neighbors = []
stack = [root]
while stack:
	node = stack.pop()
	if node is leaf:
		neighbors += [for p in node.points if distance(p, query) <= radius]
	else:
		if bounding_rect(node.left_child) intersects disk(query, radius):
			stack.push(node.left_child)
		if bounding_rect(node.right_child) intersects disk(query, radius):
			stack.push(node.right_child)

Finally, I use the k-d tree implementation in nanoflann for fast-fisheye.

Emscripten: compile C++ to JavaScript

To glue things together, I use emscripten to compile the C++ computation code into JavaScript. Emscripten is an LLVM-based project, which “compiles any other code that can be translated into LLVM bitcode into JavaScript” without losing the native performance. In the simplest cases like here, all you need to do is 1) write adapters in C++ to expose your C++ classes or functions to JavaScript, and 2) compile the C++ code to JavaScript. Check their documentations for more information. A list of porting examples and demos may also be of interest.

An example

The code can be found on GitHub, including an example with 20000 random circles and fisheye effect rendered by this fast-fisheye, as well as a baseline with the same set of points but rendered by d3-plugins/fisheye. At least in my Chromium (version: 48.0.2564.82 (Developer Build) Ubuntu 15.04 (64-bit)), I can see a noticeable speedup and smoother rendering in fast-fisheye compared with the baseline. Unfortunately in Firefox both are slow, but I might be too lazy to figure out the reason.

It is also possible to incorporate nearest neighbor searching techniques for moving query. I leave this to future works.

All's (Not) Right with Data

Machine Learning | Visualization