| June 01, 2022

Communities in Data

By Kenneth S. Berenhaut and Katherine E. Moore

Figure 1. A small two-dimensional Euclidean data set with varying local density. The community structure is overlaid on the data, which reveals three distinct (community) clusters and two isolated points. The inset in the bottom right depicts the highly connected nature of the cluster in the bottom left. Figure courtesy of [1].

As humans, the way that we make sense of the world around us is often informed by our social perspectives and interactions. Our recent research leverages the instinctive understanding of social ideas—such as communities, conflicts, and foci—with probability to introduce a depth-outward perspective that reveals structure in data [1].

When viewing the 16 data points in Figure 1, we might notice two main groupings in the bottom left and top right—the former is tightly knit and the second is more spread out—as well as a couple of outliers and a small group of only two points. These clusters are not simply a reflection of absolute distance in a global sense; instead, they represent a nuanced local-global view from the perspective of individual points.

A social perspective can provide some insight. In a social context, communities, clusters, and cliques are held together by complex bonding ties that are driven by local perspectives. Suppose that we wish to determine the degree to which elements/points in a collection are “deep,” in that they are surrounded by other points. Note that what we view as a point’s local surroundings can be strongly influenced by local context (think of urban versus rural settings). To tackle potentially varying density and reflect locality, we borrow from the social analogy of opposition. Consider an individual \(x\) that is in conflict with an opposing individual \(y\). Other nearby individuals may then possess positional impetus to express comparative closeness (or support) for one of the members of the opposing pair \((x,y)\). In this sense, “nearby”—or \((x,y)\)-local—refers to the set of points \(U_{x,y}\) that are at least as close to either \(x\) or \(y\) as \(x\) and \(y\) are to one another, i.e.,

\[U_{x,y} = \{z \in S \ | \ d(z,x) \le d(y,x)\;\; {\textrm{or}}\;\; d(z,y) \le d(x,y)\}.\]

Here, \(d(x,y)\) conveys the distance (or dissimilarity) from \(x\) to \(y\). We determine the set \(U_{x,y}\) via local closeness comparisons without eliciting precise numeric distances, which again mirrors the social analogy.

Figure 2. Examples of sets \(U_{x,Y}\) of \((x,Y)\)-local points for fixed \(x\) and varying \(Y\). The underlying data is the same as in Figure 1. Figure courtesy of [1].

We can now sweep across the space, comparing \(x\) to potential opposing points while tallying relative support from nearby points. More formally, the local (community) depth of point \(x\) is the chance that \(Z\) is closer to \(x\) than to \(Y\) when the opposing point \(Y\) is selected (uniformly) at random and one of the \((x,Y)\)-local individuals \(Z\) is also randomly selected. That is,

\[\ell _S(x) = P(d(Z,x) < d(Z,Y)).\]

Note that \(Z\) could possibly be equidistant to both \(x\) and \(Y\), in which case a coin flip breaks the tie; this results in an average local depth of precisely 0.5. In the simple example in Figure 1, the local depths range from 0.33 to 0.64.

Asking to what extent a point \(w\) contributes to the local depth of \(x\) offers insight into \(w\)’s alignment to \(x\). Namely, we can obtain a measure of binding cohesion \(C_{x,w}\) by considering the probability that \(w\) is the selected point from the resulting local region and is closer to \(x\), i.e.,

\[C_{x,w} = P(Z = w \;\; \textrm{and} \;\; d(Z,x) < d(Z,Y)).\]

In essence, we partition the probability that defines the local depth of \(x\) over contributions from points across the space; we can thus refer to cohesion as partitioned local depth.

Now that we have a measure of the strength of relationships between individuals (i.e., cohesion), we can designate certain ties as particularly strong. A straightforward yet elegant option is to take as “strongly cohesive” any pair \((x,w)\) for which both mutual cohesions (of \(x\) to \(w\) and \(w\) to \(x\)) are greater than the mean value when we average globally over \(x\) and locally (or nearby) over \(w\) — i.e., pairs that satisfy

\[\textrm{min}\{C_{x,w},C_{w,x}\} \ge P(Z = W \;\, \textrm{and} \;\, d(Z,X) < d(Z,Y)).\]

Figure 3. Community structure for 87 Indo-European languages, which employs cognate information that was coded via 2,665-dimensional binary vectors. Commonly identifiable language clusters arise along with informative inter- and intra-cluster structure. Several ancient languages are centrally located. Figure courtesy of [1].

Here, we uniformly select \(X\) at random from the set of all points under consideration. The individuals \(Y\) and \(Z\) are as before, and \(W\) is random from the respective local \((X,Y)\) region.

We can therefore convey the resulting community structure through annotated networks with edges that are weighted by mutual cohesion; we can also emphasize connected components of the subnetwork with edge-pairs that satisfy the aforementioned threshold inequality. Doing so allows us to holistically consider both local and global structure. In the figures below, we will visualize all networks via a standard force-directed graph plotting algorithm.

Cohesion captures relative positioning through local queries—such as “Is \(A\) more similar to \(B\) or to \(C\)?”—and hence does not require a global sense of “closeness” in terms of radial distance. This situation is particularly valuable when density varies across the space, which can occur in complex systems that evolve locally — for instance, those that arise in the context of language, genetic information, or cultural values (see Figures 3 and 4). It also provides robustness to uncertainty in exact numeric values and sidesteps some of the issues in the consideration of non-metric and high-dimensional data [1].

Figure 3 considers cognate information for 87 Indo-European languages [2, 3]. We note the presence of commonly identifiable language clusters and find that, with a slight rotation, some of the underlying geography is mirrored in the plot. Perhaps surprisingly, we obtained this structural information without any input determination beyond a collection of distance comparisons; there was no need to specify a priori a number of clusters or neighborhood size.

Figure 4. Community structure based on the cultural fixation index values for regions within the U.S., China, India, and the European Union [4]. Figure courtesy of [1].

Figure 4 depicts the community structure among regions in the U.S., Europe, India, and China based on cultural differences. The input distances between regions were obtained from aggregated responses to recent World Values Survey questions that reflect cultural values regarding family, religion, education, and institutions [4]. Even though some cohesive groups (such as India and the European Union) are far more culturally diverse than others, Figure 4 provides an informative sense of community structure. In fact, the two most culturally similar regions in India (Uttar Pradesh and Bihar) are more different than the most culturally dissimilar regions in the U.S. (California and East South Central).

Beyond its fundamental role in the construction of community networks, the concept of cohesion can be valuable in other contexts where one might wish to consider local information. For example, particularly strong cohesion can serve as a parameter-free, weighted alternative to \(k\)-nearest neighbors or local kernel functions. In fact, the number and strength of strong ties can adapt to reflect local context and vary across the space. One could also apply this approach alongside other methods to account for varying densities in a broad range of applications, from clustering and classification to smoothing and outlier detection.

Our experiences can inspire and inform the ways in which we process and communicate structural information in the world around us. The concept of data communities suggested here is derived from—and aligns with—a shared human social perspective.

References
[1] Berenhaut, K.S., Moore, K.E., & Melvin, R.L. (2022). A social perspective on perceived distances reveals deep community structure. Proc. Natl. Acad. Sci., 119(4), e2003634119.
[2] Dyen, I., Kruskal, J.B., & Black, P. (1992). An Indoeuropean classification: A lexicostatistical experiment. Trans. Am. Philos. Soc., 82(5), iii-iv+1-132.
[3] Gray, R.D., & Atkinson, Q.D. (2003). Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature, 426(6965), 435-439.
[4] Muthukrishna, M., Bell, A.V., Henrich, J., Curtin, C.M., Gedranovich, A., McInerney, J., & Thue, B. (2020). Beyond Western, educated, industrial, rich, and democratic (WEIRD) psychology: Measuring and mapping scales of cultural and psychological distance. Psychol. Sci., 31(6), 678-701.

Kenneth S. Berenhaut ([email protected]) is a professor and Baker Family Faculty Fellow at Wake Forest University. He studies questions that relate to applied probability and the structure of data. Katherine E. Moore is a visiting assistant professor at Amherst College. Her research addresses discrete applied probability and the mathematical foundations of data science.