Seven Reasons to Use R for Social Network Analysis (and Three Reasons Against)

Learning to use software always entails some startup cost. I recently had an exchange with one of my colleagues who is relatively new to social network analysis. He asked about my thoughts on a certain network analysis program and mentioned that “it’s easy to get lost with so many [network analysis] programs out there.” His impression is completely understandable. Social network analysis has become immensely popular in recent years. The rise in its popularity has especially been witnessed among gifted people capable of writing good software. Indeed, one Wikipedia list broadly describes about 70 social network analysis programs. Each of these programs have their strengths and weaknesses with regards to its contributions to the field. Given the wealth of options, which programs are worth the time investment to learn, and there are resources as irainvesting.com which could help with this.

If you’re new to network analysis then I’d highly recommend learning the packages in R, perhaps supplemented by Pajek and/or Python packages. Here’s why:

1. R and its packages are open source.

The benefits of open source software for the scientific community are numerous. Open source software means that anyone can freely access and improve upon the software’s code. Should the field present new developments, as all scientific endeavors should, hypothetically anyone in the world can take previously existing code and update it to accommodate the new developments. These updates on behalf of a large community minimize the lag between contributions in the literature and their corresponding software implementation.

Also, developers often cease maintaining a given piece of software. If a developer becomes too busy, loses interest, dies, switches fields, etc, then his or her closed source software will cease to change. Such circumstances are very common and mean that the software won’t incorporate changes in the field, it won’t adapt to operating system updates, and the software’s bugs will stick around. If such a situation happens with your primary network analysis program, you’re going to have to adopt a new program to stay relevant. In contrast, due to universal access, open source software has the potential to live beyond the lifespan of its creator.

Lastly, the price tag on practically all open source software is $0.00. The costs incurred are time, but not money.

2. R is cross-platform.

Cross-platform software means that if you use Linux, you can collaborate analyses with people who use MacOS or Windows. Your operating systems won’t stand in the way of collaboration.

Cross-platform software also entails universal replication. Anyone with the necessary hardware can recreate analyses in R.

3. R can import and export practically any form of data.

Though it may require some massaging afterwards, R can read data from other statistical packages (e.g., SPSS, Stata) using the foreign library, from other network packages using either the network or igraph packages, from internet sources using packages like RCurl or XML, and it can read anything that’s text formatted. It can write data just the same. This advantage means that your research with R will never be limited by the data’s format.

4. R is much more than social network analysis.

Aside from network analysis, R has a huge library of packages for practically every statistical need. You can complement your network research with any analysis of your choosing within R environment. The time invested learning social network analysis within R lends itself to countless other statistical and quantitative techniques.

5. Network analysis within R can include igraph, RSiena, and the statnet suite.

While points 1-4 apply to the C++, Python, and R languages alike, RSiena and statnet are R-exclusive and igraph fills out most omissions from statnet and RSiena. The ability to access all three of these libraries within the same software environment provides a vast arsenal of network functionality, larger than any other software system I know of. A few words on each of these libraries:

statnet is a suite of R packages that includes the sna, network, and ergm packages among others. The sna package originated as an open source alternative to and improvement upon UCINET. sna includes all the functionality of UCINET along with many additional features. The focus is largely upon traditional sociometrics, yet also includes some developments over the past five years. The network package is used to create “network” objects within R. Network objects are more computationally efficient than sociomatrices and serve as the universal data among the various statnet packages. Lastly, ergm is the main package to model social networks and explain tie formation. Additional statnet packages, like stergm, hergm, degreenet, latentnet, networksis, netperm, and relevent, provide a number of cutting edge, specialized ways to model network phenomenon.

RSiena began with Siena, standing for Simulation Investigation for Empirical Network Analysis. The program began as a standalone Windows program to model social networks. By the summer of 2011, Siena had transitioned from Windows to an R package. If you want to model longitudinal networks today, RSiena is the package to use.

From my understanding of its history, igraph began as a clone of the sna package, yet has evolved in a remarkably different direction. igraph includes many functions identical to those corresponding to sna, yet its network data format, referred to as a “graph,” is stored in a manner different from network’s “network” object. I’ve heard that graph objects are more efficient. Unlike the sna package, igraph includes a number of developments over the past decade from the field of “network science,” incorporating models, measurements, and clustering algorithms developed by physicists and computer scientists. Aside from perhaps speed improvements, I would recommend using this package for community detection and its random graph “games.” The package is also available for Python and C.

6. R can model network tie formation without assuming dyadic independence.

Would you use a statistics package that produced plots along with univariate and bivariate analyses, but omitted multivariate regression models? In general, probably not. How frequently do the top journals publish quantitative papers without regression models? Infrequently. For now, the status of social network analysis seems to be an exception to these matters, but I doubt it will last.

Only a very small minority of network analysis programs can truly model network tie formation. Sure, many programs might provide a technique that closely resembles linear regression, while assuming dyadic independence, failing to account for triadic effects and other graph-level properties. Other programs can model networks based upon specific theories on tie formation, like small world, preferential attachment, or density controlled random graphs. But if you want to test competing hypotheses within a single model that accounts for a variety of graph-level phenomena, your options are very limited. Siena and statnet are two of the very few programs capable of creating such models and they’re at the forefront of the field in this regard. Both RSiena and ergm are actively maintained and the statnet team is continuing to create new methodological tools to accomplish different modeling challenges.

7. R plays nicely with Pajek.

Pajek is very fast. Pajek XXL eats your so-called “big data” for breakfast. Pajek makes pretty plots. Pajek is free. Pajek runs in Linux through WINE. Pajek has a big active community. R can read and write Pajek data. Pajek can send data directly to R. Pajek can also automate all this point-and-click business through macro scripts.

But, Pajek isn’t open source. Pajek doesn’t model well. Pajek accepts a limited amount of data formats and writes even fewer. Pajek only does network analysis. Pajek’s interface doesn’t make much sense.

Using R and Pajek together overcomes the limitations of either one.

Downsides

To be fair, while I think working in R is by far your best bet for social network analysis, it does have a few notable downsides.

1. It’s less new user-friendly than point-and-click programs.

You’ve got to invest the time in learning how to use R before you can do social network analysis well in R. Clearly it’s more difficult to write script than to point and click in order to run a network function. The upsides to learning R for sake of network analysis is that in the process you learn R for statistics and everything else the environment can do. When you spend time learning UCINET, Gephi, Pajek, etc you end up only knowing how to use a single program with a limited set of commands representative of the interests of relatively few developers.

2. The visualizations take a bit of work to beautify.

Making nice visualizations is a trial and error process in practically all software. While you can make incredibly attractive, custom network plots in R through gplot(), gplot3d(), plot.network(), and plot.graph(), plots made in Pajek and especially Gephi tend to look a bit nicer. Fortunately, data produced within R can be exported in formats readable by those two programs.

3. R tends to lag behind C++ and Python in speed tests.

Compared to C++ and Python, R is a painfully slow language. To accommodate the loss of performance, some R functions and R’s social network analysis functions are coded in C or C++. While R’s network and igraph packages aren’t slow for most uses and their developers have gone to great lengths to improve their speed, R’s performance can be a handicap compared to other programming languages if you’re working with very large datasets and/or using computationally intense simulations.

In light of these limitations, I’d recommend supplementing social network analysis in R with programs capable of faster performance and perhaps presenting high-end visualizations. When I need to produce visualizations and basic measurements on very large networks, I use Pajek due to the reasons outlined earlier. Increasingly developers are writing network analysis libraries for Python and they’re probably worth considering if you either know Python or want to learn. These libraries include igraph, graph-tool, and NetworkX. The advantages of these Python libraries are that they’re open source, likely provide a speed boost beyond R, can interface with other Python libraries, and the advertised visuals look nice. I can often visually tell if a plot was produced in R, Pajek, Gephi, or NetDraw by color schemes, layout, the edge width to node diameter ratio, labels, arrowheads, etc, but the plots made in Python are quite neutral with regard to aesthetics. From what I can tell from the documentation, these packages overlap in terms of interests and thus lack the breadth found in R’s social network packages and also Python’s network libraries are limited in respect to modeling capabilities.

Of course, the “best” networks package is the one you most enjoy working in.

Bad Hessian

Too Nerdy to Facebook, Too Obscure to Tweet