Sebastiano Vigna, Professor, Milano University


S-Vigna.jpg

How would you present the project?

Software today is built using components. Only a few decades ago, a programmer had basically complete control over every single line of code belonging to a software project. It was common to copy other people source, modify it for one’s need, and taking care of it afterwards. This model of software development became increasingly difficult to manage: code was packaged into libraries, which you would be able to use without actually looking at the code. The increasing number of libraries available led to ideas such as componentisation, isolation, and so on. The goal was to be able to do something essentially just by knowing the name of the right component, and an external description of the component (the so-called Application Programming Interface). This is akin to stop washing your clothes using soap and manual work and buying a washer: you just need the instruction manual.

But not all components are created equal: some have programming defects, and in particular security issues, which might make the component unusable in certain settings where security and safety are essential (think air plains or power plants, for example). These defects are usually transient and fixed after some time, but they render the component unusable in the meanwhile.

The problem is exacerbated by the fact that we only track these defects at the component level: maybe you are not actually using the part of the component containing the defect. For example, your washer might stop working with some generic error message on its display because it detects a leak in the softener container. But maybe you are not using the softener — if you knew exactly the problem you could just continue to use your washer.

FASTEN wants to track the problems in components down to a very fine-grained level—the function, or method, which is the smallest logically sound and syntactically recognizable unit of aggregation of code. We want to make you able to continue to use your washer even before someone comes to fix the softener container.

What is your role in the project?

UMIL is leading Work Package 5, Change impact analysis. We are also responsible for designing the storage format of the database of function calls. In a previous interview Lodewijk Bergmans has already explained the difference between maintaining a coarse-grained graph of dependencies between libraries, and a fine-grained graph of function calls. Our goal is to store the information about function calls so they can be searched efficiently : for example, by answering the question « Who’s calling this function ? », which tells us which components will be impacted by a modification to the function. This information can be massive, so we plan to borrow techniques from information retrieval to filter relevant results for the user.

What key innovation do you bring or help to develop?

The main challenge we have to face is to store and access a large number of function-call graphs from revisions of software products, loosely coupled by the explicit dependency descriptions provided by the developers. Since these descriptions provide alternatives, we have to materialize at the moment of a query an actual graph of dependencies that depends on a timestamp ; we are answering to questions like « Who was calling method A in July 2018 ? ». The revision call graphs are numerous, and they are updated at a fast pace. The database of an ecosystem like Maven contains hundreds of millions of function calls.

The other challenge is to make results understandable and browsable. While we can borrow some ideas from information retrieval and network centrality, they are only loosely related to the notion of importance of a function from the viewpoint of software architecture. We will have to develop new ranking methods to this purpose.
  

A word about yourself and your organization.

The Laboratory for Web Algorithmics was established at the end of the 90’s as we started to crawl the web to build datasets for us and other researchers—in fact, literally hundreds of papers have been published using our dataset. We started by writing a crawler, which needed efficient software libraries; then we moved to permanent-storage technology, such as search engines, static key/value stores, and so on. We also used the datasets to study network centrality measures such as PageRank. Among other things, we collaborated with Facebook in 2011 to use our high-performance algorithms to measure Facebook’s degrees of separation. The LAW is part of the Dipartimento di Informatica of the Università degli Studi di Milano.

Sebastiano Vigna's research focuses on the interaction between theory and practice. He has worked on highly theoretical topics such as computability on the reals, distributed computability, self-stabilization, minimal perfect hashing, succinct data structures, query recommendation, algorithms for large graphs, pseudorandom number generation, theoretical/experimental analysis of spectral rankings such as PageRank, and axiomatisation of centrality measures, but he is also (co)author of several widely used software tools ranging from high-performance Java libraries to a search engine, a crawler, a text editor and a graph compression framework. In 2011 he collaborated to the first computation of the distance distribution of the whole Facebook graph, from which it was possible to evince that on Facebook there are just 3.74 degrees of separation. His work on Elias-Fano coding and quasi-succinct indices is at the basis of the code of Facebook's "folly" library. His pseudorandom number generator xorshift128+ is the current stock generator of Google's V8 JavaScript engine, used by Chrome, Safari, Firefox, Edge, and Node.js; it is also the stock generator of the Erlang language.