Successes and shortfalls of the current Canadian ARC platform and ideas to improve it further

Maxime Boissonneault
Research computing analyst, Université Laval
Team lead, User Support, Calcul Québec
Team lead, Research Support National Team (RSNT), Compute Canada

Preamble

I have used HPC clusters since 2004 when I was an undergraduate student in physics and the Elix‑2 and Mammouth‑1 clusters were recent additions at Université de Sherbrooke. I used the successive generations of clusters in Sherbrooke until I finished my PhD in 2011. The following year, I started working as an HPC analyst at Calcul Québec, at Université Laval and familiarized myself with other Calcul Québec clusters. In 2015, I started in my roles as team lead for research support for Calcul Québec as well as Compute Canada.

I helped drive the transition from the large collection of small clusters that Compute Canada had before 2015 to the small number of larger clusters that exist currently, and saw the successes and challenges along the way. This white paper is therefore not written from the perspective of a researcher, but rather from that of someone who used to be one and who now has considerable insight into the inner workings of the Canadian ARC ecosystem. I hope my observations can help shape NDRIO for the best.

Successes

First, let me acknowledge that the Compute Canada Federation has come a long way forward in usability since the launch of the new infrastructure round in 2015. This progress was driven by an array of national teams who put a lot of coordinated effort into making the user experience more uniform across the national platform.

National helpdesk

One major achievement driven by the Research Support National Team (RSNT) was the establishment of a common help desk across the federation. This shifted the user support model from being primarily cluster-centric – where questions from every user are only answered by staff from the hosting site – to being user-centric – where questions are primarily answered by staff local to the users regardless of which cluster is being used.

This is beneficial in many aspects. It first ensures that French-speaking users can get support in their primary language regardless of which cluster they are using or which region they are from. It also allows the federation to leverage domain expertise across the country. Since universities are naturally more inclined to recruit people with expertise in the institution’s dominant research fields, it also allows the federation to better tailor its expertise to that of researchers. Finally,

having a large pool of support staff working in multiple time zones and with various schedules naturally increases the effective business hours and reduces answering delays.

National documentation website

Another initiative overseen by the RSNT was the deployment of a single​bilingual documentation​website for the whole federation. One of the challenges of having a​bilingual website is having the two languages desynchronizing over time. Preventing this was the main driver in choosing MediaWiki as a platform, since it provides extensions explicitly designed for handling translation and ensuring the synchronicity of content in multiple languages.

Having bilingual documentation is beneficial to more than the French-speaking users. Indeed the process of translation usually requires adjustments to clarify the original text, which improves it for all, and in particular for the large community of non-native English speakers.

National and portable software environment

The biggest achievement is probably the software environment designed by the RSNT. Its major feature is being portable, which means that users can get the same software on each of our clusters, or even in the commercial cloud, in their research lab, or on their laptop. This level of portability of workload was never seen before. This work was presented at​PEARC19​, and sparked a similar​European initiative.​This is also being used by the National Research Council, by other Canadian partner institutions, by research groups and labs within Canada, and even by some sites in France and Switzerland.

Thanks to this portability, researchers can develop software locally in their lab and move their work seamlessly to the ARC infrastructure that best fits their research needs or migrate their workload across clusters without wasting time on software installation.

Coordinated decisions and nationally managed services

In addition to the work done by the RSNT, other national teams also had a lot of impact in making the user experience uniform across the federation. The Scheduler national team selected, configured and operates the schedulers across all clusters. The Storage national team defined standard filesystem layout and policies so that users can find the same folders in the same location. The systems teams also adopted the same authentication backend (LDAP), so that users use the same credentials to connect to any of the clusters. All of this work was coordinated by national teams and by the Technological Leadership Council (TLC).

The level of uniformity that was achieved thanks to these teams allows most researchers to migrate their work seamlessly across clusters, which can be especially useful when an outage happens before one’s research deadline.

Shortfalls

While a lot of good work has been done to create a consistent user experience across the national platform, gaps still exist. Those shortfalls are mostly born from having separate operation teams for each cluster: implementation details decided by local teams become policy for specific sites. From a service model perspective siloed operations introduce inconsistencies and technical non-conformities which result in a loss of efficiency. A natural consideration would be to consolidate operations into a single operational team for each type of service, which would also leverage the benefits of geographical dispersion and time-zone differences.

Note that it may be perfectly acceptable to have different policies when an infrastructure serves a specific need. For example, the Niagara cluster is meant for large parallel jobs, and it is therefore natural that its scheduler does not accept single-core jobs. Another acceptable example would be an infrastructure designed to serve the needs of researchers with sensitive data, which could be designed with stricter security policies than other infrastructure. Many of the differences are however not driven by any such reasoning and should be considered for improvement. Below are such examples.

Internet access from the compute nodes

The policy for accessing the Internet from compute nodes varies widely: on Cedar, access is allowed by default, on Graham and Béluga, it is blocked by default, but exceptions are granted by the operation teams upon request, while on Niagara, it is blocked and users are left to configure workarounds through SSH tunnels.

Front node policies

Front nodes of the various clusters have different restrictions. For example, Cedar allows to run periodic jobs (crontab), while Graham, Béluga and Niagara do not. Limits on memory usage and CPU usage on those nodes also differ across clusters. On Graham, users can connect directly to data transfer nodes, but not to those of Béluga, Cedar or Niagara.

Scheduling policies

While the sites use the same scheduling software (Slurm), scheduling policies vary quite a bit across sites. For example, the maximum job duration varies from 1 day (on Niagara) to 7 days (on Béluga), to 30 days (on Cedar and Graham). Other variations include the number of jobs that a user can submit, what location they can submit from (Cedar forbids submitting jobs from /​home), or where they can write during a job (Niagara’s /​home is read-only on compute nodes).

Globus file transfer

All of the clusters support Globus as a tool to transfer large files across the sites. However, not all sites have it installed the same way. On Cedar and Béluga, users use a federated authentication provider provided by Compute Canada which only requires users to log in once (single sign-on), while Graham and Niagara require users to send their credentials to a third party (Globus).

The Niagara exception

It is well known within the organization that the Niagara team took many design decisions that

are very different from the other sites. The cluster does not use the same filesystem layout as the other clusters, users on Niagara do not have the same POSIX group memberships, and the default file permissions are also different. Moreover, the Niagara team still uses and advertises a separate helpdesk email address, which is not accessible to staff outside of the Niagara team, and a separate documentation website –​which is not bilingual​. Finally, the portable software environment that Compute Canada supports is also merely offered to users as an alternative to a local software environment, which is incompatible with Compute Canada’s and is being maintained independently and in duplication.

Different services on different sites

Because decision making is currently a local function, different hosting sites develop different services. For example, the Cedar team is hosting a NextCloud service, Graham and Cedar offer database services and Graham offers dedicated VDI (visualisation) nodes. Béluga and Niagara both offer a service for Jupyter notebooks, but their implementation is different. Currently, when coordination does happen and the same service is deployed on multiple sites, it is usually deployed independently, by different people, and it results in different implementation decisions.

Why do such differences matter ?

The above are just a few examples of remaining discrepancies across the existing clusters. When taking each individually, it may not seem like a big problem. However, each of them is a barrier for user migration from one cluster to another. We have encountered examples of users not being able to easily migrate from Cedar or Graham to Béluga, because their workload is tailored to running jobs longer than 7 days. Others can not easily use Niagara, because they require Internet access. Others must use Graham because they require VDI nodes, or Cedar or Graham because they need database services. Others have a hard time using Graham and Cedar because they need Jupyter notebooks. Finally, some may have security requirements that are met by one cluster and not the other.

The differences above and their impact on mobility of users may sometimes be just an inconvenience, but for some users communities, they can be a deal breaker. In all cases, they limit user mobility, which may leave some with no usable resource at a critical moment, for example if there is an outage of one of the clusters right before a critical research deadline. While it is great to see different initiatives being born, the strategy going forward to support new services and offer researchers more research-support options could be greatly enhanced by taking a uniform approach to operations to eliminate or minimize inconsistencies.

Building a better future

Based on the above, in which direction should NDRIO take the Canadian digital research infrastructure? In the past 5 years, the Compute Canada Federation has made tremendous progress in better supporting advanced research computing in Canada, by providing a more mature environment to researchers. The achievements mentioned above should be preserved and built upon. I believe that they are perfectly suited for a new national organization. However, further progress is being held back by the existence of artificial divisions based on hosting sites rather than based on speciality and talent of people, irrespective of where they are located.

While these silos exist for historical reasons, we should not minimise the role of legal responsibility, which currently resides on individual sites. For hosting sites, this means that they have responsibility and decision-making power over the infrastructure. For all sites (hosting or not), it also means that each one has its employees, and each employee’s superior is that of the local institution. There are no secondment agreements between the national organization and the local staff. This creates a situation in which local priorities and decisions often substitute national ones. Collaboration is based on goodwill and, while that sometimes works great, it also sometimes fails.

Recommendations

In the future, NDRIO should take​direct responsibility for operating the national infrastructure. Formal secondments should be required between NDRIO and the local staff to ensure a clear chain of command across the organization. Division of labour should be made not based on where the hardware is located, but rather on staff specialization. For example, if the national organization decides that we are to offer a database service, there should be a single team who is in charge of implementing it across all sites. There should be a single team who manages file systems across all clusters, one who manages data transfer nodes and their configurations, one who manages security aspects for all sites, etc. Each of these teams should be drawn from across the country, to leverage a wide range of expertise and increase robustness.

Hardware should be located in data centres that are chosen based on how much it costs to host it there and technical reasons such as reliability and availability rather than based on politics. Other than for hardware maintenance, there should be no more geographical connection between staff and the infrastructure. That would also allow to leverage and integrate the expertise of staff not located at hosting sites, which represent the majority of the team members in the Compute Canada Federation.

While this change could be perceived as limiting innovation, by creating a more top-down approach driven nationally rather than locally, it does not have to be that way. By granting national teams responsibility and privilege to develop and operationalize services based on their speciality, we can instead give those individuals the leeway to create innovative solutions that will be implemented uniformly, for the benefit of everyone. We will also reduce overhead by having a given service run by an expert group, rather than by multiple independent people with varying degrees of expertise. One example of such successful teams are the Research Support National Team, who operates the national helpdesk, the national documentation and the national software environment. Another example is the Infrastructure Operations National team that operates shared national services, such as LDAP (authentication directory), DNS, Google Workspace, Slack, CC GitLab, SSL certificates, high availability servers, CVMFS.

Division of labour based on expertise rather than based on colocation with infrastructure would also allow, for example, to have a team who specialises in user experience, and who works on developing and deploying simplified access to the infrastructure, which is especially lacking to serve the needs of non-traditional users.