Skip to content

Internet Access on Maestro

Summary#

All nodes of the Maestro cluster allow HTTP/S connections to the Internet. This includes, but is not limited to:

  • wget/curl
  • apptainer build
  • pip/conda/poetry
  • git via https
  • python urllib, perl LWP
  • Colabfold (use --msa-only  in a CPU allocation to download them, then batch in a GPU allocation)

Please take into account these guidelines

  • Know what you (or your program) download
  • Download what's not already available at Pasteur (check /local/databases or ask us at ask-hpc@pasteur.fr)
  • Download only what you need and only once
  • Verify that your code has downloaded the required files before launching many jobs
  • AI-related scripts may overfill your home directory (check $HOME/.cache/huggingface)

Unfortunately NextFlow fails to authenticate with our proxy, so you will have to do:

remove proxy (bash)

 unset HTTP_PROXY https_proxy http_proxy HTTPS_PROXY

This will restore NextFlow access to the internet, but only on submit. Check page for more information.

If in doubt, please contact ask-hpc@pasteur.fr or join us at https://rocketchat.pasteur.cloud/channel/ask-hpc

Details#

Internet access on Maestro is evolving to address current shortcomings and accommodate broader workloads. The following diagram illustrates the changes taking place

[[DRAW.IO DIAGRAM PLACEHOLDER]]

Previous configuration#

  • Only the submit node was allowed Internet access.
  • Site-wide restrictions were applied (access policies at Institut Pasteur level governing Internet access restrictions).
  • Compute nodes were not allowed to reach any target on Internet through any protocol.

New configuration#

  • A split has been operated on access protocols: HTTP/S (wget, curl, git over http/s, etc) and non HTTP/S ones (ssh, gridFTP, etc).
  • The only change from the submit node standpoint is that HTTP/S protocols will go through a Proxy server. The rest of Internet traffic (read non HTTP/S protocols is still allowed directly)
  • From a compute node standpoint, HTTP/S protocols are treated as on the submit node (allowed through the proxy), the rest of the traffic is filtered out.
  • Site-wide restrictions still apply to all exiting traffic from the proxy server and from the submit node.

All HTTP/S traffic flowing through the Proxy is logged (timestamp, username, URL/domain, source host and bytes count)

Important:

  • Avoid redundant data downloads for each independently running job.
  • Depending on the context, some tools automatically send data during job execution. For example, BLAST is a classic case where a usage report is sent to NCBI at the end of a BLAST command. It is advisable to use the Gensoft versions as they have been configured not to connect elsewhere.
  • There is neither a definitive nor straightforward way to guess beforehand if a tool will require Internet access or which protocol will be used for that matter. In the case of time-out please test again on the submit node.