Internet Access on Maestro
Summary#
All nodes of the Maestro cluster allow HTTP/S connections to the Internet. This includes, but is not limited to:
- wget/curl
- apptainer build
- pip/conda/poetry
- git via https
- python urllib, perl LWP
- Colabfold (use
--msa-onlyin a CPU allocation to download them, then batch in a GPU allocation)
Please take into account these guidelines
- Know what you (or your program) download
- Download what's not already available at Pasteur (check
/local/databasesor ask us atask-hpc@pasteur.fr) - Download only what you need and only once
- Verify that your code has downloaded the required files before launching many jobs
- AI-related scripts may overfill your home directory (check
$HOME/.cache/huggingface)
Unfortunately NextFlow fails to authenticate with our proxy, so you will have to do:
remove proxy (bash)
unset HTTP_PROXY https_proxy http_proxy HTTPS_PROXY
This will restore NextFlow access to the internet, but only on submit. Check page for more information.
If in doubt, please contact ask-hpc@pasteur.fr or join us at https://rocketchat.pasteur.cloud/channel/ask-hpc
Details#
Internet access on Maestro is evolving to address current shortcomings and accommodate broader workloads. The following diagram illustrates the changes taking place
[[DRAW.IO DIAGRAM PLACEHOLDER]]
Previous configuration#
- Only the submit node was allowed Internet access.
- Site-wide restrictions were applied (access policies at Institut Pasteur level governing Internet access restrictions).
- Compute nodes were not allowed to reach any target on Internet through any protocol.
New configuration#
- A split has been operated on access protocols: HTTP/S (
wget,curl,gitover http/s, etc) and non HTTP/S ones (ssh,gridFTP, etc). - The only change from the submit node standpoint is that HTTP/S protocols will go through a Proxy server. The rest of Internet traffic (read non HTTP/S protocols is still allowed directly)
- From a compute node standpoint, HTTP/S protocols are treated as on the submit node (allowed through the proxy), the rest of the traffic is filtered out.
- Site-wide restrictions still apply to all exiting traffic from the proxy server and from the submit node.
All HTTP/S traffic flowing through the Proxy is logged (timestamp, username, URL/domain, source host and bytes count)
Important:
- Avoid redundant data downloads for each independently running job.
- Depending on the context, some tools automatically send data during job execution. For example, BLAST is a classic case where a usage report is sent to NCBI at the end of a BLAST command. It is advisable to use the Gensoft versions as they have been configured not to connect elsewhere.
- There is neither a definitive nor straightforward way to guess beforehand if a tool will require Internet access or which protocol will be used for that matter. In the case of time-out please test again on the submit node.