Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up HTCondor-CE overview doc (SOFTWARE-5819) #157

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 27 additions & 118 deletions docs/compute-element/htcondor-ce-overview.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
title: HTCondor-CE Overview
DateReviewed: 2024-02-09

HTCondor-CE Overview
====================
Expand All @@ -8,145 +9,53 @@ Before continuing with the overview, make sure that you are familiar with the fo

- An OSG site plan
- What is a batch system and which one will you use ([HTCondor](http://htcondor.org/), PBS, LSF, SGE, or
[SLURM](https://slurm.schedmd.com/))?
[Slurm](https://slurm.schedmd.com/))?
- Security via [host certificates](../security/host-certs/overview.md) to authenticate servers and
[bearer tokens](../security/tokens/overview.md) to authenticate clients
- Pilot jobs, frontends, and factories (i.e., [GlideinWMS](http://glideinwms.fnal.gov/doc.prd/index.html),
AutoPyFactory)
Harvester)

What is a Compute Entrypoint?
--------------------------

An OSG Compute Entrypoint (CE) is the door for remote organizations to submit requests to temporarily allocate local
compute resources.
At the heart of the CE is the *job gateway* software, which is responsible for handling incoming jobs, authenticating
and authorizing them, and delegating them to your batch system for execution.
An OSG Compute Entrypoint (CE) is the door for research organizations to submit requests to temporarily allocate local
compute capacity.
At the heart of the CE is the software that is responsible for handling incoming allocation requests, authenticating and
authorizing them, and delegating them to your batch system for execution as jobs.

Most jobs that arrive at a CE (here referred to as "CE jobs")
are **not** end-user jobs, but rather pilot jobs submitted from factories.
Successful pilot jobs create and make available an environment for actual end-user jobs to match and ultimately run
within the pilot job container.
Capacity allocation requests that arrive at a CE are **not** end-user jobs, but rather pilot jobs that are submitted
from pilot factories.
Successful pilot jobs create an environment for actual research user jobs to match and ultimately run under the pilot job.
Eventually pilot jobs remove themselves, typically after a period of inactivity.

!!! note
The Compute Entrypoint was previously known as the "Compute Element".

What is HTCondor-CE?
--------------------

HTCondor-CE is a special configuration of the HTCondor software designed to be a job gateway solution for the OSG Fabric of Services.
HTCondor-CE is a special configuration of the HTCondor software designed to be a compute entrypoint solution for the OSG
Fabric of Services.
It is configured to use the [JobRouter daemon](https://htcondor.readthedocs.io/en/latest/grid-computing/job-router.html) to
delegate jobs by transforming and submitting them to the site’s batch system.
delegate pilot jobs by transforming and submitting them to the site’s batch system.

Benefits of running the HTCondor-CE:

- **Scalability:** HTCondor-CE is capable of supporting job workloads of large sites
- **Scalability:** HTCondor-CE is capable of supporting pilot job workloads of large sites
- **Debugging tools:** HTCondor-CE offers [many tools to help troubleshoot](troubleshoot-htcondor-ce.md)
issues with jobs
- **Routing as configuration:** HTCondor-CE’s mechanism to transform and submit jobs is customized via configuration
variables, which means that customizations will persist across upgrades and will not involve modification of
software internals to route jobs

How CE Jobs Run
---------------

Once an incoming CE job is authorized, it is placed into HTCondor-CE’s scheduler where the JobRouter creates a
transformed copy (called the *routed job*) and submits the copy to the batch system (called the *batch system job*).
After submission, HTCondor-CE monitors the batch system job and communicates its status to the original CE job, which
in turn notifies the original submitter (e.g., job factory) of any updates.
When the job completes, files are transferred along the same chain: from the batch system to the CE, then from the CE to
the original submitter.

### Hosted CE over SSH

The Hosted CE is intended for small sites or as an introduction to providing capacity to collaborations.
OSG staff configure and maintain an HTCondor-CE on behalf of the site.
The Hosted CE is a special configuration of HTCondor-CE that can submit jobs to a remote cluster over SSH.
It provides a simple starting point for opportunistic resource owners that want to start contributing capacity with
minimal effort: an organization will be able to accept CE jobs by allowing SSH access to a login node in their cluster.

If your site intends to run over 10,000 concurrent CE jobs, you will need to host your own
[HTCondor-CE](install-htcondor-ce.md) because the Hosted CE has not yet been optimized for such
loads.

If you are interested in a Hosted CE solution, please follow the instructions on [this page](hosted-ce.md).

![HTCondor-CE-Bosco](../img/HTCondorCEBosco.png)

### On HTCondor batch systems

For a site with an HTCondor **batch system**, the JobRouter can use HTCondor protocols to place a transformed copy of
the CE job directly into the batch system’s scheduler, meaning that the routed and batch system jobs are one and the
same.
Thus, there are three representations of your job, each with its own ID (see diagram below):

- Access point: the HTCondor job ID in the original queue
- HTCondor-CE: the incoming CE job’s ID
- HTCondor batch system: the routed job’s ID

![HTCondor-CE with an HTCondor batch system](../img/ce_condorbatchsystem.png)

In an HTCondor-CE/HTCondor setup, files are transferred from HTCondor-CE’s spool directory to the batch system’s spool
directory using internal HTCondor protocols.

!!! note
The JobRouter copies the job directly into the batch system and does not make use of `condor_submit`.
This means that if the HTCondor batch system is configured to add attributes to incoming jobs when they are
submitted (i.e., `SUBMIT_EXPRS`), these attributes will not be added to the routed jobs.
issues with pilot jobs
- **Routing as configuration:** HTCondor-CE’s mechanism to transform and submit pilot jobs is customized via
configuration variables, which means that customizations will persist across upgrades and will not involve
modification of software internals to route jobs

### On other batch systems

For non-HTCondor batch systems, the JobRouter transforms the CE job into a routed job on the CE and the routed job
submits a job into the batch system via a process called the BLAHP.
Thus, there are four representations of your job, each with its own ID (see diagram below):

- Login node: the HTCondor job ID in the original queue
- HTCondor-CE: the incoming CE job’s ID and the routed job’s ID
- HTCondor batch system: the batch system’s job ID

Although the following figure specifies the PBS case, it applies to all non-HTCondor batch systems:

![HTCondor-CE with other batch systems](../img/ce_otherbatchsystem.png)

With non-HTCondor batch systems, HTCondor-CE cannot use internal HTCondor protocols to transfer files so its spool
directory must be exported to a shared file system that is mounted on the batch system’s worker nodes.

How the CE is Customized
------------------------

Aside from the [basic configuration](install-htcondor-ce.md#configuring-htcondor-ce) required in the CE
installation, there are two main ways to customize your CE (if you decide any customization is required at all):

- **Deciding which collaborations are allowed to run at your site:** collaborations will submit resource allocation
requests to your CE using bearer tokens, and you can configure which collaboration's tokens you are willing to accept.
- **How to filter and transform the CE jobs to be run on your batch system:** Filtering and transforming CE jobs
(i.e., setting site-specific attributes or resource limits), requires configuration of your site’s job routes.
For examples of common job routes, consult the [JobRouter recipes](job-router-recipes.md) page.

!!! note
If you are running HTCondor as your batch system, you will have two HTCondor configurations side-by-side (one
residing in `/etc/condor/` and the other in `/etc/condor-ce`) and will need to make sure to differentiate the two
when editing any configuration.

How Security Works
------------------

Among OSG services, communication is secured between various parties using a combination of PKI infrastructure involving
Certificate Authorities (CAs) and bearer tokens.
Services such as a Compute Entrypoint, present [host certificates](../security/host-certs/overview.md) to prove their
identity to clients, much like your browser verifies websites that you may visit.

And to use these services, clients present [bearer tokens](../security/tokens/overview.md) declaring their association
with a given collaboration and what permissions the collaboration has given the client.
In turn, the service may be configured to authorize the client based on their collaboration.
See the [upstream documentation](https://htcondor.com/htcondor-ce/architecture/) for details about the architecture of
HTCondor-CE.

Next steps
----------

Once the basic installation is done, additional activities include:
If you are already running a local batch system and are interested in contributing computational capacity to the OSG
Consortium, deploy an OSG CE through one of the following methods:

- [Request an OSG-Hosted CE](hosted-ce.md)
- [Install an HTCondor-CE](install-htcondor-ce.md)

- [Setting up job routes to customize incoming jobs](job-router-recipes.md)
- [Submitting jobs to a HTCondor-CE](submit-htcondor-ce.md)
- [Troubleshooting the HTCondor-CE](troubleshoot-htcondor-ce.md)
- [Register the CE](install-htcondor-ce.md#registering-the-ce)
- Register with the OSG GlideinWMS factories and/or the ATLAS AutoPyFactory
If you do not already have a batch system installed,
consider contributing through an [Open Science Pool EP container](../resource-sharing/os-backfill-containers.md).
Loading