Files, Directories, And Config Roles¶
Read this page before running any commands.
If files and keys are placed inconsistently, the workflow still may run, but it becomes difficult to review, reuse, or debug.
Three Different Trees¶
This workflow uses three different directory trees:
the raw source-data tree,
the canonical ingested dataset tree,
the
plebconfig tree.
They serve different purposes and should be kept conceptually separate.
Raw Source-Data Tree¶
This is whatever external layout the source data currently uses. It may be messy, split by telescope, backend, year, or release bundle.
Typical contents:
scattered
.timfiles,one or more
.parroots,optional template files,
inconsistent backend naming.
This tree should usually not be edited during normal runs.
Canonical Ingested Dataset Tree¶
After ingest, pleb expects a regular pulsar-oriented layout. For each
pulsar, the canonical form is:
<dataset_root>/
J1909-3744/
J1909-3744.par
J1909-3744_all.tim
tims/
EFF.P200.1360.tim
NRT.NUPPI.1480.tim
tmplts/
...
This is the tree later stages operate on.
Key facts about this tree:
the pulsar directory is the main unit of work,
the pulsar parfile sits at the pulsar root,
backend tim files sit under
tims/,Jxxxx_all.timis the include file that gathers backend tims.
The pleb Config Tree¶
Inside this repository, config files are separated by role.
Use this as a hard rule:
configs/runs/*are executable run profiles,configs/catalogs/*are shared data assets,configs/rules/*are policy files,configs/workflows/*are multi-step orchestration files,configs/state/*is generated runtime state.
This split is described in Configuration Layout and
configs/README.md. Preserving this split makes later maintenance much
easier.
Which Keys Go In Which File¶
This distinction is the main source of avoidable config drift.
Put these in a run profile under configs/runs/...:
environment-specific paths such as
home_dir,results_dir,singularity_image,run scope such as
pulsars,branches,reference_branch,stage toggles such as
run_tempo2,run_pqc,run_fix_dataset,run-local policy choices that are specific to one analysis pass.
Put these in a catalog file under configs/catalogs/...:
ingest mapping JSON,
backend classification tables,
variant definitions,
stable system lookup tables.
Put these in a rule file under configs/rules/...:
per-backend PQC overrides,
overlap action policies,
relabel policies.
Put these in a workflow file under configs/workflows/...:
stage order,
branch hand-off between stages,
per-step overrides,
one command that runs a known sequence.
Recommended Single-Pulsar File Set¶
For a single-pulsar setup, create a small dedicated set of files.
configs/
catalogs/
ingest/
single_pulsar_mapping.json
runs/
ingest/
single_pulsar_ingest.toml
fixdataset/
single_pulsar_step1_fix.toml
single_pulsar_pqc_apply.toml
pqc/
single_pulsar_pqc_detect.toml
rules/
pqc/
single_pulsar_backend_profiles.toml
workflows/
single_pulsar_3pass.toml
This layout is not required by the code, but it is clean, explicit, and easy to maintain.
How Paths Resolve Across These Files¶
The most important path relationship in a single-pulsar setup is:
ingest writes a canonical dataset tree to
ingest_output_dir,later run profiles refer to that tree through
home_diranddataset_name,run outputs are written separately under
results_dir.
Example:
ingest writes to
/data/canonical/EPTA-DR3/epta-dr3-datalater run profiles use
home_dir = "/data/canonical"and
dataset_name = "EPTA-DR3/epta-dr3-data"while writing results under
results_dir = "results"
This means the dataset itself and the run outputs are separate concerns:
the dataset tree is the input tree being analyzed or mutated,
the results tree is where logs, summaries, QC products, plots, and run metadata are written.
How Branches Fit Into The Picture¶
The branch keys refer to the git repository that contains the canonical dataset
tree, not to the results_dir.
For mutating stages, the important branch keys are:
fix_base_branchExisting branch used as the starting point for a mutation pass.
fix_branch_nameNew branch written by that mutation pass.
branchesThe branch or branches that the run processes.
reference_branchThe comparison anchor for reports and, in practice, the branch that the run is organized around.
In a branch-chained workflow, these values should form a simple sequence rather than a branching tangle.
The Most Important Run Keys¶
These are the first keys to understand.
home_dirRoot that contains the dataset tree.
dataset_nameDataset directory or dataset identifier resolved under
home_dir. In most single-pulsar setups this is a relative path underhome_dir, not an independent root.results_dirWhere output run directories are written. This is separate from the dataset tree and should remain separate.
branchesWhich data-repo git branches to process. For a single-pulsar run, this should usually be a one-element list.
reference_branchThe branch used as reference for comparisons and often the branch the run is conceptually anchored to.
pulsarsEither
"ALL"or an explicit list such as["J1909-3744"]. For a single-pulsar setup, use a one-element list.run_tempo2Whether the fit stage runs.
run_fix_datasetWhether FixDataset logic runs.
fix_applyWhether FixDataset actually writes mutations to the dataset branch. When
fix_apply = true, branch names and commit messages are part of the operational state of the run and should be chosen deliberately.run_pqcWhether PQC detectors run.
What Not To Do¶
Avoid these patterns:
store experimental one-off absolute paths in shared catalogs,
modify a shared example config directly without copying it,
use
pulsars = "ALL"in an initial single-pulsar setup,mix the detect stage and apply stage without explaining the difference,
apply QC deletion before they understand the QC columns.
Minimal Naming Convention¶
Use names that expose stage and purpose:
single_pulsar_ingest.tomlsingle_pulsar_step1_fix.tomlsingle_pulsar_pqc_detect.tomlsingle_pulsar_pqc_apply.tomlsingle_pulsar_backend_profiles.tomlsingle_pulsar_3pass.toml
That makes later debugging much easier than ambiguous names such as
test.toml or config2.toml.
How The Config Layers Work Together¶
For a single-pulsar workflow, the interaction between files usually looks like this:
the ingest run profile points to an ingest mapping catalog,
the first FixDataset run profile points to variant catalogs and optional system/group rule tables,
the PQC detect run profile points to global
pqc_*settings and optionally topqc_backend_profiles_path,the QC-apply run profile points back to the QC outputs from the detect run,
the workflow file, if used, sequences those run profiles and passes branch names from one stage to the next.
This is the operational reason the repo separates runs, catalogs, rules, and workflows.
A Minimal Mental Model¶
When deciding where to add a new setting or file, ask three questions:
Is this about one specific run invocation? Put it in a run profile.
Is this a reusable mapping or lookup table? Put it in a catalog.
Is this a reusable behavior choice? Put it in a rule file.
If the answer is instead “this describes the order in which several run profiles should execute,” the right place is a workflow file.