Ingest: Build The Canonical Pulsar Tree¶

This page explains how to get from a messy source-data layout to the canonical dataset tree that pleb expects.

Ingest comes first. If ingest is wrong, every later stage is downstream of a bad filesystem model.

What Ingest Does¶

Ingest reads an explicit mapping file and writes a standard per-pulsar layout.

It does not guess backend names. Backend names come from the mapping file.

According to Ingest Mapping, ingest writes:

Jxxxx+xxxx/Jxxxx+xxxx.par
Jxxxx+xxxx/Jxxxx+xxxx_all.tim
Jxxxx+xxxx/tims/TEL.BACKEND.CENFREQ.tim
Jxxxx+xxxx/tmplts/...

Two Files Are Required¶

For a normal ingest setup, create:

a mapping JSON under configs/catalogs/ingest/,
an ingest run profile under configs/runs/ingest/.

The Mapping JSON¶

The mapping file describes where source files live and how backend names are assigned.

Example: configs/catalogs/ingest/single_pulsar_mapping.json

Tracked repository example: configs/catalogs/ingest/single_pulsar_mapping.example.json

{
  "sources": [
    "/data/raw_release"
  ],
  "par_roots": [
    "/data/raw_release/par"
  ],
  "template_roots": [
    "/data/raw_release/templates"
  ],
  "pulsar_aliases": {
    "B1907-3744": "J1909-3744"
  },
  "backends": {
    "EFF.P200.1360": {
      "root": "/data/raw_release/tim/EFF/P200/1360",
      "tim_glob": "*.tim",
      "ignore_suffixes": ["_all.tim"]
    },
    "NRT.NUPPI.1480": {
      "root": "/data/raw_release/tim/NRT/NUPPI/1480",
      "tim_glob": "*.tim"
    }
  }
}

How To Read The Mapping Keys¶

sources: Informational root list. This records where material came from, but does not define backend identity.
par_roots: Directories where ingest looks for parfiles.
template_roots: Directories where template files live.
pulsar_aliases: Explicit alias map, usually B-name to J-name. If a name cannot be resolved cleanly, ingest should not guess.
backends: The most important part of the mapping. Each key is the canonical backend name that will be used later. Each backend entry must point to a real source directory.
tim_glob: The pattern used to find source tim files within that backend root.
ignore_suffixes: A way to skip known files such as pre-existing aggregate _all.tim files.

Why The Backend Key Matters¶

The backend key is not just a label. It becomes part of the later dataset structure and later QC grouping logic.

If the mapping uses inconsistent backend names, later stages will inherit:

broken or misleading jump logic,
bad system grouping,
confusing PQC backend splits.

Rule:

The ingest mapping defines the canonical backend identity used by downstream stages.

The Ingest Run Profile¶

Example: configs/runs/ingest/single_pulsar_ingest.toml

Tracked repository example: configs/runs/ingest/single_pulsar_ingest.example.toml

ingest_mapping_file = "configs/catalogs/ingest/single_pulsar_mapping.json"
ingest_output_dir = "/data/canonical/EPTA-DR3/epta-dr3-data"

ingest_verify = true
ingest_commit_branch_name = "raw_ingest"
ingest_commit_base_branch = "ingest"
ingest_commit_message = "Ingest: single-pulsar import"

This mirrors the role of configs/runs/ingest/ingest_epta_data.toml.

How To Explain Each Ingest Run Key¶

ingest_mapping_file: Path to the JSON mapping file.
ingest_output_dir: Where the canonical pulsar tree is written. This is the directory later referenced indirectly by home_dir and dataset_name in pipeline-style run profiles.
ingest_verify: Turn on ingest checks so mapping or alias errors fail early.
ingest_commit_branch_name: Branch that receives the ingested dataset state.
ingest_commit_base_branch: Existing branch ingest starts from.
ingest_commit_message: Git commit message for the ingest mutation.

Where The Data Ends Up¶

Suppose:

home_dir = "/data/canonical"
dataset_name = "EPTA-DR3/epta-dr3-data"

Then the canonical dataset root later used by pipeline profiles is:

/data/canonical/EPTA-DR3/epta-dr3-data

Inside that tree, it should be possible to locate:

/data/canonical/EPTA-DR3/epta-dr3-data/J1909-3744/

and inside it:

J1909-3744.par
J1909-3744_all.tim
tims/

How To Run Ingest¶

Run:

pleb ingest --config configs/runs/ingest/single_pulsar_ingest.toml

After it finishes, inspect the output tree before doing anything else.

What Ingest Does Not Do¶

Ingest standardizes the file layout. It does not by itself:

run tempo2,
insert jumps,
infer the final QC grouping policy,
decide which TOAs are suspicious,
apply FixDataset mutations beyond the ingest-specific commit step.

Those tasks happen later.

What To Check After Ingest¶

For one pulsar, verify all of the following manually:

the pulsar directory exists,
the pulsar parfile exists,
the Jxxxx_all.tim include file exists,
backend tim files exist under tims/,
backend filenames match the intended canonical names,
there is only one parfile for the pulsar,
aliases were resolved to the intended J-name.

If any of these checks fail, do not proceed to FixDataset.

Common Ingest Mistakes¶

mapping the wrong source directory to a backend key,
forgetting a B-name to J-name alias,
accidentally ingesting an existing aggregate _all.tim,
using inconsistent backend keys for conceptually identical systems,
using a mapping file that was never reviewed against the source release tree.

Why This Stage Matters¶

Do not treat ingest as a clerical stage.

Ingest is the stage where raw source structure is translated into the naming and grouping model that later operations depend on.

Detailed Mapping Guidance¶

For more depth on mapping structure and failure conditions, see Ingest Mapping and the JSON schema at configs/schemas/ingest_mapping.schema.json.

When reviewing a mapping file, check three things in order:

backend names are canonical and stable,
pulsar aliases are complete enough to resolve all names,
source roots do not accidentally include derived or aggregate files that should not be re-ingested.

How Ingest Connects To The Next Stage¶

After a successful ingest run, the next stage is usually a FixDataset pass that starts from the ingest branch. In practice, that means:

ingest writes the canonical tree,
ingest commits it to a branch such as raw_ingest,
the Step-1 FixDataset profile then sets fix_base_branch = "raw_ingest".

This is why the ingest branch name matters even in a single-pulsar workflow.

Ingest: Build The Canonical Pulsar Tree¶

What Ingest Does¶

Two Files Are Required¶

The Mapping JSON¶

How To Read The Mapping Keys¶

Why The Backend Key Matters¶

The Ingest Run Profile¶

How To Explain Each Ingest Run Key¶

Where The Data Ends Up¶

How To Run Ingest¶

What Ingest Does Not Do¶

What To Check After Ingest¶

Common Ingest Mistakes¶

Why This Stage Matters¶

Detailed Mapping Guidance¶

How Ingest Connects To The Next Stage¶

pleb - The EPTA Data Combination Pipeline

Navigation

Related Topics

Ingest: Build The Canonical Pulsar Tree¶

What Ingest Does¶

Two Files Are Required¶

The Mapping JSON¶

How To Read The Mapping Keys¶

Why The Backend Key Matters¶

The Ingest Run Profile¶

How To Explain Each Ingest Run Key¶

Where The Data Ends Up¶

How To Run Ingest¶

What Ingest Does Not Do¶

What To Check After Ingest¶

Common Ingest Mistakes¶

Why This Stage Matters¶

Detailed Mapping Guidance¶

How Ingest Connects To The Next Stage¶

Related Documentation¶