Ingest: Build The Canonical Pulsar Tree

This page explains how to get from a messy source-data layout to the canonical dataset tree that pleb expects.

Ingest comes first. If ingest is wrong, every later stage is downstream of a bad filesystem model.

What Ingest Does

Ingest reads an explicit mapping file and writes a standard per-pulsar layout.

It does not guess backend names. Backend names come from the mapping file.

According to Ingest Mapping, ingest writes:

  • Jxxxx+xxxx/Jxxxx+xxxx.par

  • Jxxxx+xxxx/Jxxxx+xxxx_all.tim

  • Jxxxx+xxxx/tims/TEL.BACKEND.CENFREQ.tim

  • Jxxxx+xxxx/tmplts/...

Two Files Are Required

For a normal ingest setup, create:

  1. a mapping JSON under configs/catalogs/ingest/,

  2. an ingest run profile under configs/runs/ingest/.

The Mapping JSON

The mapping file describes where source files live and how backend names are assigned.

Example: configs/catalogs/ingest/single_pulsar_mapping.json

Tracked repository example: configs/catalogs/ingest/single_pulsar_mapping.example.json

{
  "sources": [
    "/data/raw_release"
  ],
  "par_roots": [
    "/data/raw_release/par"
  ],
  "template_roots": [
    "/data/raw_release/templates"
  ],
  "pulsar_aliases": {
    "B1907-3744": "J1909-3744"
  },
  "backends": {
    "EFF.P200.1360": {
      "root": "/data/raw_release/tim/EFF/P200/1360",
      "tim_glob": "*.tim",
      "ignore_suffixes": ["_all.tim"]
    },
    "NRT.NUPPI.1480": {
      "root": "/data/raw_release/tim/NRT/NUPPI/1480",
      "tim_glob": "*.tim"
    }
  }
}

How To Read The Mapping Keys

sources

Informational root list. This records where material came from, but does not define backend identity.

par_roots

Directories where ingest looks for parfiles.

template_roots

Directories where template files live.

pulsar_aliases

Explicit alias map, usually B-name to J-name. If a name cannot be resolved cleanly, ingest should not guess.

backends

The most important part of the mapping. Each key is the canonical backend name that will be used later. Each backend entry must point to a real source directory.

tim_glob

The pattern used to find source tim files within that backend root.

ignore_suffixes

A way to skip known files such as pre-existing aggregate _all.tim files.

Why The Backend Key Matters

The backend key is not just a label. It becomes part of the later dataset structure and later QC grouping logic.

If the mapping uses inconsistent backend names, later stages will inherit:

  • broken or misleading jump logic,

  • bad system grouping,

  • confusing PQC backend splits.

Rule:

The ingest mapping defines the canonical backend identity used by downstream stages.

The Ingest Run Profile

Example: configs/runs/ingest/single_pulsar_ingest.toml

Tracked repository example: configs/runs/ingest/single_pulsar_ingest.example.toml

ingest_mapping_file = "configs/catalogs/ingest/single_pulsar_mapping.json"
ingest_output_dir = "/data/canonical/EPTA-DR3/epta-dr3-data"

ingest_verify = true
ingest_commit_branch_name = "raw_ingest"
ingest_commit_base_branch = "ingest"
ingest_commit_message = "Ingest: single-pulsar import"

This mirrors the role of configs/runs/ingest/ingest_epta_data.toml.

How To Explain Each Ingest Run Key

ingest_mapping_file

Path to the JSON mapping file.

ingest_output_dir

Where the canonical pulsar tree is written. This is the directory later referenced indirectly by home_dir and dataset_name in pipeline-style run profiles.

ingest_verify

Turn on ingest checks so mapping or alias errors fail early.

ingest_commit_branch_name

Branch that receives the ingested dataset state.

ingest_commit_base_branch

Existing branch ingest starts from.

ingest_commit_message

Git commit message for the ingest mutation.

Where The Data Ends Up

Suppose:

  • home_dir = "/data/canonical"

  • dataset_name = "EPTA-DR3/epta-dr3-data"

Then the canonical dataset root later used by pipeline profiles is:

/data/canonical/EPTA-DR3/epta-dr3-data

Inside that tree, it should be possible to locate:

/data/canonical/EPTA-DR3/epta-dr3-data/J1909-3744/

and inside it:

J1909-3744.par
J1909-3744_all.tim
tims/

How To Run Ingest

Run:

pleb ingest --config configs/runs/ingest/single_pulsar_ingest.toml

After it finishes, inspect the output tree before doing anything else.

What Ingest Does Not Do

Ingest standardizes the file layout. It does not by itself:

  • run tempo2,

  • insert jumps,

  • infer the final QC grouping policy,

  • decide which TOAs are suspicious,

  • apply FixDataset mutations beyond the ingest-specific commit step.

Those tasks happen later.

What To Check After Ingest

For one pulsar, verify all of the following manually:

  1. the pulsar directory exists,

  2. the pulsar parfile exists,

  3. the Jxxxx_all.tim include file exists,

  4. backend tim files exist under tims/,

  5. backend filenames match the intended canonical names,

  6. there is only one parfile for the pulsar,

  7. aliases were resolved to the intended J-name.

If any of these checks fail, do not proceed to FixDataset.

Common Ingest Mistakes

  • mapping the wrong source directory to a backend key,

  • forgetting a B-name to J-name alias,

  • accidentally ingesting an existing aggregate _all.tim,

  • using inconsistent backend keys for conceptually identical systems,

  • using a mapping file that was never reviewed against the source release tree.

Why This Stage Matters

Do not treat ingest as a clerical stage.

Ingest is the stage where raw source structure is translated into the naming and grouping model that later operations depend on.

Detailed Mapping Guidance

For more depth on mapping structure and failure conditions, see Ingest Mapping and the JSON schema at configs/schemas/ingest_mapping.schema.json.

When reviewing a mapping file, check three things in order:

  1. backend names are canonical and stable,

  2. pulsar aliases are complete enough to resolve all names,

  3. source roots do not accidentally include derived or aggregate files that should not be re-ingested.

How Ingest Connects To The Next Stage

After a successful ingest run, the next stage is usually a FixDataset pass that starts from the ingest branch. In practice, that means:

  • ingest writes the canonical tree,

  • ingest commits it to a branch such as raw_ingest,

  • the Step-1 FixDataset profile then sets fix_base_branch = "raw_ingest".

This is why the ingest branch name matters even in a single-pulsar workflow.