Ingest: Build The Canonical Pulsar Tree ======================================= This page explains how to get from a messy source-data layout to the canonical dataset tree that ``pleb`` expects. Ingest comes first. If ingest is wrong, every later stage is downstream of a bad filesystem model. What Ingest Does ---------------- Ingest reads an explicit mapping file and writes a standard per-pulsar layout. It does not guess backend names. Backend names come from the mapping file. According to :doc:`../ingest`, ingest writes: - ``Jxxxx+xxxx/Jxxxx+xxxx.par`` - ``Jxxxx+xxxx/Jxxxx+xxxx_all.tim`` - ``Jxxxx+xxxx/tims/TEL.BACKEND.CENFREQ.tim`` - ``Jxxxx+xxxx/tmplts/...`` Two Files Are Required ---------------------- For a normal ingest setup, create: 1. a mapping JSON under ``configs/catalogs/ingest/``, 2. an ingest run profile under ``configs/runs/ingest/``. The Mapping JSON ---------------- The mapping file describes where source files live and how backend names are assigned. Example: ``configs/catalogs/ingest/single_pulsar_mapping.json`` Tracked repository example: ``configs/catalogs/ingest/single_pulsar_mapping.example.json`` .. code-block:: json { "sources": [ "/data/raw_release" ], "par_roots": [ "/data/raw_release/par" ], "template_roots": [ "/data/raw_release/templates" ], "pulsar_aliases": { "B1907-3744": "J1909-3744" }, "backends": { "EFF.P200.1360": { "root": "/data/raw_release/tim/EFF/P200/1360", "tim_glob": "*.tim", "ignore_suffixes": ["_all.tim"] }, "NRT.NUPPI.1480": { "root": "/data/raw_release/tim/NRT/NUPPI/1480", "tim_glob": "*.tim" } } } How To Read The Mapping Keys ---------------------------- ``sources`` Informational root list. This records where material came from, but does not define backend identity. ``par_roots`` Directories where ingest looks for parfiles. ``template_roots`` Directories where template files live. ``pulsar_aliases`` Explicit alias map, usually B-name to J-name. If a name cannot be resolved cleanly, ingest should not guess. ``backends`` The most important part of the mapping. Each key is the canonical backend name that will be used later. Each backend entry must point to a real source directory. ``tim_glob`` The pattern used to find source tim files within that backend root. ``ignore_suffixes`` A way to skip known files such as pre-existing aggregate ``_all.tim`` files. Why The Backend Key Matters --------------------------- The backend key is not just a label. It becomes part of the later dataset structure and later QC grouping logic. If the mapping uses inconsistent backend names, later stages will inherit: - broken or misleading jump logic, - bad system grouping, - confusing PQC backend splits. Rule: The ingest mapping defines the canonical backend identity used by downstream stages. The Ingest Run Profile ---------------------- Example: ``configs/runs/ingest/single_pulsar_ingest.toml`` Tracked repository example: ``configs/runs/ingest/single_pulsar_ingest.example.toml`` .. code-block:: toml ingest_mapping_file = "configs/catalogs/ingest/single_pulsar_mapping.json" ingest_output_dir = "/data/canonical/EPTA-DR3/epta-dr3-data" ingest_verify = true ingest_commit_branch_name = "raw_ingest" ingest_commit_base_branch = "ingest" ingest_commit_message = "Ingest: single-pulsar import" This mirrors the role of ``configs/runs/ingest/ingest_epta_data.toml``. How To Explain Each Ingest Run Key ---------------------------------- ``ingest_mapping_file`` Path to the JSON mapping file. ``ingest_output_dir`` Where the canonical pulsar tree is written. This is the directory later referenced indirectly by ``home_dir`` and ``dataset_name`` in pipeline-style run profiles. ``ingest_verify`` Turn on ingest checks so mapping or alias errors fail early. ``ingest_commit_branch_name`` Branch that receives the ingested dataset state. ``ingest_commit_base_branch`` Existing branch ingest starts from. ``ingest_commit_message`` Git commit message for the ingest mutation. Where The Data Ends Up ---------------------- Suppose: - ``home_dir = "/data/canonical"`` - ``dataset_name = "EPTA-DR3/epta-dr3-data"`` Then the canonical dataset root later used by pipeline profiles is: .. code-block:: text /data/canonical/EPTA-DR3/epta-dr3-data Inside that tree, it should be possible to locate: .. code-block:: text /data/canonical/EPTA-DR3/epta-dr3-data/J1909-3744/ and inside it: .. code-block:: text J1909-3744.par J1909-3744_all.tim tims/ How To Run Ingest ----------------- Run: .. code-block:: bash pleb ingest --config configs/runs/ingest/single_pulsar_ingest.toml After it finishes, inspect the output tree before doing anything else. What Ingest Does Not Do ----------------------- Ingest standardizes the file layout. It does not by itself: - run tempo2, - insert jumps, - infer the final QC grouping policy, - decide which TOAs are suspicious, - apply FixDataset mutations beyond the ingest-specific commit step. Those tasks happen later. What To Check After Ingest -------------------------- For one pulsar, verify all of the following manually: 1. the pulsar directory exists, 2. the pulsar parfile exists, 3. the ``Jxxxx_all.tim`` include file exists, 4. backend tim files exist under ``tims/``, 5. backend filenames match the intended canonical names, 6. there is only one parfile for the pulsar, 7. aliases were resolved to the intended J-name. If any of these checks fail, do not proceed to FixDataset. Common Ingest Mistakes ---------------------- - mapping the wrong source directory to a backend key, - forgetting a B-name to J-name alias, - accidentally ingesting an existing aggregate ``_all.tim``, - using inconsistent backend keys for conceptually identical systems, - using a mapping file that was never reviewed against the source release tree. Why This Stage Matters ---------------------- Do not treat ingest as a clerical stage. Ingest is the stage where raw source structure is translated into the naming and grouping model that later operations depend on. Detailed Mapping Guidance ------------------------- For more depth on mapping structure and failure conditions, see :doc:`../ingest` and the JSON schema at ``configs/schemas/ingest_mapping.schema.json``. When reviewing a mapping file, check three things in order: 1. backend names are canonical and stable, 2. pulsar aliases are complete enough to resolve all names, 3. source roots do not accidentally include derived or aggregate files that should not be re-ingested. How Ingest Connects To The Next Stage ------------------------------------- After a successful ingest run, the next stage is usually a FixDataset pass that starts from the ingest branch. In practice, that means: - ingest writes the canonical tree, - ingest commits it to a branch such as ``raw_ingest``, - the Step-1 FixDataset profile then sets ``fix_base_branch = "raw_ingest"``. This is why the ingest branch name matters even in a single-pulsar workflow. Related Documentation --------------------- - ingest mode overview and schema: :doc:`../ingest` - CLI entry point details: :doc:`../cli` - mode selection and compatibility notes: :doc:`../running_modes`