LUTE Configuration Database Specification

Date: 2024-02-12 VERSION: v0.1

Basic Outline

The backend database will be sqlite, using the standard Python library.
A high-level API is provided, so if needed, the backend database can be changed without affecting Executor level code.
One LUTE database is created per working directory for this iteration of the database. Note that this database is independent of any database used by a workflow manager (e.g. Airflow) to manage task execution order.
Each database has the following tables:
1 table for Executor configuration
1 table for general task configuration (i.e., lute.io.config.AnalysisHeader)
1 table PER Task
- Executor and general configuration is shared between Task tables by pointing/linking to the entry ids in the above two tables.
- Multiple experiments can reside in the same table, although in practice this is unlikely to occur in production as the working directory will most likely change between experiments.

`gen_cfg` table

The general configuration table contains entries which may be shared between multiple Tasks. The format of the table is:

id	title	experiment	run	date	lute_version	task_timeout
2	"My experiment desc"	"EXPx00000	1	YYYY/MM/DD	0.1	6000

These parameters are extracted from the TaskParameters object. Each of those contains an AnalysisHeader object stored in the lute_config variable. For a given experimental run, this value will be shared across any Tasks that are executed.

Column descriptions

Column	Description
`id`	ID of the entry in this table.
`title`	Arbitrary description/title of the purpose of analysis. E.g. what kind of experiment is being conducted
`experiment`	LCLS Experiment. Can be a placeholder if debugging, etc.
`run`	LCLS Acquisition run. Can be a placeholder if debugging, testing, etc.
`date`	Date the configuration file was first setup.
`lute_version`	Version of the codebase being used to execute `Task`s.
`task_timeout`	The maximum amount of time in seconds that a `Task` can run before being cancelled.

`exec_cfg` table

The Executor table contains information on the environment provided to the Executor for Task execution, the polling interval used for IPC between the Task and Executor and information on the communicator protocols used for IPC. This information can be shared between Tasks or between experimental runs, but not necessarily every Task of a given run will use exactly the same Executor configuration and environment.

id	env	poll_interval	communicator_desc
2	"VAR1=val1;VAR2=val2"	0.1	"PipeCommunicator...;SocketCommunicator..."

Column descriptions

Column	Description
`id`	ID of the entry in this table.
`env`	Execution environment used by the Executor and by proxy any Tasks submitted by an Executor matching this entry. Environment is stored as a string with variables delimited by ";"
`poll_interval`	Polling interval used for Task monitoring.
`communicator_desc`	Description of the Communicators used.

NOTE: The env column currently only stores variables related to SLURM or LUTE itself.

`Task` tables

For every Task a table of the following format will be created. The exact number of columns will depend on the specific Task, as the number of parameters can vary between them, and each parameter gets its own column. Within a table, multiple experiments and runs can coexist. The experiment and run are not recorded directly. Instead, the first two columns point to the id of entries in the general configuration and Executor tables respectively. The general configuration table entry will contain the experiment and run information.

id	timestamp	gen_cfg_id	exec_cfg_id	P1	P2	...	Pn	result.task_status	result.summary	result.payload	result.impl_schemas	valid_flag
2	"YYYY-MM-DD HH:MM:SS"	1	1	1	2	...	3	"COMPLETED"	"Summary"	"XYZ"	"schema1;schema3;"	1
3	"YYYY-MM-DD HH:MM:SS"	1	1	3	1	...	4	"FAILED"	"Summary"	"XYZ"	"schema1;schema3;"	0

Parameter sets which can be described as nested dictionaries are flattened and then delimited with a . to create column names. Parameters which are lists (or Python tuples, etc.) have a column for each entry with names that include an index (counting from 0). E.g. consider the following dictionary of parameters:

param_dict: Dict[str, Any] = {
    "a": {               # First parameter a
        "b": (1, 2),
        "c": 1,
        # ...
    },
    "a2": 4,             # Second parameter a2
    # ...
}

The dictionary a will produce columns: a.b[0], a.b[1], a.c, and so on.

Column descriptions

Column	Description
`id`	ID of the entry in this table.
`CURRENT_TIMESTAMP`	Full timestamp for the entry.
`gen_cfg_id`	ID of the entry in the general config table that applies to this `Task` entry. That table has, e.g., experiment and run number.
`exec_cfg_id`	The ID of the entry in the `Executor` table which applies to this `Task` entry.
`P1` - `Pn`	The specific parameters of the `Task`. The `P{1..n}` are replaced by the actual parameter names.
`result.task_status`	Reported exit status of the `Task`. Note that the output may still be labeled invalid by the `valid_flag` (see below).
`result.summary`	Short text summary of the `Task` result. This is provided by the `Task`, or sometimes the `Executor`.
`result.payload`	Full description of result from the `Task`. If the object is incompatible with the database, will instead be a pointer to where it can be found.
`result.impl_schemas`	A string of semi-colon separated schema(s) implemented by the `Task`. Schemas describe conceptually the type output the `Task` produces.
`valid_flag`	A boolean flag for whether the result is valid. May be `0` (False) if e.g., data is missing, or corrupt, or reported status is failed.

NOTE: The result.payload may be distinct from the output files. Payloads can be specified in terms of output parameters, specific output files, or are an optional summary of the results provided by the Task. E.g. this may include graphical descriptions of results (plots, figures, etc.). In many cases, however, the output files will most likely be pointed to by a parameter in one of the columns P{1...n} - if properly specified in the TaskParameters model the value of this output parameter will be replicated in the result.payload column as well..

API

This API is intended to be used at the Executor level, with some calls intended to provide default values for Pydantic models. Utilities for reading and inspecting the database outside of normal Task execution are addressed in the following subheader.

Write

record_analysis_db(cfg: DescribedAnalysis) -> None: Writes the configuration to the backend database.
...
...

Read

read_latest_db_entry(db_dir: str, task_name: str, param: str) -> Any: Retrieve the most recent entry from a database for a specific Task.
...
...

Utilities

Scripts

invalidate_entry: Marks a database entry as invalid. Common reason to use this is if data has been deleted, or found to be corrupted.
...

TUI and GUI

dbview: TUI for database inspection. Read only.
...

LUTE Configuration Database Specification

Basic Outline

gen_cfg table

Column descriptions

exec_cfg table

Column descriptions

Task tables

Column descriptions

API

Write

Read

Utilities

Scripts

TUI and GUI

`gen_cfg` table

`exec_cfg` table

`Task` tables