Xilp001 Script Parser
src.xil_pipeline.XILP001_script_parser
Parse markdown production scripts into structured JSON.
Converts podcast scripts from markdown format into sequence-numbered entries suitable for voice generation.
Module Attributes
KNOWN_SPEAKERS: Ordered list of speaker names (longest-first for matching). SPEAKER_KEYS: Mapping from display names to normalized keys. SECTION_MAP: Mapping from section header text to URL-safe slugs. DIRECTION_TYPES: Recognized direction subtypes for stage directions.
SECTION_MAP
module-attribute
SECTION_MAP = {'COLD OPEN': 'cold-open', 'OPENING CREDITS': 'opening-credits', 'CHAPTER ONE': 'chapter1', 'CHAPTER 1': 'chapter1', 'CHAPTER TWO': 'chapter2', 'CHAPTER 2': 'chapter2', 'CHAPTER THREE': 'chapter3', 'CHAPTER 3': 'chapter3', 'ACT ONE': 'act1', 'ACT 1': 'act1', 'ACT TWO': 'act2', 'ACT 2': 'act2', 'ACT THREE': 'act3', 'ACT 3': 'act3', 'ACT FOUR': 'act4', 'ACT 4': 'act4', 'MID-EPISODE BREAK': 'mid-break', 'CLOSING': 'closing', 'CLOSING — RADIO STATION': 'closing', "CLOSING — ADAM'S SIGN-OFF": 'closing', 'CLOSING — ADAM’S SIGN-OFF': 'closing', 'POST-INTERVIEW': 'post-interview', 'POST-INTERVIEW: ADAM & TINA': 'post-interview', 'POST-CREDITS SCENE': 'post-credits', "DEZ'S CLOSING NARRATION": 'dez-closing', 'DEZ’S CLOSING NARRATION': 'dez-closing', 'PRODUCTION NOTES': 'production-notes', 'PREAMBLE': 'preamble', 'POSTAMBLE': 'postamble'}
PODCAST_SECTIONS
module-attribute
PODCAST_SECTIONS: dict[str, str] = {'COLD OPEN': 'cold-open', 'OPENING CREDITS': 'opening-credits', 'ACT ONE': 'act1', 'ACT 1': 'act1', 'ACT TWO': 'act2', 'ACT 2': 'act2', 'ACT THREE': 'act3', 'ACT 3': 'act3', 'ACT FOUR': 'act4', 'ACT 4': 'act4', 'MID-EPISODE BREAK': 'mid-break', 'CLOSING': 'closing', 'POST-CREDITS SCENE': 'post-credits', 'INTRO': 'intro', 'OUTRO': 'outro', 'PREAMBLE': 'preamble', 'POSTAMBLE': 'postamble'}
AUDIOBOOK_SECTIONS
module-attribute
AUDIOBOOK_SECTIONS: dict[str, str] = {'PROLOGUE': 'prologue', 'EPILOGUE': 'epilogue', "AUTHOR'S NOTE": 'authors-note', 'AUTHOR’S NOTE': 'authors-note', None: _AUDIOBOOK_CHAPTERS}
DRAMA_SECTIONS
module-attribute
DRAMA_SECTIONS: dict[str, str] = {'PROLOGUE': 'prologue', 'EPILOGUE': 'epilogue', 'INTERMISSION': 'intermission', 'ACT ONE': 'act1', 'ACT 1': 'act1', 'ACT TWO': 'act2', 'ACT 2': 'act2', 'ACT THREE': 'act3', 'ACT 3': 'act3', 'ACT FOUR': 'act4', 'ACT 4': 'act4', 'COLD OPEN': 'cold-open', 'CLOSING': 'closing', 'POST-CREDITS SCENE': 'post-credits'}
SPECIAL_SECTIONS
module-attribute
SPECIAL_SECTIONS: dict[str, str] = {None: PODCAST_SECTIONS, None: AUDIOBOOK_SECTIONS, None: DRAMA_SECTIONS, None: {f'SEGMENT {n}': f'segment{n}' for n in (range(1, 16))}}
DIRECTION_TYPES
module-attribute
extract_cast_from_script
Extract cast members from the CAST: block in a script header.
Parses bullet-point entries of the form::
CAST:
* ADAM — Host/Narrator
* MR. PATTERSON — Recurring Caller
* DETECTIVE NORA WALSH — New this episode
Each entry is converted to {"display": str, "key": str}. Role
descriptions after —, –, -, or ( are stripped.
Parameters:
Returns:
-
list[dict]–List of
{"display": str, "key": str}dicts, empty when no CAST: -
list[dict]–block is present.
Source code in src/xil_pipeline/XILP001_script_parser.py
load_speakers
load_speakers(path: str | None = None, cast_entries: list[dict] | None = None) -> tuple[list[str], dict[str, str]]
Load speaker definitions, merging CAST-block entries with speakers.json.
Resolution order:
cast_entries— speakers declared in the script's CAST: block (see :func:extract_cast_from_script); auto-derived keys are used unless overridden by speakers.json- Speakers from
path/configs/{slug}/speakers.json/ CWDspeakers.json; JSON keys always win over auto-derived keys and new JSON entries are appended - Built-in
_BUILTIN_KNOWN_SPEAKERS/_BUILTIN_SPEAKER_KEYSonly when neithercast_entriesnor a JSON file are available
The JSON file is an array of objects with display and key fields::
[
{"display": "ADAM", "key": "adam"},
{"display": "MR. PATTERSON", "key": "mr_patterson"}
]
The returned list is automatically sorted longest-first so compound names match before short ones.
Parameters:
-
path(str | None, default:None) –Explicit path to a speakers JSON file.
Nonetriggers auto-detection. -
cast_entries(list[dict] | None, default:None) –Speaker dicts extracted from the script's CAST: block via :func:
extract_cast_from_script.Noneor[]means no CAST block was found.
Returns:
Source code in src/xil_pipeline/XILP001_script_parser.py
load_speakers_registry
Load the full speaker registry from speakers.json, keyed by speaker key.
Returns the raw entry dicts, which may include optional per-character
attributes (voice_id, pan, filter, role, etc.) in addition
to the required display/key fields. Used by
:func:generate_cast_config to pre-populate cast skeletons.
Returns an empty dict when no speakers file is found (built-in defaults have no registry data).
Parameters:
-
path(str | None, default:None) –Explicit path to a speakers JSON file.
Nonetriggers auto-detection (same resolution order as :func:load_speakers).
Returns:
Source code in src/xil_pipeline/XILP001_script_parser.py
get_section_map
Return the section-header-to-slug map for the given content type.
Falls back to the legacy :data:SECTION_MAP entries not covered by the
type-specific map so that existing show-specific section names continue
to parse correctly.
Parameters:
-
project_type(str, default:'podcast') –One of
"podcast","audiobook","drama","special". Unknown values fall back to the full legacy map.
Returns:
Source code in src/xil_pipeline/XILP001_script_parser.py
strip_markdown_escapes
Remove markdown backslash escapes from the script.
Parameters:
-
text(str) –Raw text possibly containing backslash-escaped markdown characters.
Returns:
-
str–Text with all backslash escapes removed.
Source code in src/xil_pipeline/XILP001_script_parser.py
strip_markdown_formatting
Remove markdown formatting syntax (bold, headings, trailing breaks).
Intended to run AFTER strip_markdown_escapes() so that backslash
escapes are already resolved. Operates per-line to correctly strip
# heading prefixes while leaving other content intact.
Parameters:
-
text(str) –Text with markdown formatting (
**,##, etc.).
Returns:
Source code in src/xil_pipeline/XILP001_script_parser.py
classify_direction
Classify a stage direction into a sound category.
Parameters:
-
text(str) –Bracket-interior text (e.g.,
"SFX: DOOR OPENS").
Returns:
-
str | None–One of
"SFX","MUSIC","AMBIENCE","BEAT", orNone -
str | None–if the direction doesn't match a known category.
Source code in src/xil_pipeline/XILP001_script_parser.py
try_match_speaker
try_match_speaker(line: str, known_speakers: list[str] | None = None, speaker_keys: dict[str, str] | None = None) -> tuple[str, str | None, str] | None
Match a known speaker name at the start of a line.
Parameters:
-
line(str) –A stripped line from the script.
-
known_speakers(list[str] | None, default:None) –Ordered list of speaker display names (longest-first). Defaults to the module-level
KNOWN_SPEAKERS. -
speaker_keys(dict[str, str] | None, default:None) –Mapping from display names to normalized keys. Defaults to the module-level
SPEAKER_KEYS.
Returns:
-
tuple[str, str | None, str] | None–A tuple of
(speaker_key, direction, spoken_text)if a known -
tuple[str, str | None, str] | None–speaker is found, or
Noneif no speaker matches.
Source code in src/xil_pipeline/XILP001_script_parser.py
is_stage_direction
is_section_header
Check if a line matches a known section header.
Parameters:
-
line(str) –A stripped line from the script.
-
section_map(dict[str, str] | None, default:None) –Section map to check against. Defaults to :data:
SECTION_MAP.
Returns:
-
bool–Trueif the line matches a key in the section map.
Source code in src/xil_pipeline/XILP001_script_parser.py
is_scene_header
Check if a line is a scene header (SCENE N: ...).
Parameters:
-
line(str) –A stripped line from the script.
Returns:
-
bool–Trueif the line matches theSCENE \d+[A-Za-z]*:pattern -
bool–(supports suffixed scene numbers such as
SCENE 5A:).
Source code in src/xil_pipeline/XILP001_script_parser.py
is_divider
is_metadata_section
Check if a line begins a post-script metadata section.
Parameters:
-
line(str) –A stripped line from the script.
Returns:
Source code in src/xil_pipeline/XILP001_script_parser.py
parse_scene_header
Extract scene number and name from a scene header line.
Parameters:
-
line(str) –A line matching the
SCENE N: ...pattern.
Returns:
-
str | None–A tuple of
(scene_number, scene_name), or(None, None) -
str | None–if the line doesn't match.
scene_numberis a string to -
tuple[str | None, str | None]–support suffixed numbers such as
"5A".
Source code in src/xil_pipeline/XILP001_script_parser.py
write_debug_csv
write_debug_csv(output_path: str, debug_line_map: list[tuple[int, str, int]], entries: list[dict]) -> None
Write a diagnostic CSV mapping markdown source lines to parsed entries.
Each row represents one parsed entry, showing the originating markdown line alongside all fields from the parsed JSON output. Text fields are truncated at 200 characters to prevent unpredictable CSV cell sizes.
Parameters:
-
output_path(str) –Filesystem path for the output CSV file.
-
debug_line_map(list[tuple[int, str, int]]) –List of
(1-based line number, raw line text, entry index)tuples collected during parsing. -
entries(list[dict]) –The fully-parsed entries list (after all continuation merges).
Source code in src/xil_pipeline/XILP001_script_parser.py
parse_script_header
Extract show, season, episode, title, and season_title from the script header line.
Parses the first line of a production script, which follows the format::
SHOW [Season N:] Episode N: "Episode Title" [Arc: "Arc Title"] ...
Season is optional — scripts without a season declaration return None for
the season element. Title is the first double-quoted string after
Episode N:. Arc title (season title) is the quoted string after Arc:;
it is None when no Arc: declaration is present. Falls back to bare
text after Episode N: when no quoted strings are present.
Parameters:
-
line(str) –The first non-empty line of the production script, after markdown escapes have been removed.
Returns:
-
tuple[str, int | None, int, str, str | None] | None–A tuple of
(show, season, episode, title, season_title)where -
tuple[str, int | None, int, str, str | None] | None–seasonandseason_titleareNonewhen not declared, or -
tuple[str, int | None, int, str, str | None] | None–Noneif the line does not match the expected header format.
Source code in src/xil_pipeline/XILP001_script_parser.py
parse_script
parse_script(filepath: str, debug_output: str | None = None, speakers_path: str | None = None, project_type: str | None = None) -> dict
Parse a markdown production script into structured entries.
Reads a markdown file and extracts dialogue lines, stage directions, section headers, and scene headers into a sequence-numbered list of entries.
Parameters:
-
filepath(str) –Path to the markdown production script file.
-
debug_output(str | None, default:None) –If provided, write a diagnostic CSV to this path mapping each markdown source line to its parsed entry. Text fields are truncated at 200 characters. Defaults to
None(no CSV written). -
speakers_path(str | None, default:None) –Path to a
speakers.jsonfile.Noneuses the default resolution order (see :func:load_speakers). -
project_type(str | None, default:None) –Content type from
project.json("podcast","audiobook","drama","special").Nonereads fromproject.jsonin the current directory, defaulting to"podcast"when the file is absent.
Returns:
-
dict–Dictionary with keys
show,season,episode,title, -
dict–source_file,entries(list of entry dicts), and -
dict–stats(aggregate statistics dict). Validates against -
dict–the
ParsedScriptmodel.
Raises:
-
FileNotFoundError–If the script file does not exist.
Source code in src/xil_pipeline/XILP001_script_parser.py
730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 | |
compute_speaker_stats
Compute per-speaker dialogue distribution.
Parameters:
-
parsed(dict) –Output dictionary from
parse_script().
Returns:
-
list[dict]–List of dicts sorted by lines descending, each with keys:
-
list[dict]–speaker,lines,words,chars,pct_lines, -
list[dict]–pct_words,pct_chars.
Source code in src/xil_pipeline/XILP001_script_parser.py
print_speaker_stats
Print per-speaker dialogue distribution table.
Shows lines, words, characters, and percentage share for each speaker, sorted by number of lines descending.
Parameters:
-
parsed(dict) –Output dictionary from
parse_script().
Source code in src/xil_pipeline/XILP001_script_parser.py
print_summary
Print a human-readable summary of the parsed script.
Displays show metadata, entry counts, TTS character budget, and a per-speaker breakdown of lines, words, and characters.
Parameters:
-
parsed(dict) –Output dictionary from
parse_script().
Source code in src/xil_pipeline/XILP001_script_parser.py
print_dialogue_preview
Print dialogue lines for review.
Parameters:
-
parsed(dict) –Output dictionary from
parse_script(). -
limit(int | None, default:None) –Maximum number of dialogue lines to display.
Noneshows all lines.
Source code in src/xil_pipeline/XILP001_script_parser.py
generate_cast_config
generate_cast_config(parsed: dict, cast_path: str, tag_override: str | None = None, speakers_registry: dict[str, dict] | None = None) -> None
Generate a skeleton cast config JSON from parsed script data.
Creates a cast config with all speakers found in the parsed script.
When speakers_registry is provided (loaded via
:func:load_speakers_registry), any per-character attributes stored
in speakers.json (voice_id, pan, filter, role,
stability, similarity_boost, style, use_speaker_boost,
language_code) are pre-populated from the registry instead of
defaulting to TBD.
Parameters:
-
parsed(dict) –Parsed script dict from :func:
parse_script. -
cast_path(str) –Output path for the cast config JSON.
-
tag_override(str | None, default:None) –Raw non-episodic tag (e.g.
"V01C03"); when set,season/episodeare written asnullandtag_overrideis added to the config. -
speakers_registry(dict[str, dict] | None, default:None) –Optional dict mapping speaker key → full speakers.json entry dict (from :func:
load_speakers_registry).
Source code in src/xil_pipeline/XILP001_script_parser.py
1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 | |
generate_sfx_config
Generate a skeleton SFX config JSON from parsed script data.
Creates an SFX config with entries for each unique direction found in the parsed script. Defaults are based on direction type:
BEAT/LONG BEAT→ silence (no API call)SFX:→ 5s effectMUSIC:→ 15s effectAMBIENCE:→ 30s looping effect- Other → 5s effect
The user should review and refine prompts before running generation.
Parameters:
-
parsed(dict) –Parsed script dict from :func:
parse_script. -
sfx_path(str) –Output path for the SFX config JSON.
Source code in src/xil_pipeline/XILP001_script_parser.py
1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 | |
backfill_sfx_sources
Add missing source fields to an existing SFX config from parsed hints.
When a script is re-parsed and the SFX config already exists, any direction
entries that carry an sfx_source hint are used to update the sfx config
in three ways:
- Clean key already exists, no source — adds
sourcefield, removes stubpromptif it matched the key text. - Stale piped key exists (
"KEY | file.mp3"from a pre-fix parse) — renames it to the clean key and addssource. - Key absent entirely — adds a new entry with
sourceand sensible defaults (loop: Truefor AMBIENCE, appropriateduration_seconds).
Entries that already have a source field are never touched.
Parameters:
-
parsed(dict) –Parsed script dict (after hint stripping).
-
sfx_path(str) –Path to the existing SFX config JSON to update in-place.
Source code in src/xil_pipeline/XILP001_script_parser.py
1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 | |
get_parser
Source code in src/xil_pipeline/XILP001_script_parser.py
main
CLI entry point for script parsing.
Source code in src/xil_pipeline/XILP001_script_parser.py
1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 | |