Chapter names instead of numbers #248

pickae · 2025-01-25T14:32:33Z

pickae
Jan 25, 2025

The resulting audio files would be more user-friendly if they kept the chapter names instead of numbering them.

I'm aware it's a bit tricky because epub has nested chapters like a book, but audio files just support flat chapter structures.

I'm leaving the suggestion here so that it doesn't get forgotten.

ROBERT-MCDOWELL · 2025-01-25T14:59:39Z

ROBERT-MCDOWELL
Jan 25, 2025
Collaborator

if they kept the chapter names instead of numbering them

do you mean the chapter file name? actually this chapter files name are not the final audiobook, they are just used
to concatanate all the chapter to one audio file. so you don't have to take care of.

1 reply

pickae Jan 25, 2025
Author

No need to export the chapter file itself as long as it is embedded in the audio container. That is working.

But if the book has chapters for example named

1 Intro
2 Main Text
3 Conclusion

Then my audio files were coming out chaptered with just

1
2
3

Without keeping names

ROBERT-MCDOWELL · 2025-01-25T15:20:51Z

ROBERT-MCDOWELL
Jan 25, 2025
Collaborator

the A.I. speaker will say 1 intro etc.. that's it. and the final file will be 1 file only containing all your book text, chapters included.
again, chapters files are not public, it's in a temporary folder used for audio concatenation and audio processing... you don't have to take care of. now if you have separate ebook for each chapter so the final file will respect the title of the chapter, if in your metdata you specify the chapter title.

1 reply

pickae Jan 25, 2025
Author

I think this will be easier with examples. Here you see the chapters of an epub file. And in the second picture how vlc shows the chapters in the audio file made from it.

And in the last screenshot you see the chapters in vlc of another audio file, to see how they could look like. That one is made with yt-dlp and embedded the YouTube video chapters and their names into the audio file.

ROBERT-MCDOWELL · 2025-01-25T16:33:20Z

ROBERT-MCDOWELL
Jan 25, 2025
Collaborator

ok, tell me how to manage millions of ebook with their own chapters classification, own language? do you know there is absolutely no standard rules to guess what is a chapter in an ebook? would you like I show you 10 ebooks from 10 different languages and so make a code guessing what is a chapter... then I will be happy to integrate it in eb2ab.
let me explain why I have no choice to call it chapter 1,2.3 etc... first, there are books where chapters are with roman letters, so I have to convert them in number, then, there are ebooks with other word than chapter, like 'section', or just a number like '1.' etc... only this for one language and we are done to guess what is a chapter....
secondly, the chapter files are only based on the structure of the ebook. let's say in your ebook you have only one page with the whole ebook in it, so it will be one chapter file. etc....
as you can see, it's not because you have several ebook in English calling a chapter "chapter" with a distinct document per chapter in the ebook that all ebooks will work. add above all these issues the fact that there are 3 versions of ebooks since 17 years... version 1 = no rules at all, version 2 = metadata more structured. version 3 - more metadata. but again absolutely no standard rules....

0 replies

ROBERT-MCDOWELL · 2025-01-25T16:36:58Z

ROBERT-MCDOWELL
Jan 25, 2025
Collaborator

that said, now if you have a way to guess what is a chapter for at least 80% of the whole ebooks in the world. I would be very happy to know it.

1 reply

pickae Jan 27, 2025
Author

I think to begin, the challenge should be decomposed into subproblems, because I think on some fronts you have already made more progress than you let on. Maybe I can help you on another front.

How to extract the chapters from the ebook (extract the place and the name of the chapter mark)
How to insert the chapters into the audio file

You are already extracting the place of the chaptermark, at least for some well behaved ebook formats, or you wouldn't be able to put chapters into the audio files at all, which you do.

There are challenges with this like for example how to treat sections. They could be seen just as chapters, flattening the tree structure of the ebook. They could be ignored and only the top level of the ebook structure be put into chapters of the audio file. I would suggest don't change that aspect yet and change only one thing at a time, in this case taking the names of the chapters that you already identified as chapters and already decided to put into the audio file.

About those names. I'm not sure if every ebook format permits extracting them easily, but I think well behaved epub v2 or v3 should. So two things can be done here

start by extracting the chapter names from those well behaved epubs
as you are using calibre, see if it helps to convert other formats like mobi or azw3 to epub first and then do point 1

This will leave you with a good amount of ebooks where you can extract the chapter names as strings. The important point to see here is that those will be strings. It doesn't matter so much if that string is nicely just "Introduction" or also once more says "Chapter: Introduction" or in another language "Chapitre: Introduction" or even with some roman numerals "Chapitre I : Introduction". They're just strings that you can embed as names into the chapter file for the audio.

Now, you can do some string manipulation to make the names nicer. I will put some suggestions. But the main takeaway is that any name at all is better than renaming to 1,2,3... Even if the UI of the audio-player adds hardcoded "Chapter:" to the front and the chapter itself is called "Chapter I: Introduction", let's see how that would look

"Chapter: Chapitre I: Introduction" is wordy and not ideal, but at least you see what is meant
"Chapter: Introduction" would be ideal, but you will be writing a lot of code to identify roman numerals as numerals and Chapitre as Chapter etc.
"Chapter: 1" and you cannot tell what it is

And for taking the chapter strings and embedding them into audio files as names, not much has to be done. I have done it in my own repo in the context of taking chapter names from the names of individual audio files to be concatenated, or taking them from cue sheets.

You insert the chapter names into a text-file like this

CHAPTER01=00:00:00.000
CHAPTER01NAME=Intro
CHAPTER02=00:02:30.000
CHAPTER02NAME=Main Body
CHAPTER03=00:07:34.000
CHAPTER03NAME=Outro

And you can embed it into audio files.

In my own repo I have a function called cleanNamesIndividually to do some needed things there. For example, the chapter names cannot contain double quotes, so I just delete them. You could look into escaping them, but deletion seemed easier and sufficient. I remove newlines inside chapter names and replace them with spaces, convert double spaces to spaces... The bare minimum cleanup is really easy to do. My code goes already further with removing numbers at the start of chapter-names and removing any pre- or suffixes that all chapters might have in common, that's not even really needed.

https://github.com/pickae/concatAudio/blob/main/concatAudio.sh

ROBERT-MCDOWELL · 2025-02-02T23:10:09Z

ROBERT-MCDOWELL
Feb 2, 2025
Collaborator

it's not worth to copy paste chatGPT.... metadata is already in eb2ab, thing you still don't catch is WITHOUT standards, it's IMPOSSIBLE to guess the part/poem/chapter/psalm/whatever title starting with the word itself splitting the text in chapters. some books have ther chapter called 1., 2. ,3. ,4. etc... or A. B.C. without or without title. now how you make your script differs from this and legends, exceptions, nomenclatures, preface etc...? and multiply this by +1100 languages... your function will work for some books, until it will break on other books.... that's why it's better you make your own function for your own books, on our side we cannot satisfy everybody with a so simple function. as I said, you need an A.I. which is greater than the whole TTS models together....

0 replies

ROBERT-MCDOWELL · 2025-02-03T00:50:08Z

ROBERT-MCDOWELL
Feb 3, 2025
Collaborator

Now, there is an option as the project become popular, we can create ourselves a STANDARD for all languages.
example that can be added to the README:

To define the exact chapters you wish with its own title into the audiobook chapter indices, you must:

use the exact word CHAPTER (for all languages) followed by a colon and between 2 pairs of brackets like ths: [[CHAPTER: 1]] or [[CHAPTER: Summer on the beach]] or [[CHAPTER: IV]]

we can even create a STANDARD that define the start and the end of the book like this:
[[STARTBOOK]] and [[ENDBOOK]]

this way, people will have choice to not care or use our standards.... and it will be much more easy for everyone.

1 reply

pickae Feb 5, 2025
Author

Or you could just use the same reliable methods that calibre uses to recognize the chapter tags and their corresponding names inside the various ebook formats.

You are already using calibre and it can already do all that.

Maybe it's not perfect, but honestly quite close, and for sure a lot better than anything that could come in the short or medium term from reinventing the wheel.

ROBERT-MCDOWELL · 2025-02-05T12:54:28Z

ROBERT-MCDOWELL
Feb 5, 2025
Collaborator

calibre does not do anything about how an ebook is structured. it shows only the metadata the author/editor put in and the xhtml pages, images and special pages that are in. no standards or "guess" calibre is doing to check if it's a chapter or else. there is NO chapter tags at all because there is NO standards. do you understand? if there is a chapter tag in an ebook, it's a personal and not standard reason why it is existing. with xhtml you can create any kind of TAGS. you cannot reinvent the wheel when there is NO WHEEL. I'm not sure you catch it still.

2 replies

ROBERT-MCDOWELL Feb 5, 2025
Collaborator

for those who are following this discussion, you must really understand there is no standards at all in an ebook, that's why HTML and XHTML has been chosen to create e-book 18 years ago. here are some examples of ebooks with completely different structures....
ebooks_1.zip
as you can see the only way for ebook2audiobook to "guess" it's a chapter is the repetition of pages with the same name + a number or else. but it's not always the case as you can see on one of the attached ebook zip. Also there are ebook without chapters but with titles only, parargraphs, or without titles but subtitles or even with just a number, etc.. but still considered as a chapter from them...
make an audiobook with real people in real studio is a full job, and for now an ebook with no standards at all to structure it like a book for audio world is not possible, not because it's not technically possible, but because there are NO STANDARDS to tell the softare that this is a chapter, this is a title with subtitle, this is the text of the book etc....

pickae Feb 5, 2025
Author

Here is a screenshot directly from calibre where you see that it takes the epub, figures out it is a zip file containing mostly html files, figures out which tags represent chapters, figures out which other tags represent subchapters or sections or however they will be called, and then represents this hierarchically in this TOC, of course preserving the names, the places and the sequence of the chapters.

Could there be someone that writes their tags so eccentrically in such an obscure ebook format that it won't work? Possibly.
But I've honestly never seen calibre choke on parsing a TOC.
At that point we might request something on calibre's repo, but rejoice in all the ones that do work in the meantime.

ROBERT-MCDOWELL · 2025-02-05T13:51:04Z

ROBERT-MCDOWELL
Feb 5, 2025
Collaborator

and what are you doing on ebooks without TOC? again you insist on the ebook you have on your hands and your language, and apparently you don't care of others.

3 replies

pickae Feb 5, 2025
Author

That's not my language, but it also works in all the languages I have ever opened.

And what to do when there is no TOC? Well, no chapters.

Or if you still have another fallback that I ignore, great, why not use that.
Just sacrificing the main use-case in favor of an obscure fallback is weird.

ROBERT-MCDOWELL Feb 5, 2025
Collaborator

No, it's absolutely not a main-use case...

ROBERT-MCDOWELL Feb 5, 2025
Collaborator

let's use this script to get the TOC or NAV section and parse the ebook sections of one of the book I zipped and attached to the discussion:

from ebooklib import epub  # Correct import
import ebooklib  # Ensure `ebooklib` is available

def extract_toc(ebook_path):
    # Load the EPUB file
    book = epub.read_epub(ebook_path)

    # Get the Table of Contents (TOC)
    toc = book.toc  # No need for `ebooklib.ITEM_NAV`

    # Convert the TOC structure into a dictionary
    def parse_toc(toc_entries):
        toc_dict = {}
        for entry in toc_entries:
            if isinstance(entry, epub.Link):  # Handle EPUB links
                title, href = entry.title, entry.href
                toc_dict[title] = {"href": href, "children": {}}
            elif isinstance(entry, tuple):  # Handle nested TOC
                title, href, sub_items = entry
                toc_dict[title] = {
                    "href": href,
                    "children": parse_toc(sub_items) if sub_items else {},
                }
        return toc_dict

    return parse_toc(toc)

# Example usage
ebook_file = r"brzydkie-kaczatko.epub"
toc_dict = extract_toc(ebook_file)

import pprint
pprint.pprint(toc_dict)  # Print TOC in dictionary format

the result Is:
{'Początek utworu': {'children': {}, 'href': 'part1.xhtml'},
'Przypisy': {'children': {}, 'href': 'annotations.xhtml'},
'Spis treści': {'children': {}, 'href': 'nav.xhtml'},
'Strona redakcyjna': {'children': {}, 'href': 'last.xhtml'},
'Strona tytułowa': {'children': {}, 'href': 'title.xhtml'},
'Wesprzyj Wolne Lektury': {'children': {}, 'href': 'support.xhtml'}}

thanks to write the code to parse the chapter of the ebook and will implement it in ebook2audiobook

ROBERT-MCDOWELL · 2025-02-05T14:00:11Z

ROBERT-MCDOWELL
Feb 5, 2025
Collaborator

what I can do is at least check if there is a TOC then parse it in dict() object, After that the trick would be to check which info we can find in every section that point to a page/text to the ebook and check where it is in the DOC provided. but even with TOC we must know what is a chapter in +1100 languages...

0 replies

ROBERT-MCDOWELL · 2025-02-05T16:33:54Z

ROBERT-MCDOWELL
Feb 5, 2025
Collaborator

could you try this script under python_env and tell me if it works for you (provide the epub_file variable path file):

import ebooklib
from ebooklib import epub
from lxml import etree
from bs4 import BeautifulSoup
import re
import warnings

from bs4 import XMLParsedAsHTMLWarning

def extract_epub_chapters(epub_file):
    book = epub.read_epub(epub_file)
    chapters = {}  
    toc = extract_toc(book)
    for item in book.get_items():
        if item.get_type() == ebooklib.ITEM_DOCUMENT:
            content = item.get_content().decode("utf-8")
            soup = BeautifulSoup(content, "xml")
            # Match file name in TOC to get the corresponding title
            title = match_chapter_title(item.file_name, toc)
            if title:
                chapter_text = extract_chapter_text(soup)
                chapters[title] = chapter_text
    return chapters

def extract_toc(book):
    toc_dict = {}
    for item in book.get_items_of_type(ebooklib.ITEM_NAVIGATION):
        soup = BeautifulSoup(item.get_content(), "xml")  # ✅ FIX: Use XML mode
        for li in soup.find_all("li"):
            link = li.find("a")
            if link:
                href = link.get("href", "").split("#")[0]
                title = link.get_text(strip=True)
                if href and title:
                    toc_dict[href] = title
    if not toc_dict:
        for item in book.get_items():
            if item.get_name().endswith("toc.ncx"):
                root = etree.fromstring(item.get_content())
                for nav_point in root.findall(".//{*}navPoint"):
                    title = nav_point.find(".//{*}text").text.strip()
                    src = nav_point.find(".//{*}content").get("src", "").split("#")[0]
                    if src and title:
                        toc_dict[src] = title
    return toc_dict

def match_chapter_title(file_name, toc):
    for href, title in toc.items():
        if file_name.endswith(href):
            return title
    return None

def extract_chapter_text(soup):
    text = []
    for elem in soup.find_all(["h1", "h2", "h3", "p"]):
        if elem.name in ["h1", "h2", "h3"]:
            text.append("\n" + elem.get_text(strip=True) + "\n")
        else:
            paragraph = clean_text(elem.get_text(strip=True))
            if paragraph:
                text.append(paragraph)
    return "\n".join(text)

def clean_text(text):
    text = re.sub(r"\s+", " ", text)  
    text = text.replace("H2 anchor", "").strip()  
    return text if len(text) > 20 else ""  

if __name__ == "__main__":
    epub_file = r"brzydkie-kaczatko.epub"
    chapters = extract_epub_chapters(epub_file)
    for title, content in chapters.items():
        print(f"\n=== {title} ===\n{content[:500]}...\n")

0 replies

Chapter names instead of numbers #248

pickae Jan 25, 2025

Replies: 10 comments · 9 replies

ROBERT-MCDOWELL Jan 25, 2025 Collaborator

pickae Jan 25, 2025 Author

ROBERT-MCDOWELL Jan 25, 2025 Collaborator

pickae Jan 25, 2025 Author

ROBERT-MCDOWELL Jan 25, 2025 Collaborator

ROBERT-MCDOWELL Jan 25, 2025 Collaborator

pickae Jan 27, 2025 Author

ROBERT-MCDOWELL Feb 2, 2025 Collaborator

ROBERT-MCDOWELL Feb 3, 2025 Collaborator

pickae Feb 5, 2025 Author

ROBERT-MCDOWELL Feb 5, 2025 Collaborator

ROBERT-MCDOWELL Feb 5, 2025 Collaborator

pickae Feb 5, 2025 Author

ROBERT-MCDOWELL Feb 5, 2025 Collaborator

pickae Feb 5, 2025 Author

ROBERT-MCDOWELL Feb 5, 2025 Collaborator

ROBERT-MCDOWELL Feb 5, 2025 Collaborator

ROBERT-MCDOWELL Feb 5, 2025 Collaborator

ROBERT-MCDOWELL Feb 5, 2025 Collaborator

pickae
Jan 25, 2025

Replies: 10 comments 9 replies

ROBERT-MCDOWELL
Jan 25, 2025
Collaborator

pickae Jan 25, 2025
Author

ROBERT-MCDOWELL
Jan 25, 2025
Collaborator

pickae Jan 25, 2025
Author

ROBERT-MCDOWELL
Jan 25, 2025
Collaborator

ROBERT-MCDOWELL
Jan 25, 2025
Collaborator

pickae Jan 27, 2025
Author

ROBERT-MCDOWELL
Feb 2, 2025
Collaborator

ROBERT-MCDOWELL
Feb 3, 2025
Collaborator

pickae Feb 5, 2025
Author

ROBERT-MCDOWELL
Feb 5, 2025
Collaborator

ROBERT-MCDOWELL Feb 5, 2025
Collaborator

pickae Feb 5, 2025
Author

ROBERT-MCDOWELL
Feb 5, 2025
Collaborator

pickae Feb 5, 2025
Author

ROBERT-MCDOWELL Feb 5, 2025
Collaborator

ROBERT-MCDOWELL Feb 5, 2025
Collaborator

ROBERT-MCDOWELL
Feb 5, 2025
Collaborator

ROBERT-MCDOWELL
Feb 5, 2025
Collaborator