mirror of
https://github.com/PacFactory/Docs-Exporter-Nextjs.git
synced 2025-12-19 19:21:05 -05:00
Refactored code for Playwright
Refactored the code for Playwright, replacing wkhtmltopdf
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -163,3 +163,4 @@ cython_debug/
|
|||||||
/*-docs
|
/*-docs
|
||||||
*.pdf
|
*.pdf
|
||||||
*.html
|
*.html
|
||||||
|
.DS_Store
|
||||||
56
CHANGELOG.md
Normal file
56
CHANGELOG.md
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
# CHANGELOG.md
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## General Changes
|
||||||
|
1. **Code Refactoring**
|
||||||
|
- Significant restructuring of the functions to improve readability and maintainability.
|
||||||
|
- Added Playwright to replace wkhtmltopdf.
|
||||||
|
- Introduction of meaningful function and variable names for better clarity.
|
||||||
|
|
||||||
|
2. **Error Handling**
|
||||||
|
- Enhanced error handling with `try-except` blocks, especially for frontmatter parsing and file operations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## New Features
|
||||||
|
1. **HTML Preprocessing**
|
||||||
|
- Added functions `preprocess_code_blocks` and `process_image_paths` to handle custom Markdown syntax and image path updates for rendering consistency.
|
||||||
|
|
||||||
|
2. **Frontmatter Parsing**
|
||||||
|
- Introduced `parse_frontmatter`, `preprocess_frontmatter`, and `restore_html_tags` functions to manage YAML frontmatter in Markdown files, enhancing metadata handling.
|
||||||
|
|
||||||
|
3. **Repository Cloning**
|
||||||
|
- Added `clone_repo` to handle Git repository cloning with sparse checkout support, improving integration with remote documentation sources.
|
||||||
|
|
||||||
|
4. **PDF Generation**
|
||||||
|
- Integrated Playwright for rendering and generating PDFs from HTML content.
|
||||||
|
- Added support for custom headers, footers, and styles in the generated PDFs.
|
||||||
|
|
||||||
|
5. **Table of Contents (ToC)**
|
||||||
|
- Automatically generates a ToC from parsed metadata with proper hierarchy and numbering.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Bug Fixes
|
||||||
|
1. **File Sorting**
|
||||||
|
- Fixed sorting issues in `get_files_sorted` to prioritize `index.md` and `index.mdx` files.
|
||||||
|
|
||||||
|
2. **Open File Check**
|
||||||
|
- Added `is_file_open` to ensure output files are not already open, preventing write conflicts.
|
||||||
|
|
||||||
|
3. **Version Detection**
|
||||||
|
- Improved `find_latest_version` logic to detect and sort unique versions from HTML content.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Removed Features
|
||||||
|
- Any obsolete or unused features from `export-docs.old.py` were removed to streamline the codebase.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Improvements
|
||||||
|
- Optimized file processing loop to handle large repositories more efficiently.
|
||||||
|
- Improved Playwright's rendering performance by preloading images and resources.
|
||||||
|
|
||||||
|
---
|
||||||
199
README.md
199
README.md
@@ -1,38 +1,191 @@
|
|||||||
# Docs-Exporter
|
# README.md
|
||||||
|
# Documentation to PDF Converter
|
||||||
|
|
||||||
This script automates the process of exporting Next.js documentation from the GitHub repository, converting it to HTML, and then compiling it into a PDF document. It also ensures that all visual content, including images used in the online documentation, and crucial formatting, such as code blocks and tables, are accurately fetched and included.
|
A Python script that clones documentation from a Git repository (default: Next.js), processes it, and generates a well-formatted PDF with table of contents, proper formatting, and consistent styling.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
- **Accurate Content Replication**: Clones the Next.js documentation from the Canary channel of the GitHub repository and preserves its layout.
|
|
||||||
- **Image Handling**: Fetches and embeds the exact images used in the online documentation, ensuring that all visual explanations and illustrations are retained.
|
|
||||||
- **Advanced Formatting**: Maintains the integrity of advanced formatting elements such as code blocks, tables, and special markdown features, ensuring that the educational value of the documentation is preserved.
|
|
||||||
- **Custom PDF Styling**: Generates a styled PDF document with a cover page and a detailed table of contents, formatted through an external CSS file.
|
|
||||||
|
|
||||||
|
- Clones specific documentation directories from Git repositories
|
||||||
|
- Processes Markdown and MDX files
|
||||||
|
- Generates table of contents with proper numbering
|
||||||
|
- Handles code blocks with filename annotations
|
||||||
|
- Processes frontmatter for metadata
|
||||||
|
- Supports image path transformations
|
||||||
|
- Creates PDF with customizable headers and footers
|
||||||
|
- Includes cover page and proper page breaks
|
||||||
|
|
||||||
## Prerequisites
|
## Requirements
|
||||||
|
|
||||||
- Python
|
### System Requirements
|
||||||
- Git
|
- Python 3.7+
|
||||||
- wkhtmltopdf
|
- Git installed and accessible from command line
|
||||||
|
- Internet connection for cloning repositories
|
||||||
|
|
||||||
## Installation
|
### Python Dependencies
|
||||||
- Install `wkhtmltopdf` which is required for PDF generation. You can download it from [wkhtmltopdf downloads](https://wkhtmltopdf.org/downloads.html) and follow the installation instructions for your operating system.
|
Install all required packages using:
|
||||||
- Clone the Repository
|
|
||||||
```bash
|
|
||||||
git clone https://github.com/Riyooo/Docs-Exporter.git
|
|
||||||
```
|
|
||||||
- Go into the Directory
|
|
||||||
```bash
|
|
||||||
cd Docs-Exporter
|
|
||||||
```
|
|
||||||
- Install Python Dependencies
|
|
||||||
```bash
|
```bash
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Then install Playwright's browser:
|
||||||
|
```bash
|
||||||
|
playwright install chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
## Setup
|
||||||
|
|
||||||
|
1. Clone this repository:
|
||||||
|
```bash
|
||||||
|
git clone
|
||||||
|
cd
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Install dependencies:
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
playwright install chromium
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Ensure you have a `styles.css` file in the same directory as the script. This file should contain your desired CSS styling for the PDF output.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
To run the script, execute the following command from the root of the repository:
|
1. Basic usage with default settings (Next.js documentation):
|
||||||
```bash
|
```bash
|
||||||
python export-docs.py
|
python docs_to_pdf.py
|
||||||
```
|
```
|
||||||
|
|
||||||
|
2. The script will:
|
||||||
|
- Clone/update the specified Git repository
|
||||||
|
- Process all documentation files
|
||||||
|
- Generate a PDF with proper formatting
|
||||||
|
- Include a cover page and table of contents
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
You can modify these variables in the script for different configurations:
|
||||||
|
|
||||||
|
```python
|
||||||
|
repo_dir = "nextjs-docs" # Local directory for cloned repo
|
||||||
|
repo_url = "https://github.com/vercel/next.js.git" # Repository URL
|
||||||
|
branch = "canary" # Branch to clone
|
||||||
|
docs_dir = "docs" # Directory containing documentation
|
||||||
|
|
||||||
|
# Image URL transformation settings
|
||||||
|
Change_img_url = True
|
||||||
|
base_path = "https://nextjs.org/_next/image?url="
|
||||||
|
path_args = "&w=1920&q=75"
|
||||||
|
```
|
||||||
|
|
||||||
|
## PDF Output Settings
|
||||||
|
|
||||||
|
The PDF generation includes:
|
||||||
|
- A4 format
|
||||||
|
- Custom margins
|
||||||
|
- Page numbers in header
|
||||||
|
- Generation date in footer
|
||||||
|
- Background colors/images
|
||||||
|
- Proper page breaks between sections
|
||||||
|
|
||||||
|
## File Organization
|
||||||
|
|
||||||
|
- `docs_to_pdf.py`: Main script file
|
||||||
|
- `requirements.txt`: Python dependencies
|
||||||
|
- `styles.css`: CSS styling for PDF output
|
||||||
|
- `README.md`: This documentation
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
1. If the PDF file is locked:
|
||||||
|
- Ensure the output PDF is not open in any application
|
||||||
|
- Check file permissions
|
||||||
|
|
||||||
|
2. If images are not loading:
|
||||||
|
- Verify internet connection
|
||||||
|
- Check if image URLs are accessible
|
||||||
|
- Adjust the `wait_for_load_state` timing if needed
|
||||||
|
|
||||||
|
3. If the repository won't clone:
|
||||||
|
- Verify Git is installed and accessible
|
||||||
|
- Check internet connection
|
||||||
|
- Ensure you have access to the repository
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- The script creates temporary files during processing
|
||||||
|
- Large documentation sets may take several minutes to process
|
||||||
|
- Memory usage depends on the size of the documentation
|
||||||
|
- The script requires active internet connection for repository cloning and image processing
|
||||||
|
|
||||||
|
## CSS Recommendations
|
||||||
|
|
||||||
|
Your `styles.css` should include at least these basic styles for proper PDF formatting:
|
||||||
|
|
||||||
|
```css
|
||||||
|
body {
|
||||||
|
font-family: Arial, sans-serif;
|
||||||
|
line-height: 1.6;
|
||||||
|
margin: 0;
|
||||||
|
padding: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.master-container {
|
||||||
|
display: flex;
|
||||||
|
justify-content: center;
|
||||||
|
align-items: center;
|
||||||
|
min-height: 100vh;
|
||||||
|
}
|
||||||
|
|
||||||
|
.container {
|
||||||
|
text-align: center;
|
||||||
|
}
|
||||||
|
|
||||||
|
.title {
|
||||||
|
font-size: 24px;
|
||||||
|
font-weight: bold;
|
||||||
|
margin-bottom: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.date {
|
||||||
|
font-size: 16px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.page-break {
|
||||||
|
page-break-after: always;
|
||||||
|
}
|
||||||
|
|
||||||
|
code {
|
||||||
|
background-color: #f4f4f4;
|
||||||
|
padding: 2px 4px;
|
||||||
|
border-radius: 4px;
|
||||||
|
}
|
||||||
|
|
||||||
|
pre {
|
||||||
|
background-color: #f8f8f8;
|
||||||
|
padding: 15px;
|
||||||
|
border-radius: 5px;
|
||||||
|
overflow-x: auto;
|
||||||
|
}
|
||||||
|
|
||||||
|
.code-header {
|
||||||
|
background-color: #e0e0e0;
|
||||||
|
padding: 5px 10px;
|
||||||
|
border-radius: 5px 5px 0 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
table {
|
||||||
|
border-collapse: collapse;
|
||||||
|
width: 100%;
|
||||||
|
margin: 15px 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
th, td {
|
||||||
|
border: 1px solid #ddd;
|
||||||
|
padding: 8px;
|
||||||
|
text-align: left;
|
||||||
|
}
|
||||||
|
|
||||||
|
th {
|
||||||
|
background-color: #f5f5f5;
|
||||||
|
}
|
||||||
|
```
|
||||||
183
export-docs.py
183
export-docs.py
@@ -1,6 +1,5 @@
|
|||||||
import os
|
import os
|
||||||
import markdown
|
import markdown
|
||||||
import pdfkit
|
|
||||||
import tempfile
|
import tempfile
|
||||||
import yaml
|
import yaml
|
||||||
import re
|
import re
|
||||||
@@ -9,6 +8,7 @@ from git import Repo, RemoteProgress
|
|||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from packaging import version
|
from packaging import version
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
|
from playwright.sync_api import sync_playwright
|
||||||
|
|
||||||
|
|
||||||
def process_image_paths(md_content):
|
def process_image_paths(md_content):
|
||||||
@@ -76,14 +76,13 @@ class CloneProgress(RemoteProgress):
|
|||||||
def update(self, op_code, cur_count, max_count=None, message=''):
|
def update(self, op_code, cur_count, max_count=None, message=''):
|
||||||
if max_count is not None:
|
if max_count is not None:
|
||||||
self.pbar.total = max_count
|
self.pbar.total = max_count
|
||||||
self.pbar.update(cur_count - self.pbar.n) # increment the pbar with the increment
|
self.pbar.update(cur_count - self.pbar.n)
|
||||||
|
|
||||||
def finalize(self):
|
def finalize(self):
|
||||||
self.pbar.close()
|
self.pbar.close()
|
||||||
|
|
||||||
# Clone a specific directory of a repository / branch
|
|
||||||
def clone_repo(repo_url, branch, docs_dir, repo_dir):
|
def clone_repo(repo_url, branch, docs_dir, repo_dir):
|
||||||
# Initialize and configure the repository for sparse checkout
|
|
||||||
if not os.path.isdir(repo_dir):
|
if not os.path.isdir(repo_dir):
|
||||||
os.makedirs(repo_dir, exist_ok=True)
|
os.makedirs(repo_dir, exist_ok=True)
|
||||||
print("Cloning repository...")
|
print("Cloning repository...")
|
||||||
@@ -91,17 +90,13 @@ def clone_repo(repo_url, branch, docs_dir, repo_dir):
|
|||||||
with repo.config_writer() as git_config:
|
with repo.config_writer() as git_config:
|
||||||
git_config.set_value("core", "sparseCheckout", "true")
|
git_config.set_value("core", "sparseCheckout", "true")
|
||||||
|
|
||||||
# Define the sparse checkout settings
|
|
||||||
with open(os.path.join(repo_dir, ".git/info/sparse-checkout"), "w") as sparse_checkout_file:
|
with open(os.path.join(repo_dir, ".git/info/sparse-checkout"), "w") as sparse_checkout_file:
|
||||||
sparse_checkout_file.write(f"/{docs_dir}\n")
|
sparse_checkout_file.write(f"/{docs_dir}\n")
|
||||||
|
|
||||||
# Pull the specific directory from the repository
|
|
||||||
origin = repo.create_remote("origin", repo_url)
|
origin = repo.create_remote("origin", repo_url)
|
||||||
origin.fetch(progress=CloneProgress())
|
origin.fetch(progress=CloneProgress())
|
||||||
repo.git.checkout(branch)
|
repo.git.checkout(branch)
|
||||||
print("Repository cloned.")
|
print("Repository cloned.")
|
||||||
|
|
||||||
# Update the repository if it already exists
|
|
||||||
else:
|
else:
|
||||||
print("Repository already exists. Updating...")
|
print("Repository already exists. Updating...")
|
||||||
repo = Repo(repo_dir)
|
repo = Repo(repo_dir)
|
||||||
@@ -114,54 +109,37 @@ def clone_repo(repo_url, branch, docs_dir, repo_dir):
|
|||||||
|
|
||||||
def is_file_open(file_path):
|
def is_file_open(file_path):
|
||||||
if not os.path.exists(file_path):
|
if not os.path.exists(file_path):
|
||||||
return False # File does not exist, so it's not open
|
return False
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Try to open the file in append mode. If the file is open in another program, this might fail
|
|
||||||
with open(file_path, 'a'):
|
with open(file_path, 'a'):
|
||||||
pass
|
pass
|
||||||
return False
|
return False
|
||||||
except PermissionError:
|
except PermissionError:
|
||||||
# If a PermissionError is raised, it's likely the file is open elsewhere
|
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
||||||
def get_files_sorted(root_dir):
|
def get_files_sorted(root_dir):
|
||||||
all_files = []
|
all_files = []
|
||||||
|
|
||||||
# Step 1: Traverse the directory structure
|
|
||||||
for root, _, files in os.walk(root_dir):
|
for root, _, files in os.walk(root_dir):
|
||||||
for file in files:
|
for file in files:
|
||||||
full_path = os.path.join(root, file)
|
full_path = os.path.join(root, file)
|
||||||
|
|
||||||
# Step 2: Prioritize 'index.mdx' or 'index.md' within the same folder
|
|
||||||
modified_basename = '!!!' + file if file in ['index.mdx', 'index.md'] else file
|
modified_basename = '!!!' + file if file in ['index.mdx', 'index.md'] else file
|
||||||
sort_key = os.path.join(root, modified_basename)
|
sort_key = os.path.join(root, modified_basename)
|
||||||
|
|
||||||
# Add tuple to the list
|
|
||||||
all_files.append((full_path, sort_key))
|
all_files.append((full_path, sort_key))
|
||||||
|
|
||||||
# Step 3: Perform a global sort based on modified basename
|
|
||||||
all_files.sort(key=lambda x: x[1])
|
all_files.sort(key=lambda x: x[1])
|
||||||
|
|
||||||
# Step 4: Return the full paths in sorted order
|
|
||||||
return [full_path for full_path, _ in all_files]
|
return [full_path for full_path, _ in all_files]
|
||||||
|
|
||||||
|
|
||||||
def preprocess_frontmatter(frontmatter):
|
def preprocess_frontmatter(frontmatter):
|
||||||
# Dictionary to store HTML tags and their placeholders
|
|
||||||
html_tags = {}
|
html_tags = {}
|
||||||
|
|
||||||
# Function to replace HTML tags with placeholders
|
|
||||||
def replace_tag(match):
|
def replace_tag(match):
|
||||||
tag = match.group(0)
|
tag = match.group(0)
|
||||||
placeholder = f"HTML_TAG_{len(html_tags)}"
|
placeholder = f"HTML_TAG_{len(html_tags)}"
|
||||||
html_tags[placeholder] = tag
|
html_tags[placeholder] = tag
|
||||||
return placeholder
|
return placeholder
|
||||||
|
|
||||||
# Replace HTML tags with placeholders
|
|
||||||
modified_frontmatter = re.sub(r'<[^>]+>', replace_tag, frontmatter)
|
modified_frontmatter = re.sub(r'<[^>]+>', replace_tag, frontmatter)
|
||||||
|
|
||||||
return modified_frontmatter, html_tags
|
return modified_frontmatter, html_tags
|
||||||
|
|
||||||
|
|
||||||
@@ -171,18 +149,15 @@ def restore_html_tags(parsed_data, html_tags):
|
|||||||
if isinstance(value, str):
|
if isinstance(value, str):
|
||||||
for placeholder, tag in html_tags.items():
|
for placeholder, tag in html_tags.items():
|
||||||
value = value.replace(placeholder, tag)
|
value = value.replace(placeholder, tag)
|
||||||
# if key == 'title': # Escape HTML characters for titles
|
|
||||||
value = html.escape(value)
|
value = html.escape(value)
|
||||||
parsed_data[key] = value
|
parsed_data[key] = value
|
||||||
return parsed_data
|
return parsed_data
|
||||||
|
|
||||||
|
|
||||||
def process_files(files, repo_dir, docs_dir):
|
def process_files(files, repo_dir, docs_dir):
|
||||||
# Initialize the Table of Contents
|
toc = ""
|
||||||
toc = ""
|
|
||||||
html_all_pages_content = ""
|
html_all_pages_content = ""
|
||||||
|
|
||||||
# Initialize an empty string to hold all the HTML content & Include the main CSS directly in the HTML
|
|
||||||
html_header = f"""
|
html_header = f"""
|
||||||
<html>
|
<html>
|
||||||
<head>
|
<head>
|
||||||
@@ -193,63 +168,43 @@ def process_files(files, repo_dir, docs_dir):
|
|||||||
<body>
|
<body>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
numbering = [0] # Starting with the first level
|
numbering = [0]
|
||||||
|
|
||||||
for index, file_path in enumerate(files):
|
for index, file_path in enumerate(files):
|
||||||
with open(file_path, 'r', encoding='utf8') as f:
|
with open(file_path, 'r', encoding='utf8') as f:
|
||||||
md_content = f.read()
|
md_content = f.read()
|
||||||
|
|
||||||
# Process the markdown content for image paths
|
|
||||||
if Change_img_url:
|
if Change_img_url:
|
||||||
md_content = process_image_paths(md_content)
|
md_content = process_image_paths(md_content)
|
||||||
|
|
||||||
# Process the markdown content for non standard code blocks
|
|
||||||
md_content = preprocess_code_blocks(md_content)
|
md_content = preprocess_code_blocks(md_content)
|
||||||
|
|
||||||
# Parse the frontmatter and markdown
|
|
||||||
frontmatter, md_content = parse_frontmatter(md_content)
|
frontmatter, md_content = parse_frontmatter(md_content)
|
||||||
|
|
||||||
if frontmatter:
|
if frontmatter:
|
||||||
# Preprocessing: replaces HTML tags with unique placeholders and stores the mappings
|
|
||||||
frontmatter, html_tags = preprocess_frontmatter(frontmatter)
|
frontmatter, html_tags = preprocess_frontmatter(frontmatter)
|
||||||
|
|
||||||
# Parse the YAML frontmatter
|
|
||||||
data = safe_load_frontmatter(frontmatter)
|
data = safe_load_frontmatter(frontmatter)
|
||||||
if data is not None:
|
if data is not None:
|
||||||
|
|
||||||
# Preprocessing: After parsing the YAML, restore the HTML tags in place of the placeholders
|
|
||||||
data = restore_html_tags(data, html_tags)
|
data = restore_html_tags(data, html_tags)
|
||||||
|
|
||||||
# Depth Level: Calculate relative path, directory depth and TOC
|
|
||||||
rel_path = os.path.relpath(file_path, os.path.join(repo_dir, docs_dir))
|
rel_path = os.path.relpath(file_path, os.path.join(repo_dir, docs_dir))
|
||||||
|
depth = rel_path.count(os.sep)
|
||||||
# Depth Level: Calculate the depth of each section
|
file_basename = os.path.basename(file_path)
|
||||||
depth = rel_path.count(os.sep) # Count separators to determine depth
|
|
||||||
file_basename = os.path.basename(file_path)
|
|
||||||
if file_basename.startswith("index.") and depth > 0:
|
if file_basename.startswith("index.") and depth > 0:
|
||||||
depth += -1 # or another title for the main index
|
depth += -1
|
||||||
indent = ' ' * 5 * depth # Adjust indentation based on depth
|
indent = ' ' * 5 * depth
|
||||||
|
|
||||||
# Numbering: Ensure numbering has enough levels
|
|
||||||
while len(numbering) <= depth:
|
while len(numbering) <= depth:
|
||||||
numbering.append(0)
|
numbering.append(0)
|
||||||
|
|
||||||
# Numbering: Increment at the current level
|
|
||||||
numbering[depth] += 1
|
numbering[depth] += 1
|
||||||
|
|
||||||
# Numbering: Reset for any lower levels
|
|
||||||
for i in range(depth + 1, len(numbering)):
|
for i in range(depth + 1, len(numbering)):
|
||||||
numbering[i] = 0
|
numbering[i] = 0
|
||||||
|
|
||||||
# Numbering: Create entry
|
|
||||||
toc_numbering = f"{'.'.join(map(str, numbering[:depth + 1]))}"
|
toc_numbering = f"{'.'.join(map(str, numbering[:depth + 1]))}"
|
||||||
|
|
||||||
# TOC: Generate the section title
|
|
||||||
toc_title = data.get('title', os.path.splitext(os.path.basename(file_path))[0].title())
|
toc_title = data.get('title', os.path.splitext(os.path.basename(file_path))[0].title())
|
||||||
toc_full_title = f"{toc_numbering} - {toc_title}"
|
toc_full_title = f"{toc_numbering} - {toc_title}"
|
||||||
toc += f"{indent}<a href='#{toc_full_title}'>{toc_full_title}</a><br/>"
|
toc += f"{indent}<a href='#{toc_full_title}'>{toc_full_title}</a><br/>"
|
||||||
|
|
||||||
# Page Content: Format the parsed YAML to HTML
|
|
||||||
html_page_content = f"""
|
html_page_content = f"""
|
||||||
<h1>{toc_full_title}</h1>
|
<h1>{toc_full_title}</h1>
|
||||||
<div class="doc-path"><p>Documentation path: {file_path.replace(chr(92),'/').replace('.mdx', '').replace(repo_dir + '/' + docs_dir,'')}</p></div>
|
<div class="doc-path"><p>Documentation path: {file_path.replace(chr(92),'/').replace('.mdx', '').replace(repo_dir + '/' + docs_dir,'')}</p></div>
|
||||||
@@ -268,78 +223,99 @@ def process_files(files, repo_dir, docs_dir):
|
|||||||
</div>
|
</div>
|
||||||
"""
|
"""
|
||||||
html_page_content += '</br>'
|
html_page_content += '</br>'
|
||||||
|
|
||||||
else:
|
else:
|
||||||
html_page_content = ""
|
html_page_content = ""
|
||||||
else:
|
else:
|
||||||
html_page_content = ""
|
html_page_content = ""
|
||||||
|
|
||||||
# Convert Markdown to HTML with table support and add content to the identified header
|
|
||||||
html_page_content += markdown.markdown(md_content, extensions=['fenced_code', 'codehilite', 'tables', 'footnotes', 'toc', 'abbr', 'attr_list', 'def_list', 'smarty', 'admonition'])
|
html_page_content += markdown.markdown(md_content, extensions=['fenced_code', 'codehilite', 'tables', 'footnotes', 'toc', 'abbr', 'attr_list', 'def_list', 'smarty', 'admonition'])
|
||||||
|
|
||||||
# Add page content to all cumulated pages content
|
|
||||||
html_all_pages_content += html_page_content
|
html_all_pages_content += html_page_content
|
||||||
|
|
||||||
# Add a page break unless it is the last file
|
|
||||||
if index < len(files) - 1:
|
if index < len(files) - 1:
|
||||||
html_all_pages_content += '<div class="page-break"></div>'
|
html_all_pages_content += '<div class="page-break"></div>'
|
||||||
|
|
||||||
# Prepend the ToC to the beginning of the HTML content
|
|
||||||
toc_html = f"""<div style="padding-bottom: 10px"><div style="padding-bottom: 20px"><h1>Table of Contents</h1></div>{toc}</div><div style="page-break-before: always;">"""
|
toc_html = f"""<div style="padding-bottom: 10px"><div style="padding-bottom: 20px"><h1>Table of Contents</h1></div>{toc}</div><div style="page-break-before: always;">"""
|
||||||
html_all_content = toc_html + html_all_pages_content
|
html_all_content = toc_html + html_all_pages_content
|
||||||
|
|
||||||
# Finalize html formatting
|
html_all_pages_content = html_header + html_all_pages_content + "</body></html>"
|
||||||
html_all_pages_content = html_header + html_all_pages_content + "</body></html>"
|
toc_html = html_header + toc_html + "</body></html>"
|
||||||
toc_html = html_header + toc_html + "</body></html>"
|
html_all_content = html_header + html_all_content + "</body></html>"
|
||||||
html_all_content = html_header + html_all_content + "</body></html>"
|
|
||||||
|
|
||||||
return(html_all_content, toc_html, html_all_pages_content)
|
return(html_all_content, toc_html, html_all_pages_content)
|
||||||
|
|
||||||
|
|
||||||
def find_latest_version(html_content):
|
def find_latest_version(html_content):
|
||||||
# Regular expression to find versions like v14.2.0
|
|
||||||
version_pattern = re.compile(r"v(\d+\.\d+\.\d+)")
|
version_pattern = re.compile(r"v(\d+\.\d+\.\d+)")
|
||||||
versions = version_pattern.findall(html_content)
|
versions = version_pattern.findall(html_content)
|
||||||
# Remove duplicates and sort versions
|
|
||||||
unique_versions = sorted(set(versions), key=lambda v: version.parse(v), reverse=True)
|
unique_versions = sorted(set(versions), key=lambda v: version.parse(v), reverse=True)
|
||||||
return unique_versions[0] if unique_versions else None
|
return unique_versions[0] if unique_versions else None
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
def generate_pdf(html_content, output_pdf, format_options=None):
|
||||||
|
"""
|
||||||
|
Generate PDF from HTML content using Playwright
|
||||||
|
"""
|
||||||
|
default_format = {
|
||||||
|
'format': 'A4',
|
||||||
|
'margin': {
|
||||||
|
'top': '50px',
|
||||||
|
'right': '50px',
|
||||||
|
'bottom': '50px',
|
||||||
|
'left': '50px'
|
||||||
|
},
|
||||||
|
'print_background': True,
|
||||||
|
'display_header_footer': True,
|
||||||
|
'header_template': '<div style="font-size: 10px; text-align: right; width: 100%; padding-right: 20px; margin-top: 20px;"><span class="pageNumber"></span> of <span class="totalPages"></span></div>',
|
||||||
|
'footer_template': '<div style="font-size: 10px; text-align: center; width: 100%; margin-bottom: 20px;"><span class="url"></span></div>'
|
||||||
|
}
|
||||||
|
|
||||||
|
format_options = format_options or default_format
|
||||||
|
|
||||||
# Define the output PDF file name
|
with sync_playwright() as p:
|
||||||
# project_title = "Next.js v14 Documentation"
|
browser = p.chromium.launch()
|
||||||
# output_pdf = "Next.js_v14_Documentation.pdf"
|
page = browser.new_page()
|
||||||
|
|
||||||
|
# Set viewport size to ensure consistent rendering
|
||||||
|
page.set_viewport_size({"width": 1280, "height": 1024})
|
||||||
|
|
||||||
|
# Set content and wait for network idle
|
||||||
|
page.set_content(html_content, wait_until='networkidle')
|
||||||
|
|
||||||
|
# Wait for any images and fonts to load
|
||||||
|
page.wait_for_load_state('networkidle')
|
||||||
|
page.wait_for_load_state('domcontentloaded')
|
||||||
|
|
||||||
|
# Generate PDF
|
||||||
|
page.pdf(path=output_pdf, **format_options)
|
||||||
|
|
||||||
|
browser.close()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
export_html = False
|
export_html = False
|
||||||
|
|
||||||
# Clone the repository and checkout the canary branch
|
|
||||||
repo_dir = "nextjs-docs"
|
repo_dir = "nextjs-docs"
|
||||||
repo_url = "https://github.com/vercel/next.js.git"
|
repo_url = "https://github.com/vercel/next.js.git"
|
||||||
branch = "canary"
|
branch = "canary"
|
||||||
docs_dir = "docs"
|
docs_dir = "docs"
|
||||||
|
|
||||||
# Define a base path and quality for the image URLs
|
|
||||||
Change_img_url = True
|
Change_img_url = True
|
||||||
base_path = "https://nextjs.org/_next/image?url="
|
base_path = "https://nextjs.org/_next/image?url="
|
||||||
path_args = "&w=1920&q=75"
|
path_args = "&w=1920&q=75"
|
||||||
|
|
||||||
# Clone the repository
|
|
||||||
clone_repo(repo_url, branch, docs_dir, repo_dir)
|
clone_repo(repo_url, branch, docs_dir, repo_dir)
|
||||||
|
|
||||||
# Traverse the docs directory and convert each markdown file to HTML
|
print("Converting the Documentation to HTML...")
|
||||||
print ("Converting the Documentation to HTML...")
|
|
||||||
docs_dir_full_path = os.path.join(repo_dir, docs_dir)
|
docs_dir_full_path = os.path.join(repo_dir, docs_dir)
|
||||||
files_to_process = get_files_sorted(docs_dir_full_path)
|
files_to_process = get_files_sorted(docs_dir_full_path)
|
||||||
html_all_content, _, _ = process_files(files_to_process, repo_dir, docs_dir)
|
html_all_content, _, _ = process_files(files_to_process, repo_dir, docs_dir)
|
||||||
print("Converted all MDX to HTML.")
|
print("Converted all MDX to HTML.")
|
||||||
|
|
||||||
# Save the HTML content to a file for inspection
|
|
||||||
if export_html:
|
if export_html:
|
||||||
with open('output.html', 'w', encoding='utf8') as f:
|
with open('output.html', 'w', encoding='utf8') as f:
|
||||||
f.write(html_all_content)
|
f.write(html_all_content)
|
||||||
print("HTML Content exported.")
|
print("HTML Content exported.")
|
||||||
|
|
||||||
# Find the latest version in the HTML content
|
|
||||||
latest_version = find_latest_version(html_all_content)
|
latest_version = find_latest_version(html_all_content)
|
||||||
if latest_version:
|
if latest_version:
|
||||||
project_title = f"""Next.js Documentation v{latest_version}"""
|
project_title = f"""Next.js Documentation v{latest_version}"""
|
||||||
@@ -348,7 +324,6 @@ if __name__ == "__main__":
|
|||||||
project_title = "Next.js Documentation"
|
project_title = "Next.js Documentation"
|
||||||
output_pdf = "Next.js_Documentation.pdf"
|
output_pdf = "Next.js_Documentation.pdf"
|
||||||
|
|
||||||
# Define the cover HTML with local CSS file
|
|
||||||
cover_html = f"""
|
cover_html = f"""
|
||||||
<html>
|
<html>
|
||||||
<head>
|
<head>
|
||||||
@@ -367,26 +342,38 @@ if __name__ == "__main__":
|
|||||||
</html>
|
</html>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# Write the cover HTML to a temporary file
|
format_options = {
|
||||||
with tempfile.NamedTemporaryFile(delete=False, suffix='.html') as cover_file:
|
'format': 'A4',
|
||||||
cover_file.write(cover_html.encode('utf-8'))
|
'margin': {
|
||||||
print("HTML Cover exported.")
|
'top': '50px',
|
||||||
|
'right': '50px',
|
||||||
|
'bottom': '50px',
|
||||||
|
'left': '50px'
|
||||||
|
},
|
||||||
|
'print_background': True,
|
||||||
|
'display_header_footer': True,
|
||||||
|
'header_template': f'''
|
||||||
|
<div style="font-size: 10px; padding: 10px 20px; margin-top: 20px;">
|
||||||
|
<span style="float: left;">{project_title}</span>
|
||||||
|
<span style="float: right;">Page <span class="pageNumber"></span> of <span class="totalPages"></span></span>
|
||||||
|
</div>
|
||||||
|
''',
|
||||||
|
'footer_template': f'''
|
||||||
|
<div style="font-size: 10px; padding: 10px 20px; margin-bottom: 20px; text-align: center;">
|
||||||
|
Generated on {datetime.now().strftime("%Y-%m-%d")}
|
||||||
|
</div>
|
||||||
|
'''
|
||||||
|
}
|
||||||
|
|
||||||
# Convert the combined HTML content to PDF with a cover and a table of contents
|
# Check if file is open
|
||||||
if is_file_open(output_pdf):
|
if is_file_open(output_pdf):
|
||||||
print("The output file is already open in another process. Please close it and try again.")
|
print("The output file is already open in another process. Please close it and try again.")
|
||||||
else:
|
else:
|
||||||
options = {
|
try:
|
||||||
'encoding': 'UTF-8',
|
print("Generating PDF...")
|
||||||
'page-size': 'A4',
|
# Generate PDF with cover page and content
|
||||||
'quiet': '',
|
generate_pdf(cover_html + html_all_content, output_pdf, format_options)
|
||||||
'image-dpi': 150, # General reco.: printer - hq, 300 dpi| ebook - low quality, 150 dpi| screen-view-only quality, 72 dpi
|
print("Created the PDF file successfully.")
|
||||||
'image-quality': 75,
|
|
||||||
# 'no-outline': None,
|
|
||||||
# 'no-images': None,
|
|
||||||
}
|
|
||||||
pdfkit.from_string(html_all_content, output_pdf, options=options, cover=cover_file.name, toc={})
|
|
||||||
print("Created the PDF file successfully.")
|
|
||||||
|
|
||||||
# Delete the temporary file
|
except Exception as e:
|
||||||
os.unlink(cover_file.name)
|
print(f"Error generating PDF: {str(e)}")
|
||||||
@@ -1,6 +1,7 @@
|
|||||||
GitPython
|
# requirements.txt
|
||||||
Markdown
|
gitpython==3.1.40
|
||||||
pdfkit
|
markdown==3.5.1
|
||||||
PyYAML
|
packaging==23.2
|
||||||
packaging
|
playwright==1.40.0
|
||||||
tqdm
|
PyYAML==6.0.1
|
||||||
|
tqdm==4.66.1
|
||||||
Reference in New Issue
Block a user