MinerU文档

发表于 2025-03-18 分类于 AI ， MinerU

基本信息

GitHub：MinerU
Gitee：MinerU【网络原因，个人迁移备份】
转载自：MinerU
防止丢失、学习记录备忘。

项目介绍

MinerU是一款将PDF转化为机器可读格式的工具（如markdown、json），可以很方便地抽取为任意格式。 MinerU诞生于书生-浦语的预训练过程中，我们将会集中精力解决科技文献中的符号转化问题，希望在大模型时代为科技发展做出贡献。相比国内外知名商用产品MinerU还很年轻，如果遇到问题或者结果不及预期请到issue提交问题，同时附上相关PDF。

主要功能

删除页眉、页脚、脚注、页码等元素，确保语义连贯
输出符合人类阅读顺序的文本，适用于单栏、多栏及复杂排版
保留原文档的结构，包括标题、段落、列表等
提取图像、图片描述、表格、表格标题及脚注
自动识别并转换文档中的公式为LaTeX格式
自动识别并转换文档中的表格为LaTeX或HTML格式
自动检测扫描版PDF和乱码PDF，并启用OCR功能
OCR支持84种语言的检测与识别
支持多种输出格式，如多模态与NLP的Markdown、按阅读顺序排序的JSON、含有丰富信息的中间格式等
支持多种可视化结果，包括layout可视化、span可视化等，便于高效确认输出效果与质检
支持CPU和GPU环境
兼容Windows、Linux和Mac平台

安装

如果您遇到任何安装问题，请首先查阅常见问题解答。如果解析结果不如预期，可参考已知问题。

Warning

安装前必看——软硬件环境支持说明

为了确保项目的稳定性和可靠性，我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时，能够获得最佳的性能表现和最少的兼容性问题。

通过集中资源和精力于主线环境，我们团队能够更高效地解决潜在的BUG，及时开发新功能。

在非主线环境中，由于硬件、软件配置的多样性，以及第三方依赖项的兼容性问题，我们无法100%保证项目的完全可用性。因此，对于希望在非推荐环境中使用本项目的用户，我们建议先仔细阅读文档以及常见问题解答，大多数问题已经在常见问题解答中有对应的解决方案，除此之外我们鼓励社区反馈问题，以便我们能够逐步扩大支持范围。

操作系统	Ubuntu 22.04 LTS	Windows 10 / 11	macOS 11+
CPU	x86_64(暂不支持ARM Linux)	x86_64(暂不支持ARM Windows)	x86_64 / arm64
内存	大于等于16GB，推荐32G以上	大于等于16GB，推荐32G以上	大于等于16GB，推荐32G以上
python版本	3.10 (请务必通过conda创建3.10虚拟环境)	3.10 (请务必通过conda创建3.10虚拟环境)	3.10 (请务必通过conda创建3.10虚拟环境)
Nvidia Driver 版本	latest(专有驱动)	latest	None
CUDA环境	自动安装[12.1(pytorch)+11.8(paddle)]	11.8(手动安装) cuDNN v8.7.0(手动安装)	None
GPU硬件支持列表–最低要求 8G+显存	3060ti/3070/4060 8G显存可开启layout、公式识别和ocr加速	3060ti/3070/4060 8G显存可开启layout、公式识别和ocr加速	None
GPU硬件支持列表–推荐配置 10G+显存	3080/3080ti/3090/3090ti 4070/4070ti/4070tisuper/4080/4090 10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速	3080/3080ti/3090/3090ti 4070/4070ti/4070tisuper/4080/4090 10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速	3080/3080ti/3090/3090ti 4070/4070ti/4070tisuper/4080/4090 10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速

创建环境

1
2
3

conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple

下载模型权重文件

1
2
3

pip install huggingface_hub
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py

使用 CUDA 加速

如果您的设备支持 CUDA 并符合主线环境的 GPU 要求，您可以使用 GPU 加速。请选择适合您系统的指南：

Ubuntu 22.04 LTS
Windows 10/11
使用 Docker 快速部署

Important

Docker 需要至少 16GB 显存的 GPU，并且所有加速功能默认启用。

在运行此 Docker 容器之前，您可以使用以下命令检查您的设备是否支持 Docker 上的 CUDA 加速。

1	docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:latest .
docker run --rm -it --gpus=all mineru:latest /bin/bash
magic-pdf --help

Ubuntu 22.04 LTS

检测是否已安装 nvidia 驱动

如果看到类似如下的信息，说明已经安装了 nvidia 驱动，可以跳过步骤2

Important

CUDAVersion 显示的版本号应 >=12.1，如显示的版本号小于12.1，请升级驱动

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.34                 Driver Version: 537.34       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   51C    P8              12W / 200W |   1489MiB /  8192MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

安装驱动

如没有驱动，则通过如下命令

1 2	sudo apt-get update sudo apt-get install nvidia-driver-545

安装专有驱动，安装完成后，重启电脑

安装 anacoda

如果已安装 conda，可以跳过本步骤

1 2	wget -U NoSuchBrowser/1.0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Linux-x86_64.sh bash Anaconda3-2024.06-1-Linux-x86_64.sh

最后一步输入yes，关闭终端重新打开

使用 conda 创建环境

需指定 python 版本为3.10

1 2	conda create -n MinerU python=3.10 conda activate MinerU

安装应用

1	pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple

Important

下载完成后，务必通过以下命令确认magic-pdf的版本是否正确

如果版本号小于0.7.0，请到issue中向我们反馈

下载模型

详细参考下载模型权重文件

了解配置文件存放的位置

脚本会自动生成用户目录下的magic-pdf.json文件，并自动配置默认模型路径。您可在【用户目录】下找到magic-pdf.json文件。

Tip

linux用户目录为 “/home/用户名”

第一次运行

从仓库中下载样本文件，并测试

1 2	wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/demo/small_ocr.pdf magic-pdf -p small_ocr.pdf -o ./output

测试CUDA加速

如果您的显卡显存大于等于 8GB ，可以进行以下流程，测试CUDA解析加速效果

1.修改【用户目录】中配置文件 magic-pdf.json 中”device-mode”的值

2.运行以下命令测试 cuda 加速效果

1	magic-pdf -p small_ocr.pdf -o ./output

Tip

CUDA 加速是否生效可以根据 log 中输出的各个阶段 cost 耗时来简单判断，通常情况下， layoutdetectioncost 和 mfrtime 应提速10倍以上。

为 ocr 开启 cuda 加速

1.下载paddlepaddle-gpu, 安装完成后会自动开启ocr加速

1	python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/

2.运行以下命令测试ocr加速效果

1	magic-pdf -p small_ocr.pdf -o ./output

Tip

CUDA 加速是否生效可以根据 log 中输出的各个阶段 cost 耗时来简单判断，通常情况下， ocrcost 应提速10倍以上。

Windows 10/11

安装 cuda 和 cuDNN

需要安装的版本 CUDA 11.8 + cuDNN 8.7.0

CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x https://developer.nvidia.com/rdp/cudnn-archive

安装 anaconda

如果已安装 conda，可以跳过本步骤

下载链接：https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Windows-x86_64.exe

使用 conda 创建环境

需指定python版本为3.10

1 2	conda create -n MinerU python=3.10 conda activate MinerU

安装应用

1	pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple

Important

下载完成后，务必通过以下命令确认magic-pdf的版本是否正确

如果版本号小于0.7.0，请到issue中向我们反馈

下载模型

详细参考下载模型权重文件

了解配置文件存放的位置

脚本会自动生成用户目录下的magic-pdf.json文件，并自动配置默认模型路径。您可在【用户目录】下找到 magic-pdf.json 文件。

Tip

windows 用户目录为 “C:/Users/用户名”

第一次运行

从仓库中下载样本文件，并测试

1 2	wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf magic-pdf -p small_ocr.pdf -o ./output

测试 CUDA 加速

如果您的显卡显存大于等于 8GB，可以进行以下流程，测试 CUDA 解析加速效果

1.覆盖安装支持cuda的torch和torchvision

1	pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118

Important

务必在命令中指定以下版本

1	torch==2.3.1 torchvision==0.18.1

这是我们支持的最高版本，如果不指定版本会自动安装更高版本导致程序无法运行

2.修改【用户目录】中配置文件magic-pdf.json中”device-mode”的值

3.运行以下命令测试cuda加速效果

1	magic-pdf -p small_ocr.pdf -o ./output

Tip

CUDA 加速是否生效可以根据 log 中输出的各个阶段的耗时来简单判断，通常情况下， layoutdetectiontime 和 mfrtime 应提速10倍以上。

为 ocr 开启 cuda 加速

1.下载paddlepaddle-gpu, 安装完成后会自动开启ocr加速

1	pip install paddlepaddle-gpu==2.6.1

2.运行以下命令测试ocr加速效果

1	magic-pdf -p small_ocr.pdf -o ./output

Tip

CUDA 加速是否生效可以根据 log 中输出的各个阶段 cost 耗时来简单判断，通常情况下， ocrtime 应提速10倍以上。

下载模型权重文件

模型下载分为初始下载和更新到模型目录。请参考相应的文档以获取如何操作的指示。

首次下载模型文件

模型文件可以从 Hugging Face 或 Model Scope下载，由于网络原因，国内用户访问HF可能会失败，请使用 ModelScope。

方法一：从 Hugging Face 下载模型

使用python脚本从Hugging Face下载模型文件

1
2
3

pip install huggingface_hub
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py

python脚本会自动下载模型文件并配置好配置文件中的模型目录

方法二：从 ModelScope 下载模型

使用python脚本从 ModelScope 下载模型文件

1
2
3

pip install modelscope
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py
python download_models.py

python脚本会自动下载模型文件并配置好配置文件中的模型目录

配置文件可以在用户目录中找到，文件名为magic-pdf.json

Tip

windows的用户目录为 “C:Users用户名”, linux用户目录为 “/home/用户名”, macOS用户目录为 “/Users/用户名”

此前下载过模型，如何更新

通过 git lfs 下载过模型

Important

由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况，现已不推荐使用该方式下载。

0.9.x及以后版本由于PDF-Extract-Kit 1.0更换仓库和新增layout排序模型，不能通过 gitpull命令更新，需要使用python脚本一键更新。

当magic-pdf <= 0.8.1时，如此前通过 git lfs 下载过模型文件，可以进入到之前的下载目录中，通过 gitpull 命令更新模型。

通过 Hugging Face 或 Model Scope 下载过模型

如此前通过 HuggingFace 或 Model Scope 下载过模型，可以重复执行此前的模型下载 python 脚本，将会自动将模型目录更新到最新版本。

命令行

magic-pdf --help
Usage: magic-pdf [OPTIONS]

Options:
  -v, --version                display the version and exit
  -p, --path PATH              local pdf filepath or directory  [required]
  -o, --output-dir PATH        output local directory  [required]
  -m, --method [ocr|txt|auto]  the method for parsing pdf. ocr: using ocr
                               technique to extract information from pdf. txt:
                               suitable for the text-based pdf only and
                               outperform ocr. auto: automatically choose the
                               best method for parsing pdf from ocr and txt.
                               without method specified, auto will be used by
                               default.
  -l, --lang TEXT              Input the languages in the pdf (if known) to
                               improve OCR accuracy.  Optional. You should
                               input "Abbreviation" with language form url: ht
                               tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
                               /blog/multi_languages.html#5-support-languages-
                               and-abbreviations
  -d, --debug BOOLEAN          Enables detailed debugging information during
                               the execution of the CLI commands.
  -s, --start INTEGER          The starting page for PDF parsing, beginning
                               from 0.
  -e, --end INTEGER            The ending page for PDF parsing, beginning from
                               0.
  --help                       Show this message and exit.


## show version
magic-pdf -v

## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto

{some_pdf} 可以是单个 PDF 文件或者一个包含多个 PDF 文件的目录。解析的结果文件存放在目录 {some_output_dir} 下。生成的结果文件列表如下所示：

├── some_pdf.md                          # markdown 文件
├── images                               # 存放图片目录
├── some_pdf_layout.pdf                  # layout 绘图 （包含layout阅读顺序）
├── some_pdf_middle.json                 # minerU 中间处理结果
├── some_pdf_model.json                  # 模型推理结果
├── some_pdf_origin.pdf                  # 原 pdf 文件
├── some_pdf_spans.pdf                   # 最小粒度的bbox位置信息绘图
└── some_pdf_content_list.json           # 按阅读顺序排列的富文本json

转换为 Markdown 文件

本地文件示例

import os

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod

# args
pdf_file_name = "abc.pdf"  # replace with the real pdf path
name_without_suff = pdf_file_name.split(".")[0]

# prepare env
local_image_dir, local_md_dir = "output/images", "output"
image_dir = str(os.path.basename(local_image_dir))

os.makedirs(local_image_dir, exist_ok=True)

image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
    local_md_dir
)
image_dir = str(os.path.basename(local_image_dir))

# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content

# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)

## inference
if ds.classify() == SupportedPdfParseMethod.OCR:
    infer_result = ds.apply(doc_analyze, ocr=True)

    ## pipeline
    pipe_result = infer_result.pipe_ocr_mode(image_writer)

else:
    infer_result = ds.apply(doc_analyze, ocr=False)

    ## pipeline
    pipe_result = infer_result.pipe_txt_mode(image_writer)

### draw model result on each page
infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))

### draw layout result on each page
pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))

### draw spans result on each page
pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))

### dump markdown
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)

### dump content list
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)

对象存储文件示例

import os

from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze

bucket_name = "{Your S3 Bucket Name}"  # replace with real bucket name
ak = "{Your S3 access key}"  # replace with real s3 access key
sk = "{Your S3 secret key}"  # replace with real s3 secret key
endpoint_url = "{Your S3 endpoint_url}"  # replace with real s3 endpoint_url


reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url)  # replace `unittest/tmp` with the real s3 prefix
writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)

# args
pdf_file_name = (
    "s3://llm-pdf-text-1/unittest/tmp/bug5-11.pdf"  # replace with the real s3 path
)

# prepare env
local_dir = "output"
name_without_suff = os.path.basename(pdf_file_name).split(".")[0]

# read bytes
pdf_bytes = reader.read(pdf_file_name)  # read the pdf content

# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)

## inference
if ds.classify() == SupportedPdfParseMethod.OCR:
    infer_result = ds.apply(doc_analyze, ocr=True)

    ## pipeline
    pipe_result = infer_result.pipe_ocr_mode(image_writer)

else:
    infer_result = ds.apply(doc_analyze, ocr=False)

    ## pipeline
    pipe_result = infer_result.pipe_txt_mode(image_writer)

### draw model result on each page
infer_result.draw_model(os.path.join(local_dir, f'{name_without_suff}_model.pdf'))  # dump to local

### draw layout result on each page
pipe_result.draw_layout(os.path.join(local_dir, f'{name_without_suff}_layout.pdf'))  # dump to local

### draw spans result on each page
pipe_result.draw_span(os.path.join(local_dir, f'{name_without_suff}_spans.pdf'))  # dump to local

### dump markdown
pipe_result.dump_md(writer, f'{name_without_suff}.md', "unittest/tmp/images")  # dump to remote s3

### dump content list
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)

输出文件格式介绍

magic-pdf 命令执行后除了输出和 markdown 有关的文件以外，还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件

some_pdf_layout.pdf

每一页的 layout 均由一个或多个框组成。每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。

some_pdf_spans.pdf

根据 span 类型的不同，采用不同颜色线框绘制页面上所有 span。该文件可以用于质检，可以快速排查出文本丢失、行间公式未识别等问题。

some_pdf_model.json

结构定义

from pydantic import BaseModel, Field
from enum import IntEnum

class CategoryType(IntEnum):
     title = 0               # 标题
     plain_text = 1          # 文本
     abandon = 2             # 包括页眉页脚页码和页面注释
     figure = 3              # 图片
     figure_caption = 4      # 图片描述
     table = 5               # 表格
     table_caption = 6       # 表格描述
     table_footnote = 7      # 表格注释
     isolate_formula = 8     # 行间公式
     formula_caption = 9     # 行间公式的标号

     embedding = 13          # 行内公式
     isolated = 14           # 行间公式
     text = 15               # ocr 识别结果


class PageInfo(BaseModel):
    page_no: int = Field(description="页码序号，第一页的序号是 0", ge=0)
    height: int = Field(description="页面高度", gt=0)
    width: int = Field(description="页面宽度", ge=0)

class ObjectInferenceResult(BaseModel):
    category_id: CategoryType = Field(description="类别", ge=0)
    poly: list[float] = Field(description="四边形坐标, 分别是 左上，右上，右下，左下 四点的坐标")
    score: float = Field(description="推理结果的置信度")
    latex: str | None = Field(description="latex 解析结果", default=None)
    html: str | None = Field(description="html 解析结果", default=None)

class PageInferenceResults(BaseModel):
     layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
     page_info: PageInfo = Field(description="页面元信息")


# 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
inference_result: list[PageInferenceResults] = []

poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右上、右下、左下四点的坐标

示例数据

[
    {
        "layout_dets": [
            {
                "category_id": 2,
                "poly": [
                    99.1906967163086,
                    100.3119125366211,
                    730.3707885742188,
                    100.3119125366211,
                    730.3707885742188,
                    245.81326293945312,
                    99.1906967163086,
                    245.81326293945312
                ],
                "score": 0.9999997615814209
            }
        ],
        "page_info": {
            "page_no": 0,
            "height": 2339,
            "width": 1654
        }
    },
    {
        "layout_dets": [
            {
                "category_id": 5,
                "poly": [
                    99.13092803955078,
                    2210.680419921875,
                    497.3183898925781,
                    2210.680419921875,
                    497.3183898925781,
                    2264.78076171875,
                    99.13092803955078,
                    2264.78076171875
                ],
                "score": 0.9999997019767761
            }
        ],
        "page_info": {
            "page_no": 1,
            "height": 2339,
            "width": 1654
        }
    }
]

some_pdf_middle.json

字段名	解释
pdf_info	list，每个元素都是一个 dict，这个dict是每一页pdf的解析结果，详见下表
_parse_type	ocr
_version_name	string，表示本次解析使用的 magic-pdf 的版本号

pdf_info 字段结构说明

字段名	解释
preproc_blocks	pdf预处理后，未分段的中间结果
layout_bboxes	布局分割的结果，含有布局的方向（垂直、水平），和bbox，按阅读顺序排序
page_idx	页码，从0开始
page_size	页面的宽度和高度
_layout_tree	布局树状结构
images	list，每个元素是一个dict，每个dict表示一个img_block
tables	list，每个元素是一个dict，每个dict表示一个table_block
interline_equations	list，每个元素是一个 dict，每个dict表示一个interline_equation_block
discarded_blocks	List, 模型返回的需要drop的block信息
para_blocks	将preproc_blocks进行分段之后的结果

上表中 para_blocks 是个dict的数组，每个dict是一个block结构，block最多支持一次嵌套

block

外层block被称为一级block

一级block中的字段包括

字段名	解释
type	block类型（table
bbox	block矩形框坐标
blocks	list，里面的每个元素都是一个dict格式的二级block

一级block只有”table”和”image”两种类型，其余block均为二级block

二级block中的字段包括

字段名	解释
type	block类型
bbox	block矩形框坐标
lines	list，每个元素都是一个dict表示的line，用来描述一行信息的构成

二级block的类型详解

type	desc
image_body	图像的本体
image_caption	图像的描述文本
image_footnote	图像的脚注
table_body	表格本体
table_caption	表格的描述文本
table_footnote	表格的脚注
text	文本块
title	标题块
index	目录块
list	列表块
interline_equation	行间公式块

line

line 的字段格式如下

字段名	解释
bbox	line的矩形框坐标
spans	list，每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成

span

字段名	解释
bbox	span的矩形框坐标
type	span的类型
content	img_path

span 的类型有如下几种

type	desc
image	图片
table	表格
text	文本
inline_equation	行内公式
interline_equation	行间公式

总结

span是所有元素的最小存储单元

para_blocks内存储的元素为区块信息

区块结构为

一级block(如有)->二级block->line->span

示例数据

{
    "pdf_info": [
        {
            "preproc_blocks": [
                {
                    "type": "text",
                    "bbox": [
                        52,
                        61.956024169921875,
                        294,
                        82.99800872802734
                    ],
                    "lines": [
                        {
                            "bbox": [
                                52,
                                61.956024169921875,
                                294,
                                72.0000228881836
                            ],
                            "spans": [
                                {
                                    "bbox": [
                                        54.0,
                                        61.956024169921875,
                                        296.2261657714844,
                                        72.0000228881836
                                    ],
                                    "content": "dependent on the service headway and the reliability of the departure ",
                                    "type": "text",
                                    "score": 1.0
                                }
                            ]
                        }
                    ]
                }
            ],
            "layout_bboxes": [
                {
                    "layout_bbox": [
                        52,
                        61,
                        294,
                        731
                    ],
                    "layout_label": "V",
                    "sub_layout": []
                }
            ],
            "page_idx": 0,
            "page_size": [
                612.0,
                792.0
            ],
            "_layout_tree": [],
            "images": [],
            "tables": [],
            "interline_equations": [],
            "discarded_blocks": [],
            "para_blocks": [
                {
                    "type": "text",
                    "bbox": [
                        52,
                        61.956024169921875,
                        294,
                        82.99800872802734
                    ],
                    "lines": [
                        {
                            "bbox": [
                                52,
                                61.956024169921875,
                                294,
                                72.0000228881836
                            ],
                            "spans": [
                                {
                                    "bbox": [
                                        54.0,
                                        61.956024169921875,
                                        296.2261657714844,
                                        72.0000228881836
                                    ],
                                    "content": "dependent on the service headway and the reliability of the departure ",
                                    "type": "text",
                                    "score": 1.0
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ],
    "_parse_type": "txt",
    "_version_name": "0.6.1"
}

流水线管道

极简示例

import os

from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze

# args
pdf_file_name = "abc.pdf"  # replace with the real pdf path
name_without_suff = pdf_file_name.split(".")[0]

# prepare env
local_image_dir, local_md_dir = "output/images", "output"
image_dir = str(os.path.basename(local_image_dir))

os.makedirs(local_image_dir, exist_ok=True)

image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
    local_md_dir
)
image_dir = str(os.path.basename(local_image_dir))

# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content

# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)

ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)

运行以上的代码，会得到如下的结果

1
2
3

output/
├── abc.md
└── images

除去初始化环境，如建立目录、导入依赖库等逻辑。真正将 pdf 转换为 markdown 的代码片段如下

# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content

# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)

ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)

ds.apply(doc_analyze,ocr=True) 会生成 InferenceResult 对象。 InferenceResult 对象执行 pipe_ocr_mode 方法会生成 PipeResult 对象。 PipeResult 对象执行 dump_md 会在指定位置生成 markdown 文件。

pipeline 的执行过程如下图所示

目前划分出数据、推理、程序处理三个阶段，分别对应着图上的 Dataset， InferenceResult， PipeResult 这三个实体。通过 apply ， doc_analyze 或 pipe_ocr_mode 等方法链接在一起。

管道组合

class Dataset(ABC):
    @abstractmethod
    def apply(self, proc: Callable, *args, **kwargs):
        """Apply callable method which.

        Args:
            proc (Callable): invoke proc as follows:
                proc(self, *args, **kwargs)

        Returns:
            Any: return the result generated by proc
        """
        pass

class InferenceResult(InferenceResultBase):

    def apply(self, proc: Callable, *args, **kwargs):
        """Apply callable method which.

        Args:
            proc (Callable): invoke proc as follows:
                proc(inference_result, *args, **kwargs)

        Returns:
            Any: return the result generated by proc
        """
        return proc(copy.deepcopy(self._infer_res), *args, **kwargs)

    def pipe_ocr_mode(
        self,
        imageWriter: DataWriter,
        start_page_id=0,
        end_page_id=None,
        debug_mode=False,
        lang=None,
        ) -> PipeResult:
        pass

class PipeResult:
    def apply(self, proc: Callable, *args, **kwargs):
        """Apply callable method which.

        Args:
            proc (Callable): invoke proc as follows:
                proc(pipeline_result, *args, **kwargs)

        Returns:
            Any: return the result generated by proc
        """
        return proc(copy.deepcopy(self._pipe_res), *args, **kwargs)

Dataset 、 InferenceResult 和 PipeResult 类均有 apply method。可用于组合不同阶段的运算过程。如下所示，MinerU 提供一套组合这些类的计算过程。

# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)

ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)

用户可以根据的需求，自行实现一些组合用的函数。比如用户通过 apply 方法实现一个统计 pdf 文件页数的功能。

from magic_pdf.data.data_reader_writer import  FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset

# args
pdf_file_name = "abc.pdf"  # replace with the real pdf path

# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content

# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)

def count_page(ds)-> int:
    return len(ds)

print("page number: ", ds.apply(count_page)) # will output the page count of `abc.pdf`

数据集

导入数据类

数据集

每个 PDF 或图像将形成一个 Dataset。众所周知，PDF 有两种类别：TXT或 OCR 方法部分。从图像中可以获得 ImageDataset，它是 Dataset 的子类；从 PDF 文件中可以获得 PymuDocDataset。ImageDataset 和 PymuDocDataset 之间的区别在于 ImageDataset 仅支持 OCR 解析方法，而 PymuDocDataset 支持 OCR 和 TXT 两种方法。

备注

实际上，有些 PDF 可能是由图像生成的，这意味着它们不支持 TXT 方法。目前，由用户保证不会调用 TXT 方法来解析图像生成的 PDF

PDF 解析方法

OCR

通过光学字符识别技术提取字符。

TXT

通过第三方库提取字符，目前我们使用的是 pymupdf。

read_api

从文件或目录读取内容以创建 Dataset。目前，我们提供了几个覆盖某些场景的函数。如果你有新的、大多数用户都会遇到的场景，可以在官方 GitHub 问题页面上发布详细描述。同时，实现你自己的读取相关函数也非常容易。

重要函数

read_jsonl

从本地机器或远程 S3 上的 JSONL 文件读取内容。如果你想了解更多关于 JSONL 的信息。

from magic_pdf.data.read_api import *
from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
from magic_pdf.data.schemas import S3Config

# 读取本地 jsonl 文件
datasets = read_jsonl("tt.jsonl", None)   # 替换为有效的文件

# 读取 s3 jsonl 文件

bucket = "bucket_1"                     # 替换为有效的 s3 bucket
ak = "access_key_1"                     # 替换为有效的 s3 access key
sk = "secret_key_1"                     # 替换为有效的 s3 secret key
endpoint_url = "endpoint_url_1"         # 替换为有效的 s3 endpoint url

bucket_2 = "bucket_2"                   # 替换为有效的 s3 bucket
ak_2 = "access_key_2"                   # 替换为有效的 s3 access key
sk_2 = "secret_key_2"                   # 替换为有效的 s3 secret key
endpoint_url_2 = "endpoint_url_2"       # 替换为有效的 s3 endpoint url

s3configs = [
    S3Config(
        bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
    ),
    S3Config(
        bucket_name=bucket_2,
        access_key=ak_2,
        secret_key=sk_2,
        endpoint_url=endpoint_url_2,
    ),
]

s3_reader = MultiBucketS3DataReader(bucket, s3configs)

datasets = read_jsonl(f"s3://bucket_1/tt.jsonl", s3_reader)  # 替换为有效的 s3 jsonl file

read_local_pdfs

从路径或目录读取 PDF 文件。

from magic_pdf.data.read_api import *

# 读取 PDF 路径
datasets = read_local_pdfs("tt.pdf")  # 替换为有效的文件

# 读取目录下的 PDF 文件
datasets = read_local_pdfs("pdfs/")   # 替换为有效的文件目录

read_local_images

从路径或目录读取图像。

from magic_pdf.data.read_api import *

# 从图像路径读取
datasets = read_local_images("tt.png")  # 替换为有效的文件

# 从目录读取以 suffixes 数组中指定后缀结尾的文件
datasets = read_local_images("images/", suffixes=["png", "jpg"])  # 替换为有效的文件目录

数据读取和写入类

旨在从不同的媒介读取或写入字节。如果 MinerU 没有提供合适的类，你可以实现新的类以满足个人场景的需求。实现新的类非常容易，唯一的要求是继承自 DataReader 或 DataWriter。

class SomeReader(DataReader):
    def read(self, path: str) -> bytes:
        pass

    def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
        pass


class SomeWriter(DataWriter):
    def write(self, path: str, data: bytes) -> None:
        pass

    def write_string(self, path: str, data: str) -> None:
        pass

读者可能会对 io 和本节的区别感到好奇。乍一看，这两部分非常相似。io 提供基本功能，而本节则更注重应用层面。用户可以构建自己的类以满足特定应用需求，这些类可能共享相同的基本 IO 功能。这就是为什么我们有 io。

重要类

class FileBasedDataReader(DataReader):
    def __init__(self, parent_dir: str = ''):
        pass


class FileBasedDataWriter(DataWriter):
    def __init__(self, parent_dir: str = '') -> None:
        pass

类 FileBasedDataReader 使用单个参数 parent_dir 初始化。这意味着 FileBasedDataReader 提供的每个方法将具有以下特性：

从绝对路径文件读取内容，parent_dir 将被忽略。
从相对路径读取文件，首先将路径与 parent_dir 连接，然后从合并后的路径读取内容。

备注

FileBasedDataWriter 与 FileBasedDataReader 具有相同的行为。

class MultiS3Mixin:
    def __init__(self, default_prefix: str, s3_configs: list[S3Config]):
        pass

class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
    pass

MultiBucketS3DataReader 提供的所有读取相关方法将具有以下特性：

从完整的 S3 格式路径读取对象，例如 s3://test_bucket/test_object，default_prefix 将被忽略。
从相对路径读取对象，首先将路径与 default_prefix 连接并去掉 bucket_name，然后读取内容。bucket_name 是将 default_prefix 用分隔符分割后的第一个元素。

备注

MultiBucketS3DataWriter 与 MultiBucketS3DataReader 具有类似的行为。

1 2	class S3DataReader(MultiBucketS3DataReader): pass

S3DataReader 基于 MultiBucketS3DataReader 构建，但仅支持单个桶。S3DataWriter 也是类似的情况。

读取示例

import os
from magic_pdf.data.data_reader_writer import *
from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
from magic_pdf.data.schemas import S3Config

# 初始化 reader
file_based_reader1 = FileBasedDataReader('')

## 读本地文件 abc
file_based_reader1.read('abc')

file_based_reader2 = FileBasedDataReader('/tmp')

## 读本地文件 /tmp/abc
file_based_reader2.read('abc')

## 读本地文件 /tmp/logs/message.txt
file_based_reader2.read('/tmp/logs/message.txt')

# 初始化多桶 s3 reader
bucket = "bucket"               # 替换为有效的 bucket
ak = "ak"                       # 替换为有效的 access key
sk = "sk"                       # 替换为有效的 secret key
endpoint_url = "endpoint_url"   # 替换为有效的 endpoint_url

bucket_2 = "bucket_2"               # 替换为有效的 bucket
ak_2 = "ak_2"                       # 替换为有效的 access key
sk_2 = "sk_2"                       # 替换为有效的 secret key
endpoint_url_2 = "endpoint_url_2"   # 替换为有效的 endpoint_url

test_prefix = 'test/unittest'
multi_bucket_s3_reader1 = MultiBucketS3DataReader(f"{bucket}/{test_prefix}", [S3Config(
        bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
    ),
    S3Config(
        bucket_name=bucket_2,
        access_key=ak_2,
        secret_key=sk_2,
        endpoint_url=endpoint_url_2,
    )])

## 读文件 s3://{bucket}/{test_prefix}/abc
multi_bucket_s3_reader1.read('abc')

## 读文件 s3://{bucket}/{test_prefix}/efg
multi_bucket_s3_reader1.read(f's3://{bucket}/{test_prefix}/efg')

## 读文件 s3://{bucket2}/{test_prefix}/abc
multi_bucket_s3_reader1.read(f's3://{bucket_2}/{test_prefix}/abc')

# 初始化 s3 reader
s3_reader1 = S3DataReader(
    test_prefix,
    bucket,
    ak,
    sk,
    endpoint_url
)

## 读文件 s3://{bucket}/{test_prefix}/abc
s3_reader1.read('abc')

## 读文件 s3://{bucket}/efg
s3_reader1.read(f's3://{bucket}/efg')

写入示例

import os
from magic_pdf.data.data_reader_writer import *
from magic_pdf.data.data_reader_writer import MultiBucketS3DataWriter
from magic_pdf.data.schemas import S3Config

# 初始化 reader
file_based_writer1 = FileBasedDataWriter("")

## 写数据 123 to abc
file_based_writer1.write("abc", "123".encode())

## 写数据 123 to abc
file_based_writer1.write_string("abc", "123")

file_based_writer2 = FileBasedDataWriter("/tmp")

## 写数据 123 to /tmp/abc
file_based_writer2.write_string("abc", "123")

## 写数据 123 to /tmp/logs/message.txt
file_based_writer2.write_string("/tmp/logs/message.txt", "123")

# 初始化多桶 s3 writer
bucket = "bucket"               # 替换为有效的 bucket
ak = "ak"                       # 替换为有效的 access key
sk = "sk"                       # 替换为有效的 secret key
endpoint_url = "endpoint_url"   # 替换为有效的 endpoint_url

bucket_2 = "bucket_2"               # 替换为有效的 bucket
ak_2 = "ak_2"                       # 替换为有效的 access key
sk_2 = "sk_2"                       # 替换为有效的 secret key
endpoint_url_2 = "endpoint_url_2"   # 替换为有效的 endpoint_url

test_prefix = "test/unittest"
multi_bucket_s3_writer1 = MultiBucketS3DataWriter(
    f"{bucket}/{test_prefix}",
    [
        S3Config(
            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
        ),
        S3Config(
            bucket_name=bucket_2,
            access_key=ak_2,
            secret_key=sk_2,
            endpoint_url=endpoint_url_2,
        ),
    ],
)

## 写数据 123 to s3://{bucket}/{test_prefix}/abc
multi_bucket_s3_writer1.write_string("abc", "123")

## 写数据 123 to s3://{bucket}/{test_prefix}/abc
multi_bucket_s3_writer1.write("abc", "123".encode())

## 写数据 123 to s3://{bucket}/{test_prefix}/efg
multi_bucket_s3_writer1.write(f"s3://{bucket}/{test_prefix}/efg", "123".encode())

## 写数据 123 to s3://{bucket_2}/{test_prefix}/abc
multi_bucket_s3_writer1.write(f's3://{bucket_2}/{test_prefix}/abc', '123'.encode())

# 初始化 s3 writer
s3_writer1 = S3DataWriter(test_prefix, bucket, ak, sk, endpoint_url)

## 写数据 123 to s3://{bucket}/{test_prefix}/abc
s3_writer1.write("abc", "123".encode())

## 写数据 123 to s3://{bucket}/{test_prefix}/abc
s3_writer1.write_string("abc", "123")

## 写数据 123 to s3://{bucket}/efg
s3_writer1.write(f"s3://{bucket}/efg", "123".encode())

IO

旨在从不同的媒介读取或写入字节。目前，我们提供了 S3Reader 和 S3Writer 用于兼容 AWS S3 的媒介，以及 HttpReader 和 HttpWriter 用于远程 HTTP 文件。如果 MinerU 没有提供合适的类，你可以实现新的类以满足个人场景的需求。实现新的类非常容易，唯一的要求是继承自 IOReader 或 IOWriter。

class SomeReader(IOReader):
    def read(self, path: str) -> bytes:
        pass

    def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
        pass


class SomeWriter(IOWriter):
    def write(self, path: str, data: bytes) -> None:
        pass