Paperless-ngx

Paperless-ngx basiert unter der Haube auf Django, ist also in Python geschrieben.

Installation

Hier wird die Bare-Metal Installation von Paperless ab 2.7.2 auf Debian 12 frei nach https://docs.paperless-ngx.com/setup/#bare_metal beschrieben (also ohne Docker), mit MariaDB-Backend.

  • Ab Python 3.11 auf Debian 12 Bookworm muss zwingend ein Virtual Environment eingesetzt werden.

  • Die verwendete Komponente OCRmyPDF benötigt Ghostscript 9.55+, welche ab Debian 12 verfügbar ist.

  • Auf RHEL bekommt man - abgesehen von anderen Paketnamen - die Software nicht ohne Aufwand ans Fliegen: auf RHEL 8 ist beispielsweise Unpaper nicht verfügbar und müsste compiliert werden, auf RHEL 9 oder Fedora Server lässt sich mysqlclient nicht per pip installieren - und icc-profiles-free fehlt immer, müsste also auch compiliert werden.

Voraussetzungen:

  • Auf einem MariaDB-Server einen Benutzer „paperless“ mit allen Rechten ohne GRANT und Schema „paperless“ anlegen: CREATE SCHEMA `paperless` DEFAULT CHARACTER SET utf8mb4;

    Bemerkung

    Der Einsatz von MariaDB bringt folgende Warnungen mit sich:

    account.EmailAddress: (models.W036) MariaDB does not support unique constraints with conditions.
        HINT: A constraint won't be created. Silence this warning if you don't care about it.
    account.EmailAddress: (models.W043) MariaDB does not support indexes on expressions.
        HINT: An index won't be created. Silence this warning if you don't care about it.
    documents.Correspondent: (models.W036) MariaDB does not support unique constraints with conditions.
        HINT: A constraint won't be created. Silence this warning if you don't care about it.
    documents.DocumentType: (models.W036) MariaDB does not support unique constraints with conditions.
        HINT: A constraint won't be created. Silence this warning if you don't care about it.
    documents.StoragePath: (models.W036) MariaDB does not support unique constraints with conditions.
        HINT: A constraint won't be created. Silence this warning if you don't care about it.
    documents.Tag: (models.W036) MariaDB does not support unique constraints with conditions.
        HINT: A constraint won't be created. Silence this warning if you don't care about it.
    
  • Redis v6+ installieren und konfigurieren.

Dann:

apt update
apt -y upgrade
apt -y install \
    python3 \
    python3-pip \
    python3-dev \
    default-libmysqlclient-dev \
    pkg-config \
    fonts-liberation \
    imagemagick \
    gnupg \
    libpq-dev \
    libmagic-dev \
    mariadb-client \
    mime-support \
    libzbar0 \
    poppler-utils

Für OCR:

apt -y install unpaper \
    ghostscript \
    icc-profiles-free \
    qpdf \
    liblept5 \
    libxml2 \
    pngquant \
    zlib1g \
    tesseract-ocr \
    tesseract-ocr-eng tesseract-ocr-deu

Download der Software:

apt -y install curl

VER=2.7.2
curl --remote-name --location https://github.com/paperless-ngx/paperless-ngx/releases/download/v$VER/paperless-ngx-v$VER.tar.xz
tar -xf paperless-ngx-v$VER.tar.xz
rm -f paperless-ngx-v$VER.tar.xz
mv paperless-ngx/ /opt/paperless/

mkdir -p /opt/paperless/{consume,data,trash,media,static}

User „paperless“ anlegen:

adduser paperless --system --home /opt/paperless --group
chown -R paperless:paperless /opt/paperless

Python-Dependencies in einer Virtualenv installieren:

apt -y install python3-venv
mkdir -p /opt/python-venv
python3 -m venv /opt/python-venv/paperless
source /opt/python-venv/paperless/bin/activate
chown -R paperless:paperless /opt/python-venv/paperless

sudo --set-home --user=paperless /opt/python-venv/paperless/bin/pip3 install --upgrade pip

cd /opt/paperless
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/pip3 install --requirement requirements.txt

PDF-Dokumente mit ImageMagick statt mit Ghostscript prozessieren lassen:

/etc/ImageMagick-6/policy.xml
<!-- <policy domain="coder" rights="none" pattern="PDF" /> -->
<policy domain="coder" rights="read|write" pattern="PDF" />

Konfiguration

Vor dem ersten Start mindestens folgende Einstellungen in /opt/paperless/paperless.conf setzen:

/opt/paperless/paperless.conf
PAPERLESS_REDIS=redis://127.0.0.1:6379
PAPERLESS_DBENGINE=mariadb
PAPERLESS_DBHOST=localhost
PAPERLESS_DBPORT=3306
PAPERLESS_DBNAME=paperless
PAPERLESS_DBUSER=paperless
PAPERLESS_DBPASS=linuxfabrik

PAPERLESS_CONSUMPTION_DIR=../consume
PAPERLESS_DATA_DIR=../data
PAPERLESS_TRASH_DIR=../trash
PAPERLESS_MEDIA_ROOT=../media
PAPERLESS_STATICDIR=../static
PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}

PAPERLESS_SECRET_KEY=8214cb7f-9646-487e-b3f1-9f0aeb7a5a0f

PAPERLESS_OCR_LANGUAGE=deu+eng

PAPERLESS_TIME_ZONE=Europe/Zurich

Die vollständige Konfigurationsdatei mit allen Default-Einstellungen:

/opt/paperless/paperless.conf
# Have a look at the docs for documentation.
# https://docs.paperless-ngx.com/configuration/

# Debug. Only enable this for development.

#PAPERLESS_DEBUG=false

# Required services

#PAPERLESS_REDIS=redis://localhost:6379
#PAPERLESS_REDIS_PREFIX=

# Database

#PAPERLESS_DBENGINE=postgresql
#PAPERLESS_DBHOST=
#PAPERLESS_DBPORT=5432
#PAPERLESS_DBNAME=paperless
#PAPERLESS_DBUSER=paperless
#PAPERLESS_DBPASS=paperless
#PAPERLESS_DBSSLMODE=prefer
#PAPERLESS_DBSSLROOTCERT=
#PAPERLESS_DBSSLCERT=
#PAPERLESS_DBSSLKEY=
#PAPERLESS_DB_TIMEOUT=

# Optional Services

# Tika
#PAPERLESS_TIKA_ENABLED=false
#PAPERLESS_TIKA_ENDPOINT=http://localhost:9998
#PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://localhost:3000

# Paths and folders

#PAPERLESS_CONSUMPTION_DIR=../consume/
#PAPERLESS_DATA_DIR=../data/
#PAPERLESS_TRASH_DIR=
#PAPERLESS_MEDIA_ROOT=../media/
#PAPERLESS_STATICDIR=../static/
#PAPERLESS_FILENAME_FORMAT=none
#PAPERLESS_FILENAME_FORMAT_REMOVE_NONE=false
#PAPERLESS_LOGGING_DIR=PAPERLESS_DATA_DIR/log/
#PAPERLESS_NLTK_DIR=/usr/share/nltk_data

# Logging

#PAPERLESS_LOGROTATE_MAX_SIZE=1M
#PAPERLESS_LOGROTATE_MAX_BACKUPS=20

# Hosting & Security

#PAPERLESS_SECRET_KEY=see `src/paperless/settings.py`
#PAPERLESS_URL=
#PAPERLESS_CSRF_TRUSTED_ORIGINS=
#PAPERLESS_ALLOWED_HOSTS=*
#PAPERLESS_CORS_ALLOWED_HOSTS=http://localhost:8000
#PAPERLESS_TRUSTED_PROXIES=
#PAPERLESS_FORCE_SCRIPT_NAME=
#PAPERLESS_STATIC_URL=/static/
#PAPERLESS_AUTO_LOGIN_USERNAME=
#PAPERLESS_ADMIN_USER=<username>
#PAPERLESS_ADMIN_MAIL=root@localhost
#PAPERLESS_ADMIN_PASSWORD=<password>
#PAPERLESS_COOKIE_PREFIX=
#PAPERLESS_ENABLE_HTTP_REMOTE_USER=false
#PAPERLESS_ENABLE_HTTP_REMOTE_USER_API=false
#PAPERLESS_HTTP_REMOTE_USER_HEADER_NAME=HTTP_REMOTE_USER
#PAPERLESS_LOGOUT_REDIRECT_URL=
#PAPERLESS_USE_X_FORWARD_HOST=false
#PAPERLESS_USE_X_FORWARD_PORT=false
#PAPERLESS_PROXY_SSL_HEADER=
#PAPERLESS_EMAIL_CERTIFICATE_LOCATION=
#PAPERLESS_SOCIALACCOUNT_PROVIDERS=
#PAPERLESS_SOCIAL_AUTO_SIGNUP=false
#PAPERLESS_SOCIALACCOUNT_ALLOW_SIGNUPS=true
#PAPERLESS_ACCOUNT_ALLOW_SIGNUPS=false
#PAPERLESS_ACCOUNT_DEFAULT_HTTP_PROTOCOL=https
#PAPERLESS_ACCOUNT_EMAIL_VERIFICATION=optional
#PAPERLESS_DISABLE_REGULAR_LOGIN=false
#PAPERLESS_ACCOUNT_SESSION_REMEMBER=<bool>

# OCR settings

#PAPERLESS_OCR_LANGUAGE=eng
#PAPERLESS_OCR_MODE=skip
#PAPERLESS_OCR_SKIP_ARCHIVE_FILE=never
#PAPERLESS_OCR_CLEAN=clean
#PAPERLESS_OCR_DESKEW=true
#PAPERLESS_OCR_ROTATE_PAGES=true
#PAPERLESS_OCR_ROTATE_PAGES_THRESHOLD=12
#PAPERLESS_OCR_OUTPUT_TYPE=<type>
#PAPERLESS_OCR_PAGES=
#PAPERLESS_OCR_IMAGE_DPI=
#PAPERLESS_OCR_MAX_IMAGE_PIXELS=<num>
#PAPERLESS_OCR_COLOR_CONVERSION_STRATEGY=<RGB>
#PAPERLESS_OCR_USER_ARGS=<json>

# Software tweaks

#PAPERLESS_TASK_WORKERS=1
#PAPERLESS_THREADS_PER_WORKER=<num>
#PAPERLESS_WORKER_TIMEOUT=1800
#PAPERLESS_TIME_ZONE=UTC
#PAPERLESS_ENABLE_NLTK=1
#PAPERLESS_EMAIL_TASK_CRON=*/10 * * * *
#PAPERLESS_TRAIN_TASK_CRON=5 */1 * * *
#PAPERLESS_INDEX_TASK_CRON=0 0 * * *
#PAPERLESS_SANITY_TASK_CRON=30 0 * * sun
#PAPERLESS_ENABLE_COMPRESSION=1
#PAPERLESS_CONVERT_MEMORY_LIMIT=0
#PAPERLESS_CONVERT_TMPDIR=
#PAPERLESS_APPS=
#PAPERLESS_MAX_IMAGE_PIXELS=

# Document Consumption

#PAPERLESS_CONSUMER_DELETE_DUPLICATES=false
#PAPERLESS_CONSUMER_RECURSIVE=false
#PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=false
#PAPERLESS_CONSUMER_IGNORE_PATTERNS=[".DS_Store", ".DS_STORE", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*", "Thumbs.db"]
#PAPERLESS_CONSUMER_BARCODE_SCANNER=PYZBAR
#PAPERLESS_PRE_CONSUME_SCRIPT=
#PAPERLESS_POST_CONSUME_SCRIPT=
#PAPERLESS_FILENAME_DATE_ORDER=
#PAPERLESS_NUMBER_OF_SUGGESTED_DATES=3
#PAPERLESS_THUMBNAIL_FONT_NAME=/usr/share/fonts/liberation/LiberationSerif-Regular.ttf
#PAPERLESS_IGNORE_DATES=
#PAPERLESS_DATE_ORDER=<format>

# Polling

#PAPERLESS_CONSUMER_POLLING=0
#PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=5
#PAPERLESS_CONSUMER_POLLING_DELAY=5

# iNotify

#PAPERLESS_CONSUMER_INOTIFY_DELAY=0.5

# Barcodes

#PAPERLESS_CONSUMER_ENABLE_BARCODES=false
#PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=false
#PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT
#PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=false
#PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=ASN
#PAPERLESS_CONSUMER_BARCODE_UPSCALE=0.0
#PAPERLESS_CONSUMER_BARCODE_DPI=300
#PAPERLESS_CONSUMER_ENABLE_TAG_BARCODE=false
#PAPERLESS_CONSUMER_TAG_BARCODE_MAPPING={"TAG:(.)": "\g<1>"}

# Audit Trail

#PAPERLESS_AUDIT_LOG_ENABLED=true

# Collate Double-Sided Documents

#PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=false
#PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME=double-sided
#PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=false

# Binaries

#PAPERLESS_CONVERT_BINARY=convert
#PAPERLESS_GS_BINARY=gs

# Docker-specific options

#PAPERLESS_WEBSERVER_WORKERS=1
#PAPERLESS_BIND_ADDR=[::]
#PAPERLESS_PORT=8000
#USERMAP_UID=1000
#USERMAP_GID=1000
#PAPERLESS_OCR_LANGUAGES=
#PAPERLESS_ENABLE_FLOWER=<defined>
#PAPERLESS_SUPERVISORD_WORKING_DIR=<defined>

# Frontend Settings

#PAPERLESS_APP_TITLE=<bool>
#PAPERLESS_APP_LOGO=<path>

# Email sending

#PAPERLESS_EMAIL_HOST=localhost
#PAPERLESS_EMAIL_PORT=25
#PAPERLESS_EMAIL_HOST_USER=
#PAPERLESS_EMAIL_FROM=PAPERLESS_EMAIL_HOST_USER
#PAPERLESS_EMAIL_HOST_PASSWORD=
#PAPERLESS_EMAIL_USE_TLS=false
#PAPERLESS_EMAIL_USE_SSL=false

Anschliessend einmalig die Tabellen erstellen, das integrierte Django von Hand starten und die Seite im Browser aufrufen:

cd /opt/paperless/src

# just once: create the database tables
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3 manage.py migrate

# this creates your first paperless user
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3 manage.py createsuperuser

# to start the server:
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3 manage.py runserver 0.0.0.0:8000

Aufruf im Browser mit http://paperless-ngx:8000, Login mit oben angelegtem Benutzer.

Anschliessend Webserver stoppen und Systemd-Services anlegen:

cp /opt/paperless/scripts/paperless-* /etc/systemd/system/

sed --in-place 's,ExecStart=celery,ExecStart=/opt/python-venv/paperless/bin/celery,g' /etc/systemd/system/paperless-scheduler.service
sed --in-place 's,ExecStart=celery,ExecStart=/opt/python-venv/paperless/bin/celery,g' /etc/systemd/system/paperless-task-queue.service
sed --in-place 's,ExecStart=/opt/paperless/.local/bin/gunicorn,ExecStart=/opt/python-venv/paperless/bin/gunicorn,g' /etc/systemd/system/paperless-webserver.service
sed --in-place 's,ExecStart=python3,ExecStart=/opt/python-venv/paperless/bin/python3,g' /etc/systemd/system/paperless-consumer.service

systemctl daemon-reload

systemctl enable --now paperless-webserver.service
systemctl enable --now paperless-consumer.service
systemctl enable --now paperless-task-queue.service
systemctl enable --now paperless-scheduler.service

systemctl status paperless-*

Der Scheduler läuft in der Default-Einstellung alle 10 Minuten.

NTLK Machine Learning installieren (siehe https://www.nltk.org/data.html):

sudo --set-home --user=paperless /opt/python-venv/paperless/bin/pip3 install --upgrade nltk
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/pip3 install --upgrade numpy

NLTK-Daten mit dem interaktiven Downloader installieren:

sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3
import nltk
# if behind reverse proxy:
#nltk.set_proxy('http://proxy.example.com:3128', ('USERNAME', 'PASSWORD'))

nltk.download('punkt')
nltk.download('snowball_data')
nltk.download('stopwords')

exit()

Pfadangabe zu den NLTK-Daten anpassen:

/opt/paperless/paperless.conf
# Paths and folders
PAPERLESS_NLTK_DIR=/opt/paperless/nltk_data

Was anschliessend optional gemacht werden kann:

  • In den persönlichen Einstellungen „Date display“ auf „ISO 8601“ umstellen.

  • NFS- oder Samba-Server installieren und PAPERLESS_CONSUMPTION_DIR freigeben.

  • Ein Mail-Konto wie paper@example.com einrichten, um Dokumente von dort abzuholen.

Verwendung

Vor dem ersten Hochladen eines Dokuments passende Correspondents, Tags und Document Types anlegen. Wichtig: Unbedingt einen Tag „inbox“ vom Typ „inbox“ anlegen - hochgeladene Dokumente erhalten automatisch diesen Tag und können so zum manuellen Review gefiltert werden (anschliessend den Tag „inbox“ entfernen). Hochgeladene Dokumente finden sich unter /opt/paperless/media/documents/originals.

Management Tools, z.B. um Dokumente massenhaft zu taggen, finden sich in /opt/paperless/src/documents/management/commands/. Beispiel eines Aufrufs, um das trainierte neuronale Network zu aktualisieren:

cd /opt/paperless/src
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3 manage.py document_create_classifier

Archive Serial Number (ASN)

Fortlaufende Nummerierung für Papierdokumente auf Basis eines QR-Codes. Um Aufkleber für Papierdokumente zu generieren, kann der QR Code Label Generator verwendet werden, der für Etikettenpapier vom Typ Avery L4731 ausgelegt ist (189 Etiketten pro Seite; Suche im Internet nach „L4731REV-10“ (10 Bögen) oder „L4731REV-25“ (25+5 Bögen)).

Troubleshooting

paperless-task-queue.service: Failed at step EXEC spawning celery: No such file or directory

Pfad zu celery stimmt nicht. /etc/systemd/system/paperless-task-queue.service anpassen.

MissingDependencyError: gs

Ghostscript 9.55+ installieren.

Error occurred while consuming document: DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document

PAPERLESS_OCR_USER_ARGS={"invalidate_digital_signatures": true}

Built on 2024-07-16