Paperless-ngx
Paperless-ngx basiert unter der Haube auf Django, ist also in Python geschrieben.
Installation
Hier wird die Bare-Metal Installation von Paperless ab 2.7.2 auf Debian 12 frei nach https://docs.paperless-ngx.com/setup/#bare_metal beschrieben (also ohne Docker), mit MariaDB-Backend.
Ab Python 3.11 auf Debian 12 Bookworm muss zwingend ein Virtual Environment eingesetzt werden.
Die verwendete Komponente OCRmyPDF benötigt Ghostscript 9.55+, welche ab Debian 12 verfügbar ist.
Auf RHEL bekommt man - abgesehen von anderen Paketnamen - die Software nicht ohne Aufwand ans Fliegen: auf RHEL 8 ist beispielsweise Unpaper nicht verfügbar und müsste compiliert werden, auf RHEL 9 oder Fedora Server lässt sich
mysqlclient
nicht per pip installieren - undicc-profiles-free
fehlt immer, müsste also auch compiliert werden.
Voraussetzungen:
Auf einem MariaDB-Server einen Benutzer „paperless“ mit allen Rechten ohne GRANT und Schema „paperless“ anlegen:
CREATE SCHEMA `paperless` DEFAULT CHARACTER SET utf8mb4;
Bemerkung
Der Einsatz von MariaDB bringt folgende Warnungen mit sich:
account.EmailAddress: (models.W036) MariaDB does not support unique constraints with conditions. HINT: A constraint won't be created. Silence this warning if you don't care about it. account.EmailAddress: (models.W043) MariaDB does not support indexes on expressions. HINT: An index won't be created. Silence this warning if you don't care about it. documents.Correspondent: (models.W036) MariaDB does not support unique constraints with conditions. HINT: A constraint won't be created. Silence this warning if you don't care about it. documents.DocumentType: (models.W036) MariaDB does not support unique constraints with conditions. HINT: A constraint won't be created. Silence this warning if you don't care about it. documents.StoragePath: (models.W036) MariaDB does not support unique constraints with conditions. HINT: A constraint won't be created. Silence this warning if you don't care about it. documents.Tag: (models.W036) MariaDB does not support unique constraints with conditions. HINT: A constraint won't be created. Silence this warning if you don't care about it.
Redis v6+ installieren und konfigurieren.
Dann:
apt update
apt -y upgrade
apt -y install \
python3 \
python3-pip \
python3-dev \
default-libmysqlclient-dev \
pkg-config \
fonts-liberation \
imagemagick \
gnupg \
libpq-dev \
libmagic-dev \
mariadb-client \
mime-support \
libzbar0 \
poppler-utils
Für OCR:
apt -y install unpaper \
ghostscript \
icc-profiles-free \
qpdf \
liblept5 \
libxml2 \
pngquant \
zlib1g \
tesseract-ocr \
tesseract-ocr-eng tesseract-ocr-deu
Download der Software:
apt -y install curl
VER=2.7.2
curl --remote-name --location https://github.com/paperless-ngx/paperless-ngx/releases/download/v$VER/paperless-ngx-v$VER.tar.xz
tar -xf paperless-ngx-v$VER.tar.xz
rm -f paperless-ngx-v$VER.tar.xz
mv paperless-ngx/ /opt/paperless/
mkdir -p /opt/paperless/{consume,data,trash,media,static}
User „paperless“ anlegen:
adduser paperless --system --home /opt/paperless --group
chown -R paperless:paperless /opt/paperless
Python-Dependencies in einer Virtualenv installieren:
apt -y install python3-venv
mkdir -p /opt/python-venv
python3 -m venv /opt/python-venv/paperless
source /opt/python-venv/paperless/bin/activate
chown -R paperless:paperless /opt/python-venv/paperless
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/pip3 install --upgrade pip
cd /opt/paperless
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/pip3 install --requirement requirements.txt
PDF-Dokumente mit ImageMagick statt mit Ghostscript prozessieren lassen:
<!-- <policy domain="coder" rights="none" pattern="PDF" /> -->
<policy domain="coder" rights="read|write" pattern="PDF" />
Konfiguration
Vor dem ersten Start mindestens folgende Einstellungen in /opt/paperless/paperless.conf
setzen:
PAPERLESS_REDIS=redis://127.0.0.1:6379
PAPERLESS_DBENGINE=mariadb
PAPERLESS_DBHOST=localhost
PAPERLESS_DBPORT=3306
PAPERLESS_DBNAME=paperless
PAPERLESS_DBUSER=paperless
PAPERLESS_DBPASS=linuxfabrik
PAPERLESS_CONSUMPTION_DIR=../consume
PAPERLESS_DATA_DIR=../data
PAPERLESS_TRASH_DIR=../trash
PAPERLESS_MEDIA_ROOT=../media
PAPERLESS_STATICDIR=../static
PAPERLESS_FILENAME_FORMAT={created_year}/{correspondent}/{title}
PAPERLESS_SECRET_KEY=8214cb7f-9646-487e-b3f1-9f0aeb7a5a0f
PAPERLESS_OCR_LANGUAGE=deu+eng
PAPERLESS_TIME_ZONE=Europe/Zurich
Die vollständige Konfigurationsdatei mit allen Default-Einstellungen:
# Have a look at the docs for documentation.
# https://docs.paperless-ngx.com/configuration/
# Debug. Only enable this for development.
#PAPERLESS_DEBUG=false
# Required services
#PAPERLESS_REDIS=redis://localhost:6379
#PAPERLESS_REDIS_PREFIX=
# Database
#PAPERLESS_DBENGINE=postgresql
#PAPERLESS_DBHOST=
#PAPERLESS_DBPORT=5432
#PAPERLESS_DBNAME=paperless
#PAPERLESS_DBUSER=paperless
#PAPERLESS_DBPASS=paperless
#PAPERLESS_DBSSLMODE=prefer
#PAPERLESS_DBSSLROOTCERT=
#PAPERLESS_DBSSLCERT=
#PAPERLESS_DBSSLKEY=
#PAPERLESS_DB_TIMEOUT=
# Optional Services
# Tika
#PAPERLESS_TIKA_ENABLED=false
#PAPERLESS_TIKA_ENDPOINT=http://localhost:9998
#PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://localhost:3000
# Paths and folders
#PAPERLESS_CONSUMPTION_DIR=../consume/
#PAPERLESS_DATA_DIR=../data/
#PAPERLESS_TRASH_DIR=
#PAPERLESS_MEDIA_ROOT=../media/
#PAPERLESS_STATICDIR=../static/
#PAPERLESS_FILENAME_FORMAT=none
#PAPERLESS_FILENAME_FORMAT_REMOVE_NONE=false
#PAPERLESS_LOGGING_DIR=PAPERLESS_DATA_DIR/log/
#PAPERLESS_NLTK_DIR=/usr/share/nltk_data
# Logging
#PAPERLESS_LOGROTATE_MAX_SIZE=1M
#PAPERLESS_LOGROTATE_MAX_BACKUPS=20
# Hosting & Security
#PAPERLESS_SECRET_KEY=see `src/paperless/settings.py`
#PAPERLESS_URL=
#PAPERLESS_CSRF_TRUSTED_ORIGINS=
#PAPERLESS_ALLOWED_HOSTS=*
#PAPERLESS_CORS_ALLOWED_HOSTS=http://localhost:8000
#PAPERLESS_TRUSTED_PROXIES=
#PAPERLESS_FORCE_SCRIPT_NAME=
#PAPERLESS_STATIC_URL=/static/
#PAPERLESS_AUTO_LOGIN_USERNAME=
#PAPERLESS_ADMIN_USER=<username>
#PAPERLESS_ADMIN_MAIL=root@localhost
#PAPERLESS_ADMIN_PASSWORD=<password>
#PAPERLESS_COOKIE_PREFIX=
#PAPERLESS_ENABLE_HTTP_REMOTE_USER=false
#PAPERLESS_ENABLE_HTTP_REMOTE_USER_API=false
#PAPERLESS_HTTP_REMOTE_USER_HEADER_NAME=HTTP_REMOTE_USER
#PAPERLESS_LOGOUT_REDIRECT_URL=
#PAPERLESS_USE_X_FORWARD_HOST=false
#PAPERLESS_USE_X_FORWARD_PORT=false
#PAPERLESS_PROXY_SSL_HEADER=
#PAPERLESS_EMAIL_CERTIFICATE_LOCATION=
#PAPERLESS_SOCIALACCOUNT_PROVIDERS=
#PAPERLESS_SOCIAL_AUTO_SIGNUP=false
#PAPERLESS_SOCIALACCOUNT_ALLOW_SIGNUPS=true
#PAPERLESS_ACCOUNT_ALLOW_SIGNUPS=false
#PAPERLESS_ACCOUNT_DEFAULT_HTTP_PROTOCOL=https
#PAPERLESS_ACCOUNT_EMAIL_VERIFICATION=optional
#PAPERLESS_DISABLE_REGULAR_LOGIN=false
#PAPERLESS_ACCOUNT_SESSION_REMEMBER=<bool>
# OCR settings
#PAPERLESS_OCR_LANGUAGE=eng
#PAPERLESS_OCR_MODE=skip
#PAPERLESS_OCR_SKIP_ARCHIVE_FILE=never
#PAPERLESS_OCR_CLEAN=clean
#PAPERLESS_OCR_DESKEW=true
#PAPERLESS_OCR_ROTATE_PAGES=true
#PAPERLESS_OCR_ROTATE_PAGES_THRESHOLD=12
#PAPERLESS_OCR_OUTPUT_TYPE=<type>
#PAPERLESS_OCR_PAGES=
#PAPERLESS_OCR_IMAGE_DPI=
#PAPERLESS_OCR_MAX_IMAGE_PIXELS=<num>
#PAPERLESS_OCR_COLOR_CONVERSION_STRATEGY=<RGB>
#PAPERLESS_OCR_USER_ARGS=<json>
# Software tweaks
#PAPERLESS_TASK_WORKERS=1
#PAPERLESS_THREADS_PER_WORKER=<num>
#PAPERLESS_WORKER_TIMEOUT=1800
#PAPERLESS_TIME_ZONE=UTC
#PAPERLESS_ENABLE_NLTK=1
#PAPERLESS_EMAIL_TASK_CRON=*/10 * * * *
#PAPERLESS_TRAIN_TASK_CRON=5 */1 * * *
#PAPERLESS_INDEX_TASK_CRON=0 0 * * *
#PAPERLESS_SANITY_TASK_CRON=30 0 * * sun
#PAPERLESS_ENABLE_COMPRESSION=1
#PAPERLESS_CONVERT_MEMORY_LIMIT=0
#PAPERLESS_CONVERT_TMPDIR=
#PAPERLESS_APPS=
#PAPERLESS_MAX_IMAGE_PIXELS=
# Document Consumption
#PAPERLESS_CONSUMER_DELETE_DUPLICATES=false
#PAPERLESS_CONSUMER_RECURSIVE=false
#PAPERLESS_CONSUMER_SUBDIRS_AS_TAGS=false
#PAPERLESS_CONSUMER_IGNORE_PATTERNS=[".DS_Store", ".DS_STORE", "._*", ".stfolder/*", ".stversions/*", ".localized/*", "desktop.ini", "@eaDir/*", "Thumbs.db"]
#PAPERLESS_CONSUMER_BARCODE_SCANNER=PYZBAR
#PAPERLESS_PRE_CONSUME_SCRIPT=
#PAPERLESS_POST_CONSUME_SCRIPT=
#PAPERLESS_FILENAME_DATE_ORDER=
#PAPERLESS_NUMBER_OF_SUGGESTED_DATES=3
#PAPERLESS_THUMBNAIL_FONT_NAME=/usr/share/fonts/liberation/LiberationSerif-Regular.ttf
#PAPERLESS_IGNORE_DATES=
#PAPERLESS_DATE_ORDER=<format>
# Polling
#PAPERLESS_CONSUMER_POLLING=0
#PAPERLESS_CONSUMER_POLLING_RETRY_COUNT=5
#PAPERLESS_CONSUMER_POLLING_DELAY=5
# iNotify
#PAPERLESS_CONSUMER_INOTIFY_DELAY=0.5
# Barcodes
#PAPERLESS_CONSUMER_ENABLE_BARCODES=false
#PAPERLESS_CONSUMER_BARCODE_TIFF_SUPPORT=false
#PAPERLESS_CONSUMER_BARCODE_STRING=PATCHT
#PAPERLESS_CONSUMER_ENABLE_ASN_BARCODE=false
#PAPERLESS_CONSUMER_ASN_BARCODE_PREFIX=ASN
#PAPERLESS_CONSUMER_BARCODE_UPSCALE=0.0
#PAPERLESS_CONSUMER_BARCODE_DPI=300
#PAPERLESS_CONSUMER_ENABLE_TAG_BARCODE=false
#PAPERLESS_CONSUMER_TAG_BARCODE_MAPPING={"TAG:(.)": "\g<1>"}
# Audit Trail
#PAPERLESS_AUDIT_LOG_ENABLED=true
# Collate Double-Sided Documents
#PAPERLESS_CONSUMER_ENABLE_COLLATE_DOUBLE_SIDED=false
#PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_SUBDIR_NAME=double-sided
#PAPERLESS_CONSUMER_COLLATE_DOUBLE_SIDED_TIFF_SUPPORT=false
# Binaries
#PAPERLESS_CONVERT_BINARY=convert
#PAPERLESS_GS_BINARY=gs
# Docker-specific options
#PAPERLESS_WEBSERVER_WORKERS=1
#PAPERLESS_BIND_ADDR=[::]
#PAPERLESS_PORT=8000
#USERMAP_UID=1000
#USERMAP_GID=1000
#PAPERLESS_OCR_LANGUAGES=
#PAPERLESS_ENABLE_FLOWER=<defined>
#PAPERLESS_SUPERVISORD_WORKING_DIR=<defined>
# Frontend Settings
#PAPERLESS_APP_TITLE=<bool>
#PAPERLESS_APP_LOGO=<path>
# Email sending
#PAPERLESS_EMAIL_HOST=localhost
#PAPERLESS_EMAIL_PORT=25
#PAPERLESS_EMAIL_HOST_USER=
#PAPERLESS_EMAIL_FROM=PAPERLESS_EMAIL_HOST_USER
#PAPERLESS_EMAIL_HOST_PASSWORD=
#PAPERLESS_EMAIL_USE_TLS=false
#PAPERLESS_EMAIL_USE_SSL=false
Anschliessend einmalig die Tabellen erstellen, das integrierte Django von Hand starten und die Seite im Browser aufrufen:
cd /opt/paperless/src
# just once: create the database tables
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3 manage.py migrate
# this creates your first paperless user
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3 manage.py createsuperuser
# to start the server:
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3 manage.py runserver 0.0.0.0:8000
Aufruf im Browser mit http://paperless-ngx:8000, Login mit oben angelegtem Benutzer.
Anschliessend Webserver stoppen und Systemd-Services anlegen:
cp /opt/paperless/scripts/paperless-* /etc/systemd/system/
sed --in-place 's,ExecStart=celery,ExecStart=/opt/python-venv/paperless/bin/celery,g' /etc/systemd/system/paperless-scheduler.service
sed --in-place 's,ExecStart=celery,ExecStart=/opt/python-venv/paperless/bin/celery,g' /etc/systemd/system/paperless-task-queue.service
sed --in-place 's,ExecStart=/opt/paperless/.local/bin/gunicorn,ExecStart=/opt/python-venv/paperless/bin/gunicorn,g' /etc/systemd/system/paperless-webserver.service
sed --in-place 's,ExecStart=python3,ExecStart=/opt/python-venv/paperless/bin/python3,g' /etc/systemd/system/paperless-consumer.service
systemctl daemon-reload
systemctl enable --now paperless-webserver.service
systemctl enable --now paperless-consumer.service
systemctl enable --now paperless-task-queue.service
systemctl enable --now paperless-scheduler.service
systemctl status paperless-*
Der Scheduler läuft in der Default-Einstellung alle 10 Minuten.
NTLK Machine Learning installieren (siehe https://www.nltk.org/data.html):
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/pip3 install --upgrade nltk
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/pip3 install --upgrade numpy
NLTK-Daten mit dem interaktiven Downloader installieren:
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3
import nltk
# if behind reverse proxy:
#nltk.set_proxy('http://proxy.example.com:3128', ('USERNAME', 'PASSWORD'))
nltk.download('punkt')
nltk.download('snowball_data')
nltk.download('stopwords')
exit()
Pfadangabe zu den NLTK-Daten anpassen:
# Paths and folders
PAPERLESS_NLTK_DIR=/opt/paperless/nltk_data
Was anschliessend optional gemacht werden kann:
In den persönlichen Einstellungen „Date display“ auf „ISO 8601“ umstellen.
NFS- oder Samba-Server installieren und
PAPERLESS_CONSUMPTION_DIR
freigeben.Ein Mail-Konto wie paper@example.com einrichten, um Dokumente von dort abzuholen.
Verwendung
Vor dem ersten Hochladen eines Dokuments passende Correspondents, Tags und Document Types anlegen. Wichtig: Unbedingt einen Tag „inbox“ vom Typ „inbox“ anlegen - hochgeladene Dokumente erhalten automatisch diesen Tag und können so zum manuellen Review gefiltert werden (anschliessend den Tag „inbox“ entfernen). Hochgeladene Dokumente finden sich unter /opt/paperless/media/documents/originals
.
Management Tools, z.B. um Dokumente massenhaft zu taggen, finden sich in /opt/paperless/src/documents/management/commands/
. Beispiel eines Aufrufs, um das trainierte neuronale Network zu aktualisieren:
cd /opt/paperless/src
sudo --set-home --user=paperless /opt/python-venv/paperless/bin/python3 manage.py document_create_classifier
Archive Serial Number (ASN)
Fortlaufende Nummerierung für Papierdokumente auf Basis eines QR-Codes. Um Aufkleber für Papierdokumente zu generieren, kann der QR Code Label Generator verwendet werden, der für Etikettenpapier vom Typ Avery L4731 ausgelegt ist (189 Etiketten pro Seite; Suche im Internet nach „L4731REV-10“ (10 Bögen) oder „L4731REV-25“ (25+5 Bögen)).
Troubleshooting
paperless-task-queue.service: Failed at step EXEC spawning celery: No such file or directory
Pfad zu
celery
stimmt nicht./etc/systemd/system/paperless-task-queue.service
anpassen.MissingDependencyError: gs
Ghostscript 9.55+ installieren.
Error occurred while consuming document: DigitalSignatureError: Input PDF has a digital signature. OCR would alter the document
PAPERLESS_OCR_USER_ARGS={"invalidate_digital_signatures": true}
Built on 2024-11-18