- DPLP-German RST Parser & EDU Segmenter
Click to expand
-
clone repository
git clone [email protected]:MaximilianKr/DPLP-German.git
cd DPLP-German -
create virtual environment (using uv)
uv venv --python 3.10
source .venv/bin/activateuv pip install -r requirements.txt
-
pull docker image
docker pull mohamadisara20/dplp-env:ger
Click to expand
From root/DPLP-German:
- put
.txtfile(s) to parse intodata/{folder}- for example, using
data/test_input--> adjust{input_folder}
- for example, using
- segmenter will output
.txtwith line-separated EDUs- adjust
{output_folder}accordingly
- adjust
-
run segmenter with the
defaultmodelpython run_seg_pipeline.py {input_folder} {output_folder}-
for example:
python run_seg_pipeline.py data/test_input data/test_output
-
-
you can use a custom-trained model by providing the path via the
--corpus_base_pathargument -
for details on how to create a custom model, see the section Retrain DPLP-German with a Custom Corpus
python run_seg_pipeline.py {input_folder} {output_folder} --corpus_base_path <path_to_your_corpus>-
for example:
python run_seg_pipeline.py data/pcc/test_input data/pcc/test_output --corpus_base_path data/pcc
-
-
run parser with the
defaultmodeldocker run -it \ -v $(pwd):/home/DPLP \ -w /home/DPLP \ mohamadisara20/dplp-env:ger \ python3 ger_predict_dis_from_txt.py {input_folder}-
for example:
docker run -it \ -v $(pwd):/home/DPLP \ -w /home/DPLP \ mohamadisara20/dplp-env:ger \ python3 ger_predict_dis_from_txt.py data/test_input
-
Click to expand
Before training, your corpus of .rs3 files must be split into training, dev, and test sets.
-
Ensure your annotated
.rs3files are in a single source directory. -
Run the splitting script:
python3 scripts/split_corpus.py <source_directory> <destination_directory> --split <train_count> <dev_count> <test_count>
<source_directory>: The directory containing your full set of.rs3files.<destination_directory>: The base path wheretraining/,dev/, andtest/subdirectories will be created (e.g.,data/my_corpus).--split: The number of files for the training, dev, and test sets.--seed(optional): A number to seed the random shuffle for reproducibility (defaults to42).
Example: To split 176 files from
data/pcc/rs3into a training set of 141, a dev set of 18, and a test set of 17 insidedata/pcc, run:python3 scripts/split_corpus.py data/pcc/rs3 data/pcc --split 141 18 17
Click to expand
Once your data is split, you can train the model using the train_custom_model.sh script. Simply provide the path to your corpus directory as an argument.
-
Run the training:
bash train_custom_model.sh <corpus_base_path>
-
<corpus_base_path>: The destination directory you used in the splitting step (e.g.,data/pcc).Example:
bash train_custom_model.sh data/pcc
The script will automatically generate a custom relation map and run the entire training pipeline. The model files will be saved to
<corpus_base_path>with evaluation results in<corpus_base_path>/result.txt.
-
Click to expand
After training, you can use your custom model to parse new .txt files.
-
Run prediction:
bash predict_with_custom_model.sh <corpus_base_path> <input_directory> <output_directory>
-
<corpus_base_path>: The path to the corpus directory containing your trained model. -
<input_directory>: The directory containing your new.txtfiles. -
<output_directory>: The directory where the output.disand.rs3files will be saved.Example:
To parse files from
data/pcc/test_inputusing the model indata/pccand save the parsed results todata/pcc/test_output, run:bash predict_with_custom_model.sh data/pcc data/pcc/test_input data/pcc/test_output
-
Click to expand
This repository is a fork of DPLP Parser that customizes the code for German language by adding new pieces of codes and other resources. A ready to use model is trained by the Postdam University RST Corpus. The pretrained model can be found in ./data/de along with all its belongings. The source code consists of several python scripts in the root directory, prefixed by ger.
DPLP Parser was written Python 2 that is discontinued, and depends on libraries that can't be installed by the package managers anymore. Therefore a prebuild Docker image (mohamadisara20/dplp-env) is created and shared in Docker Hub that contains all of them preinstalled. Both German and the original English DPLP parsers can be executed within this image. The image can be found here: https://hub.docker.com/repository/docker/mohamadisara20/dplp-env/general There are two tags created for this docker image:
- latest: the default tag suitable for running the English DPLP parser.
- ger: the tag created for running German DPLP. It has some language-specific libraries and data files pre-installed.
The German RST parser's main script is ger_predict_dis_from_txt.py (it generates RST trees in both .dis and .rs3 formats). All the preprocessings, segmentation and parsing are included; so the script simply takes text files as input and generates the RST trees as its ultimate output. You can follow these steps to parse a batch of text files using this script in a docker container:
1- in terminal, change directory to the repo's root path:
cd path_to_rst_german2- copy the .txt input files in a single subdirectory in ./data (let's say: data/input). Please don't choose any place ourside the current directory.
3- run the following command:
docker run -d -v $(pwd):/home/DPLP -w /home/DPLP mohamadisara20/dplp-env:ger python3 ger_predict_dis_from_txt.py data/input- Important: replace
data/inputwith the relative path to your input text files. You don't need to change anything else for standard parsing. For customization with additional args, refer to the parser script's source code.
4- There are several ways to check the progress or debug errors:
- verify the new files created in the input path
- replace
-doption with-itto get all logs printed throughout the process - use
docke logsto see the logs generated by the Docker container
5- you will see new files created throughout the parsing process. When .rs3 files are generated, it means that the process is finished.
The script ger_rest_api.py creates a REST-API server that allows us to perform RST parsing remotely or integrate it into a web application. It can be launched by running:
docker run -d -p 5000:5000 -v $(pwd):/home/DPLP -w /home/DPLP mohamadisara20/dplp-env python3 ger_rest_api.pyThen the REST-API will listen to port 5000 and accept JSON requests from the subpath dplp and return the RST tree in two formats dis and rs3. You can test it using this command:
curl -d '{"text":"Ich bin gut."}' -H 'Content-Type: application/json' http://127.0.0.1:5000/dplpYou will get a result like this:
{
"dis": "(Root (leaf 1) (rel2par None) (text _!lorem ipsum_!))\n",
"rs3":"<rst>\n
<header>\n <relations>\n <rel name=\"Antithesis\" type=\"rst\"/>\n
<rel name=\"Background\" type=\"rst\"/>\n <rel name=\"Cause\" type=\"rst\"/>\n <rel name=\"Circumstance\" type=\"rst\"/>\n <rel name=\"Concession\" type=\"rst\"/>\n <rel name=\"Condition\" type=\"rst\"/>\n <rel name=\"Conjunction\" type=\"multinuc\"/>\n <rel name=\"Contrast\" type=\"multinuc\"/>\n <rel name=\"Disjunction\" type=\"multinuc\"/>\n <rel name=\"Elaboration\" type=\"rst\"/>\n <rel name=\"Enablement\" type=\"rst\"/>\n <rel name=\"Evaluation\" type=\"rst\"/>\n <rel name=\"Evidence\" type=\"rst\"/>\n <rel name=\"Interpretation\" type=\"rst\"/>\n <rel name=\"Joint\" type=\"multinuc\"/>\n <rel name=\"Justify\" type=\"rst\"/>\n <rel name=\"Motivation\" type=\"rst\"/>\n <rel name=\"Otherwise\" type=\"rst\"/>\n <rel name=\"Preparation\" type=\"rst\"/>\n <rel name=\"Purpose\" type=\"rst\"/>\n <rel name=\"Restatement\" type=\"rst\"/>\n <rel name=\"Result\" type=\"rst\"/>\n <rel name=\"Sequence\" type=\"multinuc\"/>\n <rel name=\"Solutionhood\" type=\"rst\"/>\n <rel name=\"Summary\" type=\"rst\"/>\n
</relations>\n</header>\n
<body>\n
<segment id=\"2\">Ich bin gut .</segment>\n
</body>\n </rst>",
"dis_url":"rstout/5d74b5b2-18c6-4443-9d5f-b67ebfe947a3/document.dis",
"rs3_url":"rstout/5d74b5b2-18c6-4443-9d5f-b67ebfe947a3/document.rs3","uid":"5d74b5b2-18c6-4443-9d5f-b67ebfe947a3"}
You can train your own parser using a corpus of RST trees. German parser uses .rs3 files for training. Training consists of these steps:
1- Divide the corpus into train, dev and test set and save them in three subdirectores training/, dev/ and test/ in a base directory ; for instance, if the base directory is data/base_dir, we will get these sub directories: data/base_dir/training, data/base_dir/dev and data/base_dir/test.
2- Review the content of the relation mapping file: parsing_eval_metrics/rel_mapping.json. It should contain all relations used in the whole corpus (train, dev and test set); otherwise training can fail and show undefind bahaviors.
3- The script ger_train_parser.py tringgers the trining process on the base directory. You can use the following command to run the training code in the docker container:
docker run -d -v $(pwd):/home/DPLP -w /home/DPLP mohamadisara20/dplp-env:ger python3 ger_train.py data/base_dir4- The model will be saved in the file model/model.pickle.gz in the base path (e.g. model/de)
5- The parser precision scrores will be reported in the file named results.txt in the base directory.
Please read the following paper for more technical details
- Yangfeng Ji, Jacob Eisenstein. Representation Learning for Text-level Parsing. ACL 2014
- Joty, S., Carenini, G., & Ng, R. T. (n.d.). CODRA: A Novel Discriminative Framework for Rhetorical Analysis.
- Shahmohammadi, S., & Stede, M. (2024). Discourse Parsing for German with new RST Corpora. Workshop Proceedings of the 20th Edition of the KONVENS Conference.