DPLP-German RST Parser & EDU Segmenter

DPLP-German RST Parser & EDU Segmenter

Setup

Click to expand

clone repository

git clone [email protected]:MaximilianKr/DPLP-German.git

cd DPLP-German

create virtual environment (using uv)

uv venv --python 3.10

source .venv/bin/activate

uv pip install -r requirements.txt

pull docker image
```
docker pull mohamadisara20/dplp-env:ger
```

EDU Segmenter for German

Click to expand

From root/DPLP-German:

put .txt file(s) to parse into data/{folder}
- for example, using data/test_input --> adjust {input_folder}
segmenter will output .txt with line-separated EDUs
- adjust {output_folder} accordingly

Segmenter: run segmentation pipeline

run segmenter with the default model

python run_seg_pipeline.py {input_folder} {output_folder}

for example:

python run_seg_pipeline.py data/test_input data/test_output

Use custom-trained Model

you can use a custom-trained model by providing the path via the --corpus_base_path argument

for details on how to create a custom model, see the section Retrain DPLP-German with a Custom Corpus

python run_seg_pipeline.py {input_folder} {output_folder} --corpus_base_path <path_to_your_corpus>

for example:

python run_seg_pipeline.py data/pcc/test_input data/pcc/test_output --corpus_base_path data/pcc

Parser: run DPLP-German

run parser with the default model

docker run -it \
  -v $(pwd):/home/DPLP \
  -w /home/DPLP \
  mohamadisara20/dplp-env:ger \
  python3 ger_predict_dis_from_txt.py {input_folder}

for example:

docker run -it \
  -v $(pwd):/home/DPLP \
  -w /home/DPLP \
  mohamadisara20/dplp-env:ger \
  python3 ger_predict_dis_from_txt.py data/test_input

Retrain DPLP-German with a Custom Corpus

Step 1: Prepare and Split Your Corpus

Click to expand

Before training, your corpus of .rs3 files must be split into training, dev, and test sets.

Ensure your annotated .rs3 files are in a single source directory.
Run the splitting script:
```
python3 scripts/split_corpus.py <source_directory> <destination_directory> --split <train_count> <dev_count> <test_count>
```
- <source_directory>: The directory containing your full set of .rs3 files.
- <destination_directory>: The base path where training/, dev/, and test/ subdirectories will be created (e.g., data/my_corpus).
- --split: The number of files for the training, dev, and test sets.
- --seed (optional): A number to seed the random shuffle for reproducibility (defaults to 42).
Example: To split 176 files from data/pcc/rs3 into a training set of 141, a dev set of 18, and a test set of 17 inside data/pcc, run:
```
python3 scripts/split_corpus.py data/pcc/rs3 data/pcc --split 141 18 17
```

Step 2: Train the Model

Click to expand

Once your data is split, you can train the model using the train_custom_model.sh script. Simply provide the path to your corpus directory as an argument.

Run the training:
```
bash train_custom_model.sh <corpus_base_path>
```
- <corpus_base_path>: The destination directory you used in the splitting step (e.g., data/pcc).
  
  Example:
```
bash train_custom_model.sh data/pcc
```
  The script will automatically generate a custom relation map and run the entire training pipeline. The model files will be saved to <corpus_base_path> with evaluation results in <corpus_base_path>/result.txt.

Step 3: Predict with Your New Model

Click to expand

After training, you can use your custom model to parse new .txt files.

Run prediction:
```
bash predict_with_custom_model.sh <corpus_base_path> <input_directory> <output_directory>
```
- <corpus_base_path>: The path to the corpus directory containing your trained model.
- <input_directory>: The directory containing your new .txt files.
- <output_directory>: The directory where the output .dis and .rs3 files will be saved.
  
  Example:
  
  To parse files from data/pcc/test_input using the model in data/pcc and save the parsed results to data/pcc/test_output, run:
```
bash predict_with_custom_model.sh data/pcc data/pcc/test_input data/pcc/test_output
```

DPLP-German (original README.md)

Click to expand

1. Introduction

This repository is a fork of DPLP Parser that customizes the code for German language by adding new pieces of codes and other resources. A ready to use model is trained by the Postdam University RST Corpus. The pretrained model can be found in ./data/de along with all its belongings. The source code consists of several python scripts in the root directory, prefixed by ger.

2. Runtime Docker Image

DPLP Parser was written Python 2 that is discontinued, and depends on libraries that can't be installed by the package managers anymore. Therefore a prebuild Docker image (mohamadisara20/dplp-env) is created and shared in Docker Hub that contains all of them preinstalled. Both German and the original English DPLP parsers can be executed within this image. The image can be found here: https://hub.docker.com/repository/docker/mohamadisara20/dplp-env/general There are two tags created for this docker image:

latest: the default tag suitable for running the English DPLP parser.
ger: the tag created for running German DPLP. It has some language-specific libraries and data files pre-installed.

3. RST Parsing from text files

The German RST parser's main script is ger_predict_dis_from_txt.py (it generates RST trees in both .dis and .rs3 formats). All the preprocessings, segmentation and parsing are included; so the script simply takes text files as input and generates the RST trees as its ultimate output. You can follow these steps to parse a batch of text files using this script in a docker container:

1- in terminal, change directory to the repo's root path:

cd path_to_rst_german

2- copy the .txt input files in a single subdirectory in ./data (let's say: data/input). Please don't choose any place ourside the current directory.

3- run the following command:

docker run -d -v $(pwd):/home/DPLP -w /home/DPLP mohamadisara20/dplp-env:ger python3 ger_predict_dis_from_txt.py data/input

Important: replace data/input with the relative path to your input text files. You don't need to change anything else for standard parsing. For customization with additional args, refer to the parser script's source code.

4- There are several ways to check the progress or debug errors:

verify the new files created in the input path
replace -d option with -it to get all logs printed throughout the process
use docke logs to see the logs generated by the Docker container

5- you will see new files created throughout the parsing process. When .rs3 files are generated, it means that the process is finished.

4. RST Parsing with the Restful API

The script ger_rest_api.py creates a REST-API server that allows us to perform RST parsing remotely or integrate it into a web application. It can be launched by running:

docker run -d -p 5000:5000 -v $(pwd):/home/DPLP -w /home/DPLP mohamadisara20/dplp-env python3 ger_rest_api.py

Then the REST-API will listen to port 5000 and accept JSON requests from the subpath dplp and return the RST tree in two formats dis and rs3. You can test it using this command:

curl -d '{"text":"Ich bin gut."}' -H 'Content-Type: application/json' http://127.0.0.1:5000/dplp

You will get a result like this:

{
    "dis": "(Root (leaf 1) (rel2par None) (text _!lorem ipsum_!))\n",
    "rs3":"<rst>\n
        <header>\n <relations>\n      <rel name=\"Antithesis\" type=\"rst\"/>\n      
            <rel         name=\"Background\" type=\"rst\"/>\n      <rel name=\"Cause\" type=\"rst\"/>\n      <rel name=\"Circumstance\" type=\"rst\"/>\n      <rel name=\"Concession\" type=\"rst\"/>\n      <rel name=\"Condition\" type=\"rst\"/>\n      <rel name=\"Conjunction\" type=\"multinuc\"/>\n      <rel name=\"Contrast\" type=\"multinuc\"/>\n      <rel name=\"Disjunction\" type=\"multinuc\"/>\n      <rel name=\"Elaboration\" type=\"rst\"/>\n      <rel name=\"Enablement\" type=\"rst\"/>\n      <rel name=\"Evaluation\" type=\"rst\"/>\n      <rel name=\"Evidence\" type=\"rst\"/>\n      <rel name=\"Interpretation\" type=\"rst\"/>\n      <rel name=\"Joint\" type=\"multinuc\"/>\n      <rel name=\"Justify\" type=\"rst\"/>\n      <rel name=\"Motivation\" type=\"rst\"/>\n      <rel name=\"Otherwise\" type=\"rst\"/>\n      <rel name=\"Preparation\" type=\"rst\"/>\n      <rel name=\"Purpose\" type=\"rst\"/>\n      <rel name=\"Restatement\" type=\"rst\"/>\n      <rel name=\"Result\" type=\"rst\"/>\n      <rel name=\"Sequence\" type=\"multinuc\"/>\n      <rel name=\"Solutionhood\" type=\"rst\"/>\n      <rel name=\"Summary\" type=\"rst\"/>\n
        </relations>\n</header>\n  
        <body>\n
            <segment id=\"2\">Ich bin gut .</segment>\n 
        </body>\n </rst>",
    "dis_url":"rstout/5d74b5b2-18c6-4443-9d5f-b67ebfe947a3/document.dis",
    "rs3_url":"rstout/5d74b5b2-18c6-4443-9d5f-b67ebfe947a3/document.rs3","uid":"5d74b5b2-18c6-4443-9d5f-b67ebfe947a3"}

5. Training Your Own RST Parser

You can train your own parser using a corpus of RST trees. German parser uses .rs3 files for training. Training consists of these steps:

1- Divide the corpus into train, dev and test set and save them in three subdirectores training/, dev/ and test/ in a base directory ; for instance, if the base directory is data/base_dir, we will get these sub directories: data/base_dir/training, data/base_dir/dev and data/base_dir/test.

2- Review the content of the relation mapping file: parsing_eval_metrics/rel_mapping.json. It should contain all relations used in the whole corpus (train, dev and test set); otherwise training can fail and show undefind bahaviors.

3- The script ger_train_parser.py tringgers the trining process on the base directory. You can use the following command to run the training code in the docker container:

docker run -d -v $(pwd):/home/DPLP -w /home/DPLP mohamadisara20/dplp-env:ger python3  ger_train.py data/base_dir

4- The model will be saved in the file model/model.pickle.gz in the base path (e.g. model/de)

5- The parser precision scrores will be reported in the file named results.txt in the base directory.

Reference

Please read the following paper for more technical details

Yangfeng Ji, Jacob Eisenstein. Representation Learning for Text-level Parsing. ACL 2014
Joty, S., Carenini, G., & Ng, R. T. (n.d.). CODRA: A Novel Discriminative Framework for Rhetorical Analysis.
Shahmohammadi, S., & Stede, M. (2024). Discourse Parsing for German with new RST Corpora. Workshop Proceedings of the 20th Edition of the KONVENS Conference.

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
bash_scripts		bash_scripts
code		code
data		data
discoseg		discoseg
doc		doc
dplp_parser		dplp_parser
ner		ner
parsing_eval_metrics		parsing_eval_metrics
preprocess		preprocess
resources		resources
scripts		scripts
tan-clustering		tan-clustering
tmp		tmp
xom		xom
.gitignore		.gitignore
BerkeleyParser-1.7.jar		BerkeleyParser-1.7.jar
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
contributors.md		contributors.md
convert.py		convert.py
corenlp.sh		corenlp.sh
fix_pickle_protocol.py		fix_pickle_protocol.py
ger_0_preprocess_rs3.py		ger_0_preprocess_rs3.py
ger_1_reduce_relations.py		ger_1_reduce_relations.py
ger_1_rst2dis.py		ger_1_rst2dis.py
ger_2_txt.py		ger_2_txt.py
ger_3_ner.py		ger_3_ner.py
ger_4_txt2parse.py		ger_4_txt2parse.py
ger_5_txt2conll.py		ger_5_txt2conll.py
ger_6_conll_rst2merge.py		ger_6_conll_rst2merge.py
ger_7_dpvocab.py		ger_7_dpvocab.py
ger_7_prepare_dis.py		ger_7_prepare_dis.py
ger_7_seg_train.py		ger_7_seg_train.py
ger_8_learn_projmat.py		ger_8_learn_projmat.py
ger_8_rels_extract.py		ger_8_rels_extract.py
ger_9_parser_train.py		ger_9_parser_train.py
ger_install_stanza.sh		ger_install_stanza.sh
ger_lisp_2_rs.py		ger_lisp_2_rs.py
ger_predict_dis_from_txt.py		ger_predict_dis_from_txt.py
ger_rest_api.py		ger_rest_api.py
ger_rstparser.py		ger_rstparser.py
ger_sm5.gr		ger_sm5.gr
ger_train.py		ger_train.py
ger_train_cross_val.py		ger_train_cross_val.py
ger_train_parser.py		ger_train_parser.py
ger_train_projmat.py		ger_train_projmat.py
get-pip.py		get-pip.py
get-pip.py.1		get-pip.py.1
get-pip.py.2		get-pip.py.2
get-pip2.py		get-pip2.py
multiclass_svm.py		multiclass_svm.py
predict_with_custom_model.sh		predict_with_custom_model.sh
rels.py		rels.py
requirements.txt		requirements.txt
rstparser.py		rstparser.py
run_seg_pipeline.py		run_seg_pipeline.py
run_stanza_preprocessing.py		run_stanza_preprocessing.py
segmenter.py		segmenter.py
train_custom_model.sh		train_custom_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

DPLP-German RST Parser & EDU Segmenter

Setup

EDU Segmenter for German

Segmenter: run segmentation pipeline

Use custom-trained Model

Parser: run DPLP-German

Retrain DPLP-German with a Custom Corpus

Step 1: Prepare and Split Your Corpus

Step 2: Train the Model

Step 3: Predict with Your New Model

DPLP-German (original README.md)

1. Introduction

2. Runtime Docker Image

3. RST Parsing from text files

4. RST Parsing with the Restful API

5. Training Your Own RST Parser

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

License

Uh oh!

MaximilianKr/DPLP-German

Folders and files

Latest commit

History

Repository files navigation

DPLP-German RST Parser & EDU Segmenter

Setup

EDU Segmenter for German

Segmenter: run segmentation pipeline

Use custom-trained Model

Parser: run DPLP-German

Retrain DPLP-German with a Custom Corpus

Step 1: Prepare and Split Your Corpus

Step 2: Train the Model

Step 3: Predict with Your New Model

DPLP-German (original README.md)

1. Introduction

2. Runtime Docker Image

3. RST Parsing from text files

4. RST Parsing with the Restful API

5. Training Your Own RST Parser

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages