Quantcast
Channel: CodeSection,代码区,Python开发技术文章_教程 - CodeSec
Viewing all articles
Browse latest Browse all 9596

Open Source Text Processing Project: segtok

$
0
0

segtok: sentence segmentation and word tokenization tools

Project Website: http://fnl.es/segtok-a-segmentation-and-tokenization-library.html

Github Link: https://github.com/fnl/segtok

Description

A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.

The segtok package provides two modules, segtok.segmenter and segtok.tokenizer. The segmenter provides functionality for splitting (Indo-European) text into sentences. The tokenizer provides functionality for splitting (Indo-European) sentences into words and symbols (collectively called tokens). Both modules can also be used from the command-line. While other Indo-European languages could work, it has only been designed with languages such as Spanish, English, and German in mind.

To install this package, you should have the latest official version of python 2 or 3 installed. The package has been reported to work with Python 2.7, 3.3, and 3.4 and is tested against the latest Python 2 and 3 branches. The easiest way to get it installed is using pip or any other package manager that works with PyPI:

pip install segtok

Important: If you are on a linux machine and have problems installing the regex dependency of segtok, make sure you have the python-dev and/or python3-dev packages installed to get the necessary headers to compile the package.

Then try the command line tools on some plain-text files (e.g., this README) to see if segtok meets your needs:

segmenter README.rst | tokenizer


Viewing all articles
Browse latest Browse all 9596

Trending Articles