Fit a lot of encodings for text file. (#458)

### What problem does this PR solve?

#384

### Type of change

- [x] Performance Improvement
This commit is contained in:
KevinHuSh
2024-04-19 18:02:53 +08:00
committed by GitHub
parent cda7b607cb
commit ed6081845a
19 changed files with 118 additions and 55 deletions

View File

@@ -8,6 +8,7 @@ import re
import string
import sys
from hanziconv import HanziConv
from huggingface_hub import snapshot_download
from nltk import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from api.utils.file_utils import get_project_base_directory