tickets.csv kann in corpus übertragen werden
This commit is contained in:
parent
fff1e5d0fd
commit
26c0f37ec8
|
@ -0,0 +1,169 @@
|
||||||
|
"TicketNumber";"Subject";"CreatedDate";"categoryName";"Impact";"Urgency";"BenutzerID";"VerantwortlicherID";"EigentuemerID";"Description";"Solution"
|
||||||
|
"INC20357";"schulungstest";"21.07.2015 08:19:34";"ZHB";"2 - Mittel (Abt./Bereich)";"B - Normal";"aa8315f5-52c3-e411-80c7-0050569c58f5";"";"aa8315f5-52c3-e411-80c7-0050569c58f5";"kevin arbeite gefälligst :)";""
|
||||||
|
"INC40481";"Telephone Contract";"13.08.2015 14:18:57";"Neuanschluss";"2 - Mittel (Abt./Bereich)";"B - Normal";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"Telefon-Neuanschluss
|
||||||
|
Antragsteller:
|
||||||
|
Melanie Hinrichs
|
||||||
|
melanie.hinrichs@tu-dortmund.de
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Terminvorschlag unbestimmt
|
||||||
|
"TicketNumber";"Subject";"CreatedDate";"categoryName";"Impact";"Urgency";"BenutzerID";"VerantwortlicherID";"EigentuemerID";"Description";"Solution"
|
||||||
|
"INC20357";"schulungstest";"21.07.2015 08:19:34";"ZHB";"2 - Mittel (Abt./Bereich)";"B - Normal";"aa8315f5-52c3-e411-80c7-0050569c58f5";"";"aa8315f5-52c3-e411-80c7-0050569c58f5";"kevin arbeite gefälligst :)";""
|
||||||
|
"INC40481";"Telephone Contract";"13.08.2015 14:18:57";"Neuanschluss";"2 - Mittel (Abt./Bereich)";"B - Normal";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"Telefon-Neuanschluss
|
||||||
|
Antragsteller:
|
||||||
|
Melanie Hinrichs
|
||||||
|
melanie.hinrichs@tu-dortmund.de
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Terminvorschlag unbestimmt
|
||||||
|
Einrichtung Dezernat 3
|
||||||
|
Abteilung Abteilung 2
|
||||||
|
PSP Element L-11-10000-100-302300
|
||||||
|
UniAccount myvowest(Westerdorf, Yvonne)
|
||||||
|
Gebäude Pavillon 8
|
||||||
|
Raum ID 031 (63292)
|
||||||
|
Telefondose keine vorhanden
|
||||||
|
Telefonnr. -
|
||||||
|
Eintrag Telefonbuch
|
||||||
|
E-Mail melanie.hinrichs@tu-dortmund.de
|
||||||
|
Voicemail Nicht erwünscht
|
||||||
|
Ansprechpartner Melanie Hinrichs
|
||||||
|
Tel. Ansprechpartner 5848
|
||||||
|
Verantwortlicher Nutzer -
|
||||||
|
Type Amt
|
||||||
|
Bemerkung:
|
||||||
|
Es wird ein Telefon benötigt,ein Telefon mit 6 Speicherpl.f.die Gruppenfunktion ist ausreichend. Die Möbel werden am 10.06.2015 aufgestellt.Weder Netzwerkdose noch Telefondose vorhanden. Dez.6 hat Vorbereitungen getroffen.";"Frau Hinrichs überdenkt die Situation und macht dann neue Anträge.
|
||||||
|
Dieses Ticket wird geschlossen"
|
||||||
|
"INC40483";"Telephone Contract";"13.08.2015 14:22:06";"Neuanschluss";"2 - Mittel (Abt./Bereich)";"B - Normal";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"Telefon-Neuanschluss
|
||||||
|
Antragsteller:
|
||||||
|
Anja Kulmsee
|
||||||
|
anja.kulmsee@tu-dortmund.de
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Terminvorschlag 03.08.2015
|
||||||
|
Einrichtung Fk06 Dekanat
|
||||||
|
Abteilung Bereich Studium und Lehre
|
||||||
|
PSP Element L-11-10000-100-060011
|
||||||
|
UniAccount manjkulm(Kulmsee, Anja)
|
||||||
|
Gebäude CT Geschossbau 2
|
||||||
|
Raum ID G2-3.22 (64882)
|
||||||
|
Telefondose
|
||||||
|
Telefonnr. -
|
||||||
|
Eintrag Telefonbuch
|
||||||
|
E-Mail anja.kulmsee@tu-dortmund.de
|
||||||
|
Voicemail Nicht erwünscht
|
||||||
|
Ansprechpartner Anja Kulmsee
|
||||||
|
Tel. Ansprechpartner 6179, 7370, 7179
|
||||||
|
Verantwortlicher Nutzer -
|
||||||
|
Type Amt
|
||||||
|
Bemerkung:
|
||||||
|
Der Anschluß ist für ein Faxgerät. Wenn möglich hätte ich gern die Rufnummer 3033.";"Faxnummer 3166 wurde unter die Telefonnummer 7179 im elektronischen Telefonbuch eingetragen"
|
||||||
|
"INC40484";"Defekte Netzwerkdose / Frage zu VPN";"13.08.2015 14:25:50";"LAN";"2 - Mittel (Abt./Bereich)";"B - Normal";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"Sehr geehrtes ITMC Service Team,
|
||||||
|
|
||||||
|
seit ein einiger Zeit scheint der Netzwerkanschluss eines Kollegen an das Intranet der BMP mit der Dosennummer G1 303/04/12.05 (G1 4 26-1) in Raum G1-426 nicht mehr zu funktionieren.
|
||||||
|
Ich würde Sie daher bitten diese Mail an den zuständigen Kollegen weiterzuleiten, um die Leitung vielleicht einmal zu Prüfen.
|
||||||
|
|
||||||
|
Des Weiteren hätte ich noch eine Frage bezüglich der Möglichkeit zur Nutzung einer VPN Verbindung aus unserem Intranet heraus zu einem fremden Netzwerk. Dies ist zwar über das WLAN-Netz möglich, jedoch nicht aus unserem Netzwerk heraus. Vielleicht können Sie mir mitteilen an welchen Kollegen ich mich bezüglich dieses Problem wenden kann.
|
||||||
|
|
||||||
|
Bei Rückfragen stehe ich gerne zur Verfügung!
|
||||||
|
|
||||||
|
Beste Grüße,
|
||||||
|
|
||||||
|
Nicolas Rauner
|
||||||
|
|
||||||
|
LS Biomaterialien und Polymerwissenschaften
|
||||||
|
Fakultät Bio- und Chemieingenieurwesen
|
||||||
|
TU Dortmund
|
||||||
|
D-44227 Dortmund
|
||||||
|
|
||||||
|
Tel: + 49-(0)231 / 755 - 3015
|
||||||
|
Fax: + 49-(0)231 / 755 - 2480
|
||||||
|
|
||||||
|
www.ls-bmp.de <http://www.ls-bmp.de/>";"Hallo Herr Rauner,
|
||||||
|
die Netzwerkdose weist z. Z. keine Verbindungsprobleme auf. Falls doch welche bestehen, melden Sie sich bitte bei uns.
|
||||||
|
|
||||||
|
Mit freunldichen Grüßen
|
||||||
|
Aicha Oikrim"
|
||||||
|
"INC40487";"(SSO) Login via Browser mit Zertifikat";"13.08.2015 14:54:57";"Betrieb";"2 - Mittel (Abt./Bereich)";"B - Normal";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"Lieber Support,
|
||||||
|
ich habe gerade versucht mich mit meiner Unicard im Firefox-Browser für das
|
||||||
|
Service-Portal zu authentifizieren. Das hat vor einigen Wochen noch tadelos
|
||||||
|
geklappt und mittlerweile bekomme ich folgende Fehlermeldung:
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Ich hoffe Sie können mir weiterhelfen.
|
||||||
|
|
||||||
|
Vielen Dank und viele Grüße
|
||||||
|
Sascha Feldhorst
|
||||||
|
|
||||||
|
Dipl.-Inform.
|
||||||
|
Sascha Feldhorst
|
||||||
|
Wiss.-Ang.
|
||||||
|
|
||||||
|
Technische Universität Dortmund
|
||||||
|
Maschinenbau/Lehrstuhl für Förder- und Lagerwesen
|
||||||
|
LogistikCampus
|
||||||
|
Joseph-von-Fraunhofer-Str. 2-4
|
||||||
|
D-44227 Dortmund
|
||||||
|
|
||||||
|
Tel.: +49 231-755 40 73
|
||||||
|
Fax: +49 231-755 47 68
|
||||||
|
<mailto:sascha.feldhorst@tu-dortmund.de> sascha.feldhorst@tu-dortmund.de
|
||||||
|
<http://www.flw.mb.tu-dortmund.de/> www.flw.mb.tu-dortmund.de
|
||||||
|
|
||||||
|
Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich. Sie ist
|
||||||
|
ausschließlich für den Adressaten bestimmt. Sollten Sie nicht der für diese
|
||||||
|
E-Mail bestimmte Adressat sein, unterrichten Sie bitte den Absender und
|
||||||
|
vernichten Sie diese Mail. Vielen Dank. Unbeschadet der Korrespondenz per
|
||||||
|
E-Mail, sind unsere Erklärungen ausschließlich final rechtsverbindlich, wenn
|
||||||
|
sie in herkömmlicher Schriftform (mit eigenhändiger Unterschrift) oder durch
|
||||||
|
Übermittlung eines solchen Schriftstücks per Telefax erfolgen.
|
||||||
|
|
||||||
|
Important note: The information included in this e-mail is confidential. It
|
||||||
|
is solely intended for the recipient. If you are not the intended recipient
|
||||||
|
of this e-mail please contact the sender and delete this message. Thank you.
|
||||||
|
Without prejudice of e-mail correspondence, our statements are only legally
|
||||||
|
binding when they are made in the conventional written form (with personal
|
||||||
|
signature) or when such documents are sent by fax.";"der Login via Zertifikat am SSO-Dienst mittels Firefox und UniCard sollte funktionieren.
|
||||||
|
Eventuell wurden durch ein Browserupdate die Einstellungen gelöscht. Bitte prüfen Sie ob die CA-Zertifikate installiert sind:
|
||||||
|
https://pki.pca.dfn.de/tu-dortmund-chipcard-ca/cgi-bin/pub/pki?cmd=getStaticPage;name=index;id=2&RA_ID=0 ""https://pki.pca.dfn.de/tu-dortmund-chipcard-ca/cgi-bin/pub/pki?cmd=getStaticPage;name=index;id=2&RA_ID=0""
|
||||||
|
und ob das Kryptographie Modul im Firefox hinterlegt ist:
|
||||||
|
https://service.tu-dortmund.de/group/intra/authentifizierung"
|
||||||
|
"INC40489";"Telephone Contract";"13.08.2015 14:57:23";"Elektronisches Telefonbuch";"2 - Mittel (Abt./Bereich)";"B - Normal";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"Telefon-Umzug
|
||||||
|
Antragsteller:
|
||||||
|
Astrid Gramm
|
||||||
|
astrid.gramm@tu-dortmund.de
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Terminvorschlag 14.08.2015
|
||||||
|
Einrichtung Dezernat 2
|
||||||
|
Abteilung 2.5
|
||||||
|
PSP Element
|
||||||
|
UniAccount mnichofm(Hofmann, Nicole)
|
||||||
|
Gebäude Dezernat 5
|
||||||
|
Raum ID 201 (651430)
|
||||||
|
Telefondose Neztwerkdose: DT04.5/04.6
|
||||||
|
Telefonnr. 4821
|
||||||
|
Eintrag Telefonbuch
|
||||||
|
E-Mail astrid.gramm@tu-dortmund.de
|
||||||
|
Voicemail
|
||||||
|
Ansprechpartner Astrid Gramm
|
||||||
|
Tel. Ansprechpartner 5444
|
||||||
|
Verantwortlicher Nutzer
|
||||||
|
Type
|
||||||
|
Bemerkung:
|
||||||
|
Frau Hofmann wird am 14.08.2015 in die WD 2 umziehen. Es ist der Raum 201a im OG (nicht 201)
|
||||||
|
Eine Bezeichnung der Telefondose ist nicht vorhanden.";"erledigt"
|
||||||
|
"INC40488";"Laptop macht komische Geräusche";"13.08.2015 14:56:24";"Verwaltung";"2 - Mittel (Abt./Bereich)";"B - Normal";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"9668e0af-7202-e711-0781-005056b025d0";"Hallo,
|
||||||
|
mein Laptop macht seit eben komische Geräusche.
|
||||||
|
Bitte um Klärung.
|
||||||
|
Jan Hustadt
|
||||||
|
(0231) 755-7248
|
||||||
|
WD2, R. 112
|
||||||
|
Dezernat 2 Hochschulentwicklung
|
||||||
|
Abteilung 2.3 Organisationsentwicklung
|
||||||
|
E-Mail: jan.hustadt@tu-dortmund.de";"Herr Alexev Swetlomier (HIWI) küümert sich bereits um das Laptop und Frau Herbst weiß auch Bescheid die zur Zeit im Urlaub ist"
|
Can't render this file because it contains an unexpected character in line 11 and column 4.
|
336
test.py
336
test.py
|
@ -1,4 +1,8 @@
|
||||||
# -*- coding: utf-8 -*-
|
# -*- coding: utf-8 -*-
|
||||||
|
import time
|
||||||
|
start = time.time()
|
||||||
|
|
||||||
|
|
||||||
import csv
|
import csv
|
||||||
import functools
|
import functools
|
||||||
import os.path
|
import os.path
|
||||||
|
@ -11,23 +15,65 @@ import spacy
|
||||||
import textacy
|
import textacy
|
||||||
from scipy import *
|
from scipy import *
|
||||||
from textacy import Vectorizer
|
from textacy import Vectorizer
|
||||||
|
import warnings
|
||||||
csv.field_size_limit(sys.maxsize)
|
csv.field_size_limit(sys.maxsize)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
path2xml = "ticket.xml"
|
# Load the configuration file
|
||||||
import de_core_news_md
|
import configparser as ConfigParser
|
||||||
|
config = ConfigParser.ConfigParser()
|
||||||
|
with open("config.ini") as f:
|
||||||
|
config.read_file(f)
|
||||||
|
|
||||||
|
|
||||||
PARSER = de_core_news_md.load()
|
|
||||||
corpus = textacy.Corpus(PARSER)
|
path2xml = config.get("default","path2xml")
|
||||||
thesauruspath = "openthesaurus.csv"
|
thesauruspath = config.get("default","thesauruspath")
|
||||||
|
|
||||||
|
|
||||||
|
DE_PARSER = spacy.load("de")
|
||||||
|
|
||||||
|
de_stop_words=list(__import__("spacy." + DE_PARSER.lang, globals(), locals(), ['object']).STOP_WORDS)
|
||||||
|
|
||||||
|
|
||||||
|
corpus = textacy.Corpus(DE_PARSER)
|
||||||
|
|
||||||
THESAURUS = list(textacy.fileio.read_csv(thesauruspath, delimiter=";"))
|
THESAURUS = list(textacy.fileio.read_csv(thesauruspath, delimiter=";"))
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
############# misc
|
||||||
|
def compose(*functions):
|
||||||
|
def compose2(f, g):
|
||||||
|
return lambda x: f(g(x))
|
||||||
|
return functools.reduce(compose2, functions, lambda x: x)
|
||||||
|
|
||||||
|
def get_calling_function():
|
||||||
|
"""finds the calling function in many decent cases.
|
||||||
|
https://stackoverflow.com/questions/39078467/python-how-to-get-the-calling-function-not-just-its-name
|
||||||
|
"""
|
||||||
|
fr = sys._getframe(1) # inspect.stack()[1][0]
|
||||||
|
co = fr.f_code
|
||||||
|
for get in (
|
||||||
|
lambda:fr.f_globals[co.co_name],
|
||||||
|
lambda:getattr(fr.f_locals['self'], co.co_name),
|
||||||
|
lambda:getattr(fr.f_locals['cls'], co.co_name),
|
||||||
|
lambda:fr.f_back.f_locals[co.co_name], # nested
|
||||||
|
lambda:fr.f_back.f_locals['func'], # decorators
|
||||||
|
lambda:fr.f_back.f_locals['meth'],
|
||||||
|
lambda:fr.f_back.f_locals['f'],
|
||||||
|
):
|
||||||
|
try:
|
||||||
|
func = get()
|
||||||
|
except (KeyError, AttributeError):
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
if func.__code__ == co:
|
||||||
|
return func
|
||||||
|
raise AttributeError("func not found")
|
||||||
|
|
||||||
def printRandomDoc(textacyCorpus):
|
def printRandomDoc(textacyCorpus):
|
||||||
import random
|
import random
|
||||||
print()
|
print()
|
||||||
|
@ -40,6 +86,9 @@ def printRandomDoc(textacyCorpus):
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
############# on xml
|
||||||
def generateMainTextfromTicketXML(path2xml, main_textfield='Beschreibung'):
|
def generateMainTextfromTicketXML(path2xml, main_textfield='Beschreibung'):
|
||||||
"""
|
"""
|
||||||
generates strings from XML
|
generates strings from XML
|
||||||
|
@ -55,7 +104,6 @@ def generateMainTextfromTicketXML(path2xml, main_textfield='Beschreibung'):
|
||||||
for field in ticket:
|
for field in ticket:
|
||||||
if field.tag == main_textfield:
|
if field.tag == main_textfield:
|
||||||
yield field.text
|
yield field.text
|
||||||
|
|
||||||
def generateMetadatafromTicketXML(path2xml, leave_out=['Beschreibung']):
|
def generateMetadatafromTicketXML(path2xml, leave_out=['Beschreibung']):
|
||||||
tree = ET.parse(path2xml, ET.XMLParser(encoding="utf-8"))
|
tree = ET.parse(path2xml, ET.XMLParser(encoding="utf-8"))
|
||||||
root = tree.getroot()
|
root = tree.getroot()
|
||||||
|
@ -71,53 +119,161 @@ def generateMetadatafromTicketXML(path2xml, leave_out=['Beschreibung']):
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
############# on csv
|
||||||
|
|
||||||
def processTextstream(textstream, funclist, parser=PARSER):
|
def csv_to_contentStream(path2csv: str, content_collumn_name: str):
|
||||||
# input:str-stream output:str-stream
|
"""
|
||||||
pipe = parser.pipe(textstream)
|
:param path2csv: string
|
||||||
|
:param content_collumn_name: string
|
||||||
|
:return: string-generator
|
||||||
|
"""
|
||||||
|
stream = textacy.fileio.read_csv(path2csv, delimiter=";") # ,encoding='utf8')
|
||||||
|
content_collumn = 0 # standardvalue
|
||||||
|
|
||||||
for doc in pipe:
|
for i,lst in enumerate(stream):
|
||||||
tokens = [tok for tok in doc]
|
if i == 0:
|
||||||
|
# look for desired column
|
||||||
|
for j,col in enumerate(lst):
|
||||||
|
if col == content_collumn_name:
|
||||||
|
content_collumn = j
|
||||||
|
else:
|
||||||
|
yield lst[content_collumn]
|
||||||
|
def csv_to_metaStream(path2csv: str, metalist: [str]):
|
||||||
|
"""
|
||||||
|
:param path2csv: string
|
||||||
|
:param metalist: list of strings
|
||||||
|
:return: dict-generator
|
||||||
|
"""
|
||||||
|
stream = textacy.fileio.read_csv(path2csv, delimiter=";") # ,encoding='utf8')
|
||||||
|
|
||||||
|
content_collumn = 0 # standardvalue
|
||||||
|
metaindices = []
|
||||||
|
metadata_temp = {}
|
||||||
|
for i,lst in enumerate(stream):
|
||||||
|
if i == 0:
|
||||||
|
for j,col in enumerate(lst): # geht bestimmt effizienter... egal, weil passiert nur einmal
|
||||||
|
for key in metalist:
|
||||||
|
if key == col:
|
||||||
|
metaindices.append(j)
|
||||||
|
metadata_temp = dict(zip(metalist,metaindices)) # zB {'Subject' : 1, 'categoryName' : 3, 'Solution' : 10}
|
||||||
|
|
||||||
|
else:
|
||||||
|
metadata = metadata_temp.copy()
|
||||||
|
for key,value in metadata.items():
|
||||||
|
metadata[key] = lst[value]
|
||||||
|
yield metadata
|
||||||
|
|
||||||
|
|
||||||
|
############# on str-gen
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def processTokens(tokens, funclist, parser):
|
||||||
|
# in:tokenlist, funclist
|
||||||
|
# out: tokenlist
|
||||||
for f in funclist:
|
for f in funclist:
|
||||||
if 'bool' in str(f.__annotations__):
|
if 'bool' in str(f.__annotations__):
|
||||||
tokens = list(filter(f, tokens))
|
tokens = list(filter(f, tokens))
|
||||||
|
|
||||||
elif 'str' in str(f.__annotations__):
|
elif 'str' in str(f.__annotations__):
|
||||||
x=0
|
tokens = list(map(f, tokens)) # purer text
|
||||||
tokens = list(map(f, tokens))
|
|
||||||
#tokens = [f(tok.lower_) for tok in tokens] #purer text
|
|
||||||
doc = parser(" ".join(tokens)) # geparsed
|
doc = parser(" ".join(tokens)) # geparsed
|
||||||
tokens = [tok for tok in doc] # nur tokens
|
tokens = [tok for tok in doc] # nur tokens
|
||||||
|
|
||||||
elif 'spacy.tokens.Doc' in str(f.__annotations__):
|
elif 'spacy.tokens.doc.Doc' in str(f.__annotations__):
|
||||||
tokens = [tok for tok in f(tokens)]
|
toks = f(tokens)
|
||||||
|
tokens = [tok for tok in toks]
|
||||||
|
|
||||||
|
else:
|
||||||
|
warnings.warn("Unknown Annotation while preprocessing. Function: {0}".format(str(f)))
|
||||||
|
|
||||||
|
return tokens
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
############# return docs
|
||||||
|
|
||||||
|
def keepUniqueTokens() -> spacy.tokens.Doc:
|
||||||
|
#todo in:tok out:doc
|
||||||
|
ret = lambda doc: (set([tok.lower_ for tok in doc]))
|
||||||
|
|
||||||
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
|
return ret
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def processTextstream(textstream, funclist, parser=DE_PARSER):
|
||||||
|
"""
|
||||||
|
:param textstream: string-gen
|
||||||
|
:param funclist: [func]
|
||||||
|
:param parser: spacy-parser
|
||||||
|
:return: string-gen
|
||||||
|
"""
|
||||||
|
# input:str-stream output:str-stream
|
||||||
|
pipe = parser.pipe(textstream)
|
||||||
|
|
||||||
|
for doc in pipe:
|
||||||
|
tokens = [tok for tok in doc]
|
||||||
|
tokens = processTokens(tokens,funclist,parser)
|
||||||
yield " ".join([tok.lower_ for tok in tokens])
|
yield " ".join([tok.lower_ for tok in tokens])
|
||||||
|
|
||||||
def processDictstream(dictstream, funcdict, parser=PARSER): #todo das selbe wie mit textstream idee: processDoc(doc,funcs)
|
def processDictstream(dictstream, funcdict, parser=DE_PARSER):
|
||||||
|
"""
|
||||||
|
|
||||||
|
:param dictstream: dict-gen
|
||||||
|
:param funcdict:
|
||||||
|
clean_in_meta = {
|
||||||
|
"Solution":funclist,
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
:param parser: spacy-parser
|
||||||
|
:return: dict-gen
|
||||||
|
"""
|
||||||
for dic in dictstream:
|
for dic in dictstream:
|
||||||
result = {}
|
result = {}
|
||||||
for key, value in dic.items():
|
for key, value in dic.items():
|
||||||
|
|
||||||
if key in funcdict:
|
if key in funcdict:
|
||||||
result[key] = funcdict[key](parser(value))
|
|
||||||
|
doc = parser(value)
|
||||||
|
tokens = [tok for tok in doc]
|
||||||
|
funclist = funcdict[key]
|
||||||
|
|
||||||
|
tokens = processTokens(tokens,funclist,parser)
|
||||||
|
|
||||||
|
|
||||||
|
result[key] = " ".join([tok.lower_ for tok in tokens])
|
||||||
|
|
||||||
|
|
||||||
else:
|
else:
|
||||||
result[key] = value
|
result[key] = value
|
||||||
yield result
|
yield result
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
############# return tokens
|
||||||
|
|
||||||
def keepPOS(pos_list) -> bool:
|
def keepPOS(pos_list) -> bool:
|
||||||
ret = lambda tok : tok.pos_ in pos_list
|
ret = lambda tok : tok.pos_ in pos_list
|
||||||
|
|
||||||
ret.__annotations__ = keepPOS.__annotations__
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
return ret
|
return ret
|
||||||
|
|
||||||
def removePOS(pos_list)-> bool:
|
def removePOS(pos_list)-> bool:
|
||||||
ret = lambda tok : tok.pos_ not in pos_list
|
ret = lambda tok : tok.pos_ not in pos_list
|
||||||
|
|
||||||
ret.__annotations__ = removePOS.__annotations__
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
return ret
|
return ret
|
||||||
|
|
||||||
def removeWords(words, keep=None)-> bool:
|
def removeWords(words, keep=None)-> bool:
|
||||||
|
@ -131,86 +287,32 @@ def removeWords(words, keep=None)-> bool:
|
||||||
|
|
||||||
ret = lambda tok : tok.lower_ not in words
|
ret = lambda tok : tok.lower_ not in words
|
||||||
|
|
||||||
ret.__annotations__ = removeWords.__annotations__
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
return ret
|
return ret
|
||||||
|
|
||||||
def keepENT(ent_list) -> bool:
|
def keepENT(ent_list) -> bool:
|
||||||
ret = lambda tok : tok.ent_type_ in ent_list
|
ret = lambda tok : tok.ent_type_ in ent_list
|
||||||
|
|
||||||
ret.__annotations__ = keepENT.__annotations__
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
return ret
|
return ret
|
||||||
|
|
||||||
def removeENT(ent_list) -> bool:
|
def removeENT(ent_list) -> bool:
|
||||||
ret = lambda tok: tok.ent_type_ not in ent_list
|
ret = lambda tok: tok.ent_type_ not in ent_list
|
||||||
|
|
||||||
ret.__annotations__ = removeENT.__annotations__
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
return ret
|
return ret
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def keepUniqueTokens() -> spacy.tokens.Doc:
|
|
||||||
ret = lambda doc: (set([tok.lower_ for tok in doc]))
|
|
||||||
|
|
||||||
ret.__annotations__ = keepUniqueTokens.__annotations__
|
|
||||||
return ret
|
|
||||||
|
|
||||||
|
|
||||||
def lemmatize() -> str:
|
def lemmatize() -> str:
|
||||||
ret = lambda tok: tok.lemma_
|
ret = lambda tok: tok.lemma_
|
||||||
|
|
||||||
ret.__annotations__ = lemmatize.__annotations__
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
return ret
|
return ret
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
mentionFinder = re.compile(r"@[a-z0-9_]{1,15}", re.IGNORECASE)
|
|
||||||
emailFinder = re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", re.IGNORECASE)
|
|
||||||
urlFinder = re.compile(r"^(?:https?:\/\/)?(?:www\.)?[a-zA-Z0-9./]+$", re.IGNORECASE)
|
|
||||||
|
|
||||||
def replaceEmails(replace_with="EMAIL") -> str:
|
|
||||||
ret = lambda tok : emailFinder.sub(replace_with, tok.lower_)
|
|
||||||
|
|
||||||
ret.__annotations__ = replaceEmails.__annotations__
|
|
||||||
return ret
|
|
||||||
|
|
||||||
def replaceURLs(replace_with="URL") -> str:
|
|
||||||
ret = lambda tok: textacy.preprocess.replace_urls(tok.lower_,replace_with=replace_with)
|
|
||||||
#ret = lambda tok: urlFinder.sub(replace_with,tok.lower_)
|
|
||||||
|
|
||||||
ret.__annotations__ = replaceURLs.__annotations__
|
|
||||||
return ret
|
|
||||||
|
|
||||||
def replaceTwitterMentions(replace_with="TWITTER_MENTION") -> str:
|
|
||||||
ret = lambda tok : mentionFinder.sub(replace_with,tok.lower_)
|
|
||||||
|
|
||||||
ret.__annotations__ = replaceTwitterMentions.__annotations__
|
|
||||||
return ret
|
|
||||||
|
|
||||||
def replaceNumbers(replace_with="NUMBER") -> str:
|
|
||||||
ret = lambda tok: textacy.preprocess.replace_numbers(tok.lower_, replace_with=replace_with)
|
|
||||||
|
|
||||||
ret.__annotations__ = replaceNumbers.__annotations__
|
|
||||||
return ret
|
|
||||||
|
|
||||||
def replacePhonenumbers(replace_with="PHONENUMBER",parser=PARSER):
|
|
||||||
ret = lambda tok: textacy.preprocess.replace_phone_numbers(tok.lower_, replace_with=replace_with)
|
|
||||||
|
|
||||||
ret.__annotations__ = replacePhonenumbers.__annotations__
|
|
||||||
return ret
|
|
||||||
|
|
||||||
|
|
||||||
def resolveAbbreviations():
|
|
||||||
pass #todo
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def normalizeSynonyms(default_return_first_Syn=False) -> str:
|
def normalizeSynonyms(default_return_first_Syn=False) -> str:
|
||||||
ret = lambda tok : getFirstSynonym(tok.lower_, default_return_first_Syn=default_return_first_Syn)
|
ret = lambda tok : getFirstSynonym(tok.lower_, default_return_first_Syn=default_return_first_Syn)
|
||||||
|
|
||||||
ret.__annotations__ = normalizeSynonyms.__annotations__
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
return ret
|
return ret
|
||||||
|
|
||||||
def getFirstSynonym(word, thesaurus=THESAURUS, default_return_first_Syn=False):
|
def getFirstSynonym(word, thesaurus=THESAURUS, default_return_first_Syn=False):
|
||||||
|
@ -251,20 +353,71 @@ def getHauptform(syn_block, word, default_return_first_Syn=False):
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
stop_words=list(__import__("spacy." + PARSER.lang, globals(), locals(), ['object']).STOP_WORDS)
|
############# return strings
|
||||||
|
|
||||||
|
mentionFinder = re.compile(r"@[a-z0-9_]{1,15}", re.IGNORECASE)
|
||||||
|
emailFinder = re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", re.IGNORECASE)
|
||||||
|
urlFinder = re.compile(r"^(?:https?:\/\/)?(?:www\.)?[a-zA-Z0-9./]+$", re.IGNORECASE)
|
||||||
|
|
||||||
|
def replaceEmails(replace_with="EMAIL") -> str:
|
||||||
|
ret = lambda tok : emailFinder.sub(replace_with, tok.lower_)
|
||||||
|
|
||||||
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
|
return ret
|
||||||
|
|
||||||
|
def replaceURLs(replace_with="URL") -> str:
|
||||||
|
ret = lambda tok: textacy.preprocess.replace_urls(tok.lower_,replace_with=replace_with)
|
||||||
|
#ret = lambda tok: urlFinder.sub(replace_with,tok.lower_)
|
||||||
|
|
||||||
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
|
return ret
|
||||||
|
|
||||||
|
def replaceTwitterMentions(replace_with="TWITTER_MENTION") -> str:
|
||||||
|
ret = lambda tok : mentionFinder.sub(replace_with,tok.lower_)
|
||||||
|
|
||||||
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
|
return ret
|
||||||
|
|
||||||
|
def replaceNumbers(replace_with="NUMBER") -> str:
|
||||||
|
ret = lambda tok: textacy.preprocess.replace_numbers(tok.lower_, replace_with=replace_with)
|
||||||
|
|
||||||
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
|
return ret
|
||||||
|
|
||||||
|
def replacePhonenumbers(replace_with="PHONENUMBER") -> str:
|
||||||
|
ret = lambda tok: textacy.preprocess.replace_phone_numbers(tok.lower_, replace_with=replace_with)
|
||||||
|
|
||||||
|
ret.__annotations__ = get_calling_function().__annotations__
|
||||||
|
return ret
|
||||||
|
|
||||||
|
|
||||||
|
def resolveAbbreviations():
|
||||||
|
pass #todo
|
||||||
|
|
||||||
|
|
||||||
|
metaliste = [
|
||||||
|
"Subject",
|
||||||
|
"categoryName",
|
||||||
|
"Solution"
|
||||||
|
]
|
||||||
|
path2csv = "M42-Export/Tickets_small.csv"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
clean_in_meta = {
|
||||||
|
"Solution":[removePOS(["SPACE"])],
|
||||||
|
"Subject":[removePOS(["SPACE","PUNCT"])]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
clean_in_content=[
|
clean_in_content=[
|
||||||
removePOS(["SPACE"]),
|
removePOS(["SPACE","PUNCT","NUM"]),
|
||||||
removeWords(["dezernat"]),
|
keepPOS(["NOUN"]),
|
||||||
removePOS(["PUNCT"]),
|
|
||||||
replaceURLs(),
|
replaceURLs(),
|
||||||
removePOS(["NUM"]),
|
replaceEmails(),
|
||||||
lemmatize(),
|
removeWords(de_stop_words),
|
||||||
removeWords(stop_words),
|
lemmatize()
|
||||||
keepUniqueTokens(),
|
|
||||||
normalizeSynonyms()
|
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@ -272,7 +425,8 @@ clean_in_content=[
|
||||||
## add files to textacy-corpus,
|
## add files to textacy-corpus,
|
||||||
print("add texts to textacy-corpus...")
|
print("add texts to textacy-corpus...")
|
||||||
corpus.add_texts(
|
corpus.add_texts(
|
||||||
processTextstream(generateMainTextfromTicketXML(path2xml), clean_in_content),
|
processTextstream(csv_to_contentStream(path2csv,"Description"), clean_in_content),
|
||||||
|
processDictstream(csv_to_metaStream(path2csv,metaliste),clean_in_meta)
|
||||||
)
|
)
|
||||||
|
|
||||||
printRandomDoc(corpus)
|
printRandomDoc(corpus)
|
||||||
|
@ -287,3 +441,5 @@ printRandomDoc(corpus)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
end = time.time()
|
||||||
|
print("\n\n\nTime Elapsed:{0}".format(end - start))
|
Loading…
Reference in New Issue