Package edu.odu.cs.cs350
Class ACMTrainingSetBuilder
java.lang.Object
edu.odu.cs.cs350.ACMTrainingSetBuilder
Build a training set based on TrainingData PDFs and train a FilteredClassifier.
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic voidaddStringInstancesFromRepository(weka.core.Instances raw, List<String> possibleCategories, List<String> docPaths) Gets PDFs from Maven repo, extracts text, and adds Instances.static weka.filters.unsupervised.attribute.StringToWordVectorbuildTrainingFilter(weka.core.Instances raw) Build a StringToWordVector filter for training and classification.static StringdetermineCategoryFromPath(String resourceRelative, List<String> possibleCategories) Determine ACM category from document path within repo.getACMClasses(List<String> docPaths) Get list of ACM categories from document paths (using the subfolder names).Get list of document paths from TrainingData repository.static StringgetTextFromDocument(String resourceRelative) Extract text from a PDF document in the TrainingData repository.static voidMain method to build training set and train classifier.static voidwriteModels(weka.core.Instances raw, weka.core.Instances trainingSet, weka.classifiers.meta.FilteredClassifier trainedClassifier) Write trained models to disk.
-
Constructor Details
-
ACMTrainingSetBuilder
public ACMTrainingSetBuilder()
-
-
Method Details
-
main
Main method to build training set and train classifier.- Parameters:
inputArguments- program arguments (unused)- Throws:
Exception- if files cannot be read or written
-
writeModels
public static void writeModels(weka.core.Instances raw, weka.core.Instances trainingSet, weka.classifiers.meta.FilteredClassifier trainedClassifier) throws Exception Write trained models to disk.- Parameters:
raw- raw Instances headertrainingSet- vectorized training InstancestrainedClassifier- trained FilteredClassifier- Throws:
Exception- if files cannot be written
-
addStringInstancesFromRepository
public static void addStringInstancesFromRepository(weka.core.Instances raw, List<String> possibleCategories, List<String> docPaths) throws IOException Gets PDFs from Maven repo, extracts text, and adds Instances.- Parameters:
raw- empty Instances object to fillpossibleCategories- list of valid ACM categoriesdocPaths- list of document paths within the Maven repo- Throws:
IOException- if a file can't be read
-
determineCategoryFromPath
public static String determineCategoryFromPath(String resourceRelative, List<String> possibleCategories) Determine ACM category from document path within repo.- Parameters:
resourceRelative- path to the document relative to the TrainingData resource pathpossibleCategories- list of valid ACM categories- Returns:
- determined category or null if not recognized
-
getTextFromDocument
Extract text from a PDF document in the TrainingData repository.- Parameters:
resourceRelative- path to the document relative to the TrainingData resource path- Returns:
- extracted text, or null if extraction failed
- Throws:
IOException- if the document cannot be read
-
buildTrainingFilter
public static weka.filters.unsupervised.attribute.StringToWordVector buildTrainingFilter(weka.core.Instances raw) throws Exception Build a StringToWordVector filter for training and classification.- Parameters:
raw- raw Instances to set input format- Returns:
- configured StringToWordVector filter
- Throws:
Exception- if filter cannot be initialized
-
getDocumentListing
Get list of document paths from TrainingData repository.- Returns:
- list of document paths including category
- Throws:
IOException- if the listing cannot be read
-
getACMClasses
Get list of ACM categories from document paths (using the subfolder names).- Parameters:
docPaths- list of document paths from repository- Returns:
- list of unique ACM categories
-