Class ACMTrainingSetBuilder

java.lang.Object
edu.odu.cs.cs350.ACMTrainingSetBuilder

public class ACMTrainingSetBuilder extends Object
Build a training set based on TrainingData PDFs and train a FilteredClassifier.
  • Constructor Details

    • ACMTrainingSetBuilder

      public ACMTrainingSetBuilder()
  • Method Details

    • main

      public static void main(String[] inputArguments) throws Exception
      Main method to build training set and train classifier.
      Parameters:
      inputArguments - program arguments (unused)
      Throws:
      Exception - if files cannot be read or written
    • writeModels

      public static void writeModels(weka.core.Instances raw, weka.core.Instances trainingSet, weka.classifiers.meta.FilteredClassifier trainedClassifier) throws Exception
      Write trained models to disk.
      Parameters:
      raw - raw Instances header
      trainingSet - vectorized training Instances
      trainedClassifier - trained FilteredClassifier
      Throws:
      Exception - if files cannot be written
    • addStringInstancesFromRepository

      public static void addStringInstancesFromRepository(weka.core.Instances raw, List<String> possibleCategories, List<String> docPaths) throws IOException
      Gets PDFs from Maven repo, extracts text, and adds Instances.
      Parameters:
      raw - empty Instances object to fill
      possibleCategories - list of valid ACM categories
      docPaths - list of document paths within the Maven repo
      Throws:
      IOException - if a file can't be read
    • determineCategoryFromPath

      public static String determineCategoryFromPath(String resourceRelative, List<String> possibleCategories)
      Determine ACM category from document path within repo.
      Parameters:
      resourceRelative - path to the document relative to the TrainingData resource path
      possibleCategories - list of valid ACM categories
      Returns:
      determined category or null if not recognized
    • getTextFromDocument

      public static String getTextFromDocument(String resourceRelative) throws IOException
      Extract text from a PDF document in the TrainingData repository.
      Parameters:
      resourceRelative - path to the document relative to the TrainingData resource path
      Returns:
      extracted text, or null if extraction failed
      Throws:
      IOException - if the document cannot be read
    • buildTrainingFilter

      public static weka.filters.unsupervised.attribute.StringToWordVector buildTrainingFilter(weka.core.Instances raw) throws Exception
      Build a StringToWordVector filter for training and classification.
      Parameters:
      raw - raw Instances to set input format
      Returns:
      configured StringToWordVector filter
      Throws:
      Exception - if filter cannot be initialized
    • getDocumentListing

      public static List<String> getDocumentListing() throws IOException
      Get list of document paths from TrainingData repository.
      Returns:
      list of document paths including category
      Throws:
      IOException - if the listing cannot be read
    • getACMClasses

      public static List<String> getACMClasses(List<String> docPaths)
      Get list of ACM categories from document paths (using the subfolder names).
      Parameters:
      docPaths - list of document paths from repository
      Returns:
      list of unique ACM categories