java.lang.Object

edu.odu.cs.cs350.ACMTrainingSetBuilder

public class ACMTrainingSetBuilder extends Object

Build a training set based on TrainingData PDFs and train a FilteredClassifier.

Constructor Summary

Constructors

Constructor

Description

ACMTrainingSetBuilder()
Method Summary

Modifier and Type

Method

Description

static void

addStringInstancesFromRepository(weka.core.Instances raw, List<String> possibleCategories, List<String> docPaths)

Gets PDFs from Maven repo, extracts text, and adds Instances.

static weka.filters.unsupervised.attribute.StringToWordVector

buildTrainingFilter(weka.core.Instances raw)

Build a StringToWordVector filter for training and classification.

static String

determineCategoryFromPath(String resourceRelative, List<String> possibleCategories)

Determine ACM category from document path within repo.

static List<String>

getACMClasses(List<String> docPaths)

Get list of ACM categories from document paths (using the subfolder names).

static List<String>

getDocumentListing()

Get list of document paths from TrainingData repository.

static String

getTextFromDocument(String resourceRelative)

Extract text from a PDF document in the TrainingData repository.

static void

main(String[] inputArguments)

Main method to build training set and train classifier.

static void

writeModels(weka.core.Instances raw, weka.core.Instances trainingSet, weka.classifiers.meta.FilteredClassifier trainedClassifier)

Write trained models to disk.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- ACMTrainingSetBuilder
  
  public ACMTrainingSetBuilder()
Method Details
- main
  
  public static void main(String[] inputArguments) throws Exception
  
  Main method to build training set and train classifier.
  
  Parameters:
  
  inputArguments - program arguments (unused)
  
  Throws:
  
  Exception - if files cannot be read or written
- writeModels
  
  public static void writeModels(weka.core.Instances raw, weka.core.Instances trainingSet, weka.classifiers.meta.FilteredClassifier trainedClassifier) throws Exception
  
  Write trained models to disk.
  
  Parameters:
  
  raw - raw Instances header
  
  trainingSet - vectorized training Instances
  
  trainedClassifier - trained FilteredClassifier
  
  Throws:
  
  Exception - if files cannot be written
- addStringInstancesFromRepository
  
  public static void addStringInstancesFromRepository(weka.core.Instances raw, List<String> possibleCategories, List<String> docPaths) throws IOException
  
  Gets PDFs from Maven repo, extracts text, and adds Instances.
  
  Parameters:
  
  raw - empty Instances object to fill
  
  possibleCategories - list of valid ACM categories
  
  docPaths - list of document paths within the Maven repo
  
  Throws:
  
  IOException - if a file can't be read
- determineCategoryFromPath
  
  public static String determineCategoryFromPath(String resourceRelative, List<String> possibleCategories)
  
  Determine ACM category from document path within repo.
  
  Parameters:
  
  resourceRelative - path to the document relative to the TrainingData resource path
  
  possibleCategories - list of valid ACM categories
  
  Returns:
  
  determined category or null if not recognized
- getTextFromDocument
  
  public static String getTextFromDocument(String resourceRelative) throws IOException
  
  Extract text from a PDF document in the TrainingData repository.
  
  Parameters:
  
  resourceRelative - path to the document relative to the TrainingData resource path
  
  Returns:
  
  extracted text, or null if extraction failed
  
  Throws:
  
  IOException - if the document cannot be read
- buildTrainingFilter
  
  public static weka.filters.unsupervised.attribute.StringToWordVector buildTrainingFilter(weka.core.Instances raw) throws Exception
  
  Build a StringToWordVector filter for training and classification.
  
  Parameters:
  
  raw - raw Instances to set input format
  
  Returns:
  
  configured StringToWordVector filter
  
  Throws:
  
  Exception - if filter cannot be initialized
- getDocumentListing
  
  public static List<String> getDocumentListing() throws IOException
  
  Get list of document paths from TrainingData repository.
  
  Returns:
  
  list of document paths including category
  
  Throws:
  
  IOException - if the listing cannot be read
- getACMClasses
  
  public static List<String> getACMClasses(List<String> docPaths)
  
  Get list of ACM categories from document paths (using the subfolder names).
  
  Parameters:
  
  docPaths - list of document paths from repository
  
  Returns:
  
  list of unique ACM categories

Class ACMTrainingSetBuilder

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

ACMTrainingSetBuilder

Method Details

main

writeModels

addStringInstancesFromRepository

determineCategoryFromPath

getTextFromDocument

buildTrainingFilter

getDocumentListing

getACMClasses