Newer
Older
\documentclass{article}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{fancyvrb}
\usepackage{vhistory}
\usepackage[margin=1in]{geometry}
\usepackage{listing}
\usepackage{url}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{mathtools} %for use of := symbol mostly.
\usepackage{booktabs} % used for making tables
\usepackage[export]{adjustbox} % allows frame around figure
\usepackage[round]{natbib}
\newcommand{\pnote}[1]{{\langle \text{#1} \rangle}}
\begin{document}
\title{\textbf{TrawlExpert: A Tool for Watershed Biological Research}}
\author{Trawlstars Inc. (Group 11) \\ Lab section: L01 \\ Version: 1.0 \\ SFWRENG 2XB3 \\ \\ Christopher W. Schankula, 400026650, schankuc \\ Haley Glavina, 001412343, glavinhc \\ Winnie Liang, 400074498, liangw15 \\ Ray Liu, 400055250, liuc40 \\ Lawrence Chung, 400014482, chungl1\\}
\begin{center}
\includegraphics{logo.png}
\end{center}
\newpage
\begin{versionhistory}
\vhEntry{1.0}{05.04.18}{HG}{created}
\vhEntry{1.1}{08.04.18}{HG}{algorithmic analysis added}
\vhEntry{1.2}{10.04.18}{RL}{added Modular structure and uses}
\vhEntry{1.3}{11.04.18}{CS}{added 2.3.x subsections}
\noindent\textit{By virtue of submitting this document we electronically sign and date that the work being submitted by all the individuals in the group is their exclusive work as a group and we consent to make available the application being developed through SE-2XB3 project, the reports, presentations, and assignments (not including my name and student number) for future teaching purposes.}
\newpage
\textbf{\Large{Team Contributions}}\\ \\
The individual contributions of each team member are described below. Subteam B indicates an algorithmic focus in a member's efforts while Subteam A indicates a focus on data parsing and user interface development. Although the contributions have been separated such that each task is recorded under one contributor, members often overlapped duties and designed modules together. \\
\begin{table}[h]
\centering
\begin{tabular}{p{0.16\hsize}p{0.30\hsize}p{0.46\hsize}}
\toprule
\textbf{Name} & \textbf{Role} & \textbf{Contributions}\\
\midrule
& Head of Room Booking \newline Subteam B Member
& Implemented the depth first search, connected components algorithms, client code for connected components, Graph building, Node class and Bag class. Oversaw editing of Final Design Document.\\
& Meeting Minutes Administrator \newline Subteam B Member
& Implemented the red-black tree, quickselect, and mergesort algorithms. Wrote \texttt{Histogram} function to sum each year's total individual count. Designed the final presentation slideshow, recorded and submitted all meeting minutes, and assembled the final design specification in LaTeX. Generated UML state machine diagrams.\\
& Project Log Administrator \newline Subteam A Member
& Implemented the module responsible for parsing our data to create related objects (FileProcessor), implemented TaxonNode ADT. Led user interface development, set up tomcat files and directory structure, handled communication between the Google Maps APIs and JavaScript code. Overlooked and maintained project log entries.\\
& TA \& Professor Liaison \newline Subteam A Member
& Implemented Record ADT, Date ADT, parsing API calls for WORMS API, RangeHelper for Basic Search, histogram output for command line, histogram output for web interface. Oversaw editing of Requirements Specification Document.\\
& Determined the goals for each meeting, implemented the k-d tree algorithm, wrote backend of server, wrote command-line tool, implemented the BasicSearch and BasicSearchResult classes, implemented RecordCluster ADT, implemented TrawlExpert model class. Generated UML class diagram.\\
\newpage
\begin{abstract}
\noindent \textit{TrawlExpert} is a powerful tool to enable researchers to analyze and filter large datasets from fish trawl surveys in order to perform environmental research on fish and invertebrate populations. The tool gives researchers the ability to intelligently filter and query datasets based on biological classification such as family, genus or species, or based on location or timeframe. Advanced outputs display data as a histogram or geographical map, each depending on population abundance as a function of time and spatial distribution. Additionally, \textit{TrawlExpert} provides a tool for finding local subpopulations within a larger query. A dataset of thousands of Great Lakes trawl surveys from 1958-2016 will be used as a demonstration of \textit{TrawlExpert}'s capability to help researchers narrow down large datasets and glean data which pertains to their research. \textit{TrawlExpert} will be designed to be used easily and effectively as the first step in a groundbreaking climate and ecological research pipeline.
\end{abstract}
\tableofcontents
\section{Project Scope}
\subsection{Objective}
Provide a statistical and visual tool for the analysis of water ecosystems, based on scientific water trawl data. Gives researchers with tools to analyze large datasets to find patterns in fish populations, including the plotting of historical population data on a map, the analysis of population trends over time and the determination of subpopulations of a certain biological classification.
\subsection{Motivation}
The diminishing of fish populations in the Great Lakes became a problem in the latter half of the 20th century, with the total prey fish biomass declining in Lakes Superior, Michigan, Huron and Ontario between 1978 and 2015 \citep{michigan2017}. Annual bottom trawl surveys involve using specialized equipment to sweep an area and are used to determine the relative temporal variation in stock size, mortality and birth rates of different fish species \citep{walsh1997efficiency}. These surveys are performed annually and often have hundreds of thousands of records, making manual analysis infeasible. The ongoing protection and development of the Great Lakes water basins is considered an important topic for scientists in both Canada and the United States, as evidenced by grants such as the \textit{Michigan Sea Grant} \citep{michseagr2018}.
TrawlExpert will give researchers tools to filter through these large amounts of data by allowing them to search through data based on class, order, genus, family or species. This will help support scientific researchers and fishing companies as they study fish populations. These studies help inform initiatives to preserve fish populations and conduct their business in an environmentally friendly way going forward. As more data is collected on an annual basis, TrawlExpert can easily be injected with the new data and will adjust and scale accordingly, combining the new data with the old data for continued analysis.
TrawlExpert will also analyze the trawl data to find connected subpopulations within the data, giving researchers tools to analyze the portions of the water body that contain different populations and even track these specific subpopulations over time.
The focus of the project will be to develop these unique data searching and querying tools as a first step in a complete trawl survey analysis. For a complete analysis, tools like stratified statistical analysis are required by the researcher \citep{walsh1997efficiency}. For purposes of maintaining a manageable scope for this project, the implementation of advanced trawl survey scientific and statistical analysis tools will be relegated to future developments.
\subsection{Dataset}\label{sec:out}
The test dataset that will be used for purposes of this project is the \textit{USGS Great Lakes Science Center Research Vessel Catch Information System Trawl} published by the United States Geological Survey \citep{usgs2018}. Compiled on yearly operations taking place from early spring to late fall from 1958 until 2016, the dataset contains over 283,000 trawl survey records in the five Great Lakes, including the latitude and longitude co-ordinates and biological classification such as family, genus and species.
\begin{figure}
\centering
\includegraphics[width=16cm]{UI.png}
\caption{The final TrawlExpert web interface allows users to intelligently search for specific taxa and display several helpful statistical and data visualization tools such as maps and histograms. A live version of the web interface can be found at \url{http://trawl.schankula.ca/Trawl}}
\label{fig:UI}
\end{figure}
Apache tomcat was used to create a webserver which uses the internal functionality and model of \textit{TrawlExpert} written in Java. The UI allows users to filter by using information about different taxa (their biological relationships to each other, such as family / genus / species, etc) and display several different data outputs such as histograms, heatmaps, maps and population clusters, in addition to viewing raw data in tabular form. The clustering function is shown in figure \ref{fig:UI}. The \textit{TrawlExpert} is hosted on Google Cloud Platform and can be accessed at \url{http://trawl.schankula.ca/Trawl}.
\subsection{Glossary of Terms}
\noindent\textbf{Classification tree:} Tree describing the relationships between taxa (for example, a species is the child of its genus).
\noindent\textbf{Taxon (plural Taxa):} Refers to a classification of a type of biological entity (species, family, etc)
\noindent\textbf{Taxon ID:} The ID given to a taxon in the WORMS database.
\noindent\textbf{Trawl survey:} Using Trawling to collect scientific data \citep{walsh1997efficiency}.
\noindent\textbf{Trawling:} A method of fishing by dragging a net along the bottom of a body of water \citep{walsh1997efficiency}.
\noindent\textbf{WORMS}: World Register of Marine Species, an openly accessible online database.
The implementation involved over 30 classes implemented in Java. Additional JavaScript and HTML files were used to create a sophisticated web-based user interface. For a complete description of each class and module used, JavaDoc documentation can be viewed by opening the \texttt{doc/index.html} folder in the submission archive or by visiting \url{http://trawl.schankula.ca/Trawl/doc}.
\subsection {Modular Structure and Uses}
The requirements in the Requirement Specification Document are achieved by applying modularity and seperation of concerns, which will be shown in the UML diagrams. Table 1 maps requirements to each class supporting that requirement.\\
\newpage
\begin{center}
Table 1: Modules contributing to requirements
\textbf{FR:} functional requirement, \textbf{NF:} non-functional requirement\\\end{center}
\begin{tabular}{| l | l |}
\hline
\textbf{Requirements} & \textbf{Contributing modules}\\
\hline
\textbf{FR1.} Reading input file and correcting corrupted data & WormsAPI.java, FileProcessor.java \\
\hline
\textbf{FR2.} Generate output file & \textit{none --- replaced by command-line tool and web GUI} \\
\hline
\textbf{FR3.} List of species by family, order, genus & BioTree.java, TaxonNode.java \\
\textbf{FR4.} Historical distribution of records & Histogram.java, RedBlackTree.Java, BasicSearch.java \\
\textbf{FR5.} Basic search by family, genus, species, time, location, etc. & BasicSearch.java, KDT.java \\
\textbf{FR6.} Geographical subgroupings & Cluster.java, RecordCluster.java, Graph.java, CC.java \\
\textbf{FR7.} Plotting/mapping tools & BasicSearch.java, CC.java, Google Maps API, tomcat) \\
\textbf{NR1.} Accuracy & KDT.java, TestBio.java, TestBasicSearch.java \\
\textbf{NR2.} Robustness & FileProcessor.java, BioTree.java, Main.java, tomcat server \\
\textbf{NR3.} Fast speed & KDT.java, QuickSelect.java, and RedBlackTree.java \\
\textbf{NR4.} Low memory usage & BioTree.java, DataStore.java \\
\hline
\textbf{NR5.} Scalability & KDT.java, QuickSelect.java, and RedBlackTree.java \\
\hline
\textbf{NR6.} User-friendly & Main.java, tomcat, Google Maps API, Graphical histogram \\
\label{tab:Req}
\end{adjustbox}
An overview of the modules included in the \textit{TrawlExpert} is shown in figure \ref{fig:UML}. The \textit{TrawlExpert} implementation efforts were divided into two subteams: Subteam A and Subteam B. Tasks were assigned so as to maximize the parallelization of the development process. As a biproduct, modules were designed to uphold the principles of information hiding, modularity and separation of concerns.
Using Java packages, related modules were grouped together to form packages of closely-related logic, as well as to ensure that modules were organized and easy to find. This section describes a package-level description of each one and justification for this modular decomposition.
By breaking up modules into generic pieces in this way, the team was able to parallelize development efforts and agreement on APIs allowed for smooth integration between the subteams' efforts. Effective communication ensured that while subteams were working on independent projects concurrently, all team members were aware of work being done throughout the team.
\subsubsection{model Package}
This package includes the class representing the model of the \textit{TrawlExpert} platform. \texttt{TrawlExpert.java} provides wrapper hooks to function deeper inside the codebase, which allows different views to be used separately from the internal functionality. Two such views are the command-line interface in \texttt{main/Main.java} and the web interface through the tomcat server. This design allows any such view to be built upon the foundation of the model, without needing to edit any underlying code.
\subsubsection{data Package}
The top-level \texttt{data} package contains three main types of modules: abstract data types (e.g. \texttt{Record}, \texttt{TaxonNode}, \texttt{Date}), abstract objects (e.g. \texttt{BioTree}, \texttt{DataStore}) and supporting modules such as \texttt{WormsAPI}.
Abstract data types such as \texttt{Record} are classes for representing and accessing certain types of information in our dataset. For example, \texttt{Record} is the representation of a single line of the dataset (including taxon ID, latitude, longitude and date). Each ADT was designed to uphold the principle of encapsulation and provides methods for accessing data relevant to the ADT.
Abstract singleton objects were used to provide a central storage location for data. For example, BioTree contains information about the biological classification within the dataset. Since they are singleton objects, data can always be accessed from anywhere else in the program which needs it (see figure \ref{fig:UML}) without needing to pass references to objects. Similarly, DataStore provides a central location for the storage of Records in the dataset in a kd-tree structure.
\subsubsection{search Package}
The \texttt{search} package provides several generic search structures such as Red-Black Tree and kd-tree. It also contains a sub-package search.trawl which contains client code that utilizes these search structures to provide \textit{TrawlExpert}-specific searching.
Firstly, search and \texttt{search.kdt} provide generic kd-tree and Red-Black Tree implementations. These each utilize the \texttt{GeneralCompare} interface for maximum generality.
Secondly, \texttt{search.trawl} provides the \texttt{BasicSearch} module. This module is the access point for range-searching the dataset of \texttt{Record} objects in the \texttt{DataStore}. The \texttt{BasicSearchResult} class is the type of object returned from a range search, providing metadata about the search (such as the amount of time taken to find the results) as well as wrapper functions for generating histograms, clusters and sums of the data. In the case of histogram and sum, these results are cached after the first time it is called to improve performance if those fields are accessed again.
\subsubsection{sort Package}
The \texttt{sort} package includes modules such as \texttt{MergeSort} and \texttt{QuickSelect} as well as the custom interfaces \texttt{GeneralCompare}, \texttt{GeneralRange} and \texttt{Field}.
The \texttt{MergeSort} package was designed using generic typing and using the \texttt{GeneralCompare} interface means that it is even more flexible than using the regular Java \texttt{Comparable} interface. This class ended up being largely replaced by the \texttt{QuickSelect} class which served our purposes better. However, \textbf{MergeSort} is still used for sorting scientific names for the web interface, a small but integral part of the UI.
\subsubsection{graph Package}
The \texttt{graph} package provides functionality and abstract data types for constructing undirected graphs with a given number of nodes. It also contains the \texttt{CC} class for outputting connected components of the graph in question.
Additionally, the \texttt{graph} package includes the client code relevant to \textit{TrawlExpert}, including \texttt{Cluster.java} and \texttt{RecordCluster.java}. These provide the functionality for clustering a given section of the dataset into related subpopulations based on a user-defined area of similarity amongst the records.
\subsubsection{utils Package}
The \texttt{utils} package contains miscellaneous helpful utilities for the \textit{TrawlExpert}. At present, the only contained class is \texttt{Stopwatch.java} which provides the method for timing how long a \texttt{BasicSearch} query takes to complete.
\subsubsection{web Package}
The \texttt{web} package contains the controller (\texttt{Director.java}) and the startup scripts (\texttt{StartUpContext.java}) for the tomcat server. \texttt{Director} routes requests to the correct views in a model-view-controller-based setup. \texttt{StartUpContext} loads the TrawlExpert data into persistent memory on the tomcat server, allowing for fast lookups of data from memory.
Most of the other Java Server Pages (.jsp) files are contained in \texttt{tomcat/webapps/Trawl/}. Since this is not a focus of the project, these are not covered further in this design document.
\begin{figure}
\centering
\includegraphics[angle=90,width=14cm]{../UML.png}
\caption{The UML class diagram of TrawlExpert, showing the relationships amongst modules. The full-resolution version is available in the root directory of the submission archive as \texttt{UML.png}.}
\label{fig:UML}
\end{figure}
\subsection{UML State Diagrams}
Two UML state machine diagrams are included to describe the states and transitions within the \textit{BioTree.java} and \textit{Main.java} class.
\subsubsection{Main.java}
The UML diagram for the Main.java class is shown in figure \ref{fig:MainUML}. This represents the \textit{TrawlExpert} console application's states, giving an overview of the types of queries and functions the user has access to. Since the \textit{Main.java} class is a console version of the final server implementation, the states shown in its UML state machine diagram are analogous to many of the states of the final \textit{TrawlExpert} website.
\subsubsection{BioTree.java}
The UML state diagram for the BioTree module is shown in figure \ref{fig:BioTreeUML}. The BioTree class is a singleton class which stores the information about the different taxa in the dataset. This method has a few advantages. Firstly, the string names and relationships amongst taxa (e.g. species, genus, family) are stored only once and accessed when needed, saving large amounts of memory. After running through the \textit{TrawlExpert}, the serialized dataset representing the same data is only 27mb, requiring very little storage on the user's computer.
Secondly, this diagrams represents a key feature of \textit{TrawlExpert} in that it is able to recover corrupted data as the dataset is processed, which is very helpful for large datasets. In the USGS dataset, for example, there were 115 instances of different incorrectly named taxa, totalling 15,596 records (almost 6\% of the records in the dataset). Using this method, these records were able to be recovered for proper use by the scientist. Using smart caching of incorrect names described by this UML diagram, the number of API calls to WORMS is kept at a minimum and the dataset processing only takes about 3 minutes. After the initial processing, the BioTree and records are stored as serialized Java objects to the disc, and can be reloaded in less than 10 seconds.
% Include explanations of why we divided it into some of its main sections/classes
\includegraphics[width=18cm, trim={4.5cm 0 3cm 0}, clip]{MainDotJava.pdf}
\caption{UML State machine diagram for \textit{main/Main.java}, a class that provides console access to the \textit{TrawlExpert}'s main functions. This class accepts search criteria from a user to produce a list of search results, depict a histogram of the records in that result, and compute a count of the search hits.}
\begin{figure}[H]
\centering
\includegraphics[width=18cm, trim={1.3cm 0 0 0}, clip]{BioTreeDotJava.pdf}
\caption{UML State machine diagram for \textit{/data/biotree/BioTree.java}, a class that builds a tree data structure from the scientific name hierarchies (taxa) of fish. Uses the World Register of Marine Species (WORMS) API to identify the correct spelling of species names for misspelled scientific names in the dataset. }
\section{Algorithmic Opportunities}
The \textit{TrawlExpert} was made possible by the use of vaious algorithms studied in \textit{SFWRENG 2C03: Algorithms} offered at McMaster University. These algorithms include \textit{Red-Black Tree} for searching and \textit{Merge Sort} for sorting objects. Additional algorithms outside of the course scope were implemented to optimize the program; they are described below.
\subsection{Quick Select}
A modified form of the \textit{QuickSort} algorithm that returns the $k^{th}$ largest element of an unsorted array. Similar to \textit{QuickSort}, \textit{QuickSelect} randomly chooses a partitioning element to sort the array such that all elements smaller than the partition are left of it, and larger elements are to the right. However, rather than recursively sorting both halves of the partitioned array, \textit{QuickSelect} only sorts the half containing the $k^{th}$ index. The algorithm terminates once the partitioning element ends up at the $k^{th}$ index of the array, the value of this element is returned.
This algorithm is implemented in \textit{/sort/QuickSelect.java}. It is used during the construction of \textit{k-d tree}s which require the frequent division of an array into two equally sized halves. By finding the median element of an array, it is partially sorted into equally sized small and large halves. The \textit{QuickSelect} class implements a \textit{median} method to simplify its usage in \textit{k-d tree} construction.
\subsection{kd Tree}
A k-dimensional (kd) binary search tree was used to provide a fast range searching structure for the records. Given that the current size of the USGS dataset is over 280,000 entries and likely to grow with additional studies, it was crucial to have a structure to support fast searches. However, since the data contains many dimensions (taxon id, latitude, longitude, date), a simple binary search tree is not useful for this task. Instead, a kd-tree was employed.
A kd-tree is like a binary search tree except that on each level $i$ of the tree, the comparison between nodes is made on the ($i\ \%\ d$)th axis of the data. For example, in a two-dimensional tree of ($x,y$) points, the first level of the tree would be compared on $x$, and the second level would be compared on $y$, the third level on $x$, and so on.
This structure gives a fast way of range searching for values in the tree, with a search complexity of $\mathcal{O}(dn^{1-\frac{1}{d}})$ where $n$ is the number of nodes in the tree and $d$ is the number of splitting dimensions of the tree. In the \textit{TrawlExpert}, we use a 4-d tree to split on taxon id, date, latitude and longitude. Range searches for specific species are very fast, often on the order of 5-10ms and sometimes as fast as 1ms.
In order to build the kd-tree in a balanced way, it's crucial to be able to find the median of the data, so that a balanced number of nodes are inserted on each left and right subtree within the tree. In order to support fast $kd$ tree building (which only needs to happen once when the dataset is first analyzed), the aforementioned \textit{QuickSelect} algorithm was used, which was able to allow building the whole kd-tree in about 0.6 seconds. The kd-tree class then contains methods for serializing the data of the kd-tree so that it can be reloaded quickly from the disc on subsequent launches of \textit{TrawlExpert}.
%
% Chris will write :) ============================ LOOK HERE CHRIS =================================
%
\subsection{Graphing}
Graph algorithms were used to support advanced searching features. Firstly, the biological classification of each organism forms a tree from which species in the same genus, for example, can be located. This was accomplished by creating a BioTree node, which stores the taxon id number of the classification, the scientific name of the entry, the number of records with that taxon id contained in the dataset and pointers to the parent and the children of the node. This structure directly mimics the method that scientists use to classify species according to their similarities (into family, genus, species) and allows for intelligent filtering and searching of the dataset. For example, with this structure it is possible to find all descendants of a certain biological classification.
Secondly, a graph algorithm was used to find connected components among search results. Nodes are connected together based on their distance to surrounding points \citep{tom10}. Depth-first search was used to determine connected components \citep{broder2000graph}.
\section{Software Design Principles}
\subsection{Robustness}
Robustness is a non-functional requirement prioritized during the \textit{TrawlExpert}'s development. Considering all 280,000+ records in the dataset were entered by humans, data entry errors were inevitable. The \textit{TrawlExpert} implementation had to ensure unexpected entries in the dataset were handled gracefully and could be recovered if possible.
When building a BioTree from the dataset, the World Register of Marine Species (WORMS) database API was used to find the correct scientific name of slightly misspelled names. Unless a name was severely misspelled, the Worms API was able to salvage small data entry errors. This ensured records could be used when building the BioTree and protected the tool from raising exceptions from small input errors. While this introduces a dependance on an Internet connection to \textit{TrawlExpert}, it was assumed that the scientists working with \textit{TrawlExpert} would have access to an Internet connection, and the tradeoff is reasonable for the recovery of many errors in the dataset.
The use of drop-down boxes on the user interface helped limit invalid search criteria from being entered. From left to right, each box contains increasingly specific components of a scientific name for fish species. When any of the dropdown boxes were selected, all boxes to the left (representing more general components of that species name) were updated. This was to ensure the hierarchy formed by the more general components contained the newly adjusted value. Additionally, all boxes to the right were cleared. If a more general feature was adjusted, the resultant possible species no longer satisfies the hierarchy needed by values populating the right-most boxes. To prevent invalid scientific names from being used as search input, they had to be cleared.
\subsection{Scalability}
The tool must be able to handle large amounts of data, all while being able to complete queries at a high speed. Currently, the tool uses a dataset of 200,000 lines of data, but it must be able to maintain its high performace for larger datasets. Using sorting algorithms such as \textit{Quick Select} to build a \textit{k-d tree}, the \textit{TrawlExpert} has been optimized to complete tree construction much faster.
Implementing \textit{Quick Select} rather than \textit{Merge Sort} drastically improved the \textit{TrawlExpert}'s performance. When using \textit{Merge Sort} during \textit{k-d tree} construction, an array must be fully sorted before retrieving the median element, taking $\mathcal{O}(n\lg n)$ time where $n$ is the size of the dataset. \textit{Quick Select} only partially sorts the array before reaching the median, taking $\mathcal{O}(n)$ time, and it reduced \textit{k-d tree} construction from 40.083 s using \textit{Merge Sort} to 0.56 s, representing a 72x improvement.
A common theme among \textit{TrawlExpert} classes is the use of lambda functions characterized by Java interfaces which describe their syntax as well as their semantic meaning. Lambda functions provide the capacity for parameterized object comparison or parameterized value access. This maintains the generality, and therefore reusability, of each class by allowing for generic types in class definitions. Type(s) of the input(s) and the how input object(s) are used only become assigned when the function is used.
\subsubsection{General Compare}
The \textit{GeneralCompare} interface can be found at \textit{/sort/GeneralCompare.java}. This interface includes a \textit{compare} function that takes two generically typed inputs and produces an integer output. When \textit{GeneralCompare} is used in other classes, a compare function (the lambda function) is used to instantiate the expected input type and designate how the integer result must be calculated. This allows reuse of the interface among modules that perform comparisons of differently typed objects. Two records consisting of a fish species, date of observation, and geographic location can be compared based on lexicographic order of their names, date, or proximity to some location. \textit{GeneralCompare} enables the comparison of record objects based on any of these parameters.
The \textit{Field} interface can be found at \textit{/search/Field.java}. This interface includes a \textit{field} function that retrieves a key (a generic type) from a generically typed input object. Similar to \textit{GeneralCompare}'s \textit{compare} function, \textit{field} is a lambda function. The field interface is used to perform searches in a tree of records that have been sorted by variable attributes from each record. The lambda function specifies which attribute to access when searching through the tree.
The \textit{GeneralRange} interface can be found at \textit{/sort/GeneralRange.java}. This interface includes a \textit{isInBounds} function returns an integer to describe if a record is member to a subset of the search results. The input has a generic type, rather than \textit{Record} type, to satisfy reusability. The lambda function uses the range itself to perform conditional checks about whether the input object is below, within, or above the range. A return value of -1 indicates it is below, 0 indicates it is within, and 1 indicates it is above the range.
Table 1 shows the modules responsible for helping to meet the functional and non-functional requirements laid out in the Requirements Specification Document. This section describes in more detail how these were met.
The first challenge in developing this tool was parsing the data. One requirement was to read and clean the data, then produce a data structure of Record objects (\textbf{FR1}). The software tool performed this task as planned, and even exceeded expectations by using a \textit{k-d tree} to store the Records in an easily and quickly accessible manner and the \texttt{BioTree} module had tools to recover corrupt data. \textbf{FR3} described a requirement to be able to list related taxa (family, genus, species, etc) in an intelligent way, which was achieved by the \texttt{BioTree} module as well. \textbf{FR4} was achieved by the \texttt{Histogram} module which generated a histogram that was displayed on the command-line and web GUI interfaces. Another requirement was to accomplish basic searching capabilities based on input criteria (\textbf{FR5}). This aspect was achieved through efficient sorting and searching algorithms, the results were verified using JUnit test cases of all searching and sorting algorithms. The geographical subgroupings (\textbf{FR6}) was achieved by using an undirected graph structure and connected component analysis.
The requirement of outputting records to .csv was not completed \textbf{FR2} as it was replaced by efforts in the command-line interface and web GUI. It should be explored further in the future.
In terms of meeting non-functional requirements, the team met expectations. The use of the WORMS API when parsing the dataset improved robustness (\textbf{NR2}) and algorithmic choices such as \textit{QuickSelect} and \textit{kd-tree}s improved performance (\textbf{NR3, NR5}). The final product achieved the requirement of being user-friendly (\textbf{NR6}) since it is easily accessible via the Google Cloud server and prevents the user from entering invalid search criteria.
Additional goals included using less than 1GB of RAM (\textbf{NR4}), this was achieved since the \textit{TrawlExpert} used approximately 0.5 GB of RAM. One way this was achieved was storing duplicate data (e.g. scientific name strings) only once. An additional goal was to perform queries in less than 1 second (\textbf{NR3}), this was achieved, with many queries taking approximately 1-10ms.
A positive team dynamic throughout the development process ensured collaboration and help were always offered, this was a large contributing factor to the success of the final product.
\subsection{Changes During Development}
There were some algorithmic changes that were realized during development. Two key algorithmic changes were the change from \textit{Merge Sort} to using \textit{Quick Select}, and changing cluster groups of Connected Components. As discussed in Algorithmic Opportunities, the use of \textit{Quick Select} dramatically improved performance.
Another algorithmic change involved the client code for Connected Components when determining fish clusters. Initially, every node was visited multiple times to determine whether other nodes were within a given radius. The running time was unacceptable using this approach, and as a result, the algorithm was changed such that visited nodes were not revisited. This decreased running time significantly and was considered acceptable by the team.
While \textit{TrawlExpert} met all of its original goals for this stage of its development, there are several points for improvement and future development of the platform as an all-in-one research tool for watershed research.
\subsubsection{Improvements on Development Process}
Most of the changes that would benefit the \textit{TrawlExpert} involve its development requirements. The original goals for this section were quite extensive, however one aspect that was overlooked was file organization. Although GitLab was used for version control, confusion still occurred over which packages certain classes belonged to. For example, there were instances in the project where a search class would be located in the graph package. Adding a requirement for file organization would make the project more easily accessible in the development process and would also yield more efficient workflow because less time would be dedicated to searching for a desired class.
\subsubsection{Future Functionality}
Functionally, there are many future goals in the development of \textit{TrawlExpert}. This phase of the development process was aimed at providing scientists with an effective tool to search and filter data relevant to their research, as well as some basic statistical tools. However, this only represents the first stage in a larger scientific research pipeline. Often, more advanced tools such as stratified statistical analysis is needed to properly take into account the many variables in trawl survey expeditions \citep{walsh1997efficiency}. The future work includes building these tools into \textit{TrawlExpert} in order to create an all-in-one research platform for trawl surveys.
Additionally, the original requirements specification stated that the program should be able to output subsets of the dataset to .csv files (\textbf{FR2}). This was replaced by the command-line interface in \texttt{Main.java} and the web GUI, but should be explored further in the future as it would be helpful to scientists interested in obtaining raw data for further analysis.
\clearpage
\bibliographystyle{apa}
\bibliography{bib}
\end{document}