add figures, edits and reference file finally

0587a5f4 · Christopher Schankula · 767c35b3 · 0587a5f4 · 0587a5f4 · 0587a5f4
Commit 0587a5f4 authored 7 years ago by Christopher Schankula
--- a/Proposals/Fig.png
+++ b/Proposals/Fig.png
--- a/Proposals/Fig2.png
+++ b/Proposals/Fig2.png
--- a/Proposals/Fig3.png
+++ b/Proposals/Fig3.png
--- a/Proposals/Fig4.png
+++ b/Proposals/Fig4.png
--- a/Proposals/Large/Fig large.png
+++ b/Proposals/Large/Fig large.png
--- a/Proposals/Large/Fig2 large.png
+++ b/Proposals/Large/Fig2 large.png
--- a/Proposals/Large/Fig3 large.png
+++ b/Proposals/Large/Fig3 large.png
--- a/Proposals/Large/Fig4 large.png
+++ b/Proposals/Large/Fig4 large.png
--- a/Proposals/Proposal.pdf
+++ b/Proposals/Proposal.pdf
--- a/Proposals/Proposal.tex
+++ b/Proposals/Proposal.tex
@@ -12,6 +12,8 @@
 \usepackage{mathtools} %for use of := symbol mostly.
 \usepackage{booktabs} % used for making tables

+\usepackage[export]{adjustbox} % allows frame around figure
+
 \usepackage[round]{natbib}


@@ -28,15 +30,18 @@
 Statistical and visual tool for the analysis of water ecosystems, based on scientific water trawl data. Provides researchers with tools to analyze large datasets to find patterns in fish populations, including the plotting of historical population data on a map, analyzing population trends over time and finding subpopulations of a certain species, genus, family, etc.

 \section{Motivation}
-The diminishing of fish populations in the Great Lakes became a problem in the latter half of the 20th century, with the total prey fish biomass declining in Lakes Superior, Michigan, Huron and Ontario between 1978 and 2015 \citep{michigan2017}. Annual bottom trawl surveys involve using specialized equipment to sweep an area and are used to determine the relative temporal variation in stock size, mortality and birth rates of different fish species \citep{walsh1997efficiency}. These surveys are performed annually and often have hundreds of thousands of records, making manual analysis infeasible.
+The diminishing of fish populations in the Great Lakes became a problem in the latter half of the 20th century, with the total prey fish biomass declining in Lakes Superior, Michigan, Huron and Ontario between 1978 and 2015 \citep{michigan2017}. Annual bottom trawl surveys involve using specialized equipment to sweep an area and are used to determine the relative temporal variation in stock size, mortality and birth rates of different fish species \citep{walsh1997efficiency}. These surveys are performed annually and often have hundreds of thousands of records, making manual analysis infeasible. The ongoing protection and development of the Great Lakes water basins is considered an important topic for scientists in both Canada and the United States, as evidenced by grants such as the \textit{Michigan Sea Grant} \citep{michseagr2018}.

 TrawlExpert will give researchers tools to filter through these large amounts of data by allowing them to search through data based on class, order, genus, family or species. This will help support scientific researchers and fishing companies as they study fish populations in order to launch initiatives to preserve fish populations and perform their business in an environmentally friendly way going forward. As more data is collected on an annual basis, TrawlExpert can easily be injected with the new data and will adjust and scale accordingly, combining the new data with the old data for continued analysis.

 TrawlExpert will also analyze the trawl data to find connected subpopulations within the data, giving researchers  tools to analyze the portions of the water body that contain different populations and even track these specific subpopulations over time.

-The focus of the project will be to develop these unique data searching and querying tools as a first step in a complete trawl survey analysis. For a complete analysis, tools like stratified statistical analysis would be required by the researcher \citep{walsh1997efficiency}. For purposes of controlling the scope of this project, the implementation of advanced trawl survey scientific and statistical analysis tools will be relegated to future developments.
+The focus of the project will be to develop these unique data searching and querying tools as a first step in a complete trawl survey analysis. For a complete analysis, tools like stratified statistical analysis would be required by the researcher \citep{walsh1997efficiency}. For purposes of maintaining a scope for this project, the implementation of advanced trawl survey scientific and statistical analysis tools will be relegated to future developments. 

 \section{Prior Work}
+From preliminary research, past software has been developed to analyze trawl survey data. One such piece of software is MIXFISH which is part of the LFSA package written in BASIC in 1987 \citep{sparre1987computer}. MIXFISH / LFSA has been used by several researchers, (e.g. \cite{levi1993analysis}, \cite{chakraborty1996stock}), while performing trawl survey analyses. Most of these were confined to running on smaller samples of data (e.g. 1000 samples for a given species); however, with modern hardware and using a modern programming language, TrawlExpert will allow processing of much larger datasets.
+
+Other studies, e.g \cite{swartzman1992spatial}, make use of software programs to perform statistical analysis but do not use software specialized to analyzing trawl survey data. Although the main focus of this semester's project will be the searching and selection of data, TrawlExpert's goal will be to give researchers an all-in-one platform in order to perform their analysis, in each step of the analysis pipeline.

 \section{Input / Output and Proposed Solutions}
 \subsection{Datasets}\label{sec:out}
@@ -53,26 +58,36 @@ Researchers will be able to query for historical population data of a certain ge
 For a given species over a given timeframe, the program will output the locations of all records matching that query in terms of the recorded location at which the fish were found.

 \subsubsection{Output 3: Geographical Subgroupings}\label{sec:subgroup}
-The third main type of output will be to classify a given search into subgroupings of highly clustered populations.
+The third main type of output will be to classify a given search into subgroupings of highly clustered populations (see section \ref{sec:graphalgs} and figure \ref{fig:Groupings}).

 \subsection{Family Subgroupings: A Use Case Example}\label{sec:case}
 The dataset presented in section \ref{sec:out} will be used as the main input in order to generate these outputs. The location and temporal data given in the dataset will be used to help generate outputs 1 and 2. In order to illustrate this, the following example use case will be presented and analyzed:

-The researcher is studying the decline of all species in the family \textit{Cyprinidae}. She has a large dataset of trawl data and wishes to obtain information about the related subpopulations in the data. Therefore, she uses the TrawlExpert to generate an output of type 3 which will give her the output she needs for her research. By recursing down a pre-built tree structure built from the data (described in more detail in section \ref{sec:graphalgs}), the program will determine which genera, species and potentially subspecies are included in this family of organisms. It will locate all entries for this expanded search. These entries each contain the latitude and longitude at which the sightings were made. The program will then cluster geographically close results and return a list of these subgroupings, either in text format or visualized as a figure.
+A researcher is studying the decline of all species in the family \textit{Cyprinidae}. She has a large dataset of trawl data and wishes to obtain information about the related subpopulations in the data. Therefore, she uses the TrawlExpert to generate an output of type 3 which will give her the output she needs for her research. By recursing down a pre-built tree structure built from the data (described in more detail in section \ref{sec:graphalgs}), the program will determine which genera, species and potentially subspecies are included in this family of organisms. It will locate all entries for this expanded search. These entries each contain the latitude and longitude at which the sightings were made. The program will then cluster geographically close results and return a list of these subgroupings, either in text format or visualized as a figure.

-The researcher can now continue her scientific analysis of the data having intelligently and easily narrowed down to relevant data.
+The researcher can now continue her scientific analysis of the data having easily and intelligently narrowed down to relevant data.

 \section{Algorithmic Challenges}
-As this project is data-driven and should be able to handle the very large datasets needed for scientific analysis, it presents many algorithmic challenges which can be broken into three categories: searching, sorting and graph algorithms.
 \subsection{Searching Algorithms}
-A modified form of binary search will be used for quickly locating specific records in the large dataset, and will be a crucial building block of all three main types of output. The modification comes from the fact that it will likely be most useful to locate the first element of a given type in a sorted dataset. For example, if a researcher is wishing to find all the entries for the species \textit{Perca flavescens}, finding the first record will then allow the program to find all the elements by either performing a linear walk over all the records after that one or by using another modified version of binary search which finds the rightmost element of a certain key. For speed of lookups without have to search each time, a sorted version of the dataset may be cached for use in subsequent searches.
+A modified form of binary search will be used for quickly locating the first of a given key in the large dataset and will be a crucial building block of all three main types of output. This will allow all entries of that type to be found to the right of that result.

 \subsection{Sorting Algorithms}
-Sorting will be crucial for both ordering historical data in chronological order as well as being the basis for binary search to work, since it requires data to be sorted, for example, alphabetically by species. The Mergesort algorithm will be advantageous due to its fast and predictable runtime ($N\lg N$) and its ability to handle sequences with many entries of a given key, of which we have many for our purposes.
+Sorting will be crucial for both ordering historical data in chronological order as well as being the basis for binary search to work, since it requires data to be sorted. The mergesort algorithm will be advantageous due to its fast and predictable runtime ($N\lg N$).

 \subsection{Graph Algorithms}\label{sec:graphalgs}
-Graph algorithms will be important to the advanced searching features of the program. The first use will be in the analysis of groupings of organisms. The kingdom, phylum, class, order, family, genus, species and subspecies classification of each organism in the dataset allows us to build a detailed tree from which species in the same genus, for example, can be located. Secondly, a graph algorithm will be used to find connected components for generating output of type 3 described in section \ref{sec:subgroup}. Points will be connected together based on their distance to surrounding points, and a connected components algorithm will be used to determine these groupings.
-
+Graph algorithms will be used for the advanced searching features of the program. Firstly, the biological classification of each organism forms a tree from which species in the same genus, for example, can be located. Secondly, a graph algorithm will be used to find connected components for generating output of type 3 described in section \ref{sec:subgroup}. Entries form nodes which can be connected together based on their distance to surrounding points and a breadth-first search algorithm will be used to determine connected components \citep{broder2000graph}. This is visualized in figure \ref{fig:Groupings}.
+
+\begin{figure*}[t]
+\centering
+\begin{tabular}{ p{32mm}p{32mm}p{32mm}p{32mm} }
+a) & b) & c) & d)\\
+\includegraphics[width=35mm,frame=0.01cm]{Fig.png} & \includegraphics[width=35mm,frame=0.01cm]{Fig2.png} & \includegraphics[width=35mm,frame=0.01cm]{Fig3.png} & \includegraphics[width=35mm,frame=0.01cm]{Fig4.png}
+\end{tabular}
+\caption{A visualization of the proposed algorithm for output 3. In a), there are some species sightings that the user wishes to group. In b), a radius of similarity is chosen by the user. In c), overlapping regions determine connected points, creating a graph structure with groups of connected points. In d), the points are classified into groups by finding connected components of the graph structure. }
+\label{fig:Groupings}
+\end{figure*}
+
+\clearpage
 \section{Project Plan}
 The following milestones will help inform our progress towards completing the goals. The team should be divided up into two subteams for maximum efficiency:

@@ -82,37 +97,32 @@ The following milestones will help inform our progress towards completing the go
 		\toprule
 		\textbf{Milestone} & \textbf{Subteam A} & \textbf{Subteam B}\\
 		\midrule
-		Milestone 1 \newline(End of Week 1)
+		Milestone 1 \newline (``Bedrock'') \newline(End of Week 1)
 		& Finished parsing module for .csv data to create Java objects that can be used for analysis; start data cleansing
 		& General binary search module underway \\
 		\midrule
-		Milestone 2 \newline(End of Week 2)
+		Milestone 2 \newline (``Quartz'') \newline(End of Week 2)
 		& Cleansed the data to remove or correct entries not containing all of the columns; start generating biological classification tree module
 		& Finished and tested binary search; start mergesort \\
 		\midrule
-		Milestone 3 \newline(End of Week 4)
+		Milestone 3 \newline (``Granite'') \newline(End of Week 4)
 		& Finished and tested classification tree; start data visualization or formatted text output tools
-		& Finished and tested mergesort; start writing query module for output 1, using mergesort and binary search to get results from data \\
+		& Finished and tested mergesort; start writing query module for output type 1 and 2, using mergesort and binary search to get results from data \\
 		\midrule
-		Milestone 4 \newline(End of Week 6)
-		& Continue data visualization tools
-		& Continue query module: finished output 1, start working on output class 2 \\
+		Milestone 4 \newline (``Sandstone'') \newline(End of Week 6)
+		& Continue data visualization or text output tools; start helping subteam B if needed
+		& Continue query module: finished output 1 \& 2; start output type 3\\
 		\midrule
-		Milestone 5 \newline(End of Week 8)
-		& Finished data visualization tools; start using them to display output from output 1 and 2
-		& Finished output class 2 \\
+		Milestone 5 \newline (``Diamond'') \newline(End of Week 8)
+		& Finished data visualization or text output tools; start using them to display outputs; work on keynote presentation
+		& Finished output type 3; work on keynote presentation \\		
 		\bottomrule
 	\end{tabular}
 \end{table}

-\subsection{End of Week 1: Data Parsing}
-By the end of week 1, one small team should be able to parse our data into a Java-usable data structure. Another small team should start writing the general sorting and searching modules as well.
-\subsection{End of Week 2: Data Cleaning}
-Some entries in the data are not perfectly formatted in the correct columns. By the end of week 2, the data parsing team should have finished coming up with a plan to deal with this data, whether that means attempting to correct it or simply disregarding these datapoints.
-
-
-
+While this schedule provides a good reference and a way to monitor progress, it may have to be modified throughout. For example, if a milestone is reached before its given date, the next milestone should start development early. Approximately 1-2 weeks has been purposely left as padding at the end in case of unforeseen circumstances.

+\clearpage
 \bibliographystyle{apa}
 \bibliography{bib}


--- a/Proposals/bib.bib
+++ b/Proposals/bib.bib
+%% This BibTeX bibliography file was created using BibDesk.
+%% http://bibdesk.sourceforge.net/
+
+%% Created for Mac Outreach Admin at 2018-01-28 16:27:03 -0500 
+
+
+%% Saved with string encoding Unicode (UTF-8) 
+
+
+
+@article{swartzman1992spatial,
+	Author = {Swartzman, Gordon and Huang, Chisheng and Kaluzny, Stephen},
+	Date-Added = {2018-01-28 21:20:59 +0000},
+	Date-Modified = {2018-01-28 21:20:59 +0000},
+	Journal = {Canadian Journal of Fisheries and Aquatic Sciences},
+	Number = {7},
+	Pages = {1366--1378},
+	Publisher = {NRC Research Press},
+	Title = {Spatial analysis of Bering Sea groundfish survey data using generalized additive models},
+	Volume = {49},
+	Year = {1992}}
+
+@article{chakraborty1996stock,
+	Author = {Chakraborty, SK},
+	Date-Added = {2018-01-28 21:16:25 +0000},
+	Date-Modified = {2018-01-28 21:16:25 +0000},
+	Journal = {Journal of the Indian Fisheries Association},
+	Pages = {9--15},
+	Title = {Stock assessment of sin croaker Johnieops sina (Cuvier) from Bombay waters},
+	Volume = {26},
+	Year = {1996}}
+
+@article{levi1993analysis,
+	Author = {Levi, Dino and Andreoli, MG and Giusto, GB},
+	Date-Added = {2018-01-28 21:13:15 +0000},
+	Date-Modified = {2018-01-28 21:13:15 +0000},
+	Journal = {Fisheries research},
+	Number = {3-4},
+	Pages = {333--341},
+	Publisher = {Elsevier},
+	Title = {An analysis based on trawl-survey data of the state of the `Italian'stock of Mullus barbatus in the Sicilian Channel, including management advice},
+	Volume = {17},
+	Year = {1993}}
+
+@book{sparre1987computer,
+	Author = {Sparre, Per},
+	Date-Added = {2018-01-28 21:10:03 +0000},
+	Date-Modified = {2018-01-28 21:10:03 +0000},
+	Publisher = {Food \& Agriculture Org.},
+	Title = {Computer programs for fish stock assessment: Length-based fish stock assessment for Apple II computers},
+	Volume = {2},
+	Year = {1987}}
+
+@webpage{michseagr2018,
+	Author = {{University of Michigan}, {Michigan State University}},
+	Date-Added = {2018-01-28 20:59:17 +0000},
+	Date-Modified = {2018-01-28 21:03:10 +0000},
+	Title = {Michigan Sea Grant},
+	Url = {http://www.miseagrant.umich.edu/},
+	Year = {2018}}
+
+@article{broder2000graph,
+	Author = {Broder, Andrei and Kumar, Ravi and Maghoul, Farzin and Raghavan, Prabhakar and Rajagopalan, Sridhar and Stata, Raymie and Tomkins, Andrew and Wiener, Janet},
+	Date-Added = {2018-01-28 20:48:15 +0000},
+	Date-Modified = {2018-01-28 20:48:15 +0000},
+	Journal = {Computer networks},
+	Number = {1-6},
+	Pages = {309--320},
+	Publisher = {Elsevier},
+	Title = {Graph structure in the web},
+	Volume = {33},
+	Year = {2000}}
+
+@webpage{usgs2018,
+	Date-Added = {2018-01-26 17:50:23 +0000},
+	Date-Modified = {2018-01-26 17:50:50 +0000},
+	Month = {1},
+	Url = {https://www1.usgs.gov/obis-usa/ipt/resource?r=usgs_glsc_rvcat_trawl},
+	Year = {2018},
+	Bdsk-Url-1 = {https://www1.usgs.gov/obis-usa/ipt/resource?r=usgs_glsc_rvcat_trawl}}
+
+@article{walsh1997efficiency,
+	Author = {Walsh, Stephen J},
+	Date-Added = {2018-01-26 14:47:55 +0000},
+	Date-Modified = {2018-01-26 14:47:55 +0000},
+	Journal = {Oceanographic Literature Review},
+	Number = {44},
+	Pages = {748},
+	Title = {Efficiency of bottom sampling trawls in deriving survey abundance indices},
+	Volume = {7},
+	Year = {1997}}
+
+@webpage{michigan2017,
+	Author = {Ronald Kinnunen},
+	Date-Added = {2018-01-26 14:35:05 +0000},
+	Date-Modified = {2018-01-26 14:37:31 +0000},
+	Journal = {Michigan State University Extension},
+	Keywords = {28},
+	Month = {February},
+	Title = {Great Lakes prey fish populations declining},
+	Url = {http://msue.anr.msu.edu/news/great_lakes_prey_fish_populations_declining_msg17_kinnunen17},
+	Year = {2017},
+	Bdsk-Url-1 = {http://msue.anr.msu.edu/news/great_lakes_prey_fish_populations_declining_msg17_kinnunen17}}