The 2017 IUPUI REU Workshop will showcase the research projects conducted during the 2017 NSF/DoD Research Experience for Undergraduates (REU) program at Indiana University-Purdue University Indianapolis. The focus of the workshop is data science and cybersecurity. The workshop will be hosted in ET-BLAH on August 4th from 9:30 a.m. to 12:30 p.m.

Review Process
Papers should be submitted by July 28 at 11:59 p.m. to Tyler Phillips, Susie Song, and Ravynne Jenkins. All submissions will be peer reviewed by three reviewers. The review process will end on August 1 at 11:59 p.m.

Paper Format
Papers should adhere to the IEEE manuscript standard and be submitted as PDFs.


General Co-Chairs
Darshan Sangani and Claire Lee
Website Chairs:
Kelby Erickson and Mark Miller
Technical Program Chairs:
Susie Song, Ravynne Jenkins, and Tyler Phillips
Publicity Chairs:
Joevita Weah and Nash Fry
Poster Chairs:
Ivana Terziyska and Brandon Johnson


Predicting Zillow Estimation Error Using Linear Regression and Gradient Boosting

Kelby Erickson and Darshan Sangani

Photo of Me

Abstract: Owning property is one of the most important investments that a person can make in their lifetime. Therefore, being able to accurately know the real-time value of any property is crucial to making wise sales and purchases. Since the online real estate database company Zillow first developed a machine learning system to predict property sale prices in real time, Zillow has continually worked to improve the accuracy of this prediction mechanism. In this paper, we describe our work to decrease the error of Zillow's price estimation by examining the effectiveness of several machine learning models at making property related forecasts. Specifically, we used property data to train linear regression and gradient boosting models with which we then made predictions about other properties. Since the gradient boosting model has numerous parameters, each with a wide range of possible values, we used grid search to optimize these parameters. Finally, we examined the effectiveness of data preprocessing techniques such as normalization, dimensionality reduction, and flattening categorical features into binary ones. While previous research in machine learning has found that normalization and dimensionality reduction generally improve forecast accuracy, we found that they did not improve the accuracy of predictions for this particular problem.  Read More


Detection, Tracking and Analysis of Qbot Family of Botnets

Nash Fry

Photo of Me

Abstract: Botnets are an immediate and serious threat to current internet infrastructure and network security. Botnets are a network comprised of computers which have been infected with a malicious program. This malicious program allows an attacker, or "botmaster", to control an infected computer, or "bot". In order for a botmaster to control their bot network, bots must find and connect to a command and control (C2) server for critical commands. In this paper, we will discuss the methodology for finding and tracking the Qbot family of botnets. The methodology for tracking and analysis is broken down into three specific steps: (1) list of IPs/Ports/Hostnames of suspected command and control (C2) servers that exhibit specific properties that we have identified are collected using third party applications. (2) Using the collected information, we then use a custom script that acts as a fake bot that connects to and communicates with the suspected command and control (C2) servers. (3) Finally, we use information collected from our fake bot connection and communications to analyze and monitor the C2 server and its activities to better understand how C2 servers communicate with their bots in a botnet.  Read More


Development of a Mobile App for Pseudo Real-Time Peer-to-Peer Communication for Supply Chain Management

Ravynne Jenkins

Photo of Me

Abstract: The management and visibility of supply chain events and transactions in pseudo real-time are critical for managing the substantial order volumes, consumer availability and complexity of events and transactions. The convenience and affordability of a mobile application makes it ideal for managing a massive system containing various types of peers, such as the supply chain. In this paper, we implement a framework for real-time data sharing using peer-to-peer communication in an Android mobile application, Hybrid Peer-to-Peer Physical Distribution (H3PD). This implementation uses blockchain technology, public ledgers and private ledgers to ensure the security of supply chain activities as order details are exchanged from one peer to another. We present the design of the app, methods and tools used in the development, as well as the overall functionality of HP3D.  Read More


Using Cancellable Electrocardiographic Templates to Authenticate and Encrypt Users

Brandon Johnson

Photo of Me

Abstract: Biometric authentication offers a unique method of verifying a user's identity through the analysis of physiological characteristics. There has been a recent proliferation of devices that use biometric information as means to authenticate users, such as iris scanners, fingerprint readers and electrocardiogram (ECG) readers. ECG measures the electrical activity produced by a heart and is unique to every person. The continuous nature of a heartbeat allows for ECG biometric authentication to retain user access to a device for the duration of device operation. This property allows for an increase in convenience, but also lends itself to the vulnerability of user biometric information being uncovered if attacked. Analysis of stolen ECG data could reveal sensitive user information such as an illness. This paper explores the viability of using a revocable encryption mechanism, developed by researchers from Indiana University-Purdue University Indianapolis, as a means of preserving the privacy of a user's ECG characteristics. This encryption mechanism is known as the "BioCapsule" scheme. Using "The PTB Diagnostic ECG Database," we assess the encryption performance of the BioCapsule through the implementation of various tests.  Read More


Preserving Key Features of Online Social Network Graphs Using Persistent Homology

Claire Lee

Photo of Me

Abstract: Online Social Networks (OSNs) are simple, undirected graphs used to store information in the context of social media and emails. A common issue of OSNs in regards to graph publication is balancing the utility of a graph while satisfying the criteria of differential privacy. Thus, it is necessary to alter the data without changing the key characteristics of an OSN. Previous methods of sustaining significant features such as clustering coefficient, degree distribution, and other various graph metrics fail to give an accurate depiction of the original OSN without compromising differential privacy. Persistent homology provides a viable method for a comprehensive visual representation of the information stored in network graphs. By translating a network graph to a persistent homology barcode format, we will observe the correlation in key features between the two figures. This paper will analyze the persistent homology barcodes of OSNs across several social media and email platforms. Furthermore, it will test the stability of the generated persistence diagrams by adding small perturbations to the original network graph.  Read More


Implementation and Analysis of a Revocable Fingerprint Biometric Authentication Scheme

Mark Miller

Photo of Me

Abstract: The use of biometric authentication has become very prominent in recent years and, as its adoption becomes more and more widespread, it is important that biometric authentication systems in place are secure. The security of these systems is of great importance because a user's biometric information, for the most part, is unable to be recreated or modified unlike other forms of authentication such as PIN, passwords, or even identification cards, and could be dangerous for users if their unique biometric information was stolen. Recently, a new scheme for masking a user's biometric information was created by the researchers at Indiana University-Purdue University at Indianapolis called the "Bio-Capsule" scheme. In this paper, we will present an implementation of this "Bio-Capsule" scheme on a fingerprint authentication system to mask a user's sensitive fingerprint information. In this fingerprint authentication system, the original fingerprint image is preprocessed and converted into a masked fingerprint image to obfuscate a user's biometric information. Our fingerprint authentication system was tested using several fingerprint databases, both with and without the "Bio-Capsule" scheme being embedded into our system.  Read More


A Cancellable and Privacy-Preserving Facial Biometric Authentication Scheme

Tyler Phillips

Photo of Me

Abstract: In recent years, biometric, or "who you are", authentication has grown rapidly in acceptance and use. Biometric authentication offers users the convenience not having to carry a password, PIN, smartcard, etc. Instead, users will use their inherent biometric traits for authentication and, as a result, risk their biometric information being stolen. The security of users' biometric information is of critical importance within a biometric authentication scheme as compromised data can reveal sensitive information: race, gender, illness, etc. A cancellable biometric scheme, the "BioCapsule" scheme, proposed by researchers from Indiana University Purdue University Indianapolis, aims to mask users' biometric information and preserve users' privacy. In this paper we will present a facial authentication system which employs several cutting edge techniques. We test our proposed system on several face databases, both with and without the BioCapsule scheme being embedded into our system. By comparing our results, we quantify the effects the BioCapsule scheme, and its security benefits, have on the accuracy of our facial authentication system.  Read More


Digital Immunization Surveillance: Monitoring Influenza Vaccination Rates Using Twitter

Susie Song

Photo of Me

Abstract: Widespread and timely influenza immunizations within the United States are critical for preventing deadly and costly outbreaks. However, flu vaccination coverage across the United States annually remains low. Thus, sustained flu vaccination surveillance is important for detecting faltering vaccination rates. Early detection of anomalies in flu vaccination rates can allow faster communication between public health agencies and local governments, facilitating rapid deployment of flu vaccination campaigns. Traditional flu vaccination rate tracking systems suffer information lags due to federal agencies' reliance on reports submitted by medical practices and since surveillance data is publicly updated only once a week. Furthermore, these approaches have limited scope as data collection is confined to Medicare beneficiaries. Yet, more patients are turning to social media platforms, such as the microblogging service Twitter, to casually report their vaccination experiences. While vaccination rates spike during times of frequent vaccine campaigns and imminent influenza spread, these vaccinated individuals often use Twitter to speak about "getting a flu shot". In this paper, we offer a low-cost and fast alternative surveillance method to the flu vaccination rate surveillance system of the United States Department of Health and Human Services (HHS). We evaluate the level of concordance between the rate of Twitter posts mentioning flu vaccinations and HHS data on flu vaccination rate.  Read More


Misinformation Trends in Social Media and the Global Terrorism Database

Ivana Terziyska

Photo of Me

Abstract: Social media plays an important role in shaping an audience's beliefs and sentiments regarding current issues. Up to 65% of adults obtain their news from social media, and "fake news" articles are just as likely to go viral as accurate accounts, greatly impacting general perception and understanding. A comparison between significant features of objectively accurate, holistic datasets and user features of social media data would give insight into the spread of misinformation as well as ways to inform users and prevent it. We use the Global Terrorism Database (GTD), a focus of previous comparisons to diverse datasets, to highlight trends in terrorism, an extensive issue that is frequently referred to in media and data obtained using the Twitter API to assess features of social media data pertinent to relevant queries. We present a connection between terrorism as recorded in the GTD and terrorism as spoken about in tweets to determine discrepancies in the trends and features of both data sets. We also further extract user features, such as location, sentiment, popularity, and keywords used about recent fake news articles to determine which features are most likely to account for discussion and propagation of misinformation. Depending on user sentiment, features are extracted to classify users who spread misinformation as well as how they view it. We create a self-organizing map (SOM) of each dataset to identify clusters present both in the GTD and Twitter data to link our findings to broader conclusions about media bias and the spread of misinformation.  Read More


Using Twitter Streaming API to Gauge the Effects of Air Quality on Climate Change Sentiments

Joevita Weah

Photo of Me

Abstract: Evidence has shown that climate change is a growing problem of the 21st century, however many individuals do not fully support this assessment as being true. Understanding an individual's feelings towards climate change may aid in better educating the public in improved mitigation strategies for the future. In this paper, we examined the impact of health factors, such as air quality and asthma, on an individual's sentiment towards climate change. Microblogging social media services, such as Twitter, allow users to express varying levels of emotions, through emoticon usage, retweeting, replying,sharing, and other actions. To analyze sentiments, we gathered data from Twitter in the form of tweets that mentioned 'climate change' and assigned each tweet a sentiment score. We categorized sentiments as positive, negative or neutral. Using a Self Organizing Map (SOM), we clustered related features with varying sentiments. Our approach defined features as toxins, prevalence of asthma, and sentiment. We compared different features related to air quality and asthma, as well as how these features related to sentiment scores. To further analyze the health and Twitter data, we looked at the features by state. This then allowed us to compare the data and assess how health quality at the state level correlates with an individual's sentiments on social media.  Read More