Sign In
Not register? Register Now!
Pages:
3 pages/≈825 words
Sources:
Check Instructions
Style:
APA
Subject:
Mathematics & Economics
Type:
Statistics Project
Language:
English (U.S.)
Document:
MS Word
Date:
Total cost:
$ 15.55
Topic:

Excel data aggregation and validation Economics Statistics Project

Statistics Project Instructions:

The assignment is due before 11pm EST on April 7th. It is to make an excel form. The rules are in the file I sent. If you have any questions, please log in to my website and select Spring 2020-Information in the 21st Century, course content, week 10 in the course selection. Website: blackbord albany LOGIN INFO UPON REQUEST TO SUPPORT

 

 

Assignment #4 - Data Aggregation, Validation & Cleaning

 

Description:  In this homework assignment, we will get familiar with the process of transforming data to solve real-world problems.  First, we’ll start with a process of collecting data.  We’ll learn a common technique for collecting data from users.  Next, we’ll look at how we can aggregate this data to answer specific real-world questions.  Remember, data by itself serves no purpose.  We need to transform data into information and knowledge to drive decision-making and real-world value.

 

You will also familiarize yourself with how to prevent “garbage” data from being used in your data analysis or research.  You will first attempt to prevent erroneous data from being entered by a user, applying data validation.  Next, you will assume that garbage data had been submitted and apply data cleaning techniques to fix known problems.

 

Readings: This homework is primary based on the following chapters.

●     Chapter 2: About Data Aggregation

●     Chapter 7: Getting the Data Ready for Cleaning

●     Chapter 8: Data Cleaning

 

Learning Objectives:

●     Learn how to collect external data on the web (via a web form)

●     Apply data aggregation techniques to make the collected raw data more usable

●     Applying pivot tables / data aggregation to solve real-world problems

●     Understand the trade-offs between data validation and post-process data cleaning

●     Apply regular expressions to prevent erroneous data entry during data validation

●     Understand and apply data cleaning at various stages (entry, validation and post-processing)

●     Leverage and apply sorts and filters to manage large data sets.

●     Leverage functions to complete tasks

●     Read and apply technical documentation to achieve goals

 

 

Need Help???

If you are ever stuck or confused on an assignment, remember that help is available. Check out the Ask-A-Question section on Blackboard for the resources.  However, please be considerate and follow the two guidelines below:

  1. Make a serious attempt to understand and complete the assignment before seeking help. Our suggestions will make a lot more sense if you have taken a serious look and attempt at the assignment. 
  2. Do not wait until the last minute.  It takes a lot of time to provide quality responses. If you wait until an assignment is almost due, it may be difficult/impossible to provide a quality response. 

 

Part #1: Build a Google Form

 

Imagine, that you are hosting a public event.  For this event, you need to know, who is attending, their contact information and some details about their registration. It is incredibility difficult, complex and time consuming to manage this data over email, especially if you have a large number of potential attendees. 

 

Based on this scenario, let’s look at a more efficient way to collect this data.  In this section, you’ll build a Google Form that will allow you to collect data from anyone over the web. 

 

First, if you have a Google Account, go to Google Drive and log on using your account.

http://drive.google.com

 

If you do not have a Google Account go to the following link to create it before proceeding.

https://support.google.com/accounts/answer/27441?hl=en

 

After your account is created, go to Google Drive and log on using your account.

http://drive.google.com

 

 

Once on Google Drive, Click on the Create button, and select “Form”

 

 

 

 

 

 

Create a meaningful title for the Google Form.  Make sure your name is included as part of the title. 

 

 

Create a question field for “First Name” as type “Short Answer”

 

 

Now that you have learned to create a question field, create all of the following:

Field Name

Type

Options

First Name

Short Answer

N/A

Last Name

 

Short Answer

N/A

Role

Choose from a list

Undergraduate Student

Graduate Student

Faculty

Phone Number

 

Short Answer

N/A

Email:

 

Short Answer

N/A

Meal Choice:

 

Multiple Choice

Chicken

Beef

Vegetarian

Attendance Dates

Checkboxes

Day 1

Day 2

Day 3

Notes / Comments:

 

Paragraph Text

N/A

 

 

Once you have created the field above, click “View Live Form”

 

 

If you did it correctly, your form should look similar to this:  https://docs.google.com/forms/d/1nujPgdlMjEEflPErunyyVxyuqQlIpbUgVHH65jmCL2w/viewform

 

Enter some fake data into the form fields and hit submit.

 

 

 

Go back to the editable Google Forms. Click “View Responses”.  You should be able to see the data that you entered in a spreadsheet.

 

 

Web forms are a quick and powerful way of collecting data over the web.  We will expand on this example over the next couple weeks.  We’ll look deeper into  decisions that occur when collecting data, and how you can avoid potential problems and headaches. Avoiding problems early, is critical in saving lots of time and money. However, for now, proceed to part #2.

 

 

Part #2: Setting up Data Validation Rules

 

 

Step #1: Edit the question for the “First Name” field.  Make it a required question.  This forces the user to enter data here.  Also, check the option for “Data Validation”.  Select Text, Does Not Match and “[0-9]”.  Provide a meaningful error message.  This prevents the user from any data that contains a number as a name.  This makes sense, since valid names should not contain any numbers.  In other words, you wouldn’t want to accept the name “B1ll” or “Su3”

 

 

 

 

Step #2: Repeat the step above for the Last Name Field.

 

Step #3: Force the user to enter a valid phone number.  There are a number of ways to implement this data validation rule.  In this assignment, let’s keep it simple.  Let’s just force the user to enter a 10 digit number. This can be implemented by the rule that the number must be between 1,000,000,000 9,999,999,999.

 

 

Step #4: For the email field, ensure that it only accepts valid email addresses.  This functionality is built into Google Forms.

 

 

Step #5:  Make sure to test all of your regular expression rules.  The web form should prevent you from entering email addresses such as “billgatesATmicrosoft.com” or “markZ@facebook”. 

 

Notice that these rules can prevent a lot of inaccurate and erroneous data from being submitted.  In real world applications this saves A LOT of time and energy.  However, there will inevitably be some data that gets entered incorrectly, no matter how thorough you build your rules.  Although we can catch the basic errors using data validation and regular expressions, let’s look at how we can use post-processing techniques to clean data after it has been submitted. 

 

Part #3: Data Cleaning

 

Let’s imagine that you have shared your above Google Form with the world.  You are able to collect large amount of conference registrations over the web, without having to have individuals email or place phone calls. 

 

Now, based on the data, you want to make some key decisions relating to planning your event. However, before we start drawing conclusions, we need to fix problems with the data.  This is called data cleaning and is the focus of this section of the homework.

 

First, download this sample dataset.  This is basically “fake” data from hundreds of people registering through a Google Form for the event.

 

DATA: https://drive.google.com/file/d/0B93qQUQZO0GJUTc3blZlUGcwWW8/view?usp=sharing

 

Since you want to open this data in Excel, you’ll want to click the download button (shown below)

 

 

After downloading the data, you’ll begin the process of “cleaning” the dataset.

 

Although we want to prevent the user from entering as much invalid data as possible, it is nearly impossible to prevent this from happening 100% of the time.  Therefore, in today’s data-driven economy, it is essential to know how to clean datasets.  In this section, we will learn some applied post-processing techniques for cleaning the data in Excel.

 

Assume: Let’s assume that our data rules from the steps above did not prevent users from entering certain types of invalid email address.

 

GOAL:  Therefore, our goal is to remove or fix invalid email addresses.  Let’s apply the following rules to remove email addresses for the following scenarios.

 

            RULE 1: Remove Email Addresses that equal “@albany.edu”

            RULE 2: Fix Email Addresses by removing spaces

 

NOTE:  This does not remove or fix ALL invalid email addresses. However, it allows us to quickly fix known problems in the dataset.  Let’s be clear about the purpose of this part of the homework.  The purpose is to fix or remove “dirty” data.  Dirty data causes problems ranging from incorrect analysis to faulty decision-making.

 

Step #6: Use a Filter to remove invalid email addresses.

Click on the first row on the dataset, then click the Filter icon (which is under the Data tab)

 

 

Step #7: Create a filter that selects only the records that have email addresses equal to “@albany.edu”

 

 

Step #8: Replace the records with some data that clearly indicates that there is no email address for the person on file.  The fastest way to do this is to type in “no-email-address” in the first row, click the tiny square in the lower right hand corner and drag it down to the last row.  This will replace all of the “@albany.edu” records with “no-email-address”

 

 

Step #9:  Next, let’s fix email address that contain spaces.  Let’s make an assumption that by deleting the spaces, we will fix the email addresses.  This type of fix often happens when the data was entered by some sort of automated software code.  It is common for the code to have a slight flaw (or bug) in the logic that enters data incorrectly under certain circumstances.  In this case, the flaw occurs when there is a space in the user’s last name.

 

First, let’s create a new Column that will contain our fixed email addresses.

 

 

 

 

Step #10:  Next, read the documentation on the substitute command.  This will show you how to remove spaces from existing cells.  Implement this in the new column.  Note: the exact way to use the substitute function is intentionally not provided. It is very useful to learn how to read documentation and troubleshoot solutions.

 

Hint:  You want to substitute all spaces “ “ for empty text “”

 

https://support.office.com/en-us/article/SUBSTITUTE-function-6434944e-a904-4336-a9b0-1e58df3bc332

 

Part #4:  Data Aggregation in Excel

 

After cleaning the data, now you are going to aggregate the data to answer some key questions about the event. 

 

To aggregate the data, Create a Pivot Table (shown below).  Recall, a pivot table is a summarization table.  In other words, aggregated data that can answer key questions and aid in the visualization of information.

 

 

Creating a Pivot Table for Mac Users: https://www.youtube.com/watch?v=l3ZhWvQUx5g

 

We want to use the Pivot table to answer a basic question. The question is: How much food do we need to have prepared for the difference event days. 

 

To answer that question, let’s create the Pivot table with the following settings.

 

 

 

If you did it correctly, it should look similar to the table below.  Note, this table answers the earlier question: How much food do we need order for each day of the conference.  For example, we need 29+17 = 46 Beef dishes on Day 2 of the conference.

 

 

 

This saves a lot of time vs. counting all the rows, right?!?

 

 

Summary: OK - you’ve finished the homework. Let’s recap what you have accomplished.  First, you learned in how to collect data (Part #1) from a large number of uses.  Next, (Part #2), you learned how to incorporate data validation rules that  prevent “garbage” data from being entered by the user.  These rules were mostly implemented using regular expressions.  Next, in Part #3, you learned a couple methods for cleaning “garbage” data after it has been entered.  As mentioned previously, it is nearly impossible to prevent dirty data from being entered.  Although, you want to catch as much as reasonably possible.  As you can see from this example, it much harder to fix problems (i.e. invalid email addresses) during post-processing.   Last, in Part #4, you aggregated the data to provide valuable informative information.

 

 

Submission Checklist:

●     Include the link to your publicly accessible Google form on the Blackboard (from Part 1 & 2).  This should be the Google Form page and not the data itself.

●     Make sure link above can be accessed without being logged (try it on another computer or sign out of Google and try the link).  This means setting the permissions to anyone that has the link.

●     Attached the Excel document (from Part 3 & 4) on Blackboard in the Assignment #4 Upload Link.

 

 

 

Statistics Project Sample Content Preview:
Assignment No. 4 Steps in Data Cleaning and Validation
Credentials:
Email Address:
Password:
Google Drive link:
https://drive.google.com/open?id=1KsfgTDH0tSJ9oesPLv4x7GyhMhuUt8RXM9KWvTBIpCI
Part 1
Step 1. Creating email account and google forms
Step 2. Adding Fake comments:
Step 3. Spreadsheet containing “fake comment” responses
P...
Updated on
Get the Whole Paper!
Not exactly what you need?
Do you need a custom essay? Order right now:

👀 Other Visitors are Viewing These APA Statistics Project Samples:

HIRE A WRITER FROM $11.95 / PAGE
ORDER WITH 15% DISCOUNT!