Data Analysis with Python: Zero to Pandas

"Data Analysis with Python: Zero to Pandas" is a practical and beginner-friendly introduction to data analysis covering the basics of Python, Numpy, Pandas, Data Visualization, and Exploratory Data Analysis.

  • Watch hands-on coding-focused video tutorials
  • Practice coding with cloud Jupyter notebooks
  • Build an end-to-end real-world course project
  • Earn a verified certificate of accomplishment

There are no prerequisites for this course.

Lesson 1 - Introduction to Programming with Python Preview

  • First steps with Python & Jupyter notebooks
  • Arithmetic, conditional & logical operators in Python
  • Quick tour with Variables and common data types

Lesson 2 - Next Steps with Python Preview

  • Branching with if, elif, and else
  • Iteration with while and for loops
  • Write reusable code with Functions
  • Scope of variables and exceptions

Assignment 1 - Python Basics Practice Preview

  • Solve word problems using variables & arithmetic operations
  • Manipulate data types using methods & operators
  • Use branching and iterations to translate ideas into code
  • Explore the documentation and get help from the community

Lesson 3 - Numerical Computing with Numpy

  • Going from Python lists to Numpy arrays
  • Working with multi-dimensional arrays
  • Array operations, slicing and broadcasting
  • Working with CSV data files

Assignment 2 - Numpy Array Operations

  • Explore the Numpy documentation website
  • Demonstrate usage 5 numpy array operations
  • Publish a Jupyter notebook with explanations
  • Share your work with the course community

Lesson 4 - Analyzing Tabular Data with Pandas

  • Reading and writing CSV data with Pandas
  • Querying, filtering and sorting data frames
  • Grouping and aggregation for data summarization
  • Merging and joining data from multiple sources

Assignment 3 - Pandas Practice

  • Create data frames from CSV files
  • Query and index operations on data frames
  • Group, merge and aggregate data frames
  • Fix missing and invalid values in data

Lesson 5 - Visualization with Matplotlib and Seaborn

  • Basic visualizations with Matplotlib
  • Advanced visualizations with Seaborn
  • Tips for customizing and styling charts
  • Plotting images and grids of charts

Course Project - Exploratory Data Analysis

  • Find a real-world dataset of your choice online
  • Use Numpy & Pandas to parse, clean & analyze data
  • Use Matplotlib & Seaborn to create visualizations
  • Ask and answer interesting questions about the data

Lesson 6 - Exploratory Data Analysis - A Case Study

  • Finding a good real-world dataset for EDA
  • Data loading, cleaning and preprocessing
  • Exploratory analysis and visualization
  • Answering questions and making inferences

Certificate of Accomplishment

Earn a verified certificate of accomplishment ( sample by completing all weekly assignments and the course project. The certificate can be added to your LinkedIn profile, linked from your Resume, and downloaded as a PDF.

Instructor - Aakash N S

Aakash N S is the co-founder and CEO of Jovian . Previously, Aakash has worked as a software engineer (APIs & Data Platforms) at Twitter in Ireland & San Francisco and graduated from the Indian Institute of Technology, Bombay. He’s also an avid blogger, open-source contributor, and online educator.

Course FAQs

If you have general questions about the course, please browse through this list first. Click/tap on a question to expand it and view the answer. If there’s something that’s not answered here, please reply to this topic with your question. For lecture & assignment related queries, please ask question on the respective threads.

Data Analysis with Python: Zero to Pandas is an online course intended to provide a coding-first introduction to data analysis.

The course takes a hands-on coding-focused approach and will be taught using live interactive Jupyter notebooks, allowing students to follow along and experiment. Theoretical concepts will be explained in simple terms using code. Participants will receive weekly assignments and work on a project with a real-world dataset to test their skills. Upon successful completion of the course, participants will receive a certificate of completion.

The following topics are covered:

  • Python & Jupyter Fundamentals
  • Numpy for data processing
  • Pandas for working with tabular data
  • Visualization with Matplotlib and Seaborn
  • Exploratory Data Analysis: A Case Study

The course is called “Zero to Pandas” because it assumes no prior knowledge of Python (i.e. you can start from Zero), and by the end of the five weeks, you’ll be familiar with running data analysis with Python.

Access the Course Syllabus for more details.

This course runs for 6 weeks. You can enroll, watch the session recordings and submit the assignments and course project during this period. The submissions will be evaluated by us and you shall be provided with the certificate on successful completion of all assignments and project.

This is a beginner-friendly course, and no prior knowledge of Data Science or Python is assumed. You DON’T require a college degree (B.Tech, Masters, PhD etc.) to participate in this course.

You do need to have a computer (laptop/desktop) with a good internet connection to watch the video lectures, run the code online, and participate in the forum discussions.

To become eligible for a “Certificate of Completion”, you need to satisfy the all of following criteria:

  • Make valid submissions for all 3 weekly assignments in the course (the course team will evaluate & accept/reject submission)
  • Make a valid submission towards the course project
  • Do not violate the Code of Conduct

More details regarding the assignments and the course project will be shared during the course. Please note that we reserve the right to withhold/cancel any participant’s certificate if we are not satisfied with the quality of their submissions or find them in violation of the Code of Conduct and Academic Honesty Policy .

The Certificate of Completion will be issued by Jovian . Please note that Jovian is not a registered educational institution, and this certificate will not count towards your higher education/college credits. The certificate simply indicates that you have completed all the required coursework for this course. Moreover, Jovian reserves the right to withhold/cancel any participant’s certificate if we are not satisfied with the quality of their submissions or find them in violation of the Code of Conduct.

Video lectures are available on the course page. Go to zerotopandas.com and open the particular lesson. You will find the video inside the lesson page.

No, you do not need to install any additional software on your computer to participate in this course. You just need a computer (laptop/desktop) with a working internet connection and a modern web browser (like Google Chrome or Firefox) to watch the lectures, participate in forum discussions, and complete the assignments.

You will be able to do all the assignments using free online computing platforms that you can access from your web browser. More details about these will be shared during the video lectures and on the individual assignment threads.

No, for this course you do not require any Graphical Processing Unit(GPU). If in case you require a GPU, you don't have to buy it. Online Platforms like Google Colab provides free access to GPUs for a limited amount of time every week. The free tier should be sufficient to fulfill all your requirements.

The lectures will be taught using Jupyter notebooks , an browser-based interactive programming environment. The lecture notebooks and assignements will be shared using Jovian , a platform for sharing Jupyter notebooks and data science projects. You will be able to run the shared Jupyter notebooks directly from Jovian.

The coursework should not take up more than 8-10 hours per week. If you’re able to do it in lesser time, that’s great.

In general, even if you’re a full-time student or working professional, you should be able to follow along and complete the coursework comfortably, if you remain motivated.

Sure, you can audit the course by just viewing the video lectures, but we highly recommend that you try out the assignments and put in the work required to earn a certificate. Doing the assignments will help you apply the concepts and get hands-on experience with data analysis. Interactive Juptyer notebooks are a great way to learn & experiment with the code, and we’ve put in a lot of effort to prepare these resources for you. We hope you will find it worthwhile to do the assignments & exercises.

No, there is no textbook for this course. This course is taught entirely using Jupyter notebook, which include a fair bit of explanation along with code, graphs, links to references etc. We will provide links to reading material, blog posts & other free resources online.

The instructor for this course is Aakash N S Aakash is the co-founder and CEO of Jovian, a platform to learn Data Science & Machine Learning. Jovian is also a project management and collaboration platform for Jupyter notebooks. Prior to starting Jovian, Aakash worked asa software engineer (APIs & Data Platforms) at Twitter in Ireland & San Francisco and graduated from IIT Bombay. He’s also a Competitions Expert on Kaggle, an avid blogger, open source contributor and online educator.

  • Assignments will require completing tasks such as creating a Jovian notebook, writing a blog post etc.
  • Assignment submission can be done on assignment page using the Jovian notebook link
  • Some assignments are automated, which means they will be evaluated automatically. Few assignments will be evaluated by the course team.
  • You will be graded as "PASS" or "FAIL" according to the assignment evaluation. If you get "FAIL" grade, you will get chance to work on the assignment again and resubmit it.

More details about the submission will be provided in the individual topics for each assignment.

Depending on the type of question, please choose one of the following:

If you have questions on any topic covered in a lecture/assignment, you can post them in the respective lessons' discusison page. Someone from the course team or the community will try to answer your question. Before asking, please scroll through the thread to check if your question has already been asked/answered.

If you have questions about the course itself, you can post your question on the discussion page of the course

  • You can also ask your question in the zerotopandas channel of Jovian community slack group.

If you do not want to ask a question publicly or need more assistance, you can send an email to [email protected] , and someone from the course team will respond to you over email.

We recommend asking question on the discussion pages, since in many cases other members of the community will be able to answer questions faster than us, and your question will also be useful for others. Remember, no question to too simple to be asked.

Yes, please spread the word and invite your friends to join in.

We expect all participants to follow the Code of Conduct , and we take harassment and abuse very seriously. Please reach out to us at [email protected] if you are a victim of harassment/abuse by another user, and we’ll investigate the matter and take strict action immediately. Once verified, we will remove the participant from the course, and for more serious matters, report it to relevant authorities.

Instantly share code, notes, and snippets.

@azizrajab

azizrajab / Final Assignment.ipynb

  • Download ZIP
  • Star ( 6 ) 6 You must be signed in to star a gist
  • Fork ( 4 ) 4 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save azizrajab/3fb610c1b129e03adc7baaa3f9696410 to your computer and use it in GitHub Desktop.

@Mohamed1Raafat

Mohamed1Raafat commented Sep 9, 2023

Sorry, something went wrong.

@Tawfiqul1983

Tawfiqul1983 commented Oct 5, 2023

Python Programming

Practice Python Exercises and Challenges with Solutions

Free Coding Exercises for Python Developers. Exercises cover Python Basics , Data structure , to Data analytics . As of now, this page contains 18 Exercises.

What included in these Python Exercises?

Each exercise contains specific Python topic questions you need to practice and solve. These free exercises are nothing but Python assignments for the practice where you need to solve different programs and challenges.

  • All exercises are tested on Python 3.
  • Each exercise has 10-20 Questions.
  • The solution is provided for every question.
  • Practice each Exercise in Online Code Editor

These Python programming exercises are suitable for all Python developers. If you are a beginner, you will have a better understanding of Python after solving these exercises. Below is the list of exercises.

Select the exercise you want to solve .

Basic Exercise for Beginners

Practice and Quickly learn Python’s necessary skills by solving simple questions and problems.

Topics : Variables, Operators, Loops, String, Numbers, List

Python Input and Output Exercise

Solve input and output operations in Python. Also, we practice file handling.

Topics : print() and input() , File I/O

Python Loop Exercise

This Python loop exercise aims to help developers to practice branching and Looping techniques in Python.

Topics : If-else statements, loop, and while loop.

Python Functions Exercise

Practice how to create a function, nested functions, and use the function arguments effectively in Python by solving different questions.

Topics : Functions arguments, built-in functions.

Python String Exercise

Solve Python String exercise to learn and practice String operations and manipulations.

Python Data Structure Exercise

Practice widely used Python types such as List, Set, Dictionary, and Tuple operations in Python

Python List Exercise

This Python list exercise aims to help Python developers to learn and practice list operations.

Python Dictionary Exercise

This Python dictionary exercise aims to help Python developers to learn and practice dictionary operations.

Python Set Exercise

This exercise aims to help Python developers to learn and practice set operations.

Python Tuple Exercise

This exercise aims to help Python developers to learn and practice tuple operations.

Python Date and Time Exercise

This exercise aims to help Python developers to learn and practice DateTime and timestamp questions and problems.

Topics : Date, time, DateTime, Calendar.

Python OOP Exercise

This Python Object-oriented programming (OOP) exercise aims to help Python developers to learn and practice OOP concepts.

Topics : Object, Classes, Inheritance

Python JSON Exercise

Practice and Learn JSON creation, manipulation, Encoding, Decoding, and parsing using Python

Python NumPy Exercise

Practice NumPy questions such as Array manipulations, numeric ranges, Slicing, indexing, Searching, Sorting, and splitting, and more.

Python Pandas Exercise

Practice Data Analysis using Python Pandas. Practice Data-frame, Data selection, group-by, Series, sorting, searching, and statistics.

Python Matplotlib Exercise

Practice Data visualization using Python Matplotlib. Line plot, Style properties, multi-line plot, scatter plot, bar chart, histogram, Pie chart, Subplot, stack plot.

Random Data Generation Exercise

Practice and Learn the various techniques to generate random data in Python.

Topics : random module, secrets module, UUID module

Python Database Exercise

Practice Python database programming skills by solving the questions step by step.

Use any of the MySQL, PostgreSQL, SQLite to solve the exercise

Exercises for Intermediate developers

The following practice questions are for intermediate Python developers.

If you have not solved the above exercises, please complete them to understand and practice each topic in detail. After that, you can solve the below questions quickly.

Exercise 1: Reverse each word of a string

Expected Output

  • Use the split() method to split a string into a list of words.
  • Reverse each word from a list
  • finally, use the join() function to convert a list into a string

Steps to solve this question :

  • Split the given string into a list of words using the split() method
  • Use a list comprehension to create a new list by reversing each word from a list.
  • Use the join() function to convert the new list into a string
  • Display the resultant string

Exercise 2: Read text file into a variable and replace all newlines with space

Given : Assume you have a following text file (sample.txt).

Expected Output :

  • First, read a text file.
  • Next, use string replace() function to replace all newlines ( \n ) with space ( ' ' ).

Steps to solve this question : -

  • First, open the file in a read mode
  • Next, read all content from a file using the read() function and assign it to a variable.
  • Display final string

Exercise 3: Remove items from a list while iterating

Description :

In this question, You need to remove items from a list while iterating but without creating a different copy of a list.

Remove numbers greater than 50

Expected Output : -

  • Get the list's size
  • Iterate list using while loop
  • Check if the number is greater than 50
  • If yes, delete the item using a del keyword
  • Reduce the list size

Solution 1: Using while loop

Solution 2: Using for loop and range()

Exercise 4: Reverse Dictionary mapping

Exercise 5: display all duplicate items from a list.

  • Use the counter() method of the collection module.
  • Create a dictionary that will maintain the count of each item of a list. Next, Fetch all keys whose value is greater than 2

Solution 1 : - Using collections.Counter()

Solution 2 : -

Exercise 6: Filter dictionary to contain keys present in the given list

Exercise 7: print the following number pattern.

Refer to Print patterns in Python to solve this question.

  • Use two for loops
  • The outer loop is reverse for loop from 5 to 0
  • Increment value of x by 1 in each iteration of an outer loop
  • The inner loop will iterate from 0 to the value of i of the outer loop
  • Print value of x in each iteration of an inner loop
  • Print newline at the end of each outer loop

Exercise 8: Create an inner function

Question description : -

  • Create an outer function that will accept two strings, x and y . ( x= 'Emma' and y = 'Kelly' .
  • Create an inner function inside an outer function that will concatenate x and y.
  • At last, an outer function will join the word 'developer' to it.

Exercise 9: Modify the element of a nested list inside the following list

Change the element 35 to 3500

Exercise 10: Access the nested key increment from the following dictionary

Under Exercises: -

Python Object-Oriented Programming (OOP) Exercise: Classes and Objects Exercises

Updated on:  December 8, 2021 | 52 Comments

Python Date and Time Exercise with Solutions

Updated on:  December 8, 2021 | 10 Comments

Python Dictionary Exercise with Solutions

Updated on:  May 6, 2023 | 56 Comments

Python Tuple Exercise with Solutions

Updated on:  December 8, 2021 | 96 Comments

Python Set Exercise with Solutions

Updated on:  October 20, 2022 | 27 Comments

Python if else, for loop, and range() Exercises with Solutions

Updated on:  July 6, 2024 | 296 Comments

Updated on:  August 2, 2022 | 155 Comments

Updated on:  September 6, 2021 | 109 Comments

Python List Exercise with Solutions

Updated on:  December 8, 2021 | 200 Comments

Updated on:  December 8, 2021 | 7 Comments

Python Data Structure Exercise for Beginners

Updated on:  December 8, 2021 | 116 Comments

Python String Exercise with Solutions

Updated on:  October 6, 2021 | 221 Comments

Updated on:  March 9, 2021 | 23 Comments

Updated on:  March 9, 2021 | 51 Comments

Updated on:  July 20, 2021 | 29 Comments

Python Basic Exercise for Beginners

Updated on:  August 31, 2023 | 497 Comments

Useful Python Tips and Tricks Every Programmer Should Know

Updated on:  May 17, 2021 | 23 Comments

Python random Data generation Exercise

Updated on:  December 8, 2021 | 13 Comments

Python Database Programming Exercise

Updated on:  March 9, 2021 | 17 Comments

  • Online Python Code Editor

Updated on:  June 1, 2022 |

About PYnative

PYnative.com is for Python lovers. Here, You can get Tutorials, Exercises, and Quizzes to practice and improve your Python skills .

Explore Python

  • Learn Python
  • Python Basics
  • Python Databases
  • Python Exercises
  • Python Quizzes
  • Python Tricks

To get New Python Tutorials, Exercises, and Quizzes

Legal Stuff

We use cookies to improve your experience. While using PYnative, you agree to have read and accepted our Terms Of Use , Cookie Policy , and Privacy Policy .

Copyright © 2018–2024 pynative.com

7 Datasets to Practice Data Analysis in Python

Author's photo

  • data analysis
  • online practice

Data analysis is a skill that is becoming more essential in today's data-driven world. One effective way to practice with Python is to take on your own data analysis projects. In this article, we’ll show you 7 datasets you can start working on.

Python is a great tool for data analysis – in fact,  it has become very popular, as we discuss in Python’s Role in Big Data and Analytics . For Python beginners to become proficient in data analysis, they need to develop their programming and analysis knowledge. And the best way to do this is by creating your own data analysis projects.

Doing projects gives you a deep understanding of Python as well as the entire data analysis process. We’ve discussed this process in our Python Exploratory Data Analysis Cheat Sheet . It’s important to learn how to effectively explore different kinds of datasets – numerical, image, text, and even audio data.

But the first step is getting your hands on data, and it isn’t always obvious how to go about this. In this article, we’ll provide you with 7 datasets that you can use to practice data analysis in Python. We’ll explain what the data is, what it can be used for, and show you some code examples to get you on your feet. The examples will range from beginner-friendly to more advanced datasets used for deep learning.

For those looking for some beginner friendly Python learning material, I recommend our Learn Programming with Python track. It bundles together 5 courses, all designed to teach you the fundamentals. For the aspiring data scientists, our Introduction to Python for Data Science course contains 141 interactive exercises. If you just want to try things out, our article 10 Python Practice Exercises for Beginners with Detailed Solutions contains exercises from some of our courses.

7 Free Python Datasets

Diabetes dataset.

The Diabetes dataset from scikit-learn is a collection of 442 patient medical records from a diabetes study conducted in the US. It contains 10 variables, including age, sex, body mass index, average blood pressure, and six blood serum measurements. The data was collected by the National Institute of Diabetes and Digestive and Kidney Diseases.

Here’s how to load the dataset into a pandas DataFrame and print the first couple of rows of some of the variables:

Here you can see the age, sex and body mass index. These variables have already been preprocessed to have a mean of zero and a standard deviation of one. The target is a quantitative measure of disease progression. To get started with a correlation analysis of some of the features in the dataset, do the following:

This shows that BMI is positively correlated with disease progression, meaning the higher the BMI, the higher the chance of having diabetes. What relationships can you find between other variables in the data?

Forest Cover Types

The Forest covertype dataset , also from scikit-learn, is a collection of data from the US Forest Service (USFS). It includes cartographic variables that measure the forest cover type for 30 x 30 meter cells and includes a total of 54 attributes.

This rich dataset can be used for a variety of projects, such as predicting the forest cover type of a given area, analyzing the relationship between different forest cover types and environmental factors, or creating a model to predict the probability of a certain type of forest cover in a given area. It can also be used to study the effects of human activities on forest cover.

Here’s how to read the data into a DataFrame and print the first 5 rows:

You can see the variables include things like elevation, slope, and soil type. The target variable is an integer and corresponds to a forest cover type. Here’s how to print the most common types:

The most commonly occurring type of forest in this dataset is type 2, with 283,301 occurrences. This corresponds to Lodgepole Pine. Type 4, the Cottonwood/Willow type, is the least frequently occurring type.

To get started in an analysis project, first start learning more about this data . Since this DataFrame is quite large with many different variables, check out How to Filter Rows and Select Columns in a Python DataFrame with p andas for some tips on manipulating the data.

Yahoo Finance

Python’s yfinance library is a powerful tool for downloading financial data from the Yahoo Finance website. You’ll need to install this library, which can be done with pip . It allows you to download data in a variety of formats; the data includes variables such as stock prices, dividends, splits, and more. To download data for Microsoft and plot the close price, do the following:

This uses the built-in pandas.DataFrame.plot() method. Running this code produces the following visualization:

7 Datasets to Practice Data Analysis in Python

There are many options open for the analysis at this stage. A regression analysis can be used to model the relationship between different financial variables. In the article Regression Analysis in Python , we show an example of how to implement this.

Atmospheric Soundings

Atmospheric sounding data is data collected from weather balloons. A comprehensive dataset is maintained on the University of Wyoming's Upper Air Sounding website . The data includes variables such as temperature, pressure, dew point, wind speed, and wind direction. This data can be used for a variety of projects, such as forecasting temperature and wind speed for your home town. Since it has decades of observations, you could use it to study the effects of climate change on the atmosphere.

Simply select an observation site and choose a time from the web interface. You can highlight the tabular data and copy-paste into a text document. Save it as ‘weather_data.txt’. Then you can read it into Python like this:

This is a nice example of having to read the data in line by line. Using Matplotlib , you can plot the temperature as a function of height for your data as follows:

7 Datasets to Practice Data Analysis in Python

Note that your plot may look a little different depending on what site and date you chose to download. If you want to download a large amount of data, you’ll need to write a web scraper. See our article Web Scraping with Python Libraries for more details.

IMDB movie review

The IMDB Movie Review dataset is a collection of movie reviews from the Internet Movie Database (IMDB). It includes reviews from tens of thousands of movies, with each review consisting of a text review and a sentiment score. The sentiment score is a binary value, either positive or negative, that indicates the sentiment of the review.

This dataset can be used for a variety of projects, such as sentiment analysis – which aims to build models that can predict the sentiment of a review. It can also be used to identify the topics and themes of a movie. You can download the dataset from Kaggle . To read in the CSV data and start preprocessing, do the following:

Text data is often quite messy, so cleaning and standardizing it as much as possible is important. See our article The Most Helpful Python Data Cleaning Modules for more information. Here, we have changed all characters to lowercase, removed stopwords (unimportant words), and tokenized the reviews (created a list of words from sentences). The article Null in Python: A Complete Guide has some more examples of working with text data.

There is more cleaning that could be done – for example, removing grammar. But this could be the starting point of a natural language processing project. Try seeing if there is a correlation between the most frequently occurring words and the sentiment.

Berlin Database of Emotional Speech

The Berlin Database of Emotional Speech (BDES) is a collection of German-language audio recordings of emotional speech. It was generated by having actors read out a set of sentences in different emotional states, such as anger, happiness, sadness, and fear. The data includes audio recordings of the actors' voices as well as annotations of the emotional states. This data can be used to study the acoustic features of emotional speech.

The data is available for download here . Metadata for the type of speech is recorded in the filename. For example, the ‘F’ in the filename ‘03a01Fa.wav’ means ‘Freude’ or Happiness. To plot the spectrogram of a happy German, do the following:

This produces the following plot of frequency against time. The yellow colors indicate higher signal strength.

7 Datasets to Practice Data Analysis in Python

Try plotting the same for angry speech, and see how the frequency, speed, and intensity of the speech changes. For more details on working with audio data in Python, check out the article How to Visualize Sound in Python .

The MNIST dataset is a collection of handwritten digits (ranging from 0 to 9) that is commonly used for training various image processing systems. It was created by the National Institute of Standards and Technology (NIST) and is widely used in machine learning and computer vision. The dataset consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 pixel grayscale image associated with a label from 0 to 9.

To load and start working with this data, you’ll need to install Keras , which is a powerful Python library for deep learning. The easiest way to do this is with a quick pip install keras command from the terminal. You can import the MNIST data and plot some of the digit images like this:

7 Datasets to Practice Data Analysis in Python

You can see the images of the handwritten digits with their labels above them. This dataset can be used to train a supervised image recognition model. The pixel values are the input data, and the labels are the truth that the model uses to adjust the internal weights. You can see how this is implemented in the Keras code examples section.

Improve Your Analysis Skills with Python Datasets

Getting started is often the hardest part of any challenge. In this article, we shared 7 datasets that you can use to start your next analysis project. The code examples we provided should serve as a starting point and allow you to delve deep into the data. From analyzing financial data to predicting the weather, Python can be used to explore and understand data in a variety of ways.

These datasets were chosen to give you exposure to working with a variety of different data types – numbers, text, and even images and audio. Our article An Introduction to NumPy in Python has more examples of working with numerical data.

With the right resources and practice, you can become an expert in data analysis and use Python datasets to make sense of the world around you. So, take the time to learn Python and start exploring the world of data!

You may also like

python data analysis assignment

How Do You Write a SELECT Statement in SQL?

python data analysis assignment

What Is a Foreign Key in SQL?

python data analysis assignment

Enumerate and Explain All the Basic Elements of an SQL Query

  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

Data Analysis with Python

In this article, we will discuss how to do data analysis with Python. We will discuss all sorts of data analysis i.e. analyzing numerical data with NumPy, Tabular data with Pandas, data visualization Matplotlib, and Exploratory data analysis.

 Data Analysis With Python 

Data Analysis is the technique of collecting, transforming, and organizing data to make future predictions and informed data-driven decisions. It also helps to find possible solutions for a business problem. There are six steps for Data Analysis. They are: 

  • Ask or Specify Data Requirements
  • Prepare or Collect Data
  • Clean and Process
  • Act or Report

Data Analysis with python

Data Analysis with Python 

Note: To know more about these steps refer to our Six Steps of Data Analysis Process tutorial. 

Analyzing Numerical Data with NumPy

NumPy is an array processing package in Python and provides a high-performance multidimensional array object and tools for working with these arrays. It is the fundamental package for scientific computing with Python.

Arrays in NumPy

NumPy Array is a table of elements (usually numbers), all of the same types, indexed by a tuple of positive integers. In Numpy, the number of dimensions of the array is called the rank of the array. A tuple of integers giving the size of the array along each dimension is known as the shape of the array. 

Creating NumPy Array

NumPy arrays can be created in multiple ways, with various ranks. It can also be created with the use of different data types like lists, tuples, etc. The type of the resultant array is deduced from the type of elements in the sequences. NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation.

Create Array using numpy.empty(shape, dtype=float, order=’C’)

Empty Matrix using pandas

Empty Matrix using pandas 

Create Array using numpy.zeros(shape, dtype = None, order = ‘C’)

Operations on Numpy Arrays

Arithmetic operations.

  • Subtraction:
  • Multiplication:

For more information, refer to our NumPy – Arithmetic Operations Tutorial

NumPy Array Indexing

Indexing can be done in NumPy by using an array as an index. In the case of the slice, a view or shallow copy of the array is returned but in the index array, a copy of the original array is returned. Numpy arrays can be indexed with other arrays or any other sequence with the exception of tuples. The last element is indexed by -1 second last by -2 and so on.

Python NumPy Array Indexing

Numpy array slicing.

Consider the syntax x[obj] where x is the array and obj is the index. The slice object is the index in the case of basic slicing . Basic slicing occurs when obj is :

  • a slice object that is of the form start: stop: step
  • or a tuple of slice objects and integers

All arrays generated by basic slicing are always the view in the original array.

Ellipsis can also be used along with basic slicing. Ellipsis (…) is the number of : objects needed to make a selection tuple of the same length as the dimensions of the array.

NumPy Array Broadcasting

The term broadcasting refers to how numpy treats arrays with different Dimensions during arithmetic operations which lead to certain constraints, the smaller array is broadcast across the larger array so that they have compatible shapes. 

Let’s assume that we have a large data set, each datum is a list of parameters. In Numpy we have a 2-D array, where each row is a datum and the number of rows is the size of the data set. Suppose we want to apply some sort of scaling to all these data every parameter gets its own scaling factor or say Every parameter is multiplied by some factor.

Just to have a clear understanding, let’s count calories in foods using a macro-nutrient breakdown. Roughly put, the caloric parts of food are made of fats (9 calories per gram), protein (4 CPG), and carbs (4 CPG). So if we list some foods (our data), and for each food list its macro-nutrient breakdown (parameters), we can then multiply each nutrient by its caloric value (apply scaling) to compute the caloric breakdown of every food item.

NumPy Array Broadcasting

With this transformation, we can now compute all kinds of useful information. For example, what is the total number of calories present in some food or, given a breakdown of my dinner know how many calories did I get from protein and so on.

Let’s see a naive way of producing this computation with Numpy:

Broadcasting Rules: Broadcasting two arrays together follow these rules:

  • If the arrays don’t have the same rank then prepend the shape of the lower rank array with 1s until both shapes have the same length.
  • The two arrays are compatible in a dimension if they have the same size in the dimension or if one of the arrays has size 1 in that dimension.
  • The arrays can be broadcast together if they are compatible with all dimensions.
  • After broadcasting, each array behaves as if it had a shape equal to the element-wise maximum of shapes of the two input arrays.
  • In any dimension where one array had a size of 1 and the other array had a size greater than 1, the first array behaves as if it were copied along that dimension.

Note: For more information, refer to our Python NumPy Tutorial .

Analyzing Data Using Pandas

Python Pandas Is used for relational or labeled data and provides various data structures for manipulating such data and time series. This library is built on top of the NumPy library. This module is generally imported as:

Here, pd is referred to as an alias to the Pandas. However, it is not necessary to import the library using the alias, it just helps in writing less amount code every time a method or property is called. Pandas generally provide two data structures for manipulating data, They are: 

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

Pandas Series

Pandas Series 

It can be created using the Series() function by loading the dataset from the existing storage like SQL, Database, CSV Files, Excel Files, etc., or from data structures like lists, dictionaries, etc.

Python Pandas Creating Series

pnadas series

pnadas series 

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

python data analysis assignment

Pandas Dataframe 

It can be created using the Dataframe() method and just like a series, it can also be from different file types and data structures.

Python Pandas Creating Dataframe

Creating Dataframe from list

Creating Dataframe from python list 

Creating Dataframe from CSV

We can create a dataframe from the CSV files using the read_csv() function.

Python Pandas read CSV

head of  a dataframe

head of  a dataframe

Filtering DataFrame

Pandas dataframe.filter() function is used to Subset rows or columns of dataframe according to labels in the specified index. Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Python Pandas Filter Dataframe

Applying filter on dataset

Applying filter on dataset 

Sorting DataFrame

In order to sort the data frame in pandas, the function sort_values() is used. Pandas sort_values() can sort the data frame in Ascending or Descending order.

Python Pandas Sorting Dataframe in Ascending Order

Sorted dataset based on a column value

Sorted dataset based on a column value 

Pandas GroupBy

Groupby is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. In real data science projects, you’ll be dealing with large amounts of data and trying things over and over, so for efficiency, we use the Groupby concept.  Groupby mainly refers to a process involving one or more of the following steps they are:

  • Splitting: It is a process in which we split data into group by applying some conditions on datasets.
  • Applying: It is a process in which we apply a function to each group independently.
  • Combining: It is a process in which we combine different datasets after applying groupby and results into a data structure.

The following image will help in understanding the process involve in the Groupby concept.

1. Group the unique values from the Team column

Pandas Groupby Method

Pandas Groupby Method 

2. Now there’s a bucket for each group

python data analysis assignment

3. Toss the other data into the buckets

Pandas GroupBy

4. Apply a function on the weight column of each bucket.

python data analysis assignment

Applying Function on the weight column of each column 

Python Pandas GroupBy

pandas groupby

pandas groupby 

Applying function to group:

After splitting a data into a group, we apply a function to each group in order to do that we perform some operations they are:

  • Aggregation: It is a process in which we compute a summary statistic (or statistics) about each group. For Example, Compute group sums or means
  • Transformation: It is a process in which we perform some group-specific computations and return a like-indexed. For Example, Filling NAs within groups with a value derived from each group
  • Filtration: It is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. For Example, Filtering out data based on the group sum or mean

Pandas Aggregation

Aggregation is a process in which we compute a summary statistic about each group. The aggregated function returns a single aggregated value for each group. After splitting data into groups using groupby function, several aggregation operations can be performed on the grouped data.

Python Pandas Aggregation

Use of sum aggregate function on dataset

Use of sum aggregate function on dataset 

Concatenating DataFrame

In order to concat the dataframe, we use concat() function which helps in concatenating the dataframe. This function does all the heavy lifting of performing concatenation operations along with an axis of Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

Python Pandas Concatenate Dataframe

python data analysis assignment

Merging DataFrame

When we need to combine very large DataFrames, joins serve as a powerful way to perform these operations swiftly. Joins can only be done on two DataFrames at a time, denoted as left and right tables. The key is the common column that the two DataFrames will be joined on. It’s a good practice to use keys that have unique values throughout the column to avoid unintended duplication of row values. Pandas provide a single function, merge() , as the entry point for all standard database join operations between DataFrame objects.

There are four basic ways to handle the join (inner, left, right, and outer), depending on which rows must retain their data.

Merge Daframe Python Pandas

Python Pandas Merge Dataframe

Concatinating Two datasets

Concatinating Two datasets 

Joining DataFrame

In order to join the dataframe, we use .join() function this function is used for combining the columns of two potentially differently indexed DataFrames into a single result DataFrame.

Python Pandas Join Dataframe

python data analysis assignment

Joining two datasets 

For more information, refer to our Pandas Merging, Joining, and Concatenating tutorial

For a complete guide on Pandas refer to our Pandas Tutorial .

Visualization with Matplotlib

Matplotlib is easy to use and an amazing visualizing library in Python. It is built on NumPy arrays and designed to work with the broader SciPy stack and consists of several plots like line, bar, scatter, histogram, etc. 

Pyplot is a Matplotlib module that provides a MATLAB-like interface. Pyplot provides functions that interact with the figure i.e. creates a figure, decorates the plot with labels, and creates a plotting area in a figure.

python data analysis assignment

A bar plot or bar chart is a graph that represents the category of data with rectangular bars with lengths and heights that is proportional to the values which they represent. The bar plots can be plotted horizontally or vertically. A bar chart describes the comparisons between the discrete categories. It can be created using the bar() method.

Python Matplotlib Bar Chart

Here we will use the iris dataset only

Bar chart using matplotlib library

Bar chart using matplotlib library  

A histogram is basically used to represent data in the form of some groups. It is a type of bar plot where the X-axis represents the bin ranges while the Y-axis gives information about frequency. To create a histogram the first step is to create a bin of the ranges, then distribute the whole range of the values into a series of intervals, and count the values which fall into each of the intervals. Bins are clearly identified as consecutive, non-overlapping intervals of variables. The hist() function is used to compute and create a histogram of x.

Python Matplotlib Histogram

python data analysis assignment

Histplot using matplotlib library 

Scatter Plot

Scatter plots are used to observe relationship between variables and uses dots to represent the relationship between them. The scatter() method in the matplotlib library is used to draw a scatter plot.

Python Matplotlib Scatter Plot

Scatter plot using matplotlib library

Scatter plot using matplotlib library 

A boxplot ,Correlation also known as a box and whisker plot. It is a very good visual representation when it comes to measuring the data distribution. Clearly plots the median values, outliers and the quartiles. Understanding data distribution is another important factor which leads to better model building. If data has outliers, box plot is a recommended way to identify them and take necessary actions. The box and whiskers chart shows how data is spread out. Five pieces of information are generally included in the chart

  • The minimum is shown at the far left of the chart, at the end of the left ‘whisker’
  • First quartile, Q1, is the far left of the box (left whisker)
  • The median is shown as a line in the center of the box
  • Third quartile, Q3, shown at the far right of the box (right whisker)
  • The maximum is at the far right of the box

Representation of box plot

Inter quartile range

Inter quartile range 

Illustrating box plot

Illustrating box plot 

Python Matplotlib Box Plot

Boxplot using matplotlib library

Boxplot using matplotlib library 

Correlation Heatmaps

A 2-D Heatmap is a data visualization tool that helps to represent the magnitude of the phenomenon in form of colors. A correlation heatmap is a heatmap that shows a 2D correlation matrix between two discrete dimensions, using colored cells to represent data from usually a monochromatic scale. The values of the first dimension appear as the rows of the table while the second dimension is a column. The color of the cell is proportional to the number of measurements that match the dimensional value. This makes correlation heatmaps ideal for data analysis since it makes patterns easily readable and highlights the differences and variation in the same data. A correlation heatmap, like a regular heatmap, is assisted by a colorbar making data easily readable and comprehensible.

Note: The data here has to be passed with corr() method to generate a correlation heatmap. Also, corr() itself eliminates columns that will be of no use while generating a correlation heatmap and selects those which can be used.

Python Matplotlib Correlation Heatmap

Heatmap using matplotlib library

Heatmap using matplotlib library 

For more information on data visualization refer to our below tutorials –  Data Visualization using Matplotlib Data Visualization with Python Seaborn Data Visualisation in Python using Matplotlib and Seaborn Using Plotly for Interactive Data Visualization in Python Interactive Data Visualization with Bokeh

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a technique to analyze data using some visual Techniques. With this technique, we can get detailed information about the statistical summary of the data. We will also be able to deal with the duplicates values, outliers, and also see some trends or patterns present in the dataset.

Note: We will be using Iris Dataset.

Getting Information about the Dataset

We will use the shape parameter to get the shape of the dataset.

Shape of Dataframe 

We can see that the dataframe contains 6 columns and 150 rows.

Now, let’s also the columns and their data types. For this, we will use the info() method.

Information about Dataset 

information about the dataset

information about the dataset 

We can see that only one column has categorical data and all the other columns are of the numeric type with non-Null entries.

Let’s get a quick statistical summary of the dataset using the describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped. describe() function gives a good picture of the distribution of data.

Description of dataset 

Description about the dataset

Description about the dataset 

We can see the count of each column along with their mean value, standard deviation, minimum and maximum values.

Checking Missing Values

We will check if our data contains any missing values or not. Missing values can occur when no information is provided for one or more items or for a whole unit. We will use the isnull() method.

python code for missing value

Missing values in dataset

Missing values in the dataset 

We can see that no column has any missing value.

Checking Duplicates

Let’s see if our dataset contains any duplicates or not. Pandas drop_duplicates() method helps in removing duplicates from the data frame.

Pandas function for missing values 

python data analysis assignment

Dropping duplicate value in the dataset 

We can see that there are only three unique species. Let’s see if the dataset is balanced or not i.e. all the species contain equal amounts of rows or not. We will use the Series.value_counts() function. This function returns a Series containing counts of unique values. 

Python code for value counts in the column 

python data analysis assignment

value count in the dataset 

We can see that all the species contain an equal amount of rows, so we should not delete any entries.

Relation between variables

We will see the relationship between the sepal length and sepal width and also between petal length and petal width.

Comparing Sepal Length and Sepal Width

Scatter plot using matplotlib library

From the above plot, we can infer that – 

  • Species Setosa has smaller sepal lengths but larger sepal widths.
  • Versicolor Species lies in the middle of the other two species in terms of sepal length and width
  • Species Virginica has larger sepal lengths but smaller sepal widths.

Comparing Petal Length and Petal Width

sactter plot petal length

sactter plot petal length 

  • The species Setosa has smaller petal lengths and widths.
  • Versicolor Species lies in the middle of the other two species in terms of petal length and width
  • Species Virginica has the largest petal lengths and widths.

Let’s plot all the column’s relationships using a pairplot. It can be used for multivariate analysis.

Python code for pairplot 

python data analysis assignment

Pairplot for the dataset 

We can see many types of relationships from this plot such as the species Seotsa has the smallest of petals widths and lengths. It also has the smallest sepal length but larger sepal widths. Such information can be gathered about any other species.

Handling Correlation

Pandas dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. Any NA values are automatically excluded. Any non-numeric data type columns in the dataframe are ignored.

correlation between columns in the dataset

correlation between columns in the dataset 

The heatmap is a data visualization technique that is used to analyze the dataset as colors in two dimensions. Basically, it shows a correlation between all numerical variables in the dataset. In simpler terms, we can plot the above-found correlation using the heatmaps.

python code for heatmap 

Heatmap for correlation in the dataset

Heatmap for correlation in the dataset 

From the above graph, we can see that –

  • Petal width and petal length have high correlations.
  • Petal length and sepal width have good correlations.
  • Petal Width and Sepal length have good correlations.

Handling Outliers

An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal)objects. They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect outliers, and the removal process is the data frame same as removing a data item from the panda’s dataframe.

Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm column.

python code for Boxplot 

Boxplot for sepalwidth column

Boxplot for sepalwidth column 

In the above graph, the values above 4 and below 2 are acting as outliers.

Removing Outliers

For removing the outlier, one must follow the same process of removing an entry from the dataset using its exact position in the dataset because in all the above methods of detecting the outliers end result is the list of all those data items that satisfy the outlier definition according to the method used.

We will detect the outliers using IQR and then we will remove them. We will also draw the boxplot to see if the outliers are removed or not.

boxplot using seaborn library

boxplot using seaborn library 

For more information about EDA, refer to our below tutorials –  What is Exploratory Data Analysis ? Exploratory Data Analysis in Python | Set 1 Exploratory Data Analysis in Python | Set 2 Exploratory Data Analysis on Iris Dataset

Please Login to comment...

Similar reads.

  • AI-ML-DS With Python

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Using Pandas and Python to Explore Your Dataset

Using pandas and Python to Explore Your Dataset

Table of Contents

Setting Up Your Environment

Using the pandas python library, displaying data types, showing basics statistics, exploring your dataset, understanding series objects, understanding dataframe objects, using the indexing operator, using .loc and .iloc, querying your dataset, grouping and aggregating your data, manipulating columns, specifying data types, missing values, invalid values, inconsistent values, combining multiple datasets, visualizing your pandas dataframe.

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Explore Your Dataset With pandas

Do you have a large dataset that’s full of interesting insights, but you’re not sure where to start exploring it? Has your boss asked you to generate some statistics from it, but they’re not so easy to extract? These are precisely the use cases where pandas and Python can help you! With these tools, you’ll be able to slice a large dataset down into manageable parts and glean insight from that information.

In this tutorial, you’ll learn how to:

  • Calculate metrics about your data
  • Perform basic queries and aggregations
  • Discover and handle incorrect data, inconsistencies, and missing values
  • Visualize your data with plots

You’ll also learn about the differences between the main data structures that pandas and Python use. To follow along, you can get all of the example code in this tutorial at the link below:

Get Jupyter Notebook: Click here to get the Jupyter Notebook you’ll use to explore data with Pandas in this tutorial.

There are a few things you’ll need to get started with this tutorial. First is a familiarity with Python’s built-in data structures, especially lists and dictionaries . For more information, check out Lists and Tuples in Python and Dictionaries in Python .

The second thing you’ll need is a working Python environment . You can follow along in any terminal that has Python 3 installed. If you want to see nicer output, especially for the large NBA dataset you’ll be working with, then you might want to run the examples in a Jupyter notebook .

Note: If you don’t have Python installed at all, then check out Python 3 Installation & Setup Guide . You can also follow along online in a try-out Jupyter notebook .

The last thing you’ll need is pandas and other Python libraries, which you can install with pip :

You can also use the Conda package manager:

If you’re using the Anaconda distribution, then you’re good to go! Anaconda already comes with the pandas Python library installed.

Note: Have you heard that there are multiple package managers in the Python world and are somewhat confused about which one to pick? pip and conda are both excellent choices, and they each have their advantages.

If you’re going to use Python mainly for data science work, then conda is perhaps the better choice. In the conda ecosystem, you have two main alternatives:

  • If you want to get a stable data science environment up and running quickly, and you don’t mind downloading 500 MB of data, then check out the Anaconda distribution .
  • If you prefer a more minimalist setup, then check out the section on installing Miniconda in Setting Up Python for Machine Learning on Windows .

The examples in this tutorial have been tested with Python 3.7 and pandas 0.25.0, but they should also work in older versions. You can get all the code examples you’ll see in this tutorial in a Jupyter notebook by clicking the link below:

Let’s get started!

Now that you’ve installed pandas, it’s time to have a look at a dataset. In this tutorial, you’ll analyze NBA results provided by FiveThirtyEight in a 17MB CSV file . Create a script download_nba_all_elo.py to download the data:

When you execute the script, it will save the file nba_all_elo.csv in your current working directory.

Note: You could also use your web browser to download the CSV file.

However, having a download script has several advantages:

  • You can tell where you got your data.
  • You can repeat the download anytime! That’s especially handy if the data is often refreshed.
  • You don’t need to share the 17MB CSV file with your co-workers. Usually, it’s enough to share the download script.

Now you can use the pandas Python library to take a look at your data:

Here, you follow the convention of importing pandas in Python with the pd alias. Then, you use .read_csv() to read in your dataset and store it as a DataFrame object in the variable nba .

Note: Is your data not in CSV format? No worries! The pandas Python library provides several similar functions like read_json() , read_html() , and read_sql_table() . To learn how to work with these file formats, check out Reading and Writing Files With pandas or consult the docs .

You can see how much data nba contains:

You use the Python built-in function len() to determine the number of rows. You also use the .shape attribute of the DataFrame to see its dimensionality . The result is a tuple containing the number of rows and columns.

Now you know that there are 126,314 rows and 23 columns in your dataset. But how can you be sure the dataset really contains basketball stats? You can have a look at the first five rows with .head() :

If you’re following along with a Jupyter notebook, then you’ll see a result like this:

Pandas DataFrame .head()

Unless your screen is quite large, your output probably won’t display all 23 columns. Somewhere in the middle, you’ll see a column of ellipses ( ... ) indicating the missing data. If you’re working in a terminal, then that’s probably more readable than wrapping long rows. However, Jupyter notebooks will allow you to scroll. You can configure pandas to display all 23 columns like this:

While it’s practical to see all the columns, you probably won’t need six decimal places! Change it to two:

To verify that you’ve changed the options successfully, you can execute .head() again, or you can display the last five rows with .tail() instead:

Now, you should see all the columns, and your data should show two decimal places:

Pandas DataFrame .tail()

You can discover some further possibilities of .head() and .tail() with a small exercise. Can you print the last three lines of your DataFrame ? Expand the code block below to see the solution:

Solution: head & tail Show/Hide

Here’s how to print the last three lines of nba :

Your output should look something like this:

Pandas DataFrame .tail() with parameter

You can see the last three lines of your dataset with the options you’ve set above.

Similar to the Python standard library, functions in pandas also come with several optional parameters. Whenever you bump into an example that looks relevant but is slightly different from your use case, check out the official documentation . The chances are good that you’ll find a solution by tweaking some optional parameters!

Getting to Know Your Data

You’ve imported a CSV file with the pandas Python library and had a first look at the contents of your dataset. So far, you’ve only seen the size of your dataset and its first and last few rows. Next, you’ll learn how to examine your data more systematically.

The first step in getting to know your data is to discover the different data types it contains. While you can put anything into a list, the columns of a DataFrame contain values of a specific data type. When you compare pandas and Python data structures, you’ll see that this behavior makes pandas much faster!

You can display all columns and their data types with .info() :

This will produce the following output:

Pandas DataFrame .info()

You’ll see a list of all the columns in your dataset and the type of data each column contains. Here, you can see the data types int64 , float64 , and object . pandas uses the NumPy library to work with these types. Later, you’ll meet the more complex categorical data type, which the pandas Python library implements itself.

The object data type is a special one. According to the pandas Cookbook , the object data type is “a catch-all for columns that pandas doesn’t recognize as any other specific type.” In practice, it often means that all of the values in the column are strings.

Although you can store arbitrary Python objects in the object data type, you should be aware of the drawbacks to doing so. Strange values in an object column can harm pandas’ performance and its interoperability with other libraries. For more information, check out the official getting started guide .

Now that you’ve seen what data types are in your dataset, it’s time to get an overview of the values each column contains. You can do this with .describe() :

This function shows you some basic descriptive statistics for all numeric columns:

Pandas DataFrame .describe()

.describe() only analyzes numeric columns by default, but you can provide other data types if you use the include parameter:

.describe() won’t try to calculate a mean or a standard deviation for the object columns, since they mostly include text strings. However, it will still display some descriptive statistics:

Pandas DataFrame .describe() with include=np.object

Take a look at the team_id and fran_id columns. Your dataset contains 104 different team IDs, but only 53 different franchise IDs. Furthermore, the most frequent team ID is BOS , but the most frequent franchise ID Lakers . How is that possible? You’ll need to explore your dataset a bit more to answer this question.

Exploratory data analysis can help you answer questions about your dataset. For example, you can examine how often specific values occur in a column:

It seems that a team named "Lakers" played 6024 games, but only 5078 of those were played by the Los Angeles Lakers. Find out who the other "Lakers" team is:

Indeed, the Minneapolis Lakers ( "MNL" ) played 946 games. You can even find out when they played those games. For that, you’ll first define a column that converts the value of date_game to the datetime data type. Then you can use the min and max aggregate functions, to find the first and last games of Minneapolis Lakers:

It looks like the Minneapolis Lakers played between the years of 1948 and 1960. That explains why you might not recognize this team!

You’ve also found out why the Boston Celtics team "BOS" played the most games in the dataset. Let’s analyze their history also a little bit. Find out how many points the Boston Celtics have scored during all matches contained in this dataset. Expand the code block below for the solution:

Solution: DataFrame intro Show/Hide

Similar to the .min() and .max() aggregate functions, you can also use .sum() :

The Boston Celtics scored a total of 626,484 points.

You’ve got a taste for the capabilities of a pandas DataFrame . In the following sections, you’ll expand on the techniques you’ve just used, but first, you’ll zoom in and learn how this powerful data structure works.

Getting to Know pandas’ Data Structures

While a DataFrame provides functions that can feel quite intuitive, the underlying concepts are a bit trickier to understand. For this reason, you’ll set aside the vast NBA DataFrame and build some smaller pandas objects from scratch.

Python’s most basic data structure is the list , which is also a good starting point for getting to know pandas.Series objects. Create a new Series object based on a list:

You’ve used the list [5555, 7000, 1980] to create a Series object called revenues . A Series object wraps two components:

  • A sequence of values
  • A sequence of identifiers , which is the index

You can access these components with .values and .index , respectively:

revenues.values returns the values in the Series , whereas revenues.index returns the positional index.

Note: If you’re familiar with NumPy , then it might be interesting for you to note that the values of a Series object are actually n-dimensional arrays:

If you’re not familiar with NumPy, then there’s no need to worry! You can explore the ins and outs of your dataset with the pandas Python library alone. However, if you’re curious about what pandas does behind the scenes, then check out Look Ma, No for Loops: Array Programming With NumPy .

While pandas builds on NumPy, a significant difference is in their indexing . Just like a NumPy array, a pandas Series also has an integer index that’s implicitly defined. This implicit index indicates the element’s position in the Series .

However, a Series can also have an arbitrary type of index. You can think of this explicit index as labels for a specific row:

Here, the index is a list of city names represented by strings. You may have noticed that Python dictionaries use string indices as well, and this is a handy analogy to keep in mind! You can use the code blocks above to distinguish between two types of Series :

  • revenues : This Series behaves like a Python list because it only has a positional index.
  • city_revenues : This Series acts like a Python dictionary because it features both a positional and a label index.

Here’s how to construct a Series with a label index from a Python dictionary:

The dictionary keys become the index, and the dictionary values are the Series values.

Just like dictionaries, Series also support .keys() and the in keyword :

You can use these methods to answer questions about your dataset quickly.

While a Series is a pretty powerful data structure, it has its limitations. For example, you can only store one attribute per key. As you’ve seen with the nba dataset, which features 23 columns, the pandas Python library has more to offer with its DataFrame . This data structure is a sequence of Series objects that share the same index.

If you’ve followed along with the Series examples, then you should already have two Series objects with cities as keys:

  • city_revenues
  • city_employee_count

You can combine these objects into a DataFrame by providing a dictionary in the constructor. The dictionary keys will become the column names, and the values should contain the Series objects:

Note how pandas replaced the missing employee_count value for Toronto with NaN .

The new DataFrame index is the union of the two Series indices:

Just like a Series , a DataFrame also stores its values in a NumPy array:

You can also refer to the 2 dimensions of a DataFrame as axes :

The axis marked with 0 is the row index , and the axis marked with 1 is the column index . This terminology is important to know because you’ll encounter several DataFrame methods that accept an axis parameter.

A DataFrame is also a dictionary-like data structure, so it also supports .keys() and the in keyword. However, for a DataFrame these don’t relate to the index, but to the columns:

You can see these concepts in action with the bigger NBA dataset. Does it contain a column called "points" , or was it called "pts" ? To answer this question, display the index and the axes of the nba dataset, then expand the code block below for the solution:

Solution: NBA index Show/Hide

Because you didn’t specify an index column when you read in the CSV file, pandas has assigned a RangeIndex to the DataFrame :

nba , like all DataFrame objects, has two axes:

You can check the existence of a column with .keys() :

The column is called "pts" , not "points" .

As you use these methods to answer questions about your dataset, be sure to keep in mind whether you’re working with a Series or a DataFrame so that your interpretation is accurate.

Accessing Series Elements

In the section above, you’ve created a pandas Series based on a Python list and compared the two data structures. You’ve seen how a Series object is similar to lists and dictionaries in several ways. A further similarity is that you can use the indexing operator ( [] ) for Series as well.

You’ll also learn how to use two pandas-specific access methods :

You’ll see that these data access methods can be much more readable than the indexing operator.

Recall that a Series has two indices:

  • A positional or implicit index , which is always a RangeIndex
  • A label or explicit index , which can contain any hashable objects

Next, revisit the city_revenues object:

You can conveniently access the values in a Series with both the label and positional indices:

You can also use negative indices and slices, just like you would for a list:

If you want to learn more about the possibilities of the indexing operator, then check out Lists and Tuples in Python .

The indexing operator ( [] ) is convenient, but there’s a caveat. What if the labels are also numbers? Say you have to work with a Series object like this:

What will colors[1] return? For a positional index, colors[1] is "purple" . However, if you go by the label index, then colors[1] is referring to "red" .

The good news is, you don’t have to figure it out! Instead, to avoid confusion, the pandas Python library provides two data access methods :

  • .loc refers to the label index .
  • .iloc refers to the positional index .

These data access methods are much more readable:

colors.loc[1] returned "red" , the element with the label 1 . colors.iloc[1] returned "purple" , the element with the index 1 .

The following figure shows which elements .loc and .iloc refer to:

Pandas Series iloc vs loc

Again, .loc points to the label index on the right-hand side of the image. Meanwhile, .iloc points to the positional index on the left-hand side of the picture.

It’s easier to keep in mind the distinction between .loc and .iloc than it is to figure out what the indexing operator will return. Even if you’re familiar with all the quirks of the indexing operator, it can be dangerous to assume that everybody who reads your code has internalized those rules as well!

Note: In addition to being confusing for Series with numeric labels, the Python indexing operator has some performance drawbacks . It’s perfectly okay to use it in interactive sessions for ad-hoc analysis, but for production code, the .loc and .iloc data access methods are preferable. For further details, check out the pandas User Guide section on indexing and selecting data .

.loc and .iloc also support the features you would expect from indexing operators, like slicing. However, these data access methods have an important difference. While .iloc excludes the closing element, .loc includes it. Take a look at this code block:

If you compare this code with the image above, then you can see that colors.iloc[1:3] returns the elements with the positional indices of 1 and 2 . The closing item "green" with a positional index of 3 is excluded.

On the other hand, .loc includes the closing element:

This code block says to return all elements with a label index between 3 and 8 . Here, the closing item "yellow" has a label index of 8 and is included in the output.

You can also pass a negative positional index to .iloc :

You start from the end of the Series and return the second element.

Note: There used to be an .ix indexer, which tried to guess whether it should apply positional or label indexing depending on the data type of the index. Because it caused a lot of confusion, it has been deprecated since pandas version 0.20.0.

It’s highly recommended that you do not use .ix for indexing. Instead, always use .loc for label indexing and .iloc for positional indexing. For further details, check out the pandas User Guide .

You can use the code blocks above to distinguish between two Series behaviors:

  • You can use .iloc on a Series similar to using [] on a list .
  • You can use .loc on a Series similar to using [] on a dictionary .

Be sure to keep these distinctions in mind as you access elements of your Series objects.

Accessing DataFrame Elements

Since a DataFrame consists of Series objects, you can use the very same tools to access its elements. The crucial difference is the additional dimension of the DataFrame . You’ll use the indexing operator for the columns and the access methods .loc and .iloc on the rows.

If you think of a DataFrame as a dictionary whose values are Series , then it makes sense that you can access its columns with the indexing operator:

Here, you use the indexing operator to select the column labeled "revenue" .

If the column name is a string, then you can use attribute-style accessing with dot notation as well:

city_data["revenue"] and city_data.revenue return the same output.

There’s one situation where accessing DataFrame elements with dot notation may not work or may lead to surprises. This is when a column name coincides with a DataFrame attribute or method name:

The indexing operation toys["shape"] returns the correct data, but the attribute-style operation toys.shape still returns the shape of the DataFrame . You should only use attribute-style accessing in interactive sessions or for read operations. You shouldn’t use it for production code or for manipulating data (such as defining new columns).

Similar to Series , a DataFrame also provides .loc and .iloc data access methods . Remember, .loc uses the label and .iloc the positional index:

Each line of code selects a different row from city_data :

  • city_data.loc["Amsterdam"] selects the row with the label index "Amsterdam" .
  • city_data.loc["Tokyo": "Toronto"] selects the rows with label indices from "Tokyo" to "Toronto" . Remember, .loc is inclusive.
  • city_data.iloc[1] selects the row with the positional index 1 , which is "Tokyo" .

Alright, you’ve used .loc and .iloc on small data structures. Now, it’s time to practice with something bigger! Use a data access method to display the second-to-last row of the nba dataset. Then, expand the code block below to see a solution:

Solution: NBA accessing rows Show/Hide

The second-to-last row is the row with the positional index of -2 . You can display it with .iloc :

You’ll see the output as a Series object.

For a DataFrame , the data access methods .loc and .iloc also accept a second parameter. While the first parameter selects rows based on the indices, the second parameter selects the columns. You can use these parameters together to select a subset of rows and columns from your DataFrame :

Note that you separate the parameters with a comma ( , ). The first parameter, "Amsterdam" : "Tokyo," says to select all rows between those two labels. The second parameter comes after the comma and says to select the "revenue" column.

It’s time to see the same construct in action with the bigger nba dataset. Select all games between the labels 5555 and 5559 . You’re only interested in the names of the teams and the scores, so select those elements as well. Expand the code block below to see a solution:

Solution: NBA accessing a subset Show/Hide

First, define which rows you want to see, then list the relevant columns:

You use .loc for the label index and a comma ( , ) to separate your two parameters.

You should see a small part of your quite huge dataset:

Pandas DataFrame .loc

The output is much easier to read!

With data access methods like .loc and .iloc , you can select just the right subset of your DataFrame to help you answer questions about your dataset.

You’ve seen how to access subsets of a huge dataset based on its indices. Now, you’ll select rows based on the values in your dataset’s columns to query your data. For example, you can create a new DataFrame that contains only games played after 2010:

You now have 24 columns, but your new DataFrame only consists of rows where the value in the "year_id" column is greater than 2010 .

You can also select the rows where a specific field is not null:

This can be helpful if you want to avoid any missing values in a column. You can also use .notna() to achieve the same goal.

You can even access values of the object data type as str and perform string methods on them:

You use .str.endswith() to filter your dataset and find all games where the home team’s name ends with "ers" .

You can combine multiple criteria and query your dataset as well. To do this, be sure to put each one in parentheses and use the logical operators | and & to separate them.

Note: The operators and , or , && , and || won’t work here. If you’re curious as to why, then check out the section on how the pandas Python library uses Boolean operators in Python pandas: Tricks & Features You May Not Know .

Do a search for Baltimore games where both teams scored over 100 points. In order to see each game only once, you’ll need to exclude duplicates:

Here, you use nba["_iscopy"] == 0 to include only the entries that aren’t copies.

Your output should contain five eventful games:

Pandas DataFrame query with multiple criteria

Try to build another query with multiple criteria. In the spring of 1992, both teams from Los Angeles had to play a home game at another court. Query your dataset to find those two games. Both teams have an ID starting with "LA" . Expand the code block below to see a solution:

Solution: Queries Show/Hide

You can use .str to find the team IDs that start with "LA" , and you can assume that such an unusual game would have some notes:

Your output should show two games on the day 5/3/1992:

Pandas DataFrame query with multiple criteria: solution of the exercise

When you know how to query your dataset with multiple criteria, you’ll be able to answer more specific questions about your dataset.

You may also want to learn other features of your dataset, like the sum, mean, or average value of a group of elements. Luckily, the pandas Python library offers grouping and aggregation functions to help you accomplish this task.

A Series has more than twenty different methods for calculating descriptive statistics. Here are some examples:

The first method returns the total of city_revenues , while the second returns the max value. There are other methods you can use, like .min() and .mean() .

Remember, a column of a DataFrame is actually a Series object. For this reason, you can use these same functions on the columns of nba :

A DataFrame can have multiple columns, which introduces new possibilities for aggregations, like grouping :

By default, pandas sorts the group keys during the call to .groupby() . If you don’t want to sort, then pass sort=False . This parameter can lead to performance gains.

You can also group by multiple columns:

You can practice these basics with an exercise. Take a look at the Golden State Warriors’ 2014-15 season ( year_id: 2015 ). How many wins and losses did they score during the regular season and the playoffs? Expand the code block below for the solution:

Solution: Aggregation Show/Hide

First, you can group by the "is_playoffs" field, then by the result:

is_playoffs=0 shows the results for the regular season, and is_playoffs=1 shows the results for the playoffs.

In the examples above, you’ve only scratched the surface of the aggregation functions that are available to you in the pandas Python library. To see more examples of how to use them, check out pandas GroupBy: Your Guide to Grouping Data in Python .

You’ll need to know how to manipulate your dataset’s columns in different phases of the data analysis process. You can add and drop columns as part of the initial data cleaning phase, or later based on the insights of your analysis.

Create a copy of your original DataFrame to work with:

You can define new columns based on the existing ones:

Here, you used the "pts" and "opp_pts" columns to create a new one called "difference" . This new column has the same functions as the old ones:

Here, you used an aggregation function .max() to find the largest value of your new column.

You can also rename the columns of your dataset. It seems that "game_result" and "game_location" are too verbose, so go ahead and rename them now:

Note that there’s a new object, renamed_df . Like several other data manipulation methods, .rename() returns a new DataFrame by default. If you want to manipulate the original DataFrame directly, then .rename() also provides an inplace parameter that you can set to True .

Your dataset might contain columns that you don’t need. For example, Elo ratings may be a fascinating concept to some, but you won’t analyze them in this tutorial. You can delete the four columns related to Elo:

Remember, you added the new column "difference" in a previous example, bringing the total number of columns to 25. When you remove the four Elo columns, the total number of columns drops to 21.

When you create a new DataFrame , either by calling a constructor or reading a CSV file, pandas assigns a data type to each column based on its values. While it does a pretty good job, it’s not perfect. If you choose the right data type for your columns up front, then you can significantly improve your code’s performance.

Take another look at the columns of the nba dataset:

You’ll see the same output as before:

Ten of your columns have the data type object . Most of these object columns contain arbitrary text, but there are also some candidates for data type conversion . For example, take a look at the date_game column:

Here, you use .to_datetime() to specify all game dates as datetime objects.

Other columns contain text that are a bit more structured. The game_location column can have only three different values:

Which data type would you use in a relational database for such a column? You would probably not use a varchar type, but rather an enum . pandas provides the categorical data type for the same purpose:

categorical data has a few advantages over unstructured text. When you specify the categorical data type, you make validation easier and save a ton of memory, as pandas will only use the unique values internally. The higher the ratio of total values to unique values, the more space savings you’ll get.

Run df.info() again. You should see that changing the game_location data type from object to categorical has decreased the memory usage.

Note: The categorical data type also gives you access to additional methods through the .cat accessor. To learn more, check out the official docs .

You’ll often encounter datasets with too many text columns. An essential skill for data scientists to have is the ability to spot which columns they can convert to a more performant data type.

Take a moment to practice this now. Find another column in the nba dataset that has a generic data type and convert it to a more specific one. You can expand the code block below to see one potential solution:

Solution: Specifying Data Types Show/Hide

game_result can take only two different values:

To improve performance, you can convert it into a categorical column:

You can use df.info() to check the memory usage.

As you work with more massive datasets, memory savings becomes especially crucial. Be sure to keep performance in mind as you continue to explore your datasets.

Cleaning Data

You may be surprised to find this section so late in the tutorial! Usually, you’d take a critical look at your dataset to fix any issues before you move on to a more sophisticated analysis. However, in this tutorial, you’ll rely on the techniques that you’ve learned in the previous sections to clean your dataset.

Have you ever wondered why .info() shows how many non-null values a column contains? The reason why is that this is vital information. Null values often indicate a problem in the data-gathering process. They can make several analysis techniques, like different types of machine learning , difficult or even impossible.

When you inspect the nba dataset with nba.info() , you’ll see that it’s quite neat. Only the column notes contains null values for the majority of its rows:

This output shows that the notes column has only 5424 non-null values. That means that over 120,000 rows of your dataset have null values in this column.

Sometimes, the easiest way to deal with records containing missing values is to ignore them. You can remove all the rows with missing values using .dropna() :

Of course, this kind of data cleanup doesn’t make sense for your nba dataset, because it’s not a problem for a game to lack notes. But if your dataset contains a million valid records and a hundred where relevant data is missing, then dropping the incomplete records can be a reasonable solution.

You can also drop problematic columns if they’re not relevant for your analysis. To do this, use .dropna() again and provide the axis=1 parameter:

Now, the resulting DataFrame contains all 126,314 games, but not the sometimes empty notes column.

If there’s a meaningful default value for your use case, then you can also replace the missing values with that:

Here, you fill the empty notes rows with the string "no notes at all" .

Invalid values can be even more dangerous than missing values. Often, you can perform your data analysis as expected, but the results you get are peculiar. This is especially important if your dataset is enormous or used manual entry. Invalid values are often more challenging to detect, but you can implement some sanity checks with queries and aggregations.

One thing you can do is validate the ranges of your data. For this, .describe() is quite handy. Recall that it returns the following output:

The year_id varies between 1947 and 2015. That sounds plausible.

What about pts ? How can the minimum be 0 ? Let’s have a look at those games:

This query returns a single row:

Pandas DataFrame query

It seems the game was forfeited. Depending on your analysis, you may want to remove it from the dataset.

Sometimes a value would be entirely realistic in and of itself, but it doesn’t fit with the values in the other columns. You can define some query criteria that are mutually exclusive and verify that these don’t occur together.

In the NBA dataset, the values of the fields pts , opp_pts and game_result should be consistent with each other. You can check this using the .empty attribute:

Fortunately, both of these queries return an empty DataFrame .

Be prepared for surprises whenever you’re working with raw datasets, especially if they were gathered from different sources or through a complex pipeline. You might see rows where a team scored more points than their opponent, but still didn’t win—at least, according to your dataset! To avoid situations like this, make sure you add further data cleaning techniques to your pandas and Python arsenal.

In the previous section, you’ve learned how to clean a messy dataset. Another aspect of real-world data is that it often comes in multiple pieces. In this section, you’ll learn how to grab those pieces and combine them into one dataset that’s ready for analysis.

Earlier , you combined two Series objects into a DataFrame based on their indices. Now, you’ll take this one step further and use .concat() to combine city_data with another DataFrame . Say you’ve managed to gather some data on two more cities:

This second DataFrame contains info on the cities "New York" and "Barcelona" .

You can add these cities to city_data using .concat() :

Now, the new variable all_city_data contains the values from both DataFrame objects.

Note: As of pandas version 0.25.0, the sort parameter’s default value is True , but this will change to False soon. It’s good practice to provide an explicit value for this parameter to ensure that your code works consistently in different pandas and Python versions. For more info, consult the pandas User Guide .

By default, concat() combines along axis=0 . In other words, it appends rows. You can also use it to append columns by supplying the parameter axis=1 :

Note how pandas added NaN for the missing values. If you want to combine only the cities that appear in both DataFrame objects, then you can set the join parameter to inner :

While it’s most straightforward to combine data based on the index, it’s not the only possibility. You can use .merge() to implement a join operation similar to the one from SQL:

Here, you pass the parameter left_on="country" to .merge() to indicate what column you want to join on. The result is a bigger DataFrame that contains not only city data, but also the population and continent of the respective countries:

Pandas merge

Note that the result contains only the cities where the country is known and appears in the joined DataFrame .

.merge() performs an inner join by default. If you want to include all cities in the result, then you need to provide the how parameter:

With this left join, you’ll see all the cities, including those without country data:

Pandas merge left join

Welcome back, New York & Barcelona!

Data visualization is one of the things that works much better in a Jupyter notebook than in a terminal, so go ahead and fire one up. If you need help getting started, then check out Jupyter Notebook: An Introduction . You can also access the Jupyter notebook that contains the examples from this tutorial by clicking the link below:

Include this line to show plots directly in the notebook:

Both Series and DataFrame objects have a .plot() method , which is a wrapper around matplotlib.pyplot.plot() . By default, it creates a line plot . Visualize how many points the Knicks scored throughout the seasons:

This shows a line plot with several peaks and two notable valleys around the years 2000 and 2010:

Pandas plot line

You can also create other types of plots, like a bar plot :

This will show the franchises with the most games played:

Pandas plot bar

The Lakers are leading the Celtics by a minimal edge, and there are six further teams with a game count above 5000.

Now try a more complicated exercise. In 2013, the Miami Heat won the championship. Create a pie plot showing the count of their wins and losses during that season. Then, expand the code block to see a solution:

Solution: Plot Show/Hide

First, you define a criteria to include only the Heat’s games from 2013. Then, you create a plot in the same way as you’ve seen above:

Here’s what a champion pie looks like:

Pandas plot pie

The slice of wins is significantly larger than the slice of losses!

Sometimes, the numbers speak for themselves, but often a chart helps a lot with communicating your insights. To learn more about visualizing your data, check out Interactive Data Visualization in Python With Bokeh .

In this tutorial, you’ve learned how to start exploring a dataset with the pandas Python library. You saw how you could access specific rows and columns to tame even the largest of datasets. Speaking of taming, you’ve also seen multiple techniques to prepare and clean your data, by specifying the data type of columns, dealing with missing values, and more. You’ve even created queries, aggregations, and plots based on those.

Now you can:

  • Work with Series and DataFrame objects
  • Subset your data with .loc , .iloc , and the indexing operator
  • Answer questions with queries, grouping, and aggregation
  • Handle missing, invalid, and inconsistent data
  • Visualize your dataset in a Jupyter notebook

This journey using the NBA stats only scratches the surface of what you can do with the pandas Python library. You can power up your project with pandas tricks , learn techniques to speed up pandas in Python, and even dive deep to see how pandas works behind the scenes . There are many more features for you to discover, so get out there and tackle those datasets!

You can get all the code examples you saw in this tutorial by clicking the link below:

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Reka Horvath

Reka Horvath

Reka is an avid Pythonista and writes for Real Python.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Topics: basics data-science

Recommended Video Course: Explore Your Dataset With pandas

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

Jupyter Notebook: An Introduction

Explore Data With Pandas (Jupyter Notebook)

🔒 No spam. We take your privacy seriously.

python data analysis assignment

python data analysis assignment

Member-only story

Exploratory Data Analysis in Python — A Step-by-Step Process

What is exploratory analysis, how it is structured and how to apply it in python with the help of pandas and other data analysis and visualization libraries.

Andrea D'Agostino

Andrea D'Agostino

Towards Data Science

Article last updated: August 2023

Exploratory data analysis ( EDA ) is an especially important activity in the routine of a data analyst or scientist.

It enables an in depth understanding of the dataset, define or discard hypotheses and create predictive models on a solid basis.

It uses data manipulation techniques and several statistical tools to describe and understand the relationship between variables and how these can impact business.

In fact, it’s thanks to EDA that we can ask ourselves meaningful questions that can impact business.

In this article, I will share with you a template for exploratory analysis that I have used over the years and that has proven to be solid for many projects and domains. This is implemented through the use of the Pandas library — an essential tool for any analyst working with Python.

The process consists of several steps:

  • Importing a dataset

Andrea D'Agostino

Written by Andrea D'Agostino

Data scientist. I write about data science, machine learning and analytics. I also write about career and productivity tips to help you thrive in the field.

Text to speech

Introduction to Python

Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages.

Create Your Free Account

Loved by learners at thousands of companies, course description, an introduction to python, discover the python basics, explore python functions and packages, get started with numpy, .css-1goj2uy{margin-right:8px;} group .css-gnv7tt{font-size:20px;font-weight:700;white-space:nowrap;} .css-12nwtlk{box-sizing:border-box;margin:0;min-width:0;color:#05192d;font-size:16px;line-height:1.5;font-size:20px;font-weight:700;white-space:nowrap;} training 2 or more people, in the following tracks, data analyst with python, associate data scientist in python, python data fundamentals, python basics.

An introduction to the basic concepts of Python. Learn how to use Python interactively and by using a script. Create your first variables and acquaint yourself with Python's basic data types.

Python Lists

Learn to store, access, and manipulate data in lists: the first step toward efficiently working with huge amounts of data.

Functions and Packages

You'll learn how to use functions, methods, and packages to efficiently leverage the code that brilliant Python developers have written. The goal is to reduce the amount of code you need to solve challenging problems!

NumPy is a fundamental Python package to efficiently practice data science. Learn to work with powerful tools in the NumPy array, and get started with data exploration.

Group Training 2 or more people?

Collaborators

Collaborator's avatar

Data Scientist

Don’t just take our word for it

  • Lowest to Highest
  • Most recent
  • Top reviews

I liked it because it was a very practical and objective course.

The way of teaching is different, which has helped me a lot to deepen the topics of data science.

Suited for first time self learners. Concepts are being well explained and further reinforced via excercises. I really like data camp.

easy, simple, great for beginners!

Bastante claro los temas y muy interactiva la plataforma.

"I liked it because it was a very practical and objective course."

"The way of teaching is different, which has helped me a lot to deepen the topics of data science."

"Suited for first time self learners. Concepts are being well explained and further reinforced via excercises. I really like data camp."

Is Python ok for beginners? .css-12obmjd{-webkit-transform:rotate(-180deg);-moz-transform:rotate(-180deg);-ms-transform:rotate(-180deg);transform:rotate(-180deg);-webkit-transition:-webkit-transform 0.3s cubic-bezier(0.85, 0, 0.15, 1);transition:transform 0.3s cubic-bezier(0.85, 0, 0.15, 1);}.css-12obmjd .plus_svg__vertical{opacity:0;-webkit-transition:all 0.5s ease-in-out;transition:all 0.5s ease-in-out;}

Python is a popular choice for beginners because it’s readable and relatively simple to use. That’s why many data science beginners choose Python as their first programming language. As Python is free and open source, it also has a large community and extensive library support, so beginners can easily find answers to popular questions and discover pre-made packages to accelerate learning.

What is Python used for? .css-vhvxgf{-webkit-transform:none;-moz-transform:none;-ms-transform:none;transform:none;-webkit-transition:-webkit-transform 0.3s cubic-bezier(0.85, 0, 0.15, 1);transition:transform 0.3s cubic-bezier(0.85, 0, 0.15, 1);}.css-vhvxgf .plus_svg__vertical{opacity:1;-webkit-transition:all 0.5s ease-in-out;transition:all 0.5s ease-in-out;}

Python is a versatile programming language used in various fields. It is widely used for data analysis and visualization, with libraries such as pandas, NumPy, matplotlib, and seaborn. Python is also a popular choice for machine learning, software development, web development, and task automation or scripting. Additionally, it finds use in unique applications like monitoring the stock market, web scraping, and creating bots.

How do I get started with Python?

Taking an online course like this one is a great way to get started with Python - you’ll have the opportunity to learn in a structured way with regular exercises to put your learning into practice. You can choose to install Python on your machine or use software like DataCamp's Workspace to practice, collaborate, or work on projects without having to install Python to practice with it.

Why take an online Python course?

Taking an online Python course offers flexibility and convenience, allowing you to learn at your own pace and on your own schedule. DataCamp's courses provide a structured learning path with interactive exercises and real-world examples, making the learning process more effective than self-study.

What skills or experience do I need before taking a Python course?

You don't need any prior skills or experience to start learning Python, as it's often the first language beginners learn due to its user-friendly nature. While no prerequisites are required, basic computer literacy and problem-solving skills can enhance your learning experience. This Introduction to Python course is specifically designed for those with no prior coding experience, making it an ideal starting point.

How long does it take to learn Python?

How long it takes to learn Python varies greatly depending on your prior programming experience, the complexity of the concepts you're trying to grasp, and the time you can dedicate to learning. However, with a structured learning plan and consistent effort, you can grasp the basics in a few weeks and become somewhat proficient in a few months. This introductory Python course aims to kick start your learning journey, providing you with the initial foundations.

What jobs use Python?

Python is widely used in various professions, particularly those focused on data and web development. Direct applications of Python can be found in roles such as data scientist, data analyst, data engineer, machine learning engineer, data journalist, data architect, full-stack web developer, back-end web developer, DevOps engineer, and software engineer. Additionally, business analysts, bankers, and scientists may also use Python for tasks like data analysis, task automation, and market monitoring.

Does DataCamp offer Python Certification?

Yes. DataCamp's industry recognized Certifiations include two Python Certifications: Data Analyst and Data Scientist . Both Certifications are available to take in Python or R.

Join over .css-ou6dz6{color:#03ef62;} 14 million learners and start Introduction to Python today!

Answers for Quizzes & Assignments that I have taken

Rice university - python data analysis, solutions for rice’s python data analysis.

Rice University

Instructors: Joe Warren, Scott Rixner

Course description.

This course will continue the introduction to Python programming that started with Python Programming Essentials and Python Data Representations. We’ll learn about reading, storing, and processing tabular data, which are common tasks. We will also teach you about CSV files and Python’s support for reading and writing them. CSV files are a generic, plain text file format that allows you to exchange tabular data between different programs. These concepts and skills will help you to further extend your Python programming knowledge and allow you to process more complex data.

By the end of the course, you will be comfortable working with tabular data in Python. This will extend your Python programming expertise, enabling you to write a wider range of scripts using Python.

  • For Individuals
  • For Businesses
  • For Universities
  • For Governments
  • Online Degrees
  • Find your New Career
  • Join for Free

Fractal Analytics

Python for Data Science

This course is part of Fractal Data Science Professional Certificate

Taught in English

Some content may not be translated

Fractal Analytics

Instructor: Fractal Analytics

Financial aid available

1,799 already enrolled

Coursera Plus

(32 reviews)

Recommended experience

Beginner level

No previous experience required

What you'll learn

Explain the significance of Python in data science and its real-world applications.

Apply Python to manipulate and analyze diverse data sources, using Pandas and relevant data types

Create informative data visualizations and draw insights from data distributions and feature relationships

Develop a comprehensive data preparation workflow for machine learning, including data rescaling and feature engineering

Skills you'll gain

  • Data cleaning and preprocessing
  • Data Analysis

Feature Engineering

  • Data transformation
  • Exploratory Data Analysis

Details to know

python data analysis assignment

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Placeholder

Build your Data Analysis expertise

  • Learn new concepts from industry experts
  • Gain a foundational understanding of a subject or tool
  • Develop job-relevant skills with hands-on projects
  • Earn a shareable career certificate from Fractal Analytics

Placeholder

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV

Share it on social media and in your performance review

Placeholder

There are 5 modules in this course

Understanding the importance of Python as a data science tool is crucial for anyone aspiring to leverage data effectively. This course is designed to equip you with the essential skills and knowledge needed to thrive in the field of data science.

This course teaches the vital skills to manipulate data using pandas, perform statistical analyses, and create impactful visualizations. Learn to solve real-world business problems and prepare data for machine learning applications. Get ready for some challenging assessments in the Python course where you'll apply your skills to real-world scenarios, ensuring a rewarding learning experience. Join us and Enroll in this course and take a step into the world of data-driven discoveries. No previous experience required

Introduction to Python for Data Science

In the first module of the Python for Data Science course, learners will be introduced to the fundamental concepts of Python programming. The module begins with the basics of Python, covering essential topics like introduction to Python.Next, the module delves into working with Jupyter notebooks, a popular interactive environment for data analysis and visualization. Learners will learn how to set up Jupyter notebooks, create, run, and manage code cells, and integrate text and visualizations using Markdown. Additionally, the module will showcase real-life applications of Python in solving data-related problems. Learners will explore various data science projects and case studies where Python plays a crucial role, such as data cleaning, data manipulation, statistical analysis, and machine learning.By the end of this module, learners will have a good understanding of Python, be proficient in using Jupyter notebooks for data analysis, and comprehend how Python is used to address real-world data science challenges.

What's included

12 videos 6 readings 2 quizzes

12 videos • Total 60 minutes

  • Welcome to python for data science • 5 minutes • Preview module
  • Expert Talk - A data scientist's experience with Python • 3 minutes
  • What is python? • 3 minutes
  • Working with Jupyter notebooks • 7 minutes
  • Introduction to the problem • 4 minutes
  • Solution approach - Preparing tables and charts • 3 minutes
  • Solution approach - Gaining Insights • 4 minutes
  • Solution Approach - Airline traffic analysis • 4 minutes
  • Solution summary • 3 minutes
  • Expert Talk - Why Python is the language of choice for data science professionals • 9 minutes
  • Introduction to the Problem • 4 minutes
  • Exploring the Problem • 4 minutes

6 readings • Total 60 minutes

  • Course syllabus • 10 minutes
  • Installation guide • 10 minutes
  • Working effectively with Jupyter notebooks • 10 minutes
  • Important note! • 10 minutes
  • The Global Problem Statement • 10 minutes
  • Tell us what you think! • 10 minutes

2 quizzes • Total 60 minutes

  • Python fundamentals • 30 minutes
  • Data Analysis • 30 minutes

Data wrangling with Python

By the end of this module, learners will acquire essential skills in working with various types of data. They will have a solid grasp of Python programming fundamentals, including data structures and libraries. They will be proficient in loading, cleaning, and transforming data, and will possess the ability to perform exploratory data analysis, employing data visualization techniques. They will also gain insights into basic statistical concepts, such as probability, distributions, and hypothesis testing.

32 videos 4 readings 6 quizzes 2 programming assignments 5 ungraded labs

32 videos • Total 174 minutes

  • Introduction • 0 minutes • Preview module
  • Diving into CSV Data • 6 minutes
  • Data inspection • 5 minutes
  • Finding missing data in the POS data • 6 minutes
  • Deleting missing data and saving the cleaned data set • 7 minutes
  • Lab data and problem • 3 minutes
  • A note on assessments • 0 minutes
  • Basic data structures - lists and dictionaries • 13 minutes
  • Basic data structures - series • 3 minutes
  • Creating a data frame using lists, dictionaries and series • 3 minutes
  • Slicing with precision • 5 minutes
  • Changing the indices and saving the new DataFrame • 4 minutes
  • Navigating data insights • 5 minutes
  • Selecting data that match certain criteria • 4 minutes
  • Selecting data that match multiple criteria • 3 minutes
  • Expert Talk - Understanding your data • 5 minutes
  • What are the unique products in the POS data set? • 6 minutes
  • Finding specific values in the data • 7 minutes
  • How much did we sell per category? • 5 minutes
  • Finding totals and averages by brand and by category • 6 minutes
  • Grouping by multiple attributes • 5 minutes
  • Displaying aggregated data in a pivot table • 8 minutes
  • Expert talk - How insights and data analysis guide each other • 5 minutes
  • Working with dates • 6 minutes
  • How much did we sell each month? • 6 minutes
  • What is the monthly average of sales? • 5 minutes
  • Were there specific dates when sales were high? • 8 minutes
  • What if we have more than one dataset? • 5 minutes
  • Merging some simple data sets • 5 minutes
  • Merging POS data with the online data • 6 minutes
  • Walkthrough - How to approach a graded assignment • 3 minutes
  • Summary • 1 minute

4 readings • Total 60 minutes

  • Data cleaning with python • 10 minutes
  • Resources - Datasets and Jupyter notebooks • 10 minutes
  • Python statistics fundamentals • 10 minutes
  • Working with dates • 30 minutes

6 quizzes • Total 180 minutes

  • DataFame essentials • 30 minutes
  • DataFrame operations • 30 minutes
  • Data selection & filtering • 30 minutes
  • Data manipulation & aggregation • 30 minutes
  • Date time operations • 30 minutes
  • Merging & joining dataframes • 30 minutes

2 programming assignments • Total 300 minutes

  • New Programming Assignment • 180 minutes
  • Graded Assignment • 120 minutes

5 ungraded labs • Total 150 minutes

  • Data cleaning & manipulation • 30 minutes
  • Data slicing & manipulations • 30 minutes
  • Data aggregations • 30 minutes
  • Practice Programming Assignment • 30 minutes
  • Merging the data • 30 minutes

Exploratory data analysis

By the end of this module, learners will gain a comprehensive understanding of statistical concepts, data exploration techniques, and visualization methods. Learners will develop the skills to identify patterns, outliers, and relationships in data, making informed decisions and formulating hypotheses. Ultimately, they will emerge with the ability to transform raw data into meaningful insights, effectively communicate their findings through data storytelling, and apply EDA across diverse real-world applications.

34 videos 1 reading 5 quizzes 1 programming assignment 4 ungraded labs

34 videos • Total 205 minutes

  • Expert Talk - Why EDA is a superpower • 6 minutes
  • Finding the average of the data • 6 minutes
  • Understanding the spread of the data • 9 minutes
  • Quantiles - how to understand and visualize them • 7 minutes
  • Exploring variability in the POS data • 6 minutes
  • What shape is my data? • 6 minutes
  • Understanding the distributions of features in the POS data • 6 minutes
  • Understanding Data Distributions • 4 minutes
  • Some other common shapes of data - Part I • 10 minutes
  • Some other common shapes of data - Part I • 6 minutes
  • Some other common shapes of data - Part II • 8 minutes
  • Some other common shapes of data - Part III • 8 minutes
  • What chance of revenue falls in a given range • 3 minutes
  • How are the features related to each other? - Part I • 5 minutes
  • How are the features related to each other? - Part II • 4 minutes
  • How are the features related to each other? - Part II • 5 minutes
  • Visualizing categorical features • 6 minutes
  • Visualizing proportions • 7 minutes
  • Expert Talk - Power of visualization & its importance in storytelling • 7 minutes
  • Using boxplots to compare revenues across segments in the POS data • 7 minutes
  • Making better visuals - Part III • 9 minutes
  • Communicating insights better by creating multiple subplots within the same plot • 2 minutes
  • Comparing the distribution of revenue for each sector by overlaying their KDE plots • 7 minutes
  • Sampling our data - Part I • 5 minutes
  • Sampling our data - Part II • 4 minutes
  • Introduction to hypothesis testing - Part I • 5 minutes
  • Introduction to hypothesis testing - Part II • 4 minutes
  • Hypothesis testing using Z - Test - Part I • 6 minutes
  • Hypothesis testing using Z - Test - Part II • 5 minutes
  • Hypothesis testing using t - Test • 6 minutes
  • Hypothesis testing using Chi-square test • 7 minutes

1 reading • Total 10 minutes

5 quizzes • total 150 minutes.

  • Statistics fundamentals • 30 minutes
  • Data distributions • 30 minutes
  • Understanding relationships between features • 30 minutes
  • Practice Quiz • 30 minutes
  • Practice quiz • 30 minutes

1 programming assignment • Total 120 minutes

4 ungraded labs • total 120 minutes.

  • Understanding data distributions • 30 minutes

Data pre-processing

By the end of this module, learners will acquire the essential skills to effectively transform raw and often messy data into a structured and suitable format for advanced analysis. They will master the techniques for handling missing values, identifying and dealing with outliers, encoding categorical variables, scaling and normalizing numerical features, and handling textual or unstructured data. Learners will also be proficient in detecting and addressing data inconsistencies, such as duplicates and errors. Learners will be able to treat data to make it suitable for further analysis. Upon completion of this module, Upon completion

25 videos 2 readings 3 quizzes 1 programming assignment 3 ungraded labs

25 videos • Total 134 minutes

  • Introduction • 4 minutes • Preview module
  • Expert Talk - Handling missing data • 7 minutes
  • What to do with missing values? • 5 minutes
  • Missing values in the POS data • 2 minutes
  • Missing values within a hierarchy • 7 minutes
  • Missing values within a hierarchy (contd.) • 5 minutes
  • What if parts of the hierarchy are also missing? • 2 minutes
  • Finishing up missing value treatment in the POS data • 5 minutes
  • Missing values - another simpler example • 8 minutes
  • Working with categoric features • 4 minutes
  • Transforming features - binning and discretization • 8 minutes
  • Transforming features - binning and discretization (contd.) • 6 minutes
  • Encoding categoric features - one-hot and label encoding • 9 minutes
  • Encoding features in the POS data • 5 minutes
  • Finishing up the encoding and saving the encoded data • 3 minutes
  • What is data normalization and why do we need it? • 4 minutes
  • Data normalization using min-max scaling • 5 minutes
  • Data normalization using z-score scaling • 4 minutes
  • Other types of data transformation • 4 minutes
  • Applying log transformation to the online data • 4 minutes
  • Finding outlying data • 5 minutes
  • Removing outliers by dropping them • 4 minutes
  • How to deal with outliers - imputation • 6 minutes
  • How to deal with outliers - capping • 4 minutes
  • Summary • 2 minutes

2 readings • Total 40 minutes

  • Data pre-processing • 30 minutes

3 quizzes • Total 90 minutes

  • Missing values • 30 minutes
  • Dealing with categorical data • 30 minutes
  • Data normalization • 30 minutes

3 ungraded labs • Total 90 minutes

  • Handling missing values • 30 minutes
  • Handling categorical features • 30 minutes
  • Data normalization & treating outliers • 30 minutes

By the end of this module, learners will develop a profound understanding of how to craft and enhance features to optimize the performance of machine learning models. They will be adept at identifying relevant variables, creating new features through techniques such as one-hot encoding, binning, and polynomial expansion, and extracting valuable information from existing data, like dates or text, using methods like feature extraction and text vectorization. Learners will also grasp the concept of feature scaling and normalization to ensure the consistency and comparability of feature ranges. With these skills, they will possess the ability to shape data effectively, amplifying its predictive power and contributing to the construction of robust, high-performing machine learning pipelines.

11 videos 2 readings 1 quiz 1 programming assignment 1 ungraded lab

11 videos • Total 53 minutes

  • Introduction • 1 minute • Preview module
  • Reducing the dimensionality of data sets • 6 minutes
  • Exploring the features of the obesity data set • 7 minutes
  • What is Principal Component Analysis(PCA)? • 7 minutes
  • Applying PCA to the obesity data • 4 minutes
  • Creating a transformed version of the data through feature engineering • 8 minutes
  • Expert Talk - Gen AI in Python • 4 minutes
  • Introduction to Gen AI in Python for Data science • 3 minutes
  • Some quick data analysis using PandasAI • 4 minutes
  • Some quick data visualization using PandasAI • 3 minutes
  • Complete guide to Feature Engineering • 30 minutes

1 quiz • Total 30 minutes

  • Feature engineering & PCA • 30 minutes

1 ungraded lab • Total 30 minutes

  • Dimensionality reduction, PCA • 30 minutes

Instructor ratings

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

python data analysis assignment

Continuous learning is imperative to stay relevant in the world of Data Analytics and AI. Fractal Analytics Academy is your learning partner for all your learning requirements. We offer a variety of learning solutions; from instructor led trainings to blended learning and eLearning covering consulting and business skills, technical skills and life skills.

Recommended if you're interested in Data Analysis

python data analysis assignment

Google Cloud

Serverless Data Processing with Dataflow:Foundations Español

python data analysis assignment

Fractal Analytics

Insights of Power BI

python data analysis assignment

Foundations of Machine Learning

python data analysis assignment

Advanced Machine Learning Algorithms

Why people choose coursera for their career.

python data analysis assignment

Learner reviews

Showing 3 of 32

Reviewed on Nov 29, 2023

Its a great course if you want to learn how to apply concepts in solving real business problems

Reviewed on Nov 15, 2023

All expert did a comprehending way of giving their knowledge for learning, a great work.

Reviewed on Feb 19, 2024

Good course. Need more in-depth details with case studies.

New to Data Analysis? Start here.

Placeholder

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

When will i have access to the lectures and assignments.

Access to lectures and assignments depends on your type of enrollment. If you take a course in audit mode, you will be able to see most course materials for free. To access graded assignments and to earn a Certificate, you will need to purchase the Certificate experience, during or after your audit. If you don't see the audit option:

The course may not offer an audit option. You can try a Free Trial instead, or apply for Financial Aid.

The course may offer 'Full Course, No Certificate' instead. This option lets you see all course materials, submit required assessments, and get a final grade. This also means that you will not be able to purchase a Certificate experience.

What will I get if I subscribe to this Certificate?

When you enroll in the course, you get access to all of the courses in the Certificate, and you earn a certificate when you complete the work. Your electronic Certificate will be added to your Accomplishments page - from there, you can print your Certificate or add it to your LinkedIn profile. If you only want to read and view the course content, you can audit the course for free.

What is the refund policy?

If you subscribed, you get a 7-day free trial during which you can cancel at no penalty. After that, we don’t give refunds, but you can cancel your subscription at any time. See our full refund policy Opens in a new tab .

More questions

IMAGES

  1. Data Analysis Using Python Programming Assignments and Quizzes Solutions

    python data analysis assignment

  2. Data Analysis with Python Course

    python data analysis assignment

  3. python_assignment_1/Data_Analysis_Assignment (1).ipynb at main

    python data analysis assignment

  4. Introduction To Data Analysis Using Python

    python data analysis assignment

  5. Python for Data Analysis Tutorial

    python data analysis assignment

  6. Data Analysis with Python

    python data analysis assignment

COMMENTS

  1. Python Data Analysis Example: A Step-by-Step Guide for Beginners

    Step 3: Exploratory Data Analysis. The next stage is to start analyzing your data by calculating summary statistics, plotting histograms and scatter plots, or performing statistical tests. The goal is to gain a better understanding of the variables, and then use this understanding to guide the rest of the analysis.

  2. Data Analysis with Python

    There are 6 modules in this course. Analyzing data with Python is an essential skill for Data Scientists and Data Analysts. This course will take you from the basics of data analysis with Python to building and evaluating data models. Topics covered include: - collecting and importing data - cleaning, preparing & formatting data - data frame ...

  3. Using Python for Data Analysis

    Data analysis is a broad term that covers a wide range of techniques that enable you to reveal any insights and relationships that may exist within raw data. As you might expect, Python lends itself readily to data analysis. Once Python has analyzed your data, you can then use your findings to make good business decisions, improve procedures, and even make informed predictions based on what ...

  4. Python Data Analysis Example: Ames Housing Dataset

    Step 1: Import Data. To start the analysis, we must import the data into Python. A dataset may come in various formats, e.g. CSV, JSON, or Excel. CSV stands for comma-separated values. It is a text file that stores tabular data, where one line represents one data record.

  5. Yash-Kavaiya/data-analytics-with-python-nptel

    Repository for the NPTEL course 'Data Analytics with Python'. Includes code examples, notebooks, and exercises covering data manipulation, visualization, statistics, and machine learning with Python libraries like NumPy, Pandas, and Matplotlib. ... ROC And Regression Analysis Model Building: Week 10: C2 Test And Introduction To Cluster Analysis ...

  6. Data Analysis with Python: Zero to Pandas

    Assignment 1 - Python Basics Practice. ... Data Analysis with Python: Zero to Pandas is an online course intended to provide a coding-first introduction to data analysis. The course takes a hands-on coding-focused approach and will be taught using live interactive Jupyter notebooks, allowing students to follow along and experiment. ...

  7. Introduction to Data Science in Python

    There are 4 modules in this course. This course will introduce the learner to the basics of the python programming environment, including fundamental python programming techniques such as lambdas, reading and manipulating csv files, and the numpy library. The course will introduce data manipulation and cleaning techniques using the popular ...

  8. Data Analysis Using Python

    There are 3 modules in this course. This course provides an introduction to basic data science techniques using Python. Students are introduced to core concepts like Data Frames and joining data, and learn how to use data analysis libraries like pandas, numpy, and matplotlib. This course provides an overview of loading, inspecting, and querying ...

  9. harshimm/Data-Analysis-with-Python-Coursera

    Final Peer Graded Assignment. Contribute to harshimm/Data-Analysis-with-Python-Coursera development by creating an account on GitHub.

  10. ivanleom/IBM---Data-Analysis-with-Python

    This folder contains the Week 6 final assignment of the Coursera course -- Data Analysis with Python offered by IBM. - ivanleom/IBM---Data-Analysis-with-Python

  11. Final Assignment for Data Analysis with Python course on ...

    Final Assignment for Data Analysis with Python course on Coursera provided by IBM - Final Assignment.ipynb

  12. Python For Data Analysis

    Prerequisites Python Data Analysis Libraries. NumPy: Provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.; Pandas: Offers data structures and operations for manipulating numerical tables and time series, making data cleaning, exploration, and analysis more straightforward.

  13. Python Exercises, Practice, Challenges

    Each exercise has 10-20 Questions. The solution is provided for every question. Practice each Exercise in Online Code Editor. These Python programming exercises are suitable for all Python developers. If you are a beginner, you will have a better understanding of Python after solving these exercises. Below is the list of exercises.

  14. 7 Datasets to Practice Data Analysis in Python

    Python is a great tool for data analysis - in fact, it has become very popular, as we discuss in Python's Role in Big Data and Analytics. For Python beginners to become proficient in data analysis, they need to develop their programming and analysis knowledge. And the best way to do this is by creating your own data analysis projects.

  15. Python Data Analytics

    This course introduces the use of the Python programming language to manipulate datasets as an alternative to spreadsheets. You will follow the OSEMN framework of data analysis to pull, clean, manipulate, and interpret data all while learning foundational programming principles and basic Python functions. You will be introduced to the Python ...

  16. Data Analysis with Python

    Data analysis using Python's Pandas library is a powerful process, and its efficiency can be enhanced with specific tricks and techniques. These Python tips will make our code concise, readable, and efficient. The adaptability of Pandas makes it an efficient tool for working with structured data. Whether you are a beginner or an experienced data sc

  17. MyTarn/IBM_Data_Analysis_with_Python

    This is a Data Analysis with Python Project (part of Coursera IBM Data Science Professional). - MyTarn/IBM_Data_Analysis_with_Python. ... Hands-on Assignment: Data Analysis House Sales in King County, USA. Instruction. Assign to Data Analyst working at a Real Estate Investment Trust, the Trust will like to start investing in Residential real ...

  18. Using pandas and Python to Explore Your Dataset

    You can see how much data nba contains: Python. >>> len(nba) 126314 >>> nba.shape (126314, 23) You use the Python built-in function len() to determine the number of rows. You also use the .shape attribute of the DataFrame to see its dimensionality. The result is a tuple containing the number of rows and columns.

  19. Exploratory Data Analysis in Python

    Exploratory data analysis (EDA) is an especially important activity in the routine of a data analyst or scientist. ... Learn how to perform a top-tier Exploratory Data Analysis in Python: Exploratory Data Analysis in Python — A Step-by-Step Process; Learn the basics of TensorFlow: Get started with TensorFlow 2.0 — Introduction to deep learning;

  20. Introduction to Python Course

    Introduction to Python. 4.7 +. 2,204 reviews. Beginner. Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages. Start Course for Free. 4 Hours 11 Videos 47 Exercises. 5,679,092 Learners Statement of Accomplishment.

  21. Prabhu-369/Data-Analysis-with-Python-by-IBM-on-Coursera

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window.

  22. Rice University

    Course Description. This course will continue the introduction to Python programming that started with Python Programming Essentials and Python Data Representations. We'll learn about reading, storing, and processing tabular data, which are common tasks. We will also teach you about CSV files and Python's support for reading and writing them.

  23. Python for Data Science

    In the first module of the Python for Data Science course, learners will be introduced to the fundamental concepts of Python programming. The module begins with the basics of Python, covering essential topics like introduction to Python.Next, the module delves into working with Jupyter notebooks, a popular interactive environment for data analysis and visualization.