[Python / pandas] Series, DataFrame 기본¶

import pandas as pd

pandas는 numpy를 기반으로 해서 돌아가는 라이브러리이다.

Series¶

1차원 자료구조로, 리스트 같은 데이터를 받는다.

s1 = pd.Series([1,2,3,4,5])
s1

0    1
1    2
2    3
3    4
4    5
dtype: int64

s2 = pd.Series(['one', 'two',' three', 'four', 'five'])
s2

0       one
1       two
2     three
3      four
4      five
dtype: object

DataFrame¶

2차원 자료구조로, 행과 열이 있는 테이블 형식의 데이터(=Tabular Data)를 받는다.

Series로 구성된 것이 DataFrame이다.

# 딕셔너리를 pd.DataFrame()으로 변환해보자.

data = {
    '이름' : ['철수', '영희', '민수', '명수'],
    '나이' : [10, 11, 9, 13],
    '성별' : ['남', '여', '남', '남'],
}

df = pd.DataFrame(data)
df

# 혹은 2개의 Series를 이용하여 DataFrame을 만들어보자.

s1 = pd.Series([1,2,3,4,5])
s2 = pd.Series(['one', 'two', 'three', 'four', 'five'])

# 'num'이름을 갖는 col에는 s1 정보가 담긴다
# 'word'이름을 갖는 col에는 s2 정보가 담긴다.
df1 = pd.DataFrame(data = dict(num = s1, word = s2))
df1

# 딕셔너리 외에, 일반적인 리스트도 입력값으로 받을 수 있을까? (Y)

data = [1,2,3,4,5]

df = pd.DataFrame(data)
df

# 2차원 리스트를 DataFrame으로 변환하기

data = [[1,10], [2,20], [3,30], [4,40], [5,50], [6,60], [7,70], [8,80], [9,90]]

df = pd.DataFrame(data)
df

데이터 접근하기¶

# 먼저 DataFrame을 만들자.

data = {
    '이름' : ['철수', '영희', '지수', '웅이'],
    '나이' : [25, 26, 27, 25],
    '키' : [175, 162, 181, 200],
    '체중' : [70, 60, 80, 100],
    '주소' : ['안양', '서울', '파주', '인천']}

df = pd.DataFrame(data)
df

체중과 관련된 정보들을 가져오고 싶다.

df[ '체중' ]
df.체중

df['체중']

0     70
1     60
2     80
3    100
Name: 체중, dtype: int64

df.체중

0     70
1     60
2     80
3    100
Name: 체중, dtype: int64

boolean indexing을 이용해서 특정 조건의 데이터만을 가져오는 것도 가능하다.

DataFrame[특정조건] 형식을 취하면, boolean indexing을 할 수 있다.

# 체중이 70 이상인 데이터만 가져오기

df[df['체중'] >= 70]

# 체중이 70 미만인 데이터만 가져오기

df[df['체중'] < 70]

외부 데이터 가져오기 (csv 파일)¶

## 캐글의 타이타닉 csv파일을 가져오자.

df = pd.read_csv('C:\\Users\\Juhee\\Desktop\\titanic\\train.csv')

df

head(), tail()함수¶

# head() : 맨 위의 5개 데이터를 보여준다.
df.head()

# tail() : 맨 아래의 5개 데이터를 보여준다.
df.tail()

describe() 함수¶

DataFrame의 기본적인 통계치를 볼 수 있다.

df.describe()

# describe()를 이용해서 count, mean, std, min, max를 한꺼번에 볼 수 있지만
# 하나씩 보는것도 가능하다!

df['Age'].mean()

29.69911764705882

df['Age'].std()

14.526497332334044

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

[Python/pandas] Series, DataFrame 기본