R을 이용하여 파일을 읽을 때 첫 번째 칼럼의 이름이나 첫번재 데이터의 문자가 깨지는 경우가 있습니다.
이는 UTF-8 형식 문서의 BOM으로 인하여 문자가 깨진것으로 인식하기 때문입니다. 이를 해결하기 위하여 read.csv(fileEncoding="UTF-8-BOM")
처럼 fileEncoding을 BOM형식을 확인하도록 선언하여 주면 됩니다.
# UTF-8의 BOM으로 인하여 파일이 깨짐
> mlbstat = read.csv(file = "mlb-player-stats-Batters.csv", header = T)
> summary(mlbstat)
癤풮layer Team Pos G AB R H X2B
Adeiny Hechavarria: 3 BAL : 28 1B: 76 Min. : 1.00 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
Jose Bautista : 3 TOR : 28 2B:102 1st Qu.: 25.00 1st Qu.: 56.0 1st Qu.: 7.00 1st Qu.: 12.00 1st Qu.: 2.00
Adam Duvall : 2 LAA : 27 3B: 71 Median : 67.00 Median :189.0 Median : 21.00 Median : 43.00 Median : 9.00
Andrew McCutchen : 2 NYM : 27 C :125 Mean : 72.99 Mean :233.5 Mean : 31.08 Mean : 58.77 Mean :11.88
Asdrubal Cabrera : 2 TB : 27 DH: 9 3rd Qu.:123.00 3rd Qu.:402.0 3rd Qu.: 51.00 3rd Qu.:100.00 3rd Qu.:19.00
Austin Jackson : 2 CIN : 25 OF:229 Max. :162.00 Max. :664.0 Max. :129.00 Max. :192.00 Max. :51.00
(Other) :675 (Other):527 SS: 77
X3B HR RBI SB CS BB SO SH
Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 4.00 1st Qu.: 16.00 1st Qu.: 0.000
Median : 0.000 Median : 4.000 Median : 20.00 Median : 1.000 Median : 0.00 Median : 16.00 Median : 44.00 Median : 0.000
Mean : 1.225 Mean : 8.073 Mean : 29.59 Mean : 3.581 Mean : 1.39 Mean : 22.55 Mean : 56.65 Mean : 0.598
3rd Qu.: 2.000 3rd Qu.:13.000 3rd Qu.: 49.00 3rd Qu.: 4.000 3rd Qu.: 2.00 3rd Qu.: 33.00 3rd Qu.: 88.00 3rd Qu.: 1.000
Max. :12.000 Max. :48.000 Max. :130.00 Max. :45.000 Max. :14.00 Max. :130.00 Max. :217.00 Max. :12.000
SF HBP AVG OBP SLG OPS
Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:0.2020 1st Qu.:0.267 1st Qu.:0.3050 1st Qu.:0.5810
Median : 1.000 Median : 2.000 Median :0.2390 Median :0.307 Median :0.3880 Median :0.6930
Mean : 1.772 Mean : 2.775 Mean :0.2301 Mean :0.299 Mean :0.3713 Mean :0.6703
3rd Qu.: 3.000 3rd Qu.: 4.000 3rd Qu.:0.2670 3rd Qu.:0.337 3rd Qu.:0.4440 3rd Qu.:0.7770
Max. :11.000 Max. :22.000 Max. :1.0000 Max. :1.000 Max. :1.1250 Max. :2.0000
# UTF-8 형식의 BOM을 인식하도록 선언
> mlbstat = read.csv(file = "mlb-player-stats-Batters.csv", header = T, fileEncoding="UTF-8-BOM")
> summary(mlbstat)
Player Team Pos G AB R H X2B
Adeiny Hechavarria: 3 BAL : 28 1B: 76 Min. : 1.00 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
Jose Bautista : 3 TOR : 28 2B:102 1st Qu.: 25.00 1st Qu.: 56.0 1st Qu.: 7.00 1st Qu.: 12.00 1st Qu.: 2.00
Adam Duvall : 2 LAA : 27 3B: 71 Median : 67.00 Median :189.0 Median : 21.00 Median : 43.00 Median : 9.00
Andrew McCutchen : 2 NYM : 27 C :125 Mean : 72.99 Mean :233.5 Mean : 31.08 Mean : 58.77 Mean :11.88
Asdrubal Cabrera : 2 TB : 27 DH: 9 3rd Qu.:123.00 3rd Qu.:402.0 3rd Qu.: 51.00 3rd Qu.:100.00 3rd Qu.:19.00
Austin Jackson : 2 CIN : 25 OF:229 Max. :162.00 Max. :664.0 Max. :129.00 Max. :192.00 Max. :51.00
(Other) :675 (Other):527 SS: 77
X3B HR RBI SB CS BB SO SH
Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 4.00 1st Qu.: 16.00 1st Qu.: 0.000
Median : 0.000 Median : 4.000 Median : 20.00 Median : 1.000 Median : 0.00 Median : 16.00 Median : 44.00 Median : 0.000
Mean : 1.225 Mean : 8.073 Mean : 29.59 Mean : 3.581 Mean : 1.39 Mean : 22.55 Mean : 56.65 Mean : 0.598
3rd Qu.: 2.000 3rd Qu.:13.000 3rd Qu.: 49.00 3rd Qu.: 4.000 3rd Qu.: 2.00 3rd Qu.: 33.00 3rd Qu.: 88.00 3rd Qu.: 1.000
Max. :12.000 Max. :48.000 Max. :130.00 Max. :45.000 Max. :14.00 Max. :130.00 Max. :217.00 Max. :12.000
SF HBP AVG OBP SLG OPS
Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:0.2020 1st Qu.:0.267 1st Qu.:0.3050 1st Qu.:0.5810
Median : 1.000 Median : 2.000 Median :0.2390 Median :0.307 Median :0.3880 Median :0.6930
Mean : 1.772 Mean : 2.775 Mean :0.2301 Mean :0.299 Mean :0.3713 Mean :0.6703
3rd Qu.: 3.000 3rd Qu.: 4.000 3rd Qu.:0.2670 3rd Qu.:0.337 3rd Qu.:0.4440 3rd Qu.:0.7770
Max. :11.000 Max. :22.000 Max. :1.0000 Max. :1.000 Max. :1.1250 Max. :2.0000
Hex 정보 확인
Hex Viewer를 이용하여 파일의 hex 정보를 확인하면 첫번째 3byte가 UTF-8의 BOM인것을 확인할 수 있습니다.
00000000: efbb bf50 6c61 7965 722c 5465 616d 2c50 6f73 2c47 2c41 422c :...Player,Team,Pos,G,AB,
MLB 선수의 기록은 다음의 사이트에서 다운로드 할 수 있습니다.
https://www.rotowire.com/baseball/stats.php
반응형
'빅데이터 > R' 카테고리의 다른 글
[R] 기본 함수: 연산,절대값,반올림,올림,버림,수열 (0) | 2019.09.10 |
---|---|
[R] 파일 읽기/쓰기 (2) | 2019.09.09 |
[R] 평균, 분산, 표준편차 (0) | 2019.05.15 |
[R] R 시작하기 (0) | 2019.04.17 |