数据预处理 sklearn 或者别的库.md

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。

无量纲化

sklearn.preprocessing.MinMaxScaler
数据归一化
（数据-最小值）/极差把数据限制在0-1之间范围可以改 feature_range

from sklearn.preprocessing import MinMaxScaler

data = [[-10,16],[-5,32],[0,48],[5,64]]

scaler = MinMaxScaler(feature_range = [0,2])
scaler = scaler.fit(data)
#scaler = scaler.partial_fit(data)

#进行数据归一化
result = scaler.transform(data)

#利用归一化结果还原原数据
origin = scaler.inverse_transform(result)

#一步
result2 = scaler.fit_transform(data)

sklearn.preprocessing.StandardScaler
数据标准化
中心化后 x-均值（均值为零）/标准差
之后满足均值0方差1 的正态分布

from sklearn.preprocessing import StandardScaler

data = [[-10,16],[-5,32],[0,48],[5,64]]

scaler = StandardScaler()
scaler.fit(data)

#mean_均值, var_方差
print(scaler.mean_, scaler.var_)

result = scaler.transform(data)
print(result.mean(), result.std())
#转化后服从均值为0方差为1的正态分布

origin =  scaler.inverse_transform(result)

#一步
result2 = scaler.fit_transform(data)

sklearn.preprocessing.MaxAbsScaler
所有数据/abs（绝对值最大的值）

from sklearn.preprocessing import MaxAbsScaler

data = [[-10,8],[-5,16],[0,24],[5,32]]

scaler = MaxAbsScaler()
scaler = scaler.fit(data)
result = scaler.transform(data)

origin = scaler.inverse_transform(result)

#一步
result2 = scaler.fit_transform(data)

缺失值

import pandas as pd
import numpy as np

# 从excel表导入数据
df_t = pd.read_excel(r'D:\EdgeDownloadPlace\3dd40612152202ee8440f82a3d277008\train.xlsx')

print(df_t.info())

# 提取列 所有
print(df_t.loc[:,'uid'])

arr_t_uid = df_t.loc[:,'uid'].values

#转换成二维
arr_t_uid = arr_t_uid.reshape(-1,1)

#找到ca列中为？的行
ca_wenhao =df_t[df_t.loc[:,'ca']=='?']

#替换ca列的？为x
df_t['ca'][df_t['ca']=='?'] = 'x'

print('中位数',df_t.loc[:,'age'].median())
print('平均数',df_t.loc[:,'age'].mean())
print('众数',df_t.loc[:,'age'].mode())
cnt =df_t.loc[:,'age'].value_counts() #把所有的值计数 降序排序
idx = cnt.index #结果降序排序

此数据缺失值用？代表，直接把？换了。
如果缺失值为nan 可以直接用pd.drowna() 或pd.fillna()

编码哑变量 ---- 处理分类数据用

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

df_t = pd.read_excel(r'D:\EdgeDownloadPlace\大数据比赛测试示例\训练集.xlsx')
#删掉 ‘Embarked’列有缺失值的行
df_t.dropna(axis = 0 , subset = ['Embarked'], inplace = True)

#一维
LE = LabelEncoder()
LElabel = LE.fit_transform(df_t['Embarked'].astype(np.str))

#多维
OE= OrdinalEncoder()
OElabel = OE.fit_transform(df_t.loc[:,['Sex','Embarked']])

#独热编码
OHE = OneHotEncoder(categories = 'auto')
OHElabel = OHE.fit_transform(df_t.loc[:,['Sex','Embarked']])
df_ttt = pd.concat([df_t,pd.DataFrame(OHElabel.toarray())],axis = 1)
df_ttt.columns = df_t.columns.tolist()+OHE.get_feature_names().tolist()

print(LE.classes_,OE.categories_,OHE.categories_,sep='\n')

rLE = LE.inverse_transform(LElabel)
rOE = OE.inverse_transform(OElabel)
rOHE = OHE.inverse_transform(OHElabel)

二值化分箱 ----连续数据

import pandas as pd
import numpy as np
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import KBinsDiscretizer

df_t = pd.read_excel(r'D:\EdgeDownloadPlace\大数据比赛测试示例\训练集.xlsx')

df_t.fillna(df_t.loc[:,'Age'].value_counts().index[0],axis = 1,inplace = True)
age_t = df_t['Age'].values.reshape(-1,1)

B = Binarizer(threshold= 24)
result = B.fit_transform(age_t)

#参数 encode：ordinal onehot 参考分类
#参数 strategy： uniform 等宽分箱(平分特征)  quantile等位分箱(分箱后每箱数量一致)  kmeans按聚类分箱
KBD_ordinal = KBinsDiscretizer(n_bins = 5, encode = 'ordinal', strategy= 'uniform')
KBDresult_ordinal = KBD_ordinal.fit_transform(age_t)
print(KBDresult_ordinal.ravel())#ravel()的作用是降维

KBD_onehot = KBinsDiscretizer(n_bins = 5, encode = 'onehot', strategy= 'uniform')
KBDresult_onehot = KBD_onehot.fit_transform(age_t)
print(KBDresult_onehot.toarray())

11月
11

欧拉函数.md

算法,数论

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。


void find_prime(int n)
{
    memset(isprime,0,sizeof(isprime));
    sp = 0;
    for(int i = 2;i <= n;i++)
    {
        if(!isprime[i])
        {
            prime[++sp] = i;
            isprime[i] = i;
            phi[i] = i-1;
        }
        for(int j = 1;j <= sp && i * prime[j] <= n;j++) {
            isprime[i*prime[j]] = prime[j];
            if(prime[j] >= isprime[i]) {
                phi[i*prime[j]] = phi[i]*prime[j];
                break;
            }
            phi[i*prime[j]] = phi[i]*(prime[j]-1);
        }
    }
}

11月
11

破事水，一道泯若众题区间dp The Sports Festival.md

算法,DP

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。

The Sports Festival

题目描述

给你一个序列 ( 数据可重复 ) ，你可以将其任意排序，最小化 Σ i = 1 n ( 前 i 个数的极
差 ) 给你一个序列(数据可重复)，你可以将其任意排序，最小化 \\ \Sigma_{i = 1} ^ n (前i个数的极差)
给你一个序列(数据可重复)，你可以将其任意排序，最小化Σi=1n(前i个数的极差)

思路

0.排一下序
1.首先我们不可能先选一个最大的，再选一个最小的，如果这样选就得到了最大值，就g了，除非数据特殊
2.那我们肯定是要循序渐进一点点增大极差，那就必须要选定某一个点开始，然后慢慢把范围往两边阔
3.那我们直接一波区间dp求出所有的情况并选个最小的就行了

#include <bits/stdc++.h>
#define int long long
const int N = 2005;
int dp[N][N], x[N], n;
signed main(){
    scanf("%lld", &n);
    for(int i = 1; i <= n; i ++) scanf("%lld", &x[i]);
    std::sort(x + 1, x + 1 + n);
    for(int s = 1; s < n; s ++) for(int st = 1, ed; (ed = st + s) <= n; st ++){
        dp[st][ed] = std::min(dp[st + 1][ed], dp[st][ed - 1]) + x[ed] - x[st];
    }
    printf("%lld", dp[1][n]);
    return 0;
}

11月
11

线段树 ~乱 ------------------------阿巴阿巴阿巴.md

算法,思想or数据结构

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。

#include <iostream>
#include <cstdio>
#include <cstring>
#include <algorithm>
#include <queue>

const int maxn=5e4+4;
int n,q,tt;
int mmx,mmn;
int cow[maxn];
struct Node
{
    int mx,mn,l,r;
    int getMid() { return (l+r) >> 1; }
}treeN[maxn<<2];

int getMid(int l,int r)
{
    return (l+r) >> 1;
}

void build(int l,int r,int root)
{
    treeN[root].l=l; treeN[root].r=r;
    if(l==r)
    {
        treeN[root].mx=cow[l];
        treeN[root].mn=cow[l];
        return;
    }
    int mmd=getMid(l,r);
    build(l,mmd,root<<1);
    build(mmd+1,r,(root<<1)+1);
    treeN[root].mx=std::max(treeN[root<<1].mx,treeN[(root<<1)+1].mx);
    treeN[root].mn=std::min(treeN[root<<1].mn,treeN[(root<<1)+1].mn);
}

void getAns(int l,int r,int root)
{
    if(treeN[root].mn>=mmn && treeN[root].mx<=mmx) return ;
    if(treeN[root].l==l && treeN[root].r==r)
    {
        mmx=std::max(mmx,treeN[root].mx);
        mmn=std::min(mmn,treeN[root].mn);
        return ;
    }
    int mmd=treeN[root].getMid();
    if(r<=mmd) getAns(l,r,root<<1);
    else if(l>mmd) getAns(l,r,(root<<1)+1);
    else
    {
        getAns(l,mmd,root<<1);
        getAns(mmd+1,r,(root<<1)+1);
    }
}

int main()
{
    scanf("%d%d",&n,&q);
    for(int i=1; i<=n; i++) scanf("%d",&cow[i]);
    build(1,n,1);
    for(int i=0; i<q; i++)
    {
        scanf("%d%d",&n,&tt);
        mmx=-1,mmn=1000005;
        getAns(n,tt,1);
        printf("%d\n",mmx-mmn);
    }

    return 0;
}

#include <iostream>
#include <cstdio>
#include <cmath>
#include <cstdlib>
#include <cstring>
#include <algorithm>
#include <map>
#include <queue>
typedef std::pair<int ,int > PP;
typedef long long ll;
int n,m;
ll ans;
const int maxm=10005;
const int BIG=1e9+7;
int val[maxm];
PP p[maxm];
int treen[maxm<<2],lazy[maxm<<2];
int getMid(int a,int b)
{
    return (a+b)>>1;
}

bool cmp(PP a,PP b)
{
    if(a.second==b.second) return a.first < b.first;
    return a.second < b.second;
}

void build(int l,int r,int rt)
{
    lazy[rt]=0;
    if(l==r) treen[rt]=val[l];
    else
    {
        int mid=getMid(l,r);
        build(l,mid,rt<<1);
        build(mid+1,r,rt<<1|1);
        treen[rt]=std::min(treen[rt<<1],treen[rt<<1|1]);
    }
}

void pushdown(int rt)
{
    if(lazy[rt]==0) return ;
    lazy[rt<<1]+=lazy[rt];
    lazy[rt<<1|1]+=lazy[rt];
    treen[rt<<1]-=lazy[rt];
    treen[rt<<1|1]-=lazy[rt];
    lazy[rt]=0;
}

void update(int l,int r,int sl,int sr,int rt,int val)
{
    if(l>=sl && r<=sr) treen[rt]-=val,lazy[rt]+=val;
    else
    {
        pushdown(rt);
        int mid=getMid(l,r);
        if(mid>=sl) update(l,mid,sl,sr,rt<<1,val);
        if(mid<sr) update(mid+1,r,sl,sr,rt<<1|1,val);
        treen[rt]=std::min(treen[rt<<1],treen[rt<<1|1]);
    }
}

int query(int l,int r,int sl,int sr,int rt)
{
    if(l>=sl && r<=sr) return treen[rt];
    if(l>sr || r<sl) return BIG;
    int mid=getMid(l,r);
    int re=BIG;
    pushdown(rt);
    if(mid<sr) re=std::min(query(mid+1,r,sl,sr,rt<<1|1),re);
    if(mid>=sl) re=std::min(query(l,mid,sl,sr,rt<<1),re);
    return re;
}

int main()
{
    std::cin >> n >> m;

    int sn=n-1;
    for(int i=0; i<sn; i++) std::cin >> val[i];
    build(0,n-2,1);
    for(int i=0; i<m; i++)
    {
        std::cin >> p[i].first >> p[i].second;
        if(p[i].first > p[i].second) std::swap(p[i].first,p[i].second);
        p[i].second-=1;
    }
    std::sort(p,p+m,cmp);
    ans=0;
    for(int i=0; i<m; i++)
    {
        int tt=query(0,n-2,p[i].first,p[i].second,1);
        ans+=tt;
        update(0,n-2,p[i].first,p[i].second,1,tt);
    }
    std::cout << ans;
    return 0;
}

11月
11

筛法-素数.md

算法,数论

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。

埃拉托斯特尼筛法

埃拉托斯特尼筛法（The Sieve of Eratosthenes）：
• 设u[]为筛子，初始时区间中的所有数在筛子u[]中。按递增顺序搜索u[]中的最小数，
将其倍数从u[]中筛去，最终筛中留下的数即为素数。
• int i, j, k; • for (i=2; i<=n; i++) u[i]=true; //初始时所有数在筛中
• for (i=2; i<=n; i++) //顺序搜索筛中的最小数
• if (u[i]) {
• for (j=2; j _i <=n; j++) //将i的倍数从筛中筛去
• u[j_i]=false;
• } • for (i=2; i<=n; i++) if (u[i]) { //将筛中的所有素数放入su[]中 • su[++num]=i; • }
• 上述算法的时间复杂度为O(n * log log n)。算法中合数是作为素数的倍数被筛去的。

欧拉筛法

如果每个合数仅被它最小的质因数筛去，则算法效率可以大幅提升。由此引出一种优化的算法——欧拉筛法（Euler’s Sieve）：
• int i, j, num=1;
• memset(u, true, sizeof(u));
• for (i=2; i<=n; i++){ //顺序分析整数区间的每个数
• if (u[i]) su[num++]=i; //将筛中最小数送入素数表
• for (j=1; j<num; j++) { //搜索素数表的每个数
• if (i _su[j] >n) break; //若i与当前素数的乘积超出范围，则分析下一个整数i • u[i_su[j]]=false;
//将i与当前素数的乘积从筛子中筛去
• if (i%su[j]==0) break; //若当前素数为i的最小素因子，则分析下一个整数i • } • } •
欧拉筛法的时间复杂度可优化至O(n)。

题目1 牛牛的“质因数”

大概题意是把每个数写成若干质因数连续排列的形式，如3=3,7=7，9=33，16=2222，1500=223555，然后给出一个n，然后在这种形式下求sum（2~n）。结果对1e+7取模
此处TP：https://ac.nowcoder.com/acm/contest/9982/I

#include <iostream>
#include <cstdio>
#include <cmath>
#include <cstring>

typedef long long LL;
const int N = 4e6+5;
//const int M = 4e6;
const int mod = 1e9+7;
bool vis[N];
LL dp[N];

int M;

int getmul(int x)
{
    int re = 1;
    while(x)
    {
        x /= 10;
        re *= 10;
    }
    return re;
}

void go_dp()
{
    for(int i=2; i<=M; i++)
    {
        if(!vis[i])
        {
            dp[i] = i;
            int mul = getmul(i);
            for(int j=2; i*j<=M; j++)
            {
                vis[i*j] = true;
                int cnt = 0;
                LL tt = i*j;
                while(tt % i == 0)
                {
                    cnt += 1;
                    tt /= i;
                }
                for(int k=0; k<cnt; k++)
                {
                    dp[i*j] = (dp[i*j] * mul + i) % mod;
                }
            }
        }
    }
}

int main()
{
    LL ans = 0;
    std::cin >> M;
    go_dp();
    for(int i=2; i<=M; i++)
    {
        ans = (ans+dp[i])%mod;
    }
    std::cout << ans;
    return 0;
}

抄欧拉模板的话好像数字的排列方式有问题

11月
11

记忆化搜索（貌似，不太清楚是不是，自己摸索题的时候猜了一个） L3-025 那就别担心了 (30 分).md

算法,搜索

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。

题目

<https://pintia.cn/problem-
sets/994805046380707840/problems/1336215880692482060>
下图转自“英式没品笑话百科”的新浪微博 —— 所以无论有没有遇到难题，其实都不用担心。
![在这里插入图片描述](https://img-
blog.csdnimg.cn/img_convert/c391ed5f2575c951739264f1eb9226e7.png#pic_center)
博主将这种逻辑推演称为“逻辑自洽”，即从某个命题出发的所有推理路径都会将结论引导到同一个最终命题（开玩笑的，千万别以为这是真正的逻辑自洽的定义……）。现给定一个更为复杂的逻辑推理图，本题就请你检查从一个给定命题到另一个命题的推理是否是“逻辑自洽”的，以及存在多少种不同的推理路径。例如上图，从“你遇到难题了吗？”到“那就别担心了”就是一种“逻辑自洽”的推理，一共有
3 条不同的推理路径。

输入格式：
输入首先在一行中给出两个正整数 N（1<N≤500）和 M，分别为命题个数和推理个数。这里我们假设命题从 1 到 N 编号。

接下来 M 行，每行给出一对命题之间的推理关系，即两个命题的编号 S1 S2，表示可以从 S1 推出
S2。题目保证任意两命题之间只存在最多一种推理关系，且任一命题不能循环自证（即从该命题出发推出该命题自己）。

最后一行给出待检验的两个命题的编号 A B。

输出格式：
在一行中首先输出从 A 到 B 有多少种不同的推理路径，然后输出 Yes 如果推理是“逻辑自洽”的，或 No 如果不是。

题目保证输出数据不超过 1e9。

输入样例 1：
7 8
7 6
7 4
6 5
4 1
5 2
5 3
2 1
3 1
7 1
输出样例 1：
3 Yes
输入样例 2：
7 8
7 6
7 4
6 5
4 1
5 2
5 3
6 1
3 1
7 1
输出样例 2：
3 No

代码

一开始暴力dfs最后一个样例过不了，不是开了一个vis数组记录。刚开始瞎记录，把找到终点的路径标记为true，但是会发生路径数与最终答案不同的错误。后来发现vis数组应该开int类型。然后继续摸索。
ac后总结了一下，vis数组记录的是当前节点通往终点的路径条数。
代码：

#include <iostream>
#include <algorithm>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cmath>
#include <queue>
#include <map>
#include <vector>
#include <set>

const int N = 505, M = 250005;

int head[N],dout[N],vis[N];
int idx;
int n,m,u,v,to,zero;
int ans;

struct Edge{
   int to,nxt;
}edge[M];

int dfs(int x){
   if(vis[x]) {
      ans += vis[x];
      return vis[x];
   }
   if(!dout[x]){
      zero += 1;
      dout[x] = -1;
      return 0;
   }
   for(int i=head[x]; i!=-1; i= edge[i].nxt){
      to = edge[i].to;
      vis[x] += dfs(to);
   }
   return vis[x];
}

int main()
{
   idx = -1;
   memset(head,-1,sizeof(head));
   scanf("%d%d",&n,&m);
   for(int i=0; i<m; i++){
      scanf("%d%d",&u,&v);
      edge[++idx] = {v,head[u]}; head[u] = idx;
      dout[u] += 1;
   }
   scanf("%d%d",&u,&v);
   vis[v] = 1;
   dfs(u);
   printf("%d %s",ans,(!zero) ? "Yes" : "No");
   return 0;
}

11月
11

随机森林的应用简例.md

教练我想学挂边躲牛,瞎学机器学习

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。

用jupyter lab 做的太多了懒得排版

%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
df_t = pd.read_excel(r'D:\EdgeDownloadPlace\复赛数据集\train.xlsx',header=None)

import re
with open(r'D:\EdgeDownloadPlace\复赛数据集\features.txt') as f:
    features = re.findall('[0-9] (.*)\n', f.read())
features.insert(0,'uid')
features.append('target')
df_t.columns = features

df_t = df_t.drop(columns = 'uid')
arr_t = df_t.values
import time
start_time = time.time()
param_grid = {'n_estimators' : np.arange(1,201,40)}
rfc = RandomForestClassifier(random_state = 435681971
                            ,criterion = 'entropy')
gs = GridSearchCV(rfc, param_grid, cv=4)
gs.fit(arr_t[:,:-1],arr_t[:,-1])

peak_n_lst = [gs.best_score_, gs.best_params_]
end_time = time.time()
time_span = end_time - start_time
peak_n_lst

[0.9373961218836566, {'n_estimators': 161}]
time_span
107.33542943000793
peak_n = peak_n_lst[1]['n_estimators']
start_time = time.time()
param_grid = {'n_estimators' : np.arange(peak_n-20,peak_n+20)}
rfc = RandomForestClassifier(random_state = 435681971
                            ,criterion = 'entropy')
gs = GridSearchCV(rfc, param_grid, cv=4)
gs.fit(arr_t[:,:-1],arr_t[:,-1])

peak_n_lst = [gs.best_score_, gs.best_params_]
end_time = time.time()
time_span = end_time - start_time
peak_n = peak_n_lst[1]['n_estimators']
peak_n
170
start_time = time.time()
param_grid = {'max_depth' : np.arange(1,561//2,30)}
rfc = RandomForestClassifier(random_state = 435681971
                            ,n_estimators = peak_n
                            ,criterion = 'entropy')
gs = GridSearchCV(rfc, param_grid, cv=4)
gs.fit(arr_t[:,:-1],arr_t[:,-1])

peak_depth_lst = [gs.best_score_, gs.best_params_]
end_time = time.time()
time_span = end_time - start_time
peak_depth_lst
[0.9386426592797784, {'max_depth': 31}]
time_span
407.307009935379
peak_depth = peak_depth_lst[1]['max_depth']
peak_depth
31
start_time = time.time()
param_grid = {'max_depth' : np.arange(peak_depth-20, peak_depth+30)}
rfc = RandomForestClassifier(random_state = 435681971
                            ,n_estimators = peak_n
                            ,criterion = 'entropy')
gs = GridSearchCV(rfc, param_grid, cv=4)
gs.fit(arr_t[:,:-1],arr_t[:,-1])

peak_depth_lst = [gs.best_score_, gs.best_params_]
end_time = time.time()
time_span = end_time - start_time
peak_depth_lst
[0.9393351800554016, {'max_depth': 14}]
time_span/60

31.484332279364267
peak_depth = peak_depth_lst[1]['max_depth']
peak_depth
14
start_time = time.time()
param_grid = {'min_samples_split' : np.arange(2,125,30)}
rfc = RandomForestClassifier(random_state = 435681971
                            ,n_estimators = peak_n
                            ,max_depth = peak_depth
                            ,criterion = 'entropy')
gs = GridSearchCV(rfc, param_grid, cv=4)
gs.fit(arr_t[:,:-1],arr_t[:,-1])

peak_depth_lst = [gs.best_score_, gs.best_params_]
end_time = time.time()
time_span = end_time - start_time
time_span/60
3.157407291730245
#peak_depth_lst 上面忘记换名字了
peak_minss = peak_depth_lst[1]['min_samples_split']
peak_minss
2
start_time = time.time()
param_grid = {'min_samples_split' : np.arange(2,30)}
rfc = RandomForestClassifier(random_state = 435681971
                            ,n_estimators = peak_n
                            ,max_depth = peak_depth
                            ,criterion = 'entropy')
gs = GridSearchCV(rfc, param_grid, cv=4)
gs.fit(arr_t[:,:-1],arr_t[:,-1])

peak_depth_lst = [gs.best_score_, gs.best_params_]
end_time = time.time()
time_span = end_time - start_time
time_span
1053.7646670341492
peak_depth_lst
[0.9393351800554016, {'min_samples_split': 2}]
peak_minss = peak_depth_lst[1]['min_samples_split']
peak_minss
2
import matplotlib.pyplot as plt
peak_n = 170
peak_depth = 14
peak_minss = 3
#rfc = RandomForestClassifier(random_state = 435681971
#                            ,n_estimators = peak_n
#                            ,max_depth = peak_depth
#                            ,min_samples_split = peak_minss
#                            ,oob_score = True)
#rfc.fit(arr_t[:,:-1],arr_t[:,-1])
#rfc.oob_score_
#plt.figure(figsize = [20,5])

#score_lst=[]
#for i in range(30):
#    rfc = RandomForestClassifier(#random_state = 435681971
#                                n_estimators = peak_n
#                                ,max_depth = peak_depth
#                                ,min_samples_split = peak_minss
#                                ,oob_score = True)
#    rfc.fit(arr_t[:,:-1],arr_t[:,-1])
#    score_lst.append(rfc.oob_score_)
#plt.plot(range(1,31),score_lst)

score_lst=[]
for i in range(30):
    rfc = RandomForestClassifier(#random_state = 435681971
                                n_estimators = peak_n
                                ,max_depth = peak_depth
                                ,min_samples_split = peak_minss
                                ,oob_score = True
                                ,criterion = 'entropy')
    rfc.fit(arr_t[:,:-1],arr_t[:,-1])
    score_lst.append(rfc.oob_score_)
plt.plot(range(1,31),score_lst,color = 'red')

plt.show()   

plt.figure(figsize = [15,6])
plt.plot(range(1,31),score_lst,color = 'red')
plt.show()

while True:
    rfc = RandomForestClassifier(#random_state = 435681971
                                n_estimators = peak_n
                                ,max_depth = peak_depth
                                ,min_samples_split = peak_minss
                                ,oob_score = True
                                ,criterion = 'entropy')
    rfc.fit(arr_t[:,:-1],arr_t[:,-1])
    if rfc.oob_score_ > 0.978:
        break
df_a = pd.read_excel(r'D:\EdgeDownloadPlace\复赛数据集\test.xlsx',header=None)

df_a.columns = features[:-1]

df_a = df_a.drop(columns= 'uid')
df_a
tBodyAcc-mean()-X	tBodyAcc-mean()-Y	tBodyAcc-mean()-Z	tBodyAcc-std()-X	tBodyAcc-std()-Y	tBodyAcc-std()-Z	tBodyAcc-mad()-X	tBodyAcc-mad()-Y	tBodyAcc-mad()-Z	tBodyAcc-max()-X	...	fBodyBodyGyroJerkMag-meanFreq()	fBodyBodyGyroJerkMag-skewness()	fBodyBodyGyroJerkMag-kurtosis()	angle(tBodyAccMean,gravity)	angle(tBodyAccJerkMean),gravityMean)	angle(tBodyGyroMean,gravityMean)	angle(tBodyGyroJerkMean,gravityMean)	angle(X,gravityMean)	angle(Y,gravityMean)	angle(Z,gravityMean)
0	0.278	-0.01640	-0.1240	-0.998	-0.9750	-0.960	-0.999	-0.9750	-0.958	-0.9430	...	0.1580	-0.5950	-0.861	0.0535	-0.00743	-0.733	0.7040	-0.845	0.180	-0.0543
1	0.281	-0.00996	-0.1060	-0.995	-0.9730	-0.986	-0.995	-0.9740	-0.986	-0.9400	...	0.2670	0.3400	0.140	-0.0206	-0.12800	-0.483	-0.0707	-0.848	0.190	-0.0344
2	0.277	-0.01470	-0.1070	-0.999	-0.9910	-0.993	-0.999	-0.9910	-0.992	-0.9430	...	0.7400	-0.5640	-0.766	0.1060	-0.09030	-0.132	0.4990	-0.850	0.189	-0.0351
3	0.279	-0.02300	-0.1220	-0.997	-0.9750	-0.983	-0.997	-0.9730	-0.984	-0.9420	...	0.6620	-0.7820	-0.954	-0.1220	-0.02910	-0.013	-0.0569	-0.761	0.263	0.0242
4	0.280	-0.01390	-0.1060	-0.998	-0.9880	-0.990	-0.998	-0.9880	-0.992	-0.9420	...	0.4290	-0.3290	-0.597	-0.0283	0.09240	-0.822	0.3680	-0.759	0.264	0.0297
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3074	0.231	-0.04230	-0.0899	-0.309	-0.0791	-0.152	-0.391	-0.0870	-0.257	0.0562	...	-0.0310	-0.1390	-0.589	0.2730	0.85600	-0.962	0.9530	-0.657	0.276	0.1770
3075	0.357	-0.04460	-0.1300	-0.314	-0.0556	-0.173	-0.386	-0.0575	-0.217	0.0262	...	0.0168	-0.1630	-0.593	-0.7110	-0.06120	-0.706	0.0646	-0.660	0.274	0.1760
3076	0.284	-0.00796	-0.1190	-0.309	-0.0804	-0.211	-0.369	-0.0971	-0.301	-0.1170	...	-0.1100	0.0245	-0.393	-0.0761	-0.23900	0.960	0.0866	-0.657	0.272	0.1830
3077	0.207	0.02460	-0.1040	-0.365	-0.1690	-0.216	-0.449	-0.1860	-0.326	-0.1760	...	-0.2140	-0.3520	-0.734	0.5350	-0.25700	0.927	-0.0843	-0.657	0.267	0.1880
3078	0.331	-0.06400	-0.1170	-0.068	0.1560	-0.317	-0.149	0.0701	-0.291	0.4120	...	-0.0214	-0.0863	-0.468	-0.3510	-0.33600	0.967	-0.7150	-0.810	0.185	0.1210
3079 rows × 561 columns

arr_a = df_a.values
arr_a
array([[ 0.278  , -0.0164 , -0.124  , ..., -0.845  ,  0.18   , -0.0543 ],
       [ 0.281  , -0.00996, -0.106  , ..., -0.848  ,  0.19   , -0.0344 ],
       [ 0.277  , -0.0147 , -0.107  , ..., -0.85   ,  0.189  , -0.0351 ],
       ...,
       [ 0.284  , -0.00796, -0.119  , ..., -0.657  ,  0.272  ,  0.183  ],
       [ 0.207  ,  0.0246 , -0.104  , ..., -0.657  ,  0.267  ,  0.188  ],
       [ 0.331  , -0.064  , -0.117  , ..., -0.81   ,  0.185  ,  0.121  ]])
answer = rfc.predict(arr_a).astype(np.int8).tolist()
len(answer)
3079
answer_df = pd.DataFrame(answer)
answer_df
0
0	5
1	5
2	5
3	5
4	5
...	...
3074	2
3075	2
3076	2
3077	2
3078	3
3079 rows × 1 columns

answer_df.to_excel(r'D:\EdgeDownloadPlace\复赛数据集\ANS\20201104try.xlsx')
print('ok')
ok

11月
11

（强行算法）对2019点美赛D题极端化的算法探讨、设计与遇到的问题（未完,但可能也到此为止）.md

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。

卢浮宫某建筑疏散问题时序无向图？

不考虑残疾人，不考虑工作人员，不考虑一切麻烦事。只考虑一群正常人走出建筑，怎样最快。
核心思路是随着时间的增加不断扩大每个出口的覆盖范围，让出口在覆盖范围内“找人”代表路线的规划

0. 铺垫定义

假设某建筑有好几层，层与层之间有若干通道连接，将这些通道定义为窄道（如楼梯），出口有若干个，可以存在于任意一层，出口与外界之间也存在
窄道（如门）（若人是源源不断的，则通过门和通过楼梯所需的平均时间近似相等）

1.定义单位时间

选一个合适的时间长度当 单位时间

2.定义单位距离

单位时间 内人按某个速度匀速行进的距离作为 单位距离 （eg：走楼梯和走平地在 单位时间 内走过的距离可能不同，但他们同为
单位距离 ，即只算时间），单位时间内能通过窄道的最大人数称为该出口的 最大流量

3.定义宽道窄道

把建筑划分成若干个大小基本相等的多边形（或椭圆形）区域称为节点
（对应图数据结构中的点）。相邻的点之间存在通道（对应图数据结构中的边）（大厅或长廊中从一个点到另一个点、连接两个点的楼梯都是通道），若一个通道不是
窄道，则将其定义为宽道，所有的通道的长度都要求近似符合 单位距离 。
默认窄道人多时会发生拥挤，人流速度受限制，而宽道不会发生拥挤，人流速度为“无限大”（宽道可以拥挤，但其拥挤的原因是
窄道限速，所以认为宽道不拥挤）

4.算法思路+定义

找到所有的出口，刚开始它们的 时间计数 为0。
时间计数决定了出口的覆盖范围。（eg：当 时间计数 为0时，只有在出口区域一部分人可以逃离建筑；当 时间计数 为5时，距离任意一个出口五个
单位距离 的人都可以理论上逃离建筑–无视碰撞体积的情况）

5.算法思路

循环：
{
1.出口的人数减少 最大流量 ，但不能低于0
2. 时间计数 +1
3.若距离出口附近的人数<出口 最大流量 ：出口点可以在距离自己 时间计数 个 单位距离 的范围内找人补充人数至 最大流量
。
注：*递归： * 当出口的覆盖范围中包含窄道时，应优先从窄道另一端的点找人， 见考虑1 ，进行递归。（若窄道
另一端覆盖的范围内存在新的窄道，则这一寻找中优先考虑新的窄道）
}
终止条件: ~~所有的点全部覆盖且~~ 所有人都逃出生天

a.考虑1

为什么窄道另一端优先
eg：2层楼疏散，只有一楼有出口，1、2层之间存在若干窄道，出口与外界之间存在若干窄道。姑且把前者称为 远级窄道 ，后者称为 近级窄道 。
显然疏散的最快速度时取决于 近级窄道 的。若不给 远级窄道 较高的优先级，而恰好所有的 远级窄道 的最大流量和<所有
近级窄道 的最大流量和，就很可能会出现1楼的人已经疏散完，留下大量的出口，而2楼的人无法及时下来的情况。

这只是一种找到每个通道需要走多少人的办法。当走一条通道的总人数固定，哪里的人先走对时间几乎没有影响。

A.问题1

同时涵盖到多条窄道时，如何抉择？(好像遇到坑了，怎么感觉需要花式DP啊）

代码（数组模拟数据结构）（未完成）

#include <iostream>
#include <cstdio>
#include <cstring>
#include <cstdlib>
#include <cmath>
#include <queue>
#include <map>
typedef long long ll;

int const INF = 1e7+7;
int const maxf = 10;  //最大楼层数
int const maxn = 20;  //楼层出口最大数目
int const maxd = 40;  //区域距离最远楼层出口的最大距离+1
int const maxv = 1000;//最多节点个数

int inode,iedge,cntout;

struct Node
{
    int num;        //此节点的人数
    int dis[maxn];  //此节点距离楼层中每个出口的距离
    Node *nxt[maxn];//指向跟这个节点与出口i距离相同的下一个节点
    bool narrow;   //这个节点是否是窄道的入点、出点
    Node *special;  //窄道另一端的点 仅在isIN或isOUT为true时生效
}node[maxv];

struct Edge
{
    int to,speed;   //通道速度 单位时间内能通过通道的最大人数 宽道默认很大
    bool narrow;    //是否为窄道
}edge[maxv<<3];

struct FLOOR
{
    int exitNum;        //出口数目
    int nodeNum;        //这层楼的节点总数
    int zero;           //无人节点数
    bool isempty;       //楼层是否为空
    Node *Dnode[maxn][maxd] //与第i个出口距离为j的节点组成的链表的头节点指针
}floor[maxf];

void init()
{
    inode = iedge = cntout = -1;
    memset(head,-1,sizeof(head));

}

int main()
{

    return 0;
}

题目描述

2019 ICM
Problem D: Time to leave the Louvre
The increasing number of terror attacks in France[1]
requires a review of the emergency
evacuation plans at many popular destinations. Your ICM team is helping to
design evacuation
plans at the Louvre in Paris, France. In general, the goal of evacuation is to
have all occupants
leave the building as quickly and safely as possible. Upon notification of a
required evacuation,
individuals egress to and through an optimal exit in order to empty the
building as quickly as
possible.
The Louvre is one of the world’s largest and most visited art museum,
receiving more than 8.1
million visitors in 2017[2]
. The number of guests in the museum varies throughout the day and
year, which provides challenges in planning for regular movement within the
museum. The
diversity of visitors – speaking a variety of languages, groups traveling
together, and disabled
visitors – makes evacuation in an emergency even more challenging.
The Louvre has five floors, two of which are underground.
![在这里插入图片描述](https://img-blog.csdnimg.cn/20210112235110425.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dsZGNtenk=,size_16,color_FFFFFF,t_70#pic_center)
The 380,000 exhibits located on these five floors cover approximately 72,735
square meters,
with building wings as long as 480 meters or 5 city blocks[3]. The pyramid
entrance is the main
and most used public entrance to the museum. However, there are also three
other entrances
usually reserved for groups and individuals with museum memberships: the
Passage Richelieu
entrance, the Carrousel du Louvre entrance, and the Portes Des Lions entrance.
The Louvre has
an online application, “Affluences” (https://www.affluences.com/louvre.php),
that provides realtime updates on the estimated waiting time at each of these
entrances to help facilitate entry to
the museum. Your team might consider how technology, to include apps such as
Affluences, or
others could be used to facilitate your evacuation plan.
Only emergency personnel and museum officials know the actual number of total
available exit
points (service doors, employee entrances, VIP entrances, emergency exits, and
old secret
entrances built by the monarchy, etc.). While public awareness of these exit
points could provide
additional strength to an evacuation plan, their use would simultaneously
cause security concerns
due to the lower or limited security postures at these exits compared with
level of security at the
four main entrances. Thus, when creating your model, your team should consider
carefully when
and how any additional exits might be utilized.
Your supervisor wants your ICM team to develop an emergency evacuation model
that allows
the museum leaders to explore a range of options to evacuate visitors from the
museum, while
also allowing emergency personnel to enter the building as quickly as
possible. It is important to
identify potential bottlenecks that may limit movement towards the exits. The
museum
emergency planners are especially interested in an adaptable model that can be
designed to
address a broad set of considerations and various types of potential threats.
Each threat has the
potential to alter or remove segments of possible routes to safety that may be
essential in a single
optimized route. Once developed, validate your model(s) and discuss how the
Louvre would
implement it.
Based on the results of your work, propose policy and procedural
recommendations for
emergency management of the Louvre. Include any applicable crowd management
and control
procedures that your team believes are necessary for the safety of the
visitors. Additionally,
discuss how you could adapt and implement your model(s) for other large,
crowded structures.
Your submission should consist of:
 One-page Summary Sheet,
 Your solution of no more than 20 pages, for a maximum of 21 pages with your
summary.
 Judges expect a complete list of references with in-text citations, but may
not consider
appendices in the judging process.
 Note: Reference list and any appendices do not count toward the 21-page
limit and
should appear after your completed solution.
References:
[1] Reporters, Telegraph. “Terror Attacks in France: From Toulouse to the
Louvre.” The
Telegraph, Telegraph Media Group, 24 June 2018,
www.telegraph.co.uk/news/0/terrorattacks-france-toulouse-louvre/. [2] “8.1
Million Visitors to the Louvre in 2017.” Louvre Press Release, 25 Jan. 2018,
presse.louvre.fr/8-1-million-visitors-to-the-louvre-in-2017/. [3] “Interactive
Floor Plans.” Louvre - Interactive Floor Plans | Louvre Museum | Paris,
30 June 2016, www.louvre.fr/en/plan.
[4] “Pyramid” Project Launch – The Musée du Louvre is improving visitor
reception
(2014-2016).” Louvre Press Kit, 18 Sept. 2014,
www.louvre.fr/sites/default/files/dp_pyramide 28102014_en.pdf.
[5] “The ‘Pyramid’ Project - Improving Visitor Reception (2014-2016).” Louvre
Press
Release, 6 July 2016, presse.louvre.fr/the-pyramid-project/.
Glossary:
Bottlenecks – places where movement is dramatically slowed or even stopped.
Emergency personnel – people who help in an emergency, such as guards, fire
fighters,
medics, ambulance crews, doctors, and police.

11月
11

【屑】【学习记录】【Python】自己能看明白的关于学习bs4.BeautifulSoup基本知识的总结.md

说明 - 2022-05-05

本篇博客为本人原创, 原发布于CSDN, 在搭建个人博客后使用爬虫批量爬取并挂到个人博客, 出于一些技术原因博客未能完全还原到初始版本(而且我懒得修改), 在观看体验上会有一些瑕疵 ,若有需求会发布重制版总结性新博客。发布时间统一定为1111年11月11日。钦此。

留着日后看
学习网址：https://fishc.com.cn/thread-97807-1-1.html
![page1](https://img-blog.csdnimg.cn/20200222175719367.jpg?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dsZGNtenk=,size_16,color_FFFFFF,t_70)
![page2](https://img-blog.csdnimg.cn/20200222175804505.jpg?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dsZGNtenk=,size_16,color_FFFFFF,t_70)
![page3](https://img-blog.csdnimg.cn/20200222175818432.jpg?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dsZGNtenk=,size_16,color_FFFFFF,t_70)

StarsWhisper

说明 - 2022-05-05

无量纲化

缺失值

编码 哑变量 ---- 处理分类数据用

二值化 分箱 ----连续数据

说明 - 2022-05-05

说明 - 2022-05-05

The Sports Festival

题目描述

思路

说明 - 2022-05-05

说明 - 2022-05-05

埃拉托斯特尼筛法

欧拉筛法

题目1 牛牛的“质因数”

说明 - 2022-05-05

题目

代码

说明 - 2022-05-05

说明 - 2022-05-05

卢浮宫 某建筑疏散问题 时序无向图？

0. 铺垫定义

1.定义 单位时间

2.定义 单位距离

3.定义 宽道 窄道

4.算法思路+定义

5.算法思路

a.考虑1

A.问题1

代码（数组模拟数据结构）（未完成）

题目描述

说明 - 2022-05-05

编码哑变量 ---- 处理分类数据用

二值化分箱 ----连续数据

卢浮宫某建筑疏散问题时序无向图？

1.定义单位时间

2.定义单位距离

3.定义宽道窄道